Secure Similar Document Detection

Optimized Computation Using the Jaccard Coefficient

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Abstract

Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.

Original languageEnglish
Title of host publicationProceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-4
Number of pages4
ISBN (Electronic)9781538643990
DOIs
StatePublished - 28 Nov 2018
Event4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018 - Omaha, United States
Duration: 3 May 20185 May 2018

Publication series

NameProceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018

Other

Other4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018
CountryUnited States
CityOmaha
Period3/05/185/05/18

Fingerprint

Hash functions
Filter
Coefficients
Hash function
Military
Similarity measure

Keywords

  • Bloom filters
  • Jaccard coefficient
  • efficiency
  • multisets
  • security

Cite this

Forman, S., & Samanthula, B. K. (2018). Secure Similar Document Detection: Optimized Computation Using the Jaccard Coefficient. In Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018 (pp. 1-4). [8552273] (Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BDS/HPSC/IDS18.2018.00015
Forman, Sam ; Samanthula, Bharath Kumar. / Secure Similar Document Detection : Optimized Computation Using the Jaccard Coefficient. Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 1-4 (Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018).
@inproceedings{2956b4efec2f40349922f6aa4653209e,
title = "Secure Similar Document Detection: Optimized Computation Using the Jaccard Coefficient",
abstract = "Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.",
keywords = "Bloom filters, Jaccard coefficient, efficiency, multisets, security",
author = "Sam Forman and Samanthula, {Bharath Kumar}",
year = "2018",
month = "11",
day = "28",
doi = "10.1109/BDS/HPSC/IDS18.2018.00015",
language = "English",
series = "Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1--4",
booktitle = "Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018",

}

Forman, S & Samanthula, BK 2018, Secure Similar Document Detection: Optimized Computation Using the Jaccard Coefficient. in Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018., 8552273, Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018, Institute of Electrical and Electronics Engineers Inc., pp. 1-4, 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018, Omaha, United States, 3/05/18. https://doi.org/10.1109/BDS/HPSC/IDS18.2018.00015

Secure Similar Document Detection : Optimized Computation Using the Jaccard Coefficient. / Forman, Sam; Samanthula, Bharath Kumar.

Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 1-4 8552273 (Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - Secure Similar Document Detection

T2 - Optimized Computation Using the Jaccard Coefficient

AU - Forman, Sam

AU - Samanthula, Bharath Kumar

PY - 2018/11/28

Y1 - 2018/11/28

N2 - Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.

AB - Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.

KW - Bloom filters

KW - Jaccard coefficient

KW - efficiency

KW - multisets

KW - security

UR - http://www.scopus.com/inward/record.url?scp=85059786074&partnerID=8YFLogxK

U2 - 10.1109/BDS/HPSC/IDS18.2018.00015

DO - 10.1109/BDS/HPSC/IDS18.2018.00015

M3 - Conference contribution

T3 - Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018

SP - 1

EP - 4

BT - Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Forman S, Samanthula BK. Secure Similar Document Detection: Optimized Computation Using the Jaccard Coefficient. In Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 1-4. 8552273. (Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018). https://doi.org/10.1109/BDS/HPSC/IDS18.2018.00015