TY - GEN
T1 - Secure Similar Document Detection
T2 - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018
AU - Forman, Sam
AU - Samanthula, Bharath Kumar
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/11/28
Y1 - 2018/11/28
N2 - Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.
AB - Secure Similar Document Detection (SSDD) problem considers two parties, each holding a private document, who want to compute the similarity between their documents without leaking the document contents to one another. This is a unique problem whose applications span across a variety of domains, including the medical field, military intelligence, and academia. In this paper, we aim to solve the SSDD problem by representing documents as multisets and using the Jaccard Coefficient (JC) as a similarity measure. We first illustrate how the computation of Jaccard Coefficient can be reduced to the computation of intersection size between the multisets. Then, we propose a novel way to securely approximate the size of intersection between multisets using Bloom filters and hash functions, without significant reductions in security and accuracy. Our proposed protocol exploits a unique property of Bloom filters -that computing the dot product of two Bloom filters yields their intersection size.
KW - Bloom filters
KW - Jaccard coefficient
KW - efficiency
KW - multisets
KW - security
UR - http://www.scopus.com/inward/record.url?scp=85059786074&partnerID=8YFLogxK
U2 - 10.1109/BDS/HPSC/IDS18.2018.00015
DO - 10.1109/BDS/HPSC/IDS18.2018.00015
M3 - Conference contribution
AN - SCOPUS:85059786074
T3 - Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018
SP - 1
EP - 4
BT - Proceedings - 4th IEEE International Conference on Big Data Security on Cloud, BigDataSecurity 2018, 4th IEEE International Conference on High Performance and Smart Computing, HPSC 2018 and 3rd IEEE International Conference on Intelligent Data and Security, IDS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 May 2018 through 5 May 2018
ER -