XML clustering by principal component analysis

Jianghui Liu, Jason T.L. Wang, Wynne Hsu, Katherine Herbert

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

28 Citations (Scopus)

Abstract

XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004
EditorsT.M. Khoshgoftaar
Pages658-662
Number of pages5
DOIs
StatePublished - 1 Dec 2004
EventProceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004 - Boca Raton, FL, United States
Duration: 15 Nov 200417 Nov 2004

Other

OtherProceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004
CountryUnited States
CityBoca Raton, FL
Period15/11/0417/11/04

Fingerprint

XML
Principal component analysis
Electronic data interchange
Information management

Cite this

Liu, J., Wang, J. T. L., Hsu, W., & Herbert, K. (2004). XML clustering by principal component analysis. In T. M. Khoshgoftaar (Ed.), Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004 (pp. 658-662) https://doi.org/10.1109/ICTAI.2004.122
Liu, Jianghui ; Wang, Jason T.L. ; Hsu, Wynne ; Herbert, Katherine. / XML clustering by principal component analysis. Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004. editor / T.M. Khoshgoftaar. 2004. pp. 658-662
@inproceedings{28deffa8bbe34a3aaeb87a8b377824b5,
title = "XML clustering by principal component analysis",
abstract = "XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.",
author = "Jianghui Liu and Wang, {Jason T.L.} and Wynne Hsu and Katherine Herbert",
year = "2004",
month = "12",
day = "1",
doi = "10.1109/ICTAI.2004.122",
language = "English",
isbn = "076952236X",
pages = "658--662",
editor = "T.M. Khoshgoftaar",
booktitle = "Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004",

}

Liu, J, Wang, JTL, Hsu, W & Herbert, K 2004, XML clustering by principal component analysis. in TM Khoshgoftaar (ed.), Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004. pp. 658-662, Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, Boca Raton, FL, United States, 15/11/04. https://doi.org/10.1109/ICTAI.2004.122

XML clustering by principal component analysis. / Liu, Jianghui; Wang, Jason T.L.; Hsu, Wynne; Herbert, Katherine.

Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004. ed. / T.M. Khoshgoftaar. 2004. p. 658-662.

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - XML clustering by principal component analysis

AU - Liu, Jianghui

AU - Wang, Jason T.L.

AU - Hsu, Wynne

AU - Herbert, Katherine

PY - 2004/12/1

Y1 - 2004/12/1

N2 - XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.

AB - XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.

UR - http://www.scopus.com/inward/record.url?scp=16244423653&partnerID=8YFLogxK

U2 - 10.1109/ICTAI.2004.122

DO - 10.1109/ICTAI.2004.122

M3 - Conference contribution

SN - 076952236X

SP - 658

EP - 662

BT - Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004

A2 - Khoshgoftaar, T.M.

ER -

Liu J, Wang JTL, Hsu W, Herbert K. XML clustering by principal component analysis. In Khoshgoftaar TM, editor, Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004. 2004. p. 658-662 https://doi.org/10.1109/ICTAI.2004.122