TY - GEN
T1 - XML clustering by principal component analysis
AU - Liu, Jianghui
AU - Wang, Jason T.L.
AU - Hsu, Wynne
AU - Herbert, Katherine G.
PY - 2004
Y1 - 2004
N2 - XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.
AB - XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.
UR - http://www.scopus.com/inward/record.url?scp=16244423653&partnerID=8YFLogxK
U2 - 10.1109/ICTAI.2004.122
DO - 10.1109/ICTAI.2004.122
M3 - Conference contribution
AN - SCOPUS:16244423653
SN - 076952236X
T3 - Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI
SP - 658
EP - 662
BT - Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004
A2 - Khoshgoftaar, T.M.
T2 - Proceedings - 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004
Y2 - 15 November 2004 through 17 November 2004
ER -