Semantic enrichment of text representation with wikipedia for text classification

Hiroki Yamakawa, Jing Peng, Anna Feldman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set.

Original languageEnglish
Title of host publication2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010
Pages4333-4340
Number of pages8
DOIs
StatePublished - 1 Dec 2010
Event2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010 - Istanbul, Turkey
Duration: 10 Oct 201013 Oct 2010

Publication series

NameConference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
ISSN (Print)1062-922X

Other

Other2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010
CountryTurkey
CityIstanbul
Period10/10/1013/10/10

Fingerprint

Semantics
Classifiers
Learning systems
Statistical methods

Keywords

  • Ensemble
  • Semantics
  • Text classification
  • Text representation
  • Voting
  • Wikipedia

Cite this

Yamakawa, H., Peng, J., & Feldman, A. (2010). Semantic enrichment of text representation with wikipedia for text classification. In 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010 (pp. 4333-4340). [5641812] (Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics). https://doi.org/10.1109/ICSMC.2010.5641812
Yamakawa, Hiroki ; Peng, Jing ; Feldman, Anna. / Semantic enrichment of text representation with wikipedia for text classification. 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010. 2010. pp. 4333-4340 (Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics).
@inproceedings{18eca4c934e84fbbbc64671f418bcc19,
title = "Semantic enrichment of text representation with wikipedia for text classification",
abstract = "Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set.",
keywords = "Ensemble, Semantics, Text classification, Text representation, Voting, Wikipedia",
author = "Hiroki Yamakawa and Jing Peng and Anna Feldman",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/ICSMC.2010.5641812",
language = "English",
isbn = "9781424465880",
series = "Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics",
pages = "4333--4340",
booktitle = "2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010",

}

Yamakawa, H, Peng, J & Feldman, A 2010, Semantic enrichment of text representation with wikipedia for text classification. in 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010., 5641812, Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics, pp. 4333-4340, 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010, Istanbul, Turkey, 10/10/10. https://doi.org/10.1109/ICSMC.2010.5641812

Semantic enrichment of text representation with wikipedia for text classification. / Yamakawa, Hiroki; Peng, Jing; Feldman, Anna.

2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010. 2010. p. 4333-4340 5641812 (Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Semantic enrichment of text representation with wikipedia for text classification

AU - Yamakawa, Hiroki

AU - Peng, Jing

AU - Feldman, Anna

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set.

AB - Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set.

KW - Ensemble

KW - Semantics

KW - Text classification

KW - Text representation

KW - Voting

KW - Wikipedia

UR - http://www.scopus.com/inward/record.url?scp=78751492151&partnerID=8YFLogxK

U2 - 10.1109/ICSMC.2010.5641812

DO - 10.1109/ICSMC.2010.5641812

M3 - Conference contribution

AN - SCOPUS:78751492151

SN - 9781424465880

T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics

SP - 4333

EP - 4340

BT - 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010

ER -

Yamakawa H, Peng J, Feldman A. Semantic enrichment of text representation with wikipedia for text classification. In 2010 IEEE International Conference on Systems, Man and Cybernetics, SMC 2010. 2010. p. 4333-4340. 5641812. (Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics). https://doi.org/10.1109/ICSMC.2010.5641812