TY - GEN
T1 - Graph-based iterative hybrid feature selection
AU - ErHeng, Zhong
AU - Sihong, Xie
AU - Wei, Fan
AU - Jiangtao, Ren
AU - Jing, Peng
AU - Kun, Zhang
PY - 2008
Y1 - 2008
N2 - When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a "hybrid" framework based on graph models. We first apply supervisedmethods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next, this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semisupervised feature selection algorithms by at least 10% in accuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.
AB - When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a "hybrid" framework based on graph models. We first apply supervisedmethods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next, this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semisupervised feature selection algorithms by at least 10% in accuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.
UR - http://www.scopus.com/inward/record.url?scp=67049167710&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2008.63
DO - 10.1109/ICDM.2008.63
M3 - Conference contribution
AN - SCOPUS:67049167710
SN - 9780769535029
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1133
EP - 1138
BT - Proceedings - 8th IEEE International Conference on Data Mining, ICDM 2008
T2 - 8th IEEE International Conference on Data Mining, ICDM 2008
Y2 - 15 December 2008 through 19 December 2008
ER -