Generalized and heuristic-free feature construction for improved accuracy

Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, Qiang Yang

Research output: Contribution to conferencePaper

9 Citations (Scopus)

Abstract

State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.

Original languageEnglish
Pages629-640
Number of pages12
StatePublished - 1 Dec 2010
Event10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States
Duration: 29 Apr 20101 May 2010

Other

Other10th SIAM International Conference on Data Mining, SDM 2010
CountryUnited States
CityColumbus, OH
Period29/04/101/05/10

Fingerprint

Learning algorithms
Feature extraction

Keywords

  • Accuracy improvement
  • Automatic
  • Efficiency
  • Feature construction

Cite this

Fan, W., Zhong, E., Peng, J., Verscheure, O., Zhang, K., Ren, J., ... Yang, Q. (2010). Generalized and heuristic-free feature construction for improved accuracy. 629-640. Paper presented at 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, United States.
Fan, Wei ; Zhong, Erheng ; Peng, Jing ; Verscheure, Olivier ; Zhang, Kun ; Ren, Jiangtao ; Yan, Rong ; Yang, Qiang. / Generalized and heuristic-free feature construction for improved accuracy. Paper presented at 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, United States.12 p.
@conference{4539b2b4192c4c0e89e4607967d40a48,
title = "Generalized and heuristic-free feature construction for improved accuracy",
abstract = "State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9{\%} in accuracy and 28{\%} in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.",
keywords = "Accuracy improvement, Automatic, Efficiency, Feature construction",
author = "Wei Fan and Erheng Zhong and Jing Peng and Olivier Verscheure and Kun Zhang and Jiangtao Ren and Rong Yan and Qiang Yang",
year = "2010",
month = "12",
day = "1",
language = "English",
pages = "629--640",
note = "null ; Conference date: 29-04-2010 Through 01-05-2010",

}

Fan, W, Zhong, E, Peng, J, Verscheure, O, Zhang, K, Ren, J, Yan, R & Yang, Q 2010, 'Generalized and heuristic-free feature construction for improved accuracy' Paper presented at 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, United States, 29/04/10 - 1/05/10, pp. 629-640.

Generalized and heuristic-free feature construction for improved accuracy. / Fan, Wei; Zhong, Erheng; Peng, Jing; Verscheure, Olivier; Zhang, Kun; Ren, Jiangtao; Yan, Rong; Yang, Qiang.

2010. 629-640 Paper presented at 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Generalized and heuristic-free feature construction for improved accuracy

AU - Fan, Wei

AU - Zhong, Erheng

AU - Peng, Jing

AU - Verscheure, Olivier

AU - Zhang, Kun

AU - Ren, Jiangtao

AU - Yan, Rong

AU - Yang, Qiang

PY - 2010/12/1

Y1 - 2010/12/1

N2 - State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.

AB - State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.

KW - Accuracy improvement

KW - Automatic

KW - Efficiency

KW - Feature construction

UR - http://www.scopus.com/inward/record.url?scp=78651326189&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:78651326189

SP - 629

EP - 640

ER -

Fan W, Zhong E, Peng J, Verscheure O, Zhang K, Ren J et al. Generalized and heuristic-free feature construction for improved accuracy. 2010. Paper presented at 10th SIAM International Conference on Data Mining, SDM 2010, Columbus, OH, United States.