Abstract
State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.
Original language | English |
---|---|
Pages | 629-640 |
Number of pages | 12 |
DOIs | |
State | Published - 2010 |
Event | 10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States Duration: 29 Apr 2010 → 1 May 2010 |
Other
Other | 10th SIAM International Conference on Data Mining, SDM 2010 |
---|---|
Country/Territory | United States |
City | Columbus, OH |
Period | 29/04/10 → 1/05/10 |
Keywords
- Accuracy improvement
- Automatic
- Efficiency
- Feature construction