Dimensionality reduction with unsupervised feature selection and applying non-Euclidean norms for classification accuracy

Amit Saxena, John Wang

Research output: Contribution to journalArticleResearchpeer-review

5 Citations (Scopus)

Abstract

This article presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the article with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.

Original languageEnglish
Pages (from-to)22-40
Number of pages19
JournalInternational Journal of Data Warehousing and Mining
Volume6
Issue number2
DOIs
StatePublished - 1 Apr 2010

Fingerprint

Feature extraction
Set theory
Genetic algorithms
Testing

Keywords

  • Classification
  • Data mining
  • Feature analysis
  • Genetic algorithms
  • Minkowski metric

Cite this

@article{9d156074717d461392b47de24e14571c,
title = "Dimensionality reduction with unsupervised feature selection and applying non-Euclidean norms for classification accuracy",
abstract = "This article presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the article with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.",
keywords = "Classification, Data mining, Feature analysis, Genetic algorithms, Minkowski metric",
author = "Amit Saxena and John Wang",
year = "2010",
month = "4",
day = "1",
doi = "10.4018/jdwm.2010040102",
language = "English",
volume = "6",
pages = "22--40",
journal = "International Journal of Data Warehousing and Mining",
issn = "1548-3924",
publisher = "IGI Publishing",
number = "2",

}

Dimensionality reduction with unsupervised feature selection and applying non-Euclidean norms for classification accuracy. / Saxena, Amit; Wang, John.

In: International Journal of Data Warehousing and Mining, Vol. 6, No. 2, 01.04.2010, p. 22-40.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Dimensionality reduction with unsupervised feature selection and applying non-Euclidean norms for classification accuracy

AU - Saxena, Amit

AU - Wang, John

PY - 2010/4/1

Y1 - 2010/4/1

N2 - This article presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the article with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.

AB - This article presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the article with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.

KW - Classification

KW - Data mining

KW - Feature analysis

KW - Genetic algorithms

KW - Minkowski metric

UR - http://www.scopus.com/inward/record.url?scp=77954138930&partnerID=8YFLogxK

U2 - 10.4018/jdwm.2010040102

DO - 10.4018/jdwm.2010040102

M3 - Article

VL - 6

SP - 22

EP - 40

JO - International Journal of Data Warehousing and Mining

JF - International Journal of Data Warehousing and Mining

SN - 1548-3924

IS - 2

ER -