Morisita-Based Feature Selection for Regression Problems

Similar documents
Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine

A methodology for Building Regression Models using Extreme Learning Machine: OP-ELM

Advanced Analysis of Environmental Data Using Machine Learning

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Cluster homogeneity as a semi-supervised principle for feature selection using mutual information

Machine Learning Feature Creation and Selection

Unsupervised Feature Selection for Sparse Data

Forward Feature Selection Using Residual Mutual Information

Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Self-Organizing Maps for cyclic and unbounded graphs

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Multiresponse Sparse Regression with Application to Multidimensional Scaling

SOM+EOF for Finding Missing Values

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Supervised Variable Clustering for Classification of NIR Spectra

Feature-weighted k-nearest Neighbor Classifier

Rank Measures for Ordering

Feature Selection with Decision Tree Criterion

A faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Using Decision Boundary to Analyze Classifiers

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

Lecture on Modeling Tools for Clustering & Regression

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Overview of Clustering

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Chapter 2 Basic Structure of High-Dimensional Spaces

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC

Filter methods for feature selection. A comparative study

Statistical Pattern Recognition

3 Nonlinear Regression

THE discrete multi-valued neuron was presented by N.

Batch Intrinsic Plasticity for Extreme Learning Machines

3 Nonlinear Regression

A Hierarchial Model for Visual Perception

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Bagging for One-Class Learning

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

An ELM-based traffic flow prediction method adapted to different data types Wang Xingchao1, a, Hu Jianming2, b, Zhang Yi3 and Wang Zhenyu4

Cyber attack detection using decision tree approach

FEATURE SELECTION TECHNIQUES

COMPUTATIONAL INTELLIGENCE

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Climate Precipitation Prediction by Neural Network

Feature Selection in Knowledge Discovery

Data preprocessing Functional Programming and Intelligent Algorithms

Class dependent feature weighting and K-nearest neighbor classification

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Information Driven Healthcare:

Gene Clustering & Classification

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Fast Associative Memory

Generating the Reduced Set by Systematic Sampling

Dimensionality Reduction, including by Feature Selection.

Controlling the spread of dynamic self-organising maps

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Visual object classification by sparse convolutional neural networks

Feature Selection for Supervised Classification: A Kolmogorov- Smirnov Class Correlation-Based Filter

arxiv: v1 [cs.lg] 25 Jan 2018

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Fuzzy Entropy based feature selection for classification of hyperspectral data

Constrained Optimization of the Stress Function for Multidimensional Scaling

Introduction to Data Mining

Accelerating Pattern Matching or HowMuchCanYouSlide?

Using a genetic algorithm for editing k-nearest neighbor classifiers

Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis

Chapter 12 Feature Selection

A Systematic Overview of Data Mining Algorithms

Elemental Set Methods. David Banks Duke University

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Using Blast Data to infer Training Images for MPS Simulation of Continuous Variables

A Boosting-Based Framework for Self-Similar and Non-linear Internet Traffic Prediction

Neural Network Weight Selection Using Genetic Algorithms

Mineral Exploation Using Neural Netowrks

Gabriel Graph for Dataset Structure and Large Margin Classification: A Bayesian Approach

Statistical Pattern Recognition

Automatic Group-Outlier Detection

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Robust Signal-Structure Reconstruction

Comparative Analysis of Unsupervised and Supervised Image Classification Techniques

Factorization with Missing and Noisy Data

Application of the Generic Feature Selection Measure in Detection of Web Attacks

Using Local Trajectory Optimizers To Speed Up Global. Christopher G. Atkeson. Department of Brain and Cognitive Sciences and

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Attribute Reduction using Forward Selection and Relative Reduct Algorithm

Cover Page. The handle holds various files of this Leiden University dissertation.

An adjustable p-exponential clustering algorithm

How to Use the Fractal Dimension to Find Correlations between Attributes

Artificial Neural Network and Multi-Response Optimization in Reliability Measurement Approximation and Redundancy Allocation Problem

Electromagnetic migration of marine CSEM data in areas with rough bathymetry Michael S. Zhdanov and Martin Čuma*, University of Utah

MODIFIED KALMAN FILTER BASED METHOD FOR TRAINING STATE-RECURRENT MULTILAYER PERCEPTRONS

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Transcription:

Morisita-Based Feature Selection for Regression Problems Jean Golay, Michael Leuenberger and Mikhail Kanevski University of Lausanne - Institute of Earth Surface Dynamics (IDYST) UNIL-Mouline, 115 Lausanne - Switzerland Abstract. Data acquisition, storage and management have been improved, while the factors of many phenomena are not well known. Consequently, irrelevant and redundant features artificially increase the size of datasets, which complicates learning tasks, such as regression. To address this problem, feature selection methods have been proposed. This research introduces a new supervised filter based on the Morisita estimator of intrinsic dimension. The algorithm is simple and does not rely on arbitrary parameters. It is applied to both synthetic and real data and a comparison with a wrapper based on extreme learning machine is conducted. 1 Introduction In data mining, it is often not known a priori what features (or input variables) are truly necessary to capture the main characteristics of a studied phenomenon. This lack of knowledge implies that many of the considered features are irrelevant or redundant. They artificially increase the dimension E of the Euclidean space in which the data are embedded (E equals the number of features). This is a serious matter, since fast improvements in data acquisition, storage and management cause the number of redundant and irrelevant features to increase. As a consequence, unless the sample size N grows exponentially with E, the curse of dimensionality is likely to reduce the overall accuracy of the results yielded by any learning algorithm. Besides, large N and E are also difficult to deal with because of computer performance limitations. In regression, these issues are often addressed by implementing supervised feature selection methods [1]. They can be broadly subdivided into filters and wrappers. Filters do not use any evaluation criterion involving a learning machine, while wrappers do. Besides, both approaches can be used with search strategies, since an exhaustive exploration of the E 1 models (all the combinations of features) is often computationally infeasible. Greedy strategies, such as Sequential Forward Selection (SFS), can be distinguished from randomized ones. The present paper deals with a new SFS filter algorithm relying on Morisitabased estimates of the intrinsic dimension, M, of data [, 3]. The Morisita estimator of Intrinsic Dimension (ID) is closely related to the fractal theory and M ( E) can be interpreted as the dimension of the space where the points of a dataset truly reside (i.e. the data manifold [4]). The proposed algorithm is supervised and designed for regression problems. It does not make use of any threshold, unlike what can be found in related works [5, 6], and it keeps the 79

simplicity of the Fractal Dimension Reduction (FDR) algorithm introduced in [7]. The Morisita estimator of ID is presented in Section. Section 3 introduces the Morisita-based filter and Section 4 is devoted to numerical experiments conducted on both synthetic and real data. A comparison with a wrapper combining Extreme Learning Machine (ELM) [8] and an exhaustive search strategy is also carried out. The Morisita Estimator of Intrinsic Dimension The Morisita estimator of Intrinsic dimension, Mm, is based on the multipoint Morisita index Im,δ [, 3] (named after Masaaki Morisita who proposed the first version of the index). Im,δ is computed by superimposing a grid of Q quadrats of diagonal size δ onto the data points. It measures how many times more likely it is that m (m ) points selected at random will be from the same quadrat than it would be if the N points of the studied dataset were distributed according to a random distribution generated from a Poisson process (i.e. complete spatial randomness). The formula is the following: Q m 1 i=1 ni (ni 1)(ni ) (ni m + 1) Im,δ = Q N (N 1)(N ) (N m + 1) where ni is the number of points in the ith quadrat. For a fixed value of m, Im,δ is calculated for a chosen range of δ values. If a dataset follows a fractal behavior (i.e. is self-similar), the functional relationship of the plot relating log (Im,δ ) to log (1/δ) is linear and the slope is defined as the Morisita slope Sm. Finally, Mm is expressed as: Sm Mm = E m 1 In practice, each variable is rescaled to [, 1] and the R chosen δ can be replaced with the edge lengths,, of the quadrats. In the rest of this paper, only M will be used and it will be computed with an algorithm called Morisita INDex for Intrinsic Dimension estimation (MINDID) [3] whose complexity is O(N E R). 3 The Morisita-based Filter for Regression Problems The Morisita-Based Filter for Regression (MBFR) relies on three observations following from the works by Traina et al. [7] and De Sousa et al. [5]: 1. Given an output variable Y generated from k relevant and non-redundant input variables X1,..., Xk and let ID( ) denote the Intrinsic Dimension (ID) of a dataset, one has that: ID(X1,..., Xk, Y ) ID(X1,..., Xk ). Given i irrelevant input variables I1,..., Ii completely independent of Y, one has that: ID(I1,..., Ii, Y ) ID(I1,..., Ii ) ID(Y ) 8

Algorithm 1 MBFR INPUT: a dataset A with f features F1,...,f and one output variable Y ; a vector L of values ; two empty vectors of length f : SelF and DissF for storing, respectively, the name of the selected features and the dissimilarities; an empty matrix Z for storing the selected features. OUTPUT: SelF and DissF. 1: Rescale each feature and Y to [, 1]. : for i = 1 to f do 3: for j = 1 to (f + 1 i) do 4: ID(Z, Fj, Y ) ID(Z, Fj ) = (MINDID is used with L) 5: end for 6: Store in SelF [i] the name of the Fj yielding the lowest. 7: Store this in DissF [i]. 8: Remove the corresponding Fj from A and add it into Z. 9: end for 3. Given a randomly selected subset of X1,..., Xk of size r with < r < k and k > 1, j redundant input variables J1,..., Jj related to some or all of X1,..., Xr and all the i irrelevant input variables I1,..., Ii, one has that: ID(X1,..., Xr, J1,..., Jj, I1,..., Ii, Y ) ID(X1,..., Xr, J1,..., Jj, I1,..., Ii ) H where H ], ID(Y )[ and H decreases to as r increases to k. X - -4 1.5 Y.4.6.8 1..5. 1.5 4 - ID(Y) Min.. 4 X1-4 X J5 J3 J4 I6 I7 I8 X1 Fig. 1: (left) The functional relationship between the dependent variable Y and the relevant features X1 and X ; (right) MBFR applied to 1 simulation of the synthetic dataset. The difference ID(features, Y ) ID(features) can thus be suggested as a way of measuring the dissimilarity (i.e the independence) between Y and the selected 81

1..6.4...8 mean(id(y)) for sd =. mean(id(y)) for sd =.5 mean(id(y)) for sd =.1 Data with no noise (sd =.) Gaussian noise with sd =.5 Gaussian noise with sd =.1 mean(min. ) X1 X Fig. : MBFR applied to simulations of the synthetic dataset with different levels of Gaussian noise (1 simulations per level). features, among which only the relevant ones (i.e. the non-redundant features on which Y depends) contribute to reducing the dissimilarity. Based on that property, MBFR (See Algorithm 1) aims at retrieving the relevant features available in a dataset by sorting each subset of variables according to its dissimilarity with Y. MBFR implements a SFS search strategy and relies on the MINDID algorithm [3] for the computation of M. Its complexity is O(N E 3 R). 4 4.1 Experimental Study Synthetic Data The synthetic dataset was constructed as follows. An output variable Y was generated from two uniformly distributed input variables X1 and X (see Figure 1) by using an Artificial Neural Network (ANN) (one hidden layer of ten neurons, a sigmoid transfer function, randomly generated weights [, ], no biases). Three redundant (J) and three irrelevant (I) input variables were also included: J3 = log(x1 + 5), J4 = X1 X, J5 = X14 X4, I6 is uniformly distributed, I7 = log(i6 + 5) and I8 = I6 + I7. Simulations were generated with N = 1 and with the same ANN weights. The MBFR algorithm was applied with 1 ranging from 5 to. The right panel of Figure 1 shows the result of one simulation run. X1 and X are easily identified as the relevant features, since they reduce the dissimilarity from ID(Y ) to about. The same conclusion can be drawn from Figure which presents the effect of three levels of Gaussian noise on three sets of 1 simulations. The minimum dissimilarity increased with the standard deviation 8

.9.9 of the noise. Eventually, if Y is shuffled (i.e. the dependences with any feature is broken), the dissimilarities stay close to ID(Y ) as shown in Figure 3. In Figures and 3, the names of the redundant and irrelevant features were replaced with letters because their rank was not stable over the simulations. It is also worth mentioning that the corresponding dissimilarities do not fluctuate around ID(Y ) and this might be related to the increasing dimensionality of the considered space and to the way the dissimilarities are computed (i.e. a subtraction between the ID of two datasets embedded in spaces of different E)..8.84.86.88 mean(id(y)) mean(min. ) Fig. 3: MBFR applied to 1 simulations of the synthetic data with shuffled Y. 4. Real Data The concrete dataset, downloaded from the UCI machine learning repository, was used. The MBFR algorithm was applied with 1 ranging from to 13 and the selected features turned out to be (see Figure 4): Age, Blast FS, Cement and Superplasticizer. The computation lasted 1.85 seconds using R (Intel Core i7-377 CPU @ 3.4 GHz with 16. GB of RAM under Windows 8). In order to compare and assess the results obtained with the MBFR algorithm, a wrapper method based on ELM [8] and relying on an exhaustive search was applied: (a) a 5-fold cross-validation was used: 1 fold was iteratively allocated to the set of validation and the remaining 4 folds were assigned to the training set; (b) for each of the E 1 subsets of features, ELM models with a number of hidden nodes (the only hyper-parameter of ELM) varying from 1 to 4 were trained and evaluated; (c) the model showing the minimal Mean Squared Error (MSE) was selected. By iterating this process (a to c) times, a total of 1 evaluations of the E 1 subsets of features were carried out and the rank of the subset selected by the MBFR algorithm was recorded (for the 1 evalu- 83

ations). The resulting histogram is displayed in Figure 4 and it highlights that the considered subset of features is one of the best from an ELM perspective. 5 Conclusion 4 3 Frequency 5 6 A. e A. A. rs oa C y ne Fi Fl l. W at er Su pe rp t. en tf.s em C e Bl as Ag 1.8.6.4. ID(Y) Min.. 1. MBFR combines the advantages of the existing algorithms of fractal feature selection, while it is designed for regression tasks. It was applied to both synthetic and real datasets. The results are coherent with those yielded by the ELM-based wrapper and with the work by Traina et al. [7] and De Sousa et al. [5]. In future research, MBFR will be applied to challenging datasets and the problem of the sensitivity of the algorithm to the choice of the scale range will be addressed. 5 1 15 Rank of the Model 5 Fig. 4: (left) MBFR applied to the concrete dataset; (right) Histogram of the model ranks based on ELM. References [1] I. Guyon, S. Gunn, M. Nikravesh and L. A. Zadeh, editors Feature Extraction: Foundations and Applications, Springer, Berlin, 6. [] J. Golay, M. Kanevski, C. D. Vega Orozco and M. Leuenberger, The multipoint Morisita index for the analysis of spatial patterns, Physica A, 46:191-, Elsevier, 14. [3] J. Golay and M. Kanevski, A new estimator of intrinsic dimension based on the multipoint Morisita index, arxiv:148.369, 14. [4] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction, Springer, New York, 7. [5] E. P. M. De Sousa, C. Traina Jr., A. J. M. Traina, L. Wu and C. Faloutsos, A fast and effective method to find correlations among attributes in databases, Data Mining and Knowledge Discovery, 14:367-47, Springer, 7. [6] D. Mo and S. H. Huang, Fractal-based intrinsic dimension estimation and its application in dimensionality reduction, IEEE Transactions on Knowledge and Data Engineering, 4(1):59-71, IEEE Computer Society, 1. [7] C. Traina Jr., A. J. M. Traina, L. Wu and C. Faloutsos, Fast feature selection using fractal dimension. Proceedings of the 15th Brazilian Symposium on Databases(SBBD ), October 15-19 8-3, Joa o Pessoa (Brazil),. [8] G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, Extreme learning machine: Theory and applications, Neurocomputing, 7(1-3):489-51, 6. 84