The Automation of the Feature Selection Process. Ronen Meiri & Jacob Zahavi

Similar documents
Machine Learning Feature Creation and Selection

Random Search Report An objective look at random search performance for 4 problem sets

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Introduction to ANSYS DesignXplorer

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Lecture on Modeling Tools for Clustering & Regression

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

What is machine learning?

5 Learning hypothesis classes (16 points)

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Network Traffic Measurements and Analysis

Hardware-Software Codesign

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Chap.12 Kernel methods [Book, Chap.7]

Applying Supervised Learning

Chapter 6: Linear Model Selection and Regularization

INF 4300 Classification III Anne Solberg The agenda today:

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

The exam is closed book, closed notes except your one-page cheat sheet.

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Special Topic: Missing Values. Missing Can Mean Many Things. Missing Values Common in Real Data

3 Feature Selection & Feature Extraction

Fraud Detection using Machine Learning

CS229 Final Project: Predicting Expected Response Times

Louis Fourrier Fabien Gaie Thomas Rolf

ECLT 5810 Evaluation of Classification Quality

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Learning from Data Linear Parameter Models

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Simulated Annealing. Slides based on lecture by Van Larhoven

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Basics of Multivariate Modelling and Data Analysis

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Dimensionality Reduction, including by Feature Selection.

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS 331: Artificial Intelligence Local Search 1. Tough real-world problems

SYS 6021 Linear Statistical Models

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Weka ( )

Machine Learning and Bioinformatics 機器學習與生物資訊學

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER

Fast or furious? - User analysis of SF Express Inc

Segmenting Lesions in Multiple Sclerosis Patients James Chen, Jason Su

Machine Learning : Clustering, Self-Organizing Maps

Artificial Neural Networks (Feedforward Nets)

Simple Model Selection Cross Validation Regularization Neural Networks

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Machine Learning Techniques for Data Mining

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Tree-based methods for classification and regression

Evaluating Classifiers

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

ECG782: Multidimensional Digital Signal Processing

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Dimension Reduction CS534

4. Feedforward neural networks. 4.1 Feedforward neural network structure

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

CPSC 340: Machine Learning and Data Mining

Artificial Intelligence. Programming Styles

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Tutorials Case studies

Multicollinearity and Validation CIVL 7012/8012

Linear Methods for Regression and Shrinkage Methods

Lecture #11: The Perceptron

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CHAPTER 5 MAINTENANCE OPTIMIZATION OF WATER DISTRIBUTION SYSTEM: SIMULATED ANNEALING APPROACH

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

1 Training/Validation/Testing

Evaluating Classifiers

Notes and Announcements

CS249: ADVANCED DATA MINING

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Network Traffic Measurements and Analysis

Lecture 07 Dimensionality Reduction with PCA

Basis Functions. Volker Tresp Summer 2017

CSC411/2515 Tutorial: K-NN and Decision Tree

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

RESAMPLING METHODS. Chapter 05

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

Applied Regression Modeling: A Business Approach

Predictive Modeling with SAS Enterprise Miner

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

4 INFORMED SEARCH AND EXPLORATION. 4.1 Heuristic Search Strategies

Machine Learning (CSE 446): Practical Issues

A Dendrogram. Bioinformatics (Lec 17)

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Transcription:

The Automation of the Feature Selection Process Ronen Meiri & Jacob Zahavi

Automated Data Science http://www.kdnuggets.com/2016/03/automated-data-science.html

Outline The feature selection problem Objective of study Review of feature selection methods used in our study: Statistical methods: stepwise regression Stochastic search methods: simulated annealing Feature reduction methods: principal component analysis Databases and set up of study Results Conclusions

Characteristic of the analysis problem Many observations, in the order of millions or more Tens to hundreds (or thousands) of potential (Curse of Dimensionality) Many are redundant Other are noisy Some are irrelevant Rare events

Feature Selection The process of selecting an optimal subset of features Objectives Distinguish informative and predictive features from coincidental features Improve prediction accuracy (best fit) Reduce bias (no over fitting)

Objective of Study Seeking the best feature selection in linear regression Testing: Statistical methods (StepWise Regression) Stochastic search methods (Simulated Annealing) Feature reduction methods (Principal Component Analysis and Radial Basis Functions) Objective: Maximize the Goal function in the validation dataset.

StepWise Regression (SWR) Process introduces and eliminates predictors based on F-distribution Significance level for entering variables: F-to-Enter Significance level for removing variables: F-to-Remove Where F-to-Enter < F-to-Remove Default values in most packages (per-comparison): F-to-Enter = 5% F-to-Remove = 10% Two alternative approaches have been devised to control the level of significance: Bonferroni correction: * = /K False Discovery Rate (FDR): α m = mα K

Simulated Annealing (SA) Simulates the annealing process coming from condensed matter physics A solid is heated in a heat bath At sufficiently high temperature, the solid is liquefied By slowly cooling down the temperature, system attains a thermal equilibrium and system re-arrange in a lower-energy state As temperature goes to zero, system converges to its ground level state (minimum energy) Converges to global optimum or best set of features

Feature Reduction Methods Also known as feature extraction methods Representing information hidden in original variables by fewer features Principal Component Analysis (PCA) Variable reduction method Seeks to remove multicollinearity by using a weighted sum of the original predictors to create new features which are uncorrelated (orthogonal) Radial Basis Functions (RBF) Kernel function usually Gaussian distribution on the density of observations A neural network (NN) type model

Datasets Name # Obs. # Resp. # Pred. Response Non Prof. 99,200 27,208 307 binary Specialty 106,284 5,758 380 Continuous Gift 101,284 9,707 104 Counter Each was split randomly into a training and validation datasets

Evaluation Metrics Number of predictors in final model in training dataset in validation dataset ratio Gini coefficient the area between the predictive model curve and the random (null) model. Sometimes multiplied by two to render a measure between 0-1. M-L: The maximum lift, where Lift = % responce in PM selection % responce in all data

Set Up of Study Approach: comparing the best performing model In each class of models SWR was calibrated based on FDR SA configuration was selected from among 60 parameter combinations (5 objective criteria, 4 cooling rates and 3 confidence intervals) For PCA, we sought the optimal number of PC s in the range 1-50 For RBF, we sought the optimal number of radial bases in the range 2-51 (1-50 DF)

Gains Charts Non Profit Gain Chart - "Non-Profit" file 14000 SA 12000 FDR 10000 PCA Buyers 8000 6000 RBF 4000 2000 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 # Customers

Gains Charts- Specialty Gain Chart - "Specialty" file 1400 SA 1200 FDR 1000 PCA Buyers 800 600 RBF 400 200 0 0 10000 20000 30000 40000 50000 # Customers

Gains Charts - Gift Gain Chart - "Gift" file 6000 5000 SA FDR PCA 4000 RBF Buyers 3000 2000 1000 0 0 10000 20000 30000 40000 50000 # Customers

Conclusions SWR and SA SWR yield comparable results to SA in all cases Both capture almost the same predictors in the final model Non-Profit Specialty Gift Both 25 27 29 Only SWR 3 8 2 Only SA - 1 4 Solution is likely to be close to a global, if not the global, optimum Hypothesis marketing data are well behaved Optimal solution to feature selection, if not unique, lie on the same plateau As a result, even the greedy SWR algorithm is capable of finding this solution Conclusion was further verified using simulated studies

Summary The winner is FDR Feature Selection is probably dominated by few Strong predictors FDR able to optimize the balance between FP and FN We have data with complex structures, but probably most business data do not have it Further research is required to generalize the results of this study