7 Techniques for Data Dimensionality Reduction

Size: px

Start display at page:

Download "7 Techniques for Data Dimensionality Reduction"

Joseph Houston
5 years ago
Views:

1 7 Techniques for Data Dimensionality Reduction Rosaria Silipo KNIME.com

2 The 2009 KDD Challenge Prediction Targets: Churn (contract renewals), Appetency (likelihood to buy specific product), Upselling (likelihood to buy side product) Input Data Sets Small Data Set: 231 columns x 50K rows Large Data Set: 15K columns x 50K rows 2

3 The Problem Many supervised classification algorithms cannot deal with large number of columns, such as decision trees, naïve Bayes, neural networks, logistic regression, etc This is due to the algorithms iterating on all columns. It is not tool-dependent, it is the algorithms structure. 3

4 Big Data or Dimensionality Reduction? Based on the optimistic idea that all columns carry useful information, we can use a parallelized version of some of the algorithms, if available, on all data columns -> Spark and Big Data Based on the pessimistic concept that some of the columns are garbage, we could remove all noninformative columns and see if we can get by with just the remaining ones -> Dimensionality Reduction 4

5 Defining the Baseline 73% accuracy 81% AuC 5

6 The 7 Techniques for Dimensionality Reduction 6

7 1. Missing Values based Filter Ratio of missing values = number of missing values / total number of rows IF Ratio missing values > Threshold => remove column 82% AuC 71% Reduction Rate 7

8 2. Low Variance Filter IF Column Variance < Threshold => remove column Select Table 82% AuC 73% Reduction Rate 8

9 3. High Correlation Filter On pairs of Columns IF correlation > Threshold => one column is removed 82% AuC 74% Reduction Rate 9

10 4. Principal Component Analysis Principal Component Analysis (PCA) transforms the original n coordinates of a data set into a new set of n coordinates called principal components (PCs). As a result of the transformation, the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible); each succeeding component has the highest possible variance under the constraint that it is orthogonal to the existing PCs. 10

11 4. Principal Component Analysis IF variance < Threshold => PC is removed 72% AuC 62% Reduction Rate 11

12 5. Random Forests (Tree Ensembles) 2000 trees, 2 levels, 3 columns only Score = #splits(lev. 0)/#candidates(lev. 0) + #splits(lev. 1)/#candidates(lev. 1) IF Score(column) < Threshold => column is removed 82% AuC 86% Reduction Rate 12

13 6. Backward Feature Elimination Start from N columns Train N models all with N-1 columns Remove columns with least performance disruption 78% AuC 99% Reduction Rate 13

14 6. Backward Feature Elimination 14

15 7. Forward Feature Construction Start from 1 column Train N models all with 2 columns Add column with best performance up to m columns 63% AuC 91% Reduction Rate

16 Comparison Results 16

17 Comparison: ROC Curves 17

18 Comparison: Accuracy 18

19 Final Workflow 19

20 Conclusions Best reduction/accuracy ratio belongs to the random forest approach (86% reduction rate, 82% AuC) Most techniques, even the simplest ones, keep ~82% AuC and have a reduction rate > 70% Simpler methods are faster Some techniques only apply to numeric columns Backward and Forward technique are too slow to work on large dimensional data sets. Maybe as a second step after for example counting the missing values Most successful techniques were then applied to the KDD large data set 20

21 Thank You! Whitepaper on KNIME web site 7 Techniques for Data Dimensionality Reduction n.pdf Blog Post on KDNuggets 7 Techniques for Data Dimensionality Reduction Workflow on the KNIME Server under 003_Preprocessing For more infos education@knime.com 21

22 Thank You Free Copy of KNIME Beginner s Luck Book at KNIME Press Promotion Code: MeetupItalia2015

Data Analytics. Qualification Exam, May 18, am 12noon

Data Analytics. Qualification Exam, May 18, am 12noon CS220 Data Analytics Number assigned to you: Qualification Exam, May 18, 2014 9am 12noon Note: DO NOT write any information related to your name or KAUST student ID. 1. There should be 12 pages including