Practical Guidance for Machine Learning Applications

Practical Guidance for Machine Learning Applications Brett Wujek

About the authors Material from SGF Paper SAS2360-2016 Brett Wujek Senior Data Scientist, Advanced Analytics R&D ~20 years developing engineering product design and optimization software Patrick Hall Senior Machine Learning Specialist, Advanced Analytics R&D 11 th person worldwide to become a Cloudera certified data scientist Funda Güneș Senior Research Statistician, Advanced Analytics R&D PhD in Statistics with expertise in new modeling techniques

Early Preview April 2016 General Availability September 2016

Statistics Pattern Recognition Computational Neuroscience Data Science Data Mining Machine Learning AI Databases KDD

https://github.com/sassoftware/enlighten-apply/tree/master/ml_tables

Observations Data Preparation GIGO Features

Observations Data Preparation Do you have the right data? Features What form is your data in? Are values appropriate??

Target Observations Data Preparation Do you have the right data? Data exhaust Bias? Extrapolation Ensure appropriate nominal target value representation - All prospective values included - Rare events - Over/undersampling - Ensembles - Zero-inflated models Feature Lohninger, H. (1999). Teach/Me Data Analysis. Berlin: Springer-Verlag.

Data Preparation Do you have the right data? Curse of Dimensionality Features? 8/10 = 80% 8/100 = 8% 8/1000 = 0.8% Feature Engineering Features Feature Selection MIC, information gain, chi-square Stepwise regression, LASSO, elastic net Decision tree NEW FEATURES Features Feature Extraction PCA, SVD Nonnegative matrix factorization Autoencoding neural networks Simpler models, shorter training times, improved generalization

Data Preparation Feature Extraction Latent Features = Derive more information from the data than what is directly presented Features? Denoising autoencoders Matrix Factorization OUTPUT = INPUT DECODE x 1 x 2 x 3 ENCODE Noise h 11 h 12 x 1 x 2 x 3 Factorization Machines, SVD, NMF (sparse to dense) INPUT (Often more hidden layers with many nodes)

Data Preparation What form is your data in? Tidy data Wickham, H. (2014). Tidy Data. Journal of Statistical Software 59:1 23. 1. Each variable forms a column 2. Each observation forms a row 3. Each value must have its own cell Melting (stacking): column headers contain values (<$10k, $10-30k, $50-100k) melt into 2 columns for value and frequency String splitting: column contains multiple pieces of information (M<20, F<20, M<30, F<30, ) split into multiple columns (M, F, <20, <30, ) Casting (unstacking): column values are actually variable names (max temp, min temp, avg temp) cast rows back into multiple columns https://github.com/sassoftware/enlighten-apply/tree/master/sas_ue_tidydata

Data Preparation Are values appropriate?? Managing Values Standardization Outliers High cardinality Missing values Wielenga, D. (2007). Identifying and Overcoming Common Data Mining Mistakes. Proceedings of the SAS Global Forum 2007 Conference. Cary, NC: SAS Institute Inc.

Data Preparation High Cardinality Variables 27 zip codes 2597 zip codes 12 regions? http://www.unitedstateszipcodes.org/maps/california-zip-code-map.png http://www.water.ca.gov/

Data Preparation High Cardinality Variables? Client Item Target 1 2 4 3 3 7 Item 1 Item 2 Item 3 Item 4 Client 1 0 4 0 0 Client 2 0 0 6 0 Matrix Factorization Factor 1 Factor 2 Factor 3 Client 1 1.304 0.582 0.892 Client 2 0.897 0.843 0.885 4 1 3 Client 3 0 0 7 0 Client 3 0.745 1.129 1.002 2 3 6 Client 4 3 0 0 0 FM, SVD, NMF Client 4 0.921 0.962 0.714 Pairs of high cardinality variables can often be represented as a sparse matrix, with matrix values populated by the corresponding target value Sparse matrices can be efficiently factored into dense features suitable for standard algorithm implementations

Data Preparation Missing Values Understand their nature Deal with them o Naïve Bayes o Decision tree, random forest, gradient boosting o Binning/discretization o Univariate imputation with missing markers o Multivariate imputation with missing markers

Data Preparation Data partitioning Information leakage Training Validation Test Do not use holdout data for transformations/imputation of training data

Training: Bias-Variance Tradeoff High bias model (underfit) High variance model (overfit) Honest Assessment With smaller data sets use k-fold cross validation Error Validation data Training data Training Iteration (e.g., Tree Depth) Average error

Training: Objective Objective = Loss + Regularization (Accuracy) (Complexity) Use regularization to avoid overfitting Linear Regression Example L1 L2 Regularization hyperparameter

Training: Algorithm Selection Traditional Regression Decision Tree Neural Network What is the size and nature of your data? What are you trying to achieve with your model? How accurate does your model need to be? How much time do you have to train your model? How interpretable or understandable does your model need to be?

Training: Algorithm Selection https://github.com/sassoftware/enlighten-apply/tree/master/ml_tables

The Master Algorithm - Pedro Domingos

Ensemble Modeling http://www.englishchamberchoir.com/ Wisdom of the crowd Aristotle ( Politics ) - collective wisdom of many is likely more accurate than any one

Ensemble Modeling Combine algorithms strengths (compensate for weaknesses) High bias model (underfit) Target High variance model (overfit) Ensemble ( average ) Input

Ensemble Modeling Account for sample variation Model with Sample #1 Target Model with Sample #2 Ensemble (average) Input

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks Build Predictive Models Combine Models One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Tree SVM Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Neural Networks One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Different Samples of Data Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Trees One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Bagging Random Forest Boosting Gradient Boosting Machine Results of base learners are combined Each model attempts to improve on past results

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection (P1+P2+P3)/3

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Second-level model with predictions as inputs

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Cluster P2 Combine Models P P3

Model Tuning Algorithm Tuning options = hyperparameters TUNE Max tree depth, Splitting criterion, etc. Inputs Model - Regression coefficients - Neural net weights - Tree splitting rules - Etc. TRAIN Target Network configuration, Solver options, etc. Polynomial order, penalty parameter, etc. Best Model? Very data/problem dependent! Tues 12:30 Super Demo D

Model Tuning Common Approaches y = f(x 1 ) + g(x 2 ) For hyperparameters x 1 and x 2 x 2 x 2 x 2 x 1 Standard Grid Search x 1 Random Search x 1 Latin Hypercube = individual model train and assessment

Model Tuning Formal optimization methods can more intelligently search the hyperparameter space to find a combination which minimizes generalization error Genetic Algorithm Iteration N (Generation) Crossover and mutation Latin hypercube sampling of hyperparameter space Population Stop? Max time? Max # evaluations? Max # iterations? Optimal set of hyperparameter values Evaluation Train Model Assess on validation set OR K-fold cross validation (Best Model)

Time (seconds) Time (seconds) Computational Resources GPUs multi-threading low-level vs. interpreted languages (performance vs ease) distributed computing 250 200 150 100 50 8 15 Tiny Problem IRIS Forest Tuning Time 105 Train / 45 Validate 36 34 40 65 112 140 226 600 500 400 300 200 100 Medium Problem Credit Data Tuning Time 49k Train / 21k Validate 0 smp 1 2 4 8 16 32 64 128 Number of Nodes for Training 0 1 2 4 8 16 32 64 Number of Nodes for Training

Interpretability Attributes Model Decision?? Contemporary regression techniques w/l1 regularization Generalized additive models Surrogate models Train a small, interpretable ensemble Variable importance measures Partial dependency plots Non-negative, monotonic predictors LIME Tues 2:15 Raphael 3

Deployment? *Including Data Prep*? Port to a compiled language Consider commercial software that can manage and automate deployment Update API Deploy as a web service Monitor for decay

Conclusion Best Practices Data Training Deployment Successful Machine Learning Application

Resources github.com/sassoftware/enlighten-apply/tree/master/ml_tables

Resources 50 Years of Data Science by David Donoho Statistical Modeling: The Two Cultures by Leo Breiman Evolution of Analytics by SAS