Practical Guidance for Machine Learning Applications

Size: px

Start display at page:

Download "Practical Guidance for Machine Learning Applications"

Brenda Fletcher
6 years ago
Views:

1 Practical Guidance for Machine Learning Applications Brett Wujek

About the authors Material from SGF Paper SAS2360-2016 Brett Wujek Senior

product design and optimization software Patrick Hall Senior Machine

become a Cloudera certified data scientist Funda Güneș Senior Research

2 About the authors Material from SGF Paper SAS Brett Wujek Senior Data Scientist, Advanced Analytics R&D ~20 years developing engineering product design and optimization software Patrick Hall Senior Machine Learning Specialist, Advanced Analytics R&D 11 th person worldwide to become a Cloudera certified data scientist Funda Güneș Senior Research Statistician, Advanced Analytics R&D PhD in Statistics with expertise in new modeling techniques

3 Early Preview April 2016 General Availability September 2016

4 Statistics Pattern Recognition Computational Neuroscience Data Science Data Mining Machine Learning AI Databases KDD

8 Observations Data Preparation GIGO Features

9 Observations Data Preparation Do you have the right data? Features What form is your data in? Are values appropriate??

10 Target Observations Data Preparation Do you have the right data? Data exhaust Bias? Extrapolation Ensure appropriate nominal target value representation - All prospective values included - Rare events - Over/undersampling - Ensembles - Zero-inflated models Feature Lohninger, H. (1999). Teach/Me Data Analysis. Berlin: Springer-Verlag.

11 Data Preparation Do you have the right data? Curse of Dimensionality Features? 8/10 = 80% 8/100 = 8% 8/1000 = 0.8% Feature Engineering Features Feature Selection MIC, information gain, chi-square Stepwise regression, LASSO, elastic net Decision tree NEW FEATURES Features Feature Extraction PCA, SVD Nonnegative matrix factorization Autoencoding neural networks Simpler models, shorter training times, improved generalization

12 Data Preparation Feature Extraction Latent Features = Derive more information from the data than what is directly presented Features? Denoising autoencoders Matrix Factorization OUTPUT = INPUT DECODE x 1 x 2 x 3 ENCODE Noise h 11 h 12 x 1 x 2 x 3 Factorization Machines, SVD, NMF (sparse to dense) INPUT (Often more hidden layers with many nodes)

13 Data Preparation What form is your data in? Tidy data Wickham, H. (2014). Tidy Data. Journal of Statistical Software 59: Each variable forms a column 2. Each observation forms a row 3. Each value must have its own cell Melting (stacking): column headers contain values (<$10k, $10-30k, $50-100k) melt into 2 columns for value and frequency String splitting: column contains multiple pieces of information (M<20, F<20, M<30, F<30, ) split into multiple columns (M, F, <20, <30, ) Casting (unstacking): column values are actually variable names (max temp, min temp, avg temp) cast rows back into multiple columns

14 Data Preparation Are values appropriate?? Managing Values Standardization Outliers High cardinality Missing values Wielenga, D. (2007). Identifying and Overcoming Common Data Mining Mistakes. Proceedings of the SAS Global Forum 2007 Conference. Cary, NC: SAS Institute Inc.

15 Data Preparation High Cardinality Variables 27 zip codes 2597 zip codes 12 regions?

16 Data Preparation High Cardinality Variables? Client Item Target Item 1 Item 2 Item 3 Item 4 Client Client Matrix Factorization Factor 1 Factor 2 Factor 3 Client Client Client Client Client FM, SVD, NMF Client Pairs of high cardinality variables can often be represented as a sparse matrix, with matrix values populated by the corresponding target value Sparse matrices can be efficiently factored into dense features suitable for standard algorithm implementations

17 Data Preparation Missing Values Understand their nature Deal with them o Naïve Bayes o Decision tree, random forest, gradient boosting o Binning/discretization o Univariate imputation with missing markers o Multivariate imputation with missing markers

18 Data Preparation Data partitioning Information leakage Training Validation Test Do not use holdout data for transformations/imputation of training data

20 Training: Bias-Variance Tradeoff High bias model (underfit) High variance model (overfit) Honest Assessment With smaller data sets use k-fold cross validation Error Validation data Training data Training Iteration (e.g., Tree Depth) Average error

21 Training: Objective Objective = Loss + Regularization (Accuracy) (Complexity) Use regularization to avoid overfitting Linear Regression Example L1 L2 Regularization hyperparameter

What are you trying to achieve with your model?

22 Training: Algorithm Selection Traditional Regression Decision Tree Neural Network What is the size and nature of your data? What are you trying to achieve with your model? How accurate does your model need to be? How much time do you have to train your model? How interpretable or understandable does your model need to be?

23 Training: Algorithm Selection

24 The Master Algorithm - Pedro Domingos

25 Ensemble Modeling Wisdom of the crowd Aristotle ( Politics ) - collective wisdom of many is likely more accurate than any one

26 Ensemble Modeling Combine algorithms strengths (compensate for weaknesses) High bias model (underfit) Target High variance model (overfit) Ensemble ( average ) Input

27 Ensemble Modeling Account for sample variation Model with Sample #1 Target Model with Sample #2 Ensemble (average) Input

28 Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks Build Predictive Models Combine Models One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine

29 Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Tree SVM Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

30 Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Neural Networks One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

31 Ensemble Modeling Different Samples of Data Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Trees One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

32 Ensemble Modeling Bagging Random Forest Boosting Gradient Boosting Machine Results of base learners are combined Each model attempts to improve on past results

33 Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection (P1+P2+P3)/3

34 Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Second-level model with predictions as inputs

35 Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Cluster P2 Combine Models P P3

Model Tuning Algorithm Tuning options = hyperparameters

Inputs Model - Regression coefficients - Neural net weights

TRAIN Target Network configuration, Solver options, etc.

36 Model Tuning Algorithm Tuning options = hyperparameters TUNE Max tree depth, Splitting criterion, etc. Inputs Model - Regression coefficients - Neural net weights - Tree splitting rules - Etc. TRAIN Target Network configuration, Solver options, etc. Polynomial order, penalty parameter, etc. Best Model? Very data/problem dependent! Tues 12:30 Super Demo D

37 Model Tuning Common Approaches y = f(x 1 ) + g(x 2 ) For hyperparameters x 1 and x 2 x 2 x 2 x 2 x 1 Standard Grid Search x 1 Random Search x 1 Latin Hypercube = individual model train and assessment

38 Model Tuning Formal optimization methods can more intelligently search the hyperparameter space to find a combination which minimizes generalization error Genetic Algorithm Iteration N (Generation) Crossover and mutation Latin hypercube sampling of hyperparameter space Population Stop? Max time? Max # evaluations? Max # iterations? Optimal set of hyperparameter values Evaluation Train Model Assess on validation set OR K-fold cross validation (Best Model)

39 Time (seconds) Time (seconds) Computational Resources GPUs multi-threading low-level vs. interpreted languages (performance vs ease) distributed computing Tiny Problem IRIS Forest Tuning Time 105 Train / 45 Validate Medium Problem Credit Data Tuning Time 49k Train / 21k Validate 0 smp Number of Nodes for Training Number of Nodes for Training

Interpretability Attributes Model Decision?

ensemble Variable importance measures Partial

40 Interpretability Attributes Model Decision?? Contemporary regression techniques w/l1 regularization Generalized additive models Surrogate models Train a small, interpretable ensemble Variable importance measures Partial dependency plots Non-negative, monotonic predictors LIME Tues 2:15 Raphael 3

42 Deployment? *Including Data Prep*? Port to a compiled language Consider commercial software that can manage and automate deployment Update API Deploy as a web service Monitor for decay

43 Conclusion Best Practices Data Training Deployment Successful Machine Learning Application

44 Resources github.com/sassoftware/enlighten-apply/tree/master/ml_tables

45 Resources 50 Years of Data Science by David Donoho Statistical Modeling: The Two Cultures by Leo Breiman Evolution of Analytics by SAS

SAS Visual Data Mining and Machine Learning 8.2: Advanced Topics

SAS Visual Data Mining and Machine Learning 8.2: Advanced Topics SAS Documentation January 25, 2018 The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2017. SAS Visual