Practical Guidance for Machine Learning Applications

Similar documents
SAS Visual Data Mining and Machine Learning 8.2: Advanced Topics

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Random Forest A. Fornaser

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Introduction to Automated Text Analysis. bit.ly/poir599

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Applying Supervised Learning

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS249: ADVANCED DATA MINING

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Doing the Data Science Dance

From Building Better Models with JMP Pro. Full book available for purchase here.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Tutorial on Machine Learning Tools

Yelp Recommendation System

Overview and Practical Application of Machine Learning in Pricing

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Scaled Machine Learning at Matroid

Data Science Bootcamp Curriculum. NYC Data Science Academy

Machine Learning: An Applied Econometric Approach Online Appendix

The Curse of Dimensionality

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Using Existing Numerical Libraries on Spark

What is machine learning?

Machine Learning. Topic 4: Linear Regression Models

Stacked Ensemble Models for Improved Prediction Accuracy

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Model Generalization and the Bias-Variance Trade-Off

Logical Rhythm - Class 3. August 27, 2018

Machine Learning with MATLAB --classification

Evaluating Classifiers

Machine Learning. Chao Lan

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Bias-Variance Analysis of Ensemble Learning

Data mining with sparse grids

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

1 Topic. Image classification using Knime.

Network Traffic Measurements and Analysis

Machine Learning Techniques for Data Mining

Evaluating Classifiers

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Slides for Data Mining by I. H. Witten and E. Frank

Predictive modelling / Machine Learning Course on Big Data Analytics

Facial Expression Classification with Random Filters Feature Extraction

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data mining with sparse grids using simplicial basis functions

Recitation Supplement: Creating a Neural Network for Classification SAS EM December 2, 2002

Contents. Preface to the Second Edition

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture on Modeling Tools for Clustering & Regression

劉介宇 國立台北護理健康大學 護理助產研究所 / 通識教育中心副教授 兼教師發展中心教師評鑑組長 Nov 19, 2012

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Machine Learning Lecture 3

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

GETTING STARTED WITH DATA MINING

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

SAS High-Performance Analytics Products

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Python for. Data Science. by Luca Massaron. and John Paul Mueller

Large Scale Data Analysis Using Deep Learning

Oracle Machine Learning Notebook

Scalable Machine Learning in R. with H2O

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

BIG DATA SCIENTIST Certification. Big Data Scientist

Information Management course

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Interpretable Machine Learning with Applications to Banking

Machine Learning Techniques

Practical OmicsFusion

Naïve Bayes for text classification

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Using Numerical Libraries on Spark

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Allstate Insurance Claims Severity: A Machine Learning Approach

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

CS 229 Midterm Review

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

10 things I wish I knew. about Machine Learning Competitions

Machine Learning / Jan 27, 2010

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Closing Thoughts on Machine Learning (ML in Practice)

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Simple Model Selection Cross Validation Regularization Neural Networks

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

CSC 411 Lecture 4: Ensembles I

7. Boosting and Bagging Bagging

Business Club. Decision Trees

CS8803: Statistical Techniques in Robotics Byron Boots. Thoughts on Machine Learning and Robotics. (how to apply machine learning in practice)

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Machine Learning in Action

Transcription:

Practical Guidance for Machine Learning Applications Brett Wujek

About the authors Material from SGF Paper SAS2360-2016 Brett Wujek Senior Data Scientist, Advanced Analytics R&D ~20 years developing engineering product design and optimization software Patrick Hall Senior Machine Learning Specialist, Advanced Analytics R&D 11 th person worldwide to become a Cloudera certified data scientist Funda Güneș Senior Research Statistician, Advanced Analytics R&D PhD in Statistics with expertise in new modeling techniques

Early Preview April 2016 General Availability September 2016

Statistics Pattern Recognition Computational Neuroscience Data Science Data Mining Machine Learning AI Databases KDD

https://github.com/sassoftware/enlighten-apply/tree/master/ml_tables

Observations Data Preparation GIGO Features

Observations Data Preparation Do you have the right data? Features What form is your data in? Are values appropriate??

Target Observations Data Preparation Do you have the right data? Data exhaust Bias? Extrapolation Ensure appropriate nominal target value representation - All prospective values included - Rare events - Over/undersampling - Ensembles - Zero-inflated models Feature Lohninger, H. (1999). Teach/Me Data Analysis. Berlin: Springer-Verlag.

Data Preparation Do you have the right data? Curse of Dimensionality Features? 8/10 = 80% 8/100 = 8% 8/1000 = 0.8% Feature Engineering Features Feature Selection MIC, information gain, chi-square Stepwise regression, LASSO, elastic net Decision tree NEW FEATURES Features Feature Extraction PCA, SVD Nonnegative matrix factorization Autoencoding neural networks Simpler models, shorter training times, improved generalization

Data Preparation Feature Extraction Latent Features = Derive more information from the data than what is directly presented Features? Denoising autoencoders Matrix Factorization OUTPUT = INPUT DECODE x 1 x 2 x 3 ENCODE Noise h 11 h 12 x 1 x 2 x 3 Factorization Machines, SVD, NMF (sparse to dense) INPUT (Often more hidden layers with many nodes)

Data Preparation What form is your data in? Tidy data Wickham, H. (2014). Tidy Data. Journal of Statistical Software 59:1 23. 1. Each variable forms a column 2. Each observation forms a row 3. Each value must have its own cell Melting (stacking): column headers contain values (<$10k, $10-30k, $50-100k) melt into 2 columns for value and frequency String splitting: column contains multiple pieces of information (M<20, F<20, M<30, F<30, ) split into multiple columns (M, F, <20, <30, ) Casting (unstacking): column values are actually variable names (max temp, min temp, avg temp) cast rows back into multiple columns https://github.com/sassoftware/enlighten-apply/tree/master/sas_ue_tidydata

Data Preparation Are values appropriate?? Managing Values Standardization Outliers High cardinality Missing values Wielenga, D. (2007). Identifying and Overcoming Common Data Mining Mistakes. Proceedings of the SAS Global Forum 2007 Conference. Cary, NC: SAS Institute Inc.

Data Preparation High Cardinality Variables 27 zip codes 2597 zip codes 12 regions? http://www.unitedstateszipcodes.org/maps/california-zip-code-map.png http://www.water.ca.gov/

Data Preparation High Cardinality Variables? Client Item Target 1 2 4 3 3 7 Item 1 Item 2 Item 3 Item 4 Client 1 0 4 0 0 Client 2 0 0 6 0 Matrix Factorization Factor 1 Factor 2 Factor 3 Client 1 1.304 0.582 0.892 Client 2 0.897 0.843 0.885 4 1 3 Client 3 0 0 7 0 Client 3 0.745 1.129 1.002 2 3 6 Client 4 3 0 0 0 FM, SVD, NMF Client 4 0.921 0.962 0.714 Pairs of high cardinality variables can often be represented as a sparse matrix, with matrix values populated by the corresponding target value Sparse matrices can be efficiently factored into dense features suitable for standard algorithm implementations

Data Preparation Missing Values Understand their nature Deal with them o Naïve Bayes o Decision tree, random forest, gradient boosting o Binning/discretization o Univariate imputation with missing markers o Multivariate imputation with missing markers

Data Preparation Data partitioning Information leakage Training Validation Test Do not use holdout data for transformations/imputation of training data

Training: Bias-Variance Tradeoff High bias model (underfit) High variance model (overfit) Honest Assessment With smaller data sets use k-fold cross validation Error Validation data Training data Training Iteration (e.g., Tree Depth) Average error

Training: Objective Objective = Loss + Regularization (Accuracy) (Complexity) Use regularization to avoid overfitting Linear Regression Example L1 L2 Regularization hyperparameter

Training: Algorithm Selection Traditional Regression Decision Tree Neural Network What is the size and nature of your data? What are you trying to achieve with your model? How accurate does your model need to be? How much time do you have to train your model? How interpretable or understandable does your model need to be?

Training: Algorithm Selection https://github.com/sassoftware/enlighten-apply/tree/master/ml_tables

The Master Algorithm - Pedro Domingos

Ensemble Modeling http://www.englishchamberchoir.com/ Wisdom of the crowd Aristotle ( Politics ) - collective wisdom of many is likely more accurate than any one

Ensemble Modeling Combine algorithms strengths (compensate for weaknesses) High bias model (underfit) Target High variance model (overfit) Ensemble ( average ) Input

Ensemble Modeling Account for sample variation Model with Sample #1 Target Model with Sample #2 Ensemble (average) Input

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks Build Predictive Models Combine Models One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Tree SVM Neural Network One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Different algorithms Ex: Decision Tree + SVM + Neural Network Neural Networks One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Different Samples of Data Different algorithms Ex: Decision Tree + SVM + Neural Network Decision Trees One algorithm, different configurations Ex: Various configurations of Neural Networks One algorithm, different data samples Ex: Random Forest, Gradient Boosting Machine Combine Models

Ensemble Modeling Bagging Random Forest Boosting Gradient Boosting Machine Results of base learners are combined Each model attempts to improve on past results

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection (P1+P2+P3)/3

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Second-level model with predictions as inputs

Ensemble Modeling Averaging or Voting Decision Tree SVM Neural Network Stacking/Blending P1 P2 P3 Cluster-based selection Cluster P2 Combine Models P P3

Model Tuning Algorithm Tuning options = hyperparameters TUNE Max tree depth, Splitting criterion, etc. Inputs Model - Regression coefficients - Neural net weights - Tree splitting rules - Etc. TRAIN Target Network configuration, Solver options, etc. Polynomial order, penalty parameter, etc. Best Model? Very data/problem dependent! Tues 12:30 Super Demo D

Model Tuning Common Approaches y = f(x 1 ) + g(x 2 ) For hyperparameters x 1 and x 2 x 2 x 2 x 2 x 1 Standard Grid Search x 1 Random Search x 1 Latin Hypercube = individual model train and assessment

Model Tuning Formal optimization methods can more intelligently search the hyperparameter space to find a combination which minimizes generalization error Genetic Algorithm Iteration N (Generation) Crossover and mutation Latin hypercube sampling of hyperparameter space Population Stop? Max time? Max # evaluations? Max # iterations? Optimal set of hyperparameter values Evaluation Train Model Assess on validation set OR K-fold cross validation (Best Model)

Time (seconds) Time (seconds) Computational Resources GPUs multi-threading low-level vs. interpreted languages (performance vs ease) distributed computing 250 200 150 100 50 8 15 Tiny Problem IRIS Forest Tuning Time 105 Train / 45 Validate 36 34 40 65 112 140 226 600 500 400 300 200 100 Medium Problem Credit Data Tuning Time 49k Train / 21k Validate 0 smp 1 2 4 8 16 32 64 128 Number of Nodes for Training 0 1 2 4 8 16 32 64 Number of Nodes for Training

Interpretability Attributes Model Decision?? Contemporary regression techniques w/l1 regularization Generalized additive models Surrogate models Train a small, interpretable ensemble Variable importance measures Partial dependency plots Non-negative, monotonic predictors LIME Tues 2:15 Raphael 3

Deployment? *Including Data Prep*? Port to a compiled language Consider commercial software that can manage and automate deployment Update API Deploy as a web service Monitor for decay

Conclusion Best Practices Data Training Deployment Successful Machine Learning Application

Resources github.com/sassoftware/enlighten-apply/tree/master/ml_tables

Resources 50 Years of Data Science by David Donoho Statistical Modeling: The Two Cultures by Leo Breiman Evolution of Analytics by SAS