Machine Learning Track

Size: px

Start display at page:

Download "Machine Learning Track"

Drusilla Harper
5 years ago
Views:

1 Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in today s changing application environment Franz J. Király

2 An overview of data analytics Statistical Programming (practical) R python DATA Scientific Questions Exploration Statistical Questions Quantitative Modelling Methods The Scientific Method Descriptive/Explanatory Predictive/Inferential Scientific and Statistical Validation Knowledge

first before one can attempt reliable Data analytics Statistics, Modelling, Data mining,

3 Data analytics and data science in a broader context Lot of problems and subtleties at these stages already Raw data Clean data often, most of manpower in data project needs to go here first before one can attempt reliable Data analytics Statistics, Modelling, Data mining, Machine learning Knowledge Relevant findings and underlying arguments need to be explained well and properly

4 Big Data?

5 What Big Data may mean in practice Strategies that stop working in reasonable time Manual exploratory data analysis Kernel methods, OLS Random forests L1, LASSO (around the same order) Number of features Feature extraction Feature selection Large-scale strategies for super-linear algorithms Super-linear algorithms Linear algorithms, including Reading in all the data Number of data samples On-line models Distributed computing Sub-sampling Solution strategies

6 Large-scale motifs in data science = where high-performance computing is helpful/impactful Big models Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models New fancy example: large neural networks aka deep learning Big data = the classic, beloved by everyone Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes = what it says, a lot of data (ca 1 million samples or more) Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting Model validation and model selection = this talk s focus Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing?

Meta-modelling: stylized case studies Customer: Hospital specializing in treatment of patients with a certain disease. Patients with this disease are at-risk to experience an adverse event (e.g. death) Scientific question: depending on patient characteristics, predict the event risk.

Customers can buy (or not buy) any of a number of products, or churn.

7 Meta-modelling: stylized case studies Customer: Hospital specializing in treatment of patients with a certain disease. Patients with this disease are at-risk to experience an adverse event (e.g. death) Scientific question: depending on patient characteristics, predict the event risk. Data set: complete clinical records of patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Customers can buy (or not buy) any of a number of products, or churn. Scientific question: predict future customer behaviour given past behaviour Data set: complete customer and purchase records of customers Customer: Manufacturer wishes to find best parameter setting for machines. Parameters influence amount/quality of product (or whether machine breaks) Scientific question: find parameter settings which optimizes the above Data set: outcomes for parameter settings on those machines Of interest: model interpretability; how accurate the predictions are expected to be whether the algorithm/model is (easily) deployable in the real world Not of interest: which algorithm/strategy, out of many, exactly solves the task

8 Model validation and model selection = data-centric and data-dependent modelling a scientific necessity implied by the scientific method and the following: 1. There is no model that is good for all data. (otherwise the concept of a model would be unnecessary) 2. For given data, there is no a-priori reason to believe that a certain type of model will be the best one. (any such belief is not empirically justified hence pseudoscientific) 3. No model can be trusted unless its validity has been verified by a model-independent argument. (otherwise the justification of validity is circular hence faulty) Machine learning provides algorithms & theory for meta-modelling and powerful algorithms motivated by meta-modelling optimality.

9 Machine Learning and Meta-Modelling in a Nutshell

Leitmotifs of Machine Learning from the intersection of engineering, statistics and computer

learning machines modelling strategy Engineering & computer science idea: Any abstract

strategy Possibly non-explicit Computer science & statistics idea: Future performance of

10 Leitmotifs of Machine Learning from the intersection of engineering, statistics and computer science Engineering & statistics idea: Statistical models are objects in their own right learning machines modelling strategy Engineering & computer science idea: Any abstract algorithm can be a modelling strategy/learning machine computational learning modelling strategy Possibly non-explicit Computer science & statistics idea: Future performance of algorithm/learning machine can be estimated model validation model selection (and should) learning machine?

11 Problem types in Machine Learning Supervised Learning: some data is labelled by expert/oracle Task: predict label from covariates statistical models are usually discriminative Examples: regression, classification???

12 Problem types in Machine Learning Unsupervised Learning: the training data is not pre-labelled??! Task: find structure or pattern in data statistical models are usually generative Examples: clustering, dimension reduction

13 Advanced learning tasks Complications in the labelling Semi-supervised learning some training data are labelled, some are not Reinforcement learning data are not directly labelled, only indirect gain/loss Anomaly detection all or most data are positive examples, the task is to flag test negatives Complications through correlated data and/or time On-line learning the data is revealed with time, models need to update Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct

14 What is a Learning Machine? an algorithm that solves, e.g., the previous tasks: Illustration: supervised learning machine new data observations training data model fitting learning fitted model prediction?? predictions model tuning parameters e.g., to base decisions on Examples: generalized linear model, linear regression, support vector machine, neural networks (= deep learning ), random forests, gradient boosting,

15 Example: Linear Regression? new data observations training data model fitting learning fitted model prediction predictions Fit intercept or not?

? training data e.g. regression, GLM, advanced methods learnt model e.g. evaluating the regression model predictions Predictive models need to be validated on unseen data!

16 Model validation: does the model make sense?? test labels the truth in-sample prediction strategy learning machine test data hold-out out-of-sample compare & quantify Model learning Prediction?? training data e.g. regression, GLM, advanced methods learnt model e.g. evaluating the regression model predictions Predictive models need to be validated on unseen data! The only (general) way to test goodness of prediction is actually observing prediction! Which means the part of data for testing has not been seen by the algorithm before! (note: this includes the case where machine = linear regression, deep learning, etc)

17 Re-sampling : all data training data 1 training test data 2 training test data 3 test data 3 Predictor 1 Predictor 2 Predictor Predictor 1 3 Predictor 2 Predictor Predictor 1 3 Predictor 2 Predictor 3 errors 1,2,3 errors 1,2,3 aggregate errors 1,2,3 errors 1,2,3 comparison Multiple algorithms are compared on multiple data splits/sub-datasets State-of-art principle in model validation, model comparison and meta-modelling type of re-sampling how to obtain training/test splits pros/cons k-fold cross-validation often: k=5 1. divide data in k (almost) equal parts 2. obtain k train/tests splits via: each part is test data exactly once the rest of data is the training set good compromise between runtime and accuracy when k is small compared to data size leave-one-out = [number of data points]-fold c.v. very accurate, high run-time repeated sub-sampling parameters: training/test size # of repetitions 1. obtain a random sub-sample of training/test data of specified sizes (train/test need not cover all data) 2. repeat 1. desired number of times can be arbitrarily quick can be arbitrarily inaccurate (depending on parameter choice) can be combined with k-fold

18 Quantitative model comparison a benchmarking experiment results in a table like this model RMSE MAE 15.3 ± ± ± ± ± ± 0.8? 20.1 ± ± 1.1 Confidence regions (or paired tests) to compare models to each other: A is better than B / B is better than A / A and B are equally good Uninformed model (stupid model/random guess) needs to be included otherwise a statement is better than an uninformed guess cannot be made. useful model = (significantly) better than uninformed baseline

19 ± 1. 4 Model ± 0. 7 ± 0. 9 ± 1. 2 Meta-model: automated parameter tuning Re-sampling is used to determine [best parameter setting] For validation, new unseen data needs to be used: all data training data test data Multi-fold-schemes are nested: splits within splits tuning train tuning test real test model goodness? predict & quantify w. Best Parameter fit to all training data whole training data training data test data Parameters 1 Parameters 2 Parameters 3 Re-sampled training data? mo del goodn ess ± 1 ±. 4 0 ± ± Best parameters Which measure of predictive goodness Important caveat: the inner training/test splits need to be part of any outer training set otherwise validation is not out-of-sample! Which inner re-sampling scheme Methods are usually less sensitive to these new tuning parameters

data-driven tuning algorithm Ensemble learning A B C D A

20 Meta-Strategies in ML Model tuning Model with tuning parameters Best tuning parameters are determined using data-driven tuning algorithm Ensemble learning A B C D A B D a number of (possibly weak ) models strong ensemble model

21 Object dependencies in the ML workflow One interesting dataset is re-sampled all data N = data points ( small data ) Typical number of into multiple train/test splits on each of which training data test data training data test data training data test data 5-10 outer splits the strategies are compared 1 2 M M = 5-20 most of which are parametertuned by the same principle 3-5 nested splits parameter combinations Runtime = 10 x 10 x 5 x (x 100) x one run on N samples base learners Ensembles: further nesting (usually O(N²) or O(N³) )

22 Machine Learning Toolboxes

23 An incomplete list of influential toolboxes scikit-learn is perhaps the most widely used ML toolbox Language Modular API (e.g., methods) GUI Common models Model tuning, meta-methods Model validation and comparison R python caret R python multiinterface Not entirely mostly kernels some Java 3rd party wrappers python Few, mostly classifiers few

The object-oriented ML Toolbox API as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization learning machine object modular structure object orientation Linear

24 The object-oriented ML Toolbox API as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization learning machine object modular structure object orientation Linear regression fit(traindata) predict(testdata) plus metadata & model info Abstraction models objects with unified API: Concept abstracted Public interface in R/mlr in sklearn Learning Machines fitting, predicting, set parameters Learner estimator Re-sampling schemes sample, apply & get results ResampleDesc splitter classes in model_selection Evaluation metrics compute from results, tabulate Measure metrics classes in metrics Meta-modelling Tuning Ensembling Pipelining wrapping machines by strategy various wrappers fused classes various wrappers Pipeline Learning task benchmark, list strategies/measures Task Implicit, not encapsulated

25 HPC for benchmarking/validation today Scikit-learn: joblib At the selected level: mlr: parallelmap 1 (one of 1-4) 2 Distribute to clusters/cores training data test data all data training data test data N = data points ( small data ) 1 2 M training data test data Typical number of M = outer splits nested splits parameter combinations base learners Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)

26 HPC support tomorrow? Layer 1: full graph of dependencies: re-samples algorithms parameters Layer 2: Scheduler for algorithms and meta-algorithms 1 2 M (image source: continuum analytics) Combining (?) MapReduce, DAAL, dask, joblib -> TBB? DATA (e.g. Hadoop) Data/task pipeline Layer 3: Optimized Primitives Linear systems convex optimization stoch. gradient descent (image source: Intel math kernel library) e.g. MKL, CUDA, BLAS Layer 4: Hardware API e.g. distributed, multi-core, multi-type/heterogeneous

27 Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes Most advanced toolboxes are currently open-source & academic Features that would be desirable to the practitioner but not available without mid-scale software development: Integration of (a) data management, (b) exploration and (c) modelling especially challenging: integration in large scale scenarios e.g. MapReduce for divide/conquer over data, model parts, and models Full HPC integration on granular level for distributed ML benchmarking making full use parallelism for nesting and computational redundancies complete HPC architecture for whole model benchmarking workflow Non-standard modelling tasks, structured data (incl time series) data heterogeneity, multiple datasets, time series, spatial features, images etc forecasting, on-line learning, anomaly detection, change point detection meta-modelling and re-sampling for these is an order of magnitude more costly

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of