Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Size: px

Start display at page:

Download "Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-"

Rebecca Montgomery
5 years ago
Views:

1 Machine Learning and SystemML Nikolay Manchev Data Scientist Europe E-

2 A Simple Problem In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation. Educational Attainment Median Income in USD Less than high school graduate High school graduate Some college or associate s degree Bachelor s degree Graduate or professional degree

3 Machine Learning "Field of study that gives computers the ability to learn without being explicitly programmed" Arthur Samuel,

4 Advantages Machines can handle bigger amounts of data Machines can work with high dimensional data Machines can work it out faster 4

5 Enneract (9 dimensional hypercube) 5

Use- case #1 Detecting potential "lemon cars" 2 million cars 8 000 cars reacquired 10 million repair cases 25

6 Use- case #1 Detecting potential "lemon cars" 2 million cars cars reacquired 10 million repair cases 25 million parts exchanges Logistic regression model input features Improved precision/recall by an order of magnitude 6

Machine Learning Supervised Machine Learning We provide a training set of labelled examples and fit a model to predict the correct labels using

7 Machine Learning Supervised Machine Learning We provide a training set of labelled examples and fit a model to predict the correct labels using the features. Unsupervised Machine Learning No desired output is provided. The model finds similarities in the data based on the features alone. 7

8 Use- case #2 Large Holiday operator Looking to enrich their web shop with custom recommendations Search Result Recommend Sardinia Sicily Majorca Ibiza all inclusive Canary Islands 8

9 Piece of cake Collaborative filtering Based on user to item rating matrix Computes similarity measure between users Sardinia Majorca Aspen User #1 4-1 User # User #n Make a prediction 9

10 Unsupervised learning to the rescue Mixture of Gaussians model Based on search strings n fixed classes Hand crafted rules tailored to classes 10

11 Use- case #2 Large Holiday operator in the UK Looking to enrich their web shop with custom recommendations Search Classifier Recommend 1. Corralejo 2. Costa Calma 3. Barracuda Point all inclusive, H10 Rubicon, Regency Country Club, Taurito Princess Sardinia Sicily Majorca Ibiza 11

12 It s Big Data 12

Why Spark Traditional approach MapReduce jobs

HDFS Read Memory Iteration 2 CPU HDFS Write

memory, distribute the execution Input HDFS

Disk Bottleneck Iteration 2 CPU Memory Chain

13 Why Spark Traditional approach MapReduce jobs Input HDFS Read Iteration 1 CPU HDFS Write HDFS Read Memory Iteration 2 CPU HDFS Write Memory Result The Spark approach keep data in memory, distribute the execution Input HDFS Read Iteration 1 CPU Memory Zero Read/Write Disk Bottleneck Iteration 2 CPU Memory Chain Job Output into New Job Input faster than network & disk 13

14 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects 14

15 A Simple Problem In this activity, you will analyze the relationship between educational attainment and median income using data from the ACS by examining a scatter plot and linear model that best fits that scatter plot and solving problems using the linear equation. Median Income Educational Attainment in USD Less than high school graduate High school graduate Some college or associate s degree Bachelor s degree Graduate or professional degree

16 Find the best fitting line 16

17 We always look for patterns 17

18 Use case #3 Predictive model for a bank campaign We want to predict successful outcomes 18

You need Data Scientists Algorithms are NOT the problem Understanding what data goes into those algorithms and how to interpret the results is the

19 You need Data Scientists Algorithms are NOT the problem Understanding what data goes into those algorithms and how to interpret the results is the crux of the matter Be very, very careful Involving a data scientist after you've gathered the data is like involving a doctor after the patient... 19

20 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects IBM will educate more than data scientists on Spark 20

21 Big Data University - free online training 21

22 Data Science before Big Data 22

23 Enter Big Data 23

24 Obvious solution Big Data 24

25 IBM s Commitment to Spark Official announcement (15th June 2015) IBM will build Spark into the core of its analytics and commerce platforms IBM will commit over 3,500 researchers & developers to work on Spark- related projects IBM will educate more than data scientists on Spark IBM will IBM will open source SystemML and collaborate with Databricks to advance Spark s machine learning capabilities 25

26 Linear Regression Refresher Simple Linear Regression Dependent variable (y) Independent variables (X) In order to estimate the parameters we have to minimize There is an elegant solution that minimizes : We can solve using R a = t(x) %*% X + diag(lambda); b = t(x) %*% y; theta = solve(a,b); 26

. MAP MAP XTX for each ytx for each 1k 1k MAP 300M observations 9GB text file Cluster Configuration 3.

27 Linear Regression - Execution y X a = t(x) %*% X + diag(lambda); b = t(x) %*% y; theta = solve(a,b); yt X 1k 1k 500 features 300M observations 4TB text file.. MAP MAP XTX for each ytx for each 1k 1k MAP 300M observations 9GB text file Cluster Configuration 3.5 GB Map Task JVM 7 GB In- memory Master JVM 128 MB HDFS block size REDUCE a bt In- memory computation (a,b) < 2 MB 1. get b 2. call solve(a,b) 27

28 Changes that impact our implementation 3 times more attributes 300M times more observations XTX Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize 600M 500 Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize XTX solve (a,b) XTX solve (a,b) 1M Cluster Configuration 3.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize Cluster configuration change 300M 500 XTy XTy The dataset fits in memory 100 XTy solve (XTX, XTy) XTX Cluster Configuration 1.5 GB Map Task J VM 7 GB In- memory Master J VM 128 MB HDFS block s ize XTy XTy solve (a,b) 28

29 To Summarize 3 lines of code Minor changes in the data set / cluster configuration result in 4 dramatically different execution plans major change in performance best solution becomes a non- working solution How can we manage this? 29

30 What s in the SystemML box High-level language front-ends High- Level Operations (HOPs) General representation of statements in the data analysis language Low- Level Operations (LOPs) General representation of operations in the runtime framework Multiple execution environments 30

31 Backend performance 31

32 Out- of- the- box algorithms Category Description Descriptive Statistics Univariate, Bivariate, Stratified Bivariate Classification Logistic Regression, Multi- class SVM, Naïve Bayes, Decision Trees, Random Forest Clustering k- Means Regression Linear Regression (System of equations, SGD) Generalised Linear Models Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli Links for all distributions: identity, log, sq. root, inverse, 1/μ^2 Links for Binomial/ Bernoulli: logit, probit, cloglog, cauchit Stepwise Linear, GLM Dimensionality Reduction PCA Matrix Factorization ALS Survival Models Kaplan Meier, Cox Predict Scoring Transformation Recoding, dummy coding, binning, scaling, missing value imputation 32

33 Summary Key features Cost based compilation Out- of- the- box scalable machine learning algorithms Support for custom algorithms Write your own code and don t worry about scalability, numeric stability, and optimization Use it standalone, with MR backend, or with Spark backend Fit into Spark APIs, consume and produce DataFrames ML Pipeline integration Use System ML from Scala, Java, Python, R/SparkR BigR integration (package) 33

34 Additional Resources SystemML is available on GitHub An in- depth scientific perspective Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce., ICDE 2011 Boehm, Matthias, et al. SystemML s Optimizer: Plan Generation for Large- Scale Machine Learning Programs.. IEEE Data Eng. Bull 37.3 (2014). Huang, Botong, et al. "Resource Elasticity for Large- Scale Machine Learning., SIGMOD

35 IBM big data IBM big data THINK IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data IBM big data

Apache SystemML Declarative Machine Learning

Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open