MACHINE LEARNING Lauri Ilison, PhD Data Scientist 20.11.2014 Example: Google search 1
27.11.14 Facebook: 350 million photo uploads every day The dream is to build full knowledge of the world and know everything that is going on. Germany s 12th Man at the World Cup: Big Data Germany football team used Big Data and Machine Learning tools to analyzes video data from on-field cameras capable of capturing thousands of data points per second, including player position and speed. The team was able to analyze stats about average possession time and cut it down from 3.4 seconds to about 1.1 seconds That style of play was evident in Germany s 7-1 victory over Brazil, which included three goals scored in a span of 179 seconds. 2
Spotify Spotify uses deep-learning for creating personal music recommendation Change in business models: From hardware seller to Data Company! Hardware company was selling speakers and audio systems for supermarkets! Customers asked for music?! Customers asked playing music?! Company started selecting the right music to increase sales! Now they are Data Company selling also HW 3
Supervised and Unsupervised learning Machine Learning Supervised learning We have previous knowledge about the sample cases that are basis for learning Classification Regression Decision Trees Unsupervised learning We do not have any previous knowledge about the sample cases that are basis for learning Clustering Hidden Markov Chains Dimensionality reduction How it works - Linear regression? Price Example: Linear Regression TASK: find the price for 46m2 apartment Price y = ax + b In order to find price of apartment size 46m2 we find the linear relation of samples. 1. We assume linear relation Price = a * Size + b 56K 46m2 Apartment Size size 2. We calculate each sample distance for the line 3. We search for the blue line equation with minimal total distance from samples 4. Knowing the line function we calculate the price for 46m2 apartment 4
Clustering How it works - Logistic regression? Example: Bank loan decision TASK: Find the probability of default for applicant Historical loan application data 16 factors (parameters) Target No Default = 0 Default = 1 In order to predict the probability of default we use Multivariate logistic regression 1. Logistic function 1 f (x) = 1+ e x 3000 samples P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 T 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0.. 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 2. We create model based on historical data predicting the default 3. Testing model the model Splitting the learning dataset randomly into training 80% and test set 20% Actual Predicted 0 1 0 True positive False Negative 1 False positive True Negative 5
Example: missing data prediction Initial data Decision tree based decision model Outlook Temp Humidity Windy Play Golf Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Sunny Mild High True No Outlook Sunny Windy Overcast Yes Rainy Humidity False Yes True No High No Normal Yes Example: Customer churn Customer historical data Churn? Gender Customer age Card type Brand Sales total In eur Purchase frequency Purchase No Churn Decision TREE algorithm Male 37 type1 brand1 62 1 123 no Female 49 type2 brand1 15 125 6 no Female 38 type3 brand3 116 31 5 no Male 64 type4 brand1 12 4 8 no Female 30 type5 brand6 47 21 43 no Female 30 type4 brand1 25 82 16 no Female 47 type2 brand7 31 97 3 yes Male 30 type3 brand2 35 162 6 yes Female 51 type1 brand3 24 88 73 no Female 30 type3 brand2 31 32 22 no Male 42 type4 brand3 57 279 3 yes Female 30 type1 brand1 25 175 11 no Female 30 type3 brand2 54 5 40 no Male 30 type2 brand7 44 467 3 yes Customer Churn prediction rules. purchace.freq.sdev <= 165: :...purchase.no > 7: no purchase.no <= 7: :...purchace.freq.sdev > 86: :...purchase.no > 4: : :...purchace.freq.sdev <= 126: : : :...purchase.no > 5: no : : : purchase.no <= 5: : : : :...brand in {brand1,brand2,brand4}: no : : : brand = brand3: yes : : purchace.freq.sdev > 126: : : :...purchase.no <= 6: yes : : purchase.no > 6: : : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes............... Female 30 type3 brand1 46 150 3 no Actionable insights for enterprise 6
Outlier analysis Detect data that is statistically out of normal behavior Outlier Time series analysis 7
Hidden Markov Chains Behavioral DATA Neural-Network 8
How to select the right algorithm? Tools for Machine Learning Traditional tools: - R - Matlab - Python (skicitlearn, mlpy) - KNIME - Rapidminer - SPSS - Weka - SAS - Tools on Hadoop: - Mahout - Spark MLlib - Graphlab - Vowpal Wabbit - R - H2O -. Saas tools: - Microsoft Azure cloud - Datumbox - BigML - Google Prediction API - wise.io -. 9
Where to start?! Look the tutorials! Read some books for basics! Participate in on-line coursers (Coursera.org or similar)! Experiment with tools! Participate on online competitions (like Kaggle.com) If you are interested? Nortal has interesting Big Data and Machine Learning tasks to solve Join our team! Lauri Ilison, PhD email: lauri.ilison@nortal.com 10