MACHINE LEARNING Example: Google search

Similar documents
COMP33111: Tutorial and lab exercise 7

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Basic Concepts Weka Workbench and its terminology

Machine Learning Chapter 2. Input

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

Data Mining and Analytics

Deep Learning for Recommender Systems

3 Data, Data Mining. Chengkai Li

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning with Python

Classification with Decision Tree Induction

GETTING STARTED WITH DATA MINING

Some examples of task parallelism are commented (mainly, embarrasing parallelism or obvious parallelism).

Polytechnic University of Tirana

Data Mining Algorithms: Basic Methods

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Demystifying Machine Learning

7 Techniques for Data Dimensionality Reduction

Summary. Machine Learning: Introduction. Marcin Sydow

Input: Concepts, Instances, Attributes

Data Mining Input: Concepts, Instances, and Attributes

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Scalable Machine Learning in R. with H2O

Parallel learning of content recommendations using map- reduce

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

Specialist ICT Learning

Introduction to Data Mining and Data Analytics

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

R Language for the SQL Server DBA

As a reference, please find a version of the Machine Learning Process described in the diagram below.

Scaled Machine Learning at Matroid

IMPACT MODELS AND DATA MATTEO DE FELICE

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Big Data and Large Scale Machine Learning

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

What's New in MATLAB for Engineering Data Analytics?

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Why data science is the new frontier in software development

Decision Tree Learning

Machine Learning - Clustering. CS102 Fall 2017

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

data-based banking customer analytics

KNIME for the life sciences Cambridge Meetup

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Homework 1 Sample Solution

Business Club. Decision Trees

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Collective Intelligence in Action

Jarek Szlichta

Build a system health check for Db2 using IBM Machine Learning for z/os

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Taking Your Application Design to the Next Level with Data Mining

Twitter data Analytics using Distributed Computing

DESIGN AND IMPLEMENTATION OF BUILDING DECISION TREE USING C4.5 ALGORITHM

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

From Building Better Models with JMP Pro. Full book available for purchase here.

Overview of Big Data

Lecture 22 : Distributed Systems for ML

User Entity Behavior Analysis for Cyber Security. Dr. Chin-Hao, Eric, Mao Institute for Information Industry

Data Mining and Data Warehousing Introduction to Data Mining

Unsupervised: no target value to predict

Deploying, Managing and Reusing R Models in an Enterprise Environment

Big Data and FrameWorks; Perspectives to Applied Machine Learning

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Sparkling Water. August 2015: First Edition

Introducing SAS Model Manager 15.1 for SAS Viya

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Embedded Technosolutions

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Learning Rules. Learning Rules from Decision Trees

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Apparel Classifier and Recommender using Deep Learning

Analyzing Fleet Data with MATLAB and Spark

SOCIAL MEDIA MINING. Data Mining Essentials

Tillämpad Artificiell Intelligens Applied Artificial Intelligence Tentamen , , MA:8. 1 Search (JM): 11 points

BIG DATA SCIENTIST Certification. Big Data Scientist

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Science Training

Convex and Distributed Optimization. Thomas Ropars

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

10 things I wish I knew. about Machine Learning Competitions

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

TIBCO Analytics Meetup. Michael O Connell and the TIBCO Data Science Team April 25th, 2017

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Data Platforms and Pattern Mining

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Sport performance analysis Project Report

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Machine Learning With Spark

Transcription:

MACHINE LEARNING Lauri Ilison, PhD Data Scientist 20.11.2014 Example: Google search 1

27.11.14 Facebook: 350 million photo uploads every day The dream is to build full knowledge of the world and know everything that is going on. Germany s 12th Man at the World Cup: Big Data Germany football team used Big Data and Machine Learning tools to analyzes video data from on-field cameras capable of capturing thousands of data points per second, including player position and speed. The team was able to analyze stats about average possession time and cut it down from 3.4 seconds to about 1.1 seconds That style of play was evident in Germany s 7-1 victory over Brazil, which included three goals scored in a span of 179 seconds. 2

Spotify Spotify uses deep-learning for creating personal music recommendation Change in business models: From hardware seller to Data Company! Hardware company was selling speakers and audio systems for supermarkets! Customers asked for music?! Customers asked playing music?! Company started selecting the right music to increase sales! Now they are Data Company selling also HW 3

Supervised and Unsupervised learning Machine Learning Supervised learning We have previous knowledge about the sample cases that are basis for learning Classification Regression Decision Trees Unsupervised learning We do not have any previous knowledge about the sample cases that are basis for learning Clustering Hidden Markov Chains Dimensionality reduction How it works - Linear regression? Price Example: Linear Regression TASK: find the price for 46m2 apartment Price y = ax + b In order to find price of apartment size 46m2 we find the linear relation of samples. 1. We assume linear relation Price = a * Size + b 56K 46m2 Apartment Size size 2. We calculate each sample distance for the line 3. We search for the blue line equation with minimal total distance from samples 4. Knowing the line function we calculate the price for 46m2 apartment 4

Clustering How it works - Logistic regression? Example: Bank loan decision TASK: Find the probability of default for applicant Historical loan application data 16 factors (parameters) Target No Default = 0 Default = 1 In order to predict the probability of default we use Multivariate logistic regression 1. Logistic function 1 f (x) = 1+ e x 3000 samples P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 T 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 0 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 0 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 1 3 5 6 7 8 9 4 2 3 4 5 2 4 6 2 2 1 2 4 5 6 3 2 5 3 2 5 7 3 6 3 7 2 1 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0.. 3 7 5 4 2 4 7 6 2 5 2 6 5 4 7 2 0 2. We create model based on historical data predicting the default 3. Testing model the model Splitting the learning dataset randomly into training 80% and test set 20% Actual Predicted 0 1 0 True positive False Negative 1 False positive True Negative 5

Example: missing data prediction Initial data Decision tree based decision model Outlook Temp Humidity Windy Play Golf Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Sunny Mild High True No Outlook Sunny Windy Overcast Yes Rainy Humidity False Yes True No High No Normal Yes Example: Customer churn Customer historical data Churn? Gender Customer age Card type Brand Sales total In eur Purchase frequency Purchase No Churn Decision TREE algorithm Male 37 type1 brand1 62 1 123 no Female 49 type2 brand1 15 125 6 no Female 38 type3 brand3 116 31 5 no Male 64 type4 brand1 12 4 8 no Female 30 type5 brand6 47 21 43 no Female 30 type4 brand1 25 82 16 no Female 47 type2 brand7 31 97 3 yes Male 30 type3 brand2 35 162 6 yes Female 51 type1 brand3 24 88 73 no Female 30 type3 brand2 31 32 22 no Male 42 type4 brand3 57 279 3 yes Female 30 type1 brand1 25 175 11 no Female 30 type3 brand2 54 5 40 no Male 30 type2 brand7 44 467 3 yes Customer Churn prediction rules. purchace.freq.sdev <= 165: :...purchase.no > 7: no purchase.no <= 7: :...purchace.freq.sdev > 86: :...purchase.no > 4: : :...purchace.freq.sdev <= 126: : : :...purchase.no > 5: no : : : purchase.no <= 5: : : : :...brand in {brand1,brand2,brand4}: no : : : brand = brand3: yes : : purchace.freq.sdev > 126: : : :...purchase.no <= 6: yes : : purchase.no > 6: : : :...purchace.freq.sdev <= 139: no : : purchace.freq.sdev > 139: yes............... Female 30 type3 brand1 46 150 3 no Actionable insights for enterprise 6

Outlier analysis Detect data that is statistically out of normal behavior Outlier Time series analysis 7

Hidden Markov Chains Behavioral DATA Neural-Network 8

How to select the right algorithm? Tools for Machine Learning Traditional tools: - R - Matlab - Python (skicitlearn, mlpy) - KNIME - Rapidminer - SPSS - Weka - SAS - Tools on Hadoop: - Mahout - Spark MLlib - Graphlab - Vowpal Wabbit - R - H2O -. Saas tools: - Microsoft Azure cloud - Datumbox - BigML - Google Prediction API - wise.io -. 9

Where to start?! Look the tutorials! Read some books for basics! Participate in on-line coursers (Coursera.org or similar)! Experiment with tools! Participate on online competitions (like Kaggle.com) If you are interested? Nortal has interesting Big Data and Machine Learning tasks to solve Join our team! Lauri Ilison, PhD email: lauri.ilison@nortal.com 10