CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Similar documents
CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

AMAZON.COM RECOMMENDATIONS ITEM-TO-ITEM COLLABORATIVE FILTERING PAPER BY GREG LINDEN, BRENT SMITH, AND JEREMY YORK

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Logistic Regression and Gradient Ascent

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Linear Regression Optimization

Network Traffic Measurements and Analysis

Louis Fourrier Fabien Gaie Thomas Rolf

The exam is closed book, closed notes except your one-page cheat sheet.

CS294-1 Assignment 2 Report

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CS145: INTRODUCTION TO DATA MINING

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Introduction to Automated Text Analysis. bit.ly/poir599

Evaluating Classifiers

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

model order p weights The solution to this optimization problem is obtained by solving the linear system

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CSC 4510 Machine Learning

CS249: ADVANCED DATA MINING

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

A Brief Look at Optimization

ECS289: Scalable Machine Learning

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Linear Regression & Gradient Descent

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Evaluating Classifiers

CS4491/CS 7265 BIG DATA ANALYTICS

Programming Exercise 1: Linear Regression

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Rapid growth of massive datasets

Fast or furious? - User analysis of SF Express Inc

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Linear Classification and Perceptron

Lecture on Modeling Tools for Clustering & Regression

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Scaled Machine Learning at Matroid

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Artificial Neural Networks (Feedforward Nets)

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CSE 158. Web Mining and Recommender Systems. Midterm recap

CS381V Experiment Presentation. Chun-Chen Kuo

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

General Instructions. Questions

Notes on Multilayer, Feedforward Neural Networks

Perceptron: This is convolution!

10-701/15-781, Fall 2006, Final

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Machine Learning Classifiers and Boosting

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Weka ( )

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CS570: Introduction to Data Mining

Matrix Computations and " Neural Networks in Spark

Non-Linearity of Scorecard Log-Odds

CSE446: Linear Regression. Spring 2017

CSC 411: Lecture 02: Linear Regression

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Bayesian Personalized Ranking for Las Vegas Restaurant Recommendation

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

scikit-learn (Machine Learning in Python)

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Online Algorithm Comparison points

Machine Learning and Bioinformatics 機器學習與生物資訊學

Apache SystemML Declarative Machine Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Using Existing Numerical Libraries on Spark

Lasso Regression: Regularization for feature selection

Tutorials Case studies

Machine Learning Basics. Sargur N. Srihari

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

S2 Text. Instructions to replicate classification results.

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Lecture 25: Review I

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

EE 511 Linear Regression

Transcription:

CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE SPARK Computer Science, Colorado State University No class on Thursday (10/12 There will be an extra class added in the week of the term project final presentations PA2 has been posted (11/6 CS535 Big Data - Fall 2017 Week 8-A-2 CS535 Big Data - Fall 2017 Week 8-A-3 Today s topics Advanced Data Analytics with Apache Spark Evaluation methodologies Item-to-item collaborative filtering Linear regression models 5. Advanced Data Analytics with Apache Spark Recommending Music and the Audioscrobbler Dataset Evaluating the Recommendation Model CS535 Big Data - Fall 2017 Week 8-A-4 What is a good recommendation? a popular artist? artists the user has listened to? artists the user will listen to? CS535 Big Data - Fall 2017 Week 8-A-5 Preparing data for evaluation To perform a meaningful evaluation, some of the artist play data can be set aside Hidden from the ALS model building process The held-out data can be used as a collection of good recommendations for each user Compute the recommender s score For building model For testing model 1

CS535 Big Data - Fall 2017 Week 8-A-6 AUC metric Rank 1.0 is perfect, 0.0 is the worst Receiver Operating Characteristic (ROC Based on the rank used to decide final recommendations Area Under the Curve (AUC of ROC may be used as the probability that a randomly chosen good recommendation ranks above a randomly chosen bad recommendation CS535 Big Data - Fall 2017 Week 8-A-7 AUC metric Spark s BinaryCalssficationMetrics Generating mean AUC ROC and Precision-Recall Curve Threshold tuning The model might need to be tuned so that it only predicts a class when the probability is very high e.g. only block a credit card transaction if the model predicts fraud with >90% probability A P-R curve plots (precision, recall points for different threshold values, while a ROC curve plots (true positive rate, false positive rate points Computes AUC per users and averages the result CS535 Big Data - Fall 2017 Week 8-A-8 CS535 Big Data - Fall 2017 Week 8-A-9 Computing AUC 90% of the data is used for training and the remaining 10% for validation Computing AUC continued import org.apache.spark.rdd._ def areaundercurve( positivedata: RDD[Rating], ballitemids: Broadcast[Array[Int]], predictfunction: (RDD[(Int, Int] = > RDD[Rating] = {... } val alldata = buildratings(rawuserartistdata, bartistalias val Array(trainData, cvdata = alldata.randomsplit(array(0.9, 0.1 traindata.cache( cvdata.cache( val allitemids = alldata.map(_. product. distinct(. collect( val ballitemids = sc.broadcast( allitemids val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0 val auc = areaundercurve(cvdata, ballitemids, model.predict CS535 Big Data - Fall 2017 Week 8-A-10 k-fold Cross-validation Create a k-fold partition of the dataset For each of the k experiments use K-1 folds for training The remaining fold for testing Test example Experiment 1 Experiment 2 CS535 Big Data - Fall 2017 Week 8-A-11 True error estimate k-fold cross validation is similar to random subsampling The advantage of k-fold Cross validation All the examples in the dataset are eventually used for both training and testing The true error is estimated as the average error rate E = 1 K K i=1 E i Experiment 3 Experiment 4 Total number of examples 2

CS535 Big Data - Fall 2017 Week 8-A-12 ML Tuning: model selection and hyperparameter tuning CrossValidator( val cv = new CrossValidator(.setEstimator(pipeline.setEvaluator(new BinaryClassificationEvaluator.setEstimatorParamMaps(paramGrid.setNumFolds(2 // Use 3+ in practice // If you want to tune your model hyperparameters, run crossvalidation, and choose the best set of parameters. val cvmodel = cv.fit(training // Full code is available at: https://spark.apache.org/docs/2.1.0/ml-tuning.html CS535 Big Data - Fall 2017 Week 8-A-13 Hyperparameter selection MatrixFactorizationModel ALS.trainImplicit( rank = 10 The number of latent factors in the model The number of columns, k iterations = 5 The number of iterations that the factorization runs lambda = 0.1 A standard overfitting parameter Higher value guards against overfitting Values that are too high will decrease the factorization s accuracy alpha = 1.0 Controls the relative weight of observed versus unobserved user-product interactions in the factorization CS535 Big Data - Fall 2017 Week 8-A-14 CS535 Big Data - Fall 2017 Week 8-A-15 This material is built based on, Greg Linden, Brent Smith, and Jeremy York, Amazon.com Recommendations, Item-to- Item Collaborative Filtering IEEE Internet Computing, 2003 5. Advanced Data Analytics with Apache Spark CASE STUDY: Recommendation Systems Amazon.com : Item-to-item collaborative filtering CS535 Big Data - Fall 2017 Week 8-A-16 CS535 Big Data - Fall 2017 Week 8-A-17 Item-to-item collaborative filtering Amazon.com uses recommendations as a targeted marketing tool Email campaigns Most of their web pages It does NOT match the user to similar customers Item-to-item collaborative filtering Matches each of the user s purchased and rated items to similar items Combines those similar items into a recommendation list 3

CS535 Big Data - Fall 2017 Week 8-A-18 CS535 Big Data - Fall 2017 Week 8-A-19 Determining the most-similar match The algorithm builds a similar-items table By finding items that customers tend to purchase together Calculating the similarity between a single product and all related products: How about building a product-to-product matrix by iterating through all item pairs and computing a similarity metric for each pair? Many product pairs have no common customer If you already bought a TV today, will you buy another TV again today? For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2 CS535 Big Data - Fall 2017 Week 8-A-20 CS535 Big Data - Fall 2017 Week 8-A-21 Computing similarity Using cosine measure Each vector corresponds to an item rather than a customer M dimensions correspond to customers who have purchased that item Creating a similar-item table Similar-items table is extremely computing intensive Offline computation O(N 2 M in the worst case Where N is the number of items and M is the number of users Average case is closer to O(NM Most customers have very few purchases Sampling customers who purchase best-selling titles reduces runtime even more With little reduction in quality CS535 Big Data - Fall 2017 Week 8-A-22 Scalability Amazon.com has around 110 million active customers(244 million total customers and several million catalog items Traditional collaborative filtering does little or no offline computation Online computation scales with the number of customers and catalog items. CS535 Big Data - Fall 2017 Week 8-A-23 Key scalability strategy for amazon recommendations Creating the expensive similar-items table offline Online component Looking up similar items for the user s purchases and ratings Scales independently of the catalog size or the total number of customers It is dependent only on how many titles the user has purchased or rated http://www.fool.com/investing/general/2014/05/24/how-many-customers-does-amazon-have.aspx 4

CS535 Big Data - Fall 2017 Week 8-A-24 CS535 Big Data - Fall 2017 Week 8-A-25 Recommendation quality The algorithm recommends highly correlated similar items Recommendation quality is excellent Algorithm performs well with limited user data 5. Advanced Data Analytics with Apache Spark CASE STUDY: Linear Models CS535 Big Data - Fall 2017 Week 8-A-26 CS535 Big Data - Fall 2017 Week 8-A-27 Spark s linear models Classification Linear Support Vector Machines (SVMs Logistic regression Regression Linear least squares, Lasso, and ridge regression Implementation SVMWithSGD LogisticRegressionWithBFGS LogisticRegressionWithSGD LinearRegressionWithSGD RidgeRegressionWithSGD LassoWithSGD CS535 Big Data - Fall 2017 Week 8-A-28 CS535 Big Data - Fall 2017 Week 8-A-29 The linear regression model The structure is exactly the same as for the linear discriminant function f(x = w 0 + w 1 x 1 + w 2 x 2 + Large scale data analysis using Spark CASE STUDY: Linear Models Linear Regression Model to a Large Dataset How big is the error of the fitted model? We would like to minimize this error A B C D E F G H Math score 85 85 62 67 89 65 80 67 Physics score 89 92 70 95 95 80 75 80 The model that fits the data best The model with the minimum sum of errors on the training data e.g. The sum or mean of the squares of the errors Least squares regression 5

CS535 Big Data - Fall 2017 Week 8-A-30 Squared error Squared error Strongly penalizes very large errors Drawback Is very sensitive to the data Erroneous or outlying data points can severely skew the resultant linear function CS535 Big Data - Fall 2017 Week 8-A-31 Root Mean Squared Error Measures the differences between values predicted by model estimator and the values actually observed RMSD = n (h θ (x (i y (i 2 i=1 n We should choose the objective function to optimize CS535 Big Data - Fall 2017 Week 8-A-32 CS535 Big Data - Fall 2017 Week 8-A-33 Linear Regression Calculating a linear regression Using the least square criterion A B C D E F G H Math score 85 85 62 67 89 65 80 67 Physics score 89 92 70 95 95 80 75 80 Linear Regression -- continued h(x =θ 0 + x 1 +θ 2 x 2 +θ 3 x 3... To simplify the notation, h(x = n i=0 θ i x i h θ(x = θ 0 + x 1 where, the θs are the parameters (Math scores, or slopes CS535 Big Data - Fall 2017 Week 8-A-34 Objective function (Cost function For a given training set, how do we pick, or learn, the parameter θ? Make h(x close to y Make your prediction close to the real observation We define the objective (cost function m J(θ = 1 (h 2 θ (x (i y (i 2 i=0 CS535 Big Data - Fall 2017 Week 8-A-35 Minimization problem We have a function J(θ 0, We want to find min J(θ o θ o Goal: Find parameters to minimize the cost (output of the objective function Outline of our approach: Start with some θ 0, Keep changing θ 0, to reduce J(θ 0, until we end up at a minimum 6

CS535 Big Data - Fall 2017 Week 8-A-36 Concept of Gradient descent algorithm (1/2 CS535 Big Data - Fall 2017 Week 8-A-37 Concept of Gradient descent algorithm (2/2 J(θ0, J(θ0, θ0 θ0 CS535 Big Data - Fall 2017 Week 8-A-38 CS535 Big Data - Fall 2017 Week 8-A-39 Stochastic Gradient descent algorithm Repeat until convergence { } Simultaneous update (for j=0 and j=1 Your implementation should perform simultaneous update See slide 41 θ j :=θ j α θ j J(θ 0 Simultaneous update Correct: Simultaneous update temp0 := θ 0 α J(θ 0 θ 0 temp1 := θ0 := temp0 := temp1 α J(θ 0 Incorrect: temp0 := θ 0 α J(θ 0 θ 0 θ0 := temp0 temp1 := := temp1 α J(θ 0 CS535 Big Data - Fall 2017 Week 8-A-40 CS535 Big Data - Fall 2017 Week 8-A-41 Decreasing Increasing Positive slope J( Negative Slope := α d d J( J( := α d d J( 7

CS535 Big Data - Fall 2017 Week 8-A-42 CS535 Big Data - Fall 2017 Week 8-A-43 Learning rates Learning rates: starting at the optimal value? If α is too small, gradient descent can be slow J( Current value of has converged already No more iteration required J( Slope = 0 If α is too large, gradient descent can overshoot the minimum It may fail to converge Or even diverge J( J( CS535 Big Data - Fall 2017 Week 8-A-44 Fixed learning rate α Gradient descent can converge to a local minimum, even with a fixed learning rate As we approach a local minimum, gradient descent will automatically take smaller steps No need to decrease α over time := α d d J( 8