COSC6376 Cloud Computing Homework 1 Tutorial

Similar documents
Machine Learning using MapReduce

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Mounica B, Aditya Srivastava, Md. Faisal Alam

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

What to come. There will be a few more topics we will cover on supervised learning

Jeff Howbert Introduction to Machine Learning Winter

CptS 570 Machine Learning Project: Netflix Competition. Parisa Rashidi Vikramaditya Jakkula. Team: MLSurvivors. Wednesday, December 12, 2007

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Improved MapReduce k-means Clustering Algorithm with Combiner

Unsupervised Learning

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin

General Instructions. Questions

CSCI6900 Assignment 3: Clustering on Spark

Singular Value Decomposition, and Application to Recommender Systems

[7.3, EA], [9.1, CMB]

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

CHAPTER 4: CLUSTER ANALYSIS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Unsupervised Learning : Clustering

Parallel Algorithms K means Clustering

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Clustering: Overview and K-means algorithm

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Progress Report: Collaborative Filtering Using Bregman Co-clustering

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Determining the k in k-means with MapReduce

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Big Data Using Hadoop

Lecture on Modeling Tools for Clustering & Regression

Clustering. Chapter 10 in Introduction to statistical learning

Computational Intelligence Meets the NetFlix Prize

Clustering: Overview and K-means algorithm

Homework Assignment #3

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

An improved MapReduce Design of Kmeans for clustering very large datasets

An Improvement of Centroid-Based Classification Algorithm for Text Classification

FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK

Databases 2 (VU) ( / )

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Jarek Szlichta

Clustering and Visualisation of Data

Demystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian

Dynamic Clustering in WSN

Cluster analysis of 3D seismic data for oil and gas exploration

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Recommender Systems New Approaches with Netflix Dataset

Some examples of task parallelism are commented (mainly, embarrasing parallelism or obvious parallelism).

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W Project: Recommendation System Models in Product Rating Predictions

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

CSE 158 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Cluster Analysis: Agglomerate Hierarchical Clustering

Nonparametric Importance Sampling for Big Data

Introduction to Artificial Intelligence

Unsupervised Learning Partitioning Methods

Towards a hybrid approach to Netflix Challenge

Clustering. (Part 2)

University of Washington Department of Computer Science and Engineering / Department of Statistics

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Comparative Analysis of K means Clustering Sequentially And Parallely

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Using Existing Numerical Libraries on Spark

CS 124/LINGUIST 180 From Languages to Information

Recommendation Systems

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633

Cluster Analysis. Ying Shen, SSE, Tongji University

Problem 1: Complexity of Update Rules for Logistic Regression

CSE 258 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)

Non-negative Matrix Factorization for Multimodal Image Retrieval

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Hierarchical and Ensemble Clustering

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Specialist ICT Learning

Problem Definition. Clustering nonlinearly separable data:

An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Transcription:

COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Homework1 Tutorial based on Netflix dataset

Homework 1 K-means Clustering of Amazon Reviews Create related product items based on the Amazon review ratings Understand the K-means and canopy clustering algorithms and their relationship Implement these algorithms using Apache Spark Analyze the effect of running these algorithms on a large data set using Amazon Cloud 3

Tutorial based on Netflix Dataset K-means example using Netflix dataset Rating dataset similar to Amazon reviews Amazon datasets productid, userid, rating, timestamp other meta data fields and review texts Netflix dataset movieid, userid, rating, timestamp 4

Netflix Prize Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies Netflix internal movie rating predictor: Cinematch used for recommending movies $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later

Competition Cancelled Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act of 1988

8

Movie Dataset The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760

K-means Clustering Clustering problem description: iterate { Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999

K-means Illustration Randomly select k centroids Assign cluster label of each point according to the distance to the centroids

K-means Illustration Recalculate the centroids Reclustering Repeat, until the cluster labels do not change, or the changes of centroids are very small

Summary of K-means Determine the value of k Determine the initial k centroids Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members

The Setting The dataset is stored in HDFS We use a MapReduce kmeans to get the clustering result Implement each iteration in one MapReduce process Pass the k centroids to the Maps Map: assign a label to each record according to the distances to the k centroids <cluster id, record> Reduce: calculate the mean for each cluster, and replace the centroid with the new mean

Complexity The complexity is pretty high: k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.

Furthermore There are three big ways a data set can be large: There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover Conclusion Clustering can be huge, even when you distribute it.

Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate

Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it s canopy strongly mark all points within some stronger threshold }

After the Canopy Clustering Run K-mean clustering as usual. Treat objects in separate clusters as being at infinite distances.

MapReduce Implementation: Problem Efficiently partition a large data set (say movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K- Means Clustering, and a Euclidean distance measure. The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)

Steps Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate!

Canopy Distance Function Canopy selection requires a simple distance function Number of rater IDs in common Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common

K-means Distance Metric The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score,..., usern_score] To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(a_i * B_i) / (sqrt(sum(a_i^2)) * sqrt(sum(b_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n

Example Three vectors Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 Distance or similarity between A and B distance(a,b) = Vector (A) * Vector (B) / ( A * B ) Vector(A)*Vector(B) = 1 A * B =2*2=4 ¼=0.25 Similarity (A,B) = 0.25

Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>>

Canopy Cluster Mapper A

Threshold Value

Reducer Mapper A - Red center Mapper B Green center

Redundant Centers within the Threshold of Each Other.

Add Small Error => Threshold+ξ

So far we found, only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like?

Canopy Cluster - Before MR job Sparse Matrix

Canopy Cluster After MR job

Cells with values 1 are grouped together and users are moved from their original location

K Means Clustering Output of Canopy cluster will become input of K- means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId, {m1,m2,m3,m4,m5}>

User A Toy Story Avatar Jumanji Heat User B Avatar GoldenEye Money Train Mortal Kombat User C Toy Story Jumanji Money Train Avatar Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA 1 1 1 1 0 0 0 User B 0 1 0 0 1 1 1 User C 1 1 1 0 0 1 0

Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready

All Points Before Clustering

Canopy - Clustering

Canopy Clustering and K-means Clustering