COSC6376 Cloud Computing Homework 1 Tutorial
|
|
- Jemima Beasley
- 5 years ago
- Views:
Transcription
1 COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
2 Outline Homework1 Tutorial based on Netflix dataset
3 Homework 1 K-means Clustering of Amazon Reviews Create related product items based on the Amazon review ratings Understand the K-means and canopy clustering algorithms and their relationship Implement these algorithms using Apache Spark Analyze the effect of running these algorithms on a large data set using Amazon Cloud 3
4 Tutorial based on Netflix Dataset K-means example using Netflix dataset Rating dataset similar to Amazon reviews Amazon datasets productid, userid, rating, timestamp other meta data fields and review texts Netflix dataset movieid, userid, rating, timestamp 4
5 Netflix Prize Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies Netflix internal movie rating predictor: Cinematch used for recommending movies $1,000,000 award to these who can improve the prediction by 10% (in terms of root means squared error) Winner: BellKor's Pragmatic Chaos Another team: Ensemble Results equally good but submitted 20 minutes later
6
7 Competition Cancelled Researchers demonstrated that individuals can be identified by matching the Netflix data sets with film ratings online Netflix users filed a class action lawsuit against Netflix for privacy violation Video Privacy Protection Act of 1988
8 8
9 Movie Dataset The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5:: ::1194::4:: ::1123::1::
10 K-means Clustering Clustering problem description: iterate { Compute distance from all points to all kcenters Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averages } Good survey: AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999
11 K-means Illustration Randomly select k centroids Assign cluster label of each point according to the distance to the centroids
12 K-means Illustration Recalculate the centroids Reclustering Repeat, until the cluster labels do not change, or the changes of centroids are very small
13 Summary of K-means Determine the value of k Determine the initial k centroids Repeat until converge - Determine membership: Assign each point to the closest centroid - Update centroid position: Compute the average of the assigned members
14 The Setting The dataset is stored in HDFS We use a MapReduce kmeans to get the clustering result Implement each iteration in one MapReduce process Pass the k centroids to the Maps Map: assign a label to each record according to the distances to the k centroids <cluster id, record> Reduce: calculate the mean for each cluster, and replace the centroid with the new mean
15 Complexity The complexity is pretty high: k * n * O ( distance metric ) * num (iterations) Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible.
16 Furthermore There are three big ways a data set can be large: There are a large number of elements in the set. Each element can have many features. There can be many clusters to discover Conclusion Clustering can be huge, even when you distribute it.
17 Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. Efficient Accurate
18
19 Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it s canopy strongly mark all points within some stronger threshold }
20 After the Canopy Clustering Run K-mean clustering as usual. Treat objects in separate clusters as being at infinite distances.
21 MapReduce Implementation: Problem Efficiently partition a large data set (say movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K- Means Clustering, and a Euclidean distance measure. The Distance Metric The Canopy Metric ($) The K-Means Metric ($$$)
22 Steps Get Data into a form you can use (MR) Picking Canopy Centers (MR) Assign Data Points to Canopies (MR) Pick K-Means Cluster Centers K-Means algorithm (MR) Iterate!
23 Canopy Distance Function Canopy selection requires a simple distance function Number of rater IDs in common Close and far distance thresholds Close distance threshold: 8 rater IDs in common Far distance threshold: 2 rate IDs in common
24 K-means Distance Metric The set of ratings for a movie given by a set of users can be thought of as a vector A = [user1_score, user2_score,..., usern_score] To evaluate the distance between two movies, A and B, use the similarity metric below, Similarity(A, B) = sum(a_i * B_i) / (sqrt(sum(a_i^2)) * sqrt(sum(b_i^2))) where the sum(...) functions retrieve all A_i or B_i for 0 <= i < n
25 Example Three vectors Vector(A) Vector (B) Vector (C) Distance or similarity between A and B distance(a,b) = Vector (A) * Vector (B) / ( A * B ) Vector(A)*Vector(B) = 1 A * B =2*2=4 ¼=0.25 Similarity (A,B) = 0.25
26 Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>>
27 Canopy Cluster Mapper A
28 Threshold Value
29
30
31
32
33
34
35 Reducer Mapper A - Red center Mapper B Green center
36 Redundant Centers within the Threshold of Each Other.
37 Add Small Error => Threshold+ξ
38 So far we found, only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters are ready when the job is completed. How it would look like?
39 Canopy Cluster - Before MR job Sparse Matrix
40 Canopy Cluster After MR job
41 Cells with values 1 are grouped together and users are moved from their original location
42 K Means Clustering Output of Canopy cluster will become input of K- means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId, {m1,m2,m3,m4,m5}>
43 User A Toy Story Avatar Jumanji Heat User B Avatar GoldenEye Money Train Mortal Kombat User C Toy Story Jumanji Money Train Avatar Toy Story Avatar Jumanji Heat Golden Eye MoneyTrain Mortal Kombat UserA User B User C
44 Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready
45 All Points Before Clustering
46 Canopy - Clustering
47 Canopy Clustering and K-means Clustering
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationMounica B, Aditya Srivastava, Md. Faisal Alam
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationJeff Howbert Introduction to Machine Learning Winter
Collaborative Filtering Nearest es Neighbor Approach Jeff Howbert Introduction to Machine Learning Winter 2012 1 Bad news Netflix Prize data no longer available to public. Just after contest t ended d
More informationCptS 570 Machine Learning Project: Netflix Competition. Parisa Rashidi Vikramaditya Jakkula. Team: MLSurvivors. Wednesday, December 12, 2007
CptS 570 Machine Learning Project: Netflix Competition Team: MLSurvivors Parisa Rashidi Vikramaditya Jakkula Wednesday, December 12, 2007 Introduction In current report, we describe our efforts put forth
More informationDepartment of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang
Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';
More informationUse of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University
Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the
More informationClustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017
Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu /6/01 Jure Leskovec, Stanford C6: Mining Massive Datasets Training data 100 million ratings, 80,000 users, 17,770
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationImproved MapReduce k-means Clustering Algorithm with Combiner
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationBy Atul S. Kulkarni Graduate Student, University of Minnesota Duluth. Under The Guidance of Dr. Richard Maclin
By Atul S. Kulkarni Graduate Student, University of Minnesota Duluth Under The Guidance of Dr. Richard Maclin Outline Problem Statement Background Proposed Solution Experiments & Results Related Work Future
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationCSCI6900 Assignment 3: Clustering on Spark
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 3: Clustering on Spark DUE: Friday, Oct 2 by 11:59:59pm Out Friday, September 18, 2015 1 OVERVIEW Clustering is a data mining technique
More informationSingular Value Decomposition, and Application to Recommender Systems
Singular Value Decomposition, and Application to Recommender Systems CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Recommendation
More information[7.3, EA], [9.1, CMB]
K-means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] Outline Introduction K-means Algorithm Example How K-means partitions? K-means Demo Relevant Issues Application: Cell Neulei Detection Summary
More informationUday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved
Implementation of K-Means Clustering Algorithm in Hadoop Framework Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India Abstract Drastic growth
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu Customer X Buys Metalica CD Buys Megadeth CD Customer Y Does search on Metalica Recommender system suggests Megadeth
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationAN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang
International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationParallel Algorithms K means Clustering
CSE 633: Parallel Algorithms Spring 2014 Parallel Algorithms K means Clustering Final Results By: Andreina Uzcategui Outline The problem Algorithm Description Parallel Algorithm Implementation(MPI) Test
More informationBeyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist
Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin 4/26/2010 Joos-Hendrik Boese Slide
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu //8 Jure Leskovec, Stanford CS6: Mining Massive Datasets High dim. data Graph data Infinite data Machine learning
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More informationProgress Report: Collaborative Filtering Using Bregman Co-clustering
Progress Report: Collaborative Filtering Using Bregman Co-clustering Wei Tang, Srivatsan Ramanujam, and Andrew Dreher April 4, 2008 1 Introduction Analytics are becoming increasingly important for business
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationCSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.
CSE 547: Machine Learning for Big Data Spring 2019 Problem Set 2 Please read the homework submission policies. 1 Principal Component Analysis and Reconstruction (25 points) Let s do PCA and reconstruct
More informationCHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM
CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu Training data 00 million ratings, 80,000 users, 7,770 movies 6 years of data: 000 00 Test data Last few ratings of
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More informationDetermining the k in k-means with MapReduce
Algorithms for MapReduce and Beyond 2014 Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Clustering & k-means Clustering K-means [Stuart P. Lloyd.
More informationCSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection
CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:
More information9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis
Introduction ti to K-means Algorithm Wenan Li (Emil Li) Sep. 5, 9 Outline Introduction to Clustering Analsis K-means Algorithm Description Eample of K-means Algorithm Other Issues of K-means Algorithm
More informationBig Data Using Hadoop
IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationComputational Intelligence Meets the NetFlix Prize
Computational Intelligence Meets the NetFlix Prize Ryan J. Meuth, Paul Robinette, Donald C. Wunsch II Abstract The NetFlix Prize is a research contest that will award $1 Million to the first group to improve
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationHomework Assignment #3
CS 540-2: Introduction to Artificial Intelligence Homework Assignment #3 Assigned: Monday, February 20 Due: Saturday, March 4 Hand-In Instructions This assignment includes written problems and programming
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationAn improved MapReduce Design of Kmeans for clustering very large datasets
An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationFINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK
CSE427S FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK M. Neumann Due: NO EXTENSION FRI 4 MAY 2018 (MIDNIGHT) Project Goal In this project you and your group will interactively get to know SPARK and
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationDemystifying movie ratings 224W Project Report. Amritha Raghunath Vignesh Ganapathi Subramanian
Demystifying movie ratings 224W Project Report Amritha Raghunath (amrithar@stanford.edu) Vignesh Ganapathi Subramanian (vigansub@stanford.edu) 9 December, 2014 Introduction The past decade or so has seen
More informationDynamic Clustering in WSN
Dynamic Clustering in WSN Software Recommended: NetSim Standard v11.1 (32/64 bit), Visual Studio 2015/2017, MATLAB (32/64 bit) Project Download Link: https://github.com/netsim-tetcos/dynamic_clustering_project_v11.1/archive/master.zip
More informationCluster analysis of 3D seismic data for oil and gas exploration
Data Mining VII: Data, Text and Web Mining and their Business Applications 63 Cluster analysis of 3D seismic data for oil and gas exploration D. R. S. Moraes, R. P. Espíndola, A. G. Evsukoff & N. F. F.
More informationComputational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions
Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................
More informationHomework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in
More informationData Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering
Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering 1/ 1 OUTLINE 2/ 1 Overview 3/ 1 CLUSTERING Clustering is a statistical technique which creates groupings
More informationRecommender Systems New Approaches with Netflix Dataset
Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based
More informationSome examples of task parallelism are commented (mainly, embarrasing parallelism or obvious parallelism).
Ricardo Aler Mur First it is explained what is meant by large scale machine learning, and shown that there are several ways in which machine learning algorithms can be parallelized: task, data, and pipeline
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More informationCS224W Project: Recommendation System Models in Product Rating Predictions
CS224W Project: Recommendation System Models in Product Rating Predictions Xiaoye Liu xiaoye@stanford.edu Abstract A product recommender system based on product-review information and metadata history
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationStats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms
Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,
More informationCSE 158 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)
CSE 158 Lecture 8 Web Mining and Recommender Systems Extensions of latent-factor models, (and more on the Netflix prize) Summary so far Recap 1. Measuring similarity between users/items for binary prediction
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationNonparametric Importance Sampling for Big Data
Nonparametric Importance Sampling for Big Data Abigael C. Nachtsheim Research Training Group Spring 2018 Advisor: Dr. Stufken SCHOOL OF MATHEMATICAL AND STATISTICAL SCIENCES Motivation Goal: build a model
More informationIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)
More informationUnsupervised Learning Partitioning Methods
Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form
More informationTowards a hybrid approach to Netflix Challenge
Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationUniversity of Washington Department of Computer Science and Engineering / Department of Statistics
University of Washington Department of Computer Science and Engineering / Department of Statistics CSE 547 / Stat 548 Machine Learning (Statistics) for Big Data Homework 2 Winter 2014 Issued: Thursday,
More informationAutomatic Cluster Number Selection using a Split and Merge K-Means Approach
Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda
More informationComparative Analysis of K means Clustering Sequentially And Parallely
Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationUsing Existing Numerical Libraries on Spark
Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm
More informationCS 124/LINGUIST 180 From Languages to Information
CS /LINGUIST 80 From Languages to Information Dan Jurafsky Stanford University Recommender Systems & Collaborative Filtering Slides adapted from Jure Leskovec Recommender Systems Customer X Buys Metallica
More informationRecommendation Systems
Recommendation Systems CS 534: Machine Learning Slides adapted from Alex Smola, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Lester Mackey, Dietmar Jannach, and Gerhard Friedrich Recommender Systems (RecSys)
More informationMatrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1
Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that
More informationCS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016
CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation
More informationParallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633
Parallel K-means Clustering Ajay Padoor Chandramohan Fall 2012 CSE 633 Outline Problem description Implementation MPI Implementation OpenMP Test Results Conclusions Future work Problem Description Clustering
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationCSE 258 Lecture 8. Web Mining and Recommender Systems. Extensions of latent-factor models, (and more on the Netflix prize)
CSE 258 Lecture 8 Web Mining and Recommender Systems Extensions of latent-factor models, (and more on the Netflix prize) Summary so far Recap 1. Measuring similarity between users/items for binary prediction
More informationNon-negative Matrix Factorization for Multimodal Image Retrieval
Non-negative Matrix Factorization for Multimodal Image Retrieval Fabio A. González PhD Machine Learning 2015-II Universidad Nacional de Colombia F. González NMF for MM IR ML 2015-II 1 / 54 Outline 1 The
More informationAN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS
AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University
More informationHybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US Outline Introduction and Background MapReduce Iterative MapReduce Distributed Workflow Management
More informationHierarchical and Ensemble Clustering
Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example
More informationThanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman
Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman http://www.mmds.org Overview of Recommender Systems Content-based Systems Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationProblem Definition. Clustering nonlinearly separable data:
Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations (PAKDD 2016) Problem Definition Clustering
More informationAn Empirical Comparison of Collaborative Filtering Approaches on Netflix Data
An Empirical Comparison of Collaborative Filtering Approaches on Netflix Data Nicola Barbieri, Massimo Guarascio, Ettore Ritacco ICAR-CNR Via Pietro Bucci 41/c, Rende, Italy {barbieri,guarascio,ritacco}@icar.cnr.it
More informationExplore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan
Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents
More information