Determining the k in k-means with MapReduce

Similar documents
Determining the k in k-means with MapReduce

pfk-means: A Parameter Free K-means Algorithm Omar Kettani*, *(Scientific Institute, Physics of the Earth Laboratory/Mohamed V- University, Rabat)

An Efficient Clustering for Crime Analysis

Databases 2 (VU) ( / )

Dynamic Clustering of Data with Modified K-Means Algorithm

Clustering. (Part 2)

Towards the world s fastest k-means algorithm

Improving the MapReduce Big Data Processing Framework

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

COSC6376 Cloud Computing Homework 1 Tutorial

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

MapReduce ML & Clustering Algorithms

Mounica B, Aditya Srivastava, Md. Faisal Alam

MapReduce Design Patterns

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Distances, Clustering! Rafael Irizarry!

Turkit: Human Computation Algorithms on Mechanical Turk. Greg Little, Lydia B. Chilton, Max Goldman, Robert C. Miller

FAST K-MEANS ALGORITHM CLUSTERING

k-means Gaussian mixture model Maximize the likelihood exp(

Data clustering & the k-means algorithm

Introduction to Hadoop and MapReduce

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

University of Washington Department of Computer Science and Engineering / Department of Statistics

A Simple Cluster Validation Index with Maximal Coverage

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

DS-Means: Distributed Data Stream Clustering

Machine Learning using MapReduce

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Comparison of Cluster Validation Indices with Missing Data

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve

An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm

Algorithm Design for MapReduce

Improved Version of Kernelized Fuzzy C-Means using Credibility

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Analyzing Big Data with Microsoft R

CS 345A Data Mining. MapReduce

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Clustering Lecture 5: Mixture Model

Chapter 6 Continued: Partitioning Methods

Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Lecture 11 Hadoop & Spark

Objective of clustering

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Comparative Analysis of K means Clustering Sequentially And Parallely

SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data

Clustering. Chapter 10 in Introduction to statistical learning

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Approximation Algorithms for NP-Hard Clustering Problems

Lecture 10: Semantic Segmentation and Clustering

Big Data Hadoop Stack

Unsupervised Learning

Clustering. Unsupervised Learning

How to keep capacity predictions on target and cut CPU usage by 5x

Computational Social Choice in the Cloud

CHAPTER 4: CLUSTER ANALYSIS

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Image Compression with Competitive Networks and Pre-fixed Prototypes*

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

k-means A classical clustering algorithm

Outline. CS-562 Introduction to data analysis using Apache Spark

Big Data Analytics using Apache Hadoop and Spark with Scala

Clustering. Unsupervised Learning

Iterative random projections for high-dimensional data clustering

Project Design. Version May, Computer Science Department, Texas Christian University

k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008

Unsupervised Learning and Clustering

Hadoop: The Definitive Guide PDF

Jeffrey D. Ullman Stanford University

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

An Introduction to Apache Spark

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Clustering Algorithms. Margareta Ackerman

Meta-Clustering. Parasaran Raman PhD Candidate School of Computing

2/26/2017. For instance, consider running Word Count across 20 splits

Big Data Using Hadoop

Dynamic Clustering in WSN

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Kapitel 4: Clustering

FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Dynamic local search algorithm for the clustering problem

ALTERNATIVE METHODS FOR CLUSTERING

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing

Transcription:

Algorithms for MapReduce and Beyond 2014 Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard

Clustering & k-means Clustering K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129 137, 1982.] 1982 (a great year!) But still largely used Drawbacks (amongst others): Local minimum K is a parameter! 2

Clustering & k-means Determine k: VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, jump method (based on information theory), Gap statistic,... O(k²) 3

G-means G-means [Greg Hamerly and Charles Elkan. Learning the k in kmeans. In Neural Information Processing Systems. MIT Press, 2003] K-means : points in each cluster are spherically distributed around the center Source: scikit-learn 4

G-means G-means [Greg Hamerly and Charles Elkan. Learning the k in kmeans. In Neural Information Processing Systems. MIT Press, 2003] K-means : points in each cluster are spherically distributed around the center normality test & recursion 5

G-means Dataset 6

G-means 1. Pick 2 centers 7

G-means 2. k-means 8

G-means 3. Project 9

G-means 3. Project 10

G-means Normal? No => iterate 4. Normality test 11

G-means 5. Recursion 12

MapReduce G-means Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage 13

MapReduce G-means Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage 14

MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while 15

MapReduce G-means 3. Maximize parallelism 4. Limit memory usage TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 16

MapReduce G-means 3. Maximize parallelism 4. Limit memory usage (risk of crash) TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster ck e end if n e l t ot B 17

MapReduce G-means TestFewClusters TestClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if iner b m o c y r o In mem 18

MapReduce G-means TestFewClusters TestClusters Map(key, point) Map(key, point) #clusters > #reducers Find cluster Find cluster Find vector Find vector & Project point on vector Project point on vector Add projection to list Emit(cluster, projection) Estimated required memory < Java heap Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 19

MapReduce G-means TestFewClusters TestClusters Map(key, point) Map(key, point) #clusters > #reducers Find cluster Find cluster Find vector Find vector & Project point on vector Project point on vector Add projection to list Emit(cluster, projection) Estimated required memory < Java heap Experimentally: Close() 64 do Bytes / point For each list Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 20

Comparison MR multi-k-means MR G-means Speed all possible values of k in a single job Quality 21

Comparison Speed MR multi-k-means O(nk²) computations MR G-means O(nk) computations But: more iterations more dataset reads log (k) 2 Quality New centers added if and where needed But: tends to overestimate k! 22

Experimental results : Speed Hadoop Synthetic dataset 10M points in R10 Euclidean distance 8 machines 23

Experimental results : Quality k 100 200 400 x ~1.5 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means multi-k-means 3.34 3.71 3.33 3.6 3.23 3.39 (with same k) Hadoop Synthetic dataset 10M points in R10 Euclidean distance 8 machines 24

Conclusions & future work... MapReduce algorithm to determine k Running time proportional to k Future: Overestimation of k Test on real data Test scalability Reduce I/O (using Spark) Consider skewed data Consider impact of machine failure 25

Thank you! 26