Clustering Large scale data using MapReduce

Similar documents
Fast Clustering using MapReduce

L9: Hierarchical Clustering

Fast Clustering using MapReduce

Clustering. (Part 2)

Developing MapReduce Programs

John Mellor-Crummey Department of Computer Science Rice University

Gene expression & Clustering (Chapter 10)

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Clustering: Centroid-Based Partitioning

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Module 6 NP-Complete Problems and Heuristics

Clustering Very Large Multi-dimensional Datasets with MapReduce

Clustering Algorithms for general similarity measures

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Distributed Computations MapReduce. adapted from Jeff Dean s slides

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Map Reduce and Design Patterns Lecture 4

Implementation of Relational Operations

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Parallel Computing: MapReduce Jin, Hai

ML from Large Datasets

Evaluation of Relational Operations. Relational Operations

Set Cover Algorithms For Very Large Datasets

Randomized Composable Core-sets for Distributed Optimization Vahab Mirrokni

Database Applications (15-415)

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Outline. CS38 Introduction to Algorithms. Approximation Algorithms. Optimization Problems. Set Cover. Set cover 5/29/2014. coping with intractibility

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Module 6 NP-Complete Problems and Heuristics

Introduction to MapReduce

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

University of Florida CISE department Gator Engineering. Clustering Part 4

Solutions to Problem Set 1

Evaluation of Relational Operations

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

MapReduce Design Patterns

Finding Connected Components in Map-Reduce in Logarithmic Rounds

1. [1 pt] What is the solution to the recurrence T(n) = 2T(n-1) + 1, T(1) = 1

Clustering Part 4 DBSCAN

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Algorithms and Applications

EECS730: Introduction to Bioinformatics

High Dimensional Indexing by Clustering

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

MapReduce. Stony Brook University CSE545, Fall 2016

Evaluation of Relational Operations

Chapter 5. Quicksort. Copyright Oliver Serang, 2018 University of Montana Department of Computer Science

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Parallel Sorting. Sathish Vadhiyar

Clustering and Visualisation of Data

Joint Entity Resolution

CSE 190D Spring 2017 Final Exam Answers

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Count based K-Means Clustering Algorithm

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Non-exhaustive, Overlapping k-means

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering. Unsupervised Learning

Approximation Algorithms for Clustering Uncertain Data

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

SORTING AND SELECTION

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Graphs and Network Flows ISE 411. Lecture 7. Dr. Ted Ralphs

Sublinear Models for Streaming and/or Distributed Data

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Introduction to Spark

Distributed Balanced Clustering via Mapping Coresets

Clustering: K-means and Kernel K-means

L22: SC Report, Map Reduce

Clustering Lecture 3: Hierarchical Methods

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Cluster Analysis. Ying Shen, SSE, Tongji University

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Clustering Documents. Case Study 2: Document Retrieval

K-means Clustering of Proportional Data Using L1 Distance

This paper proposes: Mining Frequent Patterns without Candidate Generation

Introduction to Data Mining

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

11. APPROXIMATION ALGORITHMS

Unsupervised Learning

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Module 6 NP-Complete Problems and Heuristics

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Unsupervised Learning and Clustering

Collaborative Rough Clustering

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Community Detection. Community

Data Clustering. Danushka Bollegala

Objective of clustering

Clustering Techniques

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Transcription:

Clustering Large scale data using MapReduce

Topics Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al. Fast Clustering using MapReduce by Alina Ene et al.

Background stuff Hyper rectangle A rectangle in n-dimensions. Used as cluster output. Map Reduce phases Mapper : Takes a single key,value pair and generates any number of key, value output pairs Shuffle: Sorts key, value pairs given by Mapper so that a single Reducer can get values of same keys. Reducer: Takes values from the output of a Mapper for a single key and outputs key, value pairs

Topic 1 Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al.

Contributions Proposes a novel Best of Both Worlds (BoW) approach Chooses one among two approaches - SnI and ParC for clustering Proposes cost estimates for each of this approaches based on environment parameters to make the decision Modular Can allow plugging in sequential clustering algorithms Shows good performance improvements 0.2 TB getting clustered in 8 minutes using 128cores.

ParC Partition data P1: Parition data based on different approaches random data partitioning: elements are assigned to machines at random (load balanced) address-space data partitioning: eventually nearby elements in the data space often end up in the same machine (not load balanced) arrival order or file-based data partitioning: first several elements in the collection go to one machine, the next batch goes to the second, and so on (load balanced) P2: Shuffle partitioned output P3: Reducers clusters the data received. Uses plug in which does the actual serial clustering P4: Send cluster description P5: Merge all cluster descriptions to produce final set of clusters.

ParC Example Taken from Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al

Prons/Cons + Parses input only once - High network cost: - All data is passed from mappers to reducers in the P2 shuffle step.

SnI approach S1: Read data and sample S2: Shuffle sampled data S3: Single reducer which looks for clusters S4: Send cluster descriptions to mappers S5: Read data in parallel in ignore clustered ones S6: Shuffle the unclustered data S7: Cluster unclustered data S8: Send cluster descriptions to one machine S9: Merge all resulting clusters

SnI example Taken from Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al

Pros/Cons + Low network and shuffle cost Sampling reduces the amount of data - Reads the input data twice (I/O cost)

Which approach is better? Answer depends on the environment and input parameters Network speed File size Disk speed Instance start up task Sequential clustering plug-in complexity

Cost estimate for ParC where, costm cost of mappers costs Shuffling cost costr cost of reducer step m number of mappers r number of reducers Fs size of data file Ns network data rate Dr ratio of data transferred from mapper to reducer. This is < 1 as map reducer optimizes by assigning reducer to machine which already ran mapper

Cost estimate for SnI where, costm cost of mappers costs Shuffling cost costr cost of reducer step m number of mappers r number of reducers Fs size of data file Sr Sampling rate Rr Ratio of data that doesn t belong to major clusters

Best of Both Worlds Algorithm Compute costc Compute costcs If costc > costcs Use SnI else Use ParC

Results

Evaluation Cluster quality was compared using synthetic data Shows significant amount of scaling up. Processed 0.2 TB in 8 minutes using 128 cores The results show that cost estimates are quite accurate. (Correct approach was chosen)

Topic 2 Fast Clustering using MapReduce

Contributions Proposes fast clustering algorithms with performance guarantees Theoretical proof that the proposed algorithm belongs to class MRC0 Can be applied to both K-median and K-center clustering techniques

MRC class The total memory used on a specific machine is sub-linear The total number of machines used is sublinear The number of rounds is constant

K center and K median K center Choose k points each representing a cluster. The maximum distance between a center and a point assigned to it is minimized K median Minimize the sum of distances from the centers to each of the points assigned to the centers. Both are NP complete

Some definitions Distance from a point P to set of points S is the minimum distance between P and any point in S. Instead of clustering the entire data, sampled data can be clustered. The sample data might not represent the entire set of points in the dataset. The left-out data is called non-represented data. Basic Idea: Keep sampling until nonrepresentation data is below a certain number.

Sequential Iterative Sampling Set S sampled data, R not-represented data S <- Φ, R <- V (input set) while R > some threshold Add each point in R to S Add each point in R to H Sort descending order points in H from farthest to smallest. Pick 8 log nth point in this order as v. Remove points from R if the distance is smaller than the distance of v to S. Repeat while Output = S U R (include the not-represented data as well)

MapReduce version of iterative sampling While R > T Arbitrarily partion R into n sets. Each of the sets will go to same reducer At each reducer, sample the received data and form sets Si and Hi Read samples in a single mapper machine and form H = union of Hi s, S = union of Ri s Reducer receives this H and S finds v (as in sequential case) 8 log nth value in the sorted order of points in H based on the descending order of distance from the point to S. Mappers partition the points in R into subsets. The next reducer receives the Ri (a single partition), v and S. Reducer i finds the distance of each point in Ri to S. The point is removed if distance < distance(v,s) Union all Ri s at reducer Repeat while Output C : S U R

Properties of Iterative sampling The number of iterations in the loop is constant with high probability The set returned by Iterative-Sample is in the order of log(n) with high probability

MapReduce-kcenter C <- Iterative-Sample() Map C and all pairwise distances between points in C to a reducer The reducer runs a k-center clustering algorithm on C (sequential version) Return the constructed clusters Note: Sampling works? It is said at results section that sampling doesn t work very well with k-center.

kmedian Since k-median needs sum of distances, each unsampled point has to be represented somehow. The closest sampled point is the representative. Assign weight to every sample point based on the number of points it represents

MapReduce-K Median C <- MapReduce-Iterative-Sample Partition input data V into sets Vi Mappers assign Vi, C to a reducer Each reducer i, computes weight for every point in C based on points in Vi Map all weights, C and pair-wise distances in C to a single reducer The reducer computes weight w(y) = sum(wi(y)) calculated in previous step The reducer then runs a weighted k-median clustering algorithm Return the clusters formed

Results observed K-Centers doesn t perform very well with sampling Maximum distance from a point to a center can be disturbed even if a single point is missed. Results presented only for k-median Shows 1000x speedup over LocalSearch About 20x speedup over Parallel-Lloyd