Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist

Similar documents
Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Multivariate Analysis (slides 9)

Mounica B, Aditya Srivastava, Md. Faisal Alam

1. Introduction to MapReduce

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Improving the MapReduce Big Data Processing Framework

Hadoop Map Reduce 10/17/2018 1

Big Data Management and NoSQL Databases

Databases 2 (VU) ( / )

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Large-Scale GPU programming

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Machine Learning using MapReduce

How to Implement MapReduce Using. Presented By Jamie Pitts

MapReduce Design Patterns

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

TI2736-B Big Data Processing. Claudia Hauff

Map Reduce. Yerevan.

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

MapReduce Algorithms

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

MapReduce: Algorithm Design for Relational Operations

R-Store: A Scalable Distributed System for Supporting Real-time Analytics

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

An Introduction to Apache Spark

Parallel Computing: MapReduce Jin, Hai

Classifying Documents by Distributed P2P Clustering

The MapReduce Abstraction

CS 61C: Great Ideas in Computer Architecture. MapReduce

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

COSC6376 Cloud Computing Homework 1 Tutorial

Big Data Analytics for Host Misbehavior Detection

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Coroutines & Data Stream Processing

L22: SC Report, Map Reduce

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Introduction to Data Management CSE 344

Data Partitioning and MapReduce

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Clustering and Visualisation of Data

Batch Processing Basic architecture

Spark, Shark and Spark Streaming Introduction

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Specialist ICT Learning

Part A: MapReduce. Introduction Model Implementation issues

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark

Apache Flink. Alessandro Margara

V Conclusions. V.1 Related work

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CS145: INTRODUCTION TO DATA MINING

Chapter 4: Apache Spark

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Data Infrastructures & Technologies

Developing MapReduce Programs

Clustering and Dimensionality Reduction

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

Determining the k in k-means with MapReduce

2. Design Methodology

MapReduce: A Programming Model for Large-Scale Distributed Computation

Introduction to Hadoop and MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Adaptive Data Dissemination in Mobile ad-hoc Networks

Case Study 1: Estimating Click Probabilities

Improved MapReduce k-means Clustering Algorithm with Combiner

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CompSci 516: Database Systems

CS 347 Parallel and Distributed Data Processing

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

CS 347 Parallel and Distributed Data Processing

Clustering Lecture 8: MapReduce

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Data-Intensive Computing with MapReduce

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce ML & Clustering Algorithms

732A54/TDDE31 Big Data Analytics

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

CS 6343: CLOUD COMPUTING Term Project

Data Analytics Job Guarantee Program

Using Existing Numerical Libraries on Spark

Approximation Algorithms for Clustering Uncertain Data

An improved MapReduce Design of Kmeans for clustering very large datasets

MULTICORE LEARNING ALGORITHM

Introduction to MapReduce (cont.)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Clustering Documents. Case Study 2: Document Retrieval

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

CLUSTERING. JELENA JOVANOVIĆ Web:

University of Maryland. Tuesday, March 23, 2010

Transcription:

Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin

4/26/2010 Joos-Hendrik Boese Slide 2/14 Motivation & Idea Exploratory data studies with MapReduce are tedious: MR Jobs are typically long running User is forced to wait for final result of job before she can decide on next query But: speed of thought performance required for interactive analysis! DSS and OLAP users are demanding Exact (final) result is rarely required in non-lookup queries! How to reduce Time-to-Solution? i. Increase degree of parallelization ii. Exploit preliminary results, (as seen in DBMS [Online Aggregation]) Degree of parallelization Map Reduce Idea Time-to-Solution beyond todays systems Online Aggregation Degree of exploiting preliminary results Combine MR parallelization and Online Aggregation to achieve Time-to-Solutions beyond todays systems!

4/26/2010 Joos-Hendrik Boese Slide 3/14 Agenda Online Aggregation Streaming MapReduce Framework Assess Quality of Preliminary Results Experiments for Single-Phase ML Algorithms How to implement Online Aggregation for Multi-Phase ML Algorithms Reducing K-Means to a Single-Phase Algorithm Experimental results for Online K-Means

4/26/2010 Joos-Hendrik Boese Slide 4/14 Online Aggregation Developed for large-scale data analysis on RDBMS [Hellerstein et. al 1997] For an SQL aggregate progress, preliminary results, and confidence intervals are provided during query execution Puts user in the driver seat : Select AVG(grade) from ENROLL GROUP BY major; i. can cancel futile queries prematurely ii. can stop execution if preliminary result is precise enough Often aggregate estimations are close to the final result after processing only a small fraction of the input data

4/26/2010 Joos-Hendrik Boese Slide 5/14 Online Aggregation in MapReduce How to implement online aggregation? Standard MapReduce system model is batch-oriented! Each operator executes completely before producing any output Map-Phase must be finished before Reduce-Phase is started Standard MapReduce system model Input file 1 Input file m read map 1 map 2 map m result file worker write result file read sort reduce 1 sort reduce 2 result 1 sort reduce n result k worker write Map-Phase Reduce-Phase To implement online aggregation, results must be continuously streamed from mapper to reducers!

4/26/2010 Joos-Hendrik Boese Slide 6/14 Streaming MapReduce Mappers send <k,v> directly to the reducer responsible for k Progress information is added to each <k,v> pair by a control component Reducer output their results regularly to a collector component Collector computes preliminary (and final) results Preliminary results are used for convergence estimation and visualization Currently implemented as shared-memory framework Map-Phase Streaming MapReduce system model Reduce-Phase Input files Input stream map 1 map 2 map m worker threads comb 1 comb 2 comb n dictionary (key red.) monitor progress control thread reduce 1 reduce 2 reduce n worker threads online collector visualizer collector thread preliminary and final result 10% result 100% Messages

4/26/2010 Joos-Hendrik Boese Slide 7/14 Convergence Estimation Collector component saves and visualizes intermediate results using signatures of intermediate results A signature sig(r) of each intermediate result R is saved to plot a convergence graph sig(r) is either the full result R, a compact representation of R, or the quality of R A convergence graph depicts the distance between intermediate results using a specific metric dif(sig(r i ),sig(r j )) Example: relative word freq. R is a histogram(word->rel. freq) sig(r) = R dif() is Mean Squared Error Online convergence curve plots dif(sig(r i-1 ),sig(r i )) Online convergence curve slope indicates degree of change in results

4/26/2010 Joos-Hendrik Boese Slide 8/14 Experiments: 1-Phase Algorithms 1-Phase Algorithms: word count (wc), texts of gutenberg project linear regression (lr), synthetic dataset (s) PCA (pca), synthetic dataset (s) Naive Bayes (nb), (i) spam dataset (m), (ii) url dataset (u) Offline convergence curves Offline convergence curve: dif(sig(r i ),sig(r f )) with R f =final result D is the 95-percentile of dif values S(y) is the smallest fraction of input data that all values of dif encountered afterwards are smaller than y% of D

4/26/2010 Joos-Hendrik Boese Slide 9/14 Multi-Phase Algorithms Iterative algorithms like k-means, EM, SVM, require multiple MR phases Example: multi-phase k-means clustering on MR Input: k cluster centroids (cluster center) G i (i=1..k) and partitioned input data D Map task m computes C m i. s, the sum of sample vectors assigned to G G ' 1 i = i,and C m i. n n C m Reducer can compute new centroids as C m m=1 i. s i. n n m=1 MapReduce Job 1 MapReduce Job 2 G D mapper 1 C 1 i. s ;C 1 i. n mapper n mapper 1 reducer C 1 i. s ;C 1 i. n comp. G ' G ' D mapper n reducer comp. G ' ' G ' ' D C i n. s ;C i n. n C i n. s ;C i n. n How to derive preliminary results?

4/26/2010 Joos-Hendrik Boese Slide 10/14 Single-Phase K-Means Idea: re-submit preliminary results G to mappers G is updated via event mechanism Update global centroids G D 1 D n G G mappers assign samples of D 1 to current G C i 1. s,c i 1. n C n i. s, C n i. n assign samples of D n to current G reducer periodically compute and update global centroids G online collector/ visualizer write preliminary and final result 10% result 100% Problem: mapper has to revisit samples already processed when an update for global centroids G is received Not all samples may fit into memory!

4/26/2010 Joos-Hendrik Boese Slide 11/14 Auxiliary Clusters Each sample s is stored in an auxiliary cluster (AC) ACs represent a group of similar samples by two values a c : centroid of AC and a n : number of samples represented by AC a c are treated as data points weighted by a n D m s G mapper m assign samples of D m to current G create/update ACs assign a c to G C i m. s, C i m. n C i m. s, C i m. n reducer periodically compute and update global centroids G weighted by a n Update and creation of ACs: A new AC is created for each s until max number L of ACs is reached s is added to the closest AC if its dist. is smaller than the largest dist. between two ACs Otherwise the two ACs with the smallest dist. are merged and a new AC is created for s

Experiments: Online K-Means Dataset: US Census (1990) with 2.4Mio samples Offline convergence graph Parameter: k = 5; number of ACs L = 1000 dif = MSE R f of online k-means differs from standard k-means Accuracy is evaluated using total sum of distances U Standard k-means: U off = 1.52*10 9 Total sum of distances online k-means approx offline result Online k-means: U on = 1.98*10 9 Accuracy increases with L 4/26/2010 Joos-Hendrik Boese Slide 12/14

THANK YOU! 4/26/2010 Joos-Hendrik Boese Slide 13/14

4/26/2010 Joos-Hendrik Boese Slide 14/14 Summary & Conclusion We presented a system model and approaches to exploit the potential of online aggregation in MapReduce We presented an approach to handle iterative algorithms which need multiple MapReduce phases Experiments show that our algorithms are capable to combine benefits of online aggregation with parallelism in a streaming MapReduce framework Future research includes: adapting other iterative data mining algorithms such as SVN or EM porting extensions for online aggregation to Hadoop Online

4/26/2010 Joos-Hendrik Boese Slide 15/14 Progress Estimation To estimate progress we monitor: i. The size of all input data processed by all mappers ii. For each k how many pairs are emitted by mappers per input byte Progress is monitored per key k, i.e. how much input contributed to a <k,v> pair emitted by a reducer Allows for algorithm specific progress metrics, if progress of reducer is heterogeneous. Example: (<k a, v>,p=25%) (<k b, v>,p=75%) online collector collector thread preliminary result 25%

Experimental Results 1Ph. k-means 4/26/2010 Joos-Hendrik Boese Slide 16/14