Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Similar documents
A brief history on Hadoop

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop An Overview. - Socrates CCDH

Hadoop/MapReduce Computing Paradigm

Clustering Lecture 8: MapReduce

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Improving the MapReduce Big Data Processing Framework

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

The MapReduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

MI-PDB, MIE-PDB: Advanced Database Systems

Map Reduce & Hadoop Recommended Text:

Embedded Technosolutions

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Hadoop Map Reduce 10/17/2018 1

BigData and Map Reduce VITMAC03

Introduction to MapReduce

Data Analysis Using MapReduce in Hadoop Environment

MapReduce. U of Toronto, 2014

TP1-2: Analyzing Hadoop Logs

Clustering Documents. Case Study 2: Document Retrieval

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Programming Models MapReduce

A Glimpse of the Hadoop Echosystem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Data Informatics. Seon Ho Kim, Ph.D.

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Project Design. Version May, Computer Science Department, Texas Christian University

Map Reduce Group Meeting

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

50 Must Read Hadoop Interview Questions & Answers

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Lecture 11 Hadoop & Spark

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Chapter 5. The MapReduce Programming Model and Implementation

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

STA141C: Big Data & High Performance Statistical Computing

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

HADOOP FRAMEWORK FOR BIG DATA

Database Applications (15-415)

Big Data and Object Storage

Distributed Computation Models

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Certified Big Data and Hadoop Course Curriculum

MapReduce, Hadoop and Spark. Bompotas Agorakis

CSE 444: Database Internals. Lecture 23 Spark

CS427 Multicore Architecture and Parallel Computing

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Cloud Computing CS

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Distributed Face Recognition Using Hadoop

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop. copyright 2011 Trainologic LTD

Microsoft Big Data and Hadoop

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data for Engineers Spring Resource Management

Some examples of task parallelism are commented (mainly, embarrasing parallelism or obvious parallelism).

Global Journal of Engineering Science and Research Management

DATA SCIENCE USING SPARK: AN INTRODUCTION

CISC 7610 Lecture 2b The beginnings of NoSQL

High Performance Computing on MapReduce Programming Framework

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

HaLoop Efficient Iterative Data Processing on Large Clusters

CS 61C: Great Ideas in Computer Architecture. MapReduce

Introduction to Data Management CSE 344

An Introduction to Big Data Analysis using Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Comparative Analysis of K means Clustering Sequentially And Parallely

Shark: Hive (SQL) on Spark

Hadoop and Map-reduce computing

BIG DATA TESTING: A UNIFIED VIEW

Improved MapReduce k-means Clustering Algorithm with Combiner

CS November 2018

Introduction to MapReduce (cont.)

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Introduction to MapReduce

A BigData Tour HDFS, Ceph and MapReduce

CS November 2017

Distributed Systems CS6421

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Massive Online Analysis - Storm,Spark

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

Transcription:

Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros

Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on the Apache Hadoop framework Data Clustering and K-means Algorithm Apache Hadoop Framework Parallel K-means implementation on the MapReduce model Experiment results

Motivation Big Data explosion Vast amounts of data that needs to be processed and analyzed Human generated data Machine generated data Increased demand for efficient storage and processing solutions

Data Clustering Machine Learning techniques focus on organizing data into sensible groupings Unsupervised learning Grouping of objects based on their relationship instead of preset tags or labels

Clustering Challenges Definition of Similarity Familiarity with the available data Data preprocessing Cluster Diversity Data Representation Number of Clusters Data type Data scale

Cluster Diversity

K-means Clustering One of the most popular machine learning algorithms utilized for clustering data sets Partitioning n observations into meaningful k clusters Target is to reduce the mean error of all observations to their respective cluster Centroid clustering Hard clustering

K-means Algorithm Define the number of clusters K, and initialize them Repeat the following steps until the stopping condition is met Assign every point* to its closer cluster Recalculate cluster centers *point: n dimensions vector

K-means Steps

Apache Hadoop Open source framework for storage and large scale processing of data-sets on computer clusters Web-scale terabytes to petabytes of data hundreds to thousands of machines Used and developed by the industry leaders Yahoo! Facebook Netflix

Hadoop Components Hadoop Common tools, libraries and utilities used by the other Hadoop modules Hadoop Distributed File System HDFS Hadoop YARN Resource Manager Task Manager Hadoop MapReduce Implementation Hadoop Ecosystem underlying platform of many Apache Big Data projects

Hadoop HDFS Distributed File-system after Google s GFS Runs on commodity hardware handles failures Master/Slave architecture single point of failure? Designed for BIG files 64-128 MB default block size Replication across all nodes that serve as block pools

Hadoop YARN Yet Another Resource Negotiator Separated from the MapReduce component since version 2 Resource Management Monitoring Task Scheduling Job execution Failure Management

Hadoop YARN

Hadoop MapReduce Programming model for processing large data sets with a parallel, distributed algorithm on a cluster Implementation of the technology introduced by Google Inspired by map and reduce methods from functional programming Takes advantage of the locality of data parallelism over data

MapReduce Can be seen as a two step process Map step process the input data and extract them as <key, value> pairs (combining and shuffling) Reduce step collect the outputs of the mappers, sorted and grouped by their key, and form the final output

WordCount Example

Apache Hadoop It makes sense for big amounts of data the amount of mappers comes out by dividing the size of the file with the block size (64-128 MB) every mapper must have enough work to do each node should have at least 10 mappers practically it makes no sense of processing something smaller than 1GB of data

Apache Hadoop Parallel Programming Challenging Higher Level Abstractions programming interfaces administration tools data distribution measures against race conditions Scalable Fault tolerant

K-means on MapReduce Parallel K-means Increased need to parallelize it Fits quite well the MapReduce model map step - classification step (data parallel over points) reduce step - recompute centers (data parallel over centers)

Experiment MLSP Amazon Access Data Set more than 8 million rows CSV format okeanos Cloud Infrastructure Resource pool with 18 cores, 22 GBs of RAM and 260 GBs of HDD Ubuntu Linux 12.04 Apache Hadoop 2.2.0 5-dimensional data points

K-means on MapReduce

Execution parameters Input path Sample size number of data points Block size a. block size directly affects the number of splits and in turn the amount of mappers b. smaller block size => more blocks c. the number of blocks equals the number of splits and the number of mappers Number of clusters Number of iterations

Results Master node with 1 core and 4 GBs of memory and Slave node with 1 core and 2 GBs of memory

Scaling up and out Experiment on 3 nodes with increased resources

K-means on MapReduce Considerations K-means is an iterative algorithm every iteration is a new Job Job initialization produces overhead between every iteration all the data are written back to the disk and then read again Needs a LOT of data points (~10m per node)

Alternatives Apache Mahout Machine Learning library on top of Hadoop Apache Spark Engine for data processing DAG model Apache Hama BSP model Stratosphere MapReduce extension

Conclusions K-means is a broadly adapted, easy to implement and quite efficient clustering algorithm Apache Hadoop is one of the best frameworks for distributed storage and parallel data processing MapReduce gets deprecated better alternatives come up

Data Clustering on Hadoop Questions