MapReduce and Friends

Similar documents
Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CISC 7610 Lecture 2b The beginnings of NoSQL

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Challenges for Data Driven Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Improving the MapReduce Big Data Processing Framework

MapReduce. Stony Brook University CSE545, Fall 2016

Lecture 11 Hadoop & Spark

Big Data with Hadoop Ecosystem

Databases 2 (VU) ( / )

Introduction to Hadoop and MapReduce

Large-Scale GPU programming

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to BigData, Hadoop:-

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Introduction to Data Management CSE 344

Database Systems CSE 414

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

MapReduce, Hadoop and Spark. Bompotas Agorakis

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Map Reduce. Yerevan.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Clustering Documents. Case Study 2: Document Retrieval

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

DATABASE DESIGN II - 1DL400

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

Big Data Hadoop Course Content

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

2.3 Algorithms Using Map-Reduce

Introduction to Data Management CSE 344

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Map-Reduce. Marco Mura 2010 March, 31th

MapReduce and the New Software Stack

CSE 344 MAY 2 ND MAP/REDUCE

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Column-Family Databases Cassandra and HBase

A Review Paper on Big data & Hadoop

Using space-filling curves for multidimensional

CS November 2018

Big Data Hadoop Stack

CS427 Multicore Architecture and Parallel Computing

Introduction to MapReduce

CS November 2017

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Chapter 5. The MapReduce Programming Model and Implementation

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Distributed File Systems II

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

CS 655 Advanced Topics in Distributed Systems

6.830 Lecture Spark 11/15/2017

Next-Generation Cloud Platform

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to MapReduce Algorithms and Analysis

DIVING IN: INSIDE THE DATA CENTER

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Stages of Data Processing

50 Must Read Hadoop Interview Questions & Answers

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Rule 14 Use Databases Appropriately

Distributed computing: index building and use

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to MapReduce

Generalizing Map- Reduce

Big Data Architect.

The Datacenter Needs an Operating System

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Distributed Computing with Spark

A Fast and High Throughput SQL Query System for Big Data

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Clustering Lecture 8: MapReduce

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to Distributed Data Systems

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Huge market -- essentially all high performance databases work this way

TDDD43 HT2014: Advanced databases and data models Theme 4: NoSQL, Distributed File System, Map-Reduce

CS 345A Data Mining. MapReduce

Hadoop. Introduction / Overview

Transcription:

MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo

Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web page search engine now known as Google, where pages are ranked. It is most useful for embarrassingly parallel applications. It has become a paradigm for implementation of parallel algorithms. 2

Is it efficient? Yes: When communication time can be managed so as not to be overwhelming. Encourages reconsidering how standard algorithms are implemented on distributed parallel machines. No: Frequently uses files instead of pipes to communicate. Encourages bad algorithm implementations. 3

Distributed file system We assume that we have huge amounts of storage that has to be spread across a compute cluster made up of commodity PC s. Files are read and appended, not edited. Files distributed in chunks (64 MB typical size). Chunks are replicated on multiple compute nodes (3 is typical). A master maintains index of chunk locations. 4

Compute clusters Typically made up of racks of 1U boards with multiple CPUs per board plus local storage. Intra- rack is usually 1-40 Gb/s (ethernet or Infiniband). Inter- rack connection is usually something fast. Multiple communication fabric fairly common on very big clusters. 5

Distributed file systems Common ones IBM GPFS, Google GFS, LUSTRE, Apache HDFS Wikipedia lists 19 systems (and 3 of the above are not on their list!). Proprietary and open source ones exist Some are easy to setup and others are an ongoing nightmare. Reconfiguring DFS is usually painful. 6

How does it all interact? SQL: PIG, OINK, HIVE, etc., or NoSQL: Cassandra, Dynamo, etc MapReduce: Hadoop, MR- MPI, etc. Object Store (key- value): BigTable, Hbase, etc. Distributed file system Compute cluster 7

MapReduce systems Common ones Google s MapReduce (Python) Apache s Hadoop (GPL, Java open source) MR- MPI (Sandia Nat l Lab, BSDL, C/C++ open source) Important features: Specialized parallel computing tools. Typically, user writes just two serial functions. Avoid restart of whole job if there is a compute node failure. 8

Key- value systems Popular ones: Google BigTable Apache Hbase and Cassandra (NoSQL) Amazon Dynamo Each row is associated with a key. The number of columns in a row can be variable. Each (row,column) has a set of values. 9

SQL- like systems Many, many available. Some popular ones: Apache HIVE (open source) Implements QL, a restricted subset of SQL standard. Sits on top of Hadoop. Yahoo! PIG Implements a relational algebra. Microsoft Scope Google Sawzall Implements parallel select + aggregation. 10

NoSQL systems Not Only Structured Query Language NoSQL a non- relational database Key- store structure internally Apache Hbase and Cassandra Dynamo CouchDB Eventually consistent over quiet periods, Many systems now exist. 11

SQL versus NoSQL SQL Every record in a collection is a table that has the same sequence of fields (though not necessarily the same number of fields). NoSQL Documents in a collection may have fields that are completely different. Documents are addressed by a unique key. Queries allow a document type. 12

Why Mergesort? Comparisons and performance Serial computer: O(nlogn). Parallel computer with n processors: O(logn). Cache aware versions of mergesort exist. n/2 auxiliary storage standard, but only O(1) if a linked list is used. Too much copying unless a linked list is used. Lots of communication for parallel computing. 13

Other sorting methods Heapsort Usually faster on serial computers. Impractical for linked lists. O(1) auxiliary storage standard. Quicksort (serial computers with caches) If quicksort average is Cnlogn comparisons, then mergesort maximum comparisons is 0.39Cnlogn. Quicksort on average is much faster by clock time. 14

MapReduce Paradigm MapReduce system Creates a large number of tasks for each of two functions. Work is divides among tasks precisely. Two functions only: Map tasks converts inputs from DFS to key- value pairs, where the keys are not necessarily unique. Output sorted by key. Reduce tasks combines the key- value pairs for a given key. Usually one Reduce task per key. Output to DFS. 15

What is MapReduce Good for? Matrix- vector and matrix- matrix multiplication Iterative form of PageRank uses these operations extensively. General relational algebra operations. Join operations in databases. Almost anything that is embarrassingly parallel that uses lots of data from a DFS. Dealing with failures efficiently. 16

Failure techniques Re- execute failed tasks, not whole jobs. Some systems do checkpointing and then restart at the last checkpoint. Very expensive to dump everything to disk. Adds cost of extra disk drives that have to be on another compute rack and can lead to early disk failure if used extensively. Time lengthy to move data across inter- rack network and may be measured in minutes or fractions of hours. 17

What is the obvious Map function? A hash function h(x)! Produce h(x) as the key. Value is x and placed in the h(x) bucket. When finished mapping, send the h(x) bucket to its Reduce task for combining. An efficient hash table code is imperative. Use memory cache tricks and complicated hash table implementations, not textbook ones. 18

MapReduce variant of Join Suppose we have a chunked file with lots of edges from one vertex to another for graph. We want to find all edges of the form R(a,b) and S(b,c) and join them to create T(a,c) if it does not already exist. Map: b value Use hash function h from b values to k buckets. Reduce: deal with a bucket. 19

MapReduce variant of Join Tuple R(a,b) to Reduce task h(b) key = b, value = R(a,b) Tuple S(b,c) to Reduce task h(b) key = b, value = S(b,c) If R(a,b) joins with S(b,c), then both edges are sent to Reduce task h(b). Their join (a,b,c) is appended to the output file on the DFS. 20

Example of Join Suppose we have a directed graph. 1 4 (1,2) 2 3 (1,3) (1,4) (2,3) (3,4) 21

Example of Join Map Keys R(a,b) T(a,c) 1 (1,2),(1,3),(1,4) 2 (1,2) (2,3) 3 (1,3), (2,3) (3,4) 4 (1,4), (3,4) Reduce T(1,3), T(1,4), T(2,4) 22

Matrix- vector multiplication Compute y = Mx. For NxN matrix M = [ m ij ] and N- vectors x = [ x i ] and y = [ y i ], then y i = m i1 x 1 + m i2 x 2 + + m in x N. Simplest Map function key = i, value = m ij x j. (optimization: ignore if 0) Works for dense and sparse matrices. Reduce function adds up products for given i. Inexcusably inefficient in this form, however. 23

Matrix- vector multiplication example Let M = [ m 11 m 12 m 13 ; m 21 m 22 m 23 ; m 31 m 32 m 33 ] and x = [ x 1 ; x 2 ; x 3 ]. Map Reduce Key Values 1 m 11 x 1, m 12 x 2, m 13 x 3 2 m 21 x 1, m 22 x 2, m 23 x 3 3 m 31 x 1, m 32 x 2, m 33 x 3 Key Values 1 m 11 x 1 + m 12 x 2 + m 13 x 3 2 m 21 x 1 + m 22 x 2 + m 23 x 3 3 m 31 x 1 + m 32 x 2 + m 33 x 3 24

Matrix- vector multiplication Better approach when x and y are small enough to fit on all nodes. Input M by sets of rows and assign key k based on the row sets. Compute whole y i s as a value element. Store a subvector of y as the value for key k. Reduce task just writes out the subvectorsto DFS. When x and y are too big, apply 2D domain decomposition methods to M, x, and y. 25

Matrix- matrix multiplication Compute C = AB. For NxL matrix A = [ a ij ], LxM matrix B = [ b ij ], and NxM matrix C = [ c ij ], then C is formed from NxM inner products. Simplest Map function Apply matrix- vector product formulation key = i, value = individual product. Works for dense and sparse matrices. Reduce function adds up products for given i. 26

Matrix- matrix multiplication Unbelievably inefficient, but seen in practice. Better approach for Map function: Assume each matrix is stored in blocks of size nxn (pad by zeroes at right and bottom of a matrix), where n is convenient to your DFS. Do matrix- matrix multiplication using a block scheme. Never, ever do a formal transpose on a DFS. Still works for dense and sparse matrices. 27

Security and scaling Most MapReduce systems (e.g., Hadoop) provide no security or firewalling abilities. All users have access to everything in databases. No encryption by default. Allows for far better scaling on parallel systems. Extremely difficult to add later and still scale. Medical record systems in USA using Hadoop on notice they are not in compliance with privacy laws in effect on January 1, 2014. Big Disaster. 28

Quick summary MapReduce is a distributed mergesort using disk files as intermediaries. Replace mergesort with a fast sorting algorithm in each Reduce task. No reason to restrict to slow disk files if all fits in the global memory of the compute cluster. Use MPI or OpenMP communication techniques from traditional supercomputing. 29