CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Similar documents
Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

CS November 2017

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

CS November 2018

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

Pregel: A System for Large-Scale Graph Proces sing

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

modern database systems lecture 10 : large-scale graph processing

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Graph Processing. Connor Gramazio Spiros Boosalis

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Pregel. Ali Shah

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

One Trillion Edges. Graph processing at Facebook scale

Big Graph Processing. Fenggang Wu Nov. 6, 2016

GraphHP: A Hybrid Platform for Iterative Graph Processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

BSP, Pregel and the need for Graph Processing

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

Exam 2 Review. October 29, Paul Krzyzanowski 1

PREGEL. A System for Large-Scale Graph Processing

Large Scale Graph Processing Pregel, GraphLab and GraphX

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Graph Processing & Bulk Synchronous Parallel Model

Webinar Series TMIP VISION

Hadoop. copyright 2011 Trainologic LTD

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

GPS: A Graph Processing System

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

CS November 2017

Apache Flink. Alessandro Margara

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

Distributed Computation Models

CS November 2018

L22: SC Report, Map Reduce

Introduction to MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

The MapReduce Abstraction

Clustering Lecture 8: MapReduce

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Introduction to BigData, Hadoop:-

Introduction to Data Management CSE 344

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Distributed Graph Algorithms

Distributed Systems 16. Distributed File Systems II

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Hadoop Development Introduction

Chapter 5. The MapReduce Programming Model and Implementation

Data-Intensive Distributed Computing

Pregel: A System for Large-Scale Graph Processing

Lecture 11 Hadoop & Spark

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

GiViP: A Visual Profiler for Distributed Graph Processing Systems

Data Informatics. Seon Ho Kim, Ph.D.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

CS427 Multicore Architecture and Parallel Computing

Database Systems CSE 414

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

CSE 444: Database Internals. Lecture 23 Spark

CS /29/17. Paul Krzyzanowski 1. Fall 2016: Question 2. Distributed Systems. Fall 2016: Question 2 (cont.) Fall 2016: Question 3

Pig A language for data processing in Hadoop

MapReduce, Hadoop and Spark. Bompotas Agorakis

Introduction to Data Management CSE 344

BigTable. CSE-291 (Cloud Computing) Fall 2016

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Parallel Computing: MapReduce Jin, Hai

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

BigTable: A Distributed Storage System for Structured Data

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Distributed File Systems II

Click Stream Data Analysis Using Hadoop

CISC 7610 Lecture 2b The beginnings of NoSQL

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

CS /5/18. Paul Krzyzanowski 1. Credit. Distributed Systems 18. MapReduce. Simplest environment for parallel processing. Background.

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

The MapReduce Framework

Big Data Hadoop Stack

Databases 2 (VU) ( / )

Transcription:

Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead of Java Make it easy to use multiple MapReduce stages Built-in common operations for join, group, filter, etc. How to use? Use Grunt the pig shell Submit a script directly to pig Use the PigServer Java class PigPen Eclipse plugin Count Job (in Pig Latin) A = LOAD myfile AS (x, y, z); B = FILTER A by x> 0; C = GROUP B by x; D = FOREACH A GENERATE x, COUNT(B); STORE D into output ; Pig Framework Parse Check Optimize Plan Execution Submit jar to Hadoop Monitor progress Hadoop Execution Map: Filter Reduce: Counter Pig compiles to sev eral Hadoop MapReduce jobs 3 4 Pig: Loading Data Load/store relations in the f ollowing f ormats: PigStorage: f ield-delimited text BinStorage: binary f iles Binary Storage: single-f ield tuples with a v alue of bytearray TextLoader: plain-text PigDump: stores using tostring() on tuples, one per line Example log = LOAD test.log AS (user, timestamp, query); grpd = GROUP log by user; cntd = FOREACH grpd GENERATE group, COUNT(log); fltrd = FILTER cntd BY cnt > 50; srtd = ORDER fltrd BY cnt; STORE srtd INTO output ; Each statement defines a new dataset Datasets can be given aliases to be used later FOREACH iterates over the members of a bag Input is grpd: list of log entries grouped by user Output is group, COUNT(log): list of {user, count} FILTER applies conditional filtering ORDER applies sorting 5 6 Paul Krzyzanowski 1

MapReduce isn t always the answer See pig.apache.org f or f ull documentation MapReduce works well f or certain problems Provides automatic parallelization Automatic job distribution For others May require many iterations Data locality usually not preserved between Map and Reduce Lots of communication between map and reduce workers 7 8 Computing model f or parallel computation. Communication 3. synchronization Superstep Superstep 3 Superstep 4 Superstep 5 9 10. Communication 3. synchronization Processes (workers) are randomly assigned to processors Each process uses only local data Each computation is asynchronous of other concurrent computation Computation time may vary. Communication 3. synchronization End of superstep: Messages received by all workers Start of superstep: Messages delivered to all workers Messaging is restricted to the end of a computation superstep Each worker sends a message to 0 or more workers These messages are inputs for the next superstep 1 Paul Krzyzanowski

BSP Implementation: Apache Hama. Communication 3. synchronization The next superstep does not begin until all messages have been received s ensure no deadlock: no circular dependency can be created Provide an opportunity to checkpoint results for fault tolerance If failure, restart computation from last superstep Hama: BSP f ramework on top of HDFS Provides automatic parallelization & distribution Uses Hadoop RPC Data is serialized with Google Protocol Buffers Zookeeper for coordination (Apache version of Google s Chubby) Handles notifications for Sync Good f or applications with data locality Matrices and graphs Algorithms that require a lot of iterations 13 14 Hama programming (high-level) Pre-processing Define the number of peers for the job Split initial inputs for each of the peers to run their supersteps Framework assigns a unique ID to each worker (peer) Superstep: the worker function is a superstep getcurrentmessage() input messages from previous superstep your code send(peer, msg) send messages to a peer sync() synchronize with other peers (barrier) For more information Architecture, examples, API Take a look at: Apache Hama project page http://hama.apache.org Hama BSP tutorial https://hama.apache.org/hama_bsp_tutorial.html Apache Hama Programming document http://bit.ly/1aifbxs http://people.apache.org/~tjungblut/downloads/hamadocs/apachehamabspprogrammingmodel_06. pdf File I/O Key/value model used by Hadoop MapReduce & HBase readnext(key, value) write(key, value) Bigtable 15 16 Graphs are common in computing Social links Friends Academic citations Music Movies Web pages Network connectiv ity Roads Disease outbreaks Processing graphs on a large scale is hard Computation with graphs Poor locality of memory access Little work per vertex Distribution across machines Communication complexity Failure concerns Solutions Application-specific, custom solutions MapReduce or databases But require many iterations (and a lot of data movement) Single-computer libraries: limits scale Parallel libraries: do not address fault tolerance BSP: close but too general 17 18 Paul Krzyzanowski 3

Pregel: a vertex-centric BSP Pregel: computation Input: directed graph A vertex is an object Each vertex uniquely identified with a name Each vertex has a modifiable value Directed edges: links to other objects Associated with source vertex Each edge has a modifiable value Each edge has a target vertex identifier Computation: series of supersteps Same user-defined function runs on each v ertex Receives messages sent from the previous superstep May modify the state of the vertex or of its outgoing edges Sends messages that will be received in the next superstep Typically to outgoing edges But can be sent to any known vertex May modify the graph topology Each superstep end with a barrier (synchronization point) 9 9 8 9 1 0 10 0 10 5 4 14 1 6 14 1 6 1 http://googleresearch.blogspot.com/009/06/large-scale-graph-computing-at-google.html 19 0 Pregel: termination Pregel: output Pregel terminates when every vertex votes to halt Initially, every vertex is in an active state Active vertices compute during a superstep Each vertex may choose to deactivate itself by voting to halt The vertex has no more work to do Will not be executed by Pregel UNLESS the vertex receives a message Then it is reactivated Will stay active until it votes to halt again Algorithm terminates when all vertices are inactive and there are no messages in transit received message Active Inactive vote to halt Vertex State Machine Output is the set of values output by the vertices Often a directed graph May be non-isomorphic to original since edges & vertices can be added or deleted Or summary data 1 Examples of graph computations Shortest path to a node Each iteration, a node sends the shortest distance received to all neighbors Cluster identification Each iteration: get info about clusters from neighbors. Add myself Pass useful clusters to neighbors (e.g., within a certain depth or size) May combine related vertices Output is a smaller set of disconnected vertices representing clusters of interest Graph mining Traverse a graph and accumulate global statistics Page rank Each iteration: update web page ranks based on messages from incoming links. Each vertex contains a value In the first superstep: A vertex sends its value to its neighbors In each successive superstep: If a vertex learned of a larger value from its incoming messages, it sends it to its neighbors Otherwise, it votes to halt Eventually, all vertices get the largest value When no vertices change in a superstep, the algorithm terminates 3 4 Paul Krzyzanowski 4

Semi-pseudocode: 1. vertex value type;. edge value type (none!); 3. message value type V0 V1 V V3 class Ma xvalu everte x : public Vertex<int, void, int> { void (MessageIterator *msgs) { int maxv = GetValue(); for (;! msgs->done(); msgs->next()) maxv = m ax(ms gs.val ue(), maxv) ; find maximum value 3 6 1 Superstep 0 6 6 6 Superstep 1 if (maxv > GetValue()) (step == 0)) { *Mutable Value () = m axv; OutEdgeI terat or out = Ge touted geite rator( ); for (;! out.d one(); out. Next() ) sendmess ageto (out.target(), maxv) } else VoteToHa lt(); } } }; send maximum value to all edges 5 Superstep 0: Each vertex propagates its own value to connected vertices Superstep 1: V0 updates its value: 6 > 3 V3 updates its value: 6 > 1 V1 and V do not update so vote to halt Active vertex Inactive vertex 6 V0 V1 V V3 V0 V1 V V3 3 6 1 Superstep 0 6 6 6 6 Superstep 6 6 6 Superstep 1 6 6 6 6 Superstep 3 6 6 6 6 Superstep Superstep : V1 receives a message becomes active V3 updates its value: 6 > V1, V, and V3 do not update so vote to halt Superstep 3: V1 receives a message becomes active V3 receives a message becomes active No vertices update their value all vote to halt Done! Active vertex Inactive vertex Active vertex Inactive vertex 7 8 Locality Vertices and edges remain on the machine that does the computation To run the same algorithm in MapReduce Requires chaining multiple MapReduce operations Entire graph state must be passed from Map to Reduce and again as input to the next Map Pregel API: Basic operations A user subclasses a Vertex class Methods (MessageIterator*): Executed per active vertex in each superstep MessageIterator identifies incoming messages from previous supersteps GetValue(): Get the current value of the vertex MutableValue(): Set the value of the vertex GetOutEdgeIterator(): Get a list of outgoing edges.target(): identify target vertex on an edge.getvalue(): get the value of the edge.mutablevalue(): set the value of the edge SendMessageTo(): send a message to a vertex Any number of messages can be sent Ordering among messages is not guaranteed A message can be sent to any vertex (but our vertex needs to have its ID) 30 31 Paul Krzyzanowski 5

Pregel API: Advanced operations Combiners Each message has an overhead let s reduce # of messages Many vertices are processed per worker (multi -threaded) Pregel can combine messages targeted to one vertex into one message Combiners are application specific Programmer subclasses a Combiner class and overrides Combine() method No guarantee on which messages may be combined 4 8 1 5 6 Combiner Sums input messages 4 15 1 71 15 Combiner Minimum value Pregel API: Advanced operations Aggregators Handle global data A vertex can provide a value to an aggregator during a superstep Aggregator combines received values to one value Value is available to all vertices in the next superstep User subclasses an Aggregator class Examples Keep track of total edges in a graph Generate histograms of graph statistics Global flags: execute until some global condition is satisfied Election: find the minimum or maximum vertex 3 33 Pregel API: Advanced operations Topology modification Examples If we re computing a spanning tree: remove unneeded edges If we re clustering: combine vertices into one vertex Add/remove edges/vertices Pregel Design Modifications visible in the next superstep 34 35 Execution environment Many copies of the program are started on a cluster of machines One copy becomes the master Will not be assigned a portion of the graph Responsible for coordination Rack 40-80 computers Partition assignment Master determines # partitions in graph One or more partitions assigned to each worker Partition = set of vertices Default: for N partitions hash(vertex ID) mod N worker Cluster s name server = chubby Master registers itself with the name service Workers contact the name service to find the master Cluster 1,000s to 10,000+ computers May deviate: e.g., place vertices representing the same web site in one partition More than 1 partition per worker: improves load balancing Worker Responsible for its section of the graph Each worker knows the vertex assignments of other workers 36 37 Paul Krzyzanowski 6

Input assignment Computation Master assigns parts of the input to each worker Data usually sits in GFS or Bigtable Input = set of records Record = vertex data and edges Assignment based on file boundaries Worker reads input If it belongs to any of the vertices it manages, messages sent locally Else worker sends messages to remote workers After data is loaded, all vertices are active Master tells each worker to perform a superstep Worker: Iterates through vertices (one thread per partition) Calls () method for each active vertex Delivers messages from the previous superstep Outgoing messages Sent asynchronously Delivered before the end of the superstep When done worker tells master how many vertices will be active in the next superstep Computation done when no more active vertices in the cluster Master may instruct workers to save their portion of the graph Deliver messages Send messages Superstep done 38 39 Handling failure Checkpointing Controlled by master every N supersteps Master asks a worker to checkpoint at the start of a superstep Save state of partitions to persistent storage Vertex values Edge values Incoming messages Master is responsible for saving aggregator values Master sends ping messages to workers If worker does not receive a ping within a time period Worker terminates If the master does not hear from a worker Master marks worker as failed When failure is detected Master reassigns partitions to the current set of workers All workers reload partition state from most recent checkpoint Pregel outside of Google Apache Giraph Initially created at Yahoo Used at Facebook to analyze the social graph of users Runs under Hadoop MapReduce framework Runs as a Map-only job Adds fault-tolerance to the master by using ZooKeeper for coordination Uses Java instead of C++ == Chubby 40 41 Conclusion Vertex-centric approach to BSP Computation = set of supersteps () called on each vertex per superstep Communication between supersteps: barrier synchronization The End Hides distribution from the programmer Framework creates lots of workers Distributes partitions among workers Distributes input Handles message sending, receipt, and synchronization A programmer just has to think from the viewpoint of a vertex Checkpoint-based fault tolerance 4 43 Paul Krzyzanowski 7