Programming Systems for Big Data

Similar documents
CS 345A Data Mining. MapReduce

CS Spark. Slides from Matei Zaharia and Databricks

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Fast, Interactive, Language-Integrated Cluster Computing

a Spark in the cloud iterative and interactive cluster computing

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Clustering Lecture 8: MapReduce

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Resilient Distributed Datasets

Introduction to MapReduce (cont.)

Lecture 11 Hadoop & Spark

CS 345A Data Mining. MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MI-PDB, MIE-PDB: Advanced Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

MapReduce. Stony Brook University CSE545, Fall 2016

CS427 Multicore Architecture and Parallel Computing

Map-Reduce. Marco Mura 2010 March, 31th

CompSci 516: Database Systems

CSE 444: Database Internals. Lecture 23 Spark

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

CS 61C: Great Ideas in Computer Architecture. MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Distributed File Systems II

Introduction to Apache Spark. Patrick Wendell - Databricks

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Introduction to MapReduce

Introduction to MapReduce

MapReduce. U of Toronto, 2014

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Map Reduce Group Meeting

Spark: A Brief History.

Distributed Computation Models

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Distributed Computing with Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Big Data Infrastructures & Technologies

Data-intensive computing systems

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Parallel Computing: MapReduce Jin, Hai

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CmpE 138 Spring 2011 Special Topics L2

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

BigData and Map Reduce VITMAC03

Fault Tolerant Distributed Main Memory Systems

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Parallel Programming Concepts

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Dept. Of Computer Science, Colorado State University

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Distributed Filesystem

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Introduction to Map Reduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Hadoop/MapReduce Computing Paradigm

Database Systems CSE 414

CA485 Ray Walshe Google File System

Big Data Management and NoSQL Databases

6.830 Lecture Spark 11/15/2017

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

Map Reduce. Yerevan.

Programming Models MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Distributed Computing with Spark and MapReduce

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Database Applications (15-415)

Introduction to MapReduce Algorithms and Analysis

An Introduction to Big Data Analysis using Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Batch Processing Basic architecture

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

MapReduce Simplified Data Processing on Large Clusters

Chapter 4: Apache Spark

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

A BigData Tour HDFS, Ceph and MapReduce

The MapReduce Abstraction

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

2/4/2019 Week 3- A Sangmi Lee Pallickara

Distributed Systems 16. Distributed File Systems II

MapReduce-style data processing

Transcription:

Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There is another class of programming systems focused on Big Data MapReduce Spark TensorFlow Prof. Aiken CS 315B Lecture 17 2 1

Warehouse Size Cluster Prof. Aiken CS 315B Lecture 17 3 Example: Google Cluster Prof. Aiken CS 315B Lecture 17 4 2

Commodity Cluster Architecture 1 Gbps between any pair of nodes in a rack Switch 2-10 Gbps backbone between racks Switch Switch 8 cores 64-256 GB CPU Mem CPU Mem CPU Mem CPU Mem 10-30 TB Disk Disk Disk Disk Each rack contains 16-64 nodes Prof. Aiken CS 315B Lecture 17 5 Commodity Cluster Trends Prof. Aiken CS 315B Lecture 17 6 3

Storing Big Data Prof. Aiken CS 315B Lecture 17 7 Stable Storage If nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace GFS, HDFS Note: Not HDF5! Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common (e.g. log files) Prof. Aiken CS 315B Lecture 17 8 4

Distributed File System Chunk Servers a.k.a. Data Nodes in HDFS File is split into contiguous chunks Typically each chunk is 16-128MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk (data) servers Connects directly to chunk servers to access data Prof. Aiken CS 315B Lecture 17 9 Hadoop Distributed File System (HDFS) Global namespace Files are broken into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode Prof. Aiken CS 315B Lecture 17 10 5

MapReduce Prof. Aiken CS 315B Lecture 17 11 The Programming Model A program consists of two functions Map function f Reduce function g In the map phase The map function f is applied to every data chunk Output is a set of <key,value> pairs In the reduce phase The reduce function g is applied once to all values with the same key Prof. Aiken CS 315B Lecture 17 12 6

Picture Map Reduce Input Map Output Map Reduce Prof. Aiken CS 315B Lecture 17 13 What is MapReduce? Dataflow language A graph of Nodes that are computation Edges that carry data In particular, MapReduce graphs are acyclic Like Legion, StarPU, And very restricted Prof. Aiken CS 315B Lecture 17 14 7

MapReduce Provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates Prof. Aiken CS 315B Lecture 17 15 MapReduce: Distributed Execution Input Data Split 0 Split 1 Split 2 read Worker Worker Worker User Program fork fork fork assign map local write Master assign reduce remote read, sort Worker Worker write Output File 0 Output File 1 Prof. Aiken CS 315B Lecture 17 16 8

Data flow Input, final output are stored on a DFS Scheduler tries to schedule map tasks close to physical storage location of input data Same node or same rack Data locality of I/O is important Bisection bandwidth of network is low (~10 Gb/s) Intermediate results are stored on the local FS of map and reduce workers Output is often input to another map reduce task Prof. Aiken CS 315B Lecture 17 17 Coordination: The Master Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Prof. Aiken CS 315B Lecture 17 18 9

Failures Map worker failure Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Reduce task is rescheduled Master failure MapReduce task is aborted and client is notified Prof. Aiken CS 315B Lecture 17 19 How many Map and Reduce jobs? M map tasks, R reduce tasks Rule of thumb: Make M and R much larger than the number of CPUS in cluster ( 8000 CPUs M = 800,000 100 tasks per CPU for map) One DFS chunk per map is common (800, 000 x 128 MB = 102 TB) Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files Prof. Aiken CS 315B Lecture 17 20 10

Partition Function Inputs to map tasks are created by contiguous splits of input file at chunk granularity For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override E.g., hash(hostname(url)) mod R ensures URLs from a host end up in the same output file Prof. Aiken CS 315B Lecture 17 21 Combiners Often a map task will produce many pairs of the form (k,v1), (k,v2), for the same key k E.g., popular words in Word Count Can save network time by pre-aggregating at mapper combine(k1, list(v1)) à v2 Usually same as reduce function Works only if reduce function is commutative and associative Prof. Aiken CS 315B Lecture 17 22 11

Execution Summary map() reduce() 1. Partition input key/value pairs into chunks, run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute! Prof. Aiken CS 315B Lecture 17 23 MapReduce & Hadoop Conclusions MapReduce has proven to be a useful abstraction for huge scale data parallelism Greatly simplifies large-scale computations at Google, Yahoo, etc. Easy to use Library deals w/ messy details of task placement, data movement, fault tolerance Not efficient or expressive enough for all problems Requires huge data to be worthwhile Prof. Aiken CS 315B Lecture 17 24 12

Spark Prof. Aiken CS 315B Lecture 17 25 Spark Goals Extend MapReduce to better support two common classes of data analytics: Iterative algorithms machine learning, graphs Interactive data mining Prof. Aiken CS 315B Lecture 17 26 13

Scala Spark is integrated into the Scala programming language Java dialect With functional programming features Improves programmability over MapReduce implementations Mostly because Scala is just a more modern programming language Prof. Aiken CS 315B Lecture 17 27 Motivation MapReduce is inefficient for applications that repeatedly reuse data Recall MapReduce programs are acyclic Only way to encode an iterative algorithm is to wrap a MapReduce program in a loop Implies data is reloaded from stable storage on each iteration Prof. Aiken CS 315B Lecture 17 28 14

Programming Model Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupby, join, ) on data in stable storage Can be cached for efficient reuse Actions on RDDs Count, reduce, collect, save, Generate result on master Prof. Aiken CS 315B Lecture 17 29 Transformations // Load text file from local FS, HDFS, or S3 val rdd = spark.textfile( hdfs://namenode:0/path/file ) val nums = spark.parallelize(list(1, 2, 3)) // Pass each element through a function val squares = nums.map(x => x*x) // {1, 4, 9} // Keep elements passing a predicate val even = squares.filter(x => x % 2 == 0) // {4} // Map each element to zero or more others Create an RDD from a Scala collection nums.flatmap(x => 1 to x) // => {1, 1, 2, 1, 2, 3} Sequence of Prof. Aiken CS 315B numbers Lecture 1, 172,, x 30 15

Actions val nums = spark.parallelize(list(1, 2, 3)) // Retrieve RDD contents as a local collection nums.collect() // => Array(1, 2, 3) could be too big! // Return first K elements nums.take(2) // => Array(1, 2) // Count number of elements nums.count() // => 3 // Merge elements with an associative function nums.reduce((a, b) => a + b) // => 6 // Write elements to a text file nums.saveastextfile( hdfs://file.txt ) Prof. Aiken CS 315B Lecture 17 31 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textfile( hdfs://... ) errors = lines.filter(_.startswith( ERROR )) messages = errors.map(_.split( \t )(2)) cachedmsgs = messages.cache() Base RDD Transformed RDD Driver results tasks Worker Block 1 Cache 1 cachedmsgs.filter(_.contains( foo )).count cachedmsgs.filter(_.contains( bar )).count... Result: full-text scaled to search 1 TB data of Wikipedia in 5-7 sec in <1 (vs sec 170 (vs sec 20 sec for on-disk for on-disk data) data) Action Cache 2 Worker Cache 3 Worker Block 2 Block 3 Prof. Aiken CS 315B Lecture 17 32 16

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textfile(...).filter(_.startswith( ERROR )).map(_.split( \t )(2)) HDFS File filter (func = _.startswith(...)) Filtered RDD map (func = _.split(...)) Mapped RDD Prof. Aiken CS 315B Lecture 17 33 Example: Logistic Regression Goal: find best line separating two sets of points + + + + + + + + + + random initial line target Prof. Aiken CS 315B Lecture 17 34 17

Example: Logistic Regression val data = spark.textfile(...).map(readpoint).cache() var w = Vector.random(D) //w is mutable i.e. not functional for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a + b) w -= gradient } println("final w: " + w) // for loop and gradient update run on master // map and reduce run on cluster Prof. Aiken CS 315B Lecture 17 35 Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s 29 GB dataset on 20 EC2 Prof. Aiken m1.xlarge CS 315B Lecture machines 17 (4 cores each) 36 18

Spark Discussion Keep benefits of MapReduce with more traditional data parallel functional programming model Higher performance by keeping intermediate data in memory instead of disk Memory has 10,000x better latency and 100X better bandwidth than disk Fault tolerance comes from functional programming model Model breaks when you have non-functional code (use vars) Prof. Aiken CS 315B Lecture 17 37 Spark Discussion Data partitioning is built-in for MapReduce and Spark Initial partitioning is just chunking data sets Limited set of operations on partitioned data simplifies communication and placement Map, reduce, Prof. Aiken CS 315B Lecture 17 38 19

TensorFlow Prof. Aiken CS 315B Lecture 17 39 TensorFlow Another dataflow model Focused on machine learning applications More on this shortly Basic data type is a tensor A multidimensional array Prof. Aiken CS 315B Lecture 17 40 20

TensorFlow Example Prof. Aiken CS 315B Lecture 17 41 The Dataflow Graph Prof. Aiken CS 315B Lecture 17 42 21

Why TensorFlow? Dataflow model makes tasks explicit Units of scheduling One major motivation for Tensorflow is to make programming GPUs and clusters easier Tasks can have variants Tasks can be assigned to GPUs or CPUs If an appropriate variant is available Supports 1 node and multi-node execution Implementation has a built-in mapping heuristic Prof. Aiken CS 315B Lecture 17 43 Data and Communication Once tasks are assigned, it is clear where data communication is required E.g., if source task is on the CPU and destination task is on the GPU Implementation automatically inserts copy operations to move data to where it is needed Not clear if multiple alternatives are considered E.g., zero-copy vs. frame buffer memory on the GPU Prof. Aiken CS 315B Lecture 17 44 22

Sessions Typically the same graph is reused many times A session Sets up a Tensorflow graph Provides hooks to call the graph with different inputs/outputs Also options to call only a portion of the graph E.g., a particular subgraph Prof. Aiken CS 315B Lecture 17 45 Automatic Differentiation Many ML algorithms are essentially optimization algorithms and need to compute gradients TensorFlow has built-in support for computing the gradient function of a TensorFlow graph Each primitive function has a gradient function Primitive gradients are composed using the chain rule Prof. Aiken CS 315B Lecture 17 46 23

Automatic Differentiation Example Prof. Aiken CS 315B Lecture 17 47 Other Features Some tensors can be updated in place Leads to need for special control flow edges Simply enforce ordering of side effects on stateful tensors Note lack of sequential semantics Control flow constructs Loops, if-then-else But note automatic differentiation doesn t work for if-then-else Prof. Aiken CS 315B Lecture 17 48 24

Other Features Queues Programmers can add queues to dataflow edges to batch up work And to allow different parts of the graph to execute asynchronously Note execution is otherwise synchronous... Prof. Aiken CS 315B Lecture 17 49 Data Partitioning Interestingly, TensorFlow has no data partitioning primitives! Not really a big data programming model At least that are exposed to the users Underlying linear algebra packages (BLAS) may be chunk up arrays The task parallelism in the dataflow graph, and replicating the graph for multiple inputs scenarios, are the primary sources of parallelism Prof. Aiken CS 315B Lecture 17 50 25

Summary Big Data problems are inspiring their own class of programming models Different constraints More data, less complex compute But also more focus on programmer productivity No assumption of willingness to learn a lot about parallel programming Prof. Aiken CS 315B Lecture 17 51 26