Distributed Computation Models

Similar documents
MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Clustering Lecture 8: MapReduce

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Lecture 11 Hadoop & Spark

Distributed Architectures & Microservices. CS 475, Spring 2018 Concurrent & Distributed Systems

6.830 Lecture Spark 11/15/2017

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Distributed File Systems II

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems 16. Distributed File Systems II

Data Platforms and Pattern Mining

CompSci 516: Database Systems

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce

MapReduce. U of Toronto, 2014

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

Processing of big data with Apache Spark

A brief history on Hadoop

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Map-Reduce. Marco Mura 2010 March, 31th

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

The MapReduce Abstraction

BigData and Map Reduce VITMAC03

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hadoop. copyright 2011 Trainologic LTD

Introduction to Data Management CSE 344

CSE 444: Database Internals. Lecture 23 Spark

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MapReduce, Hadoop and Spark. Bompotas Agorakis

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

HADOOP FRAMEWORK FOR BIG DATA

Parallel Computing: MapReduce Jin, Hai

Database Systems CSE 414

Map Reduce Group Meeting

Database Applications (15-415)

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Outline. CS-562 Introduction to data analysis using Apache Spark

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

CISC 7610 Lecture 2b The beginnings of NoSQL

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Programming Systems for Big Data

Hadoop/MapReduce Computing Paradigm

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

A BigData Tour HDFS, Ceph and MapReduce

Dept. Of Computer Science, Colorado State University

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS370 Operating Systems

CPS 512 midterm exam #1, 10/7/2016

CLOUD-SCALE FILE SYSTEMS

Dynamic Reconfiguration of Primary/Backup Clusters

Remote Procedure Call. Tom Anderson

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Introduction to MapReduce

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Distributed Filesystem

MI-PDB, MIE-PDB: Advanced Database Systems

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Big Data Hadoop Course Content

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Hadoop Map Reduce 10/17/2018 1

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

Shark: Hive (SQL) on Spark

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Chapter 5. The MapReduce Programming Model and Implementation

The Google File System. Alexandru Costan

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Certified Big Data and Hadoop Course Curriculum

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Innovatus Technologies

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Namenode HA. Sanjay Radia - Hortonworks

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Exam 2 Review. October 29, Paul Krzyzanowski 1

MapReduce. Stony Brook University CSE545, Fall 2016

Transcription:

Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean

HW4 Recap https://b.socrative.com/ Class: SWE622 2

Review Replicating state machines Case studies - Hypervisor, GFS, MagicPocket 3

Strawman RSM implementation op1 op1 op2 op2 op1 op2 4

Strawman RSM implementation op1 op1 op2 op2 op1 op2 5

Primary/backup RSM op1 Primary op1 op2 Backup op2 op1 op2 Backup 6

Dangers of ignorance Client 1 create /leader Reconnects, discovers no longer leader Disconnected ZK create /leader Notification that / leader is dead Client 2 Client 2 is leader 7

Dangers of ignorance Each client needs to be aware of whether or not its connected: when disconnected, can not assume that it is still included in any way in operations By default, ZK client will NOT close the client session because it's disconnected! Assumption that eventually things will reconnect Up to you to decide to close it or not 8

Hypervisor (Case Study) http://www.cs.cornell.edu/ken/book/2005sp/vft.tocs.pdf 9

11

MagicPocket 12

Review: Why Distributed? A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A B A So far.. distributing data [101.. [O [101.. 200] Z] 200] B [O Z] A [101.. 200] B [O Z] Today: DC distributing computation NYC A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [0 100] B [A N] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A [101.. 200] B [O Z] A [101.. 200] B [O Z] SF London 13

Today Distributed Computation Models Has anyone heard of this "big data" thing? Exploiting parallelism MapReduce Spark 14

More data, more problems I have a 1TB file I need to sort it My computer can only read 60MB/sec 1 day later, it s done 15

More data, more problems Think about scale: Google indexes ~20 petabytes of web pages per day (as of 2008!) Facebook has 2.5 petabytes of user data, increases by 15 terabytes/day (as of 2009!) 16

Distributing Computation 17

Distributing Computation Can't I just add 100 nodes and sort my file 100 times faster? Not so easy: Sending data to/from nodes Coordinating among nodes Recovering when one node fails Optimizing for locality Debugging 18

Distributing Computation Lots of these challenges re-appear, regardless of our specific problem How to split up the task How to put the results back together How to store the data Enter, MapReduce 19

MapReduce A programming model for large-scale computations Takes large inputs, produces output No side-effects or persistent state other than that input and output Runtime library Automatic parallelization Load balancing Locality optimization Fault tolerance 20

MapReduce Partition data into splits (map) Aggregate, summarize, filter or transform that data (reduce) Programmer provides these two methods 21

MapReduce: Divide & Conquer Big Data (lots of work) Partition w1 w2 w3 w4 w5 worker worker worker worker worker r1 r2 r3 r4 r5 Combine Result 22

MapReduce: Example Calculate word frequencies in documents Input: files, one document per record Map parses documents into words Key - Word Value - Frequency of word Reduce: compute sum for each key 23

MapReduce: Example Input 1: apple orange mango orange grapes plum Each line goes to a mapper apple orange mango Map splits key->value apple, 1 orange, 1 mango, 1 Input 2: apple plum mango apple apple plum orange grapes plum apple plum mango orange, 1 grapes, 1 plum, 1 apple, 1 plum, 1 mango, 1 to reduce apple apple plum apple, 1 apple, 1 plum, 1 24

MapReduce: Example apple, 1 orange, 1 mango, 1 Sort, shuffle From Map apple, 1 apple, 1 apple, 2 grape, 1 Reduce apple, 4 grape, 1 Final Output orange, 1 grapes, 1 plum, 1 apple, 1 plum, 1 mango, 1 mango, 1 mango, 1 orange, 1 orange, 1 mango, 2 orange, 2 apple, 4 grape, 1 mango, 2 orange, 2 plum, 3 apple, 1 apple, 1 plum, 1 plum, 1 plum, 1 plum, 1 plum, 3 25

MapReduce Applications Distributed grep Distributed clustering Web link graph traversal Detecting duplicate web pages 26

MapReduce: Implementation Input data is partitioned into M splits Map - extract information about each split Each map produces R partitions Shuffle and sort Bring the right partitions to the right reducer Output from R 27

MapReduce: Implementation Each worker node is also a GFS chunk server! 28

MapReduce: Scheduling One master, many workers Input data split into M map tasks (typically 64MB ea) R reduce tasks Tasks assigned to works dynamically; stateless and idempotent -> easy fault tolerance for workers Typical numbers: 200,000 map tasks, 4,000 reduce tasks across 2,000 workers 29

MapReduce: Scheduling Master assigns map task to a free worker Prefer "close-by" workers for each task (based on data locality) Worker reads task input, produces intermediate output, stores locally (K/V pairs) Master assigns reduce task to a free worker Reads intermediate K/V pairs from map workers Reduce worker sorts and applies some reduce operation to get the output 30

Fault tolerance via reexecution Ideally, fine granularity tasks (more taks than machines) On worker-failure: Re-execute completed and in-progress map tasks Re-executes in-progress reduce tasks Commit completion to master On master-failure: Recover state (master checkpoints in a primarybackup mechanism) 31

MapReduce in Practice Originally presented by Google in 2003 Widely used today (Hadoop is an open source implementation) Many systems designed to have easier programming models that compile into MapReduce code (Pig, Hive) 32

Hadoop Hadoop in Practice, Second Edition 33

Hadoop: HDFS HDFS HDFS NameNode HDFS DataNode HDFS DataNode 34

HDFS (GFS Review) Files are split into blocks (128MB) Each block is replicated (default 3 block servers) If a host crashes, all blocks are re-replicated somewhere else If a host is added, blocks are rebalanced Can get awesome locality by pushing the map tasks to the nodes with the blocks (just like MapReduce) 35

YARN Hadoop's distributed resource scheduler Helps with scaling to very large deployments (>4,000 nodes) General purpose system for scheduling executable code to run on workers, managing those jobs, collecting results 36

Hadoop + ZooKeeper Hadoop uses ZooKeeper for automatic failover for HDFS Run a ZooKeeper client on each NameNode (master) Primary NameNode and standbys all maintain session in ZK, primary holds an ephemeral lock If primary doesn t maintain contact it session expires, triggering a failure (handled by the client) Similar mechanism to failover a YARN controller 37

Hadoop + ZooKeeper https://issues.apache.org/jira/secure/attachment/12519914/zkfc-design.pdf 38

Hadoop + ZooKeeper ZK Server ZK Server ZK Server ZKClient ZKClient NameNode NameNode Primary Secondary DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode 39

Hadoop + ZooKeeper ZK Server ZK Server ZK Server timeout disconnected ZKClient ZKClient NameNode NameNode Notification that leader is gone, secondary becomes primary Secondary Primary Secondary Primary DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode 40

Hadoop + ZooKeeper ZK Server ZK Server ZK Server ZKClient ZKClient NameNode NameNode Primary Secondary DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode Note - this is why ZK is helpful here: DataNode DataNode DataNode DataNode DataNode we can have the ZK servers partitioned *too* and still DataNode DataNode DataNode DataNode DataNode tolerate it the same way DataNode DataNode DataNode DataNode DataNode 41

Hadoop + ZooKeeper Why run ZK client in a different process? Why run ZK client on the same machine? Can this config still lead to unavailability? Can this config lead to inconsistency? 42

Hadoop Ecosystem 43

Beyond MapReduce MapReduce is optimized for I/O bound problems E.g. sorting big files Primitive API Developers have to build on the simple map/ reduce abstraction Typically result in many small files that need to be combined (reduce results) 44

Apache Spark Developed in 2009 at Berkeley Revolves around resilient distributed data sets (RDD) Fault-tolerant read-only collections of elements that can be operated on in parallel 45

Spark vs MapReduce 46

RDD Operations Generally expressive: Transform an existing RDD dataset into a new one Map, filter, distinct, union, sample, join, reduce, etc Make some actions on RDDs after a computation collect, count, first, foreach, etc. 47

Spark Aggressively caches data in memory AND is fault tolerant How? MapReduce got tolerance through its disk replication RDDs are resilient but they are also restricted Immutable, partitioned records Can only be built through coarse-grained and determistic transformations (e.g. map, filter, join) 48

Spark - Fault Recovery Can efficiently recover from faults using lineage Just like write-ahead-logging: Log which RDD was the input, and the high level operation to generate the output Only need to log that the operation applied to the entire dataset, not per-element RDDs track the graph of transformations that built them all the way to their origin 49

How fault tolerant is CFS? Master Master Slave Slave Slave Slave Client Goal 1: Promote new master when master crashes Client 50

How fault tolerant is CFS? OK disconnected Disconnected Master Disconnected Slave Slave Master Slave Slave Client Client 51

How fault tolerant is CFS? OK disconnected Disconnected Slave Disconnected Slave Slave Master Slave Slave Goal 2: Promote new master when partitioned + have quorum, cease Client functioning in minority Client 52

Return to failover & CFS WAIT returns the number of servers that acknowledged the update How many do we really need to know it though? So far, for N servers, need N ack's N/2+1? 53

Role of ZK in CFS for fault tolerance We will track who the Redis master is with ZK If something is wrong (WAIT fails) or can't contact master, then you need to challenge the master's leadership Maybe master is partitioned from slaves but ZK isn't Maybe master is crashed If you promote a master, then you need to track that in ZK If you were disconnected and reconnect from ZK, you need to validate who the master is 54

HW 5: Fault Tolerance We're going all-in on ZooKeeper here Use ZK similar to how HDFS does: each Redis slave will have a dedicated ZK client to determine who the master Redis server is Redis master holds a lease that can be renewed perpetually When client notices a problem (e.g. WAIT doesn't work right or can't talk to master) it proposes becoming the master As long as a client can talk to a quorum of ZK nodes, then they can decide who the leader is Clients don't need to vote - just matters that there is exactly one of them 55

HW 5: Fault Tolerance Does this make Redis a fault-tolerant CP system? No. Argument: Master might have gotten writes that succeeded there but not been replicated, then loses leadership Then later, master comes back online. Master accepted writes that nobody else knew about (and kept functioning without) 56

HW 5: Fault Tolerance Is this a fundamental limitation, is there a theorem that says this is impossible? No! It s just how Redis is implemented. It would be easier if we could do SET and WAIT in one step, failing SET if WAIT failed That s literally what synchronous replication, and Redis is an asynchronous system https://aphyr.com/posts/283-jepsen-redis https://aphyr.com/posts/307-call-me-maybe-redis-redux 57

Lab: ZooKeeper Leadership We're going to simulate failures in our system, and work to maintain a single leader 58