Distributed Systems. Day 3: Principles Continued Jan 31, 2019

Size: px

Start display at page:

Download "Distributed Systems. Day 3: Principles Continued Jan 31, 2019"

Gwen Waters
5 years ago
Views:

1 Distributed Systems Day 3: Principles Continued Jan 31, 2019

2 Semantic Guarantees of RPCs Semantics At-least-once (1 or more calls) At-most-once (0 or 1 calls) Scenarios: Reading from bank account? Withdrawing from bank account? Depositing into bank account?

3 Non-Idempotent Procedures Transfer $1000 from Kyle s Account to Andy s Current Balance: 5000 Done Current Balance: 4000 Do it again and again and again! Current Balance: 4000 Done Current Balance: 3000 At-most-once semantics

4 Non-Idempotent Versus Idempotent Transfer $1000 from Kyle s Account to Andy s Current Balance: 5000 Done Current Balance: 4000 Read Kyle s Balance Current Balance: 5000 Done Current Balance: 5000 Do it again and again and again! Current Balance: 4000 Do it again and again and again! Current Balance: 5000 Done Current Balance: 3000 Done Current Balance: 5000

5 Semantic Guarantees of RPCs Semantics At-least-once (1 or more calls) At-most-once (0 or 1 calls) Scenarios: Reading from bank account? Withdrawing from bank account? Depositing into bank account? Reads versus Writes versus Appends Solutions: Idempotent calls Maintain history (to filter duplicates)

6 Idempotent Procedures Write File A Block 0 Done (Ahem ) Write File A Block 0 Done At-least-once semantics

7 Maintaining History Transfer $1000 from Kyle s Account to Andy s Done HISTORY Do it again! Done (replay) HISTORY At-most-once semantics

8 History Lost due to Crash Transfer $1000 from Kyle s Account to Andy s Done HISTORY CRASH!! Do it again! Sorry Throws Exception! At-most-once semantics

9 Semantic Guarantees of RPCs Semantics Request is lost Response is lost At-least-once (1 or more calls) At-most-once (0 or 1 calls) Retransmit? Filter duplicate? Yes No Re-execute function Yes Yes, Use history to filter Use history to retransmit

10 RPC in Projects Project 1,2,3 Project 4 We provide RPC Skeletons and you fill in functionality You write RPC skeleton And figure out RPC compilation

11 Replication and Partitioning

12 Partition Cut a file into ``chunks (or sharding) Reduce impact of file growth Limits amount of processing NodeA NodeB NodeC Definition (size) of chunk is app-specific

13 Replications Make many copies of a file Provides fault tolerance Always one copy alive NodeA NodeB NodeC Provides more CPU/BW to the file

14 s://static.googleusercontent.com/media/research.google.com/en//people/jeff/wsdm09-keynote.pdf

15 How to Partition the Data? s://static.googleusercontent.com/media/research.google.com/en//people/jeff/wsdm09-keynote.pdf

16 Use-Case: MapReduce

17 Distributed Data Processing: Index and Rank the World!! Strawman solution: Partition data across servers Have every server process local data Why won t this work? Inter-data dependencies: Ranking of a web page depends on ranking of pages that link to it Need data from all users who have a certain gene to evaluate susceptibility to a disease 72

18 MapReduce Distributed data processing paradigm introduced by Google in 2004 Spark is an evolution of MapReduce TensorFlow applies dataflow principles MapReduce represents A programming interface for data processing jobs Model processing as a data flow Map and Reduce functions (worker functions) A distributed execution framework Scalable and fault-tolerant

19 Fault Tolerance via Master Partition data: * Each partition is a split Master: Sends heartbeats to workers If worker doesn t respond in time period Assume deadand start a new worker Worker does processing

20 An Example What the user says: Goal Find frequent search queries to Bing SELECT FROM HAVING Query, COUNT(*) AS Freq QueryTable Freq > X How it Works: job manager file block 0 file block 1 file block 2 file block 3 task task task assign work, get progress Local write task task output block 0 output block 1 Read Map Reduce 80 Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris

21 How do you Ensure Good Performance Potential problems: Slow (or bad nodes) Some nodes have bad hardware Data imbalance Some blocks are larger than others Bad Network Potential solutions: Detect Problem: periodic status updates If progress is slow à problem Potential Solution problem Restart task on a new node. file block 0 file block 1 file block 2 file block 3 job manager task task task assign work, get progress Local write

22 Project 1: LiteMiner Check for node failures Balance Load between Miners: e.g., same number of items to mine Progress reports Miner Miner Miner Pool assign work, Miner Miner Progress reports Detect slow nodes

23 Project 1: LiteMiner Repartition work based on new miner Pool assign work, Join Miner Miner Miner Miner Miner Assign work Miner

24 This Class System Model Synchronous v. asynchronous Implications Failure types Communication overview Networking basics RPC Partition/Replication MapReduce use-case

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V Madhyastha Example Scenario Genome data from roughly one million users 125 MB of data per user Goal: Analyze data to identify genes that