Architecture of Systems for Processing Massive Amounts of Data Milan Vojnović Microsoft Research April 2011 The Goals of this Lecture Learn about the underlying principles of system design for processing of massive amounts of data Learn about the state-of-the-art systems used in production and commercial systems of major Internet online service providers Ex. Amazon, Google and Microsoft Learn about some alternative system designs proposed in research papers 2 1
Typical System Characteristics Distributed system built from inexpensive commodity components Share-nothing model Machines with their own CPU, memory and hard disks interconnected with a network Failures of machines and network components are common 3 Application Requirements Support for storing and processing of large files GB, TB, PB quite common Efficient processing that uses large streaming reads and writes Operations on contiguous regions of a file Support for structured data Ex. tables, incremental processing Parallel processing Parallel computation complexities hidden from the programmer Accommodate declarative and imperative programming Quality of service requirements Ex. fast processing speed, high availability 4 2
Contents Network Architecture File System Job Scheduling Parallel Computing MapReduce, Dryad, SCOPE and DryadLINQ Structured Data BigTable, Amazon s Dynamo, Percolator 5 Network Architecture Typically hierarchical organization Either two- or three-level trees of switches or routers Core tier (ex. 32-128, 10 GigE ports) Aggregation tier (ex. 48-288 X GigE ports) Computers clustered in racks 6 3
Oversubscription Switches allow all directly connected hosts to communicate with one another at the full speed of their network interface Oversubscription The ratio of the worst-case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular network Pros: Lowers the cost Cons: Complicates design of protocols as the design must be conscious of network bandwidth asymmetries 7 Oversubscription: Examples 1:1 = all hosts may potentially communicate with arbitrary other hosts at the full bandwidth of their network interface 5:1 = 20% of available host bandwidth is available for some connection patterns Typical designs 2.5:1 (400 Mbps) to 8:1 (125 Mbps) 8 4
Alternative: Fat-Tree Topology Built using many small commodity switches Lower cost Fat-tree topology k-ary tree k pods each containing two layers of k/2 aggregation switches (upper and lower layers) (k/2)x(k/2) k-port core switches Each in the lower aggregation layer directly connected to k/2 hosts, other k/2 ports connected to k/2 switches in the upper aggregation layer Each core switch has one port connected to each of k pods A fat-tree with k-port switches supports k 3 /4 hosts 9 An Example Fat-Tree Topology k = 4 10 5
Addressing IP address blocks 10.0.0.0/8 Pod switch: 10.pod.switch.1 Pod = the pod number [0,k-1] Switch = position of the switch in the pod [0,k-1] from left to right bottom to top Core switch: 10.k.j.i (j,i) = switch coordinate in the (k/2)x(k/2) core switch grid; i,j in [1,k/2] from top-left Host: 10.pod.switch.ID ID = host position in the subnet [2,k/2+1] from left to right 11 IP Routing Goal: IP routing using paths across the network so that the load on switches is balanced Two-level prefix lookup Primary table contains first-level prefixes Secondary table contains second-level suffix and port A first-level prefix in the primary table may contain a pointer to a secondary table Prefix said to be terminating if no (suffix, port) entry is associated to this prefix Inter-pod routing has default /0 prefix and routing uses suffix routing based on host ID Centralized configuration of routing table entries Appropriate for data centre scenarios 12 6
IP Routing Example Routing table at switch 10.2.2.1 IP address 10.2.1.2 forwarded to port 1 IP address 10.3.0.3 forwarded to port 3 13 IP Routing Table Generation Aggregation switch routing tables for load balancing 14 7
IP Routing Table Generation (cont d) Core switch routing tables Load balancing in the initial part of the route through aggregation switches 15 Packing One drawback of the fat-tree topology is the number of required cables Larger fan out of switches Packing aims at Minimizing the number of external cables Reducing the overall cable length Allowing for incremental deployment Aggregation switches partitioned over pod racks Star layout per pod to reduce cable length Pod rack is a hub and racks are leaves to reduce Only external cabling is to core switches 16 8
Packing: Example 17 Network Architecture File System Job Scheduling Contents Parallel Computing MapReduce, Dryad, SCOPE and DryadLINQ Structured Data BigTable, Amazon s Dynamo, Percolator 18 9
File System Needs to meet a set of design requirements for efficient processing of massive data sets Ex. Google File System (GFS) Cosmos (Microsoft) 19 Design Requirements Storing a modest number of large files Ex. millions of files, each of size 100 MB or larger, multi GB file size are common Large streaming reads Individual operations typically read multiple MBs; clients often read through a contiguous region of a file Many large sequential writes Mostly append to a file; modifications are rare Semantics to support multiple clients concurrently appending to a file Atomicity with minimal overhead High sustained bandwidth more important than latency 20 10
System Architecture File partitioned into chunks ( extents in Cosmos) Ex. chunk size 64 MB Choice of chunk size for efficient reads Specialized master nodes Ex. handle namespace management and locking, replica placement, creation, re-replication and rebalancing Chunkservers Maintain replicas Clients Issue read and write requests 21 GFS Architecture Single master node Centralized component: simplifies the design Ex. Easier to implement chunk placement strategies using global knowledge Clients never read or write file data through the master Otherwise, the master may become a bottleneck 22 11
GFS Architecture (cont d) Client sends request for chunk index The master replies with the corresponding chunk handle and location of the replicas Client reads data directly from a chunkserver 23 Consistency Model A file region consistent if all clients always see the same data, regardless of which replicas they read from Uses a lease mechanism for consistent mutation order across replicas Mutation = an operation that changes the contents or metadata of a chunk (ex. write or append) The master grants a chunk lease to one of the replicas (called primary) Delegation to minimize management overhead of the master Lease granted for a time period that can be repeatedly extended by the primary The primary determines a serial order for all mutations of the chunk, which is then followed by all replicas 24 12
The Lease Mechanism 1. Client asks for chunkserver that holds lease for the chunk; if none exists, the master grants one to a replica it choses 2. Primary and secondary replicas communicated to the client 3. The client pushes the data to all the replicas 4. Once all replicas ack to have received the data, the client sends write request to the primary 5. The primary forwards the write request to all secondary replicas 6. Each secondary ack to the primary to have applied the operation 7. The primary replies to the client 25 Chunk Replication Resilience to failures of machines and network partitioning Chunks replicated to multiple chunkservers on different racks K copy replication 26 13
Data Flow Pipelined data transmission through a chain of chunk servers The goal is to fully utilize each machine s network interface Each machine forwards the data to the closest machine in the network topology that has not received it Distances estimated from IP addresses Data pipelined over TCP connections In absence of network congestion, the transfer time for R replicas = B/C + R L B = number of bytes to transfer C = network interface bandwidth L = latency to transfer bytes between two machines 27 Network Architecture File System Job Scheduling Contents Parallel Computing MapReduce, Dryad, SCOPE and DryadLINQ Structured Data BigTable, Amazon s Dynamo, Percolator 28 14
Job Scheduling Jobs consists of multiple tasks Ex. a task needs to process data on a machine Scheduling of tasks to machines with respect to the following criteria: Inter-job fairness: allocation of resources fair across jobs with respect to an adopted notion of fairness Data locality: tasks placed near to data Scheduling of tasks performed by execution control (runtime) Examples: Quincy (Microsoft), Delay Scheduling (Hadoop) 29 Principles of Job Scheduling in Distributed Cluster Systems Separation of the inter-job fairness objective and data-locality objective Inter-job fairness Ex. weighted round robin, or special case of uniform round robin Data-locality Different ways to accommodate this 30 15
Queue-based Scheduling Data structure to encode locality preference with no inter-job fairness C i = machine i R i = rack i X = cluster w i j = task i of job j 31 Simple Greedy Fairness M = number of machines K = current number of jobs N j = number of unfinished tasks of job j Baseline allocation to job j: B j = min M K, N j If B j < M j the remaining slots divided equally among jobs that have additional tasks to determine the allocation B j ; else A j = B j Greed Fair Scheduler: block job j if the number of allocated machines is A j or more 32 16
Simple Greedy Fairness (cont d) Suffers form a stick-slot problem When a task finishes, fairness requires to serve the same job Solution by using a hysteresis approach A job unblocked if the number of running tasks falls below A j α, for some α > 0 Essentially uniform round-robin Uniform allocation across jobs Same as with Hadoop Fair Scheduler 33 Combining Inter-Job Fairness and Locality Preference Each job given an allocation according to a fairness criteria Allocation derived by solving a mincost flow problem The costs encode locality preference Ex. machine and rack preference 34 17
Delay Scheduling Inter-job fairness criteria: essentially the same as with Quincy Basic idea: Scheduling a job that should be served according the inter-job fairness criterion postponed for a limited number of scheduling slots, if the head-of-the-line task of this job cannot be assigned locally 35 Delay Scheduling (cont d) D = input parameter determining the maximum number of skips per job 36 18
Configuring Number of Skips M = number of machines L = slots per machine P j = set of machines on which job j has data to process (preferred machine for job j) p j = P j /M T = task processing time R = number of replicas per data chunk For a job j that is farthest below its fair share the probability of launching a non-local task = (1 p j ) D Exponentially decreasing with D Choose D such that the average portion of locally assigned tasks for a job with N tasks is at least 1 ε, for given ε > 0 37 Assumption 1 Ass. 1: All N tasks require data from same machine Sufficient: D log 1 ε 1 log 1 R M = M R log 1 ε + o(m R ) 38 19
Proof Sketch Probability that a task is assigned to a preferred machine: 1 1 R M D 1 e RD M Therefore, it suffices to chose D such that 1 e RD M 1 ε which yields the result 39 Assumption 2 Ass. 2: each task prefers a machine selected uniformly at random from the set of M machines Suppose NR = o( suffices M), then for every ε > 1 N it D M R log 1 1 1 εn = M R 1 Nε + o( 1 Nε ) 40 20
Proof Sketch First note that machines preferred by tasks are all distinct with high probability; ex. going to 1 if NR = o M 1 1 M 1 2 M 1 NR 1 M 1 NR M NR e M (NR) 2 NR 2 M 1 Given there are K unfinished tasks for job j, the probability that a local task assignment is approximately: 1 1 KR D 1 e RD M K M Average fraction of local assignments per job is at least 1 N N K=1 1 e RD M K 1 1 N K=0 e RD M K 1 = 1 N 1 e RD M The result follows by requiring the right-hand side is at least 1 ε 41 Network Architecture File System Job Scheduling Contents Parallel Computing MapReduce, Dryad, SCOPE and DryadLINQ Structured Data BigTable, Amazon s Dynamo, Percolator 42 21
MapReduce Abstraction of group-by and aggregation Applications typically use several rounds of Map and Reduce phases 43 Example of Map and Reduce reduce map 44 22
System Components 45 Dryad A general-purpose distributed execution engine for coarse-grained data-parallel computations Based on specifying a dataflow graph Vertices contain code Directed edges (channels) describe data flows DAG = Directed Acyclic Graph 46 23
System Architecture NS = name server JM = job manager Determines the assignment of vertices to machines and orchestrates the overall execution D = machines V = vertices 47 Vertices and Channels vertex input channels producer consumer output channels channel Vertex Denotes computation code Typically sequential But also supports even-based programming, ex. using shared thread pool Channel types File (default): preserved after vertex execution until the job completes TCP: requires no disk accesses, but both end-point vertices must be scheduled to run at the same time Shared memory FIFO: low comm cost, but end-point vertices must run within the same process 48 24
Data Flow Graph: Construction Operators Clone Point-wise composition Complete bipartite composition Merge 49 Construction Operators (cont d) 50 25
Example: Histogram Computation Compute histogram of record frequencies Map phase P = read a part of the file to extract records D = distribute the input using hash partitioning S = performs in-memory sort C = compute the total count per record 51 Example: Histogram Computation Reduce phase MS = sort based on the record hash C = computes total count per record 52 26
Example: Histogram Computation Optimized Version Wasteful to execute Q vertices for every input partition Small input partition size Much smaller than RAM size Inefficient to read from many input partitions 53 SCOPE Structured Computations Optimized for Parallel Computations Data modelled as a set of rows comprised of typed columns Declarative language Programs tells what to do, not how to do it Resembles SQL with C# expressions Sequence of commands, typically data transformation operators (take one or more rowsets as input, perform some operation on the data, output a rowset) Compiler and optimizer responsible for generating an efficient execution plan and the runtime for executing the plan with minimal overhead 54 27
SCOPE Software Stack SCOPE script SCOPE compiler SCOPE runtime SCOPE optimizer Cosmos execution environment Cosmos file system Cosmos files 55 Example: Histogram Find most popular queries that were requested more than 1000 times Step-by-step equivalent: 56 28
Example: Histogram Execution Plan Extractors read extents in parallel Partial aggregation at the rack level* Distribute partition on the grouping column Final aggregation Take only rows with count larger than 1000 Sort by count Merge sorted results * Exploits knowledge about network topology 57 DryadLINQ Similar purpose as with SCOPE but uses LINQ LINQ = Language Integrated Query A set of.net constructs for programming with datasets Objects can be of any.net type Easy to compute using vectors and matrices 58 29
DryadLINQ Software Stack DryadLINQ Dryad Cluster services High-level language API Distributed execution, fault-tolerance, scheduling Remote process execution, naming, storage Windows Server Windows Server Windows Server 59 Example: Histogram 60 30
Contents Network Architecture File System Job Scheduling Parallel Computing MapReduce, Dryad, SCOPE and DryadLINQ Structured Data BigTable, Amazon s Dynamo, Percolator 61 Design Principles Provide a client with a structured data model that supports control over layout and format Distributed storage system for structured data Efficient reads/writes Consistency High availability 62 31
BigTable Data model: multidimensional sorted map (row:string, column: string, time:int64) -> string Column families Group of column keys; basic unit of access control Small number of column families each possibly consisting of many columns Atomic reads and writes Uses horizontal partitioning Rowsets distributed across machines Efficient reads of short ranges as they typically require access to a small number of machines Consistency Uses highly-available and persistent distributed lock service Chuby 63 An Example Table Rows correspond to reversed URLs Contents is the web page content The anchor column family consists of anchor text that referenced the web page Timestamps t i indicate various snapshots 64 32
Interface to a Table Write to a table: Read from a table: 65 System Architecture Tablet = contiguous region of the key space Master Assigns tablets to tablet servers Detects the addition and expiration of tablet servers Balances tablet server load Garbage collection Tablet server Tablet server Tablet server Stores a collection of tablets 66 33
Indexing of Tablet Locations Three-level hierarchy similar to B+ trees Stores the location of the root tablet Root tablet stores the location of all tablets METADATA tablet points to location for a set of tablets 67 Dictionary: key -> value Amazon s Dynamo Many services require only to store and retrieve a primary key No need for complex relational database queries Key requirement: high availability Always writable Relaxing consistency guarantees Other requirements Incremental scalability Symmetry (no special roles taken by some components) Decentralization (no centralized components) Leverage system heterogeneity 68 34
Key System Design Choices Partitioning by consistent hashing Allows for incremental scalability High availability for writes using vector clocks with reconciliations during reads Handling temporary failures using quorum Recovering from system failures using Merkle trees Membership and failure detection using a gossip-based membership protocol 69 Consistent Hashing (for resilience to failures) B is the coordinator for key K Preference list for key K contains B, C and D 70 35
Data Versioning using Vector Clocks Vector clock = a list of (node, counter) pairs both versions must be kept 71 Percolator Incremental processing using distributed transactions and notifications Two main abstractions ACID transactions over a random-access repository Observers a way to organize an incremental computation 72 36
Notifications Observers - the user written code that is triggered by changes to the table Similar to database triggers or events in active databases Percolator application is a series of observers Notifications designed to help structure an incremental computation 73 References Network Architecture A Scalable Commodity Data Center Network Architecture, M. Al-Fares, A. Loukissas and A. Vahdat, SIGCOMM 2008 File System The Google File System, S. Ghemawat, H. Gobioff and S.- T. Leung, SOSP 2003 Job Scheduling Quincy: Fair Scheduling for Distributed Computing Clusters, M. Isard et al, SOSP 2009 Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, M. Zaharia et al, EuroSys 2010 74 37
References (cont d) Parallel Computing MapReduce: Simplified Data Processing on Large Clusters, J. Dean and S. Ghemawat, OSDI 2004 Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, M. Isard et al, EuroSys 2007 SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets, R. Chaiken et al, VLDB 2008 DryadLINQ: A System for General-Purpose Distributed Data- Parallel Computing Using a High-Level Language, Y. Yu et al Structured Data Bigtable: A Distributed Storage System for Structured Data, F. Chang et al, OSDI 2006 Dynamo Amazon s Highly Available Key-value Store, G. DeCandia et al, SOSP 2007 Large-scale Incremental Processing Using Distributed Transactions and Notifications, D. Peng and F. Dabek, OSDI 2010 75 38