Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?

Size: px

Start display at page:

Download "Coflow. Big Data. Data-Parallel Applications. Big Datacenters for Massive Parallelism. Recent Advances and What s Next?"

Miles Riley
5 years ago
Views:

1 Big Data Coflow The volume of data businesses want to make sense of is increasing Increasing variety of sources Recent Advances and What s Next? Web, mobile, wearables, vehicles, scientific, Cheaper disks, SSDs, and memory Stalling processor speeds Mosharaf Chowdhury Big s for Massive Parallelism BlinkDB Storm Pregel DryadLINQ MapReduce Hadoop Dryad GraphLab Spark Data-Parallel Applications TensorFlow Spark-Streaming Multi-stage dataflow Computation interleaved with communication GraphX Dremel Computation Stage (e.g., Map, Reduce) Hive Distributed across many machines Tasks run in parallel Communication Stage (e.g., Shuffle) Between successive computation stages Reduce Stage A communication stage cannot complete until all the data have been transferred Map Stage

2 Communication is Crucial Performance Facebook jobs spend ~% of run on average in intermediate comm. As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck Flow Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization Faster Communication Stages: Networking Approach Configuration should be handled at the system level. Based on a month-long trace with, jobs and Million tasks, collected from a -machine Facebook production MapReduce cluster. Existing Solutions Why Do They Fall Short? r r WFQ GPS RED CSFQ DeTail PDQ pfabric ECN XCP RCP DCTCP D TCP FCP 98s 99s s D s s s Input Links Network Output Links Per-Flow Fairness Flow Independent flows cannot capture the collective communication behavior common in data-parallel applications

3 Why Do They Fall Short? Why Do They Fall Short? r r s r s r s s s s s Network r s s Network r Link to r Link to r Per-Flow Fair Sharing 4 Shuffle = Avg. Flow =. Solutions focusing on flow completion cannot further decrease the shuffle completion Improve Application-Level Performance s s s Network r r Slow down faster flows to accelerate slower flows Coflow Communication abstraction for data-parallel applications to express their performance goals Link to r Link to r Per-Flow Fair Sharing 4 Shuffle = Avg. Flow =. Link to r Link to r Data-Proportional Per-Flow Fair Sharing Allocation Shuffle = 4 Avg. Flow = 4. Minimize completion s,. Meet deadlines, or. Perform fair allocation.. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM.

4 Broadcast Aggregation Shuffle All-to-All Single Flow Parallel Flows How to schedule coflows online... N... N for faster # completion of coflows? to meet # more deadlines? for fair # allocation of the network? Enables coflows in data-intensive clusters Coflow Communication abstraction for data-parallel applications to express their performance goals. Coflow Scheduler. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users. The size of each flow,. The total number of flows, and. The endpoints of individual flows.. Efficient Coflow Scheduling with, SIGCOMM 4.

5 Benefits of Inter-Coflow Scheduling Inter-Coflow Scheduling is NP-Hard Coflow Coflow Coflow Coflow Link Units Link Units Link Units Units Link Units Units Fair Sharing Smallest-Flow First, The Optimal Fair Sharing Flow-level Prioritization The Optimal Concurrent Open Shop Scheduling L L 4 Coflow comp. = Coflow comp. = L L 4 Coflow comp. = Coflow comp. = L L 4 Coflow comp. = Coflow comp. = L L Examples include job scheduling and L caching blocks 4 Solutions use a ordering heuristic Coflow comp. = Coflow comp. = L 4 Coflow comp. = Coflow comp. = L L 4 Coflow comp. = Coflow comp. =. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM.. pfabric: Minimal Near-Optimal Transport, SIGCOMM.. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM... A pfabric: Note on Minimal the Complexity Near-Optimal of the Concurrent Transport, Open Shop SIGCOMM. Problem, Journal of Scheduling, 9(4):89 9, Inter-Coflow Scheduling is NP-Hard Coflow Coflow Link Units Link Units Units Employs a two-step algorithm to minimize coflow completion s Concurrent Open Shop Scheduling with Coupled Resources Examples include job scheduling and caching blocks Solutions use a ordering heuristic Consider matching constraints Input Links Output Links. Ordering heuristic. Allocation algorithm Keep an ordered list of coflows to be scheduled, preempting if needed Allocates minimum required resources to each coflow to finish in minimum

6 Ordering Heuristic Ordering Heuristic C ends C ends C ends C ends C ends C ends C ends C ends C ends O O O O O O O O O 9 Shortest-First (Total CCT = ) 9 Shortest-First () 9 Narrowest-First (Total CCT = 4) C C C Length C C C Width Ordering Heuristic Ordering Heuristic C ends C ends C ends C ends C ends C ends C ends C ends C ends C ends C ends C ends O O O O O O O O O O O O 9 Shortest-First () 9 Narrowest-First (4) 9 Shortest-First () 9 Narrowest-First (4) C C C Size 9 C ends C ends O O C ends C C C Bottleneck C ends C ends O O C ends C ends C ends O O C ends O O O 9 9 Smallest-First (4) Smallest-First (4) Smallest-Bottleneck () 9

7 Allocation Algorithm A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow s completion Allocate minimum flow rates such that all flows of a coflow finish together on. Coflow Scheduler. Global Coordination. The Coflow API Enables coflows in data-intensive clusters Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users The Need for Coordination The Need for Coordination 4 C ends C ends O 4 C ends C ends O C ends O C ends O O O O O O C C Bottleneck 4 Scheduling with Coordination (Total CCT = ) Scheduling with Coordination 7 Scheduling without Coordination (Total CCT = ) (Total CCT = 9) Uncoordinated local decisions interleave coflows, hurting performance

Architecture Enables coflows in data-intensive clusters Centralized master-slave architecture Applications use a client library to communicate with the master Actual timing and rates are determined

8 Architecture Enables coflows in data-intensive clusters Centralized master-slave architecture Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler Sender Receiver Driver Put Get Reg Daemon Topology Monitor Usage Estimator Coflow Scheduler Master Daemon Network Interface Daemon (Distributed) File System TaskName Comp. Tasks calling f Client Library. Coflow Scheduler. Global Coordination. The Coflow API Faster, application-aware data transfers throughout the network Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users. Download from The Coflow API register put get unregister broadcast shuffle reducer s mappers driver (JobTracker). NO changes to user jobs. NO storage b register(broadcast) s register(shuffle) id b.put(content) b.unregister() b.get(id) id s s.get(id sl ) Evaluation A -machine trace-driven simulation matched against a -machine EC deployment. Does it improve performance?. Can it beat non-preemptive solutions?. Do we really need coordination? YES

9 Better than Per-Flow Fairness EC.X.8X.X.X Sim. Comm. Heavy Comm. Improv. Job Improv. 4.8X.X.9X.X Better than Per-Flow Fairness EC.8X.X Sim. Comm. Improv. Job Improv..X.X Preemption is Necessary [Sim.] Preemption is Necessary [Sim.].7.7 Overhead Over..... Per-Flow Fair FIFO Per-Flow Priority FIFO-LM NC, 4 Fairness Prioritization NO What About Starvation Overhead Over..... Per-Flow Fair FIFO Per-Flow Priority FIFO-LM NC, 4 Fairness Prioritization NO What About Starvation. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM

10 Lack of Coordination Hurts [Sim.].7 Coflow Communication abstraction for data-parallel applications to express their performance goals Overhead Over..... Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM. pfabric: Minimal Near-Optimal Transport, SIGCOMM 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM 4., 4 Per-Flow Fair FIFO Per-Flow Priority FIFO-LM NC Fairness Prioritization Smallest-flow-first (per-flow priorities) Minimizes flow completion FIFO-LM 4 performs decentralized coflow scheduling Suffers due to local decisions Works well for small, similar coflows. The size of each flow,. The total number of flows, and. The endpoints of individual flows.! Pipelining between stages! Speculative executions! Task failures and restarts How to Perform Coflow Scheduling Without Complete Knowledge? Aalo Efficiently schedules coflows without complete and future information. Current size is a good predictor of actual size. Set priority that decreases by how much a coflow has sent. Discretize priority levels to blend in FIFO within each level. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM

11 CODA Efficiently schedules coflows without changing applications * How to Perform Coflow Scheduling Without Changing the Applications?. Learn coflows online from traffic patterns. Error-tolerant scheduling to survive learning errors. Limited to jobs with single coflows. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM HUG Fairly schedules coflows instead of trying to minimize CCT What About Fair Coflow Scheduling?. Multi-resource fairness with high utilization. Fairness-utilization tradeoff results in prisoner s dilemma. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI

12 Systems ME MIND THE GAP Networking Better capture application-level performance goals using coflows Coflows improve application-level performance and usability Extends networking and scheduling literature Coordination even if not free is worth paying for in many cases Improve Flow s Distributions of Coflow Characteristics Link to r Link to r Per-Flow Fair Sharing 4 s s s Shuffle = Avg. Flow =.. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM.. pfabric: Minimal Near-Optimal Transport, SIGCOMM. Link to r Link to r r r Smallest-Flow First, 4 4 Shuffle = Avg. Flow =. Frac. of Coflows Frac. of Coflows E+.E+8.E+.E+.E+4 Coflow Length (Bytes) E+.E+8.E+.E+.E+4 Coflow Size (Bytes) Frac. of Coflows Frac. of Coflows E+.E+4.E+8 Coflow Width (Number of Flows) E+.E+8.E+.E+.E+4 Coflow Bottleneck Size (Bytes)

13 Traffic Sources. Ingest and replicate new data. Read input from remote machines, when needed. Transfer intermediate data 4. Write and replicate output Percentage of Traffic by Category at Facebook 4 4 Distribution of Shuffle Durations Performance Facebook jobs spend ~% of run on average in intermediate comm. Fraction of Jobs Fraction of Run Spent in Shuffle Month-long trace from a - machine MapReduce production cluster at Facebook, jobs Million tasks Theoretical Results Structure of optimal schedules Permutation schedules might not always lead to the optimal solution Employs a two-step algorithm to support coflow deadlines Approximation ratio of COSS-CR Polynomial- algorithm with constant approximation ratio ( 4 ) The need for coordination Fully decentralized schedulers can perform arbitrarily worse than the optimal. Admission control. Allocation algorithm Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocate minimum required resources to each coflow to finish them at their deadlines. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 4

14 More Predictable Experimental Methodology Facebook Trace Simulation EC Deployment % of Coflows 7 7 Fair EDF (Earliest-Deadline First) Fair Met Deadline Not Admitted Missed Deadline % of Coflows deployment in EC m.4xlarge machines Each machine has 8 CPU cores, 8.4 GB memory, and Gbps NIC ~9 Mbps/machine during all-to-all communication Trace-driven simulation Detailed replay of a day-long Facebook trace (circa October ) -machine,-rack cluster with : oversubscription. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan Coflow Recent Advances and What s Next? Mosharaf Chowdhury University of Michigan Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open