Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Size: px

Start display at page:

Download "Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa"

Alban Price
5 years ago
Views:

1 Camdoop Exploiting In-network Aggregation for Big Data Applications joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge)

2 MapReduce Overview Input file Intermediate results Final results Chunk 0 Chunk 1 Chunk 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task Map Processes input data and generates (key, value) pairs Shuffle Distributes the intermediate pairs to the reduce tasks Reduce Aggregates all values associated to each key 2/52

3 Problem Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task Shuffle phase is challenging for data center networks All-to-all traffic pattern with O(N 2 ) flows Led to proposals for full-bisection bandwidth 3/52

4 Data Reduction Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task The final results are typically much smaller than the intermediate results In most Facebook jobs the final size is 5.4 % of the intermediate size In most Yahoo jobs the ratio is 8.2 % 4/52

5 Data Reduction Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task The final results are typically much smaller than the intermediate results In most Facebook jobs final size is 5.4 % of the How intermediate can we exploit size this to reduce the traffic and improve In most Yahoo the performance jobs the ratio of is the 8.2 shuffle % phase? 5/52

6 Background: Combiners Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Map Task Reduce Task Reduce Task Reduce Task To reduce the data transferred in the shuffle, users can specify a combiner function Aggregates the local intermediate pairs Server-side only => limited aggregation 6/52

Background: Combiners Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Combiner Combiner Reduce Task Reduce Task Map Task Combiner

7 Background: Combiners Input file Intermediate results Final results Split 0 Split 1 Split 2 Map Task Map Task Combiner Combiner Reduce Task Reduce Task Map Task Combiner Reduce Task To reduce the data transferred in the shuffle, users can specify a combiner function Aggregates the local intermediate pairs Server-side only => limited aggregation 7/52

8 Distributed Combiners It has been proposed to use aggregation trees in MapReduce to perform multiple steps of combiners e.g., rack-level aggregation [Yu et al., SOSP 09] 8/52

9 Logical and Physical Topology What happens when we map the tree to a typical data center topology? ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link 9/52

10 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link

11 Logical and Physical Topology What happens when we map the tree to a typical data center topology? Only 500 Mbps per child ToR Switch Logical topology Physical topology Camdoop Goal The server link is the bottleneck Full-bisection bandwidth does not help here Perform the combiner functions within the network as opposed to application-level solutions Mismatch between physical and logical topology Two logical links are mapped onto the same physical link Paolo Reduce Costa shuffle Camdoop: time Exploiting In-network by aggregating Aggregation for Big Data Applications packets on path

12 How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 12/52

13 How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 13/52

14 (1,2,1) (1,2,2) How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates This defines a key-space (=> key-based routing) Coordinates are locally re-mapped in case of failures z x 14/52

15 (1,2,1) (1,2,2) How Can We Perform In-network Processing? We exploit CamCube Direct-connect topology 3D torus Uses no switches / routers for internal traffic y Servers intercept, forward and process packets Nodes have (x,y,z)coordinates Key property No This distinction defines between a key-space network (=> key-based and computation routing) devices Coordinates are locally re-mapped in case of failures Servers can perform arbitrary packet processing on-path z x

16 Mapping a tree on a switched topology The 1 Gbps link becomes the bottleneck 1/in-degree Gbps on CamCube Packets are aggregated on path (=> less traffic) 1:1 mapping btw. logical and physical topology 1 Gbps 16/52

17 Camdoop Design Goals 1. No change in the programming model 2. Exploit network locality 3. Good server and link load distribution 4. Fault-tolerance 17/52

18 Design Goal #1 Programming Model Camdoop adopts the same MapReduce model GFS-like distributed file-system Each server runs map tasks on local chunks We use a spanning tree Combiners aggregate map tasks and children results (if any) and stream the results to the parents The root runs the reduce task and generates the final output 18/52

19 Design Goal #2 Network locality How to map the tree nodes to servers? 19/52

20 Design Goal #2 Network locality Map task outputs are always read from the local disk 20/52

21 Design Goal #2 Network locality (1,2,1) (1,2,2) (1,1,1) The parent-children are mapped on physical neighbors 21/52

22 Design Goal #2 Network locality (1,2,1) The Map parent-children task How outputs to map are are the always mapped tree nodes read on from to physical servers? the local neighbors disk (1,2,2) (1,1,1) This ensures maximum locality and optimizes network transfer

23 Network Locality Logical View Physical View (3D Torus) One physical link is used by one and only one logical link 23/52

24 Design Goal #3 Load Distribution 24/52

25 Design Goal #3 Load Distribution Only Different 1 Gbps (instead in-degree of 6) Poor server load distribution 25/52

26 Only 1 Gbps Design Goal #3 (instead of 6) Load Distribution Poor bandwidth utilization 26/52

27 Design Goal #3 Load Distribution Solution: Poor stripe server bandwidth the data load across distribution utilization disjoint trees Different links are used Improves load distribution

28 Design Goal #3 Solution: Second First stripe issue: the Poor data bandwidth load across distribution 6 disjoint utilization trees All links are used => (Up to) 6 Gbps / server Camdoop: Exploiting Good In-network load Aggregation distribution for Big Data Applications

29 Design Goal #4 Fault-tolerance The tree is built in the coordinate space CamCube remaps coordinates in case of failures Details in the paper 29/52

27 Ghz 12GB RAM 6 Intel PRO/1000 PT 1 Gbps ports Runtime &

30 Evaluation Testbed 27-server CamCube (3 x 3 x 3) Quad-core Intel Xeon Ghz 12GB RAM 6 Intel PRO/1000 PT 1 Gbps ports Runtime & services implemented in user-space Simulator Packet-level simulator (CPU overhead not modelled) 512-server (8x8x8) CamCube 30/52

Evaluation Design and implementation recap Shuffle & reduce parallelized Camdoop Reduce phase is parallelized with the shuffle phase Since all streams are ordered, as soon as the

31 Evaluation Design and implementation recap Shuffle & reduce parallelized Camdoop Reduce phase is parallelized with the shuffle phase Since all streams are ordered, as soon as the root receive at least one packet from all children, it can start the reduce function No need to store to disk intermediate results on reduce servers Map Shuffle Reduce Map Shuffle Reduce

32 Evaluation Design and implementation recap Camdoop Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation 32/52

33 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation TCP Camdoop (switch) 27 CamCube servers attached to a ToR switch Map Map Shuffle Reduce TCP is used to transfer data in the shuffle phase Shuffle 33/52

Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Camdoop (no agg ) Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation TCP Camdoop Camdoop (no

34 Evaluation Design and implementation recap Camdoop TCP Camdoop (switch) Camdoop (no agg ) Shuffle & reduce parallelized CamCube Six disjoint trees In - network aggregation TCP Camdoop Camdoop (no agg) (switch) 27 Like CamCube Camdoop servers but without attached in-network to a ToR switch Map aggregation Map TCP Shows is Shuffle used the impact to transfer Reduce of just data running the on shuffle CamCube phase Shuffle 34/52

35 Time logscale (s) Validation against Hadoop & Dryad Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: Shuffle and reduce parellized Fine-tuned implementation Hadoop TCP Camdoop (switch) 1000 Worse Better 0 Sort Dryad/DryadLINQ Camdoop (no agg) WordCount 35/52

Time logscale (s) Validation against Hadoop & these Dryad as our Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: Shuffle and

36 Time logscale (s) Validation against Hadoop & these Dryad as our Sort and WordCount Camdoop baselines are competitive against Hadoop and Dryad Several reasons: Shuffle and reduce parellized Fine-tuned implementation Hadoop TCP Camdoop (switch) 1000 Worse Better 0 Sort We consider baselines Dryad/DryadLINQ Camdoop (no agg) WordCount 36/52

37 Parameter Sweep Output size / intermediate size (S) S=1 (no aggregation) o Every key is unique S=1/N 0 (full aggregation) o Every key appears in all map task outputs We use synthetic workloads to explore different value of S o Intermediate data size is 22.2 GB (843 MB/server) Reduce tasks (R) R= 1 (all-to-one) o E.g., Interactive queries, top-k jobs R=N (all-to-all) o Common setup in MapReduce jobs o N output files are generated 37/52

38 Time (s) logscale All-to-one (R=1) Worse TCP Camdoop (switch) Camdoop (no agg) Better Camdoop Full Output size/ intermediate size (S) No aggregation aggregation 38/52

39 Time (s) logscale Impact of All-to-one in-network (R=1) aggregation Worse 1000 Performance independent of S Impact of running on CamCube TCP Camdoop (switch) Camdoop (no agg) Camdoop Better Full Output size/ intermediate size (S) No aggregation aggregation 39/52

(switch) Camdoop (no agg) Camdoop Better 1 0 0.2 0.4 0.6 0.

40 Time (s) logscale Impact of All-to-one in-network (R=1) aggregation Worse 1000 Performance independent of S Impact of running on CamCube TCP Camdoop (switch) Camdoop (no agg) Camdoop Better Facebook reported Full Output size/ intermediate size (S) No aggregation aggregation ratio aggregation 40/52

41 Time (s) logscale All-to-all (R=27) Worse Better 1 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) Output size / intermediate size (S) Full aggregation No aggregation 41/52

Better 1 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) 0 0.

42 Time (s) logscale Worse All-to-all (R=27) Impact of in-network aggregation 100 Impact of running on CamCube Maximum theoretical performance over the switch 10 Better 1 TCP Camdoop (switch) Camdoop Camdoop (no agg) Switch 1 Gbps (bound) Output size / intermediate size (S) Full aggregation No aggregation 42/52

43 Time (s) logscale Number of reduce tasks (S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x 10 Better 1.91 x 1.13 x All-to-one # reduce tasks (R) All-to-all 43/52

Camdoop (no agg) Camdoop 10.19 x 4.31 x 10 1.91 x Better 1 1.

44 Time (s) logscale Performance Number depends of reduce on R tasks (S=0) Worse R 1000 does not (significantly) impact performance 100 TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x x Better x All-to-one # reduce tasks (R) All-to-all Implementation bottleneck 44/52

45 Time (s) logscale Performance Number depends of reduce on R tasks (S=0) Worse R 1000 does not (significantly) impact performance 100 TCP Camdoop (switch) Camdoop (no agg) Camdoop x 4.31 x x Better x 1 All resources 6 are 11used even 16 when R = All-to-one # reduce tasks (R) All-to-all Camdoop decouples the job execution time from the Camdoop: number Exploiting of In-network output Aggregation files for generated Big Data Applications

46 Time (s) logscale Behavior at scale (simulated) Worse 100 Switch 1 Gbps (bound) Camdoop N=512, S= x Better All-to-one # reduce tasks (R) logscale 512 All-to-all 46/52

(bound) Camdoop N=512, S= 0 10 1 512x 0.1 0.01 Better 0.

47 Time (s) logscale This assumes full-bisection bandwidth Behavior at scale (simulated) Worse 100 Switch 1 Gbps (bound) Camdoop N=512, S= x Better All-to-one # reduce tasks (R) logscale 512 All-to-all 47/52

48 Beyond MapReduce More experiments (failures, multiple jobs, ) in the paper 48/52

49 Beyond MapReduce requests The partition-aggregate model also common in interactive services e.g., Bing Search, Google Dremel Small-scale data 10s to 100s of KB returned per server Typically, these services use one reduce task (R=1) Single result must be returned to the user Web server Leaf servers Full aggregation is common (S 0) E.g., N servers generate their best k responses each and the final result contains the best k responses Cache Parent servers 49/52

50 Time (ms) Small-scale data (R=1, S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop Better 0 20 KB 200 KB Input data size / server 50/52

Time (ms) Small-scale data (R=1, S=0) Worse 60 50 40 30 20 10 TCP Camdoop (switch) Camdoop (no agg) Camdoop Better 0 20 KB 200 KB Input data size / server

51 Time (ms) Small-scale data (R=1, S=0) Worse TCP Camdoop (switch) Camdoop (no agg) Camdoop Better 0 20 KB 200 KB Input data size / server In-network aggregation can be beneficial also for Camdoop: (small-scale Exploiting In-network data) Aggregation interactive for Big Data Applications services

Conclusions Camdoop Explores the benefits of in-network processing by running combiners within the network No change in the programming model Achieves lower shuffle and reduce time Decouples

52 Conclusions Camdoop Explores the benefits of in-network processing by running combiners within the network No change in the programming model Achieves lower shuffle and reduce time Decouples performance from the # of output files A final thought: how would Camdoop run on this? AMD SeaMicro a 512-core cluster for data centers using a 3D torus Fast interconnect: 5 Gbps / link 52/52

Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop: Exploiting In-network Aggregation for Big Data Applications : Exploiting In-network Aggregation for Big Data Applications Paolo Costa Austin Donnelly Antony Rowstron Greg O Shea Microsoft Research Cambridge Imperial College London Abstract Large companies like