Accelerating Analytical Workloads

Accelerating Analytical Workloads Thomas Neumann Technische Universität München April 15, 2014

Scale Out in Big Data Analytics Big Data usually means data is distributed Scale out to process very large inputs but for analytics sort data has to be combined and aggregated typically map/reduce-based, group Hadoop/Hive etc. join data is copied to processing nodes for aggregations lineitem orders not very smart, dominated by network traffic TPC-H Query 12 smart data movement can speed up processing significantly node 2 node 1 node 3 SWITCH node 4 node 6 node 5 Main-Memory Cluster Thomas Neumann Accelerating Analytical Workloads 2 / 26

Running Example (1) Focus on analytical query processing in this talk TPC-H query 12 used as running example Runtime dominated by join orders lineitem Example from well-known benchmark, but applicable for all distributed joins sort group join lineitem orders TPC-H Query 12 Thomas Neumann Accelerating Analytical Workloads 3 / 26

Running Example (2) Relations are equally distributed across nodes We make no assumptions on the data distribution Thus, tuples may join with tuples on remote nodes Communication over the network required node 3 node 2 node 1 orders lineitem key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL 13 MAIL 13 SHIP 17 MAIL 18 MAIL 18 MAIL 21 2-HIGH 19 SHIP Thomas Neumann Accelerating Analytical Workloads 20 SHIP 4 / 26 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 16 3-MEDIUM 17 2-HIGH 18 3-MEDIUM 19 5-LOW 20 1-URGENT

CPU vs. Network computing power [MIPS] 10 6 10 5 10 4 10 3 10 2 10 1 10 0 Core i7 10 6 Core 2 per 10 5 core Pentium 4 10 4 Pentium Pro 10 Gigabit Pentium (forecast) 10 3 Token Ring Gigabit Ethernet Fast Ethernet 10 2 80486 80386 10 1 80286 1980 1990 2000 2010 2020 100 LAN bandwidth [Mbits] CPU speed has grown much faster than network bandwidth Thomas Neumann Accelerating Analytical Workloads 5 / 26

Scale Out: Network is the Bottleneck 200 Single node: Performance is bound algorithmically Cluster: Network is bottleneck for query processing Investing time and effort in decreasing network traffic pays off Goal: Increase local processing to close the performance gap join performance [M tuples/s] 150 100 50 0 distributed local Thomas Neumann Accelerating Analytical Workloads 6 / 26

Neo-Join: Network-optimized Join [ICDE14] 1. Open Shop Scheduling Efficient network communication 2. Optimal Partition Assignment Increase local processing 3. Selective Broadcast Handle value skew Thomas Neumann Accelerating Analytical Workloads 7 / 26

Bandwidth Sharing node 11 node 22 node 11 bottleneck SWITCH bottleneck node 33 SWITCH node 22 node 33 Simultaneous use of a single link creates a bottleneck Reduces bandwidth by at least a factor of 2 Thomas Neumann Accelerating Analytical Workloads 8 / 26

Naïve Schedule node 1 5 5 node 2 5 4 node 3 5 4 bandwidth sharing time Node 2 and 3 send to node 1 at the same time Bandwidth node 1 sharing 4 increases network 4 duration 1 1 significantly node 2 4 4 1 node 3 4 4 1 time Thomas Neumann Accelerating Analytical Workloads 9 / 26

Open Shop Scheduling (1) Avoiding bandwidth sharing translates to open shop scheduling: A sender has one transfer per receiver A receiver should receive at most one transfer at a time A sender should send at most one transfer at a time node 1 transfers node 1 senders node 2 node 3 node 2 node 3 receivers Thomas Neumann Accelerating Analytical Workloads 10 / 26

Open Shop Scheduling (2) Compute optimal schedule: Edge weights represent total transfer duration Scheduler repeatedly finds perfect matchings Each matching specifies one communication phase Transfers in a phase will never share bandwidth node 1 transfers node 1 senders node 2 node 3 5 5 4 node 2 receivers node 3 Thomas Neumann Accelerating Analytical Workloads 11 / 26

node 3 1 1 1 1 1 4 Optimal Schedule bandwidth sharing node 1 4 4 1 1 node 2 4 4 1 node 3 4 4 1 time maximum straggler Open shop schedule achieves minimal network duration Schedule duration determined by maximum straggler Thomas Neumann Accelerating Analytical Workloads 12 / 26

Distributed Join Tuples may join with tuples on remote nodes Repartition and redistribute both relations for local join Tuples will join only with the corresponding partition Using hash, range, radix, or other partitioning scheme In any case: Decide how to assign partitions to nodes O L O1 O2 O3 L1 L2 L3 fragmented redistributed Thomas Neumann Accelerating Analytical Workloads 13 / 26

Running Example: Hash Partitioning node 3 node 2 node 1 orders lineitem key priority key shipmode 1 1-URGENT 1 MAIL 2 2-HIGH 1 MAIL 3 1-URGENT 1 MAIL 4 5-LOW 2 SHIP 5 3-MEDIUM 2 MAIL x+2 mod 3 6 1-URGENT 6 SHIP 6 5 5 7 2-HIGH 6 SHIP 8 1-URGENT 6 SHIP P1 P2 P3 9 1-URGENT 6 MAIL 10 2-HIGH 10 SHIP 11 3-MEDIUM 11 MAIL 12 5-LOW 11 MAIL 13 1-URGENT 13 MAIL x+2 mod 3 14 3-MEDIUM 13 MAIL 5 4 4 15 1-URGENT P1 P2 P3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL x+2 mod 3 21 2-HIGH 19 SHIP 5 4 4 20 SHIP P1 P2 P3 Thomas Neumann Accelerating Analytical Workloads 14 / 26

Assign Partitions to Nodes (1) Option 1: Minimize network traffic Assign partition to node that owns its largest part Only the small fragments of a partition sent over the network Schedule with minimal network traffic may have high duration n3 n3 hash partitioning (x mod 3) 5 6 5 4 2 35 34 4 2 35 34 P1 P2 P3 open shop schedule 13 13 traffic: 26 time: 26 Thomas Neumann Accelerating Analytical Workloads 15 / 26

Assign Partitions to Nodes (2) hash partitioning (x mod 3) 5 6 5 Option 2: Minimize response time: 4 2 35 3 4 Query response time is time from requestn3 to result 4 2 35 3 4 Query response P1 time dominated P2 by P3 network duration To minimize network open shop duration, schedule minimize maximum straggler n3 13 13 n3 n3 hash partitioning (x mod 3) 5 6 5 4 5 4 4 5 4 P1 P2 P3 open shop schedule 4 5 1 4 4 1 4 4 1 traffic: 26 time: 26 traffic: 28 time: 10 Thomas Neumann Accelerating Analytical Workloads 16 / 26

he tuples from every other example, Minimize radix partitioning Maximum Straggler rk phase by almost a factor ing. of all local fragments of partitions that are not assigned to it. Likewise, equation 4 adds the size of remote fragments of partitions that were assigned to node i to the receive cost. MILPs require a linear objective, which minimizing a maximum is not. Fortunately, we can rephrase the objective and (Phase Formalized 2) instead minimize a new variable w. Additional constraints take as mixed-integer care that w assumes the maximum over the send/receive costs: howlinear to repartition program the input e join key fall into the same minimize w, subject to itions Shown are fragmented to be NP-hard across in worst p 1 ts case of one specific partition w h ij (1 x ij ) 0 i < n j=0 me But nodeinfor practice joining. fast Thisenough p 1 e an assignment of partitions (OPT-ASSIGN) n 1 using CPLEX or Gurobi w ork phase duration. j=0 x ij h kj 0 i < n k=0,i k a (< node0.5 as% theoverhead number of for 32 nodes, n 1 s for 200 thempartitions tuples that each) were 1 = x ij 0 j < p ost Partition is defined as assignment the number i=0 can odes. Section III-C4 shows optimize any partitioning uration is determined by the One can obtain an optimal solution for a specific partition eive cost. The assignment is assignment problem (OPT-ASSIGN) by passing the mixed his maximum cost. integer linear program to an optimizer such as Microsoft a partition to the node that Gurobi 3 or IBM CPLEX 4. These solvers can be linked as a ver, this is not optimal in library to create and solve linear programs via API calls. for the running example in 3 http://www.gurobi.com Thomas Neumann 4 Accelerating Analytical Workloads 17 / 26

Running Example: Locality orders lineitem node 1 key priority 1 1-URGENT 2 2-HIGH 3 1-URGENT 4 5-LOW 5 3-MEDIUM 6 1-URGENT 7 2-HIGH 8 1-URGENT key shipmode 1 MAIL 1 MAIL 1 MAIL 2 SHIP 2 MAIL 6 SHIP 6 SHIP 6 SHIP radix 15 P1 1 P2 P3 node 2 9 1-URGENT 10 2-HIGH 11 3-MEDIUM 12 5-LOW 13 1-URGENT 14 3-MEDIUM 15 1-URGENT 6 MAIL 10 SHIP 11 MAIL 11 MAIL 13 MAIL 13 MAIL radix 11 1 1 P1 P2 P3 node 3 16 3-MEDIUM 13 MAIL 17 2-HIGH 13 SHIP 18 3-MEDIUM 17 MAIL 19 5-LOW 18 MAIL 20 1-URGENT 18 MAIL 21 2-HIGH 19 SHIP radix 2 11 20 SHIP P1 P2 P3 Thomas Neumann Accelerating Analytical Workloads 18 / 26

Locality Running example exhibits time-of-creation clustering Radix repartitioning on most significant bits retains locality Partition assignment can exploit locality Significantly reduces query response time n3 n3 n3 n3 radix radix partitioning partitioning (MSB) (MSB) 15 15 1 0 1 11 11 1 0 2 11 11 1 1 1 1 1 2 2 P1 P2 P3 P1 P2 P3 open shop schedule open shop schedule traffic: 5 time: 3 traffic: time: Thomas Neumann Accelerating Analytical Workloads 19 / 26

Broadcast O Alternative to data repartitioning Replicate the smaller relation between all nodes O L O L Larger relation remains fragmented across nodes O broadcast O local join Thomas Neumann Accelerating Analytical Workloads 20 / 26

Selective Broadcast Decide per partition whether to assign or broadcast Broadcast orders for P 2, let line items remain fragmented Assign the other partitions taking locality into account Improves performance for high skew and many duplicates n3 n3 n3 n3 hash hash partitioning partitioning (mod (mod 3) 3) 2 1 1 6 1 1 1 1 1 6 1 1 1 1 1 5 2 2 O1 L1 O2 L2 O3 L3 O1 L1 O2 L2 O3 L3 open shop schedule open shop schedule 3 1 1 3 3 3 1 3 1 3 traffic: 14 time: 6 traffic: 14 time: Thomas Neumann Accelerating Analytical Workloads 21 / 26

Experimental Setup Cluster of 4 nodes Core i7, 4 cores, 3.4 GHz, 32 GB RAM Gigabit Ethernet Tuples consist of 64 bit key, 64 bit payload Thomas Neumann Accelerating Analytical Workloads 22 / 26

Locality Vary locality from 0 % (uniform distribution) to 100 % (range partitioning) Neo-Join improves join performance from 29 M to 156 M tuples/s (> 500 %) 3 nodes, 600 M tuples join performance [tuple/s] Neo-Join Hadoop Hive DBMS-X MySQL Cluster 160 M 120 M 80 M 40 M 0 M 0 % 25 % 50 % 75 % 100 % locality Thomas Neumann Accelerating Analytical Workloads 23 / 26

TPC-H Results (scale factor 100) hash broadcast Neo-Join 7 Results for three selected TPC-H queries Broadcast outperforms hash for large relation size differences Neo-Join always performs better due to selective broadcast and locality 4 nodes, ca. 100GB data execution time [s] 6 5 4 3 2 1 0 Q12 Q14 Q19 TPC-H queries Thomas Neumann Accelerating Analytical Workloads 24 / 26

Further Optimizations Network-aware joining is only one ingredient All Query Processing steps are important parallel, network aware, maximize locality [PVDB12] group by, sort, cube,... [DEBUL14, SIGMOD13, PVLDB11] also: smart loading/parsing [PVLDB13] Query Optimization has a huge impact Reformulate the query into a more efficient form [EDBT14,ICDE12] Involves algebraic optimization, exploiting statistics, etc. [ICDE11] Can improve runtimes by orders of magnitude! Result is much faster than a naive map/reduce approach. Thomas Neumann Accelerating Analytical Workloads 25 / 26

Conclusion Analyzing Big Data is challenging very large volume, distributed many operations require joining data network is a bottleneck We can use optimization techniques to speed up the analysis maximize bandwidth exploit data characteristics (locality, skew, etc.) smart scheduling of operations Improves over commonly used approaches like Hive by order of magnitudes. Thomas Neumann Accelerating Analytical Workloads 26 / 26