DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos

Size: px

Start display at page:

Download "DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Don. Holger Pirk Eleni Petraki Strato Idreos"

Elaine Robbins
5 years ago
Views:

1 DATABASE CRACKING: Fancy Scan, not Poor Man s Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten

2 EVALUATING RANGE PREDICATES

3 COMPLEXITY ON PAPER Scanning: O(n) Sorting: O(n log(n)) Cracking: O(n) Essentially a single Quicksort-Step

4 COSTS IN REALITY Implement microbenchmarks 1 Billion uniform random integer values Pivot in the middle of the range Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)

5 COSTS IN REALITY Wallclock time in s Parallel Scanning Cracking Parallel Sorting

6 SO: WHAT S GOING ON?

7 CACHE MISSES? 1.5B 1.4B 1.2B 1.0B 800M 600M 400M 200M L1I Misses L1D Misses L2 Misses L3 Misses NOPE! 0.0 Scanning Cracking Sorting

8 CPU COSTS Micro-ops Issued? No Yes Allocation Stall? Micro-op Ever Retire? No Yes No Yes Frontend Bound Backend Bound Bad Speculation " " # Retiring! Cache Miss Stalls Other Stalls

9 CPU COSTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Scanning Cracking Sorting

10 CPU COSTS Data Stalls Retiring Bad Speculation Pipeline Frontend Pipeline Backend 14 %!!! Scanning Cracking Sorting

11 CPU COSTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Lots of Potential Scanning Cracking Sorting

12 WHAT CAN WE DO ABOUT IT?

13 INCREASING CPU EFFICIENCY

14 PREDICATION for(i=0; i<size; i++)! if(input[i] < pivot) {! output[outi] = input[i];! outi++! } for(i=0; i<size; i++)! {! output[outi] = input[i];! outi += (input[i] < pivot);! }

15 PREDICATION Turns control dependencies into data dependencies Eliminates Branch Mispredictions Causes unconditional (potentially unnecessary) I/O (limited to caches) Works only for out-of-place algorithms

16 PREDICATED CRACKING

17 PREDICATED CRACKING pivot 5 active backup

18 PREDICATED CRACKING pivot active backup

19 PREDICATED CRACKING pivot cmp active backup 5? State Before Iteration

20 PREDICATED CRACKING pivot cmp active backup > Evaluate Predicat & Write

21 PREDICATED CRACKING pivot cmp active backup = 1- = Advance Cursor

22 PREDICATED CRACKING pivot cmp active backup * + * Read Next Element

23 PREDICATED CRACKING pivot cmp backup active

24 PREDICATED CRACKING Predication for in-place algorithms No branching No branch mispredictions Somewhat intricate Lots of copying stuff around (integer granularity inefficient) Bulk-copying would be more efficient

25 VECTORIZED CRACKING

26 VECTORIZED CRACKING Turns in-place cracking into out-of-place cracking Copies Vector-sized chunks and cracks them into the array Makes vanilla-predication possible Uses SIMD-copying for vector copying Challenge: ensure that values aren't accidentally" overwritten

27 VECTORIZED CRACKING copy partition copy partition

28 RESULTS

29 RESULTS Data Stalls Retiring 1.0 Bad Speculation Pipeline Frontend Pipeline Backend Vectorized Predicated Original

30 RESULTS: WORKSTATION Wallclock time in s Scan Vectorized Predicated (Register) Predicated (Cache) Original

31 RESULTS: SERVER Wallclock time in s Not there yet! 0.0 Scan Vectorized Predicated (Register) Predicated (Cache) Original

32 PARALLELIZATION

33 PARALLELIZATION Obvious Solution: Partitioning

34 CRACK & MERGE x1 y1x2 y2x3 y3x4 y4 Partition

35 CRACK & MERGE x1 y1x2 y2x3 y3x4 y4 Merge

36 REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Partition

37 REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Smaller Merge

38 RESULTS: WORKSTATION 1,6 1,2 Seconds 0,8 0,4 0 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

39 RESULTS: SERVER 3,00 2,25 Seconds 1,50 0,75 0,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

40 IMPACT OF SELECTIVITY: WORKSTATION Wallclock time in s Vectorized Partition & Merge Vectorized Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning Qualifying Tuples/Pivot

41 IMPACT OF SELECTIVITY: SERVER 2.6 Wallclock time in s Vectorized Partition & Merge Vectorized Partition & Merge Refined Partition & Merge Vectorized Refined Partition & Merge Scanning Qualifying Tuples/Pivot

42 CONCLUSIONS

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)