Performance Issues and Query Optimization in Monet

Size: px

Start display at page:

Download "Performance Issues and Query Optimization in Monet"

Sabina Dennis
6 years ago
Views:

1 Performance Issues and Query Optimization in Monet Stefan Manegold

2 1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical decomposition - Algorithms: tune random memory access - Implementation techniques: avoid CPU stalls Dissecting CPU & Memory Optimization Effects - Monet experiments (focus: partitioned hash-join) - Gain (or loose) an order of magnitude in performance Query Optimization in Monet Conclusion bottom

3 2 Trends in DRAM and CPU Speed 1e memory latency (ns) 1 percent improvement per year CPU clock speed (Hz) 5 percent improvement per year memory bandwidth (MB/s) CPU parallelism year type execution units 1988 classic simple 1992 scalar single pipeline 1996 super- multiple scalar pipelines 1999 Athlon 9 pipe s 2 Pentium 4 12 pipe s? Itanium 16 pipe s year

4 3 Modern Computer Architecture: Exploiting CPU Parallelism Keeping all pipelines busy requires - independent operations - simple/predictable code But there are... - conditional branches (if-then-else) - (dereferenced) function calls - late binding in C++ compilers fail Good : scientific code (matrix computation) Bad : (generic) DBMS code Operator sequence: A, B, C, D, E, F Filled pipelines: E C A F D B

5 4 Modern Computer Architecture: Hierarchical Memory System Chip/Die CPU Caches reduce memory latency only if the requested data is found in the cache Otherwise, CPU stalls for up to 4 ns on each cache miss L1 cache L2 cache Main Memory size latency bandwidth

6 5 Consequences for DBMS Goal Optimize Use cache lines fully Data structures Prevent cache misses Memory access / algorithms Prevent CPU stalls Implementation techniques Exploit CPU-inherent parallelism Implementation techniques

7 6 Data Structures: Vertical Decomposition in Monet Store vertical fragments instead of wide relations Wide relations: waste bandwidth A_1 A_2 A_3... A_n Monet: uses complete bandwidth A_1 A_2 A_3... A_n.... requested attribute cache line

8 7 Algorithms: Partitioned Joins Cluster both input relations Create clusters that fit in memory cache Restrict random memory access to smallest cache Avoid cache capacity misses Join matching clusters L R non-clustered L 96 R clustered

9 8 Algorithms: Straightforward Clustering Input Clustered output Problem: Number of clusters exceeds number of cache lines cache thrashing (active) cache lines Solution: Multi-pass clustering

10 9 Algorithms: Multi-Pass Clustering Input Clustered output Limit number of clusters per pass (active) cache lines Avoid cache thrashing Trade memory cost for CPU cost Pass 1 Pass 2

11 1 Dissecting CPU- & Memory-Optimization Platforms: - SGI Origin2 (MIPS R1, 25 MHz) - Sun Ultra (UltraSPARC, 2 MHz) - Intel PC (PentiumIII, 45 MHz) - AMD PC (Athlon, 6 MHz) Use hardware event counters Break down execution times into - Memory access (cache misses) - CPU stalls (on the Intel PC) - Divisions - Real CPU work

12 11 Memory-Optimization: SGI Origin2 Multi-Pass Clustering Intel PC pass 2 18 P passes P=1 P= pass P passes P=1 P= k 4M k 4M k 4M number of clusters number of clusters number of clusters memory CPU stalls

13 12 Memory-Optimization: Partitioned Hash-Join SGI Origin2 Intel PC M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions

14 13 CPU-Optimization DBMS techniques for CPU optimization: - column-at-a-time Monet operators have fixed layout and few types - join([oid,t],[t,oid]) [oid,oid] has just one degree of freedom (T) - bulk type-switch technique: seperate routine for each T join( [oid,t], [T,oid] ) : [oid,oid] { switch(t) { int: return join int( [oid,int], [int,oid] ); string: return join string( [oid,string], [string,oid] ); default: return join ADT( [oid,t], [T,oid], ADT ); } } type-specific join: replace all function calls by inline - less overhead - code more predictable for CPU also: replace expensive division by bit operator

15 14 CPU-Optimization: Partitioned Hash-Join SGI Origin2: 15 s 4 s Intel PC: 1 s 4 s 25 default 25 default optimized 2 optimized M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions

16 15 CPU-Optimization: Multi-Pass Clustering SGI Origin2: 3 s.75 s Intel PC: 2.25 s.75 s 1 default P=1 P=2 1 default P=1 P= optimized optimized 4 2 P=1 P=2 P=3 4 2 P=1 P= k 8k 64k 512k 4M k 8k 64k 512k 4M number of clusters number of clusters memory CPU stalls

17 16 CPU- & Memory-Optimization: Overall Performance Boosting Effects: Mc > M & Cm > C 6 5 C SGI Origin2 M C = 1.5 s Cm = 16.4 s M = 28.6 s Mc = 34.5 s 6 5 AMD PC C = 4.3 s Cm = 8.4 s M = 18.9 s Mc = 23. s C M 2 Mc Cm Mc Cm 64M 8M 1M 128k 16k 2k cluster size [byte] 64M 8M 1M 128k 16k 2k cluster size [byte] default optimized simple minimum 1 pass 2 passes 3 passes 4 passes

18 17 Automatic Tuning of Algorithms Detailed and accurate main-memory cost models: - Calibrate CPU costs - Estimate number of cache misses - Memory access cost: misses multiplied by their latencies Calibration tool: - Automatically analyzes memory system of any computer - Extracts number of cache levels, cache sizes, cache line sizes, cache miss latencies

19 18 Extreme CPU-Optimization: Select exhaustive code expansion on: - data types - predicate type ( x = C, x < C, C < x, C < x < C1) - various other properties avoid as many braches/conditionals as possible in the inner loop results in 12(!) specific routines generated from one template performance improvement: up to factor 1(!) lesson learned: even correctly predicted branches do hurt

20 19 Monet Query Optimizer: System Architecture multiple users/applications multiple MIL streams query results MIL stream merger & result dispatcher merged MIL stream Strategic optimization Tactical optimization elimination of common (sub ) expressions re use of cached (intermediate) results Multi Query Optimizer Dataflow Graph query results pattern rewriting cache / memory management optimized MIL stream Execution Engine Result Cache Operational optimization parallelization Monet Database System

21 2 Monet Query Optimizer: Common Sub-expression Elimination avoid redundancy: s39 := {sum}(s13.reverse().join(s38), s13.tunique()); s49 := {sum}(s13.reverse().join(s48), s13.tunique()); s66 := {sum}(s13.reverse().join(s65), s13.tunique()); s76 := {count}(s13.reverse(), s13.tunique()); s13r := s13.reverse(); s13t := s13.tunique(); s39 := {sum}(s13r.join(s38), s13t); s49 := {sum}(s13r.join(s48), s13t); s66 := {sum}(s13r.join(s65), s13t); s76 := {count}(s13r, s13t);

22 21 Monet Query Optimizer: Heuristic Pattern Rewriting inline join (also for {sum} & {count}): x := {avg}(g.reverse().join(b),e); x := {avg}(b,g,e); avoid join: x := {count}(b,g,e); x := {count}(g.reverse(),e); avoid grouping and join: x1 := {sum}(b,g,e); x1 := {sum}(b,g,e); x2 := {count}(g.reverse(),e); x2 := {count}(g.reverse(),e); x3 := {avg}(b,g,e); x3 := x1 [/] x3;

23 22 Monet Query Optimizer Parallelization (SMP): - schedule independent operations for concurrent execution Cache / Memory management: - re-use intermediate results as soon as possible - discard intermediate results (free memory) as soon as possible in preparation: - cost-based pattern rewriting - cost-based cache / memory management - horizontal fragmentation / partitioning

24 23 Experiment: Q1 of TPC-H benchmark on Origin2 #CPU time [ms] comment original Monet code optimized Monet code optimized query parallel execution

25 24 Conclusion Bad memory access pattern and poor usage of CPU-inherent parallelism ruin database performance Use data structures that exploit full memory bandwidth Tune algorithms to achieve optimal memory access Optimize (simplify) code to efficiently exploit CPU resources Tactical optimization layer allows lazy MIL generation in frontends and applications Monet home-page: monet

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query