Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl
1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical decomposition - Algorithms: tune random memory access - Implementation techniques: avoid CPU stalls Dissecting CPU & Memory Optimization Effects - Monet experiments (focus: partitioned hash-join) - Gain (or loose) an order of magnitude in performance Query Optimization in Monet Conclusion bottom
2 Trends in DRAM and CPU Speed 1e+6 1 1 1 1 1 memory latency (ns) 1 percent improvement per year CPU clock speed (Hz) 5 percent improvement per year memory bandwidth (MB/s) CPU parallelism year type execution units 1988 classic simple 1992 scalar single pipeline 1996 super- multiple scalar pipelines 1999 Athlon 9 pipe s 2 Pentium 4 12 pipe s? Itanium 16 pipe s 1 1986 1988 199 1992 1994 1996 1998 2 year
3 Modern Computer Architecture: Exploiting CPU Parallelism Keeping all pipelines busy requires - independent operations - simple/predictable code But there are... - conditional branches (if-then-else) - (dereferenced) function calls - late binding in C++ compilers fail Good : scientific code (matrix computation) Bad : (generic) DBMS code Operator sequence: A, B, C, D, E, F Filled pipelines: E C A F D B
4 Modern Computer Architecture: Hierarchical Memory System Chip/Die CPU Caches reduce memory latency only if the requested data is found in the cache Otherwise, CPU stalls for up to 4 ns on each cache miss L1 cache L2 cache Main Memory size latency bandwidth
5 Consequences for DBMS Goal Optimize Use cache lines fully Data structures Prevent cache misses Memory access / algorithms Prevent CPU stalls Implementation techniques Exploit CPU-inherent parallelism Implementation techniques
6 Data Structures: Vertical Decomposition in Monet Store vertical fragments instead of wide relations Wide relations: waste bandwidth A_1 A_2 A_3... A_n.................................. Monet: uses complete bandwidth A_1 A_2 A_3... A_n.... requested attribute cache line
7 Algorithms: Partitioned Joins Cluster both input relations Create clusters that fit in memory cache Restrict random memory access to smallest cache Avoid cache capacity misses Join matching clusters L 3 6 17 2 37 47 57 66 75 81 92 96 R 2 3 1 17 2 32 35 47 66 96 non-clustered L 96 R 57 96 17 81 75 66 3 92 2 37 6 47 32 17 1 2 66 3 35 2 47 clustered
8 Algorithms: Straightforward Clustering Input Clustered output Problem: Number of clusters exceeds number of cache lines cache thrashing (active) cache lines Solution: Multi-pass clustering
9 Algorithms: Multi-Pass Clustering Input Clustered output Limit number of clusters per pass (active) cache lines Avoid cache thrashing Trade memory cost for CPU cost Pass 1 Pass 2
1 Dissecting CPU- & Memory-Optimization Platforms: - SGI Origin2 (MIPS R1, 25 MHz) - Sun Ultra (UltraSPARC, 2 MHz) - Intel PC (PentiumIII, 45 MHz) - AMD PC (Athlon, 6 MHz) Use hardware event counters Break down execution times into - Memory access (cache misses) - CPU stalls (on the Intel PC) - Divisions - Real CPU work
11 Memory-Optimization: SGI Origin2 Multi-Pass Clustering Intel PC 2 18 1 pass 2 18 P passes P=1 P=2 1 8 1 pass 16 14 16 14 6 4 12 12 2 1 8 6 1 8 6 8 6 P passes P=1 P=2 4 2 4 2 4 2 2 256 32k 4M 2 256 32k 4M 2 256 32k 4M number of clusters number of clusters number of clusters memory CPU stalls
12 Memory-Optimization: Partitioned Hash-Join SGI Origin2 Intel PC 45 45 4 4 35 35 3 3 25 2 25 2 15 15 1 1 5 5 64M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions
13 CPU-Optimization DBMS techniques for CPU optimization: - column-at-a-time Monet operators have fixed layout and few types - join([oid,t],[t,oid]) [oid,oid] has just one degree of freedom (T) - bulk type-switch technique: seperate routine for each T join( [oid,t], [T,oid] ) : [oid,oid] { switch(t) { int: return join int( [oid,int], [int,oid] ); string: return join string( [oid,string], [string,oid] ); default: return join ADT( [oid,t], [T,oid], ADT ); } } type-specific join: replace all function calls by inline - less overhead - code more predictable for CPU also: replace expensive division by bit operator
14 CPU-Optimization: Partitioned Hash-Join SGI Origin2: 15 s 4 s Intel PC: 1 s 4 s 25 default 25 default 2 2 15 1 15 1 5 5 2 optimized 2 optimized 15 15 1 5 1 5 64M 4M 256k 16k 1k 64 64M 4M 256k 16k 1k 64 cluster size [bytes] cluster size [bytes] memory CPU stalls divisions
15 CPU-Optimization: Multi-Pass Clustering SGI Origin2: 3 s.75 s Intel PC: 2.25 s.75 s 1 default P=1 P=2 1 default P=1 P=2 8 8 6 4 6 4 2 2 optimized optimized 4 2 P=1 P=2 P=3 4 2 P=1 P=2 2 16 128 1k 8k 64k 512k 4M 2 16 128 1k 8k 64k 512k 4M number of clusters number of clusters memory CPU stalls
16 CPU- & Memory-Optimization: Overall Performance Boosting Effects: Mc > M & Cm > C 6 5 C SGI Origin2 M C = 1.5 s Cm = 16.4 s M = 28.6 s Mc = 34.5 s 6 5 AMD PC C = 4.3 s Cm = 8.4 s M = 18.9 s Mc = 23. s 4 4 3 3 C M 2 Mc Cm 2 1 1 Mc Cm 64M 8M 1M 128k 16k 2k 256 32 cluster size [byte] 64M 8M 1M 128k 16k 2k 256 32 cluster size [byte] default optimized simple minimum 1 pass 2 passes 3 passes 4 passes
17 Automatic Tuning of Algorithms Detailed and accurate main-memory cost models: - Calibrate CPU costs - Estimate number of cache misses - Memory access cost: misses multiplied by their latencies Calibration tool: - Automatically analyzes memory system of any computer - Extracts number of cache levels, cache sizes, cache line sizes, cache miss latencies
18 Extreme CPU-Optimization: Select exhaustive code expansion on: - data types - predicate type ( x = C, x < C, C < x, C < x < C1) - various other properties avoid as many braches/conditionals as possible in the inner loop results in 12(!) specific routines generated from one template performance improvement: up to factor 1(!) lesson learned: even correctly predicted branches do hurt
19 Monet Query Optimizer: System Architecture multiple users/applications multiple MIL streams...... query results MIL stream merger & result dispatcher merged MIL stream Strategic optimization Tactical optimization elimination of common (sub ) expressions re use of cached (intermediate) results Multi Query Optimizer Dataflow Graph query results pattern rewriting cache / memory management optimized MIL stream Execution Engine Result Cache Operational optimization parallelization Monet Database System
2 Monet Query Optimizer: Common Sub-expression Elimination avoid redundancy: s39 := {sum}(s13.reverse().join(s38), s13.tunique()); s49 := {sum}(s13.reverse().join(s48), s13.tunique()); s66 := {sum}(s13.reverse().join(s65), s13.tunique()); s76 := {count}(s13.reverse(), s13.tunique()); s13r := s13.reverse(); s13t := s13.tunique(); s39 := {sum}(s13r.join(s38), s13t); s49 := {sum}(s13r.join(s48), s13t); s66 := {sum}(s13r.join(s65), s13t); s76 := {count}(s13r, s13t);
21 Monet Query Optimizer: Heuristic Pattern Rewriting inline join (also for {sum} & {count}): x := {avg}(g.reverse().join(b),e); x := {avg}(b,g,e); avoid join: x := {count}(b,g,e); x := {count}(g.reverse(),e); avoid grouping and join: x1 := {sum}(b,g,e); x1 := {sum}(b,g,e); x2 := {count}(g.reverse(),e); x2 := {count}(g.reverse(),e); x3 := {avg}(b,g,e); x3 := x1 [/] x3;
22 Monet Query Optimizer Parallelization (SMP): - schedule independent operations for concurrent execution Cache / Memory management: - re-use intermediate results as soon as possible - discard intermediate results (free memory) as soon as possible in preparation: - cost-based pattern rewriting - cost-based cache / memory management - horizontal fragmentation / partitioning
23 Experiment: Q1 of TPC-H benchmark on Origin2 #CPU time [ms] comment 1 74887 original Monet code 1 2993 optimized Monet code 1 12174 optimized query 2 1434 parallel execution 3 136 4 8666 5 8745 6 7823 7 6957 8 6955
24 Conclusion Bad memory access pattern and poor usage of CPU-inherent parallelism ruin database performance Use data structures that exploit full memory bandwidth Tune algorithms to achieve optimal memory access Optimize (simplify) code to efficiently exploit CPU resources Tactical optimization layer allows lazy MIL generation in frontends and applications Monet home-page: www.cwi.nl/ monet