Walking Four Machines by the Shore

Size: px

Start display at page:

Download "Walking Four Machines by the Shore"

Kristian Ray
5 years ago
Views:

1 Walking Four Machines by the Shore Anastassia Ailamaki with Mark Hill and David DeWitt University of Wisconsin - Madison

2 Workloads on Modern Platforms Cycles per instruction Theoretical minimum Desktop/ Engineering Decision Support Online Transaction Processing High CPI for DB workloads

3 Previous Work DBMSs on modern platforms [Barroso 98], [Keeton 98], [Ailamaki 99], etc Studied one or more DBMSs per platform Located performance bottlenecks Cache-conscious software Data placement Optimized use of cache in algorithms Instruction stream optimizations Optimized I-cache / branch prediction Hardware design also affects DBMS behavior

4 Impact of Architectural Decisions Shore: A prototype storage manager / DBMS Compared Shore on four different systems different processor architectures/µ-architectures different memory subsystems Found evidence that DBMSs would benefit from 2-4 way associative, larger L2, no inclusion large blocks, no sub-blocking high-accuracy branch prediction memory-aggressive execution engine Steps towards a DSS-centric machine

5 Outline Introduction Experimental setup / methodology Processor pipeline Branch prediction mechanism Memory subsystem Conclusions

6 Platform Design Variations Architecture RISC or CISC Instruction set Microarchitecture Pipeline Speculation (out-of-order, multiple issue) Branch prediction Memory subsystem Cache size, associativity Block size, subblocking Inclusion Which design favors DSS workloads?

7 Why Use Shore? Range selection query on 4 commercial DBMSs + Shore Breakdown of execution & memory delays 100% Range Selection (no index) 100% Range Selection (no index) % execution time 80% 60% 40% 20% Memory stall time (%) 80% 60% 40% 20% 0% 0% A B C D Shore DBMS Computation Memory Branch mispr. Resource A B C D Shore DBMS L1 Data L2 Data L1 Instruction L2 Instruction We can use Shore to evaluate DSS workload behavior

8 Used four machines Experimental Setup Sun UltraSparc: US-II and US-IIi, Solaris 2.6/2.7 Intel P6: PII Xeon, Linux v2.2 DEC Alpha: 21164A, OSF1 v.4.0 Architecture and Processor Microarchitecture Characteristic US-II UltraSparc US-IIi PII Xeon Alpha speed 296 MHz 300 MHz 400 MHz 532 MHz introduced in out of order? no no yes no instruction set RISC RISC CISC RISC

9 Cache Hierarchies Characteristic L1 D L3 size, assoc block/subblock inclusion by L2 size, assoc L1 I block/subblock 32/32 32/32 32/32 32/16 inclusion by L2 size, assoc L2 block/subblock 64/64 64/64 32/32 64/32 inclusion by L3 size, assoc block/subblock UltraSparc US-II US-IIi PII Xeon Alpha KB, DM 16KB, DM 16KB, 2-way 8KB, DM 32/16 32/16 32/32 32/32 yes yes no yes 16KB, 2-way 16KB, 2-way 16KB, 4-way 8KB, DM yes yes no no 2 MB, DM 512KB, DM 512KB, 4-way 96KB, 3-way N/A N/A N/A yes N/A N/A N/A 4 MB / DM N/A N/A N/A 64/64

10 Methodology Compiled Shore with gcc Alpha version not optimized Ran DSS workload, 100-MB TPC-H dataset Range Selections w/ variable parameters (RS) TPC-H Q1 and Q6 sequential scans, lots of aggregates (sum, avg, count) TPC-H Q12 and Q14 Hash Joins, complex where clause, conditional aggregates Used processors counters Sun: run-pic (by Glenn Ammons, modified) PII: PAPI (public-domain counter library) Alpha: DCPI (sampling software by Compaq)

11 Issue/Retire Width Alpha issues at most 2 instructions / cycle (max=4) >60% of time Xeon retires 0/1 instruction (max=3) % of total clock cycles 100% 80% 60% 40% 20% 0% Alpha Issue Breakdown RS Q1 Q6 Q12 Q14 query pipeline dry 0-issue 1-issue 2-issue % of total clock cycles 100% 80% 60% 40% 20% 0% Xeon (NT) Retire Breakdown A B C D Shore DBMS 0-retired 1-retired 2-retired 3-retired Issue/retire width is not fully exploited

12 Execution Time Breakdown UltraSparc-II PII Xeon A % 100% 100% clock cycles (%) 80% 60% 40% 20% 80% 60% 40% 20% 80% 60% 40% 20% 0% RS Q1 Q6 Q12 Q14 DBMS 0% RS Q1 Q6 Q12 Q14 DBMS 0% RS Q1 Q6 Q12 Q14 DBMS D-stalls I-stalls Branch Misprediction Other+Computation Memory + branch misprediction stalls = 35-60% of time Data accesses: major memory bottleneck (esp. Q12, Q14)

13 Branch Prediction Branch penalty = frequency*misprediction rate*penalty Frequency is typically 20-25% In-order processors => lower penalty Low misprediction accuracy may break it (e.g., UltraSparc) Branch frequency Characteristic Branch RS, Q1, Q6 misprediction rate Q12, Q14 1% 6% Branch penalty (cycles) RS, Q1, Q6 Q12, Q14 Branch misprediction stalls PII Xeon 18% 22% 3.5% % Alpha % 9% 15% 5 1% High-accuracy branch predictors

14 Cache Inclusion UltraSparc II: 128-bit L1 interface, 2MB L2 cache UltraSparc IIi: 64-bit L1 interface, 512KB L2 cache 1 UltraSparc II/IIi cache comparison (RS) normalized unit Elapsed time L1D misses L1I misses L2D misses L2I misses UltraSparc-II UltraSparc-IIi Small, DM L2 caches should not maintain inclusion

15 miss rate improvement 120% 105% 90% 75% 60% 45% 30% 15% 0% Cache Block Size Compared two data placement algorithms Improving locality pays off with larger cache blocks Improvement on data miss rates PII Xeon (L2) US-II (L2) A21164 (L3) RS Q1 Q6 Q12 Q14 Query Larger cache line = lower miss rates (leads to higher performance given bandwidth)

16 Sub-Blocking / Associativity UltraSparc: direct-mapped, subblocking (32/16) Xeon: 2-way, no subblocking (32/32) Range selections L1 Data Cache Misses (RS) L1 Data Cache Misses (RS) # L1D misses per record US-II US-IIi PII Xeon # L1D misses per record US-II US-IIi PII Xeon projectivity 2% 10% 20% 50% 100% selectivity High associativity, no sub-blocking

17 Memory Hierarchy Non-blocking caches >64-byte block, no sub-blocking Generous-sized L1-I (128K) and L2 (> 2MB) A tiny, fast L1/2 with a large, slow L3 won t add much High associativity (2-4) No inclusion (at least for instructions) Processor pipeline Conclusions Issue width is fine, out-of-order overlaps stall time Execution engine to sustain >1 load/store instr. High-accuracy branch prediction provided that implementation cost is stable.

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles