Architecture-Conscious Database Systems

Size: px

Start display at page:

Download "Architecture-Conscious Database Systems"

Gervase Hicks
5 years ago
Views:

1 Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000

2 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles MAIN MEMORY < 1 MBps 1 Megabyte (Buffer pool) Hot data DATABASE The main performance bottleneck was I/O latency 2000 Anastassia Ailamaki Ph.D. Defense 2

3 Present and Future Platforms DBMS Execution PROCESSOR 0.33 CACHE Data and cycles/instruction Instructions 70 cycles DBMS Data and Instructions MAIN MEMORY MAIN MEMORY 1 Gigabyte Terabyte MBps DB DB DB Hot data migrates to larger and slower main memory 2000 Anastassia Ailamaki Ph.D. Defense 3

4 Processor & Memory Speed Gap cycles per instruction VAX 11/780 CPI memory latency Pentium II Xeon Memory latency One access to memory is 100 s of instruction opportunities 2000 Anastassia Ailamaki Ph.D. Defense 4

5 On Today s Computers When you think about what today s machines do - they look at the instruction stream dynamically, find parallelism on the fly, execute instructions out of order, and speculate on branch outcomes - it s amazing that they work. John Hennessy, IEEE Computer, August 1999 New architectures are more sophisticated 2000 Anastassia Ailamaki Ph.D. Defense 5

6 Why Study Database Performance? Cycles per instruction Theoretical minimum Desktop/ Engineering (SPECInt) Decision Support Online Transaction Processing High average time per instruction for DB workloads 2000 Anastassia Ailamaki Ph.D. Defense 6

7 Contributions (I): Analysis Problem: Where does query execution time go? Proposed evaluation framework [VLDB 99] Identified bottlenecks in hardware memory access hardware implementation details Discovered two memory-related bottlenecks second-level cache data access first-level instruction cache access Methodological discovery: micro-benchmarks A systematic evaluation framework 2000 Anastassia Ailamaki Ph.D. Defense 7

8 Contributions (II): Software Problem: Current data placement hurts caches Proposed novel data placement [subm. SIGMOD 01] rearranges data records on disk page optimizes data cache performance Evaluated it against the popular scheme 70% less data-related memory access delays does not affect I/O behavior especially beneficial for decision support workloads A cache-conscious data placement 2000 Anastassia Ailamaki Ph.D. Defense 8

9 Contributions (III): Hardware Problem: Hardware design affects DB behavior Compared Shore on four different systems different processor architectures/µ-architectures different memory subsystems Found evidence that DBMSs would benefit from 2-4 way associative, larger L2, no inclusion large blocks, no sub-blocking high-accuracy branch prediction memory-aggressive execution engine Step towards a DSS-centric machine 2000 Anastassia Ailamaki Ph.D. Defense 9

10 Outline Introduction PART I: Analysis Background Query execution time breakdown Experimental results Bottleneck assessment PART II: Partition Attributes Across (PAX) PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 10

11 Previous Work Workload characterization studies, e.g., [Barroso 98], [Keeton 98] Various platforms, mostly multiprocessor One DBMS per platform Results: Commercial different than scientific apps OLTP different than DSS workloads Memory is major bottleneck No coherent study across DBMSs and workloads 2000 Anastassia Ailamaki Ph.D. Defense 11

12 An Execution Pipeline INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE MAIN MEMORY Branch prediction, non-blocking cache, out-of-order 2000 Anastassia Ailamaki Ph.D. Defense 12

13 Where Does Time Go? Hardware Resources Branch Mispredictions Memory Computation Delays (Stalls) Overlap opportunity: Load A D=B+C Load E Execution Execution Time Time = Computation = Computation + Stalls + Stalls - Overlap 2000 Anastassia Ailamaki Ph.D. Defense 13

14 Setup and Methodology Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 Four commercial DBMSs: A, B, C, D 6400 PII Xeon/MT running Windows NT 4 Used processor counters to measure/estimate Crafted microbenchmarks to isolate execution loops 2000 Anastassia Ailamaki Ph.D. Defense 14

15 Time Calculations Measured: Resource stalls, L1I stalls Estimated: L1 data stalls: # misses * penalty L2 stalls: # misses * measured memory latency Branch misprediction stalls: # mispr. * penalty Overlap: measured CPI / expected CPI 2000 Anastassia Ailamaki Ph.D. Defense 15

16 Microbenchmarks vs. TPC 3.5 System B 3.5 System D Clock cycles mbench (seq) TPC-D mbench (idx) benchmark TPC-C benchmark Computation Memory Branch misprediction Resource 2000 Anastassia Ailamaki Ph.D. Defense mbench (seq) TPC-D High CPI compared to integer workloads Sequential scan / TPC-D, 2ary index / TPC-C mbench (idx) TPC-C

17 Execution Time Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) 100% 100% 100% % execution time 80% 60% 40% 20% 80% 60% 40% 20% 80% 60% 40% 20% 0% 0% 0% A B C D DBMS B C D DBMS A B C D DBMS Computation Memory Branch mispredictions Resource Stalls at least 50% of time Memory stalls are major bottleneck 2000 Anastassia Ailamaki Ph.D. Defense 17

18 Memory Stalls Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) Memory stall time (%) 100% 80% 60% 40% 20% 0% A B C D DBMS 100% 80% 60% 40% 20% 0% B C D DBMS 100% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% A B C D DBMS L1 instruction and L2 data stalls dominate Different memory bottlenecks across DBMSs and queries 2000 Anastassia Ailamaki Ph.D. Defense 18

19 Summary of Analysis We can use microbenchmarks instead of TPC Execution time breakdown shows trends Memory access is a major bottleneck Increasing memory-processor performance gap Deeper memory hierarchies expected L2 cache data misses L2 grows (8MB), but will be slower Stalls due to L1 I-cache misses L1 I-cache not likely to grow as much as L2 We need to address every reason for stalls 2000 Anastassia Ailamaki Ph.D. Defense 19

20 Addressing Bottlenecks Memory D-cache I-cache D I DBMS: improve locality DBMS + Compiler Branch Mispredictions Hardware Resources B R Compiler + Hardware Hardware Data cache: A clear responsibility of the DBMS 2000 Anastassia Ailamaki Ph.D. Defense 20

21 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across The current scheme: Slotted pages Partition Attributes Across (PAX) Performance Results PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 21

22 The Data Placement Tradeoff Slotted Pages: Used by all commercial DBMSs Store table records sequentially Intra-record locality (attributes of record r together) but pollutes cache Inspiration: Vertical partitioning [Copeland 85] Store n-attribute table as n single-attribute tables Problem: High record reconstruction cost Partition Attributes Across (PAX) Have the cake and eat it, too! PAX: Inter-record locality, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 22

23 Current Scheme: Slotted Pages Formal name: NSM (N-ary Storage Model) R PAGE HEADER RH RID 1 SSN 1237 Name Jane Age 30 Jane 30 RH John 45 RH Jim 20 RH John Susan Jim Susan Leon Dan 37 Records are stored sequentially Offsets to start of each record at end of page 2000 Anastassia Ailamaki Ph.D. Defense 23

24 Current Scheme: Slotted Pages HEADER FIXED-LENGTH VALUES VARIABLE-LENGTH VALUES null bitmap, record length, etc offsets to variablelength fields All attributes of a record are stored together 2000 Anastassia Ailamaki Ph.D. Defense 24

25 NSM Cache Behavior PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH4 Jane 30 RH block 1 45 RH block Susan Leon Jim 20 RH4 block Leon block 4 CACHE MAIN MEMORY select name from R where age > 40 NSM pollutes the cache and wastes bandwidth 2000 Anastassia Ailamaki Ph.D. Defense 25

26 Partition Attributes Across (PAX) NSM PAGE PAX PAGE PAGE HEADER RH PAGE HEADER Jane 30 RH John RH Jim 20 RH Susan 52 Jane John Jim Susan Partition data within the page for spatial locality 2000 Anastassia Ailamaki Ph.D. Defense 26

27 PAX: Mapping to Cache PAGE HEADER block 1 Jane John Jim Suzan MAIN MEMORY CACHE select name from R where age > 40 Fewer cache misses, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 27

28 PAX: Detailed Design # records free space attribute # attributes sizes pid v 4 f }}Page Header F - Minipage Jane John presence bits v-offsets presence bits 1 1 } } 1 1 V - Minipage F - Minipage 2000 Anastassia Ailamaki Ph.D. Defense 28

29 Basic Evaluation: Methodology Main-memory resident R Query: select avg (a i ) from R where a j >= Lo and a j <= Hi PII Xeon running Windows NT 4 16KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAM Used processor counters Implemented schemes on Shore Storage Manager Similar behavior to commercial Database Systems 2000 Anastassia Ailamaki Ph.D. Defense 29

30 Why Use Shore? 100% Range selection query on 4 commercial DBMSs + Shore Breakdown of execution & memory delays Range Selection (no index) 100% Range Selection (no index) % execution time 80% 60% 40% 20% Memory stall time (%) 80% 60% 40% 20% 0% 0% A B C D Shore DBMS Computation Memory Branch mispr. Resource A B C D Shore DBMS L1 Data L2 Data L1 Instruction L2 Instruction We can use Shore to evaluate DSS workload behavior 2000 Anastassia Ailamaki Ph.D. Defense 30

31 Effect on Accessing Cache Data Cache data stalls Sensitivity to Selectivity stall cycles / record L1 Data stalls L2 Data stalls stall cycles / record NSM L2 PAX L2 0 0 NSM PAX 1% 5% 10% 20% 50% 100% page layout selectivity PAX incurs 70% less data cache penalty than NSM PAX reduces cache misses at both L1 and L2 Selectivity doesn t matter for PAX data stalls 2000 Anastassia Ailamaki Ph.D. Defense 31

32 Time and Sensitivity Analysis Execution time breakdown Sensitivity to # of attributes clock cycles per record NSM PAX page layout Resource Branch Mispred. Memory Comp. elapsed time (sec) # of attributes in record NSM PAX PAX: 75% less memory penalty than NSM (10% of time) Execution times converge as number of attrs increases 2000 Anastassia Ailamaki Ph.D. Defense 32

33 Sensitivity Analysis (2) Elapsed time sensitivity to projectivity / # predicates Range selection queries, 1% selectivity seconds seconds NSM PAX NSM PAX projectivity # of attributes in predicate PAX and NSM times converge as query covers entire tuple 2000 Anastassia Ailamaki Ph.D. Defense 33

34 Evaluation Using a DSS Benchmark Loaded 100M, 200M, and 500M TPC-H DBs Ran Queries: Range Selections w/ variable parameters (RS) TPC-H Q1 and Q6 sequential scans lots of aggregates (sum, avg, count) grouping/ordering of results TPC-H Q12 and Q14 (Adaptive Hybrid) Hash Join complex where clause, conditional aggregates PII Xeon running Windows NT 4 Used processor counters 2000 Anastassia Ailamaki Ph.D. Defense 34

35 Insertions with PAX Elapsed Bulk Load Times Estimate average field sizes Start inserting records If a record doesn t fit, Reorganize page (move minipage boundaries) Adjust average field sizes time (seconds) NSM PAX DSM % of reorganizations accommodate a single record MB 200 MB 500 MB Database Size PAX loads a TPC-H database in 2-26% longer than NSM 2000 Anastassia Ailamaki Ph.D. Defense 35

36 Elapsed Execution Time TPC-H 100M TPC-H 200M TPC-H 500M Elapsed time (sec) Q1 Q6 Q12 Q14 TPC-H Query NSM Q1 Q6 Q12 Q14 TPC-H query Q1 Q6 Q12 Q14 PAX TPC-H query PAX improves performance up to 42% even with I/O 2000 Anastassia Ailamaki Ph.D. Defense 36

37 Speedup PAX/NSM Speedup 45% 30% 15% 100 MB 200 MB 500 MB PAX/NSM Speedup on PII/NT 0% RS Q1 Q6 Q12 Q14 Query PAX improves performance up to 42% even with I/O Speedup differs across DB sizes 2000 Anastassia Ailamaki Ph.D. Defense 37

38 PAX: Summary Advantages High data cache performance Faster than NSM for DSS queries Orthogonal to other storage decisions Does not affect I/O performance Current Disadvantages Complex free space mgmt with variable length attributes Complicates update algorithm PAX beneficial for read-mostly workloads (e.g., DSS) (update-intensive workloads in future work) 2000 Anastassia Ailamaki Ph.D. Defense 38

39 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across PART III: Towards DSS-Centric H/W Memory subsystem Branch prediction mechanism Processor pipeline Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 39

40 Architecture Platform Differences RISC or CISC Instruction set Microarchitecture Pipeline Speculation (out-of-order, multiple issue) Branch prediction Memory subsystem Cache size, associativity Block size, subblocking Inclusion Which design looks beneficial for DSS workloads? 2000 Anastassia Ailamaki Ph.D. Defense 40

41 Experimental Setup Used four machines Sun UltraSparc: US-II and US-IIi, Solaris 2.6/2.7 Intel P6: PII Xeon, Linux v2.2 DEC Alpha: 21164A, OSF1 v.4.0 Architecture and Processor Microarchitecture Characteristic US-II UltraSparc US-IIi PII Xeon Alpha speed 296 MHz 300 MHz 400 MHz 532 MHz introduced in out of order? no no yes no instruction set RISC RISC CISC RISC 2000 Anastassia Ailamaki Ph.D. Defense 41

42 Cache Hierarchies Characteristic L1 D L3 size, assoc block/subblock inclusion by L2 size, assoc L1 I block/subblock 32/32 32/32 32/32 32/16 inclusion by L2 size, assoc L2 block/subblock 64/64 64/64 32/32 64/32 inclusion by L3 size, assoc block/subblock UltraSparc US-II US-IIi PII Xeon Alpha KB, DM 16KB, DM 16KB, 2-way 8KB, DM 32/16 32/16 32/32 32/32 yes yes no yes 16KB, 2-way 16KB, 2-way 16KB, 4-way 8KB, DM yes yes no no 2 MB, DM 512KB, DM 512KB, 4-way 96KB, 3-way N/A N/A N/A yes N/A N/A N/A 4 MB / DM N/A N/A N/A 64/ Anastassia Ailamaki Ph.D. Defense 42

43 Methodology Compiled Shore with gcc Alpha version not optimized Ran DSS workload Range Selections w/ variable parameters (RS) TPC-H 1, 6, 12, 14 Used processors counters Sun: run-pic (by Glenn Ammons, modified) PII: PAPI (public-domain counter library) Alpha: DCPI (sampling software by Compaq) 2000 Anastassia Ailamaki Ph.D. Defense 43

44 Superscalar Processor Capability Alpha issues at most 2 instructions / cycle (max=4) >60% of the time the Xeon retires 0/1 instruction (max=3) 100% Alpha Issue Breakdown PII Xeon (NT) Retire Breakdown 100% % of total clock cycles 80% 60% 40% 20% % of total clock cycles 80% 60% 40% 20% 0% 0% RS Q1 Q6 Q12 Q14 query pipeline dry 0-issue 1-issue 2-issue A B C D Shore DBMS 0-retired 1-retired 2-retired 3-retired The current issue/retire width remains unexploited 2000 Anastassia Ailamaki Ph.D. Defense 44

45 Clock-per-Record Breakdown UltraSparc-II PII Xeon A21164 clock cycles / record (%) 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS D-stalls I-stalls Br. Mispr Other Memory + branch misprediction stalls = 35-60% of time Data accesses: major memory bottleneck (esp. Q12, Q14) 2000 Anastassia Ailamaki Ph.D. Defense 45

46 Branch Prediction Branch penalty = frequency*misprediction rate*penalty Frequency is typically 20-25% In-order processors => lower penalty Low misprediction accuracy may break it (e.g., UltraSparc) Branch frequency Characteristic RS, Q1, Q6 Q12, Q14 22% 3.5% 9% 15% Branch RS, Q1, Q6 misprediction rate Q12, Q14 1% 6% Branch penalty (cycles) PII Xeon 18% 15 Alpha % 5 High-accuracy predictors 2000 Anastassia Ailamaki Ph.D. Defense 46

47 Cache Inclusion UltraSparc II/IIi cache comparison (RS) 1.05 normalized unit Elapsed time L1D misses L1I misses L2D misses L2I misses UltraSparc-IIi UltraSparc-II Small caches should not maintain inclusion 2000 Anastassia Ailamaki Ph.D. Defense 47

48 Cache Block Size PAX/NSM miss rate improv. 120% 105% 90% 75% 60% 45% 30% 15% 0% PAX savings on L3 data miss rates PII Xeon (L2) US-II (L2) A21164 (L3) RS Q1 Q6 Q12 Q14 Query Larger cache line = lower miss rates 2000 Anastassia Ailamaki Ph.D. Defense 48

49 Sub-Blocking / Associativity UltraSparc: direct-mapped, subblocking (32/16) Xeon: 2-way, no subblocking (32/32) L1 Data Cache Misses (RS) L1 Data Cache Misses (RS) # L1D misses per record US-II US-IIi PII Xeon # L1D misses per record US-II US-IIi PII Xeon projectivity 2% 10% 20% 50% 100% selectivity High associativity, no sub-blocking 2000 Anastassia Ailamaki Ph.D. Defense 49

50 PAX vs. NSM across platforms PAX/NSM Speedup on Unix (100MB database) PAX/NSM Speedup 45% 30% 15% PII Xeon UltraSparc-II A % RS Q1 Q6 Q12 Q14 Query PAX improves all queries 2000 Anastassia Ailamaki Ph.D. Defense 50

51 Memory Hierarchy Non-blocking caches >64-byte block, no sub-blocking Generous-sized L1-I (128K) and L2 (> 2MB) A tiny, fast L1/2 with a large, slow L3 won t add much High associativity (2-4) No inclusion (at least for instructions) Processor pipeline Summary Issue width is fine, out-of-order overlaps stall time Execution engine to sustain >1 load/store instr. High-accuracy branch prediction provided that implementation cost remains stable Anastassia Ailamaki Ph.D. Defense 51

52 Conclusions Found trends in behavior of commercial DBMSs using an analytic framework to model execution time Identified bottlenecks among HW components Main memory access is the new DB bottleneck Major showstoppers: L1 Instruction + L2 Data Proposed new design for cache performance Increase spatial locality using novel data placement 70% less data-related memory access delays Significant improvement on sequential scans Evaluated several hardware parameters Suggested DSS-centric processor and memory design 2000 Anastassia Ailamaki Ph.D. Defense 52

Walking Four Machines by the Shore

Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0