Weaving Relations for Cache Performance

Size: px

Start display at page:

Download "Weaving Relations for Cache Performance"

Franklin Kelley
5 years ago
Views:

1 Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles <1MBps 1 Megabyte (Buffer pool) Hot data DATABASE The main performance bottleneck was I/O latency Present Platforms Execution PROCESSOR Data and.33 CACHE cycles/instruction Instructions 7 cycles Data and Instructions 1 Gigabyte 25 MBps DB DB DB Hot data migrates to larger and slower main memory 1

2 Future Platforms Execution PROCESSOR CACHE Data and Instructions Data and Instructions 1 Terabyte MBps DB DB DB Hot data migrates to larger and slower main memory Processor/Memory Speed Gap cycles per instruction VAX 11/78 CPI memory latency Pentium II Xeon Memory latency 1 memory access 1 instructions Why has the Bottleneck Shifted?! Lots of I/O optimizations! hardware (DMA)! software (storage managers do sequential I/O)! Disks are powerful! have cache memories and processors! disks are replacing tapes as backup devices! Databases are largely memory-resident! hot data are in big but slow main memory! Compute- and memory-bound applications! e.g., spatial joins New bottleneck: memory+computation resources 2

3 Why Study DB Workloads? Cycles per instruction Theoretical minimum Desktop/ Decision Engineering Support (SPECInt) Online Transaction Processing High average time per instruction for DB workloads Outline! The memory/processor speed gap! Importance of Data Placement! What s wrong with slotted pages?! Partition Attributes Across (PAX)! Performance results! Summary An Execution Pipeline INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE Branch prediction, non-blocking cache, out-of-order 3

4 Memory hierarchy 11! Memory hierarchy = main memory + caches! What are caches?! Why are they important?! Caches in today s machines! PIII! A2114! Itanium! Power4! Important cache design parameters? Where Does Time Go? Hardware Resources Branch }Delays (Stalls) Mispredictions Memory Computation Overlap opportunity: Load A D=B+C Load E Execution Time = Computation + Stalls - Overlap Execution Time Breakdown (%)! PII Xeon running NT 4., 4 commercial s: A,B,C,D! Stalls at least 5 of the time % execution time Range Selection (no index) Range Selection (index) Join (no index) A B C D B C D Computation Memory Branch mispredictions Resource A B C D Memory stalls are major bottleneck 4

5 Breakdown of Memory Delays! PII Xeon running NT 4., 4 commercial s: A,B,C,D! Memory-related delays: 4-8 of execution time Range Selection (no index) Range selection (index) Join (no index) Memory stalltime (%) A B C D A B C D L1 Data L2 Data L1 Instruction L2 Instruction A B C D Data accesses: 19%-8% of memory stalls Data Placement on Disk Pages! Commercial s use Slotted pages " Store table records sequentially Intra-record locality (attributes of record r together) $ Doesn t work well on today s memory hierarchies! Alternative: Vertical partitioning [Copeland 85] " Store n-attribute table as n single-attribute tables Inter-record locality, saves unnecessary I/O $ Destroys intra-record locality => expensive to reconstruct record! Contribution: Partition Attributes Across have the cake and eat it, too Inter-record locality + low reconstruction cost Outline! The memory/processor speed gap! Importance of Data Placement! What s wrong with slotted pages?! Partition Attributes Across (PAX)! Performance results! Summary 5

6 Current Scheme: Slotted Pages Formal name: NSM (N-ary Storage Model) R PAGE HEADER RH RID 1 SSN 1237 Name Jane Age 3 Jane 3 RH John 45 RH3 153 Jim 2 RH John Susan Jim Susan Leon Dan 37 NSM stores records sequentially w/ offsets Predicate Evaluation using NSM PAGE HEADER RH Jane 3 RH John 45 RH3 153 Jim 2 RH4 758 Susan Leon Jane 3 RH block 1 45 RH3 153 block 2 Jim 2 RH4 block Leon block 4 CACHE select name from R where age > 5 NSM pushes non-referenced data to the cache Need New Data Page Layout! Eliminates unnecessary memory accesses! Improves inter-record locality! Keeps a record s fields together! Does not affect I/O performance and, most importantly, is low-implementation-cost, high-impact

7 Outline! The memory/processor speed gap! Importance of Data Placement! What s wrong with slotted pages?! Partition Attributes Across (PAX)! Performance results! Summary Partition Attributes Across (PAX) NSM PAGE PAX PAGE PAGE HEADER RH PAGE HEADER Jane 3 RH John RH3 153 Jim 2 RH4 758 Susan 52 Jane John Jim Susan Partition data within the page for spatial locality Predicate Evaluation using PAX PAGE HEADER block 1 Jane John Jim Suzan CACHE select name from R where age > 5 Fewer cache misses, low reconstruction cost 7

8 A Real NSM Record HEADER FIXED-LENGTH VALUES VARIABLE-LENGTH VALUES null bitmap, record length, etc offsets to variablelength fields NSM: All fields of record stored together + slots PAX: Detailed Design pid v Jane 3 45 # records free space attribute # attributes sizes John presence bits 1 1 V - Minipage v-offsets F - Minipage } presence bits 1 1 } } f } Page Header F - Minipage PAX: Group fields + amortizes record headers Outline! The memory/processor speed gap! Importance of Data Placement! What s wrong with slotted pages?! Partition Attributes Across (PAX)! Performance results! Summary 8

9 Sanity Check: Basic Evaluation! Main-memory resident R, numeric fields! Query: select avg (a i ) from R where a j >= Lo and a j <= Hi! PII Xeon running Windows NT 4! 1KB L1-I, 1KB L1-D, 512 KB L2, 512 MB RAM! Used processor counters! Implemented schemes on Shore Storage Manager! Similar behavior to commercial Database Systems! Compare Shore query behavior with commercial! Execution time & memory delays (range selection) 1 Execution time breakdown Why Use Shore? 1 Memory stall time breakdown % execution time Memory stall time (%) A B C D Shore A B C D Shore Computation Memory Branch mispr. Resource L1 Data L2 Data L1 Instruction L2 Instruction We can use Shore to evaluate workload behavior Effect on Accessing Cache Data! PAX saves 7 of data penalty (L1+L2)! Selectivity doesn t matter for PAX data stalls Cache data stalls Sensitivity to Selectivity stall cycles / record L1 Data stalls L2 Data stalls NSM PAX page layout stall cycles / record NSM L2 PAX L2 1% 5% selectivity PAX drastically reduces data stalls 9

10 Time and Sensitivity Analysis! PAX: 75% less memory penalty than NSM (1 of time)! Execution times converge as number of attrs increases 18 Execution time breakdown Sensitivity to # of attributes clock cycles per record NSM PAX page layout Hardware Resource Branch Mispredict Memory Computation elapsed time (sec) NSM PAX # of attributes in record PAX improves overall execution time Sensitivity Analysis (2)! Elapsed time sensitivity to projectivity / # predicates! Range selection queries, 1% selectivity 5 5 seconds NSM PAX projectivity seconds NSM PAX # of attributes in predicate PAX,NSM times converge as query covers entire tuple Evaluation Using a DSS Benchmark! 1M, 2M, and 5M TPC-H DBs! Queries: 1. Range Selections w/ variable parameters (RS) 2. TPC-H Q1 and Q! sequential scans! lots of aggregates (sum, avg, count)! grouping/ordering of results 3. TPC-H Q12 and Q14! (Adaptive Hybrid) Hash Join! complex where clause, conditional aggregates! 128MB buffer pool 1

11 PAX/NSM Speedup TPC-H Queries: Speedup! Avg(range selections) + 4 TPC-H queries! ShoreonPII/NT 45% 3 15% 1 MB 2 MB 5 MB PAX/NSM Speedup RS Q1 Q Q12 Q14 Query PAX: 5 elapsed time improvement in TPC-H PAX vs. NSM across platforms! Avg(range selections) + 4 TPC-H queries! Shore on PII/Linux, UltraSparc-II/Solaris, A2114/Tru4 PAX/NSM Speedup 45% 3 15% PAX/NSM Speedup on Unix (1MB database) PII Xeon UltraSparc-II A2114 RS Q1 Q Q12 Q14 Query PAX improves performance across platforms! PAX: a low-cost, high-impact DP technique! Performance! Eliminates unnecessary memory references! High utilization of cache space/bandwidth! Faster than NSM (does not affect I/O)! Usability Conclusions! Orthogonal to other storage decisions! Easy to implement in large existing s 11

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE