STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon
CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I stalls Other stalls Computation 256KB 512KB 1B L2 cache size Cache size 10B L1-I stalls account for 25-40% of execution time Instruction caches cannot grow We need a solution for instruction cache-residency 2 1B 100KB 10KB ax on-chip L2/L3 cache L1-I cache 96 98 00 02 04 Year Introduced Server CPUs
Steps for cache-resident code Eliminate misses for a group of Xactions Xactions are assigned to threads ultiplex execution at fine granularity Reuse instructions in L1-I cache STEPS: Synchronized Transactions through Explicit Processor Scheduling 3
Fewer misses & misspred. branches # of L1-I misses 8K 6K 4K 2K Index selection Shore Steps 1 2 4 8 Concurrent threads Normalized count 4 Payment Xaction (TPC-C) 100% 80% 60% 40% 20% Cycles Shore Steps Up to 1.4 speedup Eliminate 96% of L1-I misses for each add l thread Eliminate 64% of mispredicted branches L1-I misses Br. ispred. L1-D misses
Outline Background & related work Basic implementation of Steps icrobenchmarks AthlonXP, SimFlex simulator Applying Steps to OLTP workloads TPC-C results Shore on AthlonXP 5
Background CPU - Caches trade size for lookup speed - L1-I misses are expensive L1-I cache L1-D cache Example: 2-way set associative L1-I capacity misses for loop { if (?) call B F1 F2 conflict misses F3 F4 } cache block B ( ) { B1 } for loop { if (?) call B F1 F2 F3 F4 } L2 cache code data B ( ) { B1 } data 6
Background CPU - Caches trade size for lookup speed - L1-I misses are expensive L1-I cache L1-D cache Example: 2-way set associative L1-I capacity misses for loop { if (?) call B F1 F2 conflict misses F3 F4 } cache block higher associativity + larger cache size B ( ) { B1 } slower access to L1-I I cache 7 for loop { if (?) call B F1 F2 F3 F4 } L2 cache code data B ( ) { B1 } data slower CPU clock
Related work Database & Architecture papers: DB workloads are increasingly non I/O-bound L2/L3 data misses, L1-I misses ORACLE OLTP code working set 560KB Hardware & compiler approaches Increase block size, add stream buffer [asplos98] Call graph prefetching (for DSS) [tocs03] Code layout optimizations [isca01] [..] 8
Related work: within the DBS Data-cache misses (mostly DSS) Cache aware page layout, B-trees, join algorithms Active area [..] Instruction-cache misses in DSS Batch processing of tuples [icde01][sigmod04] Instruction-cache misses in OLTP Challenging! 9
Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 10
Steps overview DBS assign Xactions to threads Xactions consist of few basic operators Index select, scan, update, insert, delete, commit Steps groups threads per Op Within each Op reuse instructions I-cache aware context-switching 11
I-cache aware context-switching instruction cache thread 1 select( ) s1 s2 s3 s4 s5 s6 s7 BEFORE iss thread 2 select( ) s1 s2 s3 s4 s5 s6 s7 CPU executes code CPU performs context-switch (CTX) 12 thread 1 thread 2 select( ) s1 s2 s3 AFTER s4 s5 s6 s7 Hit H H H H H H H select( ) s1 s2 s3 s4 s5 s6 s7 code fits in I-cache context-switch point (CTX)
Basic implementation on Shore Assume (for now) Threads interested in same Op Uninterrupted flow (no locks, I/O) Fast, small, compatible CTX code 76 bytes, bypass (for now) full CTX Add CTX calls throughout Op source code Use hardware counters (PAPI) on sample Op 13
Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 14
icrobenchmark setup All experiments on index fetch, in-memory index 45KB footprint Fast CTX for both Steps /Shore, warm cache AD AthlonXP Simulated IA-32 SimFlex L1 I + D cache size 64KB + 64KB associativity 2-way block size 64 bytes L2 cache size 256KB vary all cache parameters 15
L1-I I cache misses L1-I cache misses 4K 3K 2K 1K Shore Steps AthlonXP 1 2 4 6 Concurrent threads 8 10 Steps eliminates 92-96% of misses for add l threads All misses are conflict misses (cache is 64KB) 16
iss reduction 100% 80% 60% 40% L1-I I misses & speedup L1-I iss reduction Upper Limit L1-I iss reduction % 10 20 30 40 50 60 70 80 Concurrent threads AthlonXP Steps achieves max performance for 6-10 threads No need for larger thread groups 17
iss reduction Speedup 100% 80% 60% 40% 1.4 1.3 1.2 1.1 L1-I I misses & speedup L1-I iss reduction Upper Limit L1-I iss reduction % Speedup AthlonXP 10 20 30 40 50 60 70 80 Concurrent threads Steps achieves max performance for 6-10 threads No need for larger thread groups 18
Normalized count 120% 100% 80% 60% 40% 20% Smaller L1-I I cache 209% 10 threads AthlonXP, PIII AthlonXP Pentium III Cycles L1-I misses L1-D misses Branches Steps outperforms Shore even on smaller caches (PIII) 62-64% fewer mispred. branches on both CPUs 19 Br. ispred. Br. missed BTB Instr. stalls (cycles)
L1-I cache misses 10K 8K 6K 4K 2K SimFlex: L1-I I misses Shore-16KB Steps-16KB IN Shore-32KB Steps-32KB IN 10 threads 64b cache block Shore-64KB Steps-64KB IN AthlonXP direct 2-way4-way 8-way full higher associativity higher associativity Steps eliminates all capacity misses (16, 32KB caches) Up to 89% overall miss reduction (upper limit is 90%) 20
Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 21
Design goals High concurrency on similar Ops Cover full spectrum of Ops Correctness & low overhead for : Locks, latches, mutexes Disk I/O Exceptions (abort & roll back) Housekeeping (detect deadlock, buffer pool) 22
Overview 1. Thin wrappers per Op to sync Xactions Form Execution Teams per Op Flexible definition of Op 2. Best-effort within execution teams Fast CTX through fixed scheduling Threads leave team on exceptions 3. Repair thread structures at exceptions odify only thread package 23
System design Op X steps wrapper Op Y steps wrapper to other Op Op Z stray thread execution team steps wrapper Threads go astray on exceptions Regroup at next Op Can have execution teams per database table 24
Outline Related work Basic implementation of Steps icrobenchmarks Applying Steps to OLTP workloads TPC-C results 25
Experimentation setup Shore/Steps : AthlonXP, 2GB RA, 2 disks Shore locking Hierarchy: record, page, table, DB Protocol: 2-phase TPC-C : Wholesale parts supplier 10-30 Warehouses, 100-300 users Increased concurrency though Zero think time TPC-C workload In-memory database, lazy commits 26
One Xaction: payment Normalized count 100% 80% 60% 40% 20% Number of Warehouses 10 20 30 Cycles L1-I misses L1-D misses Steps outperforms Shore 1.4 speedup, 65% fewer L1-I misses 48% fewer mispredicted branches For 10 Warehouses: 15 ready threads, 7 threads / team 27 L2-I misses L2-D misses Branches mispred.
ix of four Xactions Normalized count 100% 80% 60% 40% 20% 121% 125% Number of Warehouses 10 20 Cycles L1-I misses L1-D misses L2-I misses L2-D misses Branches mispred. Xaction mix reduces average team size (4.3 in 10W) Still, Steps has 56% fewer L1-I misses (out of 77% max) 28
Summary of results Steps can handle full OLTP workloads Significant improvements in TPC-C 65% fewer L1-I misses 48% fewer mispredicted branches Room for improvement Steps was not tuned for TPC-C Shore s code yields low concurrency Steps minimizes both capacity & conflict misses without increasing I-cache I size / associativity 29
Thank you 30