Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Size: px

Start display at page:

Download "Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:"

Lydia Cooper
5 years ago
Views:

1 Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive interference could be a problem) Key question: Can SMT s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?

2 Workload & Methodology On-line transactions processing (OLTP) updating account balances for a bank 16 server threads Decision support systems (DSS) data warehousing, data mining answer business questions Trace-driven simulation ATOM-generated instruction traces of Oracle context, 8-wide SMT simulator

3 OLTP Characterization Memory behavior for OLTP (1 context, 16 server processes) Memory region L1 cache miss rate (128KB caches) Memory footprint Instruction text 13.7% 556 KB Program global area (PGA) (private) 7.4% 1.3 MB Buffer cache 6.8% 9.3 MB Metadata 12.9% 26.5 MB High miss rates Large memory footprints

4 Shared Resources are Critical Almost all SMT hardware resources are dynamically shared by all executing threads Benefit: better hardware resource utilization SMT outperformed superscalar by 2-3X, single-chip MPs by 52% Cost: potential inter-thread contention shared hardware data structures contain the working set of multiple threads increase in some type of conflict Managing the shared resources effectively may avoid the conflicts

5 L2 Cache Interference OLTP L2 cache miss rate (global) Number of contexts DSS L2 cache miss rate (global) Number of contexts Interthread conflict misses

6 Critical Working Sets Total foot print is too big, but how much of it do we really need? Some data is more important than others Skewed reference behavior: a majority of references to a minority of memory blocks for example, 87% of OLTP instruction references are to 31% of the instruction footprint; 98% & 6KB for DSS for example, 41% of OLTP metadata references are to 26KB Commercial workloads have small critical working sets Even with multiple threads, performance-critical working sets might fit in the caches Multithreading doesn t have to cause more misses

7 Cache Conflicts OLTP DSS Percentage of L2 cache references Misses (Metadata) Misses (Buffer cache) Misses (PGA) Misses (Instructions) OLTP DSS Percentage of L1 cache references L1 and L2 misses dominated by PGA references Conflict misses can be avoided by page mapping and per-thread address offsetting

8 Page Mapping Alternatives Map virtual pages to physical page frame example reference stream (virtual pages): Page coloring Reference stream MMMMMMHMMMMM Virtual addr space Page coloring exploits spatial locality

9 Bin hopping exploits temporal locality Bin hopping Reference stream MMHMHMHMMMMM Virtual addr space

10 Page Mapping Results OLTP Number of contexts DSS Number of contexts bin hopping page coloring page coloring with seed L2 cache miss rate (global) L2 cache miss rate (global)

11 Application-level Offsetting Virtual address space Before After 0x PGA PGA PGA PGA 0x x x x Thread 0 Thread 1 Thread 0 Thread 1 Base of each thread s PGA is at the same virtual address Causes inter-thread conflicts in (virtually-indexed) L1 cache Address offsets can avoid interference (thread id * 8KB)

12 Offsetting Results L1 data cache miss rate OLTP DSS Number of contexts Number of contexts bin hopping no offset bin hopping with offset L1 data cache miss rate

13 SMT Performance with bin hopping, application offsetting, L1 I-cache thread-sharing OLTP DSS context 2 contexts 4 contexts 8 contexts 1 context 2 contexts 4 contexts 8 contexts Instructions/cycle Instructions/cycle

14 Summary SMT gets good speedups on commercial database workloads Limits inter-thread data interference to superscalar levels virtual-to-physical page mapping policies address offsetting for thread-private data Exploits inter-thread instruction sharing (35% reduction in instruction cache miss rates) SMT gets good speedups on commercial database workloads 3x for debit/credit systems, 1.5 for big search problems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard