Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Size: px

Start display at page:

Download "Anastasia Ailamaki. Performance and energy analysis using transactional workloads"

Millicent Hensley
5 years ago
Views:

1 Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun

2 Online Transaction Processing $2B+ industry Characteristics: Has many concurrent requests Touch small part of whole data Need high & predictable performance Primary application for databases

3 Hardware keeps providing new forms of parallelism How s the utilization? Hardware OLTP runs on Core ILP SMT multisocket multicores heterogeneous many-cores

4 Utilizing modern processors Execution Cycles Breakdown 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % STALLED BUSY Shore-MT DBMS D VoltDB HyPer DBMS M TPC-C 1GB data 1 worker thread Intel Ivy Bridge Processor stalled most of the time

5 Scaling up OLTP on multisockets Throughput Number of sockets Multisocket servers severely under-utilized

6 Why care about power? Monthly datacenter costs 18% 13% 8% 4% 57% 3% power-related Dynamic fraction increasing [J. R. Hamilton] Servers Networking Equipment Power Distribution & Cooling Power Other Power Energy proportionality Today Ideal Utilization Energy efficiency as important as performance

7 Why is my system under-utilizing hardware? Why isn t my system faster on new hardware? Are new processors more energy-efficient?

9 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

10 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

11 Instructions per Cycle Intel Xeon X566 maximum Instructions per Cycle Utilization (microarchitecture level) Sun Niagara T2 maximum [EDBT13] OLTP utilizes lean cores better TPC-B TPC-C TPC-E TPC-B TPC-C TPC-E TPC-E has higher IPC

12 Macrobenchmark: Execution Cycles & Stalls Execution Cycles Breakdown Intel Xeon X566 1% 8% 6% 4% 2% % STALLED BUSY TPC-B TPC-C TPC-E Core Stalls 1% 8% 6% 4% 2% % INSTRUCTION REST TPC-B TPC-C TPC-E Over 7% of time goes to stalls Instruction stalls are the main problem [EDBT13]

13 Execution cycles breakdown Utilization across systems 1% 8% 6% 4% 2% % STALLED BUSY Shore-MT DBMS D VoltDB HyPer DBMS M on-disk in-memory [SIGMOD16] TPC-C, 1GB 1 worker thread Intel Ivy Bridge Even in-memory systems stall > 6% time

14 [SIGMOD16] Microbenchmarks for what-if analysis 4 3 1GB 1 worker thread Intel Ivy Bridge IPC Shore-MT DBMS D VoltDB HyPer DBMS M Number of rows read Lower data locality à low IPC for some systems

OLTP on hardware islands 21 cycles 72 cycles 237 cycles [PVLDB12] scan Core Core Core Core Core Core Core Core L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Column

15 OLTP on hardware islands 21 cycles 72 cycles 237 cycles [PVLDB12] scan Core Core Core Core Core Core Core Core L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Column L3 Memory controller Inter-socket links L3 Memory controller Inter-socket links Column Socket Socket 1 How measure impact? 2x12-core Intel Xeon E5-265L v3, 24x16GiB DDR3

16 Partition sensitive microbenchmark Single site version probe/update N rows from the local site [PVLDB12] Multisite version probe/update 1 row from the local site probe/update N-1 rows uniformly from any site sites may reside on the same instance

17 OLTP deployment configurations [PVLDB12] Shared-everything Island shared-nothing Shared-nothing

18 Multisite transactions: read only [PVLDB12] Throughput (KTps) Shared-everything Shared-nothing Island shared-nothing 4socket x 6cores 1 rows accessed % multisite transactions More instances -> faster performance degradation

19 Multisite transactions: updates [PVLDB12] Throughput (KTps) Shared-everything Shared-nothing Island shared-nothing 4socket x 6cores 1 rows accessed % multisite transactions Update distributed transactions are more expensive

20 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

21 My first toy: PII Xeon [VLDB99] INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE MAIN MEMORY ÊBranch prediction, non-blocking caches, out-of-order

22 Where Does Time Go? [VLDB99] Computation Stalls Cache misses Branch mispredictions Other execution pipeline stalls Stall time and computation overlap Time = T Computation +T Memory +T Branch +T Resource -T Overlap

23 Setup and Methodology [VLDB99] Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 Four commercial DBMSs: A, B, C, D 64 PII Xeon/MT running Windows NT 4 Used PII counters Correctness: Measured & computed CPI

24 Two very useful breakdowns [VLDB99] processor stalled >5% of time most stalls: L1I and L2D

25 Adapted formula [SIGMOD16] # branch-misp x penalty T " + T $%& = T ( + T ) + T * + T + T ) = T &,- + T &,. + T &/- + T &/. +T &- + T &. T + = T 12 + T.34 + T )-5( + T.6&* + T -6&*

26 Intel Xeon X566 32KB L1-I & 32 KB L1-D Misses per k-instructions Cache Misses today LLC L2D L1D L2I L1I 1 TPC-B TPC-C TPC-E Misses per k-instructions L1-I misses dominate TPC-E has lower data miss ratio [EDBT13] Sun Niagara T2 16KB L1-I & 8KB L1-D TPC-B TPC-C TPC-E

27 More scans à Increased page reuse TPC-E s lower miss ratio 15 4 Average per transaction #Records Accessed #Pages Accessed Heap Index TPC-B TPC-C TPC-E TPC-B TPC-C TPC-E

28 Breaking down clock cycles [SIGMOD16] Stall cycles breakdown in 1 instructions LLC D L2D L1D LLC I L2I L1I L1I LLC D TPC-C, 1GB 1 worker thread Intel Ivy Bridge Shore-MT DBMS D VoltDB HyPer DBMS M L1I or LLC D stalls dominate

29 Where do L1I stalls come from? Stall cycles breakdown in 1 instructions L1I L2I L2I L1I Number of rows read DBMS D Number of rows read DBMS M Read N rows, 1GB 1 worker thread Intel Ivy Bridge Code outside storage mgr à high L1I misses

30 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

31 ARM server-grade processor [DAMON16] lean ARM Fat 4-way fetch, decode, rename, dispatch Issue Branch Integer Integer 1 Store FP/SIMD FP/SIMD 1 Load OLTP on ARM: performance & power?

32 Xeon vs. ARM [DAMON16] Processor Intel Xeon ARM Cortex-A57 # Sockets 2 (one is active) 1 # Cores/socket 8 8 Issue width 4 4 Clock speed 2.GHz 2.Ghz L1I / L1D 32KB / 32KB 32KB / 32KB L2 256KB 256KB L3 (shared) 2MB 8MB RAM 256GB 16GB

33 ARM is a promising alternative ARM 15 Xeon 15 15x 15 Silo, TPC-C, 5GB Single thread [DAMON16] Normalized throughput x Normalized power 1 5 Normalized energy efficiency 1 5 6x

34 Normalized throughput ARM achieves energy proportionality Number of threads ARM Xeon Normalized energy efficiency 1.5 Normalized power Number of threads Silo, TPC-C, 5GB Number of threads [DAMON16]

35 Normalized latency ARM is less suitable for low latency Shore-MT 2x ARM Normalized latency Xeon 3x Silo 5x TPC-C, 5GB All cores active 11x [DAMON16] 5th 99th 99.9th 5th 99th 99.9th

36 Lessons learned Macrobenchmarks show big picture Microbenchmarksreveal details Breakdowns correlate numbers Sensitivity analysis highlights trends Right methodology is essential for understanding behavior

37 References [DAMON16] U. Sirin, R. Appuswamy, and A. Ailamaki: OLTP on a server-grade ARM. [EDBT13] P. Tözün, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki: From A to E: Analyzing TPC s OLTP Benchmarks The obsolete, the ubiquitous, the unexplored. [PVLDB12] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki: OLTP on Hardware Islands. [SIGMOD16] U. Sirin, P. Tözün, D. Porobic, and A. Ailamaki: Micro-architectural Analysis of In-memory OLTP [VLDB99] A. Ailamaki, D. DeWitt, M. Hill, and D. Wood: DBMSs On A Modern Processor: Where Does Time Go?

Walking Four Machines by the Shore

Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0