StressRight: Finding the Right Stress for Accurate In-development System Evaluation

Size: px

Start display at page:

Download "StressRight: Finding the Right Stress for Accurate In-development System Evaluation"

Bryan Edwards
6 years ago
Views:

Gyu-Hyeon Lee 2, Jangwoo Kim 2 High Performance Computing Lab

1 StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1, Hanhwi Jang 1, Jae-eon Jo 1, Gyu-Hyeon Lee 2, Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH) 1 Seoul National University 2

2 Configuring Workloads Modern workloads are configurable No definite answer: depends on the usage scenario 1

3 Evaluating a System Workloads Performance report (e.g., latency, throughput) System Reconfigure workloads & system 2

4 Evaluating an In-development System Workloads No performance report System simulator / emulator Investigate uarchitecture details 3 System modeling tools: Too slow or too inaccurate

5 Workload Configuration Matters Configuration System behavior The system executes different code patterns Different analysis results & system design insights 4 Must configure to represent actual usage scenarios

6 Index Introduction / Motivation Limitations Proposed idea: StressRight Evaluation Conclusion 5

7 Latency (Normalized) Limitations (of the Existing Methods) Inaccurate insights about the configurations Short simulation: No high-level metrics DBT-based simulation: No kernel considerations Emulator: No timing considerations memcached query throughput (Normalized)

8 Index Introduction / Motivation Limitations Proposed idea: StressRight Goals & Key ideas Method details Conclusion 7

9 StressRight: Goals Goals (e.g., latency, throughput) Quickly derive workload-reported performance metrics To explore workload configurations for in-devel systems To evaluate the systems with right stress behaviors Requirements Long workload execution Must observe high-level workload-reported performance metrics Efficient performance model To quickly derive the performance metrics 8

10 StressRight: Key Ideas Long workload execution Use timing-agnostic platforms (e.g., Emulators) Extract user & kernel behavior, analyze performance later Efficient performance model Leverage redundancy in workloads Analyze only the unique behaviors (i.e., code blocks) Overall behavior = Analyzed unique behaviors 9

11 StressRight: Overview Code blocks (No timing, 1-IPC) Emulation Core 0 Core 1 A B A A C A B 100 Ops/sec (Inaccurate)

12 Hit rate StressRight: Overview Code blocks (No timing, 1-IPC) Emulation Core 0 Core 1 A B A A C A B 100 Ops/sec (Inaccurate) Memory / Branch trace Functional simulation Cache Branch Time

13 Hit rate StressRight: Overview Code blocks (No timing, 1-IPC) Emulation Core 0 Core 1 A B A A C A B 100 Ops/sec (Inaccurate) Memory / Branch trace Functional simulation Cache Branch Time Timing reconstruction C A B A B A

14 Hit rate StressRight: Overview Code blocks (No timing, 1-IPC) Emulation Core 0 Core 1 A B A A C A B 100 Ops/sec (Inaccurate) Functional simulation Timing reconstruction C Cache Branch Time High $ hit Low $ hit Med $ hit A B A B A

15 Hit rate StressRight: Overview Code blocks (No timing, 1-IPC) Emulation Core 0 Core 1 A B A A C A B 100 Ops/sec (Inaccurate) Functional simulation Cache Branch Time Timing reconstruction C A B A B A Reschedule & Reinterpret Core 0 Core 1 C A A B B A A 120 Ops/sec (Accurate)

16 StressRight: Timing Reconstruction Challenge: Code blocks are too short Pipeline drain effect is nontrivial IQ ROB *S. 11Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM TOCS, 2009

17 StressRight: Timing Reconstruction Challenge: Code blocks are too short Pipeline drain effect is nontrivial IQ ROB Empty Empty Issue rate drops (not true for longer traces) *S. 11Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM TOCS, 2009

18 StressRight: Timing Reconstruction Challenge: Code blocks are too short Pipeline drain effect is nontrivial Solution: Consider hypothetical next block Assume: next block issue rate current block issue rate Use power law* to further adjust the rate IQ ROB Empty Empty Issue rate drops (not true for longer traces) *S. 11Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM TOCS, 2009

19 StressRight: Timing Reconstruction Challenge: Code blocks are too short Pipeline drain effect is nontrivial Solution: Consider hypothetical next block Assume: next block issue rate current block issue rate Use power law* to further adjust the rate Current block avg. issue = 2.0 IPC IQ ROB Next Next Next block issues proportional to 2.0 IPC *S. 11Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM TOCS, 2009

20 StressRight: Timing Reconstruction Challenge: Code blocks are too short Pipeline drain effect is nontrivial Solution: Consider hypothetical next block Assume: next block issue rate current block issue rate Use power law* to further adjust the rate Current block avg. issue = 2.0 IPC Next ROB IQ Next Larger window Issue more Next block issues proportional to 2.0 IPC *S. 11Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, A mechanistic performance model for superscalar out-of-order processors, ACM TOCS, 2009

21 StressRight: Multiple Performances Challenge: Difficult to model every scenario Code block Mem Mem Mem Mem Mem 12

22 StressRight: Multiple Performances Challenge: Difficult to model every scenario Code block Mem Mem Mem Mem Mem 90% $ Hit 50% $ Hit 30% $ Hit Analysis IPC A Analysis IPC B Analysis IPC C 12

23 StressRight: Multiple Performances Challenge: Difficult to model every scenario Solution: Mix template scenarios Random-generate scenarios & mix them Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss IPC 2.0 IPC

24 StressRight: Multiple Performances Challenge: Difficult to model every scenario Solution: Mix template scenarios Random-generate scenarios & mix them Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss 50% Hit IPC 2.0 IPC 1.6 IPC

25 StressRight: Rescheduling Basic scheduling method Schedule to the earliest possible slot Three rules Rule 1: Blocks from a thread execute serially Rule 2: Critical sections shouldn t overlap Rule 3: Threads should wait for barriers 13

26 StressRight: Rescheduling Rule 1: Blocks from a thread execute serially Tag code blocks with the executor thread ID Prohibit blocks from a thread from running concurrently Core 0 Core 1 Thread 1 Thread 1 Core 0 Core 1 Thread 1 Idle Thread 1 14

27 StressRight: Rescheduling Rule 2: Critical sections shouldn t overlap Tag code blocks with synchronization variable ID (if applicable) Prohibit the critical sections from overlapping Core 0 A Thread 1 Core 1 A Thread 2 Core 0 A Thread 1 Core 1 Idle A Thread 2 15

28 StressRight: Rescheduling Rule 3: Threads should wait for barriers Tag code blocks related to barrier operations (if applicable) Prohibit the scheduling before the last barrier_wait() barrier_wait() Core 0 Thread 1 Thread 1 Core 1 Thread 2 Last barrier_wait() Core 0 Thread 1 Idle Thread 1 Core 1 Thread 2 16

29 Index Introduction / Motivation Limitations Proposed idea: StressRight Evaluation Conclusion 17

30 18 Evaluation Quantitative analysis Why StressRight would work well Accuracy and speed Comparison with cycle-level simulation (MARSSx86) Model 1 / 12 / 16 OoO x86 cores SPEC, PARSEC, memcached Implementation Emulation: QEMU, Reconstruction models: Python, Functional simulators: C++

31 Quantitative Analysis Efficiency of the method # instructions: full-execution vs. unique code blocks Orders of magnitude reduction in the analysis load 19 *mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup

32 Quantitative Analysis Accuracy of the dynamic resource models Functional simulations are accurate enough 20 Functional vs. Cycle-level memory simulation

33 Quantitative Analysis Accuracy of the dynamic resource models Functional simulations are accurate enough 20 Functional vs. Cycle-level memory simulation

34 Accuracy: SPEC Validating the pipeline model Correctly estimates the first-order performance Improvement in progress: Better memory model 21

35 Accuracy: PARSEC Validating the scheduler Correctly estimates the scaling behavior Improvement in progress: Barrier synchronizations 22 *We model a 12-core system

36 Accuracy: memcached Reconstructing throughput-latency curve StressRight greatly improves over the existing methods 23 *We model a 16-core system; 8 cores host the server and 8 cores run the load generator

37 Speed evaluation Order of magnitude faster vs. simulator Main bottleneck is cache simulation 24 *Reconstruction uses 40 vcpus

38 Conclusion Motivation Stress in-development systems with actual usage scenarios to obtain correct insights Key ideas Focus only on unique behavior Consider execution dynamics: $, branch, and scheduling Results Accurately reconstruct workload-reported performance metrics with an order of magnitude faster speed 25

Thank you! 26 Jaewon Lee (spiegel0@postech.ac.

39 Thank you! 26 Jaewon Lee Pohang University of Science and Technology (POSTECH)

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance