Improving Cloud Application Performance with Simulation-Guided CPU State Management

Size: px

Start display at page:

Download "Improving Cloud Application Performance with Simulation-Guided CPU State Management"

Hope Cummings
5 years ago
Views:

1 Improving Cloud Application Performance with Simulation-Guided CPU State Management Mathias Gottschlag, Frank Bellosa April 23, 2017 KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT) - OPERATING SYSTEMS GROUP KIT The Research University in the Helmholtz Association

2 Motivation Scale-out/server workloads: Complex workloads, large working set Accesses miss L2 and hit slow L3 Result: Half of all cycles spent waiting for memory (Kanev et al.: Profiling a Warehouse-Scale Computer, ISCA 15) Need prefetching to bring data closer to the core Only very simple hardware prefetchers available L1i L2 L3 core L1d many cache misses (~10 cycles) high latency (~40 cycles) Need for efficient software prefetching M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 2/8

3 Prefetching Existing software solutions (e.g., Helper Threads, Call Graph Prefetching): 1 Developer profiles the application 2 Compiler inserts prefetching code 3 Application is deployed Problems Addresses not known at compile time (address calculation overhead) Not known whether prefetching is beneficial Of limited use for the OS itself (workload not known in advance) M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 3/8

4 Prefetching Existing software solutions: (e.g., Speculative Precomputation, ISCA 01, Call Graph Prefetching, HPCA 01) 1 Developer profiles the application 2 Compiler inserts prefetching code 3 Application is deployed Problem: Workload not known in advance Large working set: Many misses in the OS OS app Small working set: No misses in the OS OS app Need to know whether accesses miss the cache M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 4/8

5 Approach Problem: Cache misses depend on OS, workload, and microarchitecture app OS simulator/ pro ler call graph memory access traces cache/ pipeline simulator call graph expensive cache misses app OS (+prefetching when idle) 1 Temporary execution on simulator to record memory access patterns 2 Cache simulator to identify data worthwhile to be prefetched 3 Use resulting address list for prefetching Challenge: Accurate models of existing hardware M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 5/8

6 Prefetching at Idle Time Server systems: Usually not 100% utilized Spare capacity for peak load Latency depends on utilization latency target latency utilization Why not make use of the resulting idle time? M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 6/8

7 Prefetching at Idle Time Idea: Use idle time to prefetch next process Next process can often be predicted One latency-critical service per core Predictable network architectures (M. Gottschlag, F. Bellosa, Reducing Response Time with Preheated Caches, ROME 16) idle active idle active idle active idle active prefetching while the system is idle Experiments: Good results when prefetching from DRAM into L2 cache Challenge: Interaction with aggressive hardware prefetchers M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 7/8

8 Summary Problem: Compiler-based approaches cannot fully utilize runtime information Result: Missed opportunity for optimization Approach: Runtime simulation-based profiling Collect precise cache miss information Prefetching while the system is idle Hide prefetching cost Challenges: Modelling existing caches and prefetchers Interaction between software and hardware prefetchers M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 8/8

9 References Ferdman, M. et al.: Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware Kanev, S. et al.: Profiling a Warehouse-Scale Computer Lee, J., Kim, H., Vuduc, R.: When Prefetching Works, When It Doesn t, and Why Collins, J. D. et al.: Speculative Precomputation: Long-Range Prefetching of Delinquent Loads Annavaram, M., Patel, J. M., Davidson, E. S.: Call Graph Prefetching for Database Applications. Gottschlag, M., Bellosa, F.: Reducing Response Time with Preheated Caches M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 9/8

10 Scope Target: Long-running Latency-critical Networked Out of scope: Runtime ASLR Need to figure out interaction with garbage collectors Copying collectors problematic Most long-lived objects expected to be at fixed location M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 10/8

11 Evaluation Baseline: Existing software prefetching Software prefetching benefits from runtime information? Augment existing approaches with prefetching during idle time? Targeted workloads: Server and scale-out applications CloudSuite, TPC-C,... M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 11/8

12 Current State Network packet announcements based on Fastpass Announcements 50µs in advance Next process can be predicted Early prototype: Runtime profiling with Intel PEBS Prefetching does not yet yield expected results Problems identified: Conflicts with hardware prefetchers PEBS traces incomplete Next step: Determine benefit in controlled environment Simulator M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 12/8

Predictable Network Architectures Problem: Predict next incoming network request Basis: Deterministic low-latency networks with central arbiter (example: Fastpass, SIGCOMM 14) Central arbiter has

13 Predictable Network Architectures Problem: Predict next incoming network request Basis: Deterministic low-latency networks with central arbiter (example: Fastpass, SIGCOMM 14) Central arbiter has global view of the network Arbiter can announce future incoming network packets (M. Gottschlag, F. Bellosa, Reducing Response Time with Preheated Caches, ROME 16) M. Gottschlag, F. Bellosa Improving Cloud Application Performance with Simulation-Guided CPU State Management 13/8

Reducing Response Time with Preheated Caches

Reducing Response Time with Preheated Caches Mathias Gottschlag and Frank Bellosa Karlsruhe Institute of Technology gottschlag@ira.uka.de bellosa@kit.edu Abstract. CPU performance is increasingly limited