Adaptive SMT Control for More Responsive Web Applications

Size: px

Start display at page:

Download "Adaptive SMT Control for More Responsive Web Applications"

Sylvia Garrett
6 years ago
Views:

1 Adaptive SMT Control for More Responsive Web Applications Hiroshi Inoue and Toshio Nakatani IBM Research Tokyo University of Tokyo Oct 27, 2014 Raleigh, NC, USA

2 Response time matters! Peak throughput has been the common metric for the Web server performance Even sub-second improvements in response times are essential for better user experiences Amazon: +100 msec 1% drop in sales Yahoo: +400 msec 5-9% drop in traffic Google: +500 msec 20% drop in searches We focus on improving the response time of Web application servers Nicole Sullivan. Design Fast Websites. Oct 14,

3 Key Question: How SMT affects response time? SMT (Simultaneous Multi Threading, a.k.a. Hyper Threading) allows multiple hardware threads to run on one core SMT typically improves aggregated throughput degrades single-thread performance Question: How SMT affects response times of Web application server? 3

4 Outline 1. How SMT affects response time 2. Adaptive SMT control with queuing model 4

5 Evaluations Processors: Xeon (SandyBridge-EP): 2-way SMT, 2.9 GHz, 16 cores POWER7: 4-way SMT, 3.55 GHz, 16 cores Workloads: PHP (MediaWiki) Ruby (Ruby-on-rails) Java (Cognos BI) OS: Redhat Enterprise Linux 6.4 (Kernel ) 5

response time [msec] IBM Research - Tokyo Response

SMT1 was better SMT1 (disabled SMT) SMT2 was better

time when CPUs were not fully loaded 300 SMT

200 lower CPU utilization higher CPU utilization 0 2

6 response time [msec] IBM Research - Tokyo Response time of the PHP application on 16 cores of Xeon 500 SMT1 was better SMT1 (disabled SMT) SMT2 was better 450 SMT2 (2-way SMT) SMT degraded response time when CPUs were not fully loaded 300 SMT improved response time at high CPU utilization lower CPU utilization higher CPU utilization number of emulated clients per core 6

were not fully loaded 400 350 300 250 SMT1 (disabled SMT) SMT2 (2-way SMT) SMT4 (4-way SMT) lower CPU

7 response time [msec] IBM Research - Tokyo Response time of the PHP application on 16 cores of POWER7 700 SMT1 was best SMT2 was best SMT4 was best SMT degraded response time when CPUs were not fully loaded SMT1 (disabled SMT) SMT2 (2-way SMT) SMT4 (4-way SMT) lower CPU utilization higher CPU utilization number of emulated clients per core 7

lower CPU utilization SMT improved response time even when CPUs were not fully loaded

8 response time [msec] IBM Research - Tokyo Response time of the PHP application on 1 core of Xeon SMT1 (disabled SMT) SMT2 (2-way SMT) SMT2 was better lower CPU utilization SMT improved response time even when CPUs were not fully loaded higher CPU utilization number of emulated clients per core 8

9 How SMT affects response time? Low CPU utilization High CPU utilization on 1 core improve improve on multiple cores degrade improve SMT hurts the response time on multicore systems with low CPU utilization level, which is the common case in today s server The crossover point depends on the number of cores 9

lower means faster response response time [msec] response time [msec] IBM Research - Tokyo on 16 cores of Xeon on 1 core of

the histogram towards slower response times 480 460 440 420 400 380 360 340 320 300 280 260 SMT2 (2-way SMT) SMT1 (disabled

(disabled SMT) many long-latency transactions on 1 core SMT reduced the long-latency transactions SMT reduced long-latency

10 lower means faster response response time [msec] response time [msec] IBM Research - Tokyo on 16 cores of Xeon on 1 core of Xeon Histogram of response time at low (~25%) CPU utilization SMT degraded single-thread performance and shifted the peak of the histogram towards slower response times SMT2 (2-way SMT) SMT1 (disabled SMT) almost no longlatency transactions on 16 cores SMT2 (2-way SMT) SMT1 (disabled SMT) many long-latency transactions on 1 core SMT reduced the long-latency transactions SMT reduced long-latency transactions on 1 core histogram histogram 10

11 Breaking down response time response time T r = service time T s SMT typically + waiting time T w increases service time (CPU time) by lowering single-thread performance reduces waiting time (in task scheduling queue) by providing more hardware threads SMT degrades the response time on multicore systems with low CPU utilization level because waiting time is not significant in such case For other cases (single core or high utilization) waiting time affect the total response time 11

12 Outline 1. How SMT affects response time 2. Adaptive SMT control with queuing model 12

13 Adaptive SMT Control We periodically (once per 5 sec) obtain the CPU utilization from /proc/stat, calculate the response time for each SMT level using a new queuing model, and select the best SMT level Implemented as a user-space daemon without modification in OS kernel 13

14 Challenges in queuing model for SMT processors How to model single-thread performance on SMT processor affected by resource contention among the SMT threads How to model task migration behavior of the OS task scheduler aggressively balances the load among the SMT threads within one core while minimizing migrations among different cores 14

15 Hierarchical queuing model 1.In-core modeling: model the single SMT core To calculate service time (i.e. single-thread performance) and waiting time without considering task migration 2.Out-of-core modeling: model the task migration among cores To modify the waiting time considering the task migration Both phases are based on the standard M/M/s model Model takes CPU utilization as input w/o task characteristics See the paper for the model details 15

16 normalized average response time response time [msec] IBM Research - Tokyo Response time predicted by our model on 16-cores of Xeon SMT1 was better SMT1 (disabled SMT) SMT2 (2-way SMT) SMT2 was better predicted response time of MediaWiki (PHP) 400 SMT1 was better SMT2 was better lower CPU utilization higher CPU utilization number of emulated clients per core measured response time of MediaWiki (PHP) response time (SMT1) response time (SMT2) 0.0 lower CPU utilization higher CPU utilization normalized CPU utilization (1.0 means fully utilized without SMT) 16

17 normalized average response time response time [msec] IBM Research - Tokyo Response time predicted by our model on 16-cores of POWER SMT1 was best SMT2 was best SMT4 was best SMT1 (disabled SMT) SMT2 (2-way SMT) SMT4 (4-way SMT) lower CPU utilization higher CPU utilization number of emulated clients per core SMT1 was best response time (SMT1) response time (SMT2) response time (SMT4) predicted response time of MediaWiki (PHP) SMT2 was best SMT4 was best measured response time of MediaWiki (PHP) lower CPU utilization higher CPU utilization normalized CPU utilization (1.0 means fully utilized without SMT) 17

18 normalized average response time response time [msec] IBM Research - Tokyo Response time predicted by our model on 1-core of Xeon SMT1 (disabled SMT) SMT2 (2-way SMT) SMT2 was better predicted response time of MediaWiki (PHP) 400 SMT2 was best lower CPU utilization higher CPU utilization number of emulated clients per core measured response time of MediaWiki (PHP) response time (SMT1) response time (SMT2) 0.0 lower CPU utilization higher CPU utilization normalized CPU utilization (1.0 means fully utilized without SMT) 18

450 SMT1 (disabled SMT) SMT2 (2-way SMT) with Adaptive

300 time compared to the default (SMT2) 250 SMT1 is

improve response time SMT2 is automatically selected

throughput 200 lower CPU utilization higher CPU

19 response time [msec] IBM Research - Tokyo Response time with adaptive SMT control on 16 cores of Xeon SMT1 (disabled SMT) SMT2 (2-way SMT) with Adaptive SMT Control about 10% reduction in response 300 time compared to the default (SMT2) 250 SMT1 is automatically selected with low CPU utilization to improve response time SMT2 is automatically selected with high CPU utilization to achieve higher peak throughput 200 lower CPU utilization higher CPU utilization number of emulated clients per core 19

600 550 500 450 SMT1 (disabled SMT) SMT2 (2-way SMT)

SMT4 is selected 20 400 350 SMT2 is 300 selected about

utilization 250response time compared to 0 2 4 6 8 10 12

20 response time [msec] IBM Research - Tokyo Response time with adaptive SMT control on 16 cores of POWER SMT1 (disabled SMT) SMT2 (2-way SMT) SMT4 (4-way SMT) with Adaptive Control SMT1 is selected SMT4 is selected SMT2 is 300 selected about lower 10% CPU reduction utilization in higher CPU utilization 250response time compared to the default (SMT4) number of emulated clients per core

response time [msec] IBM Research - Tokyo Response time with adaptive SMT control on 1 core of Xeon 500 SMT1 (disabled SMT) 450 SMT2 (2-way SMT) 400 with Adaptive Control 350 300

21 response time [msec] IBM Research - Tokyo Response time with adaptive SMT control on 1 core of Xeon 500 SMT1 (disabled SMT) 450 SMT2 (2-way SMT) 400 with Adaptive Control SMT2 is automatically selected regardless of the CPU utilization lower CPU utilization higher CPU utilization number of emulated clients per core 21

Summary We showed that SMT may degrade the response time on multicore processors with low CPU utilization We developed a new queuing model to predict the response time on multicore SMT processors Our

22 Summary We showed that SMT may degrade the response time on multicore processors with low CPU utilization We developed a new queuing model to predict the response time on multicore SMT processors Our adaptive SMT control based on the new model automatically selected the best SMT level at runtime See the paper for more detail evaluation with Ruby and Java workloads results on moderate number of cores detail of the queuing model 22

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler

Adaptive Multi-Level Compilation in a Trace-based Java JIT Compiler Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu and Toshio Nakatani IBM Research Tokyo IBM Research T.J. Watson Research Center October