Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference

Size: px

Start display at page:

Download "Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference"

Marlene Peters
5 years ago
Views:

1 Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang

2 Internet services User interactive applications Powered by large-scale distributed systems Millions of queries hitting the servers 2

3 Tail latency Orders of magnitude higher than average latency Negatively affects user experience Resource provisioned based on tail latency 3

4 It is challenging for service providers to keep the tail of latency distribution short for interactive services as the size and complexity of the system scales up. Jeffrey Dean, Luiz Barroso The Tail at Scale Google

5 Challenges Tail latency is sensitive to any variance in the system Many services operate at latency as low as microseconds Many architectural components are involved 5

6 Limitations of prior studies NUMA [NUMA Experience HPCA 2013, Tales of the Tail SoCC 2014] NIC [Architecting to Achieve a Billion Request per Second Throughput ISCA 2015, Tales of the Tail SoCC 2014, Chronos SoCC 2012] DVFS [Heracles ISCA 2015, Rubik MICRO 2015, Adrenaline HPCA 2015, Towards Energy Proportionality ISCA 2014, Dynamic Management of TurboMode HPCA 2014] Architectural components have complex interactions 6

7 Goal Attribute and understand the source of tail latency 7

8 Tail latency measurement [ASPLOS 2012] Why are they different? [EuroSys 2014] 8

9 Tail latency measurement [ASPLOS 2012] Client-side queueing bias Query inter-arrival generation Statistical aggregation Why are they different? Performance hysteresis [EuroSys 2014] 8

10 Client-side queueing bias load tester network server 9

11 Client-side queueing bias Multiple clients are needed to avoid client-side queueing bias load tester network server 9

12 Query inter-arrival generation [Closed v.s. Open System Models. NSDI 2006] closed-loop open-loop 10

13 Query inter-arrival generation [Closed v.s. Open System Models. NSDI 2006] Open-loop is necessary to properly exercise the system queueing behavior closed-loop open-loop 10

14 Treadmill Open source Generality < 200 lines of code to integrate each workload CloudSuite [ASPLOS2012] Mutilate [EuroSys2014] Treadmill Memcached load tester network tcpdump server 11

15 Evaluation Extremely long tail Client-side queueing bias Slightly long tail Different distribution Exact same shape Fixed gap to tcpdump 12

16 Evaluation Extremely long tail Client-side queueing bias Slightly long tail Treadmill achieves microsecond-level precision even at high quantiles Different distribution Exact same shape Fixed gap to tcpdump 12

17 Goal Attribute and understand the source of tail latency 13

18 Tail latency attribution Quantile regression (tail latency) (architectural components) attribute variance to explanatory variables and their interactions works with any given quantile rather than average only (ANOVA) Architectural components NUMA allocation policy DVFS frequency governor TurboBoost NIC interrupt handling 14

19 Complex interactions NIC IRQ policy same-node all-nodes DVFS Governor ondemand performance DVFS=ondemand DVFS=performance NIC IRQ same-node NIC IRQ all-nodes more frequency transition overhead Utilization Utilization 15

20 Complex interactions NIC IRQ policy same-node all-nodes DVFS Governor ondemand performance DVFS=ondemand DVFS=performance NIC IRQ same-node NIC IRQ all-nodes more frequency transition overhead Tail latency attribution enables understanding of complex interactions Utilization Utilization 15

21 Tail latency reduction 43% reduction in 99th-percentile latency 93% reduction in its variance 16

22 Conclusion Identifying common pitfalls query inter-arrival generation; client-side queueing bias; statistical aggregation; performance hysteresis Treadmill open source modular load testing infrastructure achieves high precision at microsecond-level Attributing the source of tail latency understand complex interactions among architectural components 43% reduction in 99th-percentile latency 17

23 Thank you!

Treadmill: Tail Latency Measurement at Microsecond-level Precision. Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang

Treadmill: Tail Latency Measurement at Microsecond-level Precision Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang Schedule Welcome Section 1: Tail latency 08:40 ~ 09:00 Overview of data