Diagnosing performance changes by comparing request flows!

Size: px

Start display at page:

Download "Diagnosing performance changes by comparing request flows!"

Loreen Garrison
5 years ago
Views:

1 Diagnosing performance changes by comparing request flows Raja Sambasivan Alice Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Greg Ganger Carnegie Mellon Microsoft Research Google

2 Perf. diagnosis in distributed systems Very difficult and time consuming Root cause could be in any component Request-flow comparison Helps localize performance changes Key insight: Changes manifest as mutations in request timing/structure 2

3 Perf. debugging a feature addition Clients Client Metadata Servers Servers server NFS server Storage nodes 3

4 Perf. debugging a feature addition Before addition: (1) Every file access needs a MDS access Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 4

5 Perf. debugging a feature addition After: Metadata prefetched to clients Most requests donʼt need MDS access Client Metadata server (1) (3) NFS server (2) Storage nodes 5

6 Perf. debugging a feature addition Adding metadata prefetching reduced performance instead of improving it () How to efficiently diagnose this? 6

7 Request-flow comparison will show Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 7

8 Request-flow comparison will show Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor differ 8 Mutation

9 Request-flow comparison Identifies distribution changes Distinct from anomaly detection E.g., Magpie, Pinpoint, etc. Satisfies many use cases Performance regressions/degradations Eliminating the system as the culprit 9

10 Contributions Heuristics for identifying mutations, precursors, and for ranking them Implementation in Spectroscope Use of Spectroscope to diagnose Unsolved problems in Ursa Minor Problems in Google services 10

11 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 11

12 Graphs via end-to-end tracing Used in research & production systems E.g., Magpie, X-Trace, Googleʼs Dapper Works as follows: Tracks trace points touched by requests Request-flow graphs obtained by stitching together trace points accessed Yields < 1% overhead w/req. sampling 12

13 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Reply 13 10μs 10μs

3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Nodes

14 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Nodes show trace points & edges show latencies Reply 14 10μs 10μs

15 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 15

16 Categorization step Necessary since it is meaningless to compare individual requests flows Groups together similar request flows Categories: basic unit for comparisons Allows for mutation identification by comparing per-category distributions 16

17 Choosing what to bin into a category Our choice: identically structured reqs Uses same path/similar cost expectation Same path/similar costs notion is valid For 88 99% of Ursa Minor categories For 47 69% of Bigtable categories Lower value due to sparser trace points Lower value also due to contention Aside: Categorization can be used to localize problematic sources of variance 17

18 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 18

19 Type 1: Response-time mutations Requests that: are structurally identical in both periods have larger problem period latencies Root cause localized by identifying interactions responsible 19

20 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time Response-time 20

21 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time IDʼd as containing mutations 21 Response-time Not IDʼd as containing mutations

22 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. 80μs NFS Read Reply Avg. problem response time: 1,090μs 22

23 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. problem response time: 1,090μs Avg. 80μs Problem localized by NFS ID ing Read responsible Reply interaction 23

24 Type 2: Structural mutations Requests that: take different paths in the problem period Root caused localized by identifying their precursors Likely path during the non-problem period idʼing how mutation & precursor differ 24

25 IDʼing categories w/structural mutations Assume similar workloads executed Categories with more problem period requests contain mutations Reverse true for precursor categories Threshold used to differentiate natural variance from categories w/mutations 25

26 Mapping mutations to precursors Read Precursors Write Structural Mutation Read? NP: 1,000 NP: 480 NP: 300 P: 150 P: 460 P: 800 Accomplished using three heuristics See paper for details 26

27 Example structural mutation Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 27

28 Example structural mutation Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor 28 differ Mutation

29 Ranking of categories w/mutations Necessary for two reasons There may be more than one problem One problem may yield many mutations Rank based on: # of reqs affected * Δ in response time 29

30 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 30

31 Putting it all together: The UI Precursor NFS Lookup Call 10μs MDS DB Lock 20μs MDS DB Unlock 50μs NFS Lookup Reply Mutation NFS Lookup Call 10μs MDS DB Lock 350μs MDS DB Unlock 50μs NFS Lookup Reply Ranked list 1. Structural 2. Response time 20. Structural 31

32 Putting it all together: The UI Mutation NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 100μs SN1 Read End Avg. 80μs NFS Read Reply Ranked list 1. Structural 2. Response time 20. Structural 32

33 Outline Introduction End-to-end tracing & Spectroscope Ursa Minor & Google case studies Summary 33

34 Ursa Minor case studies Used Spectroscope to diagnose real performance problems in Ursa Minor Four were previously unsolved Evaluated ranked list by measuring % of top 10 results that were relevant % of total results that were relevant 34

35 Quantitative results 100% 75% 50% 25% 0% Prefetch Config. RMWs Creates 500us delay % top 10 relevant % total relevant N/A Spikes 35

36 Ursa Minor 5-component config All case studies use this configuration (1) Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 36

37 Case 1: MDS configuration problem Problem: Slowdown in key benchmarks seen after large code merge 37

38 Comparing request flows Identified 128 mutation categories: Most contained structural mutations Mutations and precursors differed only in the storage node accessed Localized bug to unexpected interaction between MDS & the data storage node But, unclear why this was happening 38

39 Further localization Thought mutations may be caused by changes in low-level parameters E.g., function call parameters, etc. Identified parameters that separated 1 st - ranked mutation from its precursors Showed changes in encoding params Localized root cause to two config files Root cause: Change to an obscure config file 39

40 Inter-cluster perf. at Google Load tests run on same software in two datacenters yielded very different perf. Developers said perf. should be similar 40

41 Comparing request flows Revealed many mutations High-ranked ones were response-time Responsible interactions found both within the service and in dependencies Led us to suspect root cause was issue with the slower load testʼs datacenter Root cause: Shared Bigtable in slower load test s datacenter was not working properly 41

42 Summary Introduced request-flow comparison as a new way to diagnose perf. changes Presented algorithms for localizing problems by identifying mutations Showed utility of our approach by using it to diagnose real, unsolved problems 42

43 Related work (I) [Abd-El-Malek05b]: Ursa Minor: versatile cluster-based storage. Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie. FAST, [Barham04]: Using Magpie for request extraction and workload modelling. Paul Barham, Austin Donnelly, Rebecca Isaacs, Richard Mortier. OSDI, [Chen04]: Path-based failure and evolution management. Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer. NSDI,

44 Related work (II) [Cohen05]: Capturing, indexing, and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Terrence Kelly, Armando Fox. SOSP, [Fonseca07]: X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica. NSDI, [Hendricks06]: Improving small file performance in objectbased storage. James Hendricks, Raja R. Sambasivan, Shafeeq Sinnamohideen, Gregory R. Ganger. Technical Report CMU-PDL ,

45 Related work (III) [Reynolds06]: Detecting the unexpected in distributed systems. Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Amin Vahdat. NSDI, [Sambasivan07]: Categorizing and differencing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno Thereska. HotAC II, [Sambasivan10]: Diagnosing performance problems by visualizing and comparing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger. CMU-PDL

46 Related work (IV) [Sambasivan11]: Automation without predictability is a recipe for failure. Raja R. Sambasivan and Gregory R. Ganger. Technical Report CMU-PDL , [Sigelman10]: Dapper, a large-scale distributed tracing infrastructure. Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag. Technical report , [Thereska06b]: Stardust: tracking activity in a distributed storage system. Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, Gregory R. Ganger. SIGMETRICS,

Diagnosis via monitoring & tracing

Diagnosis via monitoring & tracing Greg Ganger, Garth Gibson, Majd Sakr adapted from Raja Sambasivan 15-719: Advanced Cloud Computing Spring 2017 1 Problem diagnosis is difficult For developers of clouds