Diagnosing performance changes by comparing request flows!

Size: px
Start display at page:

Download "Diagnosing performance changes by comparing request flows!"

Transcription

1 Diagnosing performance changes by comparing request flows Raja Sambasivan Alice Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Greg Ganger Carnegie Mellon Microsoft Research Google

2 Perf. diagnosis in distributed systems Very difficult and time consuming Root cause could be in any component Request-flow comparison Helps localize performance changes Key insight: Changes manifest as mutations in request timing/structure 2

3 Perf. debugging a feature addition Clients Client Metadata Servers Servers server NFS server Storage nodes 3

4 Perf. debugging a feature addition Before addition: (1) Every file access needs a MDS access Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 4

5 Perf. debugging a feature addition After: Metadata prefetched to clients Most requests donʼt need MDS access Client Metadata server (1) (3) NFS server (2) Storage nodes 5

6 Perf. debugging a feature addition Adding metadata prefetching reduced performance instead of improving it () How to efficiently diagnose this? 6

7 Request-flow comparison will show Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 7

8 Request-flow comparison will show Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor differ 8 Mutation

9 Request-flow comparison Identifies distribution changes Distinct from anomaly detection E.g., Magpie, Pinpoint, etc. Satisfies many use cases Performance regressions/degradations Eliminating the system as the culprit 9

10 Contributions Heuristics for identifying mutations, precursors, and for ranking them Implementation in Spectroscope Use of Spectroscope to diagnose Unsolved problems in Ursa Minor Problems in Google services 10

11 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 11

12 Graphs via end-to-end tracing Used in research & production systems E.g., Magpie, X-Trace, Googleʼs Dapper Works as follows: Tracks trace points touched by requests Request-flow graphs obtained by stitching together trace points accessed Yields < 1% overhead w/req. sampling 12

13 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Reply 13 10μs 10μs

14 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Nodes show trace points & edges show latencies Reply 14 10μs 10μs

15 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 15

16 Categorization step Necessary since it is meaningless to compare individual requests flows Groups together similar request flows Categories: basic unit for comparisons Allows for mutation identification by comparing per-category distributions 16

17 Choosing what to bin into a category Our choice: identically structured reqs Uses same path/similar cost expectation Same path/similar costs notion is valid For 88 99% of Ursa Minor categories For 47 69% of Bigtable categories Lower value due to sparser trace points Lower value also due to contention Aside: Categorization can be used to localize problematic sources of variance 17

18 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 18

19 Type 1: Response-time mutations Requests that: are structurally identical in both periods have larger problem period latencies Root cause localized by identifying interactions responsible 19

20 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time Response-time 20

21 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time IDʼd as containing mutations 21 Response-time Not IDʼd as containing mutations

22 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. 80μs NFS Read Reply Avg. problem response time: 1,090μs 22

23 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. problem response time: 1,090μs Avg. 80μs Problem localized by NFS ID ing Read responsible Reply interaction 23

24 Type 2: Structural mutations Requests that: take different paths in the problem period Root caused localized by identifying their precursors Likely path during the non-problem period idʼing how mutation & precursor differ 24

25 IDʼing categories w/structural mutations Assume similar workloads executed Categories with more problem period requests contain mutations Reverse true for precursor categories Threshold used to differentiate natural variance from categories w/mutations 25

26 Mapping mutations to precursors Read Precursors Write Structural Mutation Read? NP: 1,000 NP: 480 NP: 300 P: 150 P: 460 P: 800 Accomplished using three heuristics See paper for details 26

27 Example structural mutation Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 27

28 Example structural mutation Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor 28 differ Mutation

29 Ranking of categories w/mutations Necessary for two reasons There may be more than one problem One problem may yield many mutations Rank based on: # of reqs affected * Δ in response time 29

30 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 30

31 Putting it all together: The UI Precursor NFS Lookup Call 10μs MDS DB Lock 20μs MDS DB Unlock 50μs NFS Lookup Reply Mutation NFS Lookup Call 10μs MDS DB Lock 350μs MDS DB Unlock 50μs NFS Lookup Reply Ranked list 1. Structural 2. Response time 20. Structural 31

32 Putting it all together: The UI Mutation NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 100μs SN1 Read End Avg. 80μs NFS Read Reply Ranked list 1. Structural 2. Response time 20. Structural 32

33 Outline Introduction End-to-end tracing & Spectroscope Ursa Minor & Google case studies Summary 33

34 Ursa Minor case studies Used Spectroscope to diagnose real performance problems in Ursa Minor Four were previously unsolved Evaluated ranked list by measuring % of top 10 results that were relevant % of total results that were relevant 34

35 Quantitative results 100% 75% 50% 25% 0% Prefetch Config. RMWs Creates 500us delay % top 10 relevant % total relevant N/A Spikes 35

36 Ursa Minor 5-component config All case studies use this configuration (1) Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 36

37 Case 1: MDS configuration problem Problem: Slowdown in key benchmarks seen after large code merge 37

38 Comparing request flows Identified 128 mutation categories: Most contained structural mutations Mutations and precursors differed only in the storage node accessed Localized bug to unexpected interaction between MDS & the data storage node But, unclear why this was happening 38

39 Further localization Thought mutations may be caused by changes in low-level parameters E.g., function call parameters, etc. Identified parameters that separated 1 st - ranked mutation from its precursors Showed changes in encoding params Localized root cause to two config files Root cause: Change to an obscure config file 39

40 Inter-cluster perf. at Google Load tests run on same software in two datacenters yielded very different perf. Developers said perf. should be similar 40

41 Comparing request flows Revealed many mutations High-ranked ones were response-time Responsible interactions found both within the service and in dependencies Led us to suspect root cause was issue with the slower load testʼs datacenter Root cause: Shared Bigtable in slower load test s datacenter was not working properly 41

42 Summary Introduced request-flow comparison as a new way to diagnose perf. changes Presented algorithms for localizing problems by identifying mutations Showed utility of our approach by using it to diagnose real, unsolved problems 42

43 Related work (I) [Abd-El-Malek05b]: Ursa Minor: versatile cluster-based storage. Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie. FAST, [Barham04]: Using Magpie for request extraction and workload modelling. Paul Barham, Austin Donnelly, Rebecca Isaacs, Richard Mortier. OSDI, [Chen04]: Path-based failure and evolution management. Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer. NSDI,

44 Related work (II) [Cohen05]: Capturing, indexing, and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Terrence Kelly, Armando Fox. SOSP, [Fonseca07]: X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica. NSDI, [Hendricks06]: Improving small file performance in objectbased storage. James Hendricks, Raja R. Sambasivan, Shafeeq Sinnamohideen, Gregory R. Ganger. Technical Report CMU-PDL ,

45 Related work (III) [Reynolds06]: Detecting the unexpected in distributed systems. Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Amin Vahdat. NSDI, [Sambasivan07]: Categorizing and differencing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno Thereska. HotAC II, [Sambasivan10]: Diagnosing performance problems by visualizing and comparing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger. CMU-PDL

46 Related work (IV) [Sambasivan11]: Automation without predictability is a recipe for failure. Raja R. Sambasivan and Gregory R. Ganger. Technical Report CMU-PDL , [Sigelman10]: Dapper, a large-scale distributed tracing infrastructure. Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag. Technical report , [Thereska06b]: Stardust: tracking activity in a distributed storage system. Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, Gregory R. Ganger. SIGMETRICS,

Diagnosis via monitoring & tracing

Diagnosis via monitoring & tracing Diagnosis via monitoring & tracing Greg Ganger, Garth Gibson, Majd Sakr adapted from Raja Sambasivan 15-719: Advanced Cloud Computing Spring 2017 1 Problem diagnosis is difficult For developers of clouds

More information

Categorizing and differencing system behaviours

Categorizing and differencing system behaviours Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL. Categorizing and differencing system behaviours Raja R. Sambasivan, Alice X. Zheng, Eno Thereska, Gregory R. Ganger

More information

CHECKING, PURE, EFFICIENT, SCALABLE PERFORMANCE DIAGNOSIS FOR PRODUCTION CLOUD COMPUTING SYSTEMS

CHECKING, PURE, EFFICIENT, SCALABLE PERFORMANCE DIAGNOSIS FOR PRODUCTION CLOUD COMPUTING SYSTEMS CHECKING, PURE, EFFICIENT, SCALABLE PERFORMANCE DIAGNOSIS FOR PRODUCTION CLOUD COMPUTING SYSTEMS 1 KHILJI AFZAL HUSSAIN, 2 SYED.ABDULHAQ, 3 P.BABU 1 PG SCHOLAR, CSE (CN), QCET, NELLORE 2,3 ASSOCIATE PROFESSOR,

More information

Observer: keeping system models from becoming obsolete

Observer: keeping system models from becoming obsolete Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL. Observer: keeping system models from becoming obsolete Eno Thereska, Anastassia Ailamaki, Gregory R. Ganger Carnegie

More information

Time Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems

Time Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems Time Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems Kun Suo Jia Rao The University of Texas at Arlington kun.suo@mavs.uta.edu jia.rao@uta.edu Luwei Cheng Facebook chengluwei@fb.com

More information

International Journal of Advance Engineering and Research Development. Authenticated Key Exchange Protocols for Parallel Network File Systems

International Journal of Advance Engineering and Research Development. Authenticated Key Exchange Protocols for Parallel Network File Systems Scientific Journal of Impact Factor (SJIF): 3.134 International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Authenticated

More information

Towards self-predicting systems: What if you could ask what-if?

Towards self-predicting systems: What if you could ask what-if? Towards self-predicting systems: What if you could ask what-if? Eno Thereska Carnegie Mellon University enothereska@cmu.edu Dushyanth Narayanan Microsoft Research, UK dnarayan@microsoft.com Gregory R.

More information

Path-based Macroanalysis for Large Distributed Systems

Path-based Macroanalysis for Large Distributed Systems Path-based Macroanalysis for Large Distributed Systems 1, Anthony Accardi 2, Emre Kiciman 3 Dave Patterson 1, Armando Fox 3, Eric Brewer 1 mikechen@cs.berkeley.edu UC Berkeley 1, Tellme Networks 2, Stanford

More information

Automatic Configuration of a Distributed Storage System

Automatic Configuration of a Distributed Storage System Automatic Configuration of a Distributed Storage System Keven Chen, Lauro Costa, Steven Liu EECE 571R Final Report, April 2010 The University of British Columbia 1. INTRODUCTION A distributed storage system

More information

An Experimental Study of Rapidly Alternating Bottlenecks in n-tier Applications

An Experimental Study of Rapidly Alternating Bottlenecks in n-tier Applications 213 IEEE Sixth International Conference on Cloud Computing An Experimental Study of Rapidly Alternating Bottlenecks in n-tier Applications Qingyang Wang 1, Yasuhiko Kanemasa 2, Jack Li 1, Deepal Jayasinghe

More information

SelfTalk for Dena: Query Language and Runtime Support for Evaluating System Behavior

SelfTalk for Dena: Query Language and Runtime Support for Evaluating System Behavior SelfTalk for Dena: Query Language and Runtime Support for Evaluating System Behavior Saeed Ghanbari, Gokul Soundararajan and Cristiana Amza Department of Electrical and Computer Engineering University

More information

Pivot Tracing is a monitoring framework for distributed systems that

Pivot Tracing is a monitoring framework for distributed systems that Pivot Tracing Dynamic Causal Monitoring for Distributed Systems JONATHAN MACE, RYAN ROELKE, AND RODRIGO FONSECA Jonathan Mace is a PhD student in computer science at Brown University, advised by Rodrigo

More information

//TRACE: Parallel trace replay with approximate causal events

//TRACE: Parallel trace replay with approximate causal events //TRACE: Parallel trace replay with approximate causal events MICHAEL P. MESNIER, MATTHEW WACHS, RAJA R. SAMBASIVAN, JULIO LOPEZ, JAMES HENDRICKS, GREGORY R. GANGER, DAVID O HALLARON CARNEGIE MELLON UNIVERSITY

More information

Detecting Transient Bottlenecks in n-tier Applications through Fine-Grained Analysis

Detecting Transient Bottlenecks in n-tier Applications through Fine-Grained Analysis 213 IEEE 33rd International Conference on Distributed Computing Systems Detecting Transient Bottlenecks in n-tier Applications through Fine-Grained Analysis Qingyang Wang 1, Yasuhiko Kanemasa 2, Jack Li

More information

Configuration-Space Performance Anomaly Depiction

Configuration-Space Performance Anomaly Depiction Configuration-Space Performance Anomaly Depiction Christopher Stewart Kai Shen Department of Computer Science University of Rochester {stewart,kshen}@cs.rochester.edu Arun Iyengar Jian Yin IBM Watson Research

More information

Survey on File System Tracing Techniques

Survey on File System Tracing Techniques CSE598D Storage Systems Survey on File System Tracing Techniques Byung Chul Tak Department of Computer Science and Engineering The Pennsylvania State University tak@cse.psu.edu 1. Introduction Measuring

More information

The Game of Twenty Questions: Do You Know Where to Log?

The Game of Twenty Questions: Do You Know Where to Log? The Game of Twenty Questions: Do You Know Where to Log? Xu Zhao Michael Stumm Kirk Rodrigues Ding Yuan Yu Luo Yuanyuan Zhou University of California San Diego ABSTRACT A production system s printed logs

More information

Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes

Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes Zhihong Zhang 1,2, Jianfeng Zhan 1, Yong Li 1, Lei Wang 1, Dan Meng 1, Bo Sang 1 1 Institute of Computing Technology,

More information

Profiling Apache HIVE Query from Run Time Logs

Profiling Apache HIVE Query from Run Time Logs Profiling Apache HIVE Query from Run Time Logs Givanna Putri Haryono School of Information Technologies The University of Sydney NSW 2008 Email: ghar1821@uni.sydney.edu.au Ying Zhou School of Information

More information

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis

CLUE: System Trace Analytics for Cloud Service Performance Diagnosis CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Hui Zhang, Junghwan Rhee, Nipun Arora, Sahan Gamage, Guofei Jiang, Kenji Yoshihira and Dongyan Xu Department of Autonomic Management

More information

Performance Debugging in Data Centers: Doing More with Less 1

Performance Debugging in Data Centers: Doing More with Less 1 Performance Debugging in Data Centers: Doing More with Less 1 Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal, Prashant Shenoy, Harrick Vin Computer Science Department, University of Massachusetts at

More information

Service-generated Big Data and Big Data-as-a-Service: An Overview

Service-generated Big Data and Big Data-as-a-Service: An Overview 2013 IEEE International Congress on Big Data Service-generated Big Data and Big Data-as-a-Service: An Overview Zibin Zheng, Jieming Zhu, and Michael R. Lyu Department of Computer Science & Engineering

More information

D3: Declarative Distributed Debugging

D3: Declarative Distributed Debugging D3: Declarative Distributed Debugging Byung-Gon Chun Kuang Chen Gunho Lee Randy H. Katz Scott Shenker Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report

More information

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. M. Burak ÖZTÜRK 1 Introduction Data Model API Building

More information

End-to-End Tracing Models: Analysis and Unification

End-to-End Tracing Models: Analysis and Unification End-to-End Tracing Models: Analysis and Unification Jonathan Leavitt Department of Computer Science Brown University jsleavit@cs.brown.edu May 5, 2014 i Abstract Many applications today require distributed

More information

Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs

Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Anupam Chanda, Khaled Elmeleegy, and Alan L. Cox Department of Computer Science Rice University, Houston,

More information

Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs

Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Anupam Chanda, Khaled Elmeleegy, and Alan L. Cox Department of Computer Science Rice University, Houston,

More information

Abstract. 1 Introduction

Abstract. 1 Introduction Towards Realistic File-System Benchmarks with CodeMRI Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau Computer Sciences Department, University of Wisconsin-Madison {nitina, dusseau,

More information

Reusing migration to simply and efficiently implement multi-server operations in transparently scalable storage systems

Reusing migration to simply and efficiently implement multi-server operations in transparently scalable storage systems Reusing migration to simply and efficiently implement multi-server operations in transparently scalable storage systems Shafeeq Sinnamohideen May 2010 CMU-CS-10-141 School of Computer Science Carnegie

More information

vquery: A Platform for Connecting Configuration and Performance

vquery: A Platform for Connecting Configuration and Performance Ilari Shafer Carnegie Mellon University Snorri Gylfason VMware, Inc. Gregory R. Ganger Carnegie Mellon University Abstract Discovering the causes of performance problems in virtualized systems is often

More information

RAPIDLY ALTERNATING BOTTLENECKS: A STUDY OF TWO CASES IN

RAPIDLY ALTERNATING BOTTLENECKS: A STUDY OF TWO CASES IN RAPIDLY ALTERNATING BOTTLENECKS: A STUDY OF TWO CASES IN N TIER APPLICATIONS 1Qingyang Wang, 2 Yasuhiko Kanemasa, 1 Jack Li, 2 Toshihiro Shimizu, 2 Masazumi Matsubara, 3 Motoyuki Kawaba, 1 Calton Pu 1College

More information

Co-scheduling of disk head time in cluster-based storage

Co-scheduling of disk head time in cluster-based storage 2009 28th IEEE International Symposium on Reliable Distributed Systems (SRDS 09). Co-scheduling of disk head time in cluster-based storage Matthew Wachs, Gregory R. Ganger Parallel Data Laboratory Carnegie

More information

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie

More information

Ganesha: Black-Box Fault Diagnosis for MapReduce Systems

Ganesha: Black-Box Fault Diagnosis for MapReduce Systems Ganesha: Black-Box Fault Diagnosis for MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan CMU-PDL-08-112 September 2008 Parallel Data Laboratory Carnegie Mellon University

More information

Magpie: Distributed request tracking for realistic performance modelling

Magpie: Distributed request tracking for realistic performance modelling Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge

More information

Distributed systems pose unique challenges for

Distributed systems pose unique challenges for 1 of 20 TE X T ONLY Debugging Distributed Systems Challenges and Options for Validation and Debugging IVAN BESCHASTNIKH PATTY WANG YURIY BRUN MICHAEL D. ERNST Distributed systems pose unique challenges

More information

The Case for Semantic Aware Remote Replication

The Case for Semantic Aware Remote Replication The Case for Semantic Aware Remote Replication Xiaotao Liu, Gal Niv, K.K. Ramakrishnan, Prashant Shenoy, Jacobus Van der Merwe University of Massachusetts / AT&T Labs-Research Abstract This paper argues

More information

Which Configuration Option Should I Change?

Which Configuration Option Should I Change? Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu I have released a new software version Developers I cannot get used to the UI

More information

WiFröst: Bridging the Information Gap for Debugging of Networked Embedded Systems

WiFröst: Bridging the Information Gap for Debugging of Networked Embedded Systems WiFröst: Bridging the Information Gap for Debugging of Networked Embedded Systems Will McGrath 1,2, Jeremy Warner 1, Mitchell Karchemsky 1, Andrew Head 1, Daniel Drew 1, Bjoern Hartmann 1 1 UC Berkeley

More information

Using Runtime Paths for Macro Analysis

Using Runtime Paths for Macro Analysis Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer {mikechen, brewer}@cs.berkeley.edu, {emrek, fox}@cs.stanford.edu, anthony@tellme.com Abstract We

More information

Statistical Debugging for Real-World Performance Problems. Linhai Song Advisor: Prof. Shan Lu

Statistical Debugging for Real-World Performance Problems. Linhai Song Advisor: Prof. Shan Lu Statistical Debugging for Real-World Performance Problems Linhai Song Advisor: Prof. Shan Lu 1 Software Efficiency is Critical No one wants slow and inefficient software Frustrate end users Cause economic

More information

Eliminating cross-server operations in scalable file systems

Eliminating cross-server operations in scalable file systems Eliminating cross-server operations in scalable file systems James Hendricks, Shafeeq Sinnamohideen, Raja R. Sambasivan, Gregory R. Ganger CMU-PDL-06-105 May 2006 Parallel Data Laboratory Carnegie Mellon

More information

Software Error Early Detection System based on Run-time Statistical Analysis of Function Return Values

Software Error Early Detection System based on Run-time Statistical Analysis of Function Return Values Software Error Early Detection System based on Run-time Statistical Analysis of Function Return Values Alex Depoutovitch and Michael Stumm Department of Computer Science, Department Electrical and Computer

More information

IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion

IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion Kai Ren Qing Zheng, Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University Why Scalable

More information

3. Composite-file File System. 2. Observations

3. Composite-file File System. 2. Observations The Composite-file File System: Decoupling the One-to-one Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, and An-I Andy Wang Computer Science Department, Florida

More information

Statistical Debugging for Real-World Performance Problems

Statistical Debugging for Real-World Performance Problems Statistical Debugging for Real-World Performance Problems Linhai Song 1 and Shan Lu 2 1 University of Wisconsin-Madison 2 University of Chicago What are Performance Problems? Definition of Performance

More information

File system virtual appliances

File system virtual appliances File system virtual appliances MICHAEL ABD-EL-MALEK May 2010 CMU-PDL-09-109 Dept. of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment

More information

Detecting Performance Anomalies in Global Applications

Detecting Performance Anomalies in Global Applications Detecting Performance Anomalies in Global Applications Terence Kelly Hewlett-Packard Laboratories 1501 Page Mill Road m/s 1125 Palo Alto, CA 94304 USA Abstract Understanding real, large distributed systems

More information

Predicting Response Times for the Spotify Backend

Predicting Response Times for the Spotify Backend Predicting Response Times for the Spotify Backend Rerngvit Yanggratoke, Gunnar Kreitz, Mikael Goldmann,andRolfStadler ACCESS Linnaeus Center, KTH Royal Institute of Technology, Sweden Spotify, Sweden Email:

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES

SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell

More information

Using Runtime Paths for Macroanalysis

Using Runtime Paths for Macroanalysis Using Runtime Paths for Macroanalysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer UC Berkeley, Stanford University, Tellme Networks {mikechen, brewer}@cs.berkeley.edu, {emrek, fox}@cs.stanford.edu,

More information

Optimizing SQL Transactions

Optimizing SQL Transactions High Performance Oracle Optimizing SQL Transactions Dave Pearson Quest Software Copyright 2006 Quest Software Quest Solutions for Enterprise IT Quest Software develops innovative products that help customers

More information

Virtual Allocation: A Scheme for Flexible Storage Allocation

Virtual Allocation: A Scheme for Flexible Storage Allocation Virtual Allocation: A Scheme for Flexible Storage Allocation Sukwoo Kang, and A. L. Narasimha Reddy Dept. of Electrical Engineering Texas A & M University College Station, Texas, 77843 fswkang, reddyg@ee.tamu.edu

More information

The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance

The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, and An-I Andy Wang, Florida State University https://www.usenix.org/conference/fast16/technical-sessions/presentation/zhang-shuanglong

More information

Connecting the Dots: Using Runtime Paths for Macro Analysis

Connecting the Dots: Using Runtime Paths for Macro Analysis Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu http://pinpoint.stanford.edu Motivation Divide and conquer, layering, and replication are fundamental design

More information

Sweet Storage SLOs with Frosting

Sweet Storage SLOs with Frosting Sweet Storage SLOs with Frosting Andrew Wang Shivaram Venkataraman Sara Alspaugh Ion Stoica Randy H. Katz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report

More information

Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality

Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality Amin Vahdat Department of Computer Science Duke University 1 Introduction Increasingly,

More information

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores Swapnil Patil M. Polte, W. Tantisiriroj, K. Ren, L.Xiao, J. Lopez, G.Gibson, A. Fuchs *, B. Rinaldi * Carnegie

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

Statistics Driven Workload Modeling for the Cloud

Statistics Driven Workload Modeling for the Cloud UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy

More information

A High Throughput Atomic Storage Algorithm

A High Throughput Atomic Storage Algorithm A High Throughput Atomic Storage Algorithm Rachid Guerraoui EPFL, Lausanne Switzerland Dejan Kostić EPFL, Lausanne Switzerland Ron R. Levy EPFL, Lausanne Switzerland Vivien Quéma CNRS, Grenoble France

More information

Research and Teaching Statement

Research and Teaching Statement Research and Teaching Statement Armando Fox February 2008 Research Summary My research focuses on both policy and mechanism for managing datacenter-scale installations (thousands of computers) of interactive-response,

More information

Storage Systems for Shingled Disks

Storage Systems for Shingled Disks Storage Systems for Shingled Disks Garth Gibson Carnegie Mellon University and Panasas Inc Anand Suresh, Jainam Shah, Xu Zhang, Swapnil Patil, Greg Ganger Kryder s Law for Magnetic Disks Market expects

More information

A Two-Tiered Software Architecture for Automated Tuning of Disk Layouts

A Two-Tiered Software Architecture for Automated Tuning of Disk Layouts A Two-Tiered Software Architecture for Automated Tuning of Disk Layouts Brandon Salmon, Eno Thereska, Craig A.N. Soules, Gregory R. Ganger Apr 2003 CMU-CS-03-130 School of Computer Science Carnegie Mellon

More information

Erik Riedel Hewlett-Packard Labs

Erik Riedel Hewlett-Packard Labs Erik Riedel Hewlett-Packard Labs Greg Ganger, Christos Faloutsos, Dave Nagle Carnegie Mellon University Outline Motivation Freeblock Scheduling Scheduling Trade-Offs Performance Details Applications Related

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

The Datacenter Needs an Operating System

The Datacenter Needs an Operating System UC BERKELEY The Datacenter Needs an Operating System Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos,

More information

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Logging in Choosing a profile Picking liked videos

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

Using Utility to Provision Storage Systems Strunk, Thereska, Faloutsos & Ganger

Using Utility to Provision Storage Systems Strunk, Thereska, Faloutsos & Ganger n e w s l e t t e r o n p d l a c t i v i t i e s a n d e v e n t s S p r i n g 2 0 0 8 http://www.pdl.cmu.edu/ pdl consortium Members http://www.pdl.cmu.edu/publications/ Selected Recent publications

More information

Mining NetFlow Records for Critical Network Activities

Mining NetFlow Records for Critical Network Activities Mining NetFlow Records for Critical Network Activities Shaonan Wang 1, Radu State 1, Mohamed Ourdane 2, and Thomas Engel 1 1 Faculty of Science, Technology and Communication, University of Luxembourg {shaonan.wang,radu.state,thomas.engel}@uni.lu

More information

From the Outside Looking In: Probing Web APIs to Build Detailed Workload Profiles

From the Outside Looking In: Probing Web APIs to Build Detailed Workload Profiles From the Outside Looking In: Probing Web APIs to Build Detailed Workload Profiles Nan Deng, Zichen Xu, Christopher Stewart and Xiaorui Wang The Ohio State University Abstract Cloud applications depend

More information

Keeping an Eye on Virtualization: VM Monitoring Techniques & Applications. Validation

Keeping an Eye on Virtualization: VM Monitoring Techniques & Applications. Validation Keeping an Eye on Virtualization: VM Monitoring Techniques & Applications Validation Why Is this Important? Providing a higher level of reliability and availability is one of the biggest challenges of

More information

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large

More information

NetPilot: Automating Datacenter Network Failure Mitigation

NetPilot: Automating Datacenter Network Failure Mitigation NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Failures are Common and Harmful Network failures are

More information

- From Theory to Practice -

- From Theory to Practice - - From Theory to Practice - Thomas Balthazar Stella Cotton Corey Donohoe Yannick Schutz What is distributed tracing? Tracing requests across distributed system boundaries A Simple Use Case Web Request

More information

Zzyzx: Scalable Fault Tolerance through Byzantine Locking

Zzyzx: Scalable Fault Tolerance through Byzantine Locking Zzyzx: Scalable Fault Tolerance through Byzantine Locking James Hendricks Shafeeq Sinnamohideen Gregory R. Ganger Michael K. Reiter Carnegie Mellon University University of North Carolina at Chapel Hill

More information

Self-tuning Page Cleaning in DB2

Self-tuning Page Cleaning in DB2 Self-tuning Page Cleaning in DB2 Wenguang Wang Richard B. Bunt Department of Computer Science University of Saskatchewan 57 Campus Drive Saskatoon, SK S7N 5A9 Canada Email: {wew36, bunt}@cs.usask.ca Abstract

More information

COPYRIGHT 13 June 2017MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

COPYRIGHT 13 June 2017MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Building and Operating High Performance MarkLogic Apps James Clippinger, VP, Strategic Accounts, MarkLogic Erin Miller, Manager, Performance Engineering, MarkLogic COPYRIGHT 13 June 2017MARKLOGIC CORPORATION.

More information

Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters

Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters Elmer Garduno, Soila P. Kavulya, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan Carnegie Mellon University Abstract Diagnosing performance

More information

Performance Modeling of Storage Devices using Machine Learning

Performance Modeling of Storage Devices using Machine Learning Performance Modeling of Storage Devices using Machine Learning Mengzhi Wang CMU-CS-05-185 January 23, 2006 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh,

More information

Customizing Progressive JPEG for Efficient Image Storage

Customizing Progressive JPEG for Efficient Image Storage Customizing Progressive JPEG for Efficient Image Storage Eddie Yan Kaiyuan Zhang Xi Wang Karin Strauss Luis Ceze HotStorage 17 July 11, 2017 2 2 2 2 2 2 2 2 2 Summary Today s image hosts need to store

More information

The Case for Semantic Aware Remote Replication

The Case for Semantic Aware Remote Replication The Case for Semantic Aware Remote Replication Xiaotao Liu, Gal Niv, K.K. Ramakrishnan, Prashant Shenoy, Jacobus Van der Merwe University of Massachusetts / AT&T Labs-Research {xiaotaol, gniv, shenoy}@cs.umass.edu

More information

Diagnosing Production-Run Concurrency-Bug Failures. Shan Lu University of Wisconsin, Madison

Diagnosing Production-Run Concurrency-Bug Failures. Shan Lu University of Wisconsin, Madison Diagnosing Production-Run Concurrency-Bug Failures Shan Lu University of Wisconsin, Madison 1 Outline Myself and my group Production-run failure diagnosis What is this problem What are our solutions CCI

More information

Treadmill: Tail Latency Measurement at Microsecond-level Precision. Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang

Treadmill: Tail Latency Measurement at Microsecond-level Precision. Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang Treadmill: Tail Latency Measurement at Microsecond-level Precision Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang Schedule Welcome Section 1: Tail latency 08:40 ~ 09:00 Overview of data

More information

Predictive Elastic Database Systems. Rebecca Taft HPTS 2017

Predictive Elastic Database Systems. Rebecca Taft HPTS 2017 Predictive Elastic Database Systems Rebecca Taft becca@cockroachlabs.com HPTS 2017 1 Modern OLTP Applications Large Scale Cloud-Based Performance is Critical 2 Challenges to transaction performance: skew

More information

RAID4S: Improving RAID Performance with Solid State Drives

RAID4S: Improving RAID Performance with Solid State Drives RAID4S: Improving RAID Performance with Solid State Drives Rosie Wacha UCSC: Scott Brandt and Carlos Maltzahn LANL: John Bent, James Nunez, and Meghan Wingate SRL/ISSDM Symposium October 19, 2010 1 RAID:

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

Metadata Efficiency in a Comprehensive Versioning File System. May 2002 CMU-CS

Metadata Efficiency in a Comprehensive Versioning File System. May 2002 CMU-CS Metadata Efficiency in a Comprehensive Versioning File System Craig A.N. Soules, Garth R. Goodson, John D. Strunk, Gregory R. Ganger May 2002 CMU-CS-02-145 School of Computer Science Carnegie Mellon University

More information

IRONModel: Robust Performance Models in the Wild

IRONModel: Robust Performance Models in the Wild IRONModel: Robust Performance Models in the Wild Eno Thereska Microsoft Research Gregory R. Ganger Carnegie Mellon University ABSTRACT Traditional performance models are too brittle to be relied on for

More information

Detailed Analysis of I/O traces for large scale applications

Detailed Analysis of I/O traces for large scale applications Detailed Analysis of I/O traces for large scale applications N Nakka, A Choudhary, W K Liao Electrical Engineering and Computer Science Northwestern University, Evanston, IL, USA {nakka, choudhar,wkliao}@eecsnorthwesternedu

More information

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale

More information

Black-Box Problem Diagnosis in Parallel File System

Black-Box Problem Diagnosis in Parallel File System A Presentation on Black-Box Problem Diagnosis in Parallel File System Authors: Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan Presented by: Rishi Baldawa Key Idea Focus is on automatically

More information

PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds

PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (TPDS) 1 PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds Daniel J. Dean, Member, IEEE,

More information

CSE-E5430 Scalable Cloud Computing Lecture 9

CSE-E5430 Scalable Cloud Computing Lecture 9 CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay

More information

An Operating System Architecture for Future Information Appliances

An Operating System Architecture for Future Information Appliances An Operating System Architecture for Future Information Appliances Tatsuo Nakajima Hiroo Ishikawa Yuki Kinebuchi Midori Sugaya Sun Lei Alexandre Courbot Andrej van der Zee Aleksi Aalto Kwon Ki Duk Department

More information

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following: CS 470 Spring 2017 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters

More information

PerfGuard: Binary-Centric Application Performance Monitoring in Production Environments

PerfGuard: Binary-Centric Application Performance Monitoring in Production Environments PerfGuard: Binary-Centric Application Performance Monitoring in Production Environments Chung Hwan Kim, Junghwan Rhee *, Kyu Hyung Lee +, Xiangyu Zhang, Dongyan Xu * + Performance Problems Performance

More information

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following: CS 470 Spring 2018 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters

More information