Diagnosing performance changes by comparing request flows!
|
|
- Loreen Garrison
- 5 years ago
- Views:
Transcription
1 Diagnosing performance changes by comparing request flows Raja Sambasivan Alice Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Greg Ganger Carnegie Mellon Microsoft Research Google
2 Perf. diagnosis in distributed systems Very difficult and time consuming Root cause could be in any component Request-flow comparison Helps localize performance changes Key insight: Changes manifest as mutations in request timing/structure 2
3 Perf. debugging a feature addition Clients Client Metadata Servers Servers server NFS server Storage nodes 3
4 Perf. debugging a feature addition Before addition: (1) Every file access needs a MDS access Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 4
5 Perf. debugging a feature addition After: Metadata prefetched to clients Most requests donʼt need MDS access Client Metadata server (1) (3) NFS server (2) Storage nodes 5
6 Perf. debugging a feature addition Adding metadata prefetching reduced performance instead of improving it () How to efficiently diagnose this? 6
7 Request-flow comparison will show Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 7
8 Request-flow comparison will show Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor differ 8 Mutation
9 Request-flow comparison Identifies distribution changes Distinct from anomaly detection E.g., Magpie, Pinpoint, etc. Satisfies many use cases Performance regressions/degradations Eliminating the system as the culprit 9
10 Contributions Heuristics for identifying mutations, precursors, and for ranking them Implementation in Spectroscope Use of Spectroscope to diagnose Unsolved problems in Ursa Minor Problems in Google services 10
11 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 11
12 Graphs via end-to-end tracing Used in research & production systems E.g., Magpie, X-Trace, Googleʼs Dapper Works as follows: Tracks trace points touched by requests Request-flow graphs obtained by stitching together trace points accessed Yields < 1% overhead w/req. sampling 12
13 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Reply 13 10μs 10μs
14 Example: Graph for a striped read NFS Read Response-time: Call 8,120μs 100μs Cache 10μs Miss 10μs SN1 Read SN2 Read 1,000μs Call Call 1,000μs Work Work 3,500μs 5,500μs on SN1 on SN2 1,500μs SN1 SN2 1,500μs Reply NFS Reply Nodes show trace points & edges show latencies Reply 14 10μs 10μs
15 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 15
16 Categorization step Necessary since it is meaningless to compare individual requests flows Groups together similar request flows Categories: basic unit for comparisons Allows for mutation identification by comparing per-category distributions 16
17 Choosing what to bin into a category Our choice: identically structured reqs Uses same path/similar cost expectation Same path/similar costs notion is valid For 88 99% of Ursa Minor categories For 47 69% of Bigtable categories Lower value due to sparser trace points Lower value also due to contention Aside: Categorization can be used to localize problematic sources of variance 17
18 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 18
19 Type 1: Response-time mutations Requests that: are structurally identical in both periods have larger problem period latencies Root cause localized by identifying interactions responsible 19
20 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time Response-time 20
21 Categories w/response-time mutations Identified via use of a hypothesis test Frequency Sets apart natural variance from mutations Also used to find interactions responsible Category 1 Category 2 Frequency Response-time IDʼd as containing mutations 21 Response-time Not IDʼd as containing mutations
22 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. 80μs NFS Read Reply Avg. problem response time: 1,090μs 22
23 Response-time mutation example Avg. non-problem response time: 110μs NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 1,000μs SN1 Read End Avg. problem response time: 1,090μs Avg. 80μs Problem localized by NFS ID ing Read responsible Reply interaction 23
24 Type 2: Structural mutations Requests that: take different paths in the problem period Root caused localized by identifying their precursors Likely path during the non-problem period idʼing how mutation & precursor differ 24
25 IDʼing categories w/structural mutations Assume similar workloads executed Categories with more problem period requests contain mutations Reverse true for precursor categories Threshold used to differentiate natural variance from categories w/mutations 25
26 Mapping mutations to precursors Read Precursors Write Structural Mutation Read? NP: 1,000 NP: 480 NP: 300 P: 150 P: 460 P: 800 Accomplished using three heuristics See paper for details 26
27 Example structural mutation Precursor Mutation NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs NFS Lookup Reply 27
28 Example structural mutation Precursor NFS Lookup Call NFS Lookup Call 10μs 10μs MDS DB Lock MDS DB Lock 20μs MDS DB Unlock 350μs 50μs NFS Lookup Reply MDS DB Unlock 50μs Root cause localized by NFS showing Lookup how Reply mutation and precursor 28 differ Mutation
29 Ranking of categories w/mutations Necessary for two reasons There may be more than one problem One problem may yield many mutations Rank based on: # of reqs affected * Δ in response time 29
30 Spectroscope workflow Non-problem period graphs Problem period graphs Categorization Response-time mutation identification Structural mutation identification Ranking UI layer 30
31 Putting it all together: The UI Precursor NFS Lookup Call 10μs MDS DB Lock 20μs MDS DB Unlock 50μs NFS Lookup Reply Mutation NFS Lookup Call 10μs MDS DB Lock 350μs MDS DB Unlock 50μs NFS Lookup Reply Ranked list 1. Structural 2. Response time 20. Structural 31
32 Putting it all together: The UI Mutation NFS Read Call Avg. 10μs SN1 Read Start Avg. 20μs Avg. 100μs SN1 Read End Avg. 80μs NFS Read Reply Ranked list 1. Structural 2. Response time 20. Structural 32
33 Outline Introduction End-to-end tracing & Spectroscope Ursa Minor & Google case studies Summary 33
34 Ursa Minor case studies Used Spectroscope to diagnose real performance problems in Ursa Minor Four were previously unsolved Evaluated ranked list by measuring % of top 10 results that were relevant % of total results that were relevant 34
35 Quantitative results 100% 75% 50% 25% 0% Prefetch Config. RMWs Creates 500us delay % top 10 relevant % total relevant N/A Spikes 35
36 Ursa Minor 5-component config All case studies use this configuration (1) Client (6) (4) (2) Metadata server (3) NFS server (5) Storage nodes 36
37 Case 1: MDS configuration problem Problem: Slowdown in key benchmarks seen after large code merge 37
38 Comparing request flows Identified 128 mutation categories: Most contained structural mutations Mutations and precursors differed only in the storage node accessed Localized bug to unexpected interaction between MDS & the data storage node But, unclear why this was happening 38
39 Further localization Thought mutations may be caused by changes in low-level parameters E.g., function call parameters, etc. Identified parameters that separated 1 st - ranked mutation from its precursors Showed changes in encoding params Localized root cause to two config files Root cause: Change to an obscure config file 39
40 Inter-cluster perf. at Google Load tests run on same software in two datacenters yielded very different perf. Developers said perf. should be similar 40
41 Comparing request flows Revealed many mutations High-ranked ones were response-time Responsible interactions found both within the service and in dependencies Led us to suspect root cause was issue with the slower load testʼs datacenter Root cause: Shared Bigtable in slower load test s datacenter was not working properly 41
42 Summary Introduced request-flow comparison as a new way to diagnose perf. changes Presented algorithms for localizing problems by identifying mutations Showed utility of our approach by using it to diagnose real, unsolved problems 42
43 Related work (I) [Abd-El-Malek05b]: Ursa Minor: versatile cluster-based storage. Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie. FAST, [Barham04]: Using Magpie for request extraction and workload modelling. Paul Barham, Austin Donnelly, Rebecca Isaacs, Richard Mortier. OSDI, [Chen04]: Path-based failure and evolution management. Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer. NSDI,
44 Related work (II) [Cohen05]: Capturing, indexing, and retrieving system history. Ira Cohen, Steve Zhang, Moises Goldszmidt, Terrence Kelly, Armando Fox. SOSP, [Fonseca07]: X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica. NSDI, [Hendricks06]: Improving small file performance in objectbased storage. James Hendricks, Raja R. Sambasivan, Shafeeq Sinnamohideen, Gregory R. Ganger. Technical Report CMU-PDL ,
45 Related work (III) [Reynolds06]: Detecting the unexpected in distributed systems. Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Amin Vahdat. NSDI, [Sambasivan07]: Categorizing and differencing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Eno Thereska. HotAC II, [Sambasivan10]: Diagnosing performance problems by visualizing and comparing system behaviours. Raja R. Sambasivan, Alice X. Zheng, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger. CMU-PDL
46 Related work (IV) [Sambasivan11]: Automation without predictability is a recipe for failure. Raja R. Sambasivan and Gregory R. Ganger. Technical Report CMU-PDL , [Sigelman10]: Dapper, a large-scale distributed tracing infrastructure. Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag. Technical report , [Thereska06b]: Stardust: tracking activity in a distributed storage system. Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, Gregory R. Ganger. SIGMETRICS,
Diagnosis via monitoring & tracing
Diagnosis via monitoring & tracing Greg Ganger, Garth Gibson, Majd Sakr adapted from Raja Sambasivan 15-719: Advanced Cloud Computing Spring 2017 1 Problem diagnosis is difficult For developers of clouds
More informationCategorizing and differencing system behaviours
Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL. Categorizing and differencing system behaviours Raja R. Sambasivan, Alice X. Zheng, Eno Thereska, Gregory R. Ganger
More informationCHECKING, PURE, EFFICIENT, SCALABLE PERFORMANCE DIAGNOSIS FOR PRODUCTION CLOUD COMPUTING SYSTEMS
CHECKING, PURE, EFFICIENT, SCALABLE PERFORMANCE DIAGNOSIS FOR PRODUCTION CLOUD COMPUTING SYSTEMS 1 KHILJI AFZAL HUSSAIN, 2 SYED.ABDULHAQ, 3 P.BABU 1 PG SCHOLAR, CSE (CN), QCET, NELLORE 2,3 ASSOCIATE PROFESSOR,
More informationObserver: keeping system models from becoming obsolete
Second Workshop on Hot Topics in Autonomic Computing. June 15, 2007. Jacksonville, FL. Observer: keeping system models from becoming obsolete Eno Thereska, Anastassia Ailamaki, Gregory R. Ganger Carnegie
More informationTime Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems
Time Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems Kun Suo Jia Rao The University of Texas at Arlington kun.suo@mavs.uta.edu jia.rao@uta.edu Luwei Cheng Facebook chengluwei@fb.com
More informationInternational Journal of Advance Engineering and Research Development. Authenticated Key Exchange Protocols for Parallel Network File Systems
Scientific Journal of Impact Factor (SJIF): 3.134 International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Authenticated
More informationTowards self-predicting systems: What if you could ask what-if?
Towards self-predicting systems: What if you could ask what-if? Eno Thereska Carnegie Mellon University enothereska@cmu.edu Dushyanth Narayanan Microsoft Research, UK dnarayan@microsoft.com Gregory R.
More informationPath-based Macroanalysis for Large Distributed Systems
Path-based Macroanalysis for Large Distributed Systems 1, Anthony Accardi 2, Emre Kiciman 3 Dave Patterson 1, Armando Fox 3, Eric Brewer 1 mikechen@cs.berkeley.edu UC Berkeley 1, Tellme Networks 2, Stanford
More informationAutomatic Configuration of a Distributed Storage System
Automatic Configuration of a Distributed Storage System Keven Chen, Lauro Costa, Steven Liu EECE 571R Final Report, April 2010 The University of British Columbia 1. INTRODUCTION A distributed storage system
More informationAn Experimental Study of Rapidly Alternating Bottlenecks in n-tier Applications
213 IEEE Sixth International Conference on Cloud Computing An Experimental Study of Rapidly Alternating Bottlenecks in n-tier Applications Qingyang Wang 1, Yasuhiko Kanemasa 2, Jack Li 1, Deepal Jayasinghe
More informationSelfTalk for Dena: Query Language and Runtime Support for Evaluating System Behavior
SelfTalk for Dena: Query Language and Runtime Support for Evaluating System Behavior Saeed Ghanbari, Gokul Soundararajan and Cristiana Amza Department of Electrical and Computer Engineering University
More informationPivot Tracing is a monitoring framework for distributed systems that
Pivot Tracing Dynamic Causal Monitoring for Distributed Systems JONATHAN MACE, RYAN ROELKE, AND RODRIGO FONSECA Jonathan Mace is a PhD student in computer science at Brown University, advised by Rodrigo
More information//TRACE: Parallel trace replay with approximate causal events
//TRACE: Parallel trace replay with approximate causal events MICHAEL P. MESNIER, MATTHEW WACHS, RAJA R. SAMBASIVAN, JULIO LOPEZ, JAMES HENDRICKS, GREGORY R. GANGER, DAVID O HALLARON CARNEGIE MELLON UNIVERSITY
More informationDetecting Transient Bottlenecks in n-tier Applications through Fine-Grained Analysis
213 IEEE 33rd International Conference on Distributed Computing Systems Detecting Transient Bottlenecks in n-tier Applications through Fine-Grained Analysis Qingyang Wang 1, Yasuhiko Kanemasa 2, Jack Li
More informationConfiguration-Space Performance Anomaly Depiction
Configuration-Space Performance Anomaly Depiction Christopher Stewart Kai Shen Department of Computer Science University of Rochester {stewart,kshen}@cs.rochester.edu Arun Iyengar Jian Yin IBM Watson Research
More informationSurvey on File System Tracing Techniques
CSE598D Storage Systems Survey on File System Tracing Techniques Byung Chul Tak Department of Computer Science and Engineering The Pennsylvania State University tak@cse.psu.edu 1. Introduction Measuring
More informationThe Game of Twenty Questions: Do You Know Where to Log?
The Game of Twenty Questions: Do You Know Where to Log? Xu Zhao Michael Stumm Kirk Rodrigues Ding Yuan Yu Luo Yuanyuan Zhou University of California San Diego ABSTRACT A production system s printed logs
More informationPrecise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes
Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes Zhihong Zhang 1,2, Jianfeng Zhan 1, Yong Li 1, Lei Wang 1, Dan Meng 1, Bo Sang 1 1 Institute of Computing Technology,
More informationProfiling Apache HIVE Query from Run Time Logs
Profiling Apache HIVE Query from Run Time Logs Givanna Putri Haryono School of Information Technologies The University of Sydney NSW 2008 Email: ghar1821@uni.sydney.edu.au Ying Zhou School of Information
More informationCLUE: System Trace Analytics for Cloud Service Performance Diagnosis
CLUE: System Trace Analytics for Cloud Service Performance Diagnosis Hui Zhang, Junghwan Rhee, Nipun Arora, Sahan Gamage, Guofei Jiang, Kenji Yoshihira and Dongyan Xu Department of Autonomic Management
More informationPerformance Debugging in Data Centers: Doing More with Less 1
Performance Debugging in Data Centers: Doing More with Less 1 Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal, Prashant Shenoy, Harrick Vin Computer Science Department, University of Massachusetts at
More informationService-generated Big Data and Big Data-as-a-Service: An Overview
2013 IEEE International Congress on Big Data Service-generated Big Data and Big Data-as-a-Service: An Overview Zibin Zheng, Jieming Zhu, and Michael R. Lyu Department of Computer Science & Engineering
More informationD3: Declarative Distributed Debugging
D3: Declarative Distributed Debugging Byung-Gon Chun Kuang Chen Gunho Lee Randy H. Katz Scott Shenker Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report
More informationIntroduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. M. Burak ÖZTÜRK 1 Introduction Data Model API Building
More informationEnd-to-End Tracing Models: Analysis and Unification
End-to-End Tracing Models: Analysis and Unification Jonathan Leavitt Department of Computer Science Brown University jsleavit@cs.brown.edu May 5, 2014 i Abstract Many applications today require distributed
More informationCauseway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs
Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Anupam Chanda, Khaled Elmeleegy, and Alan L. Cox Department of Computer Science Rice University, Houston,
More informationCauseway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs
Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs Anupam Chanda, Khaled Elmeleegy, and Alan L. Cox Department of Computer Science Rice University, Houston,
More informationAbstract. 1 Introduction
Towards Realistic File-System Benchmarks with CodeMRI Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau Computer Sciences Department, University of Wisconsin-Madison {nitina, dusseau,
More informationReusing migration to simply and efficiently implement multi-server operations in transparently scalable storage systems
Reusing migration to simply and efficiently implement multi-server operations in transparently scalable storage systems Shafeeq Sinnamohideen May 2010 CMU-CS-10-141 School of Computer Science Carnegie
More informationvquery: A Platform for Connecting Configuration and Performance
Ilari Shafer Carnegie Mellon University Snorri Gylfason VMware, Inc. Gregory R. Ganger Carnegie Mellon University Abstract Discovering the causes of performance problems in virtualized systems is often
More informationRAPIDLY ALTERNATING BOTTLENECKS: A STUDY OF TWO CASES IN
RAPIDLY ALTERNATING BOTTLENECKS: A STUDY OF TWO CASES IN N TIER APPLICATIONS 1Qingyang Wang, 2 Yasuhiko Kanemasa, 1 Jack Li, 2 Toshihiro Shimizu, 2 Masazumi Matsubara, 3 Motoyuki Kawaba, 1 Calton Pu 1College
More informationCo-scheduling of disk head time in cluster-based storage
2009 28th IEEE International Symposium on Reliable Distributed Systems (SRDS 09). Co-scheduling of disk head time in cluster-based storage Matthew Wachs, Gregory R. Ganger Parallel Data Laboratory Carnegie
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationGanesha: Black-Box Fault Diagnosis for MapReduce Systems
Ganesha: Black-Box Fault Diagnosis for MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan CMU-PDL-08-112 September 2008 Parallel Data Laboratory Carnegie Mellon University
More informationMagpie: Distributed request tracking for realistic performance modelling
Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge
More informationDistributed systems pose unique challenges for
1 of 20 TE X T ONLY Debugging Distributed Systems Challenges and Options for Validation and Debugging IVAN BESCHASTNIKH PATTY WANG YURIY BRUN MICHAEL D. ERNST Distributed systems pose unique challenges
More informationThe Case for Semantic Aware Remote Replication
The Case for Semantic Aware Remote Replication Xiaotao Liu, Gal Niv, K.K. Ramakrishnan, Prashant Shenoy, Jacobus Van der Merwe University of Massachusetts / AT&T Labs-Research Abstract This paper argues
More informationWhich Configuration Option Should I Change?
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu I have released a new software version Developers I cannot get used to the UI
More informationWiFröst: Bridging the Information Gap for Debugging of Networked Embedded Systems
WiFröst: Bridging the Information Gap for Debugging of Networked Embedded Systems Will McGrath 1,2, Jeremy Warner 1, Mitchell Karchemsky 1, Andrew Head 1, Daniel Drew 1, Bjoern Hartmann 1 1 UC Berkeley
More informationUsing Runtime Paths for Macro Analysis
Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer {mikechen, brewer}@cs.berkeley.edu, {emrek, fox}@cs.stanford.edu, anthony@tellme.com Abstract We
More informationStatistical Debugging for Real-World Performance Problems. Linhai Song Advisor: Prof. Shan Lu
Statistical Debugging for Real-World Performance Problems Linhai Song Advisor: Prof. Shan Lu 1 Software Efficiency is Critical No one wants slow and inefficient software Frustrate end users Cause economic
More informationEliminating cross-server operations in scalable file systems
Eliminating cross-server operations in scalable file systems James Hendricks, Shafeeq Sinnamohideen, Raja R. Sambasivan, Gregory R. Ganger CMU-PDL-06-105 May 2006 Parallel Data Laboratory Carnegie Mellon
More informationSoftware Error Early Detection System based on Run-time Statistical Analysis of Function Return Values
Software Error Early Detection System based on Run-time Statistical Analysis of Function Return Values Alex Depoutovitch and Michael Stumm Department of Computer Science, Department Electrical and Computer
More informationIndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion
IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion Kai Ren Qing Zheng, Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University Why Scalable
More information3. Composite-file File System. 2. Observations
The Composite-file File System: Decoupling the One-to-one Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, and An-I Andy Wang Computer Science Department, Florida
More informationStatistical Debugging for Real-World Performance Problems
Statistical Debugging for Real-World Performance Problems Linhai Song 1 and Shan Lu 2 1 University of Wisconsin-Madison 2 University of Chicago What are Performance Problems? Definition of Performance
More informationFile system virtual appliances
File system virtual appliances MICHAEL ABD-EL-MALEK May 2010 CMU-PDL-09-109 Dept. of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial fulfillment
More informationDetecting Performance Anomalies in Global Applications
Detecting Performance Anomalies in Global Applications Terence Kelly Hewlett-Packard Laboratories 1501 Page Mill Road m/s 1125 Palo Alto, CA 94304 USA Abstract Understanding real, large distributed systems
More informationPredicting Response Times for the Spotify Backend
Predicting Response Times for the Spotify Backend Rerngvit Yanggratoke, Gunnar Kreitz, Mikael Goldmann,andRolfStadler ACCESS Linnaeus Center, KTH Royal Institute of Technology, Sweden Spotify, Sweden Email:
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationSEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES
SEER: LEVERAGING BIG DATA TO NAVIGATE THE COMPLEXITY OF PERFORMANCE DEBUGGING IN CLOUD MICROSERVICES Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou Cornell
More informationUsing Runtime Paths for Macroanalysis
Using Runtime Paths for Macroanalysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer UC Berkeley, Stanford University, Tellme Networks {mikechen, brewer}@cs.berkeley.edu, {emrek, fox}@cs.stanford.edu,
More informationOptimizing SQL Transactions
High Performance Oracle Optimizing SQL Transactions Dave Pearson Quest Software Copyright 2006 Quest Software Quest Solutions for Enterprise IT Quest Software develops innovative products that help customers
More informationVirtual Allocation: A Scheme for Flexible Storage Allocation
Virtual Allocation: A Scheme for Flexible Storage Allocation Sukwoo Kang, and A. L. Narasimha Reddy Dept. of Electrical Engineering Texas A & M University College Station, Texas, 77843 fswkang, reddyg@ee.tamu.edu
More informationThe Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance
The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance Shuanglong Zhang, Helen Catanese, and An-I Andy Wang, Florida State University https://www.usenix.org/conference/fast16/technical-sessions/presentation/zhang-shuanglong
More informationConnecting the Dots: Using Runtime Paths for Macro Analysis
Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen@cs.berkeley.edu http://pinpoint.stanford.edu Motivation Divide and conquer, layering, and replication are fundamental design
More informationSweet Storage SLOs with Frosting
Sweet Storage SLOs with Frosting Andrew Wang Shivaram Venkataraman Sara Alspaugh Ion Stoica Randy H. Katz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report
More informationDynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality
Dynamically Provisioning Distributed Systems to Meet Target Levels of Performance, Availability, and Data Quality Amin Vahdat Department of Computer Science Duke University 1 Introduction Increasingly,
More informationYCSB++ benchmarking tool Performance debugging advanced features of scalable table stores
YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores Swapnil Patil M. Polte, W. Tantisiriroj, K. Ren, L.Xiao, J. Lopez, G.Gibson, A. Fuchs *, B. Rinaldi * Carnegie
More informationBigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng
Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:
More informationStatistics Driven Workload Modeling for the Cloud
UC Berkeley Statistics Driven Workload Modeling for the Cloud Archana Ganapathi, Yanpei Chen Armando Fox, Randy Katz, David Patterson SMDB 2010 Data analytics are moving to the cloud Cloud computing economy
More informationA High Throughput Atomic Storage Algorithm
A High Throughput Atomic Storage Algorithm Rachid Guerraoui EPFL, Lausanne Switzerland Dejan Kostić EPFL, Lausanne Switzerland Ron R. Levy EPFL, Lausanne Switzerland Vivien Quéma CNRS, Grenoble France
More informationResearch and Teaching Statement
Research and Teaching Statement Armando Fox February 2008 Research Summary My research focuses on both policy and mechanism for managing datacenter-scale installations (thousands of computers) of interactive-response,
More informationStorage Systems for Shingled Disks
Storage Systems for Shingled Disks Garth Gibson Carnegie Mellon University and Panasas Inc Anand Suresh, Jainam Shah, Xu Zhang, Swapnil Patil, Greg Ganger Kryder s Law for Magnetic Disks Market expects
More informationA Two-Tiered Software Architecture for Automated Tuning of Disk Layouts
A Two-Tiered Software Architecture for Automated Tuning of Disk Layouts Brandon Salmon, Eno Thereska, Craig A.N. Soules, Gregory R. Ganger Apr 2003 CMU-CS-03-130 School of Computer Science Carnegie Mellon
More informationErik Riedel Hewlett-Packard Labs
Erik Riedel Hewlett-Packard Labs Greg Ganger, Christos Faloutsos, Dave Nagle Carnegie Mellon University Outline Motivation Freeblock Scheduling Scheduling Trade-Offs Performance Details Applications Related
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationThe Datacenter Needs an Operating System
UC BERKELEY The Datacenter Needs an Operating System Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos,
More informationEVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache
EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Logging in Choosing a profile Picking liked videos
More informationPYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads
PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),
More informationUsing Utility to Provision Storage Systems Strunk, Thereska, Faloutsos & Ganger
n e w s l e t t e r o n p d l a c t i v i t i e s a n d e v e n t s S p r i n g 2 0 0 8 http://www.pdl.cmu.edu/ pdl consortium Members http://www.pdl.cmu.edu/publications/ Selected Recent publications
More informationMining NetFlow Records for Critical Network Activities
Mining NetFlow Records for Critical Network Activities Shaonan Wang 1, Radu State 1, Mohamed Ourdane 2, and Thomas Engel 1 1 Faculty of Science, Technology and Communication, University of Luxembourg {shaonan.wang,radu.state,thomas.engel}@uni.lu
More informationFrom the Outside Looking In: Probing Web APIs to Build Detailed Workload Profiles
From the Outside Looking In: Probing Web APIs to Build Detailed Workload Profiles Nan Deng, Zichen Xu, Christopher Stewart and Xiaorui Wang The Ohio State University Abstract Cloud applications depend
More informationKeeping an Eye on Virtualization: VM Monitoring Techniques & Applications. Validation
Keeping an Eye on Virtualization: VM Monitoring Techniques & Applications Validation Why Is this Important? Providing a higher level of reliability and availability is one of the biggest challenges of
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationNetPilot: Automating Datacenter Network Failure Mitigation
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Failures are Common and Harmful Network failures are
More information- From Theory to Practice -
- From Theory to Practice - Thomas Balthazar Stella Cotton Corey Donohoe Yannick Schutz What is distributed tracing? Tracing requests across distributed system boundaries A Simple Use Case Web Request
More informationZzyzx: Scalable Fault Tolerance through Byzantine Locking
Zzyzx: Scalable Fault Tolerance through Byzantine Locking James Hendricks Shafeeq Sinnamohideen Gregory R. Ganger Michael K. Reiter Carnegie Mellon University University of North Carolina at Chapel Hill
More informationSelf-tuning Page Cleaning in DB2
Self-tuning Page Cleaning in DB2 Wenguang Wang Richard B. Bunt Department of Computer Science University of Saskatchewan 57 Campus Drive Saskatoon, SK S7N 5A9 Canada Email: {wew36, bunt}@cs.usask.ca Abstract
More informationCOPYRIGHT 13 June 2017MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Building and Operating High Performance MarkLogic Apps James Clippinger, VP, Strategic Accounts, MarkLogic Erin Miller, Manager, Performance Engineering, MarkLogic COPYRIGHT 13 June 2017MARKLOGIC CORPORATION.
More informationTheia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters
Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters Elmer Garduno, Soila P. Kavulya, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan Carnegie Mellon University Abstract Diagnosing performance
More informationPerformance Modeling of Storage Devices using Machine Learning
Performance Modeling of Storage Devices using Machine Learning Mengzhi Wang CMU-CS-05-185 January 23, 2006 School of Computer Science Computer Science Department Carnegie Mellon University Pittsburgh,
More informationCustomizing Progressive JPEG for Efficient Image Storage
Customizing Progressive JPEG for Efficient Image Storage Eddie Yan Kaiyuan Zhang Xi Wang Karin Strauss Luis Ceze HotStorage 17 July 11, 2017 2 2 2 2 2 2 2 2 2 Summary Today s image hosts need to store
More informationThe Case for Semantic Aware Remote Replication
The Case for Semantic Aware Remote Replication Xiaotao Liu, Gal Niv, K.K. Ramakrishnan, Prashant Shenoy, Jacobus Van der Merwe University of Massachusetts / AT&T Labs-Research {xiaotaol, gniv, shenoy}@cs.umass.edu
More informationDiagnosing Production-Run Concurrency-Bug Failures. Shan Lu University of Wisconsin, Madison
Diagnosing Production-Run Concurrency-Bug Failures Shan Lu University of Wisconsin, Madison 1 Outline Myself and my group Production-run failure diagnosis What is this problem What are our solutions CCI
More informationTreadmill: Tail Latency Measurement at Microsecond-level Precision. Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang
Treadmill: Tail Latency Measurement at Microsecond-level Precision Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang Schedule Welcome Section 1: Tail latency 08:40 ~ 09:00 Overview of data
More informationPredictive Elastic Database Systems. Rebecca Taft HPTS 2017
Predictive Elastic Database Systems Rebecca Taft becca@cockroachlabs.com HPTS 2017 1 Modern OLTP Applications Large Scale Cloud-Based Performance is Critical 2 Challenges to transaction performance: skew
More informationRAID4S: Improving RAID Performance with Solid State Drives
RAID4S: Improving RAID Performance with Solid State Drives Rosie Wacha UCSC: Scott Brandt and Carlos Maltzahn LANL: John Bent, James Nunez, and Meghan Wingate SRL/ISSDM Symposium October 19, 2010 1 RAID:
More informationShark: Hive on Spark
Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries
More informationMetadata Efficiency in a Comprehensive Versioning File System. May 2002 CMU-CS
Metadata Efficiency in a Comprehensive Versioning File System Craig A.N. Soules, Garth R. Goodson, John D. Strunk, Gregory R. Ganger May 2002 CMU-CS-02-145 School of Computer Science Carnegie Mellon University
More informationIRONModel: Robust Performance Models in the Wild
IRONModel: Robust Performance Models in the Wild Eno Thereska Microsoft Research Gregory R. Ganger Carnegie Mellon University ABSTRACT Traditional performance models are too brittle to be relied on for
More informationDetailed Analysis of I/O traces for large scale applications
Detailed Analysis of I/O traces for large scale applications N Nakka, A Choudhary, W K Liao Electrical Engineering and Computer Science Northwestern University, Evanston, IL, USA {nakka, choudhar,wkliao}@eecsnorthwesternedu
More informationDistributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju
Distributed Data Infrastructures, Fall 2017, Chapter 2 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale
More informationBlack-Box Problem Diagnosis in Parallel File System
A Presentation on Black-Box Problem Diagnosis in Parallel File System Authors: Michael P. Kasick, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan Presented by: Rishi Baldawa Key Idea Focus is on automatically
More informationPerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (TPDS) 1 PerfCompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds Daniel J. Dean, Member, IEEE,
More informationCSE-E5430 Scalable Cloud Computing Lecture 9
CSE-E5430 Scalable Cloud Computing Lecture 9 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 15.11-2015 1/24 BigTable Described in the paper: Fay
More informationAn Operating System Architecture for Future Information Appliances
An Operating System Architecture for Future Information Appliances Tatsuo Nakajima Hiroo Ishikawa Yuki Kinebuchi Midori Sugaya Sun Lei Alexandre Courbot Andrej van der Zee Aleksi Aalto Kwon Ki Duk Department
More informationCS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:
CS 470 Spring 2017 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters
More informationPerfGuard: Binary-Centric Application Performance Monitoring in Production Environments
PerfGuard: Binary-Centric Application Performance Monitoring in Production Environments Chung Hwan Kim, Junghwan Rhee *, Kyu Hyung Lee +, Xiangyu Zhang, Dongyan Xu * + Performance Problems Performance
More informationCS 470 Spring Distributed Web and File Systems. Mike Lam, Professor. Content taken from the following:
CS 470 Spring 2018 Mike Lam, Professor Distributed Web and File Systems Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapters
More information