Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Similar documents
IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Arachne. Core Aware Thread Management Henry Qin Jacqueline Speiser John Ousterhout

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

VARIABILITY IN OPERATING SYSTEMS

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Multi-Level Feedback Queues

MoonGen. A Scriptable High-Speed Packet Generator. Paul Emmerich. January 31st, 2016 FOSDEM Chair for Network Architectures and Services

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Chapter 5: CPU Scheduling

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Best Practices for Setting BIOS Parameters for Performance

No Tradeoff Low Latency + High Efficiency

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

CS370 Operating Systems

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Multiprocessor and Real- Time Scheduling. Chapter 10

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

Distributed caching for cloud computing

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

Learning with Purpose

Lecture 2: September 9

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg

Scheduling Mar. 19, 2018

The Power of Batching in the Click Modular Router

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. Brent Stephens, Arjun Singhvi, Aditya Akella, and Mike Swift July 13 th, 2017

Last Class: Processes

AMD EPYC Processors Showcase High Performance for Network Function Virtualization (NFV)

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chronos: Predictable Low Latency for Data Center Applications

Performance and Optimization Issues in Multicore Computing

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

CS370 Operating Systems

Lecture 5 / Chapter 6 (CPU Scheduling) Basic Concepts. Scheduling Criteria Scheduling Algorithms

CSE398: Network Systems Design

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers

A Predictable RTOS. Mantis Cheng Department of Computer Science University of Victoria

Lies, Damn Lies and Performance Metrics. PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

Operating System Design Issues. I/O Management

Performance and Scalability with Griddable.io

CPU Scheduling: Objectives

CEC 450 Real-Time Systems

Scheduling. CS 161: Lecture 4 2/9/17

Building Consistent Transactions with Inconsistent Replication

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

Chapter 5: CPU Scheduling

Cutting the Cord: A Robust Wireless Facilities Network for Data Centers

Towards a Software Defined Data Plane for Datacenters

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Frequently asked questions from the previous class survey

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Chapter 13: I/O Systems

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Multiprocessor Systems. COMP s1

Abstract. Testing Parameters. Introduction. Hardware Platform. Native System

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Low-Latency Datacenters. John Ousterhout

Running High Performance Computing Workloads on Red Hat Enterprise Linux

Memory-Based Cloud Architectures

Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference

Data Center Performance

The dark powers on Intel processor boards

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

High Performance Packet Processing with FlexNIC

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

EECS 750: Advanced Operating Systems. 01/29 /2014 Heechul Yun

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Architecture and Performance Implications

OS impact on performance

CSCE Operating Systems Scheduling. Qiang Zeng, Ph.D. Fall 2018

Process- Concept &Process Scheduling OPERATING SYSTEMS

Chapter 13: I/O Systems

PRESENTATION TITLE GOES HERE

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

TAIL LATENCY AND PERFORMANCE AT SCALE

CSE 124: TAIL LATENCY AND PERFORMANCE AT SCALE. George Porter November 27, 2017

Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Congestion Control in Datacenters. Ahmed Saeed

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Episode 5. Scheduling and Traffic Management

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Managing Performance Variance of Applications Using Storage I/O Control

Comprehensive Kernel Instrumentation via Dynamic Binary Translation

Why Your Application only Uses 10Mbps Even the Link is 1Gbps?

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF's on OpenStack

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

Transcription:

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1

Introduction What is Tail Latency? What is Tail Latency? 2

Introduction What is Tail Latency? What is Tail Latency? Fraction of Requests Request Processing Time 2

Introduction What is Tail Latency? What is Tail Latency? Fraction of Requests Request Processing Time 2

Introduction What is Tail Latency? What is Tail Latency? Fraction of Requests Request Processing Time In Facebook s Memcached deployment, Median latency is 100µs, but 95 th percentile latency 1ms. 2

Introduction What is Tail Latency? What is Tail Latency? Fraction of Requests Request Processing Time In Facebook s Memcached deployment, Median latency is 100µs, but 95 th percentile latency 1ms. In this talk, we will explore Why some requests take longer than expected? What causes them to get delayed? 2

Introduction What is Tail Latency? Why is the Tail important? Low latency is crucial for interactive services. 500ms delay can cause 20% drop in user traffic. [Google Study] Latency is directly tied to traffic, hence revenue. 3

Introduction What is Tail Latency? Why is the Tail important? Low latency is crucial for interactive services. 500ms delay can cause 20% drop in user traffic. [Google Study] Latency is directly tied to traffic, hence revenue. What makes it challenging is today s datacenter workloads. Interactive services are highly parallel. Single client request spawns thousands of sub-tasks. Overall latency depends on slowest sub-task latency. Bad Tail Probability of any one sub-task getting delayed is high. 3

Introduction What is Tail Latency? A real-life example Nishtala et. al. Scaling memcache at Facebook, NSDI 2013. 4

Introduction What is Tail Latency? A real-life example All requests have to finish within the SLA latency. Nishtala et. al. Scaling memcache at Facebook, NSDI 2013. 4

Introduction What is Tail Latency? What can we do? People in industry have worked hard on solutions. Hedged Requests [Jeff Dean et. al.] Effective sometimes, but adds application specific complexity. Intelligently avoid slow machines Keep track of server status; route requests around slow nodes. 5

Introduction What is Tail Latency? What can we do? People in industry have worked hard on solutions. Hedged Requests [Jeff Dean et. al.] Effective sometimes, but adds application specific complexity. Intelligently avoid slow machines Keep track of server status; route requests around slow nodes. Attempts to build predictable response out of less predictable parts. We still don t know what is causing requests to get delayed. 5

Introduction What is Tail Latency? Our Approach 1 Pick some real life applications: RPC Server, Memcached, Nginx. 2 Generate the ideal latency distribution. 3 Measure the actual distribution on a standard Linux server. 4 Identify a factor causing deviation from ideal distribution. 5 Explain and mitigate it. 6 Iterate over this till we reach the ideal distribution. 6

Introduction What is Tail Latency? Rest of the Talk 1 Introduction 2 Predicted Latency from Queuing Models 3 Measurements: Sources of Tail Latencies 4 Summary 7

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Server 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Clients Server 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Clients Server 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Clients Server 8

Predicted Latency from Queuing Models Ideal latency distribution What is the ideal latency for a network server? Ideal baseline for comparing measured performance. Assume a simple model, and apply queuing theory. Clients Server Given the arrival distribution and request processing time, We can predict the time spent by a request in the server. 8

Predicted Latency from Queuing Models Tail latency characteristics 10 0 CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Dummy 9

Predicted Latency from Queuing Models Tail latency characteristics 10 0 Distribution 1 CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Dummy 9

Predicted Latency from Queuing Models Tail latency characteristics 10 0 Distribution 1 CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 99th percentile 60 µs 9

Predicted Latency from Queuing Models Tail latency characteristics 10 0 Distribution 1 CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 99.9th percentile 200 µs 9

Predicted Latency from Queuing Models Tail latency characteristics CCDF P[X >= x] 10 0 10-1 10-2 10-3 Distribution 1 Distribution 2 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Dummy 9

Predicted Latency from Queuing Models Tail latency characteristics What is the ideal latency distribution? Assume a server with single worker with 50 µs fixed processing time. 10 0 CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Dummy 9

Predicted Latency from Queuing Models Tail latency characteristics What is the ideal latency distribution? Assume a server with single worker with 50 µs fixed processing time. 10 0 Uniform Request Arrival CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Dummy 9

Predicted Latency from Queuing Models Tail latency characteristics What is the ideal latency distribution? Assume a server with single worker with 50 µs fixed processing time. CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 Uniform Request Arrival Poisson at 70% Utilization 10 1 10 2 10 3 10 4 Latency in micro-seconds Inherent tail latency due to request burstiness. 9

Predicted Latency from Queuing Models Tail latency characteristics What is the ideal latency distribution? Assume a server with single worker with 50 µs fixed processing time. CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 Uniform Request Arrival Poisson at 70% Utilization Poisson at 90% Utilization 10 1 10 2 10 3 10 4 Latency in micro-seconds Tail latency depends on the average server utilization. 9

Predicted Latency from Queuing Models Tail latency characteristics What is the ideal latency distribution? Assume a server with single worker with 50 µs fixed processing time. CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 Uniform Request Arrival Poisson at 70% Utilization Poisson at 90% Utilization Poisson at 70% - 4 workers 10 1 10 2 10 3 10 4 Latency in micro-seconds Additional workers can reduce tail latency, even at constant utilization. 9

1 Introduction 2 Predicted Latency from Queuing Models 3 Measurements: Sources of Tail Latencies 4 Summary 10

Testbed Cluster of standard datacenter machines. 2 x Intel L5640 6 core CPU 24 GB of DRAM Mellanox 10Gbps NIC Ubuntu 12.04, Linux Kernel 3.2.0 All servers connected to a single 10 Gbps ToR switch. One server runs Memcached, others run workload generating clients. Other application results are in the paper. 11

Timestamping Methodology Append a blank buffer 32 bytes to each request. Overwrite buffer with timestamps as it goes through the server. Incoming Server NIC After TCP/UDP processing Memcached thread scheduled on CPU Outgoing Server NIC Memcached write() Memcached read() return Very low overhead and no server side logging. 12

How far are we from the ideal? 13

How far are we from the ideal? 10 0 Ideal Model CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Single CPU, single core, Memcached running at 80% utilization. 14

How far are we from the ideal? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Single CPU, single core, Memcached running at 80% utilization. 14

How far are we from the ideal? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 30x 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Single CPU, single core, Memcached running at 80% utilization. 14

Rest of the talk Source of Tail Latency Potential way to fix Background Processes Multicore Concurrency Interrupt Processing 15

Rest of the talk Source of Tail Latency Potential way to fix Background Processes Multicore Concurrency Interrupt Processing 15

How can background processes affect tail latency? Memcached threads time-share a CPU core with other processes. We need to wait for other processes to relinquish CPU. Scheduling time-slices are usually couple of milliseconds. 16

How can background processes affect tail latency? Memcached threads time-share a CPU core with other processes. We need to wait for other processes to relinquish CPU. Scheduling time-slices are usually couple of milliseconds. How can we mitigate it? Raise priority (decrease niceness) More CPU time. Upgrade scheduling class to real-time Pre-emptive power. Run on a dedicated core No interference what-so-ever. 16

Impact of Background Processes CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 30x 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Single CPU, single core, Memcached running at 80% utilization. 17

Impact of Background Processes CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10x 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Maximum Priority Single CPU, single core, Memcached running at 80% utilization. 17

Impact of Background Processes CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 4x 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Maximum Priority Realtime Scheduling Single CPU, single core, Memcached running at 80% utilization. 17

Impact of Background Processes CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 3x 10 1 10 2 10 3 10 4 Latency in micro-seconds Ideal Model Standard Linux Maximum Priority Realtime Scheduling Dedicated Core Single CPU, single core, Memcached running at 80% utilization. 17

Impact of Background Processes CCDF P[X >= x] 10 0 10-1 10-2 10-3 3x Ideal Model Standard Linux Maximum Priority Realtime Scheduling Interference from background processes Dedicated Core has a large effect on the tail. 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Single CPU, single core, Memcached running at 80% utilization. 17

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Multicore Concurrency Interrupt Processing 18

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Multicore Concurrency Interrupt Processing 18

Does adding more CPU cores improve tail latency? 10 0 1 core Ideal Model CCDF P[X >= x] 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Single CPU, 4 cores, Memcached running at 80% utilization. 19

Does adding more CPU cores improve tail latency? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model Single CPU, 4 cores, Memcached running at 80% utilization. 19

Does adding more CPU cores improve tail latency? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 3x 10 1 10 2 10 3 10 4 Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model 1 core Linux Single CPU, 4 cores, Memcached running at 80% utilization. 19

Does adding more CPU cores improve tail latency? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model 1 core Linux 4 core Linux Single CPU, 4 cores, Memcached running at 80% utilization. 19

Does adding more CPU cores improve tail latency? CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 15x 10 1 10 2 10 3 10 4 Latency in micro-seconds 1 core Ideal Model 4 core Ideal Model 1 core Linux 4 core Linux Single CPU, 4 cores, Memcached running at 80% utilization. 19

Does adding more CPU cores improve tail latency? Yes it does! Provided we maintain a single queue abstraction. 20

Does adding more CPU cores improve tail latency? Yes it does! Provided we maintain a single queue abstraction. Clients Ideal Model Server 20

Does adding more CPU cores improve tail latency? Yes it does! Provided we maintain a single queue abstraction. Memcached partitions requests statically among threads. Clients Server Clients Server Ideal Model Memcached Architecture 20

Does adding more CPU cores improve tail latency? Yes it does! Provided we maintain a single queue abstraction. Memcached partitions requests statically among threads. Clients Server Clients Server Ideal Model Memcached Architecture How can we mitigate it? Modify Memcached concurrency model to use a single queue. 20

Impact of Multicore Concurrency Model CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 15x 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 1 core Linux 4 core Linux Single CPU, 4 cores, Memcached running at 80% utilization. 21

Impact of Multicore Concurrency Model CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue Single CPU, 4 cores, Memcached running at 80% utilization. 21

Impact of Multicore Concurrency Model CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 4x 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue Single CPU, 4 cores, Memcached running at 80% utilization. 21

Impact of Multicore Concurrency Model CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 4x 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 1 core Linux 4 core Linux 4 core Linux - Single Queue For multi-threaded applications, a single queue abstraction can reduce tail latency. Single CPU, 4 cores, Memcached running at 80% utilization. 21

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Concurrency Model Ensure a single queue abstraction. Interrupt Processing 22

Source of Tail Latency Potential way to fix Background Processes Isolate by running on a dedicated core. Concurrency Model Ensure a single queue abstraction. Interrupt Processing 22

How can interrupts affect tail latency? By default, Linux irqbalance spreads interrupts across all cores. OS pre-empts Memcached threads frequently. Introduces extra context switching overheads and cache pollution. 23

How can interrupts affect tail latency? By default, Linux irqbalance spreads interrupts across all cores. OS pre-empts Memcached threads frequently. Introduces extra context switching overheads and cache pollution. How can we mitigate it? Separate cores for interrupt processing and application threads. 3 cores run Memcached threads, and 1 core processes interrupts. 23

Impact of Interrupt Processing CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 4x 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 4 core Linux - Interrupt spread Single CPU, 4 cores, Memcached running at 80% utilization. 24

Impact of Interrupt Processing CCDF P[X >= x] 10 0 10-1 10-2 10-3 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds 4 core Ideal Model 4 core Linux - Interrupt spread 4 core Linux - Separate Interrupt core Single CPU, 4 cores, Memcached running at 80% utilization. 24

Impact of Interrupt Processing CCDF P[X >= x] 10 0 10-1 10-2 10-3 4 core Ideal Model 4 core Linux - Interrupt spread 4 core Linux - Separate Interrupt core Separate cores for interrupt and application processing improves tail latency. 10-4 10 1 10 2 10 3 10 4 Latency in micro-seconds Single CPU, 4 cores, Memcached running at 80% utilization. 24

Other sources of tail latency Source of Tail Latency Underlying Cause Thread Scheduling Policy NUMA Effects Hyper-threading Power Saving Features Non-FIFO ordering of requests. Increased latency across NUMA nodes. Contending hyper-threads can increase latency. Extra time required to wake CPU from idle state. 25

Summary Summary and Future Works We explored hardware, OS and application-level sources of tail latency. Pin-point sources using finegrained timestaming, and an ideal model. We obtain substantial improvements, close to ideal distributions. 99.9th percentile latency of Memcached from 5 ms to 32 µs. 26

Summary Summary and Future Works We explored hardware, OS and application-level sources of tail latency. Pin-point sources using finegrained timestaming, and an ideal model. We obtain substantial improvements, close to ideal distributions. 99.9th percentile latency of Memcached from 5 ms to 32 µs. Sources of tail latency in multi-process environment. How does virtualization effect tail latency? Overhead of virtualization, interference from other VMs. New effects when moving to a distributed setting, network effects. 26