D 2 P: A Distributed Deadline Propagation Approach to Tolerate Long-Tail Latency in Datacenters

Size: px

Start display at page:

Download "D 2 P: A Distributed Deadline Propagation Approach to Tolerate Long-Tail Latency in Datacenters"

Donald Hoover
6 years ago
Views:

1 D 2 P: A Distributed Deadline Propagation Approach to Tolerate Long-Tail Latency in Datacenters Rui Ren, Jiuyue Ma, Xiufeng Sui, Yungang Bao Institute of Computing Technology, Chinese Academy Sciences {renrui, suixiufeng, baoyg}@ict.ac.cn, majiuyue@ncic.ac.cn Abstract We propose a Distributed Deadline Propagation (D 2 P) approach for datacenter applications to tolerate latency variability. The key idea of D 2 P is to allow local nodes to perceive global deadline information and to propagate the information among distributed nodes. Local nodes can leverage the information to do scheduling and adjust processing speed to reduce latency variability. Preliminary experimental results show that D 2 P has the potential of reducing the long-tail latency in datacenters by leveraging propagated deadline information on the local n- odes. 1 Introduction Time is money. For Internet companies, the response time of online services is strongly related to user experience, which is a key factor for revenue. For instance, Amazon found that every 100ms increase in load time of Amazon.com decreases sales by 1% [13]; Google s advertising revenues decline by 20% when the response time increases from 0.4s to 0.9s [2]; and Bing s revenue generated by per user declines 4.3% when response time increases from 50ms to 2000ms [20]. For the sake of users, datacenter operators usually overprovision resources to guarantee QoS of these latency-critical applications, even if doing so lowers resource utilization. For instance, Google [5] reports that the CPU utilization of 20,000 servers averaged about 30% during January to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. APSys 14, June 25-26, 2014, Beijing, China Copyright 2014 ACM /14/06 $ March, 2013, in a typical datacenter for online services. In contrast, batch-workload datacenters averaged 75% u- tilization during the same period [5]. Resource sharing is an effective approach to improve datacenter resource utilization but also raises unpredictable performance variability due to interference. Even worse, the variability within a server can be amplified by scale in datacenters [8, 9], i.e., long-tail phenomenon, severely degrades the responsiveness of latency-sensitive services. Therefore datacenter operators usually have to make tough tradeoffs between application QoS and datacenter utilization: either disregarding QoS to maximize datacenter utilization, or disallowing the colocation of latency critical online applications with other applications to guarantee QoS. Since latency variability is inevitable, and is impractical to be fully eliminated in shared environments, tolerating latency variability (i.e., tail-tolerated computing [9]) has become an important issue in datacenters. There are some studies on latency analysis [13, 14, 25], reducing latency variability for large fan-out search engine systems through adopting selective replication, backup requests techniques [8], and reducing network latency [4, 23, 24, 26]. However, these techniques are unsuitable for sequential/dependent applications. In this paper, we leverage a stage-service model to formularize datacenter applications, and then propose a Distributed Deadline Propagation (D 2 P) approach to tolerate latency variability for latency critical applications. The idea of D 2 P is inspired by the traffic light system in Manhattan, New York City where one can enjoy a chain of green lights after stop at a red light. In datacenters, latency variability in a node (an analogy to red lights) alters the real-time laxity of subsequent steps, however, which are unaware of the varying real-time requirements. We further implement D 2 P in an open-sourced distributed service framework Dubbo [1]. Our preliminary experimental results show that D 2 P is able to reduce latency variability.

2 Probability Density (x10 ) Equake,SPEC CPU2000 (10000 Runs) Sample Mean Variability parameters: avg = 2.21, max=2.3, min=2.18 Variability range (%) = ( )/2.21 * 100% = 0.5% Execution Time (us) x10 5 Number of Instances Latency Profile of a Service Variability parameters: avg = 5, max>300, min<1 Variability range (%) > (300-1)/5 * 100% = 600% Latency (ms) Percentage (%) stage-1 stage-2 stage-3 stage-4 stage Response Time (ms) Figure 1: The example of Variability range Analysis. Left figure is from [6], right figure is from [13]. Figure 2: Latency variability is significantly amplified as the depth of service stage increases. 2 Datacenter Applications and Variability 2.1 Datacenter Application Patterns Basically, there are three patterns: (1) Partition/Aggregate pattern. This pattern is used by applications such as web search, and data processing services like MapReduce, Dryad, which can scale out by partitioning tasks into many sub-tasks and assigning them to worker machines (possibly at multiple layers) [26]. Usually the fan-out can be 100X and even 1000X. (2) Sequential/Dependent pattern. In this pattern, requests are processed sequentially and subsequent requests depend on the results of previous ones. Some online shop services are typical applications. (3) Hybrid pattern with Sequential/Dependent and Partition/Aggregate. Actually hybrid pattern is more typical in datacenters. For example, in Facebook [17], a single user request can result in hundreds of memcached requests that can form a dependency graph. When a dependency graph is executed in a datacenter, requests without dependencies can be sent concurrently. Since there is work on Partition/Aggregate pattern [8, 9], this work is focused on the latter two patterns. 2.2 Latency Variability and Long Tail In both single machine and datacenter environments, performance variability or latency variability is inevitable [6, 13], due to resource sharing, queuing, background maintenance activities, etc. In datacenters, however, the variability within a server can be amplified when applications being scaled out [8, 9]. For example, Figure 1 illustrates the comparison of the performance variability of execution time distribution of Equake in a single machine [6] and the performance variability of Google backend services in a datacenter [13]. Here we define the variability range as the value of ((Max Min)/Average) 100%. According to the definition, the variability range in a single machine is less than 1%, but the variability range in datacenter environment can be more than 600%. The reason of large performance variability is that a user request is divided into a number of sub-requests that are assigned to hundreds and even thousands of machines. The response time of the request obviously depends on the lowest machine. Assume only 1% of subrequests on each machine suffer slow process time, e.g., more than one second, if a user request needs to be processed on 100 such machine in parallel, then the latency of 63% of user requests will exceed one second [9]. This is the Long-Tail phenomenon. The long-tail phenomenon exists not only in Partition/Aggregate pattern but also in Sequential/Dependent pattern as well as Hybrid pattern. For example, we implement a simple five-stage service and calculate the response time of each stage. Figure 2 illustrates that the variance of response time in this five-stage service is exacerbated by 51.2%. In particular, the 90%-percentile and the 95%-percentile latency are increased by 2.6X and 2.4X respectively from stage-1 to stage-5. 3 Distributed Deadline Propagation (D 2 P) In this work we focus on tolerating long tail in Sequential/Dependent pattern and Hybrid pattern. 3.1 Stage-Service Model We define Stage-Service Model (SSM) to describe the service containing the two patterns. In SSM, applications are composed of multiple service stages connected by request queues, as shown in Figure 3. Below are the parameters in SSM: N: Depth of service stage S i : The i-th service stage in the process of application, 1 i N L i : Processing time of the i-th service stage. If there are fan-out requests at some service stage in Hybrid pattern, L i includes the processing time of fan-out requests. P i : Total processing latency after the i-th service stage, P i = P i 1 + L i

3 Request Stage #1 For each service stage, it can use the time information to do scheduling, resource allocation or other techniques to accelerate those requests with urgent deadline. P 1 P 2 Stage #2 Stage #2 3.3 D 2 P-Enabled Distributed Framework D 2 P is a design methodology and can be implemented in various distributed systems. Figure 4 illustrates a typical D 2 P-enabled multiple-phase distributed service framework that consists of three major steps: P N Stage #N Figure 3: Stage-Service model. There are N stages of the Service. At the end of each stage, we can get the processing latency of each service stage P i. 3.2 D 2 P based on Stage-Service Model The Long-Tail phenomenon mainly results from inappropriately overlaying the service latency of multiple stages and missing global processing time information. For example, in Figure 3, if both service latencies of Stage 2 and Stage 3 for a request fall into tail regions, the final response time of the request will be exacerbated. To address the long-tail problem, we propose a Distributed Deadline Propagation (D 2 P) approach to dynamically update global deadline information of requests and propagate the deadline information in datacenters during requests whole lifecycle. Here, given a user request, deadline is measured as the difference between elapsed time and expected response time. Upon a user request arriving a datacenter, its Init deadline is initialized as the same value of expected response time, and then the deadline information will be dynamically updated according to formula (1). Init deadline = expected response time deadline si = Init deadline elapsed time si (1 i N,elapsed time si = P i ) (1) At the same time, we define the percentage o f elapsed time according to the formula (2), which can be used to describe the request s priority. percentage elapsed time si = elapsed time s i Init deadline (1 i N,elapsed time si = P i ) (2) The time information of each request (deadline, percentage elapsed time) is propagated between stages. (1) Assign deadline information to a request; (2) Use deadline information to schedule requests and/or adjust requests process speed; (3) Update and propagate deadline information. Step 1: When a new request arrives in front-end servers of the framework, it will be appended with deadline field (deadline, percentage elapsed time), and assigned with initial value. Once the deadline information is assigned, it will be propagated along with the original request during the whole lifecycle of the request. In our experiments, we predefine the expected response time because usually there is a cut-off latency (e.g., 200ms) for latency-sensitive service, and set the initial value of percentage elapsed time as zero. An alternative way is to leverage machine learning approach to automatically learn expected response time via profiling data. Step 2: When a node receives a modified request with additional deadline information, it first extracts the information and computes deadline laxity, which is used to determine the scheduling priority and the processing priority of the request. In particular, we provide APIs to allow programmers to use deadline information to control the request. In our experiments, we implement the APIs by attaching a Thread Local Storage (TLS) to each thread, and extract deadline information to TLS for referencing by programmers. There are various techniques to control request processing such as scheduling and acceleration. For example, the deadline laxity can be used for real time scheduling algorithms e.g., Least Laxity First scheduling (LLF) [3]. Step 3: When the request is done at a service stage, the framework will record the elapsed time and calculate a new value of deadline and percentage elapsed time using formula (1)(2), which will be filled into the deadline information field of the request. Then the new (deadline, percentage elapsed time) will be sent to next service stage.

4 Front End Stage #1 Stage #2 Stage #N User Request ❶ Manually Defined / Profiled Deadline Constrain Request (data, deadline, elapsed_time% ) ❷ Scheduling by deadline field Thread Pool Save deadline into TLS ❸ Generate new deadline field TLS Request send to next service stage Figure 4: D2P-enabled framework. 1 Assign deadline information to a request; 2 Call APIs to use deadline information; 3 Update and propagate deadline. 4 Leveraging D 2 P to Tolerate Variability There are mainly three kinds of techniques to reduce contention of unmanaged shared resources and to enforce QoS requirements. (1) Scheduling. Application-level scheduling [10, 27] and distributed real-time scheduling can co-locate applications with different resource requirements and schedule urgent requests to be processed timely. (2) Resource allocation adjustment. This is another effective approach, such as machine-level resources (VMs) allocated by hypervisors, OS-level resources (e.g., I/O bandwidth) allocated by OS containers and architecture-level resources (e.g., shared cache) allocated by page-coloring [15, 16]. (3) Levering tradeoff between precision and execution time [19]. For Internet online applications, the precision of advertisement is iteratively updated. Thus, more time applications spend, more precise ads users can get. This tradeoff can be used for performance adjustment for user requests according to deadline requirement, as shown in Figure 5(c). Since D2P allows local nodes to perceive request s global time information, there are still many open problems in how to leverage the information. 5 Evaluation and Discussion 5.1 Experimental Setup and Evaluation We implement a D 2 P-enabled framework based on the Dubbo distributed service framework [1] developed by Alibaba. Dubbo was the key part of Alibaba s SOA solution and deployed in the whole alibaba.com, serving 2,000+ services with more than 3 billion invocations every day. It is also used by tens of famous web service providers in China. In Dubbo framework, an application is divided into clients and servers that communicate with each other via RPC. We implement the D 2 P mechanism by hooking client and server invocations. As illustrated in Figure 4, 1 After a client invocation, the framework packs the deadline information into RPC request. 2 Before a server invocation, the framework extracts the deadline information, and copies it to TLS of a target service thread as explained in Section After the server finishes processing a request, the framework accumulates processing latency to update deadline information which stored in TLS, re-packs it into a RPC response, and propagate the updated deadline information back to the client through the RPC callback. Then, we implement a priority-based scheduling strategy at server side that using the global dynamic deadline information (see in Section 3.2) as priority, in order to evaluate the effect of D 2 P-enabled deadline-aware scheduling. Specifically, (i) the requests are issued from client (front end) to servers at service stage S 1, and the requests are generated at 100 requests/second. At the same time, we assume the request arrival rate and service time meeting normal distribution and simulate this process in server, simultaneously assume there is only one bottleneck node whose queue buffer is almost full. (ii) We use predefined deadline method (See Step 1 in Section 3.3) to assign a cut-off response time as Init deadline, here the value is set as 200ms. (iii) After requests have been processed, we record the request processing time as e- lapsed time and compute the processing latency distribution of service stage S 1, update the deadline field of requests using formula (1)(2). (iv) Then, the updated deadline information will be sent to the next service stage S 2, and percentage elapsed time s1 can be set as the priority of requests at service stage S 2 : if the percentage elapsed time s1 of request is larger, the priority of request is higher, which means that the request is more urgent for scheduling. (v) When the requests are completed at this service stage, the deadline information will be updated and sent to next service stage, and so on. In this process, the request scheduler of each service stage S i uses the value of percentage elapsed time si 1 as request priority. Figure 5(a) illustrates that the latency variability is amplified by 2X (from 50ms to 110ms) with FI- FO scheduling strategy. This amplification effect is reduced by about 10% with D 2 P-enabled deadlineaware scheduling strategy and the standard deviation of

5 Percentage (%) Original FIFO DP-enabled Percentage (%) Original DP-enabled Response Time (ms) Response Time (ms) Server Work Time (ms) (a) (b) (c) Probablity of best ad Figure 5: (a) Effect of D2P-enabled deadline-aware scheduling. Compared with FIFO scheduling, D2P-enabled scheduling reduces 22.5% response time variability in terms of standard deviation. (b) Effect of D2P-enabled performance adjustment. On simulated performance adjustable server leveraging the tradeoff in Figure 5(c), D2P-enabled processing can reduce 37.8% response time variability in in terms of standard deviation. (c) Trade-off between execution time and ad precision [19]. response time is reduced by 22.5%, from to These results can show that D 2 P Approach is useful in distributed system for tolerating latency variability via real-time scheduling. Figure 5(b) illustrates the effect of performance adjustment policy according to Figure 5(c). The response time distribution of the five-stage service application is shown in the Figure. With the help of the performance adjustment policy, standard deviation of response time is reduced from 9.28 to 5.77, improved about 37.8%. 5.2 Discussions Apart from the Stage-Service Model in distributed systems discussed in this paper, there are many other situations where D 2 P approach can be employed. All these scenarios can be abstracted as multiple phase resource sharing, such as memory hierarchy or network switch. In addition, how to reduce the latency variability of Partition/Aggregate application is also an important problems, which need further discussion and is our next work. 6 Related Work Latency in Long Tail: Datacenters run many latencycritical applications that are extremely important for the revenue of internet companies. How to reduce long tail delay has become an important issue in datacenters. There is some work focusing on reducing network latency via removing network congestion and prioritizing flows [26, 4, 23, 24]. Some other work analyzes or diagnoses bottlenecks in large scale systems through a number of techniques, such as model and attribute analysis [13, 14] to understand latency variations, using improved BSP model [7] and co-scheduling method like Bobtail [25] to avoid long tails. Dean and Barroso introduced that Google uses technology such as containers isolation technology, priority management, backup request [8, 9], but Google s approaches are only suitable for Partition/Aggregate pattern. In this work, D 2 P focus on the applications containing Sequential/Dependent pattern and Hybrid pattern that can be represented as Stage-Service Model. Besides datacenter applications, the mobile app marketplace also concerns the latency of interactive applications. For example, Microsoft uses AppInsight [18] and Timecard [19] to monitor mobile app performance and control user-perceived delays in server-based mobile applications. By the contrast, D 2 P focuses on datacenter traffics and leverages a distributed framework Dubbo. QoS Guarantee: Hardware platforms that enforce QoS priorities are proposed in [5, 11], such as CQoS [11] and the QoS-enabled memory architecture for CM- P platforms [12]. Software solutions like compilation techniques can also be used to improve performance and enforce QoS, for example, Tang et al. proposed QoS- Compile [21] and ReQoS [22]. These QoS-guarantee techniques improve only intra-node QoS, while D 2 P aims to tolerate latency variability across nodes in distributed environments. 7 Conclusions In this paper, we propose distributed deadline propagation (D 2 P) approach to tolerate latency variability for the applications containing Sequential/Dependent pattern. We design and implement the idea in an extensive used distributed framework. Experimental results show that D 2 P is able to tolerate variability. However, there are still many open problems such as various scenarios, hardware acceleration and architecture supports.

6 References [1] Dubbo distributed service framework. alibabatech.com/wiki/display/dubbo/home. [2] Google s marissa mayer: Speed wins. com/blog/btl/googles-marissa-mayer-speed-wins/ [3] Least laxity first. slack_time_scheduling. [4] ALIZADEH, M., GREENBERG, A., MALTZ, D. A., PADHYE, J., PATEL, P., PRABHAKAR, B., SENGUPTA, S., AND SRID- HARAN, M. Data center TCP (DCTCP). In Proceedings of the ACM SIGCOMM 2010 Conference (New York, NY, USA, 2010), SIGCOMM 10, ACM, pp [5] BARROSO, L. A., CLIDARAS, J., AND HOLZLE, U. The datacenter as a computer: An introduction to the design of warehousescale machines. Synthesis Lectures on Computer Architecture 8, 3 (2013), [6] CHEN, T., CHEN, Y., GUO, Q., TEMAM, O., WU, Y., AND HU, W. Statistical performance comparisons of computers. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on (2012), IEEE, pp [7] CIPAR, J., HO, Q., KIM, J. K., LEE, S., GANGER, G. R., GIBSON, G., KEETON, K., AND XING, E. Solving the straggler problem with bounded staleness. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems (Berkeley, CA, USA, 2013), HotOS 13, USENIX Association, pp [8] DEAN, J. Achieving rapid response times in large online services. In Berkeley AMPLab Cloud Seminar (2012). [9] DEAN, J., AND BARROSO, L. A. The tail at scale. Communications of the ACM 56, 2 (2013), [10] DELIMITROU, C., AND KOZYRAKIS, C. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2013), ASPLOS 13, ACM, pp [11] IYER, R. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In Proceedings of the 18th annual international conference on Supercomputing (2004), ACM, pp [12] IYER, R., ZHAO, L., GUO, F., ILLIKKAL, R., MAKINENI, S., NEWELL, D., SOLIHIN, Y., HSU, L., AND REINHARDT, S. QoS policies and architecture for cache/memory in CMP platforms. In ACM SIGMETRICS Performance Evaluation Review (2007), vol. 35, ACM, pp [13] KRUSHEVSKAJA, D., AND SANDLER, M. Understanding latency variations of black box services. In Proceedings of the 22Nd International Conference on World Wide Web (Republic and Canton of Geneva, Switzerland, 2013), WWW 13, International World Wide Web Conferences Steering Committee, p- p [14] KRZYSZTOF OSTROWSKI, GIDEON MANN, AND MARK SAN- DLER. Diagnosing latency in multi-tier black-box services. In 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS 2011) (2011). [15] LIN, J., LU, Q., DING, X., ZHANG, Z., ZHANG, X., AND SA- DAYAPPAN, P. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In High Performance Computer Architecture, HPCA IEEE 14th International Symposium on (2008), IEEE, pp [16] LIU, L., CUI, Z., XING, M., BAO, Y., CHEN, M., AND WU, C. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques (2012), ACM, pp [17] NISHTALA, R., FUGAL, H., GRIMM, S., KWIATKOWSKI, M., LEE, H., LI, H. C., MCELROY, R., PALECZNY, M., PEEK, D., SAAB, P., STAFFORD, D., TUNG, T., AND VENKATARAMANI, V. Scaling memcache at facebook. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2013), nsdi 13, USENIX Association, pp [18] RAVINDRANATH, L., PADHYE, J., AGARWAL, S., MAHAJAN, R., OBERMILLER, I., AND SHAYANDEH, S. AppInsight: mobile app performance monitoring in the wild. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2012), OSDI 12, USENIX Association, pp [19] RAVINDRANATH, L., PADHYE, J., MAHAJAN, R., AND BAL- AKRISHNAN, H. Timecard: Controlling user-perceived delays in server-based mobile applications. In Proceedings of the Twenty- Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP 13, ACM, pp [20] SCHURMAN, E., AND J. BRUTLAG. The user and business impact of server delays. [21] TANG, L., MARS, J., AND SOFFA, M. L. Compiling for niceness: Mitigating contention for QoS in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (New York, NY, USA, 2012), CGO 12, ACM, pp [22] TANG, L., MARS, J., WANG, W., DEY, T., AND SOFFA, M. L. ReQoS: reactive Static/Dynamic compilation for QoS in warehouse scale computers. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2013), ASPLOS 13, ACM, pp [23] VAMANAN, B., HASAN, J., AND VIJAYKUMAR, T. Deadlineaware datacenter TCP (D2TCP). In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (New York, NY, USA, 2012), SIGCOMM 12, ACM, pp [24] WILSON, C., BALLANI, H., KARAGIANNIS, T., AND ROWTRON, A. Better never than late: Meeting deadlines in datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference (New York, NY, USA, 2011), SIGCOMM 11, ACM, pp [25] XU, Y., MUSGRAVE, Z., NOBLE, B., AND BAILEY, M. Bobtail: Avoiding long tails in the cloud. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2013), nsdi 13, USENIX Association, pp [26] ZATS, D., DAS, T., MOHAN, P., BORTHAKUR, D., AND KATZ, R. DeTail: reducing the flow completion time tail in datacenter networks. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (New York, NY, USA, 2012), SIGCOMM 12, ACM, pp [27] ZHURAVLEV, S., BLAGODUROV, S., AND FEDOROVA, A. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2010), ASPLOS XV, ACM, pp

Towards Deadline Guaranteed Cloud Storage Services Guoxin Liu, Haiying Shen, and Lei Yu

Towards Deadline Guaranteed Cloud Storage Services Guoxin Liu, Haiying Shen, and Lei Yu Presenter: Guoxin Liu Ph.D. Department of Electrical and Computer Engineering, Clemson University, Clemson, USA Computer