TAPS: Software Defined Task-level Deadline-aware Preemptive Flow scheduling in Data Centers

Size: px

Start display at page:

Download "TAPS: Software Defined Task-level Deadline-aware Preemptive Flow scheduling in Data Centers"

Jacob McDonald
5 years ago
Views:

1 25 44th International Conference on Parallel Processing TAPS: Software Defined Task-level Deadline-aware Preemptive Flow scheduling in Data Centers Lili Liu, Dan Li, Jianping Wu Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University Abstract Many data center applications have deadline requirements, which pose a requirement of deadline-awareness in network transport. Completing within deadlines is a necessary requirement for flows to be completed. Transport protocols in current data centers try to share the network resources fairly and are deadline-agnostic. Recently several works try to address the problem by making as many flows meet deadlines as possible. However, for many data center applications, a task cannot be completed until the last flow finishes, which indicates the bandwidths consumed by completed flows are wasted if some flows in the task cannot meet deadlines. In this paper we design a task-level deadline-aware preemptive flow scheduling(taps), which aims to make more tasks meet deadlines. We leverage software defined networking (SDN) technology and generalize SDN from flow-level awareness to task-level awareness. The scheduling algorithm runs on the SDN controller, which decides whether a flow should be accepted or discarded, pre-allocates the transmission time slices and computes the routing paths for accepted flows. Extensive flow-level simulations demonstrate TAPS outperforms Varys, Baraat, PDQ (Preemptive Distributed Quick flow scheduling), D 3 (Deadline-Driven Delivery control protocol) and Fair Sharing transport protocols in deadlinesensitive data center environment. A simple implementation on real systems also proves that TAPS makes high effective utilization of the network bandwidth in data centers. I. INTRODUCTION Data centers, capable of running various cloud applications, are widely deployed around the world. Many cloud applications are interactive in nature and accordingly have soft-realtime requirements [], [2]. For sake of both user experience and provider revenue, sometimes seconds or microseconds of latency matters significantly [3]. A cloud application (task) is usually distributedly executed by a number of servers, and data flows among the servers are bandwidth-hungry to shuffle a large amount of data. As a result, the data center network has to provide very low latency to transmit the flows among servers and meet their deadlines. Unfortunately, traditional competition based transport protocols in data centers, such as TCP, RCP [4], ICTCP [5], DCTCP [2], adopt a philosophy of Fair Sharing to let flows compete for the available bandwidth of links. They consider neither meeting the deadline demands of the flows, nor This work is supported by the National Key Basic Research Program of China (973 program) under Grant 24CB3478, the National Natural Science Foundation of China under Grant No.6729, No.64322, the National High-tech R&D Program of China (863 program) under Grant 23AA333, and Tsinghua University Initiative Scientific Research Program. minimizing the completion time of the flows. These deadlineagnostic transport protocols cannot make more flows complete within deadlines, and cause a waste of link bandwidth. A recent study shows that the deadlines of 7.25% flows were missed in three production data centers, because of the ineffective utilization of link bandwidth [3]. To overcome the problem above, recently deadline-aware transport protocols for data center networks are proposed, such as D 3 [3], PDQ and D 2 TCP [6]. The basic idea is to introduce a bandwidth competition or allocation algorithm which makes more flows be completed within deadlines, thus link bandwidth is more effectively utilized. A common goal of these deadlineaware transport protocols is how to finish as many flows before deadlines as possible. However, we argue that for cloud tasks such as financial service, online payment, or scientific computation, the computation results are useful if and only if all the servers finish their computations before deadlines. Consequently, for all the flows of a single task, what really matters is that the last flow completes before the deadline. Otherwise, the task will fail and the bandwidth consumed by all the completed flows is also wasted. Recently, some taskaware flow scheduling schemes are proposed, e.g. Baraat [7] and Varys [8]. However, Baraat is deadline-agnostic and aims to reduce overall task completion time, which would result in the low throughput in deadline-constraint cases. Varys is very sensitive to the task arrival order, which may make later-arrived but more urgent tasks miss deadlines. To consider these problems, in this paper we propose a task-level deadline-aware preemptive flow scheduling algorithm for data centers that tries to finish as many tasks, instead of flows, as possible before deadlines. We call the protocol TAPS. The design of TAPS leverages the emerging software defined networking (SDN) technique, and further generalizes SDN from flow-level awareness to task-level awareness. The core of TAPS is a task-aware flow scheduling algorithm running on the SDN controller. When the scheduling requirements of a flow arrive, the SDN controller decides whether it should be accepted or discarded according to a reject rule. If the flow is accepted, the SDN controller pre-allocates its transmission time slices and computes the routing paths for it. Although the allocation problem is proved to be NP-hard, we provide a heuristic solution which works well to arbitrary data center network topologies. Compared to PDQ, Baraat and Varys, TAPS has a near-optimal routing scheme and a better defined 9-398/5 $3. 25 Crown Copyright DOI.9/ICPP

2 centralized routing algorithm, thus could make the most of bandwidth and let more tasks be completed before deadlines. Apart from controller modification, an additional module is added to the servers to maintain the states of local flows. The switches do not need any modification, which is consistent with the technique trend of employing low-end commodity switches in modern data centers. We conduct extensive flow-level simulations and report the performance of TAPS, in both single-rooted and multirooted tree network topologies. To do comparison, we also implement some existing flow-level and task-level protocols, and report their performance in the same topologies. Results demonstrate that TAPS outperforms Baraat, Varys, PDQ, D 3 and Fair Sharing protocols in deadline-sensitive data center network environment, in terms of the number of tasks and task size completed before deadlines. We also conduct testbed experiments based on an implementation of TAPS, whose results also show that TAPS makes higher effective utilization of network bandwidth and fulfills much more task than Fair Sharing transport protocols. II. BACKGROUND AND RELATED WORK In this section, we briefly review some important problems in current data center applications, including the task-level and latency-aware property, and multi-path routing they have to consider, which inspire us the design of TAPS. Task-level. Current data center applications and distributed computing systems like MapReduce and Dryad, employ a partition/aggregation pattern. They aim to achieving horizontal scalability by partitioning a task into many flows. The unit of these applications is task and each task contains a number of flows. Statistics indicates that for web search works, each task contains at least 88 flows [2] while for MapReduce works each task contains 3 to even more than 5 flows [9], and for Cosmos works most tasks contain 3 7 flows []. These statistics reveals that in data centers many applications generates multiple flows for a single task, and task is the unit of processing. Nowadays mainly task-level related works are Baraat [7] and Varys [8]. Baraat: Baraat is a task-aware scheduling. The priority of tasks obeys SJF and the priority of all the flows in a task is the same. The flow scheduling of Baraat is similar to PDQ [] except the flow priority. The main goal of Baraat is to minimize the average task completion time. However, losing sight of deadline information makes Baraat behave not very well in deadline-sensitive case. Varys: Varys is also a task-aware and deadline-aware scheduling. The early arrived task would be scheduled first. The rate-allocation scheme is most like D 3. Once a task is scheduled, it would not be rejected. If a more urgent task arrived later than a less urgent task and the allocation of the less task left insufficient bandwidth to the more urgent task, then the more urgent task would be discarded by Varys. This decides the limitation of arrival sequence for Varys. Deadline-aware. Many data center applications are interactive and have a very strong requirement on latency. In this kind of applications the response to a request must be processed very quickly, even ms latency overhead would cause a big loss of the providers revenue [3]. A previous study shows that applications need to complete the requests and respond to the users in a SLA of 2-3ms [3]. This tells us in many data center applications flows usually have specific deadlines to meet, thus they should be completed fast and early. Nowadays mainly task-level related works are D 3 [3], PDQ [], DCTCP [2] and D 2 TCP [6]. D 3 :D 3 performs a centralized bandwidth allocation in order to fulfill flows before deadlines. However, the FCFS scheduler used in D 3 results in some performance issues, such as different flow-scheduling results according to different order of their arrival. Unfortunately, this allows large flows that arrived earlier to occupy the bottleneck bandwidth, but blocks small flows arrived later. Furthermore, the task-agnostic feature of D 3 makes the data center networks to lose more tasks within deadlines. PDQ: PDQ is deadline-aware protocol which employs explicit rate control like TAPS. Unlike D 3 [3], PDQ allocates bandwidth to the most critical flows and allows flow preempting, bringing the effects of reducing mean FCT by 3% compared with D 3 and fulfilling more flows within deadlines. However, distributed scheduling without the global knowledge of all passing tasks makes PDQ to be far from optimal scheduling. DeTail and D 2 TCP: DeTail and D 2 TCP are also deadlineaware protocols. DeTail aims to cut the FCT tail in data center networks. D 2 TCP improves DCTCP [2] to a deadline-aware version in order to accomplish more flows before deadline. However, the limitation of flow-level scheduling cannot minimize the deadline-missing tasks. Multi-path routing. Generally tree architecture is applied in traditional data center topology. However, previous researches [2] show that traditional tree topology architecture cannot adapt to the requirements of current data centers. In order to achieve high network capacity, numerous data center architectures have been proposed to address this problem. These richconnected architectures, such as Fat-Tree [2], BCube [3], FiConn [4], employ multi-rooted tree topology to improve network capacity. In current data center networks which use multi-rooted tree topology, generalizing their rooting protocol to multi-path is a fundamental but very important problem. However, transport protocols in data centers employ TCP and emulate fair sharing which tries to share the network resources fairly. The mainly related works are TCP, RCP [4], DCTCP [2], and HULL [5] Previous studies [3] showed that TCP and RCP with priority queueing lose quite a number of flows since missing deadlines and fall behind D 3. DCTCP and HULL mainly target at reducing the queue length with novel rate control and congestion detection mechanisms. Previous research showed that though DCTCP can tackle the latency problem to a certain extent, but cannot achieve high deadlinesensitive flow completion ratio in data center networks. III. MOTIVATION AND DESIGN GOALS In this section, we start by presenting some motivation examples to show the importance of task-level, preemptive scheduling and global scheduling in the design of current data center scheduling algorithm. Then based on these property we present the design goals of TAPS

3 A. Motivation Example Consider the scenario shown in Fig., where two concurrent tasks arrive simultaneously. 3/4 /2 /4 Task ID Flow ID Size Deadline (a) flow size and deadline 3/4 / (b) Fair Sharing (c) D Fig. (e) illustrates the result of a simple Task-aware Scheduling scheme. The priority of the tasks is ordered by EDF [6], and the priority of flows inside each task obeys EDF as well. Flow with the highest priority is transmitted first at the maximum rate as [].In this way, 2 flows and task can be completed. This example reveals that task-aware flow scheduling scheme can complete one task, while both flow-granularity scheduling and deadline-agnostic task-granularity scheduling scheme fails to complete any task. Therefore, compared to task-agnostic or deadline-agnostic scheduling methods, more tasks can be completed using task-aware and deadline-aware flow scheduling. Taking tasks and their deadline into consideration does give a better performance in task completion ratio. Task ID Flow ID Size Deadline (a) flow size and deadline /2 2 4 (b) Baraat (d) PDQ (e) Task-aware Scheduling Fig.. Task-level scheduling vs flow-level scheduling. (a) shows the size and deadline of the 4 flows(in 2 tasks). (b) - (e) show scheduling results of Fair Sharing, D 3, PDQ and Task-aware Scheduling, respectively. X-axis is time and Y-axis is the allocation of bottleneck link bandwidth. Task-level Scheduling. Fig. presents an example to show that task-level scheduling has advantage over flow-level scheduling.there are 2 tasks competing for one bottleneck link, and each task consists of 2 flows. Fig. (a) shows the size(expected transmission time) and deadline of each flow in each task. These four concurrent flows arrive simultaneously in such order: Flow f, f2, f 2, f2 2. Fig. (b)-(e) show the scheduling results of Fair Sharing, D 3, PDQ [], and taskaware scheduling, respectively. With Fair Sharing scheduling scheme, flows share the bottleneck link capacity equally. In the end flows and task can be completed as Fig. (b) shows. With D 3 scheduling scheme, each flow requests a rate of r = s d, where s means flow size and d means deadline. In the st to 4th time unit, f requests a rate of r = 2 4 and f 2 requests a rate of r = 4 4.Butf arrives earlier, so f gets a rate of r = 2, f 2 gets a rate of r = 2 and others get a rate r =.In5th to the 6th time unit, f2 gets a rate of r =, the others get the rest rate r =. Afterwards, in the 6th time unit, f 2 gets a rate of r =. And in the 7th to thtime unit, f2 2 gets a rate of r =At last, only flow and no tasks is completed, as Fig. (c) illustrates. With PDQ scheduling scheme, the descending priority order of these flows is f 2, f, f2 2, f2. For brevity in this example, Early Termination [] is not employed. Each flow is transmitted at the rate of the link capacity. As the result, 2 flow are completed before deadlines but neither tasks is completed within deadlines,as Fig. (d) shows. /2 2 4 (c) Varys (d) TAPS Fig. 2. Existing Task-level scheduling vs TAPS. (a) shows the size and deadline of the 4 flows(in 2 tasks). (b) - (d) show scheduling results of Barrat, Varys and our proposed TAPS, respectively. X-axis is time and Y-axis is the allocation of bottleneck link bandwidth. Preemptive Scheduling. Existing Task-aware Scheduling scheme such as Baraat [7] and Varys [8] still have room to be improved. They perform badly in some examples due to their lack of deadline or preemptive consideration. Specifically, Baraat is task-aware but not deadline-agnostic. Varys is taskaware and deadline-agnostic, but it obeys FIFO and does not support preemptive, which leads some last-arrived emergency tasks missing deadline. These disadvantages motivate us to propose a task-level, deadline-agnostic and preemptive scheduling(taps for short). Fig. 2 presents an example to show our preemptive motivation. There are 2 tasks competing for one bottleneck link, and each task consists of 2 flows. These four concurrent flows arrive simultaneously in such order: Flow f, f2, f 2, f2 2. Fig. 2(a) shows the size and deadline of the flows in each task. Fig. 2(b)-(d) show the scheduling results of Baraat, Varys, and our proposed TAPS, respectively. With Baraat [7] scheduling scheme, earlier-arrived task has higher priority. Therefore task t starts first. Since the priority of flows inside a task obeys SJF, flow f starts first which results in the failure of flow f 2. Though Baraat schedules flows in task granularity, it fails to all the tasks as Fig. 2(b) shows. With Varys [8] scheduling scheme, the earliest-arrived task should be scheduled first. Same as PDQ and D 3, Varys is anoth

4 er rate control protocol. And in deadline-sensitive environment, the rate of a flow is assigned as r = s d. Flows of t is scheduled firstly. But after t is scheduled, there would not be enough bandwidth for t 2 to transmit. So t 2 is rejected in the first place. Varys completed task at last as Fig. 2(c) shows. The reason why Varys rejects t 2 is that it comes later than t and Varys do not support task preemption. To address such problem, the proposed scheduling scheme TAPS try to make a global task-level optimization after each task arrives, thus supports task preemption. Specifically, it has the same rate control protocol as PDQ. And it give an accept/reject choice to a task after it arrives, based on an overall task-level scheduling optimization. when a task is accepted, all the flows obey EDF and SJF to schedule, and flow with the highest priority would be transmitted at the rate of the link capacity. TAPS completed 2 tasks in examples showed in Fig. 2(d). This example reveals that preemptive task-level scheduling could accomplish more tasks in deadline sensitive situation. Flow ID Size Deadline Source Destination f 2 f2 2 4 f f (a) Flows on transmission (c) Topology, link capacity is Gbps Flow ID Optimal Transmission Interval f (,) f (,2) 2 f 3 (,2) f 4 (,) & (2,3) (b) Optimal Scheduling Results Fig. 3. Global Scheduling Motivation Example. Global scheduling can complete more flows before deadline, compare to PDQ Global Scheduling Fig. 3 presents an example to show the effectiveness of global scheduling. There are 4 flows which goes through different pathes in the network. Fig. 3(a) shows the size, deadline, source ID, destination ID of the 4 flows, respectively. Fig. 3(b) shows the optimal scheduling results and Fig. 3(c) shows the topology. If we schedule the 4 flows with PDQ [], the entire scheduling process is as follows: In the st time unit, there is no switch that pauses f,sof gets a rate=. f 2 is paused at S, because f is more critical than f 2, and f 3 is paused by S 5. Assume that the flow list in S 3 is full, so f 4 is paused by S 3. In the 2nd time unit, f is completed so f 2 and f 3 get rate=. f 4 is paused by S 3. In the 3rd time unit, f 2 and f 3 are completed, but f 4 cannot be completed before deadline, because it has 2 size units to transmit and only time unit left before deadline. In the end f 4 misses the deadline. As we can see in the example, the bottleneck link S 3 -S 5 and S 5 -S 4 are idle in the st time unit, but f 4 cannot utilize them. If we schedule globally, we can make full use of the 2 links and let f 4 transmit through them. The optimal scheduling results of each flow in this example are showed in Fig.3(b), which successfully make all these 4 flows completed before their deadlines. Finally global scheduling can complete all the flows, while PDQ can only complete 3 flows. This example reveals that in common situations, global scheduling has advantages to complete more flows than PDQ. B. Design Goals In current data center network a task always consists of many flows. As mentioned above, simply maximizing the number of completed flows is not enough while making more tasks completed before their deadlines is more meaningful. We give the following design goals for TAPS here: Maximizing the number of completed tasks before deadline: TAPS is designed for deadline-sensitive data center environment, which expects all flows in a task to be completed before their deadlines. Note that a task is useful if and only if all flows in the task are completed before their deadlines. It is more crucial for data centers to fulfill more tasks. This goal requires preemptive and global scheduling of TAPS. Under this goal, unnecessary bandwidth waste should also be highly decreased. Online response to tasks in dynamic data center network: The traffic in data center networks is dynamic and changes frequently [7]. When a burst of tasks arrive, data center networks should selectively accept them according to network tolerance capacity. TAPS is designed to respond to tasks in data center networks online and dynamically. Applicability to general data center network topologies: Nowadays most topologies in data center networks are multirooted tree. Whereas, current latency-aware and task-aware transport protocols, such as D 3, can only be applied to singlerooted tree topology. We extend single-path routing to multipath routing in TAPS so that TAPS can be applied to general data center topologies. IV. TAPS DESIGN In this section, we will firstly overview the overall architecture of TAPS. Then we will discuss the core part, centralized algorithm in detail. After that we will introduce the design of the controller, server and switch, respectively. A. Architecture Overview The basic idea of TAPS is maximizing the number of completed tasks before deadlines. The priority of flows is decided by the deadline and flow size in scheduling disciplines using EDF [6] and SJF [8]. Flows with higher priority should be transmitted first. In order to minimize the mean flow completion time(fct), there is at most one flow on transmission on each link at any time []. In other words, once a flow starts to send, it will occupy the link capacity exclusively. TAPS leverages Software Defined Networking(SDN) [9] framework to enforce the flow scheduling mechanisms. Fig. 4 depicts the procedure of TAPS and exchanged messages among the controller, servers, and switches. TAPS senders maintain a set of task-related variables, including flow deadline, expected transmission time and sending rate. Then the sender encapsulates the task-related information into a scheduling header added to a probe packet, and sends the packet to the server. When a task arrives, it is directly sent to the controller for scheduling. When the SDN controller receives the probe packet, it firstly decides whether the task can be processed by the network according to a reject rule. If the task is accepted by the

5 . A new task arrives SDN Controller 2. Servers send packets containing task information to the controller 3.SDN controller computes to decide whether to accept this task 4. If the task is accepted, the controller (4A) installs forwarding entries on the corresponding switches (4B) sends packets including preallocated time slices to senders 5. If else, the controller informs the senders to discard this task Fig A 4B Server Core Switch... Aggregate Switch Task f,, f f { Src, Dst, s, d } i i i i i i i i m i j j j j j TAPS protocol architecture controller, the controller calculates the time slices about when to transfer the flows in the task, and the routing path for the flows. Afterwards the controller sends the packets with time slices included to the corresponding senders, and installs the routing table on the corresponding switches along the routing path of a specific flow. Then the senders monitor the time and decide when to send the flow at an assigned sending rate. Intermediate switches forward the packets according to the default route in time. We present TAPS details in the following sections. B. Centralized Algorithm F trans F tmp f new Src i j Dst i j s i j d i j Ej i A i j L i j O x the set of all flows on transmission a temporary set of flows the newly coming flow source server ID of flow j in task i destination server ID of flow j in task i size of flow j in task i deadline of flow j in task i expected transmission time of flow j in task i allocated time slices for flow j in task i link set which flow j in task i goes through the occupied time set for link x TABLE I LIST OF NOTATIONS USED IN THE PAPER The centralized algorithm is the core part of TAPS architecture, and it runs in the SDN controller. Task is the unit of the accepting or rejecting decision. We would not discard flows in tasks which are accepted and transmitting, meanwhile we would not waste bandwidth on any flows in the task we have decided to discard. Specifically, newly arrived tasks are accepted/discarded according to a rejecting policy, which intelligently decides whether a task need to be discarded to save more bandwidth resource for other tasks. The goal of the centralized algorithm is to check whether the newly arrived task could be handled by the network. So in general, the centralized 2 algorithm calculates the potential allocation of a newly arrived task and make decisions according to rejecting policy. Problem formulation. We model the task scheduling problem in the following form. The full set of tasks to be scheduled is T = {t },...,t n }, and task t i contains m i flows { f i,...,fm i i. We suppose in the network, the bandwidth of each link is uniformed, so each flow can always be transferred at its maximum rate. In this scenario, we do not need to care about the actual size of a flow, but only the transferring time required. Flow fj i has the tuple d i j,ei j,ai j j,li. d i j, Ej i and Ai j are the deadline, expected transferring time and the transmitted time slices of flow fj i, respectively. And Li j is the set of links that will be transferred through by fj i. Note that flows in the same task have the same deadline, i.e. d i j = di for any j. Then the task scheduling problem is to find a set of tasks T trans that has as many tasks as possible and the network can process. For each flow fj i of each task t i in T trans, a scheduled transferring time slice set will be given. The objective is to finish as many tasks as possible, while ensure that only flows within T trans will be transferred and bandwidth resource will not be wasted on tasks that would only be partially finished. NP-hardness Proof. We have proved that this problem is a NP-hard problem. To prove the NP-hardness of the problem, we reduce a well-known NP-hard problem, the Hamiltonian Circuit problem, into a special case of this problem. Suppose we have a graph G = V,E, in which V = {v i,i=,...,n } is the set of all vertices and E = {e j,j =,...,m } is the set of all edges. To find a hamiltonian circuit in G is actually to find a set of n edges E E so that each vertex in V appears as the endpoint of some edge in E twice. The hamiltonian circuit problem above can be reduced into a task-based flow scheduling problem on a single link. Specifically, on a particular link there are m tasks to be scheduled, and each task contains four flows each of which have size 2 and start at time zero. Each task is corresponded to an edge in E from the hamiltonian circuit problem. For an edge whose endpoints are v i and v i2, the four flows of the corresponding task have deadlines i +, 2n i, i 2 + and 2n i 2. Therefore, we can find a scheduling in the original problem that n task can be completed if and only if a circuit can be found in the corresponding Hamiltonian problem. Algorithm Detail. Next, we will take a deep insight into the centralized algorithms in detail. Alg. describes the whole process of TAPS. When a new flow f new arrives, the algorithm adds f new to a temporary flow set F tmp and wait for the flows in the same task for a time interval T. After adding all the transmitted flows F trans into F tmp, we try to allocate all the flows into this network, calculate the time slices and route for each flow in F tmp (Alg. 2 ). Here we denote the task which f new belongs to as t id. Then the new task t id is decided to be accepted or discarded according to the reject rule. The reject rule means if one of the following situations happens, then all the flows belong to t id would be discarded and t id would be added to discarded task set T discard. ) If we accept f new, flows of more than task would miss deadline; 2) Some flows inside t id have already missed deadline; 3) All the deadlinemissing flows belong to task, but the task is not t id and the completion ratio of the task is not less than t id ; If the task is

6 not t id but the completion ratio of the task is less than t id, then we discard the deadline-missing task and add it to T discard. Alg. 2 is the whole process to allocate routing path for each flow in F. Specifically, for each flow fj i, we aim to calculate its transmission time slices A i j and links Li j it goes through. The calculation of fj i consists of three steps. Firstly, we calculate a alternative path set P which contains all the paths fj i may go through. Secondly, we try to allocate time slices for every path p in P (Alg. 3 ). Finally, we try to find the optimal path in P through which fj i can be completed the earliest. Li j is the set of the links of the optimal path, and A i j is the time slice set of the optimal path. Alg. 3 allocates time slices for a specific flow fj i when it goes through path p. For each link l x, time period when it is occupied is recorded. We firstly compute the union set T ocp of time periods O x of all the links in p. The complementary set of T ocp is the time set when all the links in p are idle. We try to allocate transfer time slice timeslice(p, f i ) to the first E i (expected transfer time) idle time slices. And flow completion time time(p, f i ) can be obtained. C. TAPS Controller The SDN controller performs the centralized algorithm when it receives a new flow. Then it runs a centralized algorithm to determine whether to accept or discard the new flow. The main functions of the controller are as follows: Compute route for each flow in F trans. In Alg. 2, the controller calculates the optimal path L i j for f j i. Then the controller informs the corresponding switches to install route entries. Since a fundamental constraint of SDN is that the flow table size of an SDN switch is very limited (usually less than 2 entries), only the first k entries are installed on a particular switch, thinking of there at most one flow on a link. When the controller receives an ACK that the flow has been completed or missed deadline, it informs the corresponding switches to withdraw the route entries. Pre-allocate time slices for each flow in F trans. In Alg. 3, the controller calculates the time slices A i j for f j i. After the calculation, the controller sends packets including time slices for each flow to the senders. Algorithm Task-aware preemptive flow scheduling : if f new arrives then 2: t id f new.id; 3: if t id T discard then 4: reject f new ; 5: end if 6: F tmp {f new } 7: Wait time T, and add all coming flows to F tmp. 8: F tmp F tmp F trans ; 9: sort F tmp according to EDF and SJF; : PathCalculation(F tmp ); : accept or reject f new according to the reject rule; 2: end if D. TAPS Server In TAPS, we add some spare modules to complete the following functions: Algorithm 2 PathCalculation(F ) : for each flow f i F do 2: P φ 3: add all the possible paths of f i to P ; 4: for each path p P do 5: Allocation(p,f i ); 6: end for 7: T = inf; 8: for each path p P do 9: if T > time(p, f i ) then : T time(p, f i ); : L i p; 2: A i timeslice(p, f i ); 3: end if 4: end for 5: Update Occupy set O x for each link in L i based on A i. 6: end for Algorithm 3 Allocation(p,f i ) : T ocp φ 2: for each link l x p do 3: push O x (l x ) into T ocp ; 4: end for 5: timeslice(p, f i ) first E i time slices in the complementary set of T ocp 6: compute time(p, f i ); Maintain the states of each flow. In TAPS, each sender maintains several state variables for flow fj i: deadline di j, the expected transmission time Ej i, the allocated time slicesai j. Communicate with the controller. Once a new task arrives, the senders send a probe packet with the task information, including source ID Src i j, destination ID Dsti j, flow size si j and deadline d i j (i means task ID and j means flow ID) to controller. Then the senders wait for the results from the controller. If the task is discarded, the senders will not transfer any flows in the task. Otherwise the senders will maintain the pre-allocation information from the controller. Monitor the time to send a flow. The senders monitor the time and keep in touch with the controller to ensure time consistency. The sender would send the flow at an allocated rate at the appropriate time. If a flow is completed, the sender sends a TERM packet to the controller and removes it from its maintained set. E. TAPS Switch The switches in TAPS do not need any modification or addition of some modules to allocate rate for flows, compared to switches in other latency-aware protocols which employ explicit rate control protocol [4], e.g. PDQ [], D 3 [3] and Baraat [7]. In TAPS switches only need to take charge of the data forwarding. What they need to do is to forward packets according to the default entries installed by the SDN controller

7 V. EVALUATION In this section, we evaluate TAPS through simulation tests, and compare TAPS with state-of-the-art solutions. The simulation results indicate that TAPS outperforms Baraat, Varys, PDQ [], D 3 and Fair Sharing in terms of task completion ratio, wasted bandwidth in both single-path and multi-path situations. Furthermore, TAPS also outperforms the other 5 algorithms in terms of flow completion ratio when task is not taken into consideration. Firstly, we give the simulation setup, then look into the details of the simulation results. A. Simulation Setup The simulation runs on two different topology setups. The single-rooted tree topology is identical to what is used in Baraat and similar to D 3 [3] and PDQ [] as shown in Fig. 5, which is a three-level single-rooted tree topology. Each rack has 4 machines with Gbps links. And inside a rack a Top-of-Rack (TOR) switch connects all these 4 machines with Gbps links. 3 ToR switches are connected to an aggregation switch and 3 aggregation switches are connected to a core switch. The single-rooted tree topology has 36, physical servers wich 3 pods and each pod is comprised with 3 racks. The multirooted tree topology is a 32-pod fat-tree [2] with 892 servers and Gbps links. 4 servers 3 ToR switches 3 aggregate switches RACK Fig. 5. Core Switch A single-rooted tree topology The simulation data is generated in the same way of experiments setup in D 3 [3] and PDQ [], but with additional task-level information. Each group of simulation data contains 3 tasks. The arrival time of the tasks is generated following the poisson arrival model, with the arrival rate λ, i.e. λ tasks arrive per second by average. Each task has μ flows by average. All flows within the same task arrive at the same time. When the task arrives in our simulation, their sending and receiving points are randomly decided.. The deadline of each task is generated by a exponential distribution(default mean deadline = 4ms). Here mean deadline is the mean flow deadline, i.e. averaged expected completion time minus start time of each flow. And the sizes of each flow are generated by a normal distribution(default mean flow size = 2KB). Note that all flows in the same task have the same deadline. In default situation, the mean number of flows per task is 2 for singlerooted simulations while 24 for multi-rooted simulations. We evaluate the following five flow scheduling mechanisms with flow-level simulator written in C++: TAPS: TAPS is implemented just as Sec. IV. Upon the arrival of each task, the algorithm decides whether the task should be accepted or declined. If a flow is accepted, time slices are pre-allocated and the route is decided for each flow in the task. Fair Sharing: We develop an ordinary version of Fair Sharing, which is totally agnostic about tasks or deadlines. Each flow that competes for a bottleneck link gets a fair share of the link capacity. D 3 : The implementation of D 3 includes the improvement introduced by []. PDQ: We simulated PDQ with the basic Early Termination(ET) function. Suppressed Probing(SP) and Early S- tart(es) [] take buffer occupancy into account and is not appropriate in our flow-level model. Baraat: We simulated Baraat according to the algorithm in [7]. Varys: We mainly simulated Varys of Pseudocode and 2 [8] to adapt to the deadline-sensitive simulations. Since these algorithms are not naturally designed for multirooted tree topologies, we use flow-level ECMP to extend them to make routing decisions in multi-rooted scenarios. For solutions that may start flows even they are impossible to be finished, namely, D 3 and Fair Sharing, they will not send more packets from flows already missed their deadlines, so that useless transmission can be avoided. We generated multi groups of simulation data, on different variable arguments to mirror the impact of different reality factors: the mean deadline of flows for the task urgency, the mean size of flows for the task duration, and the mean number of flows per task and the number of tasks for the task diffusion. We evaluate on the following three metrics. The task completion ratio is the percentage of tasks that can be successfully finished before their deadlines. Only tasks in which all flows meet their deadlines are counted as completed. As a contrast we also keep records of the flow completion ratio and the application flow throughput, which is the ratio of the total number and the total size of flows finished before their deadlines, regardless of their tasks finished or not. B. Impact of Task Urgency In the first group of simulation, we vary the mean flow deadline from 2ms to 6ms. Fig. 6 shows experiment results of single-rooted tree. The results indicate that with the growing mean deadline of flows the performance of each algorithm increases. It is intuitive that the larger the mean deadline is, the easier for the same amount of tasks to complete. TAPS outperforms the other 5 algorithms in terms of and Application. Fair Sharing behaves the worst because of the deadline-agnostic and task-agnostic properties. When deadlines are very urgent, the performance of D 3, PDQ, Varys and Baraat are very similar. But when deadlines is becoming larger, the difference began to emerge. The performance of PDQ and Varys are very close. The reason that TAPS beats PDQ is mainly because of the rejecting policy which prevents the subsequent flows from interrupting the prior flows. PDQ is taskagnostic, but could make flows completed more quickly and before deadlines. Although Varys is task-aware, it is restricted to the task arrival rate. The reason why Baraat behaves like this is that Baraat is deadline-agnostic. In deadline-sensitive

8 ..9 Application Throuput Wasted Bandwidth Ratio Deadline / ms (a) Application Deadline / ms (a) Wasted Bandwidth Comparison Wasted Bandwidth Ratio Deadline / ms Deadline / ms (b) (b) Wasted Bandwidth Comparison without Fair Sharing Fig. 6. Application throughput and task completion ratio when varying deadline for single-rooted tree. Fig. 8. Wasted bandwidth when varying deadline scenario, Baraat could not accomplish more tasks than the other solutions but Fair Sharing. We can also see that most algorithms could finish more task number ratio than task size ratio from the difference of Fig. 6(a) and Fig. 6(b). For D 3 and Fair Sharing in the same situation application flow throughput is higher than flow completion ratio, which reveals that the three algorithms prefer to accomplish flows with larger size. However, PDQ, Varys, Baraat and TAPS show the opposite feature: they prefer to fulfill smaller flows, which benefits from the scheduling discipline SJF [8]. Fig Deadline / ms Multi-rooted Simulation Results when varying deadline Fig. 7 shows experiment results of multi-rooted tree. It is similar to Fig. 6(b). The difference between them is that the growth trends of the curves are more obvious. The general trend is like Fig. 6(b). Fig. 8 reveals the wasted bandwidth of these algorithms of single-rooted tree. Wasted bandwidth means that some packets which have been transmitted successfully but the flow they belonged to missed deadline at last. Wasted bandwidth ratio denotes the percentage of the size of the packets having been transmitted but wasted inside total task size. Fig. 8(a) shows that Fair Sharing wasted the most bandwidth. Fig 8(b) shows the detailed comparison of the other algorithms. The reason of the wasted bandwidth of TAPS the reject policy. The flow which may miss its deadline would not be accepted and would not be transmitted at first. The wasted bandwidth ratio of Baraat is also very high, which reveals that the deadline-agnostic property makes Baraat waste plenty of bandwidth. D 3 and PDQ share the similar results. They are both deadline-aware but PDQ saved more bandwidth than D 3. Varys saved the most bandwidth benefitting from the similar reject policy as TAPS. C. Impact of Task Duration In the second group of simulation, we vary the mean flow size from 6KB to 3KB in single-rooted tree topology. Fig. 9 indicates that the other algorithms can hardly complete tasks when flow size is large, while TAPS achieves higher completion ratio than them because of task awareness, rejecting policy and deadline awareness. Though PDQ can accomplish more flows and more packets than D 3 and Fair Sharing, but without global scheduling and near-optimal routing, the performance is much lower than that of TAPS. The results of Fig. 9 is similar to Sec. V-B that TAPS outperforms the other algorithms in terms of task completion ratio. In contrast, the performance of D 3 is very poor compared to the other algorithms when the flow size is very large. Though PDQ

9 Application Size/kB (a) Application throughput D. Impact of Task Diffusion In the third group of simulation, we vary the mean number of flows in one task from 4 to 2, task number from 3 to 27 in single-rooted tree topology Flow Number per Task Fig.. Task completion ratio when varying task number Fig. 9. size Size/kB (b) task completion ratio Application throughput and task completion ratio when varying flow can accomplish more flows and more packets than D 3 and Fair Sharing, but without global scheduling, the performance is much lower than that of TAPS Fig Size/kB Flow completion ratio when varying flow number Fig. is to prove that the near-optimal property of TAPS. The setup of this simulation is as Sec. V-A. While the task here only has one flow, which means task and flow are equal here. Task completion ratio is equal to flow completion ratio. There are 36, tasks in this simulation. We could see that the TAPS still outperforms the other 5 algorithms in terms of flow completion ratio. But PDQ outperforms Varys more obviously in Fig.. The variation trends of the left algorithms are very similar to the above figures Task Number Fig. 2. Task completion ratio when varying task number Fig. and Fig. 2 indicate that with the increase of task number and flow number the performance of each algorithm decreases. The advantage of TAPS is mainly because of the awareness of task and proper rejection of new coming tasks. Thus, under different level of task diffusion, the awareness of task plays the most important role for TAPS s performance. VI. IMPLEMENTATION AND EXPERIMENTS We deployed TAPS upon a software based controller separately across a small-scale testbed which is a partial Fat-tree topology as Fig. 3. The testbed includes 8 endhosts arranged across 4 racks and two pods. All the servers are Desktops with 2.8GHz Intel Duo E74 processor and 2GB of RAM, whose network cards are Inter 82574L Gigabit Ethernet card. Each rack has a top-of-rack (ToR) switch which is connected to an aggregate switch. The aggregate switches are connected by core switches, composed of level-3 switches and configured dynamically by the controller which instructs servers when to send flows and which flow to send. All the switches are composed of H3C S55-24P-SI series switches. To evaluate the benefits of exploiting deadline and task information, we compare TAPS with Fair Sharing. For both two scheduling approaches, a new flow is directed to the controller by the sender which has a virtual switch inside, when it is generated by a virtual machine. Iperf [2] is used to generate flows in our implementation. The average flow size is

10 KB and average deadline is 4ms, similar to Sec. V-A. The source and destination IDs are generated randomly. We use effective application throughput as the metric, which indicated the useful data packets transmitted per unit time. As Fig. 4 shows, TAPS can achieve a high effective application throughput, which is almost close to %. However, Fair Sharing fails to achieve a stable effective application throughput which is much lower than that of TAPS. Since occupying the link bandwidth exclusively and no competition of link bandwidth among flows, link bandwidth can be fully utilized by the flows.in contrast, Fair Sharing can make the mean effective application throughput up to 6% due to flow competition and deadline-agnostic feature. The tail of TAPS curve descends little by little. It is because when a sender finished all its flows, some bottleneck link would be idle and more Gbps bandwidth would not contribute to the throughput. However the scenario of Fair Sharing shows that the throughput changes rapidly upon different deadline-missing flows. Effective Application /% Fig. 3. A Partial Fat-tree TAPS Fair Sharing Core Switches Aggregate Switches Edge Switchs Servers /ms Fig. 4. Implementation Results The implementation results demonstrate that TAPS can make high effective utilization of network bandwidth and fulfill much more tasks than Fair Sharing transport protocol, which can save network bandwidth resources. VII. CONCLUSION We proposed TAPS, a task-level deadline-aware scheduling algorithm in data centers which aims to make more tasks instead of flows to be completed before their deadlines. We leverage SDN and further generalize SDN from flow-level awareness to task-level awareness. TAPS provides a centralized algorithm which runs on the SDN controller to decide whether a task should be accepted or discarded. If the task is accepted, the controller try to pre-allocate the transmission time slices for flows of the tasks, and compute the routing path for the accepted flows. Extensive flow-level simulations demonstrate that TAPS outperforms PDQ [], D 3 [3], and Fair Sharing transport protocols in deadline-sensitive data center network environment, in both single-path and multi-path paradigms. A simple implementation on real system also proves that TAPS gets high effective utilization of network bandwidth in data centers. REFERENCES [] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch, Power management of online data-intensive services, in Computer Architecture (ISCA), 2 38th Annual International Symposium on, pp , IEEE, 2. [2] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, Data center tcp (dctcp), ACM SIGCOMM computer communication review, vol. 4, no. 4, pp , 2. [3] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, Better never than late: Meeting deadlines in datacenter networks, in Proceedings of SIGCOMM, pp. 5 6, ACM, 2. [4] N. Dukkipati and N. McKeown, Why flow-completion time is the right metric for congestion control, ACM SIGCOMM Computer Communication Review, vol. 36, no., pp , 26. [5] H. Wu, Z. Feng, C. Guo, and Y. Zhang, Ictcp: incast congestion control for tcp in data-center networks, vol. 2, pp , IEEE Press, 23. [6] B. Vamanan, J. Hasan, and T. Vijaykumar, Deadline-aware datacenter tcp (d2tcp), ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp. 5 26, 22. [7] F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron, Decentralized task-aware scheduling for data center networks, 23. [8] M. Chowdhury, Y. Zhong, and I. Stoica, Efficient coflow scheduling with varys, in Proceedings of the 24 ACM Conference on SIGCOMM, SIGCOMM 4, pp , 24. [9] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol. 5, no., pp. 7 3, 28. [] A. Shieh, S. Kandula, A. G. Greenberg, C. Kim, and B. Saha, Sharing the data center network., in NSDI, 2. [] C.-Y. Hong, M. Caesar, and P. Godfrey, Finishing flows quickly with preemptive scheduling, ACM SIGCOMM Computer Communication Review, vol. 42, no. 4, pp , 22. [2] M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable, commodity data center network architecture, in ACM SIGCOMM Computer Communication Review, vol. 38, pp , ACM, 28. [3] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, Bcube: a high performance, server-centric network architecture for modular data centers, ACM SIGCOMM Computer Communication Review, vol. 39, no. 4, pp , 29. [4] D. Li, C. Guo, H. Wu, K. Tan, Y. Zhang, and S. Lu, Ficonn: Using backup port for server interconnection in data centers, in INFOCOM 29, IEEE, pp , IEEE, 29. [5] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda, Less is more: trading a little bandwidth for ultra-low latency in the data center, in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 9 9, USENIX Association, 22. [6] V. Sivaraman, F. M. Chiussi, and M. Gerla, End-to-end statistical delay service under gps and edf scheduling: A comparison study, in INFOCOM 2. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, vol. 2, pp. 3 22, IEEE, 2. [7] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, The nature of data center traffic: measurements & analysis, in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pp , ACM, 29. [8] N. Bansal and M. Harchol-Balter, Analysis of srpt scheduling: Investigating unfairness, vol. 29, ACM, 2. [9] Software-defined networking (sdn). [2] Iperf

DAQ: Deadline-Aware Queue Scheme for Scheduling Service Flows in Data Centers

DAQ: Deadline-Aware Queue Scheme for Scheduling Service Flows in Data Centers Cong Ding and Roberto Rojas-Cessa Abstract We propose a scheme to schedule the transmission of data center traffic to guarantee