Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems

Size: px

Start display at page:

Download "Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems"

Ezra Watts
5 years ago
Views:

2 Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Kyle J. Nesbit University of Wisconsin Madison Department of Electrical and Computer Engineering kjnesbit@ece.wisc.edu James Laudon Google Inc. jlaudon@google.com James E. Smith University of Wisconsin Madison Department of Electrical and Computer Engineering jes@ece.wisc.edu Abstract Virtual Private Machines (VPM) are an abstraction for managing resource sharing in multi-core computer systems. A VPM consists of a complete set of resources, which includes both spatial (microarchitecture) and temporal (processor time slice) resources. Tasks assigned VPMs achieve a minimum level of performance regardless of other tasks in the system that is, a VPM provides performance isolation. The VPM abstraction provides the interface between a system s resource management policies and mechanisms. VPM policies, implemented primarily in software, translate system-level performance requirements into VPM assignments. Then VPM mechanisms, implemented in hardware, enforce the VPM assignments. To illustrate the potential of the VPM abstraction, we propose and implement a complete set of VPM policies and mechanisms. The policies translate applications' system-level Quality of Service requirements into VPMs and distribute unassigned and unused resources in order to optimize aggregate system-level performance. A simulation-based study shows that the proposed VPM policies and mechanisms, in combination, provide a high degree of QoS and can significantly improve aggregate performance.. Introduction Sharing hardware resources among concurrently executing tasks improves the overall efficiency of a computer system. If poorly managed, however, resource sharing can also lead to poor Quality of Service (QoS) and unpredictable performance. Traditionally, shared resources are managed at coarse granularities complete processor(s) and real memory pages. Conventional hardware ISA features and mechanisms allow an OS to implement policies for managing these shared resources efficiently. With the evolution toward single chip multi-core hardware, however, concurrently executing threads share fine grained resources at the microarchitecture level. This poses a problem because contemporary systems do not provide mechanisms and ISA support for managing microarchitecture resource sharing. In the absence of such support, hardware-implemented sharing policies are unavoidable. Hardware-implemented policies have a number of disadvantages, for example, they lack flexibility and may conflict with OS objectives and policies. 2

3 Consequently, future multi-core computer systems should employ coordinated hardware and software resource management. Doing so is difficult, however. Firstly, modern multi-core chips contain several inter-dependent shared resources, e.g., cache storage and memory system bandwidth. Secondly, modern high-performance, high-volume chips are used in diverse environments, e.g., from embedded devices to servers. Such environments have unique workload and system-dependent performance objectives. Thirdly, performance objectives for a given system may conflict. For example, providing each thread with a guaranteed level of service from each of the shared resources (our definition of QoS) often conflicts with providing maximum aggregate performance over the collection of threads. Furthermore, achieving a good balance between these conflicting objectives is both workload and system dependent. Therefore, microarchitecture resource management requires a well-structured framework for building resource management solutions that can be tailored to a system s specific performance requirements. To be effective, such a framework must be consistent with well-established operating system principles and have a distinct separation between mechanisms and policies [7], where a system s mechanisms provide a universal set of workload independent resource management primitives and policies provide system and workload dependent resource management solutions. To this end, we propose the Virtual Private Machine (VPM) framework.. Virtual Private Machine Framework The real resources of a machine can be distributed into a number of VPMs, where each VPM consists of spatial and temporal components. The spatial component of a VPM specifies the fractions of the system s microarchitecture resources that are dedicated to that VPM [22][23][3]. As an example, consider a baseline system containing four processors with private L caches and shared L2 cache, main memory, and supporting interconnection structures. In Figure a, a policy has distributed these resources amongst four VPMs. Each of the VPMs contains a single processor. The policy assigns VPM a significant fraction (50%) of the resources to support a demanding multimedia application and assigns the other three VPMs a much lower fraction of resources (0% each). These assignments leave 20% of the cache and memory resources unallocated, which is excess service. Excess service also includes allocated, but unused resources. In the VPM framework, excess service policies distribute the system s excess service among the active tasks, thereby assuring that resources are not wasted if there is a task that can use them. The temporal component of a VPM is based on the well-established concept of Generalized Processor Sharing (GPS) [24] (Figure b), and it specifies the fraction of processor time (processor time slices) In general, the spatial component of a VPM may incorporate multiple processors, and, although they pose no special difficulties, multiprocessor VPMs are outside the scope this paper. 3

4 that a VPM s spatial resources are dedicated to the VPM. As with spatial VPM resources, there may be excess temporal resources. The VPM abstraction provides the conceptual interface between software policies and hardware mechanisms. Software policies translate tasks performance requirements into VPM resource assignments, and hardware mechanisms enforce the assignments at runtime by offering the tasks at least the assigned amount of service (i.e., QoS). With the assumption that a task will only perform better if it is given additional resources (performance monotonicity), QoS leads to the desirable property of performance isolation; that is, the task performs at least as well as it would if it were executing on a real machine with a configuration equivalent to the task s assigned VPM. This level of performance is assured, regardless of the other tasks in the system. VPM Proc. L Cache L2 Cache (Capacity.5C) Memory Cntl. Main Memory VPM 2 Proc. 2 L Cache L2 Cache (Capacity.C) Memory Cntl. BW.5L BW.L Main Memory VPM 3 Proc. 3 L Cache L2 Cache (Capacity.C) Memory Cntl. Main Memory VPM 4 Proc. 4 L Cache BW.5K BW.K BW.K BW.K (a) L2 Cache (Capacity.C) Memory Cntl. BW.L BW.L Main Memory VPM VPM Proc. VPM Proc. L Cache Proc. L Cache BW.5K L2 Cache L Cache BW.5K (Capacity.5C) L2 Cache BW.5K (Capacity Memory.5C) Cntl. L2 Cache BW.5L (Capacity Memory.5C) Cntl. Main Memory BW.5L Memory Cntl. Main Memory BW.5L Main Memory Figure. a) Four Spatial VPMs and b) a Complete VPM (including temporal component).2 General Purpose VPM Policies The VPM framework is capable of satisfying the performance objectives of a wide range of systems and workloads, from embedded systems, to desktops, to servers. In this paper, we apply VPMs to general purpose systems and propose a set of general purpose policies that seamlessly integrate temporal and spatial resource sharing. General purpose systems have two basic performance objectives, predictable single-task performance (QoS / performance isolation) and high aggregate performance. As noted above, these objectives often conflict. Hence, the proposed policies allow a system administrator to dedicate a fraction of each of the system s shared resources (both spatial and temporal) to QoS by assigning them to VPMs. The non-dedicated resources are excess service, which the proposed policies use for optimizing system-level aggregate performance. Therefore, by adjusting the fraction of the system s resources that are dedicated to QoS, the system s administrator can tune a system s QoS and aggregate performance to the specifics of the system s workload. To optimize system-level performance, the policies distribute excess service to the shortest job first (SJF). SJF is a highly effective, well-established heuristic that is used in most general purpose operating systems [][28]. Nonetheless, this is the first work that applies the SJF heuristic to microarchitecture resource sharing. Furthermore, this is the first work that provides a complete microarchitecture resource (b) Time 4

5 sharing solution for satisfying realistic, conflicting system-level performance objectives. Most prior microarchitecture resource sharing work focuses on a single shared resource [8][0][5][20][25][32] and optimizing IPC-based metrics [5][8][9][0][][3][5][20][25][26][32] that are less meaningful to system and application software developers than well-established system-level metrics, e.g., average turn around time. We evaluate the proposed VPM-based policies through simulation. In contrast to prior work, our simulations model realistic OS scheduling algorithms and each of the baseline system s shared microarchitecture resources in detail, e.g., the cache storage, status registers at each level of cache hierarchy, and cache array, interconnect, and SDRAM bandwidth. We show that with the proposed policies a system administrator can configure a system to ensure tasks are offered an equitable share of the system s resources (QoS), or the system administrator can relax a system s QoS constraints, thus allowing the excess service policies to aggressively optimize aggregate performance. When the system s QoS constraints are relaxed, for our selected benchmarks, the proposed SJF excess service policies can improve average throughput by 86% and turn around time by 77% when compared with a conventional multi-core system that uses least recently used (LRU) cache replacement and first come first serve (FCFS) memory system arbiters. In addition, we show that, as expected, such improvements in aggregate performance come at the cost of reduced QoS and performance isolation. 2. Prior Work 2. Hardware Policies and Mechanisms Most prior research on microarchitecture resource sharing combines hardware policies and mechanisms in order to optimize instruction per cycle (IPC)-based metrics, e.g., IPC-based QoS [5][3], aggregate performance [9][25][32], and fairness metrics [5][20]. However, it is our position that such an approach is insufficient. Firstly, hardware is inflexible, so it is difficult for hardware-based policies to account for all system objectives at design time, particularly in emerging platforms where system objectives are rapidly evolving. Secondly, conventional OS policies are oblivious to tasks performance measured in IPC, and there is often no clear relationship between commonly used IPC-based metrics and wellestablished system-level objectives such as improving response time and throughput measured in tasks or jobs per second. Thirdly, OS policies have a global view of the system resources and are better suited for managing resource sharing. Mixing independently developed software and hardware policies can lead to unstable, unpredictable system behavior. 2.2 Software Policies Generalized Processor Sharing (GPS) is a well established model of QoS [24] that is frequently applied in networking, operating systems, and real-time systems [2][3][9][30][34]. GPS can satisfy the QoS requirements of many different real-time task models, e.g., periodic, aperiodic, sporadic, inter- 5

6 sporadic, and rate-based task models [30][34], and is compatible with most general purpose operating systems [9][28]. Briefly, a GPS server has a processing capacity s and a task set that is characterized by a set of positive shares, 2,, N, one share per task. As long as the task set is feasible, i.e., Σ i, a GPS server guarantees that each task i will receive processor service Q i s ( i T ) over any time period T that the task is continuously in ready or running states. 2 It is important to emphasize that s is a fixed rate (processor bandwidth). Although a good objective, GPS is often unachievable in realistic environments. The basis of GPS is the notion of fluid processor service where multiple tasks can receive processor service simultaneously. In practice, processor service is distributed in finite time quanta (time slices), one quantum at a time. The concept of proportionate progress captures these non-ideal traits [2]. A task i makes proportionate progress if over any time quanta T that the task is continuously in ready or running states, it receives Q i s i T quanta of processor service. 3 PD 2 ( PD squared ) is an efficient proportionate fair (p-fair) multiprocessor scheduling algorithm [3], which given a feasible task set, ensures all tasks make proportionate progress. On a multiprocessor, a feasible task set satisfies Σ i p and i, where p is the number of processors in the system. In practice, general purpose OS schedulers often combine proportional sharing with aggregate performance optimizations [][28]. One common optimization heuristic is to prioritize tasks SJF, which tends to improve system-level performance metrics such as average response time and throughput measured in jobs per second [][28]. Recent research has focused on software policies for shared microarchitecture resources [7][8][9][4][26][27][29]. Symbiotic scheduling is a technique that improves aggregate system performance by preventing threads with conflicting resource requirements from being co-scheduled [27][29]. El- Haj-Mahmoud et al. and Jain et al. study real-time periodic task scheduling on SMT processors [7][4]. Lee et al. propose a methodology to test and deploy real-time applications on systems with shared microarchitecture resources [6]. Rafique et al. propose using hardware mechanisms to allow an OS to manage shared cache storage [26]. Fedorova et al. and Guo et al. propose schedulers that provide weakly defined QoS properties [8][9] these schedulers focus only on shared cache storage resources and use ad hoc task models (e.g., strict, elastic, and opportunistic ) and scheduling techniques. In contrast with prior work, we provide formal definitions of QoS and performance isolation that are derived from well-established concepts [24] and are compatible with most common real-time task models and general purpose operating systems. Furthermore, the definitions and proposed policies account for all of the sys- 2 This definition of GPS assumes [26] Σ i = to make it consistent with proportionate progress (defined in Section 2.2). 3 The definition of proportionate progress [3] also has an upper bound on quanta that we drop because it does not apply here. 6

7 tem s shared microarchitecture resources not just shared cache storage. For our evaluation, we model all shared microarchitecture resources in detail and apply a provably optimal p-fair multiprocessor scheduling algorithm to approximate ideal proportional sharing (GPS) [2][24]. 3. Virtual Private Machines As described in the introduction, the VPM abstraction provides the interface between resource management policies and mechanisms. Policies translate tasks performance requirements into VPM configurations, and the mechanisms enforce the VPM configurations. In this section, we precisely define the VPM abstraction (Section 3.) and the properties a system must provide in order to satisfy the VPM abstraction (Section 3.2). 3. Formal Definition A Virtual Private Machine has a spatial component (R) and a temporal component (T). The spatial component of a VPM is composed of a processor core and an assigned portion of all the multi-core system s shared microarchitecture resources. The temporal component of a VPM specifies the fraction of processor time that the spatial VPM resources are assigned to the VPM. A task i s spatial VPM, denoted as R i, is defined as the element-wise product of its assigned vector of resource shares (A i = < i, i 2,, i n >) and a vector of the system s shared resource capacities (R sys = <r sys, r sys 2,, r sys n >), i.e., R i = < i r sys, i 2 r sys 2,, i n r sys n >. For example, if a system s kth shared resource is the L2 cache storage, then task i is assigned i k of the system s shared L2 cache storage or i k r sys k bytes of L2 cache storage. Note that spatial VPMs form a partial ordering. If R = <a, a 2,, a n > and R 2 = <b, b 2,, b n >, then R R 2 iff a b a 2 b 2 a n b n. It is important to emphasize that the ordering is not total. For example, if we have two spatial VPMs, and one of them has more cache storage while the other VPM has more memory bandwidth, then the two VPMs are incomparable. The rationale for introducing the partial ordering will become evident in the next subsection. The temporal component of a VPM specifies the fraction of processor time that the spatial (microarchitecture) resources are assigned to the VPM. The definition of the temporal component follows from the definition of GPS [24]. A task i is assigned a share i of processor time, i.e., T i = i T during any time period T that the task is continuously ready or running. Combining the spatial and temporal components creates a complete VPM V i = ( A i. R sys ) ( i T ). We use the operator to compose the spatial and temporal components. Complete VPMs are also partially ordered. For example, if V = ( A. R sys ) ( T ) and V 2 = ( A 2. R sys ) ( 2 T ), then V V 2 iff A. R sys A 2. R sys T 2 T. By factoring out the R sys and T constants, we have 7

8 V V 2 iff A A 2 2. In the next subsection, we use these formalisms to precisely define the properties a VPM system should provide. 3.2 Quality of Service and Performance Isolation In general, there is no formal, agreed upon definition of QoS. Within the VPM framework, however, we are able to provide a formal definition. Then, using this definition we can show clearly how QoS leads to the very desirable property of performance isolation. Definition: A task i is offered Quality of Service if, at runtime, the task is offered a VPM V offered that is greater or equal to the VPM it is assigned. That is: V offered ( A i. R sys ) ( i T ) An important property of this definition is that it is in terms of service, i.e., it is a bound on the minimum amount of service that a task is offered. As part of the next definition, we employ the abstract relationship Perf(W, V) that maps a task s workload W and a VPM V to performance, e.g., measured in transactions per second. Definition: A system satisfies performance monotonicity if for any two VPMs such that V V 2 and any workload W, the performance of the workload on V is greater than or equal to the performance of the same workload on V 2. That is: V V 2 Perf(W, V ) Perf(W, V 2 ) Note that not all systems satisfy performance monotonicity [22], but it is an important assumption that holds under most conditions. Definition: A system provides performance isolation for task i if its performance when running on its VPM (Perf i ) is greater than or equal to its performance when running on a real machine configured with the same resources as the task s assigned VPM. That is: Perf i Perf (W i, ( A i. R sys ) ( i T ) ) An important property of this definition is that it is stated in terms of performance, i.e., it is a bound on a task s minimum performance. Moreover, a task s minimum performance depends only on its assigned VPM and is independent of other tasks in the system. Now, we can tie the three definitions together by making an important assertion that relates assigned service to achieved performance. (If space allowed, the assertion could be stated and proved as a theorem). Assertion: if a system provides QoS and satisfies performance monotonicity, then the system provides performance isolation. QoS and performance isolation are ideal primitives for building a range of resource management policies. Firstly, the definitions of QoS and performance isolation presented in this work are workload independent. Workload independence ensures that the definitions are applicable to any workload and are 8

9 appropriate primitives to incorporate into a system s architecture. Secondly, the definitions formalisms provide a well-defined interface between policy and mechanisms, thus facilitating the design of scalable resource management solutions. Thirdly, the definitions extend the notion of GPS to multi-core computer systems, which makes the definitions compatible with most common general purpose operating systems and real-time task models (discussed further in the next section). 4. General Purpose VPM Policies The VPM abstraction has the ability to satisfy a wide variety of performance objectives, e.g., the VPM abstraction can satisfy most real-time task models requirements and facilitate the optimization of parallel applications. Satisfying specific performance objectives requires specialized VPM policies. In its full generality, the overall policy design space is enormous. For example, policies can assign tasks homogenous VPMs or unique heterogeneous VPMs based on the tasks specific requirements and workload characteristics. Policies can vary VPM assignments dynamically based of tasks phase behaviors. Policies can be implemented in a concealed hypervisor layer, in the OS, in the user-space as part of a user-level scheduler, or offline, which may involve the application developer and profiling tools. Furthermore, a system can support multiple policies at the discretion of the OS developers. Exploring the entire design space is outside the scope of this paper it is a promising area for future research. To illustrate the VPM framework s potential, we focus on general purpose systems which have two basic performance objectives: predictable single-task performance (QoS / performance isolation) and high aggregate performance. Because these objectives conflict, most general purpose schedulers are configurable, e.g., in Linux, a system administrator or OS developer can decrease the scheduling time granularity, thus improving the system s response time, or can increase the scheduling time granularity, thus reducing context switching overhead and improving the system s aggregate performance [9]. The proposed VPM-based policies follow this basic design philosophy. The proposed policies are capable of providing high QoS or high aggregate performance, where the balance between QoS and aggregate performance is selectable. Moreover, like the scheduling time granularity example, the proposed policies are straightforward and based on concepts that are familiar to most system administrators and OS developers. In the next subsection (Section 4.), we discuss the proposed general purpose QoS policy, and in Section 4.2, we discuss the proposed excess service policy. 4. QoS Policy In general purpose operating systems, the level of service a task is offered is determined by the task s static priority (user defined priority). For example, the Complete Fair Scheduler (CFS) in Linux [9] roughly offers a task i = max{, p (pri i / pri i ) } of the processor bandwidth, where pri i is task i's static priority and p is the number of processors in the system. CFS and most general purpose schedulers 9

10 are based on the assumption that the processor service a task receives per time quantum (time slice) is constant. Unfortunately, contemporary multicore and multi-threaded systems with shared microarchitecture resources violate this assumption because the processor service a task receives per time quantum depends on the task s workload characteristics, the workload characteristics of the other tasks with which it is co-scheduled, and the hardware s sharing policies. The VPM framework solves this problem because the spatial component of a VPM ( A i. R sys ) accounts for microarchitecture resource sharing, and therefore, a policy can use the spatial component of a VPM to recapture the intended QoS and performance isolation properties of a scheduler such as CFS. The proposed QoS policy allows a system administrator to dedicate a fraction of each of the system s temporal and spatial resources to QoS. The QoS policy then divides these dedicated resources evenly amongst tasks, so all the tasks are assigned the same homogeneous spatial VPM ( A. R sys ) notice that we dropped the i subscript on A. Therefore, the minimum processor service a task receives per time quantum is constant for all quanta. Note that with this policy, the system-wide A parameter always satisfies A </p, /p,, /p>. The QoS policy calculates the VPMs temporal component ( i ) similar to the way the Linux scheduler determines tasks processor service shares. A task i s temporal service share is i = max{, λ p (pri i / pri i ) }, where λ is the fraction of the temporal (time slice) resources that are dedicated to QoS. 4.2 Shortest Job First Excess Service Policy Excess service is service that is either unassigned or is assigned, but unused. Excess service can be temporal (time slices) or spatial (microarchitecture) resources. The proposed excess service policy distributes excess service to the shortest job first (SJF). To distribute service SJF, the policy monitors tasks average execution times and assigns tasks VPM priorities ( i ), where the shortest task has the greatest priority and the longest task has the least priority this use of priorities is similar to the Linux scheduler s use of dynamic priorities []. The policy passes these priorities to the VPM scheduler (Section 6) and hardware mechanisms (Section 5) the VPM scheduler and hardware mechanisms distribute excess service highest priority first. 5. VPM Architectural Support A system s hardware must provide OS developers with the primitives needed to construct effective resource management policies; ideally, architectural support would allow OS developers to construct any practical resource management solution. In general, a system s architectural support allows the operating system to communicate with the system s hardware mechanisms in order to control the system s shared resources. For example, in conventional systems, the page table (or software-managed TLB) is an 0

11 ISA feature that allows page management policies to communicate with hardware translation mechanisms and control the distribution of physical memory. In addition, page tables often support touched bits that the OS can periodically read and clear, which support many different page replacement algorithms. Similar ISA support is required for the VPM framework. This support allows the VPM policies to ) communicate with the hardware s mechanisms (Section 5.), 2) control the system s microarchitecture resources in order to satisfy policies VPM assignments (Section 5.2), and 3) provide support for software excess service policies (Section 5.3). 5. VPM Control Registers VPM software/hardware communication is done via architected control registers and supporting (privileged) instructions that read/write the control registers. To communicate spatial VPM assignments (A i ), each shared microarchitecture resource k has a set of architected control registers (C. i k ), one register per hardware thread we use a dot notation with a C prefix to distinguish control registers. These registers store the running tasks assigned share of the kth resource ( i k ). In our baseline multi-core system, there are three composite shared microarchitecture resources: ) L2 cache bandwidth, 2) L2 cache storage, and 3) SDRAM memory system bandwidth. Each of these composite resources consists of multiple elementary resources. The L2 cache bandwidth consists of multiple banks each with interconnect, tag array, and data array bandwidth resources [22]. The SDRAM memory bandwidth consists of multiple banks and channels [2]. Therefore, for our baseline multi-core system, a task i s spatial shares can be represented as a 3-tuple A i = < i L2_CS, i L2_BW, i Mem_BW >, where i L2_CS is task i's share of cache storage, i L2_BW is its assigned share of the L2 cache bandwidth, and i Mem_BW is its share of SDRAM memory bandwidth. Each elementary resource has a hardware mechanism that enforces resource assignments independently; however, in this work, the elementary resources are combined into the three composite resources listed above and the mechanisms within a composite resource are all controlled by a single control register (C. i k ). Additional control registers are required to support the priority-based excess service policies. To communicate tasks VPM priorities ( i ), the hardware has a set of priority control registers (C. i ), one register for each hardware thread. To distribute excess cache storage, the hardware tracks tasks LRU stack distance histograms [25] and communicates the tasks histograms through control registers (C.lru_stack_hist i [j]). There is one register per hardware thread per cache way, where C.lru_stack_hist i [j] is the counter that tracks the ith hardware thread s jth cache way. In the remainder of this section, we describe how the hardware mechanisms compute and consume the values stored in the system s control registers.

12 5.2 Hardware Mechanisms Hardware mechanisms enforce the spatial component of VPM assignments R i ( A i. R sys ). An ideal VPM hardware mechanism ensures each task i is offered r i i k r sys k of the system s kth shared resource when it is running. However, in real systems, shared microarchitecture resources are typically partitioned into fixed size quanta. For example, way-partitioning enforces cache storage assignments at the column or way granularity [5][22][25]. Therefore, a hardware mechanism ensures that each task is offered r i i k r sys k quanta of the shared resource when the task is running. A complete set of hardware mechanisms (each shared resource is under the control of hardware mechanism) ensures each task i is offered a spatial VPM R i A i. R sys when the task is running. The A i parameters that we evaluate in this paper coincide with the shared resources natural partitioning points, and therefore, the operator has no effect on our evaluation. In this work, we use the way-partitioning algorithm presented in [22] to enforce cache storage resource assignments and the fair queuing (FQ) algorithms presented in [2] and [22] to enforce the cache and SDRAM memory system bandwidth assignments. Way-partitioning is implemented using a simple thread-aware replacement policy that requires few changes to the underlying cache microarchitecture. The FQ bandwidth partitioning algorithms operate within a virtual time framework [34]. When a request arrives, the FQ algorithm calculates the request s virtual start-time and virtual finish-time, which are the request s start and finish times if its task were running on its own virtual private bandwidth resource [2][22][34]. A request s virtual finish-time is the time the request must finish (its deadline) in order to fulfill the minimum bandwidth guarantee under ideal conditions. The cache and SDRAM arbiters service requests in earliest virtual finish-time first order to ensure threads are offered their allocated share of the bandwidth resources [34] in effect, the arbiters schedule requests earliest deadline first (EDF) [6]. 5.3 Hardware Support for Excess Service Policies In addition to enforcing spatial resource assignments, hardware must facilitate software excess service policies in distributing excess memory system bandwidth and cache storage Hardware Support for Distributing Excess Bandwidth Excess memory system bandwidth is available for very short periods of time, and therefore, must be distributed by hardware mechanisms. We propose a simple extension to the baseline arbiters [2][22] that distributes excess bandwidth to the task with highest VPM priority ( i ). To implement the excess service policy, we leverage the concept of eligibility [2][34]. The key insight is that a request can be delayed up to its virtual start-time without violating task i s bandwidth assignment. When a request s virtual start-time has elapsed (i.e., hw-clock virtual start-time, where hw-clock holds the current time), the request is eligible and is serviced earliest virtual finish-time first in order to satisfy the task s bandwidth assignment. When a request s virtual start-time is ahead of the current time, i.e., 2

13 hw-clock < virtual start-time, the request is ineligible. If a task s requests are ineligible, the task has consumed more than its assigned share of the bandwidth resource since the arbiter was reset. Therefore, the task s ineligible requests are contending for excess service and are serviced highest priority first. To summarize, the proposed cache arbiter and memory scheduler compute virtual start- and finish-times as in [2][22], and select requests in the following order: ) first eligible request with the earliest virtual finish-time, 2) if no requests are eligible, the request with the highest priority, and 3) if more than one request has the same priority, the request with earliest virtual finish-time. The FQ arbiters have built-in history that must be reset whenever a new task is context switched in [20][2][22], i.e., when a task is context switched on to a hardware context, the hardware context s virtual start-time register is reset using virtual start-time hw-clock. Therefore, the arbiters track the amount of excess service a task has received (normalized to its assigned share) since the task was context switched on to the processor Hardware Support for Distributing Excess Cache Storage In contrast to memory system bandwidth, tasks cache storage usage changes relatively slowly; therefore, software can effectively manage excess cache storage by adjusting the cache storage control registers (C. i L2_CS ) during context switches (discussed in Section 6.2). However, the software excess cache storage policies require hardware mechanisms to monitor cache usage. In general, there are many ways to monitor cache usage. In this work, we use dynamic set sampling (DSS) [25]. DSS approximates tasks LRU stack distance histograms, which as described earlier (Section 5.), are stored in software accessible control registers (C.lru_stack_hist i [j]). 6. VPM Software Support In this section, we discuss the VPM software support which consists of the VPM scheduler (Section 6.) and the excess cache storage policy (Section 6.2). 6. VPM Scheduler The VPM scheduler ensures that co-scheduled tasks spatial VPM assignments do not conflict, and that tasks temporal VPM assignments are satisfied. To satisfy a task s temporal resource assignment, the VPM scheduler controls the amount of processor time a task is offered. Ideally, a task i should be offered T i i T during any period of time T that the task is continuously in ready or running states. However, because processor time is partitioned and distributed in fixed sized quanta (time slices), the VPM scheduler offers tasks T i i T quanta. When the VPM scheduler context switches a task onto a processor, the scheduler communicates the task s spatial resource shares A i to the VPM hardware mechanisms through control registers. However, to ensure tasks spatial resource assignments can be satisfied by the hardware mechanisms, the VPM scheduler must ensure that the running set of tasks spatial assignments do not conflict., i.e., for i {running tasks} Σ A i <,,, >. 3

14 In its full generality, the VPM scheduling problem is a multi-resource scheduling problem with no known efficient solution. Fortunately, to support the policies proposed in Section 4, we do not have to solve the general VPM scheduling problem. Instead, we can focus on the case where all tasks spatial VPM assignments satisfy A i </p, /p,, /p>, where p is the number of processors in the system we leave the general case for future work. In this special case, tasks spatial VPM assignments can not conflict no matter how tasks are co-scheduled. Consequently, we can apply an existing p-fair multiprocessor scheduling algorithm to satisfy a task set s temporal VPM assignments without worrying about spatial conflicts. In this work, we implement a VPM scheduler based on the PD 2 scheduling algorithm [2], but could have used any p-fair scheduler. We chose PD 2 because it is well-suited for integrating a priority-based excess time slice policy. The PD 2 algorithm has two scheduling queues: a release queue and a ready queue. When a task enters the system or is context switched out, its release-time is calculated and the task is added to the PD 2 release queue. The task remains in the release queue until it becomes eligible, i.e., os-clock release-time, where os-clock is the OS s internal clock. In effect, the release queue prevents tasks from consuming excess time slices. After a task becomes eligible its PD 2 priority is computed and it is added to the ready queue. To schedule a task, the scheduler selects from the ready queue the task with the highest PD 2 priority. The reasoning behind the PD 2 release time and internal priority algorithms is quite involved we refer the reader to [2] for the details. The PD 2 algorithm naturally supports a priority-based policy to distribute excess time slices. The key insight is if the PD 2 ready queue is empty, the current time slice is excess. Therefore, when the ready queue is empty, the scheduler selects the task with the highest VPM priority ( i ) from the release queue. This technique is analogous to the FQ arbiters eligibility logic (Section 5.3.). 6.2 Excess Cache Storage Policy Excess cache storage is quite different than both processor and memory system bandwidth resources. Firstly, it requires oracle knowledge to decide with certainty that cache storage is unused; e.g., whether a line will be accessed before it is replaced [22]. Therefore, the proposed excess cache storage policy only distributes cache storage that is unassigned ( - Σ i L2_CS ). Secondly, the utility a thread receives from additional cache storage is workload dependent. Therefore, in order to make efficient use of excess cache storage, we propose a cache storage policy that takes into account tasks priorities and their cache utility. The proposed utility-aware policy distributes excess cache storage to the task that has the highest VPM priority ( i ) and a cache utility above util_theshold. Cache utility threshold is specified as the number of misses eliminated if the task were assigned an extra quantum (way) of cache storage. To determine a task s utility for an additional way of excess cache storage, the excess cache storage policy 4

15 reads the tasks stack distance histograms (C.lru_stack_hist i [j]) [25]. The software policy runs during context switches and controls the excess cache storage using the cache s control registers (C. i L2_CS ). 7. Evaluation 7. Timing Model We evaluate the proposed VPM policies and mechanisms through detailed simulation of a multicore system. The multi-core timing model is based on a structural simulator developed at IBM Research [4]. The model s default configuration is a single processor IBM 970 system []. In its default configuration, the model was validated to be ± 5% of the 970 design group s latch-level processor model. In this paper, we use an alternative simulator configuration (see Table ) to avoid 970-specific design constraints. The simulator has a cycle accurate model of the L2 cache microarchitecture and an on-chip memory controller attached to a DDR2-800 memory system [8]. Memory system buffers are statically partitioned as in [2][22]. We implemented the hardware mechanisms as described in Section 5. The DSS mechanism is configured with 32 sampled sets [25]. Processors Issue Buffers Issue Width Reorder Buffer LS Queues L I-Cache L D-Cache L2 Cache Memory Controller SDRAM four, 4 Ghz processors Table. System Configuration 20 entry BRU/CRU, two 24 entry FXU/LSU, 24 entry FPU 8 units (2 FXU, 2 LSU, 2 FPU, BRU, CRU) 24 dispatch groups, 5 instr. per dispatch group 32 entry load reorder queue, 32 entry store queue 6KB private, 4-ways, 64 byte lines, 2 cycles, 8 MSHRs 6KB private, 4-ways, 64 byte lines, 2 cycles, 6 MSHRs runs at ½ core frequency, 2 banks, 2MB, 32-ways, 64 byte lines, 6 controller state machines per thread, 8 store gathering entries per thread, read bypassing, retire-at-6 policy, partial-flush on read conflict, 2 cycle interconnect, 4 cycle tag, 8 cycle data, 6 byte data bus per bank on-chip, runs at ½ core freq., 6 transaction entries per thread, 8 write buffer entries per thread, closed page policy DDR2-800, channel, 2 ranks per channel, 8 banks per rank The VPM software support is implemented as described in Section 6. The SJF excess service policy uses the tasks average execution times (ave_exec_time i ) to compute VPM priorities. The tasks average execution times are stored in the tasks process control blocks and are updated whenever a task s subtask completes we discuss the task model further in the next subsection. The policy uses a decaying average to update tasks average execution times, i.e., ave_exec_time i (ave_exec_time i + exec_time) / 2, where exec_time is the execution time of the just completed subtask. The timer interrupt granularity is 2 ms, which is the default Linux scheduling granularity [9]. We set the cache storage utility threshold util_threshold using (time_interrupt_granularity *.05) / ave_memory_latency, which roughly approximates whether an additional way of excess cache storage will improve a task s performance by at least 5

16 5%. Based on our system configuration, the utility threshold is 4,000 misses, i.e., (8 million cycles per timer interrupt *.05) / 00 cycles per miss. 7.2 Workloads To evaluate the proposed policies, we use a quasi-statistical multiprogram workload because the simulation overhead of a detailed multi-core model prevents us from running a real multiprogram workload at meaningful OS time granularities. Furthermore, a statistical workload gives us greater control over the experimental setup, thus allowing us to focus the evaluation on the workload independent resource management principles presented in the body of this paper. The workload consists of a set of single-threaded tasks each task consists of a sequence of subtasks. We use the SPEC 2000 benchmark suite to generate the workload, i.e., each task is a SPEC benchmark and a subtask is a section of a benchmark s execution. The specific benchmark suite is not especially important for this study what is important is that the applications in the benchmark suite exhibit a realistic range of resource utilization(s). To set up the workload, we started with a simulation time constraint and worked backwards. Early in our analysis we decided simulations must complete in two days in order to make the analysis tractable. With the baseline configuration in Table, our simulator can simulate approximately billion cycles (25 timer interrupts) in two days. Given the constraint of billion simulated cycles, we chose the number of tasks in the workload and the size of task s subtasks so that ) a subtask s initial cold-start misses were less than 0% of the subtask s total misses, 2) each task completed at least one subtask on the baseline configuration, and 3) each simulation completed at least 00 subtasks. To satisfy these constraints, we chose a workload with 8 tasks and 0 million instructions per subtask. We used the benchmarks memory system utilizations to choose the eight benchmarks. Figure 2 illustrates the cache and SDRAM bandwidth resource utilization with the baseline system s full cache storage and with ¼ of the cache storage. Cache data array utilization is used as a proxy for cache bandwidth utilization and SDRAM data bus bandwidth utilization is used as a proxy for memory bandwidth utilization [2][22]. The benchmarks are ordered by their SDRAM memory bandwidth utilization. We selected the eight leftmost benchmarks to form the workload because they put the most pressure on the system s shared resources, and consequently, on the proposed policies and mechanisms. Utilization 00% 75% 50% 25% 0% art swim lucas mcf equake wupwise facerec Bandwidth Utilization Mem BW - /4 Cache Storage Cache BW - /4 Cache Storage mgrid bzip2 ammp apsi twolf gcc vpr gap Mem BW - Full Cache Storage Cache BW - Full Cache Storage mesa gzip sixtrack perlbmk crafty Mean Figure 2. Bandwidth utilization 6

17 For each of the eight benchmarks, we generated ten, 0 million instruction subtask traces using statistical sampling [2]. Although each subtask trace has the same number of instructions, their execution times on the baseline system vary from 8.2 ms to 82.5 ms, or from 4 to 4 times the scheduling granularity. During initialization, the simulator randomly selects one subtask from each task s ten subtask traces and injects it into the simulated system. During simulation, when a task s subtask completes, the simulator randomly selects a subtask from the task s ten subtask traces and injects it into the simulated system. To conservatively approximate the percentage of total misses that may be cold start misses, we follow Wood et al. [33] and divide the number of cache blocks by each subtask s misses when running alone. All of the subtasks initial cold start misses were less than 0% of their total misses, and on average, less than 4.4% of the total misses. It is important to emphasize that this is a very conservative approximation, i.e., most often the actual cold-start will be much lower moreover, a subtask s initial cold start misses are relatively insignificant when compared to its misses due to context switching. We ran each simulation four times the simulation are non-deterministic due to the random order subtasks enter the system. We ensured that the average throughput of the four simulations had converged to 2%. The results from all four simulations are included in the results section. 7.3 Metrics To capture both temporal and spatial resource sharing effects, we chose to use turn around time (TAT) as the primary metric. TAT starts at the time a subtask enters the system and ends at the time the subtask completes and leaves the system. To measure performance isolation, we first compute a subtask s isolated TAT, which is the subtask s TAT if it were running on a private system configured with an even-share of the baseline system s shared resources and no more, i.e., ¼ of the cache storage, ¼ of cache bandwidth, and ¼ of the memory bandwidth of the baseline system. We collect the subtasks execution times and time-scale them by the task s even-share temporal weight, i.e., the multi-core simulations have four processors and eight processes; therefore, a subtask s isolated TAT is even_share_exec_time / ½. To measure performance isolation, we normalize a subtask s measured TAT to its isolated TAT. A normalized TAT of less than one means the subtask met its even-share performance isolation target. To measure aggregate performance, we use average TAT, which is the TAT averaged over all tasks subtasks. We also include throughput (Thrpt) results measured as the number of subtasks completed per second. 7.4 Results The results presented in this section illustrate the effects of the proposed policies from the perspective of a system administrator. To simplify the evaluation, we assume tasks have equal static priorities, and therefore, each task has the same system-wide temporal weight ( ). The administrator dedicates a fraction of the system s shared temporal and spatial resources for providing single-task QoS. These re- 7

18 sources (both temporal and spatial) are divided evenly to form homogeneous VPMs, which are passed to the VPM scheduler through A and parameters. For example, if the administrator dedicates the whole cache () to providing QoS, then each task s VPM is assigned ¼ of the system s cache storage, i.e., L2_CS i = ¼. In the case where all cache storage is dedicated to QoS, there is no cache storage dedicated to optimizing aggregate performance. Or if the administrator dedicates half of the cache (½) to QoS, then each task s VPM is assigned /8 of the cache storage, and the other half of the cache storage is dedicated to optimizing aggregate performance via the excess service policy. For the first set of results (Figure 3), we analyze the effects of dedicating each shared resource independently. To study a single resource, we vary the fraction of the resource that is dedicated to QoS, while all other shared resources are fully dedicated to QoS and are held constant. For example, to study the effects of the cache storage, we vary the amount of cache storage dedicated to QoS while all memory system bandwidth and processor time slice resources are dedicated to QoS and held constant, i.e., A = < x, ¼, ¼ > and = ½. Figure 3a illustrates the effects of varying the fraction of the system s processor time slices dedicated to QoS. Figure 3b illustrates the effects of varying the fraction of the memory system bandwidth (both cache and SDRAM bandwidth) dedicated to QoS. We compare the proposed bandwidth mechanisms and policies to first come first serve (fcfs) arbiters and the fair queuing (fq) arbiters presented in [2][22]. Figure 3c illustrates the effects of varying the fraction of the cache storage dedicated to QoS. We compare the proposed cache storage mechanisms and policies to LRU (lru) cache replacement and the utility policy presented in [25]. The key in each graph shows the fraction of the resource dedicated to QoS, i.e.,, ¾, ½, and ¼. The graph on the left side of each figure shows each task s normalized TATs. Tasks are ordered by their execution time when executing alone with the longest execution time on the left. The bars plot the tasks average normalized TATs. The error bars illustrate the range of the tasks normalized TATs. The graph on the right side of each figure shows the aggregate performance improvement with respect to all resources dedicated to QoS (). The policies most significant trends hold across all three graphs in Figure 3. For the case where all of the resources are dedicated to QoS (the data points labeled ), the policies and mechanisms satisfy the VPM abstraction s performance isolation requirements. This result is illustrated by the graphs on the left side of each figure, i.e., for the fully dedicated configurations (), the top of each error bar (the maximum normalized TAT) is less than one. Consequently, the tasks subtasks run faster than they would if running on a single processor machine with a configuration that is equivalent to an even-share VPM. Reducing the level of QoS from to ¼, significantly decreases the degree of performance isolation offered to the tasks, i.e., the ¼ configuration has a much greater deviation in normalized TAT. The normalized TATs of the tasks on the left side of the graphs (art, mcf, lucas, swim) increase significantly in some 8

Virtual Private Caches

Virtual Private Caches Kyle J. Nesbit, James Laudon *, and James E. Smith University of Wisconsin Madison Departmnet of Electrical and Computer Engr. { nesbit, jes }@ece.wisc.edu Sun Microsystems, Inc. * James.laudon@sun.com