Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems

Size: px
Start display at page:

Download "Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems"

Transcription

1

2 Virtual Private Machines: A Resource Abstraction for Multi-Core Computer Systems Kyle J. Nesbit University of Wisconsin Madison Department of Electrical and Computer Engineering kjnesbit@ece.wisc.edu James Laudon Google Inc. jlaudon@google.com James E. Smith University of Wisconsin Madison Department of Electrical and Computer Engineering jes@ece.wisc.edu Abstract Virtual Private Machines (VPM) are an abstraction for managing resource sharing in multi-core computer systems. A VPM consists of a complete set of resources, which includes both spatial (microarchitecture) and temporal (processor time slice) resources. Tasks assigned VPMs achieve a minimum level of performance regardless of other tasks in the system that is, a VPM provides performance isolation. The VPM abstraction provides the interface between a system s resource management policies and mechanisms. VPM policies, implemented primarily in software, translate system-level performance requirements into VPM assignments. Then VPM mechanisms, implemented in hardware, enforce the VPM assignments. To illustrate the potential of the VPM abstraction, we propose and implement a complete set of VPM policies and mechanisms. The policies translate applications' system-level Quality of Service requirements into VPMs and distribute unassigned and unused resources in order to optimize aggregate system-level performance. A simulation-based study shows that the proposed VPM policies and mechanisms, in combination, provide a high degree of QoS and can significantly improve aggregate performance.. Introduction Sharing hardware resources among concurrently executing tasks improves the overall efficiency of a computer system. If poorly managed, however, resource sharing can also lead to poor Quality of Service (QoS) and unpredictable performance. Traditionally, shared resources are managed at coarse granularities complete processor(s) and real memory pages. Conventional hardware ISA features and mechanisms allow an OS to implement policies for managing these shared resources efficiently. With the evolution toward single chip multi-core hardware, however, concurrently executing threads share fine grained resources at the microarchitecture level. This poses a problem because contemporary systems do not provide mechanisms and ISA support for managing microarchitecture resource sharing. In the absence of such support, hardware-implemented sharing policies are unavoidable. Hardware-implemented policies have a number of disadvantages, for example, they lack flexibility and may conflict with OS objectives and policies. 2

3 Consequently, future multi-core computer systems should employ coordinated hardware and software resource management. Doing so is difficult, however. Firstly, modern multi-core chips contain several inter-dependent shared resources, e.g., cache storage and memory system bandwidth. Secondly, modern high-performance, high-volume chips are used in diverse environments, e.g., from embedded devices to servers. Such environments have unique workload and system-dependent performance objectives. Thirdly, performance objectives for a given system may conflict. For example, providing each thread with a guaranteed level of service from each of the shared resources (our definition of QoS) often conflicts with providing maximum aggregate performance over the collection of threads. Furthermore, achieving a good balance between these conflicting objectives is both workload and system dependent. Therefore, microarchitecture resource management requires a well-structured framework for building resource management solutions that can be tailored to a system s specific performance requirements. To be effective, such a framework must be consistent with well-established operating system principles and have a distinct separation between mechanisms and policies [7], where a system s mechanisms provide a universal set of workload independent resource management primitives and policies provide system and workload dependent resource management solutions. To this end, we propose the Virtual Private Machine (VPM) framework.. Virtual Private Machine Framework The real resources of a machine can be distributed into a number of VPMs, where each VPM consists of spatial and temporal components. The spatial component of a VPM specifies the fractions of the system s microarchitecture resources that are dedicated to that VPM [22][23][3]. As an example, consider a baseline system containing four processors with private L caches and shared L2 cache, main memory, and supporting interconnection structures. In Figure a, a policy has distributed these resources amongst four VPMs. Each of the VPMs contains a single processor. The policy assigns VPM a significant fraction (50%) of the resources to support a demanding multimedia application and assigns the other three VPMs a much lower fraction of resources (0% each). These assignments leave 20% of the cache and memory resources unallocated, which is excess service. Excess service also includes allocated, but unused resources. In the VPM framework, excess service policies distribute the system s excess service among the active tasks, thereby assuring that resources are not wasted if there is a task that can use them. The temporal component of a VPM is based on the well-established concept of Generalized Processor Sharing (GPS) [24] (Figure b), and it specifies the fraction of processor time (processor time slices) In general, the spatial component of a VPM may incorporate multiple processors, and, although they pose no special difficulties, multiprocessor VPMs are outside the scope this paper. 3

4 that a VPM s spatial resources are dedicated to the VPM. As with spatial VPM resources, there may be excess temporal resources. The VPM abstraction provides the conceptual interface between software policies and hardware mechanisms. Software policies translate tasks performance requirements into VPM resource assignments, and hardware mechanisms enforce the assignments at runtime by offering the tasks at least the assigned amount of service (i.e., QoS). With the assumption that a task will only perform better if it is given additional resources (performance monotonicity), QoS leads to the desirable property of performance isolation; that is, the task performs at least as well as it would if it were executing on a real machine with a configuration equivalent to the task s assigned VPM. This level of performance is assured, regardless of the other tasks in the system. VPM Proc. L Cache L2 Cache (Capacity.5C) Memory Cntl. Main Memory VPM 2 Proc. 2 L Cache L2 Cache (Capacity.C) Memory Cntl. BW.5L BW.L Main Memory VPM 3 Proc. 3 L Cache L2 Cache (Capacity.C) Memory Cntl. Main Memory VPM 4 Proc. 4 L Cache BW.5K BW.K BW.K BW.K (a) L2 Cache (Capacity.C) Memory Cntl. BW.L BW.L Main Memory VPM VPM Proc. VPM Proc. L Cache Proc. L Cache BW.5K L2 Cache L Cache BW.5K (Capacity.5C) L2 Cache BW.5K (Capacity Memory.5C) Cntl. L2 Cache BW.5L (Capacity Memory.5C) Cntl. Main Memory BW.5L Memory Cntl. Main Memory BW.5L Main Memory Figure. a) Four Spatial VPMs and b) a Complete VPM (including temporal component).2 General Purpose VPM Policies The VPM framework is capable of satisfying the performance objectives of a wide range of systems and workloads, from embedded systems, to desktops, to servers. In this paper, we apply VPMs to general purpose systems and propose a set of general purpose policies that seamlessly integrate temporal and spatial resource sharing. General purpose systems have two basic performance objectives, predictable single-task performance (QoS / performance isolation) and high aggregate performance. As noted above, these objectives often conflict. Hence, the proposed policies allow a system administrator to dedicate a fraction of each of the system s shared resources (both spatial and temporal) to QoS by assigning them to VPMs. The non-dedicated resources are excess service, which the proposed policies use for optimizing system-level aggregate performance. Therefore, by adjusting the fraction of the system s resources that are dedicated to QoS, the system s administrator can tune a system s QoS and aggregate performance to the specifics of the system s workload. To optimize system-level performance, the policies distribute excess service to the shortest job first (SJF). SJF is a highly effective, well-established heuristic that is used in most general purpose operating systems [][28]. Nonetheless, this is the first work that applies the SJF heuristic to microarchitecture resource sharing. Furthermore, this is the first work that provides a complete microarchitecture resource (b) Time 4

5 sharing solution for satisfying realistic, conflicting system-level performance objectives. Most prior microarchitecture resource sharing work focuses on a single shared resource [8][0][5][20][25][32] and optimizing IPC-based metrics [5][8][9][0][][3][5][20][25][26][32] that are less meaningful to system and application software developers than well-established system-level metrics, e.g., average turn around time. We evaluate the proposed VPM-based policies through simulation. In contrast to prior work, our simulations model realistic OS scheduling algorithms and each of the baseline system s shared microarchitecture resources in detail, e.g., the cache storage, status registers at each level of cache hierarchy, and cache array, interconnect, and SDRAM bandwidth. We show that with the proposed policies a system administrator can configure a system to ensure tasks are offered an equitable share of the system s resources (QoS), or the system administrator can relax a system s QoS constraints, thus allowing the excess service policies to aggressively optimize aggregate performance. When the system s QoS constraints are relaxed, for our selected benchmarks, the proposed SJF excess service policies can improve average throughput by 86% and turn around time by 77% when compared with a conventional multi-core system that uses least recently used (LRU) cache replacement and first come first serve (FCFS) memory system arbiters. In addition, we show that, as expected, such improvements in aggregate performance come at the cost of reduced QoS and performance isolation. 2. Prior Work 2. Hardware Policies and Mechanisms Most prior research on microarchitecture resource sharing combines hardware policies and mechanisms in order to optimize instruction per cycle (IPC)-based metrics, e.g., IPC-based QoS [5][3], aggregate performance [9][25][32], and fairness metrics [5][20]. However, it is our position that such an approach is insufficient. Firstly, hardware is inflexible, so it is difficult for hardware-based policies to account for all system objectives at design time, particularly in emerging platforms where system objectives are rapidly evolving. Secondly, conventional OS policies are oblivious to tasks performance measured in IPC, and there is often no clear relationship between commonly used IPC-based metrics and wellestablished system-level objectives such as improving response time and throughput measured in tasks or jobs per second. Thirdly, OS policies have a global view of the system resources and are better suited for managing resource sharing. Mixing independently developed software and hardware policies can lead to unstable, unpredictable system behavior. 2.2 Software Policies Generalized Processor Sharing (GPS) is a well established model of QoS [24] that is frequently applied in networking, operating systems, and real-time systems [2][3][9][30][34]. GPS can satisfy the QoS requirements of many different real-time task models, e.g., periodic, aperiodic, sporadic, inter- 5

6 sporadic, and rate-based task models [30][34], and is compatible with most general purpose operating systems [9][28]. Briefly, a GPS server has a processing capacity s and a task set that is characterized by a set of positive shares, 2,, N, one share per task. As long as the task set is feasible, i.e., Σ i, a GPS server guarantees that each task i will receive processor service Q i s ( i T ) over any time period T that the task is continuously in ready or running states. 2 It is important to emphasize that s is a fixed rate (processor bandwidth). Although a good objective, GPS is often unachievable in realistic environments. The basis of GPS is the notion of fluid processor service where multiple tasks can receive processor service simultaneously. In practice, processor service is distributed in finite time quanta (time slices), one quantum at a time. The concept of proportionate progress captures these non-ideal traits [2]. A task i makes proportionate progress if over any time quanta T that the task is continuously in ready or running states, it receives Q i s i T quanta of processor service. 3 PD 2 ( PD squared ) is an efficient proportionate fair (p-fair) multiprocessor scheduling algorithm [3], which given a feasible task set, ensures all tasks make proportionate progress. On a multiprocessor, a feasible task set satisfies Σ i p and i, where p is the number of processors in the system. In practice, general purpose OS schedulers often combine proportional sharing with aggregate performance optimizations [][28]. One common optimization heuristic is to prioritize tasks SJF, which tends to improve system-level performance metrics such as average response time and throughput measured in jobs per second [][28]. Recent research has focused on software policies for shared microarchitecture resources [7][8][9][4][26][27][29]. Symbiotic scheduling is a technique that improves aggregate system performance by preventing threads with conflicting resource requirements from being co-scheduled [27][29]. El- Haj-Mahmoud et al. and Jain et al. study real-time periodic task scheduling on SMT processors [7][4]. Lee et al. propose a methodology to test and deploy real-time applications on systems with shared microarchitecture resources [6]. Rafique et al. propose using hardware mechanisms to allow an OS to manage shared cache storage [26]. Fedorova et al. and Guo et al. propose schedulers that provide weakly defined QoS properties [8][9] these schedulers focus only on shared cache storage resources and use ad hoc task models (e.g., strict, elastic, and opportunistic ) and scheduling techniques. In contrast with prior work, we provide formal definitions of QoS and performance isolation that are derived from well-established concepts [24] and are compatible with most common real-time task models and general purpose operating systems. Furthermore, the definitions and proposed policies account for all of the sys- 2 This definition of GPS assumes [26] Σ i = to make it consistent with proportionate progress (defined in Section 2.2). 3 The definition of proportionate progress [3] also has an upper bound on quanta that we drop because it does not apply here. 6

7 tem s shared microarchitecture resources not just shared cache storage. For our evaluation, we model all shared microarchitecture resources in detail and apply a provably optimal p-fair multiprocessor scheduling algorithm to approximate ideal proportional sharing (GPS) [2][24]. 3. Virtual Private Machines As described in the introduction, the VPM abstraction provides the interface between resource management policies and mechanisms. Policies translate tasks performance requirements into VPM configurations, and the mechanisms enforce the VPM configurations. In this section, we precisely define the VPM abstraction (Section 3.) and the properties a system must provide in order to satisfy the VPM abstraction (Section 3.2). 3. Formal Definition A Virtual Private Machine has a spatial component (R) and a temporal component (T). The spatial component of a VPM is composed of a processor core and an assigned portion of all the multi-core system s shared microarchitecture resources. The temporal component of a VPM specifies the fraction of processor time that the spatial VPM resources are assigned to the VPM. A task i s spatial VPM, denoted as R i, is defined as the element-wise product of its assigned vector of resource shares (A i = < i, i 2,, i n >) and a vector of the system s shared resource capacities (R sys = <r sys, r sys 2,, r sys n >), i.e., R i = < i r sys, i 2 r sys 2,, i n r sys n >. For example, if a system s kth shared resource is the L2 cache storage, then task i is assigned i k of the system s shared L2 cache storage or i k r sys k bytes of L2 cache storage. Note that spatial VPMs form a partial ordering. If R = <a, a 2,, a n > and R 2 = <b, b 2,, b n >, then R R 2 iff a b a 2 b 2 a n b n. It is important to emphasize that the ordering is not total. For example, if we have two spatial VPMs, and one of them has more cache storage while the other VPM has more memory bandwidth, then the two VPMs are incomparable. The rationale for introducing the partial ordering will become evident in the next subsection. The temporal component of a VPM specifies the fraction of processor time that the spatial (microarchitecture) resources are assigned to the VPM. The definition of the temporal component follows from the definition of GPS [24]. A task i is assigned a share i of processor time, i.e., T i = i T during any time period T that the task is continuously ready or running. Combining the spatial and temporal components creates a complete VPM V i = ( A i. R sys ) ( i T ). We use the operator to compose the spatial and temporal components. Complete VPMs are also partially ordered. For example, if V = ( A. R sys ) ( T ) and V 2 = ( A 2. R sys ) ( 2 T ), then V V 2 iff A. R sys A 2. R sys T 2 T. By factoring out the R sys and T constants, we have 7

8 V V 2 iff A A 2 2. In the next subsection, we use these formalisms to precisely define the properties a VPM system should provide. 3.2 Quality of Service and Performance Isolation In general, there is no formal, agreed upon definition of QoS. Within the VPM framework, however, we are able to provide a formal definition. Then, using this definition we can show clearly how QoS leads to the very desirable property of performance isolation. Definition: A task i is offered Quality of Service if, at runtime, the task is offered a VPM V offered that is greater or equal to the VPM it is assigned. That is: V offered ( A i. R sys ) ( i T ) An important property of this definition is that it is in terms of service, i.e., it is a bound on the minimum amount of service that a task is offered. As part of the next definition, we employ the abstract relationship Perf(W, V) that maps a task s workload W and a VPM V to performance, e.g., measured in transactions per second. Definition: A system satisfies performance monotonicity if for any two VPMs such that V V 2 and any workload W, the performance of the workload on V is greater than or equal to the performance of the same workload on V 2. That is: V V 2 Perf(W, V ) Perf(W, V 2 ) Note that not all systems satisfy performance monotonicity [22], but it is an important assumption that holds under most conditions. Definition: A system provides performance isolation for task i if its performance when running on its VPM (Perf i ) is greater than or equal to its performance when running on a real machine configured with the same resources as the task s assigned VPM. That is: Perf i Perf (W i, ( A i. R sys ) ( i T ) ) An important property of this definition is that it is stated in terms of performance, i.e., it is a bound on a task s minimum performance. Moreover, a task s minimum performance depends only on its assigned VPM and is independent of other tasks in the system. Now, we can tie the three definitions together by making an important assertion that relates assigned service to achieved performance. (If space allowed, the assertion could be stated and proved as a theorem). Assertion: if a system provides QoS and satisfies performance monotonicity, then the system provides performance isolation. QoS and performance isolation are ideal primitives for building a range of resource management policies. Firstly, the definitions of QoS and performance isolation presented in this work are workload independent. Workload independence ensures that the definitions are applicable to any workload and are 8

9 appropriate primitives to incorporate into a system s architecture. Secondly, the definitions formalisms provide a well-defined interface between policy and mechanisms, thus facilitating the design of scalable resource management solutions. Thirdly, the definitions extend the notion of GPS to multi-core computer systems, which makes the definitions compatible with most common general purpose operating systems and real-time task models (discussed further in the next section). 4. General Purpose VPM Policies The VPM abstraction has the ability to satisfy a wide variety of performance objectives, e.g., the VPM abstraction can satisfy most real-time task models requirements and facilitate the optimization of parallel applications. Satisfying specific performance objectives requires specialized VPM policies. In its full generality, the overall policy design space is enormous. For example, policies can assign tasks homogenous VPMs or unique heterogeneous VPMs based on the tasks specific requirements and workload characteristics. Policies can vary VPM assignments dynamically based of tasks phase behaviors. Policies can be implemented in a concealed hypervisor layer, in the OS, in the user-space as part of a user-level scheduler, or offline, which may involve the application developer and profiling tools. Furthermore, a system can support multiple policies at the discretion of the OS developers. Exploring the entire design space is outside the scope of this paper it is a promising area for future research. To illustrate the VPM framework s potential, we focus on general purpose systems which have two basic performance objectives: predictable single-task performance (QoS / performance isolation) and high aggregate performance. Because these objectives conflict, most general purpose schedulers are configurable, e.g., in Linux, a system administrator or OS developer can decrease the scheduling time granularity, thus improving the system s response time, or can increase the scheduling time granularity, thus reducing context switching overhead and improving the system s aggregate performance [9]. The proposed VPM-based policies follow this basic design philosophy. The proposed policies are capable of providing high QoS or high aggregate performance, where the balance between QoS and aggregate performance is selectable. Moreover, like the scheduling time granularity example, the proposed policies are straightforward and based on concepts that are familiar to most system administrators and OS developers. In the next subsection (Section 4.), we discuss the proposed general purpose QoS policy, and in Section 4.2, we discuss the proposed excess service policy. 4. QoS Policy In general purpose operating systems, the level of service a task is offered is determined by the task s static priority (user defined priority). For example, the Complete Fair Scheduler (CFS) in Linux [9] roughly offers a task i = max{, p (pri i / pri i ) } of the processor bandwidth, where pri i is task i's static priority and p is the number of processors in the system. CFS and most general purpose schedulers 9

10 are based on the assumption that the processor service a task receives per time quantum (time slice) is constant. Unfortunately, contemporary multicore and multi-threaded systems with shared microarchitecture resources violate this assumption because the processor service a task receives per time quantum depends on the task s workload characteristics, the workload characteristics of the other tasks with which it is co-scheduled, and the hardware s sharing policies. The VPM framework solves this problem because the spatial component of a VPM ( A i. R sys ) accounts for microarchitecture resource sharing, and therefore, a policy can use the spatial component of a VPM to recapture the intended QoS and performance isolation properties of a scheduler such as CFS. The proposed QoS policy allows a system administrator to dedicate a fraction of each of the system s temporal and spatial resources to QoS. The QoS policy then divides these dedicated resources evenly amongst tasks, so all the tasks are assigned the same homogeneous spatial VPM ( A. R sys ) notice that we dropped the i subscript on A. Therefore, the minimum processor service a task receives per time quantum is constant for all quanta. Note that with this policy, the system-wide A parameter always satisfies A </p, /p,, /p>. The QoS policy calculates the VPMs temporal component ( i ) similar to the way the Linux scheduler determines tasks processor service shares. A task i s temporal service share is i = max{, λ p (pri i / pri i ) }, where λ is the fraction of the temporal (time slice) resources that are dedicated to QoS. 4.2 Shortest Job First Excess Service Policy Excess service is service that is either unassigned or is assigned, but unused. Excess service can be temporal (time slices) or spatial (microarchitecture) resources. The proposed excess service policy distributes excess service to the shortest job first (SJF). To distribute service SJF, the policy monitors tasks average execution times and assigns tasks VPM priorities ( i ), where the shortest task has the greatest priority and the longest task has the least priority this use of priorities is similar to the Linux scheduler s use of dynamic priorities []. The policy passes these priorities to the VPM scheduler (Section 6) and hardware mechanisms (Section 5) the VPM scheduler and hardware mechanisms distribute excess service highest priority first. 5. VPM Architectural Support A system s hardware must provide OS developers with the primitives needed to construct effective resource management policies; ideally, architectural support would allow OS developers to construct any practical resource management solution. In general, a system s architectural support allows the operating system to communicate with the system s hardware mechanisms in order to control the system s shared resources. For example, in conventional systems, the page table (or software-managed TLB) is an 0

11 ISA feature that allows page management policies to communicate with hardware translation mechanisms and control the distribution of physical memory. In addition, page tables often support touched bits that the OS can periodically read and clear, which support many different page replacement algorithms. Similar ISA support is required for the VPM framework. This support allows the VPM policies to ) communicate with the hardware s mechanisms (Section 5.), 2) control the system s microarchitecture resources in order to satisfy policies VPM assignments (Section 5.2), and 3) provide support for software excess service policies (Section 5.3). 5. VPM Control Registers VPM software/hardware communication is done via architected control registers and supporting (privileged) instructions that read/write the control registers. To communicate spatial VPM assignments (A i ), each shared microarchitecture resource k has a set of architected control registers (C. i k ), one register per hardware thread we use a dot notation with a C prefix to distinguish control registers. These registers store the running tasks assigned share of the kth resource ( i k ). In our baseline multi-core system, there are three composite shared microarchitecture resources: ) L2 cache bandwidth, 2) L2 cache storage, and 3) SDRAM memory system bandwidth. Each of these composite resources consists of multiple elementary resources. The L2 cache bandwidth consists of multiple banks each with interconnect, tag array, and data array bandwidth resources [22]. The SDRAM memory bandwidth consists of multiple banks and channels [2]. Therefore, for our baseline multi-core system, a task i s spatial shares can be represented as a 3-tuple A i = < i L2_CS, i L2_BW, i Mem_BW >, where i L2_CS is task i's share of cache storage, i L2_BW is its assigned share of the L2 cache bandwidth, and i Mem_BW is its share of SDRAM memory bandwidth. Each elementary resource has a hardware mechanism that enforces resource assignments independently; however, in this work, the elementary resources are combined into the three composite resources listed above and the mechanisms within a composite resource are all controlled by a single control register (C. i k ). Additional control registers are required to support the priority-based excess service policies. To communicate tasks VPM priorities ( i ), the hardware has a set of priority control registers (C. i ), one register for each hardware thread. To distribute excess cache storage, the hardware tracks tasks LRU stack distance histograms [25] and communicates the tasks histograms through control registers (C.lru_stack_hist i [j]). There is one register per hardware thread per cache way, where C.lru_stack_hist i [j] is the counter that tracks the ith hardware thread s jth cache way. In the remainder of this section, we describe how the hardware mechanisms compute and consume the values stored in the system s control registers.

12 5.2 Hardware Mechanisms Hardware mechanisms enforce the spatial component of VPM assignments R i ( A i. R sys ). An ideal VPM hardware mechanism ensures each task i is offered r i i k r sys k of the system s kth shared resource when it is running. However, in real systems, shared microarchitecture resources are typically partitioned into fixed size quanta. For example, way-partitioning enforces cache storage assignments at the column or way granularity [5][22][25]. Therefore, a hardware mechanism ensures that each task is offered r i i k r sys k quanta of the shared resource when the task is running. A complete set of hardware mechanisms (each shared resource is under the control of hardware mechanism) ensures each task i is offered a spatial VPM R i A i. R sys when the task is running. The A i parameters that we evaluate in this paper coincide with the shared resources natural partitioning points, and therefore, the operator has no effect on our evaluation. In this work, we use the way-partitioning algorithm presented in [22] to enforce cache storage resource assignments and the fair queuing (FQ) algorithms presented in [2] and [22] to enforce the cache and SDRAM memory system bandwidth assignments. Way-partitioning is implemented using a simple thread-aware replacement policy that requires few changes to the underlying cache microarchitecture. The FQ bandwidth partitioning algorithms operate within a virtual time framework [34]. When a request arrives, the FQ algorithm calculates the request s virtual start-time and virtual finish-time, which are the request s start and finish times if its task were running on its own virtual private bandwidth resource [2][22][34]. A request s virtual finish-time is the time the request must finish (its deadline) in order to fulfill the minimum bandwidth guarantee under ideal conditions. The cache and SDRAM arbiters service requests in earliest virtual finish-time first order to ensure threads are offered their allocated share of the bandwidth resources [34] in effect, the arbiters schedule requests earliest deadline first (EDF) [6]. 5.3 Hardware Support for Excess Service Policies In addition to enforcing spatial resource assignments, hardware must facilitate software excess service policies in distributing excess memory system bandwidth and cache storage Hardware Support for Distributing Excess Bandwidth Excess memory system bandwidth is available for very short periods of time, and therefore, must be distributed by hardware mechanisms. We propose a simple extension to the baseline arbiters [2][22] that distributes excess bandwidth to the task with highest VPM priority ( i ). To implement the excess service policy, we leverage the concept of eligibility [2][34]. The key insight is that a request can be delayed up to its virtual start-time without violating task i s bandwidth assignment. When a request s virtual start-time has elapsed (i.e., hw-clock virtual start-time, where hw-clock holds the current time), the request is eligible and is serviced earliest virtual finish-time first in order to satisfy the task s bandwidth assignment. When a request s virtual start-time is ahead of the current time, i.e., 2

13 hw-clock < virtual start-time, the request is ineligible. If a task s requests are ineligible, the task has consumed more than its assigned share of the bandwidth resource since the arbiter was reset. Therefore, the task s ineligible requests are contending for excess service and are serviced highest priority first. To summarize, the proposed cache arbiter and memory scheduler compute virtual start- and finish-times as in [2][22], and select requests in the following order: ) first eligible request with the earliest virtual finish-time, 2) if no requests are eligible, the request with the highest priority, and 3) if more than one request has the same priority, the request with earliest virtual finish-time. The FQ arbiters have built-in history that must be reset whenever a new task is context switched in [20][2][22], i.e., when a task is context switched on to a hardware context, the hardware context s virtual start-time register is reset using virtual start-time hw-clock. Therefore, the arbiters track the amount of excess service a task has received (normalized to its assigned share) since the task was context switched on to the processor Hardware Support for Distributing Excess Cache Storage In contrast to memory system bandwidth, tasks cache storage usage changes relatively slowly; therefore, software can effectively manage excess cache storage by adjusting the cache storage control registers (C. i L2_CS ) during context switches (discussed in Section 6.2). However, the software excess cache storage policies require hardware mechanisms to monitor cache usage. In general, there are many ways to monitor cache usage. In this work, we use dynamic set sampling (DSS) [25]. DSS approximates tasks LRU stack distance histograms, which as described earlier (Section 5.), are stored in software accessible control registers (C.lru_stack_hist i [j]). 6. VPM Software Support In this section, we discuss the VPM software support which consists of the VPM scheduler (Section 6.) and the excess cache storage policy (Section 6.2). 6. VPM Scheduler The VPM scheduler ensures that co-scheduled tasks spatial VPM assignments do not conflict, and that tasks temporal VPM assignments are satisfied. To satisfy a task s temporal resource assignment, the VPM scheduler controls the amount of processor time a task is offered. Ideally, a task i should be offered T i i T during any period of time T that the task is continuously in ready or running states. However, because processor time is partitioned and distributed in fixed sized quanta (time slices), the VPM scheduler offers tasks T i i T quanta. When the VPM scheduler context switches a task onto a processor, the scheduler communicates the task s spatial resource shares A i to the VPM hardware mechanisms through control registers. However, to ensure tasks spatial resource assignments can be satisfied by the hardware mechanisms, the VPM scheduler must ensure that the running set of tasks spatial assignments do not conflict., i.e., for i {running tasks} Σ A i <,,, >. 3

14 In its full generality, the VPM scheduling problem is a multi-resource scheduling problem with no known efficient solution. Fortunately, to support the policies proposed in Section 4, we do not have to solve the general VPM scheduling problem. Instead, we can focus on the case where all tasks spatial VPM assignments satisfy A i </p, /p,, /p>, where p is the number of processors in the system we leave the general case for future work. In this special case, tasks spatial VPM assignments can not conflict no matter how tasks are co-scheduled. Consequently, we can apply an existing p-fair multiprocessor scheduling algorithm to satisfy a task set s temporal VPM assignments without worrying about spatial conflicts. In this work, we implement a VPM scheduler based on the PD 2 scheduling algorithm [2], but could have used any p-fair scheduler. We chose PD 2 because it is well-suited for integrating a priority-based excess time slice policy. The PD 2 algorithm has two scheduling queues: a release queue and a ready queue. When a task enters the system or is context switched out, its release-time is calculated and the task is added to the PD 2 release queue. The task remains in the release queue until it becomes eligible, i.e., os-clock release-time, where os-clock is the OS s internal clock. In effect, the release queue prevents tasks from consuming excess time slices. After a task becomes eligible its PD 2 priority is computed and it is added to the ready queue. To schedule a task, the scheduler selects from the ready queue the task with the highest PD 2 priority. The reasoning behind the PD 2 release time and internal priority algorithms is quite involved we refer the reader to [2] for the details. The PD 2 algorithm naturally supports a priority-based policy to distribute excess time slices. The key insight is if the PD 2 ready queue is empty, the current time slice is excess. Therefore, when the ready queue is empty, the scheduler selects the task with the highest VPM priority ( i ) from the release queue. This technique is analogous to the FQ arbiters eligibility logic (Section 5.3.). 6.2 Excess Cache Storage Policy Excess cache storage is quite different than both processor and memory system bandwidth resources. Firstly, it requires oracle knowledge to decide with certainty that cache storage is unused; e.g., whether a line will be accessed before it is replaced [22]. Therefore, the proposed excess cache storage policy only distributes cache storage that is unassigned ( - Σ i L2_CS ). Secondly, the utility a thread receives from additional cache storage is workload dependent. Therefore, in order to make efficient use of excess cache storage, we propose a cache storage policy that takes into account tasks priorities and their cache utility. The proposed utility-aware policy distributes excess cache storage to the task that has the highest VPM priority ( i ) and a cache utility above util_theshold. Cache utility threshold is specified as the number of misses eliminated if the task were assigned an extra quantum (way) of cache storage. To determine a task s utility for an additional way of excess cache storage, the excess cache storage policy 4

15 reads the tasks stack distance histograms (C.lru_stack_hist i [j]) [25]. The software policy runs during context switches and controls the excess cache storage using the cache s control registers (C. i L2_CS ). 7. Evaluation 7. Timing Model We evaluate the proposed VPM policies and mechanisms through detailed simulation of a multicore system. The multi-core timing model is based on a structural simulator developed at IBM Research [4]. The model s default configuration is a single processor IBM 970 system []. In its default configuration, the model was validated to be ± 5% of the 970 design group s latch-level processor model. In this paper, we use an alternative simulator configuration (see Table ) to avoid 970-specific design constraints. The simulator has a cycle accurate model of the L2 cache microarchitecture and an on-chip memory controller attached to a DDR2-800 memory system [8]. Memory system buffers are statically partitioned as in [2][22]. We implemented the hardware mechanisms as described in Section 5. The DSS mechanism is configured with 32 sampled sets [25]. Processors Issue Buffers Issue Width Reorder Buffer LS Queues L I-Cache L D-Cache L2 Cache Memory Controller SDRAM four, 4 Ghz processors Table. System Configuration 20 entry BRU/CRU, two 24 entry FXU/LSU, 24 entry FPU 8 units (2 FXU, 2 LSU, 2 FPU, BRU, CRU) 24 dispatch groups, 5 instr. per dispatch group 32 entry load reorder queue, 32 entry store queue 6KB private, 4-ways, 64 byte lines, 2 cycles, 8 MSHRs 6KB private, 4-ways, 64 byte lines, 2 cycles, 6 MSHRs runs at ½ core frequency, 2 banks, 2MB, 32-ways, 64 byte lines, 6 controller state machines per thread, 8 store gathering entries per thread, read bypassing, retire-at-6 policy, partial-flush on read conflict, 2 cycle interconnect, 4 cycle tag, 8 cycle data, 6 byte data bus per bank on-chip, runs at ½ core freq., 6 transaction entries per thread, 8 write buffer entries per thread, closed page policy DDR2-800, channel, 2 ranks per channel, 8 banks per rank The VPM software support is implemented as described in Section 6. The SJF excess service policy uses the tasks average execution times (ave_exec_time i ) to compute VPM priorities. The tasks average execution times are stored in the tasks process control blocks and are updated whenever a task s subtask completes we discuss the task model further in the next subsection. The policy uses a decaying average to update tasks average execution times, i.e., ave_exec_time i (ave_exec_time i + exec_time) / 2, where exec_time is the execution time of the just completed subtask. The timer interrupt granularity is 2 ms, which is the default Linux scheduling granularity [9]. We set the cache storage utility threshold util_threshold using (time_interrupt_granularity *.05) / ave_memory_latency, which roughly approximates whether an additional way of excess cache storage will improve a task s performance by at least 5

16 5%. Based on our system configuration, the utility threshold is 4,000 misses, i.e., (8 million cycles per timer interrupt *.05) / 00 cycles per miss. 7.2 Workloads To evaluate the proposed policies, we use a quasi-statistical multiprogram workload because the simulation overhead of a detailed multi-core model prevents us from running a real multiprogram workload at meaningful OS time granularities. Furthermore, a statistical workload gives us greater control over the experimental setup, thus allowing us to focus the evaluation on the workload independent resource management principles presented in the body of this paper. The workload consists of a set of single-threaded tasks each task consists of a sequence of subtasks. We use the SPEC 2000 benchmark suite to generate the workload, i.e., each task is a SPEC benchmark and a subtask is a section of a benchmark s execution. The specific benchmark suite is not especially important for this study what is important is that the applications in the benchmark suite exhibit a realistic range of resource utilization(s). To set up the workload, we started with a simulation time constraint and worked backwards. Early in our analysis we decided simulations must complete in two days in order to make the analysis tractable. With the baseline configuration in Table, our simulator can simulate approximately billion cycles (25 timer interrupts) in two days. Given the constraint of billion simulated cycles, we chose the number of tasks in the workload and the size of task s subtasks so that ) a subtask s initial cold-start misses were less than 0% of the subtask s total misses, 2) each task completed at least one subtask on the baseline configuration, and 3) each simulation completed at least 00 subtasks. To satisfy these constraints, we chose a workload with 8 tasks and 0 million instructions per subtask. We used the benchmarks memory system utilizations to choose the eight benchmarks. Figure 2 illustrates the cache and SDRAM bandwidth resource utilization with the baseline system s full cache storage and with ¼ of the cache storage. Cache data array utilization is used as a proxy for cache bandwidth utilization and SDRAM data bus bandwidth utilization is used as a proxy for memory bandwidth utilization [2][22]. The benchmarks are ordered by their SDRAM memory bandwidth utilization. We selected the eight leftmost benchmarks to form the workload because they put the most pressure on the system s shared resources, and consequently, on the proposed policies and mechanisms. Utilization 00% 75% 50% 25% 0% art swim lucas mcf equake wupwise facerec Bandwidth Utilization Mem BW - /4 Cache Storage Cache BW - /4 Cache Storage mgrid bzip2 ammp apsi twolf gcc vpr gap Mem BW - Full Cache Storage Cache BW - Full Cache Storage mesa gzip sixtrack perlbmk crafty Mean Figure 2. Bandwidth utilization 6

17 For each of the eight benchmarks, we generated ten, 0 million instruction subtask traces using statistical sampling [2]. Although each subtask trace has the same number of instructions, their execution times on the baseline system vary from 8.2 ms to 82.5 ms, or from 4 to 4 times the scheduling granularity. During initialization, the simulator randomly selects one subtask from each task s ten subtask traces and injects it into the simulated system. During simulation, when a task s subtask completes, the simulator randomly selects a subtask from the task s ten subtask traces and injects it into the simulated system. To conservatively approximate the percentage of total misses that may be cold start misses, we follow Wood et al. [33] and divide the number of cache blocks by each subtask s misses when running alone. All of the subtasks initial cold start misses were less than 0% of their total misses, and on average, less than 4.4% of the total misses. It is important to emphasize that this is a very conservative approximation, i.e., most often the actual cold-start will be much lower moreover, a subtask s initial cold start misses are relatively insignificant when compared to its misses due to context switching. We ran each simulation four times the simulation are non-deterministic due to the random order subtasks enter the system. We ensured that the average throughput of the four simulations had converged to 2%. The results from all four simulations are included in the results section. 7.3 Metrics To capture both temporal and spatial resource sharing effects, we chose to use turn around time (TAT) as the primary metric. TAT starts at the time a subtask enters the system and ends at the time the subtask completes and leaves the system. To measure performance isolation, we first compute a subtask s isolated TAT, which is the subtask s TAT if it were running on a private system configured with an even-share of the baseline system s shared resources and no more, i.e., ¼ of the cache storage, ¼ of cache bandwidth, and ¼ of the memory bandwidth of the baseline system. We collect the subtasks execution times and time-scale them by the task s even-share temporal weight, i.e., the multi-core simulations have four processors and eight processes; therefore, a subtask s isolated TAT is even_share_exec_time / ½. To measure performance isolation, we normalize a subtask s measured TAT to its isolated TAT. A normalized TAT of less than one means the subtask met its even-share performance isolation target. To measure aggregate performance, we use average TAT, which is the TAT averaged over all tasks subtasks. We also include throughput (Thrpt) results measured as the number of subtasks completed per second. 7.4 Results The results presented in this section illustrate the effects of the proposed policies from the perspective of a system administrator. To simplify the evaluation, we assume tasks have equal static priorities, and therefore, each task has the same system-wide temporal weight ( ). The administrator dedicates a fraction of the system s shared temporal and spatial resources for providing single-task QoS. These re- 7

18 sources (both temporal and spatial) are divided evenly to form homogeneous VPMs, which are passed to the VPM scheduler through A and parameters. For example, if the administrator dedicates the whole cache () to providing QoS, then each task s VPM is assigned ¼ of the system s cache storage, i.e., L2_CS i = ¼. In the case where all cache storage is dedicated to QoS, there is no cache storage dedicated to optimizing aggregate performance. Or if the administrator dedicates half of the cache (½) to QoS, then each task s VPM is assigned /8 of the cache storage, and the other half of the cache storage is dedicated to optimizing aggregate performance via the excess service policy. For the first set of results (Figure 3), we analyze the effects of dedicating each shared resource independently. To study a single resource, we vary the fraction of the resource that is dedicated to QoS, while all other shared resources are fully dedicated to QoS and are held constant. For example, to study the effects of the cache storage, we vary the amount of cache storage dedicated to QoS while all memory system bandwidth and processor time slice resources are dedicated to QoS and held constant, i.e., A = < x, ¼, ¼ > and = ½. Figure 3a illustrates the effects of varying the fraction of the system s processor time slices dedicated to QoS. Figure 3b illustrates the effects of varying the fraction of the memory system bandwidth (both cache and SDRAM bandwidth) dedicated to QoS. We compare the proposed bandwidth mechanisms and policies to first come first serve (fcfs) arbiters and the fair queuing (fq) arbiters presented in [2][22]. Figure 3c illustrates the effects of varying the fraction of the cache storage dedicated to QoS. We compare the proposed cache storage mechanisms and policies to LRU (lru) cache replacement and the utility policy presented in [25]. The key in each graph shows the fraction of the resource dedicated to QoS, i.e.,, ¾, ½, and ¼. The graph on the left side of each figure shows each task s normalized TATs. Tasks are ordered by their execution time when executing alone with the longest execution time on the left. The bars plot the tasks average normalized TATs. The error bars illustrate the range of the tasks normalized TATs. The graph on the right side of each figure shows the aggregate performance improvement with respect to all resources dedicated to QoS (). The policies most significant trends hold across all three graphs in Figure 3. For the case where all of the resources are dedicated to QoS (the data points labeled ), the policies and mechanisms satisfy the VPM abstraction s performance isolation requirements. This result is illustrated by the graphs on the left side of each figure, i.e., for the fully dedicated configurations (), the top of each error bar (the maximum normalized TAT) is less than one. Consequently, the tasks subtasks run faster than they would if running on a single processor machine with a configuration that is equivalent to an even-share VPM. Reducing the level of QoS from to ¼, significantly decreases the degree of performance isolation offered to the tasks, i.e., the ¼ configuration has a much greater deviation in normalized TAT. The normalized TATs of the tasks on the left side of the graphs (art, mcf, lucas, swim) increase significantly in some 8

Virtual Private Caches

Virtual Private Caches Kyle J. Nesbit, James Laudon *, and James E. Smith University of Wisconsin Madison Departmnet of Electrical and Computer Engr. { nesbit, jes }@ece.wisc.edu Sun Microsystems, Inc. * James.laudon@sun.com

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

A Framework for Providing Quality of Service in Chip Multi-Processors

A Framework for Providing Quality of Service in Chip Multi-Processors A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5 OPERATING SYSTEMS CS3502 Spring 2018 Processor Scheduling Chapter 5 Goals of Processor Scheduling Scheduling is the sharing of the CPU among the processes in the ready queue The critical activities are:

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps Interactive Scheduling Algorithms Continued o Priority Scheduling Introduction Round-robin assumes all processes are equal often not the case Assign a priority to each process, and always choose the process

More information

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date:

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date: Subject Name: OPERATING SYSTEMS Subject Code: 10EC65 Prepared By: Kala H S and Remya R Department: ECE Date: Unit 7 SCHEDULING TOPICS TO BE COVERED Preliminaries Non-preemptive scheduling policies Preemptive

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable What s An OS? Provides environment for executing programs Process abstraction for multitasking/concurrency scheduling Hardware abstraction layer (device drivers) File systems Communication Do we need an

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

CPU Scheduling: Objectives

CPU Scheduling: Objectives CPU Scheduling: Objectives CPU scheduling, the basis for multiprogrammed operating systems CPU-scheduling algorithms Evaluation criteria for selecting a CPU-scheduling algorithm for a particular system

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

Last Class: Processes

Last Class: Processes Last Class: Processes A process is the unit of execution. Processes are represented as Process Control Blocks in the OS PCBs contain process state, scheduling and memory management information, etc A process

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Operating Systems. Scheduling

Operating Systems. Scheduling Operating Systems Scheduling Process States Blocking operation Running Exit Terminated (initiate I/O, down on semaphore, etc.) Waiting Preempted Picked by scheduler Event arrived (I/O complete, semaphore

More information

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. !

Scheduling II. Today. Next Time. ! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling. ! Scheduling II Today! Proportional-share scheduling! Multilevel-feedback queue! Multiprocessor scheduling Next Time! Memory management Scheduling with multiple goals! What if you want both good turnaround

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs

Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs University of Maryland Technical Report UMIACS-TR-2008-13 Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs Wanli Liu and Donald Yeung Department of Electrical and Computer Engineering

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Decoupled Zero-Compressed Memory

Decoupled Zero-Compressed Memory Decoupled Zero-Compressed Julien Dusser julien.dusser@inria.fr André Seznec andre.seznec@inria.fr Centre de recherche INRIA Rennes Bretagne Atlantique Campus de Beaulieu, 3542 Rennes Cedex, France Abstract

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Operating System Review Part

Operating System Review Part Operating System Review Part CMSC 602 Operating Systems Ju Wang, 2003 Fall Virginia Commonwealth University Review Outline Definition Memory Management Objective Paging Scheme Virtual Memory System and

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

Precedence Graphs Revisited (Again)

Precedence Graphs Revisited (Again) Precedence Graphs Revisited (Again) [i,i+6) [i+6,i+12) T 2 [i,i+6) [i+6,i+12) T 3 [i,i+2) [i+2,i+4) [i+4,i+6) [i+6,i+8) T 4 [i,i+1) [i+1,i+2) [i+2,i+3) [i+3,i+4) [i+4,i+5) [i+5,i+6) [i+6,i+7) T 5 [i,i+1)

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Unit 3 : Process Management

Unit 3 : Process Management Unit : Process Management Processes are the most widely used units of computation in programming and systems, although object and threads are becoming more prominent in contemporary systems. Process management

More information

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Appears in Proc. of the 18th Int l Conf. on Parallel Architectures and Compilation Techniques. Raleigh, NC. Sept. 2009. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs Wanli

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Chapter 5: Process Scheduling

Chapter 5: Process Scheduling Chapter 5: Process Scheduling Chapter 5: Process Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Thread Scheduling Operating Systems Examples Algorithm

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) 1/32 CPU Scheduling The scheduling problem: - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) When do we make decision? 2/32 CPU Scheduling Scheduling decisions may take

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services Overview 15-441 15-441 Computer Networking 15-641 Lecture 19 Queue Management and Quality of Service Peter Steenkiste Fall 2016 www.cs.cmu.edu/~prs/15-441-f16 What is QoS? Queuing discipline and scheduling

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three level scheduling 2 1 Types of Scheduling 3 Long- and Medium-Term Schedulers Long-term scheduler Determines which programs

More information

Scheduling. Today. Next Time Process interaction & communication

Scheduling. Today. Next Time Process interaction & communication Scheduling Today Introduction to scheduling Classical algorithms Thread scheduling Evaluating scheduling OS example Next Time Process interaction & communication Scheduling Problem Several ready processes

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Process Scheduling. Copyright : University of Illinois CS 241 Staff

Process Scheduling. Copyright : University of Illinois CS 241 Staff Process Scheduling Copyright : University of Illinois CS 241 Staff 1 Process Scheduling Deciding which process/thread should occupy the resource (CPU, disk, etc) CPU I want to play Whose turn is it? Process

More information

Operating Systems Unit 3

Operating Systems Unit 3 Unit 3 CPU Scheduling Algorithms Structure 3.1 Introduction Objectives 3.2 Basic Concepts of Scheduling. CPU-I/O Burst Cycle. CPU Scheduler. Preemptive/non preemptive scheduling. Dispatcher Scheduling

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization Requirements Relocation Memory management ability to change process image position Protection ability to avoid unwanted memory accesses Sharing ability to share memory portions among processes Logical

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW

Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW Department of Computer Science Institute for System Architecture, Operating Systems Group REAL-TIME MICHAEL ROITZSCH OVERVIEW 2 SO FAR talked about in-kernel building blocks: threads memory IPC drivers

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Practice Exercises 305

Practice Exercises 305 Practice Exercises 305 The FCFS algorithm is nonpreemptive; the RR algorithm is preemptive. The SJF and priority algorithms may be either preemptive or nonpreemptive. Multilevel queue algorithms allow

More information

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition,

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition, Chapter 5: CPU Scheduling Operating System Concepts 8 th Edition, Hanbat National Univ. Computer Eng. Dept. Y.J.Kim 2009 Chapter 5: Process Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

More information

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000

Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Evaluating the Performance Impact of Hardware Thread Priorities in Simultaneous Multithreaded Processors using SPEC CPU2000 Mitesh R. Meswani and Patricia J. Teller Department of Computer Science, University

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2019 Lecture 8 Scheduling Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ POSIX: Portable Operating

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections ) CPU Scheduling CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections 6.7.2 6.8) 1 Contents Why Scheduling? Basic Concepts of Scheduling Scheduling Criteria A Basic Scheduling

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

Chapter 8 Memory Management

Chapter 8 Memory Management 1 Chapter 8 Memory Management The technique we will describe are: 1. Single continuous memory management 2. Partitioned memory management 3. Relocatable partitioned memory management 4. Paged memory management

More information

(b) External fragmentation can happen in a virtual memory paging system.

(b) External fragmentation can happen in a virtual memory paging system. Alexandria University Faculty of Engineering Electrical Engineering - Communications Spring 2015 Final Exam CS333: Operating Systems Wednesday, June 17, 2015 Allowed Time: 3 Hours Maximum: 75 points Note:

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Chapter 5 CPU scheduling

Chapter 5 CPU scheduling Chapter 5 CPU scheduling Contents Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Thread Scheduling Operating Systems Examples Java Thread Scheduling

More information

A Study of the Performance Tradeoffs of a Tape Archive

A Study of the Performance Tradeoffs of a Tape Archive A Study of the Performance Tradeoffs of a Tape Archive Jason Xie (jasonxie@cs.wisc.edu) Naveen Prakash (naveen@cs.wisc.edu) Vishal Kathuria (vishal@cs.wisc.edu) Computer Sciences Department University

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Operating Systems. Process scheduling. Thomas Ropars.

Operating Systems. Process scheduling. Thomas Ropars. 1 Operating Systems Process scheduling Thomas Ropars thomas.ropars@univ-grenoble-alpes.fr 2018 References The content of these lectures is inspired by: The lecture notes of Renaud Lachaize. The lecture

More information

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) CPU Scheduling The scheduling problem: - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s) When do we make decision? 1 / 31 CPU Scheduling new admitted interrupt exit terminated

More information

c 2004 by Ritu Gupta. All rights reserved.

c 2004 by Ritu Gupta. All rights reserved. c by Ritu Gupta. All rights reserved. JOINT PROCESSOR-MEMORY ADAPTATION FOR ENERGY FOR GENERAL-PURPOSE APPLICATIONS BY RITU GUPTA B.Tech, Indian Institute of Technology, Bombay, THESIS Submitted in partial

More information

But this will not be complete (no book covers 100%) So consider it a rough approximation Last lecture OSPP Sections 3.1 and 4.1

But this will not be complete (no book covers 100%) So consider it a rough approximation Last lecture OSPP Sections 3.1 and 4.1 ADRIAN PERRIG & TORSTEN HOEFLER ( 252-0062-00 ) Networks and Operating Systems Chapter 3: Scheduling Source: slashdot, Feb. 2014 Administrivia I will try to indicate book chapters But this will not be

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13

PROCESS SCHEDULING II. CS124 Operating Systems Fall , Lecture 13 PROCESS SCHEDULING II CS124 Operating Systems Fall 2017-2018, Lecture 13 2 Real-Time Systems Increasingly common to have systems with real-time scheduling requirements Real-time systems are driven by specific

More information