A Multiprogramming Aware OpenMP Implementation

Size: px
Start display at page:

Download "A Multiprogramming Aware OpenMP Implementation"

Transcription

1 Scientific Programming 11 (2003) IOS Press A Multiprogramming Aware OpenMP Implementation Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory, Department of Computer Engineering and Informatics, University of Patras, Rio 26500, Patras, Greece Tel.: ; Fax: ; {bkb,peh,edp,tsp}@ hpclab.ceid.upatras.gr; Abstract. In this work, we present an OpenMP implementation suitable for multiprogrammed environments on Intel-based SMP systems. This implementation consists of a runtime system and a resource manager, while we use the NanosCompiler to transform OpenMP-coded applications into code with calls to our runtime system. The resource manager acts as the operating system scheduler for the applications built with our runtime system. It executes a custom made scheduling policy to distribute the available physical processors to the active applications. The runtime system cooperates with the resource manager in order to adapt each application s generated parallelism to the number of processors allocated to it, according to the resource manager scheduling policy. We use the OpenMP version of the NAS Parallel Benchmark suite in order to evaluate the performance of our implementation. In our experiments we compare the performance of our implementation with that of a commercial OpenMP implementation. The comparison proves that our approach performs better both on a dedicated and on a heavily multiprogrammed environment. 1. Introduction The OpenMP Application Programming Interface [9] provides a simple and portable model for programming a wide range of parallel applications on parallel plarforms. Everyone is able to use this quite simple standard in order to parallelize applications without paying attention on the underlying architecture. The OpenMP API is portable across a wide range of parallel platforms, including small-scale SMP servers, scalable cc- NUMA multiprocessors and clusters of workstations. The simplicity of this model derives from the fact that the programmer does not need to worry about the details of the underlying platform or the operating system mechanisms. The programmer simply inserts directives into the original sequential code in order to annotate loops and sections of code that can be executed in parallel. The OpenMP support is transparent to the user through an OpenMP-capable compiler and the necessary runtime support. The compiler interprets the userinserted directives into appropriate runtime calls that enable the application to execute in parallel by multiple threads using a fork/join execution model. For every part of the code that has to be executed in parallel, a group of threads is created and a chunk of the total work is assigned to each thread according to a pre-defined scheduling scheme. These threads are scheduled on the available physical processors by the operating system scheduler. On the other hand, multiprocessor systems are also used as multiprogrammed compute servers, where several users submit parallel and sequential CPU-intensive applications. The threads of all these applications contend for systems s available resources, especially the processors and the memory. Since general-purpose operating system schedulers are designed to deal with sequential applications, they fail to achieve good performance when workloads consisting of multiple parallel applications run on the system. This fact is due to the scheduler ignorance about the nature of the parallel applications and their real requirements. It is therefore crucial to provide efficient mechanisms that enable ISSN /03/$ IOS Press. All rights reserved

2 134 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation parallel applications to achieve robust performance in multiprogrammed environments. An efficient way to achieve better performance with OpenMP applications that are executed in multiprogrammed environments is the employment of a sophisticated runtime system. The runtime system enables parallel applications to adapt themselves to the resources made available for them by the operating system. Using this runtime system, parallel applications exploit the amount of parallelism that matches the number of processors that the operating system scheduler allocates to them. This approach requires close collaboration between the operating system scheduler and the runtime system, in order to keep each part informed about the requirements and/or the decisions of the other part. Although the OpenMP standard already makes provision for the use of dynamic parallelism, several OpenMP implementations either do not support it or they support it in a very limited way. In those implementations, the runtime system creates and uses as many threads as each application initially requests, without taking into account the variations of the system workload during the execution time. This is due to the limitations that the operating system scheduler set, and the lack of collaboration between the scheduler and the runtime system. In this paper, we present an OpenMP implementation consisting of a runtime system and an independent resource manager, which plays the role of the operating system scheduler. The resource manager provides to the runtime system the functionality needed in order to adapt the application parallelism to the available resources. The collaboration between these two parts make our implementation capable to support the execution of OpenMP applications in a multiprogrammed environment. We utilize the NanosCompiler [2] to transform the original OpenMP-coded applications into appropriate code that have calls to our runtime system. The runtime system is based on the NTLib multithreaded runtime library [3]. NTLib is a highly optimized library developed for the Windows NT/2000 operating system, which provides lightweight user-level threads and the necessary support to run parallel applications on Intel-based SMP systems. We have extended the runtime library in order to support the code that the compiler injects in the original OpenMP-coded application. The resource manager we use in this work is based on a portable resource manager [4], which is implemented as a loadable kernel-mode driver. The resource manager communicates with the applications through a shared memory area, lying between the operating system kernel space, where the resource manager runs and the user space, where the applications run. The implementation of the resource manager as a loadable kernel module provides, several advantages such as portability and efficiency. So, we can take advantages of an in-kernel implementation avoiding the need to modify and/or recompile the kernel. The original resource manager was extended in order to adapt its behavior to the peculiarities of OpenMP applications when they are executed in a multiprogrammed environment. The runtime system in combination with the resource manager provides an environment capable to efficiently execute OpenMP applications on multiprogrammed systems. The functionality of our environment is provided transparently to the programmer of the OpenMP applications. In order to evaluate our approach we have used the OpenMP implementation of the NAS Parallel Benchmarks suite [5]. The experiments on a quad-processor SMP system show that our implementation has a significantly better performance than a commercial OpenMP implementation, both on a dedicated and on a multiprogrammed environment. On a multiprogrammed environment, we also perform better both in the turnaround time of the several applications in the workload executed and the total execution time of the whole workload. Furthermore, in the case of the execution of an application on a dedicated environment, using only our runtime library, we show that it achieves comparable or better speedups to that of a commercial OpenMP compiler environment. The rest of this paper is organized as follows: Section 2 describes the features of the runtime system. In Section 3, we present the functionality of the resource manager. Next, in Section 4, we describe how OpenMP applications utilize our environment. In Section 5, we present our experiments that validate the efficiency of our implementation. Finally, in Section 6 we provide our conclusions together with the related work. 2. Runtime system NTLib is a multithreaded runtime library that provides lightweight user-level threads and the necessary support to exploit, with minimal overhead, the parallelism found in applications. The lightweight threads enable the library to efficiently exploit fine-grain parallelism, while its dependence-driven execution model makes it capable to exploit multiple levels of parallelism. The cost of the user-level thread primitives is

3 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation 135 Independent application Application A Parallel applications controllable by the resource manager Application B Application C User-level threads and work descriptors User space NTLib NTLib Virtual Processors Kernel-level threads Kernel Space Resource Manager Not Allocated Operating System Scheduler Operating System Kernel CPU 1 CPU 2 CPU 3 CPU 4 Physical Processors Fig. 1. The resource manager cooperation with the applications and the operating system scheduler. very low compared to other thread packages. The runtime library exports a set of functions that supports the nanothreads programming model [7]. These functions are responsible for the thread management, the handling of the ready thread queues, the control of the dependencies between the threads and the initialization of the runtime environment. More details about the implementation and the functionality of the NTLib runtime library can be found in [3]. The set of the exported functions and the functionality of the library was extended in order to support the compiler-generated code that corresponds to the original OpenMP application code. Additionally, support was added to enable the use of the runtime functionality with applications written in the Fortran programming language. The NanosCompiler utilizes user-level threads to express multiple levels of parallel loops and parallel sections of code. However, in order to express the parallelism of single-level loops the compiler uses the work descriptor structure. We implement in our runtime system the work descriptor structure and the necessary support functions, based on the proposed implementation in [8] with the appropriate modifications to fit in our underlying architecture. A work descriptor structure consists of a pointer to the function that encapsulates the work to be executed and its arguments. Work descriptors provide a very efficient mean to distribute work among the participating processors. For each singlelevel parallel loop, the master processor creates a work descriptor and initializes it for the specific loop. Then, the master processor distributes the work descriptor to all slave processors that participate in the execution of the specific loop, by copying it to a specific memory location for each processor. Each slave processor, checks its memory location in order to find a work descriptor, and execute it. The part of the work descriptor that each processor executes, is determined by the function arguments, the number of processors participating in the execution and the processor-own identifier. The runtime library executes the created threads based on a dependence-driven execution model, in order to ensure the correct execution order of the created threads and to allow the efficient exploitation of multiple levels of parallelism. For each thread, we keep information about its dependencies with other threads. When a thread finishes its execution, it satisfies one input dependency on each one of its successors. When a thread creates a new thread, the dependencies of the creator thread must be preserved. For this reason, we must increase by one the creator thread dependencies and declare it as the successor of the new thread. By this way, we can create multiple levels of nested threads making thus easy the representation of multiple levels of parallelism found in a wide range of applications. The main responsibility of the runtime library is to control the generated parallelism, ensuring that it matches the number of processors allocated by the operating system to each application. In other words, the runtime library implements dynamic program adaptability to the available resources by adjusting the

4 136 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation amount of the generated parallelism. The scheduling policy of the operating system is concerned with the allocation of physical processors to the applications currently running in the system. The runtime library cooperates with the operating system scheduler, running in order to achieve the desired adaptation. In our implementation, we use an external resource manager to play the role of the operating system scheduler. The operating system provides kernel-level threads to the runtime system, as the kernel abstraction of physical processors on which applications can execute. These kernel-level threads play the role of the virtual processors, which will execute the application s user-level threads. 3. Resource manager In this section, we present the functionality provided by the resource manager to the runtime system. The resource manager is implemented as a loadable kernelmode device driver; this gives us the ability to start and stop its execution at our will. The primary responsibility of the Resource Manager is to keep track of all applications running on the system and to apply for them a user-defined scheduling policy. The scheduling policy mainly determines the number of physical processors that will be allocated to each application. In order to achieve this, a shared memory area that is maintained between the applications and the resource manager, acts as the communication path between them. At its initialization phase, the manager creates a shared memory section, which is mapped in the address space of each application that utilizes the runtime system. For each active application in the system, there is some information in the shared memory area, where both the applications and the manager have access. The information kept in the shared area concerns the actual parallelism of the application at each time, the virtual processors identifiers that the application uses and other useful data. The resource manager based on this information calculates the number of physical processors that will be allocated to each application each time, and controls the execution of the virtual processors. When an application starts, it maps the shared memory area in its private address space. Next, the application s main virtual processor creates the rest virtual processors in suspended mode and registers them into the shared area, making them accessible from the resource manager and enabling thus their manipulation from it. After this point, the control of the application s virtual processors has passed exclusively to the resource manager, which is responsible for their execution. The resource manager, based on the information in the shared memory area, applies a user-defined scheduling policy to distribute the available physical processors across the registered applications. The scheduling policy is executed periodically at a fixed time interval called scheduler quantum. Both the scheduling policy and the scheduler quantum are user-defined and can be changed dynamically at runtime. The scheduling policy determines the number of processors that will be allocated to each application during the next scheduling quantum and which physical processors they will be. The latter is decided based on a per-application history that records the processors on which the application runs during the previous quanta. Additionally, it decides which virtual processors will run on these physical processors, in order to preserve the affinity. The resource manager applies the scheduling policy decisions on the virtual processors of the applications, by allowing some of them to run by resuming their execution, while preventing some others by suspending them. Eventually, resource manager maintains the total number of running virtual processors across all applications, equal to the number of physical processors in the system. The running applications cooperate with the resource manager during their execution period in order to minimize their execution time by avoiding the idling of physical processors while exists some application that could execute useful work on them. Before entering a parallel region, each application informs, through the shared memory area, the resource manager about its requirements reflecting the actual degree of parallelism that it can exploit in this region. The resource manager responds to the application s requirements and allocates physical processors to it according to the current scheduling policy. The application receives, through the shared memory area, the resource manager s decisions and tries to match the parallelism that it will generate to the number of physical processors currently allocated to it. Based on the requirements of all the applications and the scheduling policy, the resource manager decides about physical processor reallocations between the applications. The removal of a physical processor from an application means the blocking of the associated virtual processor, while the assignment of a physical processor to an application means the unblocking of an application s virtual processor and its binding to the specific physical processor. This reallocation procedure can block a virtual processor at an unsafe point, for example while being

5 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation Dedicated Environment - Speedup 3 Nanos PGI Speedup BT CG EP FT MG SP Application / Number of Processors Fig. 2. The speedup that the applications achieve on a dedicated execution environment. inside a critical section, preventing the application to make progress on its critical path. In order to eliminate these undesirable preemptions, each virtual processor checks for any non-safely preempted virtual processor whenever it reaches a safe-point, where it is known that the application and runtime synchronization constrains are satisfied. If such a virtual processor is found, the currently running virtual processor yields its physical processor to the preempted one. The general design of the resource manager allows its collaboration with the native operating system scheduler. The resource manager controls the execution of the virtual processors (kernel threads) of the applications built with our runtime library, by allowing some of them to run, with preventing some others. It does not disturb the kernel threads of the rest applications that are executed in the same time on the system. The virtual processors that the resource manager allows to run contend with the kernel threads of the other applications for the physical processors. The operating system scheduler decides which of the virtual processors that are allowed to run, will actually run on the physical processors. By this way we treat all the running applications with fairness, without violating the operating system scheduling policy primitives. The collaboration of the resource manager with the several applications running on the system and its cooperation with the operating system scheduler is depicted in Fig. 1. In this Figure we have three applications, the first one of them is independent, while the other two are controllable by the resource manager. Resource manager scheduling policy decides to grant two physical processors to each one of its applications. But the operating system scheduler ignores it and schedules on physical processor 1 a kernel-level thread of the independent application. More details about the design and the implementation of the resource manager can be found in [4]. 4. OpenMP application transformation In this section, we describe how applications with OpenMP directives use the functionality offered by our environment. We use the NanosCompiler to compile applications that are parallelized with OpenMP directives. NanosCompiler front-end is capable to convert applications with OpenMP directives into equivalent applications that use calls to our runtime library in order to express the annotated parallelism. Each application, during its initialization informs the resource manager about the maximum parallelism that it is capable to exploit during its execution. When NanosCompiler reaches a combined parallel worksharing construct (OMP PARALLEL DO), which is the common case in the applications examined in this paper, it generates two functions that substitute the construct and its body. The first function replaces the parallel construct, while the second function contains the code of the parallel loop body. In the first function, the compiler inserts code that consults the resource manager, to find out how many physical processors are allocated to the application in order to execute this region. The number of processors that are allocated to the application is determined through the shared mem-

6 138 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation Multiprogrammed Environment - Average Execution Time Nanos PGI 1000 Execution Time (sec) BT CG EP FT MG SP Application / Degree of Multiprogramming Fig. 3. The average turnaround times of the applications in each workload on a multiprogrammed execution environment. ory area, that the resource manager uses for the communication with the application. The first function is executed exclusively by the master processor, which creates a work descriptor structure that represents the body of the parallel loop. This work descriptor specifies the second compiler-generated function as the one that represents the loop that must be executed in parallel. Next, the master processor distributes the created work descriptor to all the available processors including itself. The work descriptor is distributed by coping it to a per-processor specific memory location. At this point, the master virtual processor suspends the execution of the first function in order to execute its own part of the work descriptor. Each one of the slave processors, whenever it becomes idle, it searches for a new work descriptor into its own memory location. Each processor calculates the work descriptor chunk that it will execute, based on its identifier and the total number of processors participating in the execution of this region. When all processors finish the work descriptor execution, the master virtual processor resumes the execution of the suspended function, while the slave processors remain idle. When the compiler arrives at a parallel region construct (OMP PARALLEL), it again generates two functions that substitute the parallel region code. Additionally, in this case the compiler generates another one function for each work-sharing construct (OMP DO or OMP SECTION) that exists inside the parallel region. In this case the master processor does not create a simple work descriptor structure that all available virtual processors executes. Instead, it creates as many user-level threads as the number of the allocated physical processors. The created threads are executed on the available processors through their local ready queues. Each one of these threads, execute the compiler-generated functions that represent the worksharing constructs that resides into the parallel region. The original implementation of the resource manager was designed to work with general parallel applications, where no distinction is made between the processors of the application, since all of them are considered equal. Some modifications in the resource manager design were required in order to work properly with the fork/join model of the OpenMP applications. This model defines a master processor, which executes different code than the other processors. Since this code belongs to the critical path of the application the resource manager must treats the master processor with more responsibility. This means that the master processor must be always running, independently of the number of physical processors assigned to the application at any time. So, in the extreme case under multiprogramming, where a single physical processor is shared among several OpenMP applications, the only virtual processor that each application executes on this physical processor, each time, is its master processor. So, each application executes through its critical path, if later another physical processor becomes available to the application some of the other slave processors will be assigned for execution on this physical processor.

7 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation 139 Table 1 The average execution times of the applications on a dedicated execution environment # proc. BT CG EP FT MG SP Nanos PGI Nanos PGI Nanos PGI Nanos PGI NAnos PGI Nanos PGI Performance evaluation In this section we present the experiments that we have conducted in order to evaluate the performance of our implementation. In our experiments we used the NAS Parallel Benchmarks suite, parallelized with the OpenMP programming model. The NAS Parallel Benchmarks is one of the most well known benchmarking suite, written in Fortran 77, and is often used for the performance evaluation on multiprocessors systems. In all experiments we used the class W problem sizes for all the applications. All the experiments were performed on a quad-processor 200 MHz Pentium Pro system equipped with 512 MB of main memory, running the Windows 2000 operating system. We compare the performance obtained by the applications built using our environment (the NanosCompiler, the runtime system and the resource manager) against the performance obtained when we built them with the PGI Workstation compiler version 3.2 [11]. The PGI Workstation is a commercial compiler product which supports the OpenMP standard. It takes as input either C or Fortran code parallelized with OpenMP directives and produces the final executable. In order to produce the final executable for the intermediate code that the NanosCompiler front-end outputs, we linked it with the NTLib library using the Intel Fortran compiler version 4.5 [6]. The NAS Parallel Benchmarks suite consists of seven applications, each one of them solving a different problem. In our experiments we run all of them, except the LU application that the PGI compiler could not build. We performed two kind of experiments, in the first set of experiments we executed a single instance of an application on a dedicated system. In the second set, we ran each application on a multiprogrammed environment, which we simulated by running multiple instances of the same application. For each application, we built two versions, one with the PGI compiler and one with our implementation and we compared their performance in the two experiments. In our first experiment, we measured the execution time for each version of the applications on a dedicated system with 1, 2, and 4 processors. We executed each application 5 times, the average execution times are presented in Table 1, while the equivalent speedups of the applications for each case are depicted in Fig. 2. As it is shown in Table 1, both versions of all the applications scale similarly when they are executed with up to two processors. On the four-processor execution three of the six application built with the PGI compiler failed to scale as well as the Nanos versions. More specifically, for the execution of BT, MG and SP on four processors there is a slowdown over the sequential execution and the variation between the execution times is quite high. We believe that this is due to the incorrect stack alignment of the data into the code that the PGI compiler produces or a problem into the runtime library that the PGI compiler uses. The remaining three applications CG, EP and FT it seems that they are not affected by the same problem and they scale well on any number of processors. On the contrary, the Nanos version of all the applications achieve good speedup in all cases. Even though the Nanos versions achieve better execution time in all the cases, their speedup is comparable to the PGI version. The conclusions from this experiment is that our implementation achieves comparable or even better speedups to that of a commercial OpenMP compiler, on a dedicated execution environment. Additionally, our implementation is more reliable and more stable than the PGI compiler, achieving robust performance with every application we tested. In our second experiment, we compare the performance of our implementation on a multiprogrammed environment. In order to simulate a multiprogrammed environment in our system we ran several workloads, each one consisting of multiple instances of the same application. In each workload, all the instances were started simultaneously, at the same time. Each instance of the application assuming that it is running on a dedicated machine requested four processors from the system. The number of instances that we run in each workload determines the degree of multiprogramming that we want to achieve. We run workloads with multiprogramming degree of 1, 2, 4, and 8 for each application. For our implementation, we executed the workloads while the resource manager was running in the system in order to schedule the running applications on the available physical processors. The resource

8 140 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation Table 2 The average turnaround times of the applications in each workload on a multiprogrammed execution environment Ins. BT CG EP FT MG SP Nanos PGI Nanos PGI Nanos PGI Nanos PGI NAnos PGI Nanos PGI manager ran with a time-sharing version of the DSS scheduling policy [10] and a scheduling quantum of 120 msec, which matches the quantum used by the operating system scheduler for the kernel-level threads. For the PGI version, we executed the workloads without the resource manager. In this case the operating system scheduler was responsible for the scheduling of the applications on the available processors. We ran each workload 5 times, and we measured the average turnaround time for all the instances in each workload. The measured times for each case are presented in Table 2. In Fig. 3 the measured times are depicted using a logarithmic scale for the time axis. As shown in Table 2 and Fig. 3 the workloads with the Nanos version applications achieves much lower turnaround times. More specifically, when we use the resource manager to coordinate the execution of the multiprogramming workloads, the turnaround time scales linearly to the degree of multiprogramming. This means that when we run a workload with multiprogramming degree 2 it requires approximately twice the time of a uni-programming workload. In the PGI case, the workloads fail to scale well, because the operating system scheduler does not schedule them efficiently, since it has no knowledge about the requirements of their applications. The workloads of the three PGIcompiled applications that did not scale well on the dedicated execution, now achieve superlinear scaling as the multiprogramming degree increases. This happens because the extra instances fill in the time intervals that the processors otherwise would be left idle, since each instance does not manage to fully utilize all of its processors. However, in any case our implementation achieves better turnaround times for the applications participating in the workloads. The improvement ranges from 6% to 63% less time that the PGI case. Additionally, in the Nanos case the deviation of the turnaround time between the instances in each workload as well as the total execution time of all the workloads is also smaller than in the PGI case. This second experiment shows the performance advantages and the stability of our implementation when it is used in a heavily multiprogrammed environment, where several applications compete for the system resources. 6. Releated work and conclusions As far as we know, there is only one related work in this subject, and none for the Windows operating system. In [1], the authors compare the performance of a runtime system and a modified Linux kernel with that of OmniMP, a research OpenMP compiler. They achieve similar results to us, since the OmniMP compiler is totally unaware about multiprogramming. Our work differs in that we use an independent resource manager, implemented as a kernel loadable module, to play the role of the operating system scheduler, instead of modifying the kernel. Moreover, our approach is more flexible without losing the performance advantages and more portable since it can be easily installed in new systems, by simply loading our module. As we have stated, several OpenMP implementations does not efficiently support application execution on a multiprogrammed environment. In these implementation, the OpenMP feature for dynamic parallelism has been ignored or not efficiently implemented. Our results confirm this statement, since our implementation performs better that the commercial PGI implementation on a multiprogrammed environment. Furthermore, our experiments shows that our implementation achieves better performance even when it is used on a dedicated system. The immaturity of the OpenMP implementations is also shown from the fact that 3 of the NAS Benchmarks fail to scale with more that 2 processors, when compiled with the PGI compiler and run on a dedicated system. The PGI Workstation compiler that we use in this work, does not support this feature of the OpenMP standard. But, neither the OpenMP version of the NAS Parallel Benchmarks that we use in this work were written to use this feature. So our results would not have change if we were using an application or a compiler that support the dynamic parallelism feature. Our approach is completely transparent for the programmer of the original OpenMP-coded application, and does not depend on the OpenMP standard feature for dynamic parallelism. In the future, we plan to use this feature in our implementation as a way to give the resource manager a hint, toward a more efficient processor scheduling.

9 V.K. Barekas and P.E. Hadjidoukas / A Multiprogramming Aware OpenMP Implementation 141 Acknowledgements We would like to thank our colleagues Christos Antonopoulos and Ioannis Venetis for their valuable help in the compilation of the NAS Parallel Benchmarks. This work was partially supported by G.S.R.T. research program 99E References [1] C. Antonopoulos, I. Venetis, D. Nikolopoulos and T. Papatheodorou, Efficient Dynamic Parallelism with OpenMP on Linux SMPs, in Proc. of the 2000 international Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, June, [2] E. Ayguadé et al., NanosCompiler: A Research Platform for OpenMP Extensions, in Proc. of the First European Workshop on OpenMP, pp , Lund, Sweden, September, [3] V. Barekas, P. Hadjidoukas, E. Polychronopoulos and T. Papatheodorou, Nanothreads vs. Fibers for the Support of Fine Grain Parallelism in Windows NT/2000 Platforms, in Proc. of the Third International Symposium on High Performance Computing, pp , Tokyo, Japan, October, [4] P. Hadjidoukas, V. Barekas, E. Polychronopoulos and T. Papatheodorou, A Portable Kernel-Mode Resource Man ager on Windows 2000 Platforms, in Proc. of the Twelth IASTED International Conference Parallel and Distributed Computing and Systems, Las Vegas, Nevada, November, [5] H. Jin, M. Frumkin and J. Yan, The OpenMP Implementation of NAS Parallel Benchmarks and its Performance, Technical Report NAS , NASA Ames Research Center, October, [6] Intel Corporation, Intel Fortran Compiler, Available at: [7] X. Martorell, J. Labarta, N. Navarro and E. Ayguadé, A Library Implementation of the Nano-Threads Programming Model, in Proc. of the 2nd Euro-Par Conference, Lyon, pp , August, [8] X. Martorell et al., Thread Fork/Join Techniques for Multilevel Parallelism Exploitation in NUMA Multiprocessors, in Proc. of the 13th ACM International Conference on Supercomputing, Rhodes, June, [9] OpenMP Architecture Review Board, OpenMP Specifications. Available at: [10] E. Polychronopoulos et al., An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors, in Proc. of the 12th International Conference on Parallel and Distributed Computing Systems, Fort Lauderdale, Florida, August, [11] Portland Group Inc, Web site. com.

10 Journal of Industrial Engineering Multimedia The Scientific World Journal Applied Computational Intelligence and Soft Computing International Journal of Distributed Sensor Networks Fuzzy Systems Modelling & Simulation in Engineering Submit your manuscripts at Journal of Computer Networks and Communications Artificial Intelligence International Journal of Biomedical Imaging Artificial Neural Systems International Journal of Computer Engineering Computer Games Technology Software Engineering International Journal of Reconfigurable Computing Robotics Computational Intelligence and Neuroscience Human-Computer Interaction Journal of Journal of Electrical and Computer Engineering

Parallel applications controllable by the resource manager. Operating System Scheduler

Parallel applications controllable by the resource manager. Operating System Scheduler An OpenMP Implementation for Multiprogrammed SMPs Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory,

More information

Efficient Dynamic Parallelism with OpenMP on Linux SMPs

Efficient Dynamic Parallelism with OpenMP on Linux SMPs Efficient Dynamic Parallelism with OpenMP on Linux SMPs Christos D. Antonopoulos, Ioannis E. Venetis, Dimitrios S. Nikolopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory

More information

Fine-Grain and Multiprogramming-Conscious Nanothreading with the Solaris Operating System

Fine-Grain and Multiprogramming-Conscious Nanothreading with the Solaris Operating System Fine-Grain and Multiprogramming-Conscious Nanothreading with the Solaris Operating System Dimitrios S. Nikolopoulos Eleftherios D. Polychronopoulos Theodore S. Papatheodorou High Performance Information

More information

Efficient Execution of Parallel Java Applications

Efficient Execution of Parallel Java Applications Efficient Execution of Parallel Java Applications Jordi Guitart, Xavier Martorell, Jordi Torres and Eduard Ayguadé European Center for Parallelism of Barcelona (CEPBA) Computer Architecture Department,

More information

Eleftherios D. Polychronopoulos

Eleftherios D. Polychronopoulos Eleftherios D. Polychronopoulos Contact Information Computer Engineering and Informatics Department. University of Patras 26500 Patra, GREECE Tel. +30 2610 996926 Mob. +30 6945 994200 FAX +30 2610 969001

More information

Nanos Mercurium: a Research Compiler for OpenMP

Nanos Mercurium: a Research Compiler for OpenMP Nanos Mercurium: a Research Compiler for OpenMP J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé and J. Labarta Computer Architecture Department, Technical University of Catalonia, cr. Jordi

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Scheduling. Jesus Labarta

Scheduling. Jesus Labarta Scheduling Jesus Labarta Scheduling Applications submitted to system Resources x Time Resources: Processors Memory Objective Maximize resource utilization Maximize throughput Minimize response time Not

More information

Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs

Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs Realistic Workload Scheduling Policies for Taming the Memory Bandwidth Bottleneck of SMPs Christos D. Antonopoulos 1,, Dimitrios S. Nikolopoulos 1, and Theodore S. Papatheodorou 2 1 Department of Computer

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled. Operating Systems: Internals and Design Principles Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles The basic idea is that the several components

More information

Data Distribution, Migration and Replication on a cc-numa Architecture

Data Distribution, Migration and Replication on a cc-numa Architecture Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland,

More information

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou (  Zhejiang University Operating Systems (Fall/Winter 2018) CPU Scheduling Yajin Zhou (http://yajin.org) Zhejiang University Acknowledgement: some pages are based on the slides from Zhi Wang(fsu). Review Motivation to use threads

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 10 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Chapter 6: CPU Scheduling Basic Concepts

More information

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems COMP 242 Class Notes Section 9: Multiprocessor Operating Systems 1 Multiprocessors As we saw earlier, a multiprocessor consists of several processors sharing a common memory. The memory is typically divided

More information

Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs

Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs Adaptive Scheduling under Memory Pressure on Multiprogrammed SMPs Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos Coordinated Science Laboratory University of Illinois at Urbana-Champaign

More information

CHAPTER-1: INTRODUCTION TO OPERATING SYSTEM:

CHAPTER-1: INTRODUCTION TO OPERATING SYSTEM: CHAPTER-1: INTRODUCTION TO OPERATING SYSTEM: TOPICS TO BE COVERED 1.1 Need of Operating System 1.2 Evolution of os 1.3 operating system i. Batch ii. iii. iv. Multiprogramming Time sharing Real time v.

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors The llc language and its implementation Antonio J. Dorta, Jose Rodr guez, Casiano Rodr guez and Francisco de Sande Dpto. Estad stica, I.O. y Computación Universidad de La Laguna La Laguna, 38271, Spain

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

Application/Kernel Cooperation Towards the Efficient Execution of Sharedmemory Parallel Java Codes

Application/Kernel Cooperation Towards the Efficient Execution of Sharedmemory Parallel Java Codes /Kernel Cooperation Towards the Efficient Execution of Sharedmemory Parallel Java Codes Jordi Guitart, Xavier Martorell, Jordi Torres and Eduard Ayguadé European Center for Parallelism of Barcelona (CEPBA)

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Multiprocessor and Real-Time Scheduling. Chapter 10

Multiprocessor and Real-Time Scheduling. Chapter 10 Multiprocessor and Real-Time Scheduling Chapter 10 1 Roadmap Multiprocessor Scheduling Real-Time Scheduling Linux Scheduling Unix SVR4 Scheduling Windows Scheduling Classifications of Multiprocessor Systems

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems OpenMP at Sun EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems Outline Sun and Parallelism Implementation Compiler Runtime Performance Analyzer Collection of data Data analysis

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming

More information

David Cronk University of Tennessee, Knoxville, TN

David Cronk University of Tennessee, Knoxville, TN Penvelope: A New Approach to Rapidly Predicting the Performance of Computationally Intensive Scientific Applications on Parallel Computer Architectures Daniel M. Pressel US Army Research Laboratory (ARL),

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three level scheduling 2 1 Types of Scheduling 3 Long- and Medium-Term Schedulers Long-term scheduler Determines which programs

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

1 Publishable Summary

1 Publishable Summary 1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Data Dependency. Extended Contorol Dependency. Data Dependency. Conditional Branch. AND OR Original Control Flow. Control Flow. Conditional Branch

Data Dependency. Extended Contorol Dependency. Data Dependency. Conditional Branch. AND OR Original Control Flow. Control Flow. Conditional Branch Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara fishizaka,obata,kasaharag@oscar.elec.waseda.ac.jp Dept.EECE,

More information

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011 Threads Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011 Threads Effectiveness of parallel computing depends on the performance of the primitives used to express

More information

Chapter 5: CPU Scheduling

Chapter 5: CPU Scheduling Chapter 5: CPU Scheduling Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating Systems Examples Algorithm Evaluation

More information

Threads Chapter 5 1 Chapter 5

Threads Chapter 5 1 Chapter 5 Threads Chapter 5 1 Chapter 5 Process Characteristics Concept of Process has two facets. A Process is: A Unit of resource ownership: a virtual address space for the process image control of some resources

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Chapter 5: CPU Scheduling. Operating System Concepts Essentials 8 th Edition

Chapter 5: CPU Scheduling. Operating System Concepts Essentials 8 th Edition Chapter 5: CPU Scheduling Silberschatz, Galvin and Gagne 2011 Chapter 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Multiprocessor and Real- Time Scheduling. Chapter 10

Multiprocessor and Real- Time Scheduling. Chapter 10 Multiprocessor and Real- Time Scheduling Chapter 10 Classifications of Multiprocessor Loosely coupled multiprocessor each processor has its own memory and I/O channels Functionally specialized processors

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU Manager

A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU Manager A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU Manager Martorell, X., Corbalan, J., Nikolopoulos, D., Navarro, N., Polychronopoulos, E. D., Papatheodorou, T. S., & Labarta,

More information

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Processes and Non-Preemptive Scheduling. Otto J. Anshus Processes and Non-Preemptive Scheduling Otto J. Anshus Threads Processes Processes Kernel An aside on concurrency Timing and sequence of events are key concurrency issues We will study classical OS concurrency

More information

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Standard promoted by main manufacturers   Fortran. Structure: Directives, clauses and run time calls OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org

More information

An Efficient Kernel-level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors

An Efficient Kernel-level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors An Efficient Kernel-level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors Elefetherios D. Polychronopoulos y, Dimitrios S. Nikolopoulos y, Theodore S. Papatheodorou y, Xavier Martorell

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

What s in a process?

What s in a process? CSE 451: Operating Systems Winter 2015 Module 5 Threads Mark Zbikowski mzbik@cs.washington.edu Allen Center 476 2013 Gribble, Lazowska, Levy, Zahorjan What s in a process? A process consists of (at least):

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Operating Systems: Internals and Design Principles. Chapter 4 Threads Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles The basic idea is that the several components

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems Personal-Computer Systems Parallel Systems Distributed Systems Real -Time

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Standard promoted by main manufacturers Fortran

Standard promoted by main manufacturers  Fortran OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org Fortran

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat

OpenMP 4.0 implementation in GCC. Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Jakub Jelínek Consulting Engineer, Platform Tools Engineering, Red Hat OpenMP 4.0 implementation in GCC Work started in April 2013, C/C++ support with host fallback only

More information

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008 Agenda Threads CSCI 444/544 Operating Systems Fall 2008 Thread concept Thread vs process Thread implementation - user-level - kernel-level - hybrid Inter-process (inter-thread) communication What is Thread

More information

An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors

An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors Polychronopoulos, E. D., Papatheodorou, T. S., Nikolopoulos, D., Martorell, X., Labarta, J., & Navarro,

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Operating Systems Overview. Chapter 2

Operating Systems Overview. Chapter 2 Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Course Syllabus. Operating Systems

Course Syllabus. Operating Systems Course Syllabus. Introduction - History; Views; Concepts; Structure 2. Process Management - Processes; State + Resources; Threads; Unix implementation of Processes 3. Scheduling Paradigms; Unix; Modeling

More information

Capriccio: Scalable Threads for Internet Services (by Behren, Condit, Zhou, Necula, Brewer) Presented by Alex Sherman and Sarita Bafna

Capriccio: Scalable Threads for Internet Services (by Behren, Condit, Zhou, Necula, Brewer) Presented by Alex Sherman and Sarita Bafna Capriccio: Scalable Threads for Internet Services (by Behren, Condit, Zhou, Necula, Brewer) Presented by Alex Sherman and Sarita Bafna Main Contribution Capriccio implements a scalable userlevel thread

More information

Chapter 4: Threads. Operating System Concepts 9 th Edition

Chapter 4: Threads. Operating System Concepts 9 th Edition Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems COP 4610: Introduction to Operating Systems (Spring 2015) Chapter 13: I/O Systems Zhi Wang Florida State University Content I/O hardware Application I/O interface Kernel I/O subsystem I/O performance Objectives

More information

Table of Contents 1. OPERATING SYSTEM OVERVIEW OPERATING SYSTEM TYPES OPERATING SYSTEM SERVICES Definition...

Table of Contents 1. OPERATING SYSTEM OVERVIEW OPERATING SYSTEM TYPES OPERATING SYSTEM SERVICES Definition... Table of Contents 1. OPERATING SYSTEM OVERVIEW... 1 Definition... 1 Memory Management... 2 Processor Management... 2 Device Management... 2 File Management... 2 Other Important Activities... 3. OPERATING

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Reducing the Number

More information

Xinu on the Transputer

Xinu on the Transputer Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1990 Xinu on the Transputer Douglas E. Comer Purdue University, comer@cs.purdue.edu Victor

More information

Informing Algorithms for Efficient Scheduling of Synchronizing Threads on Multiprogrammed SMPs

Informing Algorithms for Efficient Scheduling of Synchronizing Threads on Multiprogrammed SMPs Informing Algorithms for Efficient Scheduling of Synchronizing Threads on Multiprogrammed SMPs Antonopoulos, C. D., Nikolopoulos, D., & Papatheodorou, T. S. (2001). Informing Algorithms for Efficient Scheduling

More information

Towards a Portable Cluster Computing Environment Supporting Single System Image

Towards a Portable Cluster Computing Environment Supporting Single System Image Towards a Portable Cluster Computing Environment Supporting Single System Image Tatsuya Asazu y Bernady O. Apduhan z Itsujiro Arita z Department of Artificial Intelligence Kyushu Institute of Technology

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Threads. CS3026 Operating Systems Lecture 06

Threads. CS3026 Operating Systems Lecture 06 Threads CS3026 Operating Systems Lecture 06 Multithreading Multithreading is the ability of an operating system to support multiple threads of execution within a single process Processes have at least

More information

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition Chapter 6: CPU Scheduling Silberschatz, Galvin and Gagne 2013 Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Real-Time

More information

OPERATING SYSTEMS UNIT - 1

OPERATING SYSTEMS UNIT - 1 OPERATING SYSTEMS UNIT - 1 Syllabus UNIT I FUNDAMENTALS Introduction: Mainframe systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered Systems Real Time Systems Handheld Systems -

More information

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009

Lecture 17: Threads and Scheduling. Thursday, 05 Nov 2009 CS211: Programming and Operating Systems Lecture 17: Threads and Scheduling Thursday, 05 Nov 2009 CS211 Lecture 17: Threads and Scheduling 1/22 Today 1 Introduction to threads Advantages of threads 2 User

More information

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1 Windows 7 Overview Windows 7 Overview By Al Lake History Design Principles System Components Environmental Subsystems File system Networking Programmer Interface Lake 2 Objectives To explore the principles

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

A Dynamic Optimization Framework for OpenMP

A Dynamic Optimization Framework for OpenMP A Dynamic Optimization Framework for OpenMP Besar Wicaksono, Ramachandra C. Nanjegowda, and Barbara Chapman University of Houston, Computer Science Dept, Houston, Texas http://www2.cs.uh.edu/~hpctools

More information

Storage and Sequence Association

Storage and Sequence Association 2 Section Storage and Sequence Association 1 1 HPF allows the mapping of variables across multiple processors in order to improve parallel 1 performance. FORTRAN and Fortran 0 both specify relationships

More information

Threads. Computer Systems. 5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1

Threads. Computer Systems.   5/12/2009 cse threads Perkins, DW Johnson and University of Washington 1 Threads CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 5/12/2009 cse410-20-threads 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and References Reading» Read

More information

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012 What s in a traditional process? CSE 451: Operating Systems Autumn 2012 Ed Lazowska lazowska @cs.washi ngton.edu Allen Center 570 A process consists of (at least): An, containing the code (instructions)

More information

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to OpenMP. Lecture 2: OpenMP fundamentals Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming

More information

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions. 1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Operating System - Overview

Operating System - Overview Unit 37. Operating System Operating System - Overview An Operating System (OS) is an interface between a computer user and computer hardware. An operating system is a software which performs all the basic

More information

Operating Systems. studykorner.org

Operating Systems. studykorner.org Operating Systems Outlines What are Operating Systems? All components Description, Types of Operating Systems Multi programming systems, Time sharing systems, Parallel systems, Real Time systems, Distributed

More information

CPU Scheduling. Basic Concepts. Histogram of CPU-burst Times. Dispatcher. CPU Scheduler. Alternating Sequence of CPU and I/O Bursts

CPU Scheduling. Basic Concepts. Histogram of CPU-burst Times. Dispatcher. CPU Scheduler. Alternating Sequence of CPU and I/O Bursts CS307 Basic Concepts Maximize CPU utilization obtained with multiprogramming CPU Scheduling CPU I/O Burst Cycle Process execution consists of a cycle of CPU execution and I/O wait CPU burst distribution

More information