Adaptively Mapping Parallelism Based on System Workload Using Machine Learning

Size: px
Start display at page:

Download "Adaptively Mapping Parallelism Based on System Workload Using Machine Learning"

Transcription

1 Adaptively Mapping Parallelism Based on System Workload Using Machine Learning Dominik Grewe E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2009

2 Abstract Parallel Computing has become pervasive and the number of processors placed in computers will further increase in the future. However, software developers are struggling to efficiently exploit the computational resources provided by parallel architectures. It is thus inevitable to investigate the behaviour of parallel programs and develop methods that help improving their performance. The software developer should not have to worry about how to map parallelism to the underlying architecture, but should instead concentrate on exposing the parallelism and leave the mapping task to the runtime system. In this project, the behaviour of parallel programs in the presence of workload is investigated. It is shown that choosing the right for an application is crucial to achieve the best performance possible when there is other workload running on the system. The default policy of creating as many threads as there are cores is rarely optimal in this situation and using the optimal reduces the runtime by 22.5% on average w.r.t. the default policy. Determining the optimal number of threads is not a straight-forward task, because it not only depends on the current workload but also on the program itself. For some programs, reducing the number of threads w.r.t. the default yields the optimal solution, whereas for other programs the best performance is achieved using more threads than there are cores. In order to tackle this problem, a novel technique for choosing the number of threads is presented in this work. Using Machine Learning techniques, a model is created that predicts the optimal based on the current system workload and the program. Different approaches for modelling this problem and several sets of features are evaluated. With the best model, a speedup of 92% of the optimal performance is achieved, which corresponds to a runtime reduction of almost 16% over the default policy. i

3 Acknowledgements At first, I would like to thank my supervisor Professor Michael O Boyle for his excellent advice and for supporting me throughout the whole project. The discussions we had were of great avail to me and helped me focusing on the most vital parts of my work. Furthermore, his reviews of my progress enabled me to present my work in a more structured form. I would also like to thank Zheng Wang for his patience in answering all my questions and for providing me with help on the technical issues that arose during my work on this project. Last but certainly not least I would like to thank Hugh Leather for the time he spent setting up his great tool libplugin for me. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Dominik Grewe) iii

5 Contents 1 Introduction Parallel Computing Parallel Architectures Parallelism Models Parallel Programming Overheads in Parallel Computing Workload on Parallel Computers Compilation The Problem with Static Compilation Beyond Static Compilation Optimising Parallel Programs Machine Learning in Compilation Predictive Modelling Using Machine Learning to Optimise Parallel Programs Contributions Organisation of the Dissertation Related Work Using Machine Learning for Program Optimisation Learning to Schedule Learning to Unroll Focusing the Search in Iterative Optimisation Automatic Feature Generation Parallel Programs and Machine Learning Mapping Parallelism to Multi-Cores Scalability Prediction Based on Regression Boosting Program Performance in the Presence of Workload iv

6 2.4 Summary Methodology Experimental Setup System Specification Benchmark Applications Collecting Benchmark Execution Times Machine Learning Models Support Vector Machines Multi-Layer Perceptrons How to Evaluate Predictions Summary Program Scalability Scaling of Parallel Programs Scaling on Idle Machines Scaling on Loaded Machines Performance of the Default Policy Summary Predicting the Optimal Number of Threads Feature Selection Program Features Workload Features Modelling the Problem Creating the Model Using the Model for Prediction Summary Results and Evaluation Prediction Using Different Feature Sets Prediction Using Dynamic Features Prediction Using Static Features Prediction Using Both Features Sets Comparison of the Feature Sets Splitting the Workload v

7 6.3 Performance of the Machine Learning Models Training the Model Using the Model for Prediction Summary Conclusion Extensions Summary A Scaling on Idle Machines 53 B Scaling on Different utdsp.mult-workloads 56 C Scaling on Various Workloads 59 Bibliography 62 vi

8 Chapter 1 Introduction Parallel computing has become ubiquitous in the recent past. Because processor speeds increase slowly, multiple processors are being placed in a single computer to improve its overall performance. However, this trend forces software developers to rethink their way of programming by detaching themselves from sequential programming and starting to develop parallel software. To support the programmer in this task, it is inevitable to investigate the behaviour of parallel programs in order to develop techniques that help exploiting the resources available in parallel machines. The first part of this chapter gives an overview about current parallel architectures, different models of how to exploit parallelism and techniques to write parallel programs. This is followed by a brief discussion about the overheads in parallel computing and the behaviour of workload on parallel computers. In the subsequent section, static compilation and its shortcomings are introduced. This motivates some approaches to overcome these problems, e.g. Iterative Compilation. Furthermore, some ideas about optimising parallel programs are shown. The third section of this chapter introduces the idea behind using Machine Learning to optimise programs and how it can be applied to parallel programs. This is followed by a description of the contributions of this project and an overview of the structure of this thesis. 1

9 Chapter 1. Introduction Parallel Computing Parallel Architectures In order to exploit the increasing availability of transistors in a computer, processors have become highly complex structures that can process data at a very fast pace despite severe problems such as the growing gap between processor and memory speeds [25]. However, hardware designers are struggling to gain even more speed from a single processor while being energy efficient. Therefore, the new trend is to have multiple (less complex) processing cores in a single machine to improve the computer s overall performance. Due to this development parallel computing has become pervasive and it is seen as the most promising way to exploit the ever-growing availability of computing resources [26]. Whereas in the past, clusters or other forms of distributed architectures [16] were the predominant form of parallel machines, multi-processors are now commonplace and will be even more so in the future. Most computers shipped today have at least two processors and this number will increase, as current research chips contain up to 80 cores [24]. Multi-Processors There are two kinds of multi-processors: On the one hand, several individual chips can be placed on a motherboard. Altough all processors share the computer s resources, e.g. the main memory, they do not share any of their internal resources, such as caches. On the other hand, multiple processors can be placed on a single chip (chip multi-processors or CMPs). In this case, the processors share some of their resources, such as low-level caches, but each processor has its own computing resources. The fact of having shared caches can have both advantages and disadvantages. The advantages of sharing caches is that programs running on the same processor but on different cores have the chance of reusing each others memory, e.g. when the same program is executed twice, the code sections can be shared and don t have to be loaded twice. Furthermore, if some of the CPUs are idle, the programs running on the remaining cores can take advantage of the whole cache, instead of just having a fraction available. This is not possible with private caches. However, in shared caches, programs may also negatively interfere with each other: If one program loads data into the cache this leads to other data getting evicted from the cache that is possibly still in use by the other program. The data must then be re-loaded from memory which takes considerably more time than loading it from the cache. This could have been avoided

10 Chapter 1. Introduction 3 by having private caches. Multi-Threading Processors Whereas programs executing on multi-processors only share lower-level caches, on multi-threading processors they share almost all of the computing resources, from functional units to high-level caches. The idea behind multi-threading is that the processor stores the architectural state, e.g. register contents, of multiple execution threads. It can then issue instructions from any thread trying to utilise the time other threads spend waiting for long latency operations, e.g. due to cache misses. Simultaneous multi-threading (SMT) [47, 34] is the most advanced form of multi-threading. Whereas in blocked multi-threading, for example, a hardware context switch that takes several cycles to complete is needed to issue instructions from a different thread, an SMT processor can execute instructions from multiple instruction streams in the same cycle. An SMT processor s resources are better utilised because less time is spent waiting when a thread is blocked. This not only leads to better performance, but also increases the processor s energy efficiency [34]. To the operating system the different contexts of an SMT processor appear like different processing cores to which it can assign threads. It is the hardware s task to decide which of these thread s instructions are actually issued for execution. This decision depends and which thread has instructions that are ready for execution and on the availability of computing resources. In general, the hardware scheduler tries to treat all threads equally by giving them a fair share of processing resources Parallelism Models There are two different ways to parallelise a program: With data parallelism, each thread or program computes the same set of instructions but works on different data, e.g. in loops. In task parallelism, the threads or programs perform different tasks on either the same or different data. Furthermore, there are two fundamentally different models of how to exploit parallel architectures: In shared memory programming or Thread-Level Parallelism (TLP), several threads of the same program are executed concurrently and they can communicate with each other using the shared memory of the program s virtual memory space. In the message passing model, multiple (individual) programs are executed concurrently. Because the program s memories are separate, they have to communicate by sending message to each other via the operating system.

11 Chapter 1. Introduction 4 Shared Memory The shared memory model maps to the multi-processor architecture, because the threads of a parallel program can execute concurrently on the different processors and share data using the program s virtual memory. However, to share data between threads, they have to synchronise first to make sure that only one thread ever operates on the shared data at a time. These critical sections can be implemented using locks or semaphores, but it requires the programmer s attention to make sure that no deadlocks or race conditions can occur [22]. There also exist other techniques that try to simplify the programm s task, e.g. Software Transactional Memory [41]. Message Passing In message passing, programs don t share any memory. They have to send messages to each other via operating system calls in order to exchange data. Due to the non-shared memory nature of the message passing model it maps to distributed parallel architectures, such as clusters. Each program resides on a node of the cluster and data is exchanged by sending messages over the network. However, the programs using message passing can also be executed on a single machine. Synchronisation between programs is implicit in sending and receiving messages, thus requiring no further action from the programmer. However, the programmer must make sure that no deadlocks can occur, which can happen when two programs wait for messages from each other, for example Parallel Programming There are multiple ways of creating parallel programs. Some programming languages, such as Java, have built-in mechanisms for shared memory programming by providing thread creation and synchronisation methods. For other programming languages, such as C/C++, there exists libraries that provide this functionality, e.g. pthreads [38] in the case of C/C++. Whereas these models provide very basic methods for writing parallel programs, there also exist more advanced frameworks, e.g. OpenMP [17], that simplify the task of writing parallel code. OpenMP allows the programmer to create parallel sections without having to worry about thread creation or synchronisation. For-loops with independent iterations, for example, can be parallelised by adding an OpenMP directive in front of the loop code. During compilation this directive is replaced with code that creates multiple threads, divides the iterations of the loop and assigns a share of them to each thread. At the end of the loop, code for synchronising the threads is included,

12 Chapter 1. Introduction 5 if not otherwise specified by the programmer. The default created for a parallel section is the number of cores that there are in the computer. However, the programmer can change this number by calling a library function, setting an environment variable or adding a clause to the OpenMP directive. There also exist APIs that allow the programmer to write programs using the message passing model, e.g. MPI [39]. MPI provides functions for sending and receiving messages in various ways, such as uni- or multi-casts and blocking or non-blocking. It supports the Single-Program-Multiple-Data programming model, where multiple instances of the same program are spawned and the behaviour of a particular instance usually depends on the identifier that is assigned to this instance of the program Overheads in Parallel Computing By sharing the work between processors, the execution time of programs can be significantly reduced. However, there is also some overhead associated with parallel programming that must be taken into account, because it can significantly reduce the speedup one may expect from parallelising a sequential program. In shared memory programming, thread creation, synchronisation and load imbalance are responsible for the most obvious overheads of parallel programs [22]. When new threads are to be created, the operating system has to be invoked to register the threads and reserve memory for the thread s stack. If threads have to co-operate, e.g. to exchange data, they have to synchronise, which causes inter-process communication. Furthermore, if the overall work is not equally distributed over the threads, some processors might be idle although there is still some work to be done. But there are also other, less apparent overheads that can reduce the expected speedup of parallel programs. In multi-core processors, for example, two threads of the same program that are executed on different CPUs may work on distinct parts of the same cache line. Because the granularity of cache coherence protocols is a cache line, a write by one thread will lead to an invalidation of the cache line in the other thread s processor. This phenomenon is called false sharing [22] and can be avoided by restructuring the program s memory layout. In the message passing model a significant cause for overhead is the sending and receiving of messages. Especially when the program is executed in a network of computers, the cost of sending data from one node to the other can be considerable. It is thus important to only sent the smallest amount of data necessary and to keep the

13 Chapter 1. Introduction 6 distance that data is sent at a minimum. For small problems that don t require much computation, the overhead caused by parallel programming may not be worth the speedup gained by splitting the work between processors. Furthermore, programs that require much communication between threads spent a considerable amount of time waiting for synchronisation. In these cases, it is maybe not worth parallelising, because a sequential version of the program is possibly faster Workload on Parallel Computers As already mentioned previously, when several programs or several threads of a program are executed on a multi-core computer, there can be both negative and positive interference. The former can be due to programs that cause each others variables to get evicted from caches. The latter can appear when threads of the same program share parts of their memory, e.g. the program code. Hence, a central role in how the workload of a parallel computer performs is played by the cache configurations of the processors and the memory access behaviour of the programs [28]. It is the operating system s responsibility to schedule program threads to the available CPUs. In order to exploit cache locality, it is useful to try to schedule threads to the CPU they have been running on before: A thread running on processor A populates processor A s cache with its data. If the thread is migrated to another processor B, the data has to be invalidated in processor A s cache and processor B s cache must be repopulated. Hence, the operating system attempts to not migrate threads when possible. This is called processor affinity [42]. On simultaneous multi-threading processor, the operating system has less control over which threads are executed. All it can do is schedule (multiple) threads to an SMT processor, but then it is in the hardware s responsibility to schedule these threads. A major problem with this setup is that it is no longer possible for the operating system to favour high-priority processes by giving them more CPU time while exploiting the opportunities presented by SMT processors [43, 13]. Although it is possible to only schedule the high-priority process to a processor, this will decrease the overall throughput, because computational resources are wasted.

14 Chapter 1. Introduction Compilation The Problem with Static Compilation Almost every piece of software today is written in a high-level language such as C/C++ or Java. To make these programs run on a computer, whose hardware can only process simple assembler instructions, they have to be mapped from the programming language into the machine language. This is performed by a compiler which parses the source code and transforms it into target code. To do this translation correctly is straight-forward and has been known for decades. However, for each program there are infinitely many translations that are semantically correct and finding the best performing one is the actual challenge. Modern computers are based on complex architectures whose behaviour is hard to predict. Today s processors have several levels of caches and execute multiple instructions at a time, which may be dynamically reordered to better exploit the existing computational resources. These and other optimisations are the reason that modern computers can execute programs at a high speed despite seemingly severe problems such as the growing gap between processor and memory performance. However, these complex hardware structures are also the reason that writing optimising compilers for these architectures becomes more difficult. The effects of optimisations are difficult to predict and finding good parameters for program transformations, e.g. loop unrolling, is a tedious task that requires the work of a compiler writer that is familiar with the architecture. Optimisations that work well on one architecture may slow down the program on a different one. Furthermore, interactions between different program transformations are complex and applying one optimisation may disable the use of another one. These intricate dependencies are hard to handle and again require an expert to achieve good results. Often simplified machine models are used to estimate the effects of optimisations, but due to the actual complexity of the real hardware these models are too simplistic to make accurate predictions. When a sequence of optimisations is finally found, it is applied to all programs that are being compiled. But it has been shown that different programs often need different optimisations to achieve the best performance [15, 30, 20]. Hence, using a fixed optimisation sequence with fixed parameters for transformations can only provide average performance, leaving a significant potential for optimisation unused.

15 Chapter 1. Introduction Beyond Static Compilation Ideally, to overcome the previously-mentioned problems of hand-crafted heuristics, a compiler should find the right optimisations for a certain program automatically, instead of requiring many hours of effort from an expert. The compiler should know on its own how to best transform a given program in order to achieve good performance. Additionally, whenever the architecture changes the optimisations have to be adjusted to it, because transformations may have different effects on the new architecture. Hence, for every architecture the expensive process of finding the right optimisation sequence has to be repeated. It would thus be useful to have a compiler that automatically adapts to a new architecture. The issues described here were tackled by Iterative Compilation [30]. In Iterative Compilation every program is compiled many times, each time with a different configuration, to find the best optimisations with respect to some metric such as execution time, code size or energy efficiency. This way specific transformations are found for each program and each architecture. The major drawback of Iterative Compilation, however, is its long compilation time. The search spaces of optimisation parameters are huge and theoretically every possible configuration has to be tested to find the optimal values. To overcome this problem, several techniques have been proposed to speed up Iterative Compilation, including the use of offline search [46] and static models [30] to prune the search space. However, it is still a time-consuming process. Even though Iterative Compilation seems to overcome some of the problems of static compilation, its use in general-purpose compilers is significantly limited by its long compilation time and is thus only feasible for heavily used applications or libraries. It also doesn t exploit the opportunity of adapting a program at runtime. When a program is executed, the system s situation is maybe different to when it was compiled or maybe the program input has changed. All this can effect the performance of a program and may require different optimisations to achieve the best performance Optimising Parallel Programs In recent years, parallel computing has become more popular. Almost any computer that is shipped today has two or maybe even four processing cores and this trend will become even stronger with up to eight cores on a chip by next year and as many as 80 cores in current research processors [24]. Most research in optimising parallel programs focuses on reducing inter-process communication by improving data locality

16 Chapter 1. Introduction 9 [4]. It is desirable to have a processor reuse a large share of its data, because exchanging data with other processes slows down the computation. Furthermore, for multi-threaded programs the granularity of the parallelism is important. A large granularity, where threads are created only rarely and perform a large amount of work is more desirable than a small granularity where threads only perform small amounts of work and there is much of overhead due to thread creation and synchronisation. A popular method for exploiting multiple cores in a computer is Thread-Level Parallelism (see section 1.1.2). A big potential for exploiting Thread-Level Parallelism is the parallelisation of loops. In many applications, e.g. in digital signal processing, there are loops where each iteration is independent of the other. Hence, all iterations can theoretically be executed concurrently by creating a new thread for each iteration. However, in practice the number of iterations is much larger than the number of cores. Creating more threads than there are cores is usually not useful, because it only creates unnecessary synchronisation overhead. Having less threads than cores is, in most cases, a waste of computational resources, because some cores will be idle. Only when the synchronisation overhead is higher than the benefit from parallelisation, a sequential version is faster than a parallel one. For these reason, standard frameworks such as OpenMP [17] create as many threads as there are cores for their default configuration. However, these results are obtained under the assumption that the program can use all of the computer s resources. But what if there are other programs running that also compete for processing time? Is the default configuration still the best? In the early years of parallel programming, parallel computers were only used for dedicated scientific programs and the programmer could be sure that he had exclusive use of the parallel machine. Today however, parallel computers are everywhere and a user cannot be sure anymore whether a program can use the vast majority of computing resources on its own or whether there are other programs competing for CPU time as well. Furthermore, more complex programs possibly consist of different parallel tasks that each may perform some parts of their task in parallel. It is now no longer obvious how many threads to use for each parallel region. In the ideal case, the programmer only has to expose the parallel sections of a program and it is the runtime system that decides on how to parallelise these sections, if at all. For reasons described before, optimising programs is a difficult task in general and it becomes even harder with parallel programs; even under the assumption that the program is executed in isolation. Taking possible system workload into account is almost never done due to its additional complexity and makes an already difficult

17 Chapter 1. Introduction 10 problem even harder. The goal of this project is to improve the performance of parallel programs on any kind of system workload, by making the program choose the right for execution using predictive modelling. Offline, a Machine Learning model will be trained to predict the right given a certain application and system workload. The model will then be used to choose the best configuration for a program when it is executed. 1.3 Machine Learning in Compilation Traditionally, computer software is written in terms of algorithms. When there is a problem to solve, a programmer develops an algorithm that, given some input data, produces the result. Sorting a sequence of numbers, for example, is a well-studied area in computer science and there are numerous algorithms for solving this problem. Some problems, however, are based on more complex patterns. Programming a computer how to recognise handwritten digits, for example, is not easy at all. Manually finding a pattern that distinguishes the different digits is almost impossible to do. However, now that we have the computational power to process large amounts of data, a Machine Learning algorithm can be devised that automatically finds the pattern [8]. The program learns how to interpret the data by looking at training examples, i.e. input data that is labelled with the correct result. It automatically finds a way to classify (previously unseen) data by fitting a mathematical model to the training data. This is the main advantage of Machine Learning: It is not necessary to fully understand the underlying patterns that lead to a results. A simplified, but more descriptive example of Machine Learning is curve fitting. Figure 1.1 shows several input data (x i,y i ) and an approximation of the function y = f(x) that may have produced this data (assuming some noise). In this case, the only input value is x and the Machine Learning model tries to predict a value y. Given that the curve is a third-order polynomial a 3 x 3 + a 2 x 2 + a 1 x + a 0, the Machine Learning task is to find parameters a i that best fit the training data. A crucial factor in Machine Learning is to find the right features, i.e. the input data to the model. The features must be related to the output to make an accurate prediction possible. All the Machine Learning algorithm can do is to fit a model to the data. Hence, it is in the user s responsibility to provide good data which includes finding good features that accurately characterise the data.

18 Chapter 1. Introduction 11 5 training estimation Figure 1.1: Machine Learning as curve fitting Predictive Modelling As already mentioned previously, today s processors are very complex pieces of hardware and it is difficult to predict their behaviour. Furthermore, computer architectures change over time and each new platform has a potentially different behaviour. Hence, Machine Learning is a good way to automate the process of finding compiler optimisations for these complex architectures, because instead of studying the architecture to find good optimisations, all that is required is a run of the training phase on the new architecture. The general idea behind predictive modelling in compilation is to build a Machine Learning model that predicts the right optimisations for a specific program and platform. Unlike in Iterative Compilation [30], multiple expensive profiling runs of an application that is being compiled are avoided. The compiler receives all the information it needs from the source code and maybe a single profiling run on a small data set. Based on this information it makes a prediction as to which transformations to apply. The Machine Learning model is built from training programs and their optimal transformations (see figure 1.2). These optimal configurations are found in a similar way as in Iterative Compilation, namely by searching the parameter space of optimisations. However, this process is done offline, i.e. it s only done once before the compiler is actually used. Once the model has been built, no more searching is needed. To relate unseen applications to the training data, programs are represented by a number of features. Which features are useful and which are not depends on the problem. A good

19 Chapter 1. Introduction 12 Figure 1.2: Predictive Modelling starting point, however, are features that describe the type of instructions found in the code, e.g. the number of floating point instructions or branch instructions Using Machine Learning to Optimise Parallel Programs Recently, Machine Learning has been used to predict the right and the OpenMP scheduling policy of a parallel program executed in isolation [51]. On an Intel Xeon and a Cell processor, 96% of the optimal performance is achieved, which, on the Cell platform, is a 37% performance improvement over the OpenMP default configuration. However, assuming no system workload, finding the right number of threads boils down to making a decision between the sequential version of a program or using as many threads as there are cores. Barnes et al. [7] used Machine Learning techniques to predict the performance of programs on a large number of processors given data gathered during executions on a smaller number of processors. Given the abundance of available processors in today s compute clusters, this is a useful and inexpensive means to find the right number of processors to allocate to a program in order to improve efficiency of the whole cluster. The project described here is similar to the former one in the sense that it uses supervised learning to build a model to predict the best for a parallel program. The crucial difference, however, is that there are other (possibly parallel) programs also competing for computing resources, i.e. the applications are not executed in isolation. By varying the of parallel programs executed on some workload, the best configuration of the program for a particular workload is found. This information is used to create a Machine Learning model that can accurately predict

20 Chapter 1. Introduction 13 the best to use for an arbitrary parallel program given the current system workload. 1.4 Contributions There are two main contributions of this project: On the one hand, experiments are carried out to investigate how a parallel program s performance changes when the is varied. It is shown that, on an idle machine, the default policy of creating as many threads as there are processors is the optimal solution. As soon as there is some other workload on the machine, however, the default policy is rarely optimal. By using the right, significant speedups can be achieved over the default policy. Whereas in some cases it is better to reduce the, for some programs the optimal performance is reached by using more threads than there are processors. Additionally, a novel method for choosing the optimal in the presence of workload is proposed. By applying Machine Learning techniques, a model is built that predicts for any program and any workload the best to use. Using this technique an average speedup of 92.14% of the optimal performance is achieved, compared to 77.44% of the optimum for the default policy. This is a speedup of 1.19 over the default policy or, in other words, a runtime reduction of 15.95%. Due to its additional complexity, the presence of workload is usually not considered in program optimisation. However, when parallel programs are executed on a computer, the workload has a significant influence on its performance and thus cannot be ignored when choosing the to create. 1.5 Organisation of the Dissertation The following chapter, chapter 2, describes work that is related to this project. It starts off with an overview of Machine Learning in Compilation, including learning to predict loop unroll factors and automatically finding program features. Furthermore the application of Machine Learning in the area of parallel programming is discussed. This is followed by some approaches to improve program performance in the presence of workload. Most research in this area focuses on Simultaneous Multi-Threading architectures and how to increase the throughput of these processors.

21 Chapter 1. Introduction 14 In chapter 3 the methodology of this project is explained. This includes the experimental setup, i.e. on which kind of computers were the experiments conducted, which benchmarks were used and what was the exact method for determining execution times. Furthermore, the Machine Learning models used in this project, Support Vector Machines and Multi-Layer Perceptrons, are introduced together with the methods of evaluating predictions. Chapter 4 describes the data gathered during the experiments. The scalability of programs on idle and loaded machines is illustrated and the default policy for the number of threads, i.e. using as many threads as there are cores, is discussed in terms of its performance in the presence of workload. Chapter 5 begins with a description of the features used for the Machine Learning algorithms in this project. Because both, the program and the workload, determine the program s behaviour, two sets of features are required to characterise these two factors. Additionally, there is a section on how the problem of predicting the number of threads can be modelled and how the model can be used to predict the optimal in a certain situation. In the subsequent chapter, chapter 6, the results of the Machine Learning models are presented. On the one hand, the accuracy of the models is described, i.e. in how many cases is the optimal predicted correctly. On the other hand, the performance of the programs using the predictor is shown and compared to the performance using the optimal or using the default policy. This is followed by a short discussion on the performance of the Machine Learning algorithms, i.e. how long does it take to train the model and how long does it take to make a prediction. Chapter 7 provides a conclusion to the project. The success and feasibility of the approach is discussed and possible extensions to the project are proposed.

22 Chapter 2 Related Work Machine Learning has been used in compilation for several optimisation tasks. This chapter describes some of the approaches of applying Machine Learning in the area of program optimisation and parallel programming. At first, one of the earliest papers on Machine Learning in Compilation is presented: Learning to Schedule. This is followed by two different applications of predictive modelling, namely to predict unroll factors for loop unrolling and to predict good areas in the parameter space of optimisations in order to speed up Iterative Compilation. The last paper is about one of the most important problems in predictive modelling, namely finding good program features. It tries to tackle this problem by automatically generating features for the specific learning task. In the subsequent section, the use of predictive modelling in parallel programming is presented. The first paper is about finding the optimal and scheduling policy for a parallel program. The second paper in this section is not concerned with improving a program s performance, but presents an approach of using regression to predict the runtime of programs on many processors. Because this project is concerned with program performance in the presence of workload, proposals of improving workload performance on multi-core and/or multithreading processors are described in the last section of this chapter. Due to the limited control an operating system has over which thread is scheduled on a simultaneous multi-threading (SMT) processor, there is a substantial amount of work about scheduling on SMT architectures. This ranges from making the scheduler more aware of which threads can beneficially co-exist on the same processing core to giving the operating system more control over the scheduling on the processor by making changes to the hardware. Furthermore, in grid computing, scheduling your program according to the 15

23 Chapter 2. Related Work 16 current workload is important to achieve good execution times. However, the scale of grid computing is much larger than the scale this project is concerned about, namely multi-processors and multi-cores. Hence, only a brief overview on the topic of grid computing is given. 2.1 Using Machine Learning for Program Optimisation Machine Learning has been applied to several areas of (sequential) program optimisation [36, 44, 35, 3, 10, 11, 20]. Some of these approaches are introduced in more detail in this section Learning to Schedule One of the first uses of Machine Learning in compilation was to find instruction schedules for basic blocks [36]. By exhaustively searching all possible schedules for a number of small basic blocks, training data is gathered to create a model that helps making decisions on which instruction to schedule next given the current partial schedule and the available instructions. The data is stored as triples (P,I i,i j ), where P is a partial schedule, i.e. a total order of already scheduled instructions and a partial order of remaining instructions. I i and I j belong to the set of available instructions I, from which the next selection is to be made. If instruction I i is preferable over I j given the partial schedule P, the triple (P,I i,i j ) is a positive example and (P,I j,i i ) is a negative example. Using the examples and counter-examples generated during the training phase, four different techniques are used to infer schedules for new programs. Among these are decision tree induction [49] and feed-forward artificial neural networks [23]. To relate different triples to each other, five features are used that describe both the partial schedule and the available instructions. These include Odd Partial (is the number of instructions odd or even?) and the instruction class. The authors achieved similar results to a manually tuned heuristic, but the advantage of their approach is that it is completely automatic and hence doesn t require an expert to hand-craft the heuristic.

24 Chapter 2. Related Work Learning to Unroll Machine Learning has also been used to determine unroll factors for loop unrolling [35, 44]. Loop unrolling is an always legal transformation but finding the right unroll factor is crucial to exploiting the potential benefits of this optimisation. Choosing a wrong value can even lead to performance degradations. Because it is not easy to manually determine the right unroll factors, Machine Learning has proved to be a helpful technique to improve and automate this decision. Monsifrot et al. [35] use oblique decision trees [37] to predict unroll factors. They create polygonal partitionings of the feature set to distinguish different types of loops. In order to characterise the loops they used features extracted from the source code, including the number of arithmetic operations, the number of array accesses and the number of if-statements. Their experimental results on an UltraSPARC and an IA-64 machine show that the technique slightly improves performance compared to not using loop unrolling at all. Compared to the compilers default heuristic for loop unrolling, the Machine Learning technique also performs marginally better on both architectures. Stephenson et al. [44] describe the loop unrolling problem as a multi-class classification problem, where each class corresponds to a different unroll factor. Both techniques used by the authors, nearest neighbour and Support Vector Machines (SVM), outperform the default compiler heuristic for loop unrolling on the SPEC benchmarks running on an Itanium 2 processor. Whereas the nearest neighbour method only yields marginal improvements of about 1%, the more complex SVM approach improves the runtime by 5% on average Focusing the Search in Iterative Optimisation The main problem with Iterative Compilation [30] is that it requires a large number of program evaluations. A huge part of the parameter space needs to be searched in order to find optimisations that yield acceptable performance. In 2006, Agakov et al. [3] used Machine Learning to focus the search in an Iterative Compilation approach for selecting optimisation sequences of length 5 from 14 source-level transformations. Their model mapped programs to promising regions in the search space using 36 static loop-level features, which were condensed to 5 features using Principal Component Analysis. For the model, the authors use two different approaches. The first one, an independent identically distributed (IID) model, treats program transformations as if they were

25 Chapter 2. Related Work 18 independent. Because this is usually not the case in practice, they also use a Markov Model which takes dependencies of transformations into account. For searching, the authors try two different approaches: A simple random search, where the probabilities of transformations is biased according to the Machine Learning model, and a Genetic Algorithm [18], where the initial population is based on the model. Their approach speeds up Iterative Compilation by an order of magnitude and achieves runtime speedups of more than 1.2 on a Texas Instruments and an AMD processor. The Markov model generally outperforms the IID model, but there is no significant difference between the two search algorithms Automatic Feature Generation One of the most difficult challenges in Predictive Modelling is to find good features to accurately make predictions. If bad features are chosen classification clashes can happen: Two programs whose best optimisation parameters are different may be mapped to the same feature vector, resulting in at least one wrong prediction. To overcome this problem, Leather et al. [32] propose a technique to automatically find representative program features. They describe the feature space as a grammar where every sentence from the grammar represents one feature. Using Genetic Algorithms [18], they search the feature space and gradually build a set of useful features by selecting the ones that improve the prediction accuracy. The method is evaluated on loop unrolling, a well-studied transformation that has often been used in Predictive Modelling before (see section 2.1.2). GCC s unrolling heuristic is only able to achieve 3% of the maximum performance on average. A Machine Learning technique to predict unroll factors [44] is able to push this number to 59%. With their automatic feature generation approach, however, Leather et al. are able to obtain 76% of the maximum speedup available. 2.2 Parallel Programs and Machine Learning Mapping Parallelism to Multi-Cores Probably the most similar approach to my project of using Machine Learning to optimise parallel programs is by Wang et al. [51]. In this paper they try to predict the optimal and OpenMP scheduling policy for a program executed in isolation. Using static program features, e.g. branch counts, and dynamic program

26 Chapter 2. Related Work 19 features, e.g. cache miss rates, they build two models, an Artificial Neural Network [23] and a Support Vector Machine [9], to predict both the best and scheduling policy. Instead of directly predicting the best, they predict speedups for all different configurations and then choose the one with the best predicted performance. Additionally, they build predictors that take program input data into account to make predictions for programs whose behaviour is strongly influenced by input data. The authors evaluated their predictions on two different platforms: an Intel Xeon machine with 8 cores and a Cell processor with 16 cores in total. On the Xeon computer, the OpenMP default policy already does a good job giving an average performance of 95% of the optimum. However, it is not very stable, because some programs only achieve 57% to 75% of the optimum. The Machine Learning model gives more consistent results with at least 95% of the optimal performance on all benchmarks. On average, however, it is only marginally better than the default policy reaching 96% of the optimal speedup. On the Cell processor, the Machine Learning predictor is able to significantly improve performance over the default policy. It gives 96% of the optimal performance as opposed to only 70% of the default scheme. The problem of the default policy on the Cell processor is that it always tries to use all processing resources. Because the data first has to be copied onto the Synergistic Processing Element s memory before it can be processed, it may not be worth parallelising. If the communication cost is too high, it is faster to execute a sequential version of the program on the Power Processing Element only. Hence, for the Cell processor, a Machine Learning model that can predict whether it is worth parallelising a program or not can achieve significantly better results than the default policy of parallelising in every situation. In [45] a similar technique has been used to decide for a loop whether it is worth parallelising or not Non-Machine Learning Based Methods Another technique for finding the best on a parallel architecture, albeit not using Machine Learning, is presented in [29]. In this approach, loops are analysed at compile- and at run-time to decide whether to execute the loop sequentially, with four threads or with eight threads on a four-core machine supporting HyperThreading [34], i.e. supporting up to eight logical threads on four physical CPUs. The authors restrict the possible to one, four and eight. If it is not

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Machine Learning based Compilation

Machine Learning based Compilation Machine Learning based Compilation Michael O Boyle March, 2014 1 Overview Machine learning - what is it and why is it useful? Predictive modelling OSE Scheduling and low level optimisation Loop unrolling

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallelism. Parallel Hardware. Introduction to Computer Systems Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

G52CON: Concepts of Concurrency

G52CON: Concepts of Concurrency G52CON: Concepts of Concurrency Lecture 6: Algorithms for Mutual Natasha Alechina School of Computer Science nza@cs.nott.ac.uk Outline of this lecture mutual exclusion with standard instructions example:

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Thread Affinity Experiments

Thread Affinity Experiments Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Multicore Computing and Scientific Discovery

Multicore Computing and Scientific Discovery scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Go Deep: Fixing Architectural Overheads of the Go Scheduler

Go Deep: Fixing Architectural Overheads of the Go Scheduler Go Deep: Fixing Architectural Overheads of the Go Scheduler Craig Hesling hesling@cmu.edu Sannan Tariq stariq@cs.cmu.edu May 11, 2018 1 Introduction Golang is a programming language developed to target

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Machine Learning Based Mapping of Data and Streaming Parallelism to Multi-cores

Machine Learning Based Mapping of Data and Streaming Parallelism to Multi-cores Machine Learning Based Mapping of Data and Streaming Parallelism to Multi-cores Zheng Wang E H U N I V E R S I T Y T O H F R G E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture

More information

Parallel Computing Concepts. CSInParallel Project

Parallel Computing Concepts. CSInParallel Project Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................

More information

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering Operating System Chapter 4. Threads Lynn Choi School of Electrical Engineering Process Characteristics Resource ownership Includes a virtual address space (process image) Ownership of resources including

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

A Compiler Cost Model for Speculative Multithreading Chip-Multiprocessor Architectures

A Compiler Cost Model for Speculative Multithreading Chip-Multiprocessor Architectures A Compiler Cost Model for Speculative Multithreading Chip-Multiprocessor Architectures Jialin Dou E H U N I V E R S I T Y T O H F R G E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Going With the (Data) Flow

Going With the (Data) Flow 1 of 6 1/6/2015 1:00 PM Going With the (Data) Flow Publish Date: May 20, 2013 Table of Contents 1. Natural Data Dependency and Artificial Data Dependency 2. Parallelism in LabVIEW 3. Overuse of Flat Sequence

More information

Free upgrade of computer power with Java, web-base technology and parallel computing

Free upgrade of computer power with Java, web-base technology and parallel computing Free upgrade of computer power with Java, web-base technology and parallel computing Alfred Loo\ Y.K. Choi * and Chris Bloor* *Lingnan University, Hong Kong *City University of Hong Kong, Hong Kong ^University

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Hugh Leather, Edwin Bonilla, Michael O'Boyle

Hugh Leather, Edwin Bonilla, Michael O'Boyle Automatic Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems Architecture University of Edinburgh, UK Overview Introduction

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,

More information

Predicting GPU Performance from CPU Runs Using Machine Learning

Predicting GPU Performance from CPU Runs Using Machine Learning Predicting GPU Performance from CPU Runs Using Machine Learning Ioana Baldini Stephen Fink Erik Altman IBM T. J. Watson Research Center Yorktown Heights, NY USA 1 To exploit GPGPU acceleration need to

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems

1 Multiprocessors. 1.1 Kinds of Processes. COMP 242 Class Notes Section 9: Multiprocessor Operating Systems COMP 242 Class Notes Section 9: Multiprocessor Operating Systems 1 Multiprocessors As we saw earlier, a multiprocessor consists of several processors sharing a common memory. The memory is typically divided

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Introduction. CS3026 Operating Systems Lecture 01

Introduction. CS3026 Operating Systems Lecture 01 Introduction CS3026 Operating Systems Lecture 01 One or more CPUs Device controllers (I/O modules) Memory Bus Operating system? Computer System What is an Operating System An Operating System is a program

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. (big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. Intro: CMP with MT cores e.g. POWER5, Niagara 1 & 2, Nehalem Off-chip miss

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

Extended Dataflow Model For Automated Parallel Execution Of Algorithms Extended Dataflow Model For Automated Parallel Execution Of Algorithms Maik Schumann, Jörg Bargenda, Edgar Reetz and Gerhard Linß Department of Quality Assurance and Industrial Image Processing Ilmenau

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Database Optimization

Database Optimization Database Optimization June 9 2009 A brief overview of database optimization techniques for the database developer. Database optimization techniques include RDBMS query execution strategies, cost estimation,

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information