Adaptively Mapping Parallelism Based on System Workload Using Machine Learning

Size: px

Start display at page:

Download "Adaptively Mapping Parallelism Based on System Workload Using Machine Learning"

Sherman Jacobs
5 years ago
Views:

1 Adaptively Mapping Parallelism Based on System Workload Using Machine Learning Dominik Grewe E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2009

2 Abstract Parallel Computing has become pervasive and the number of processors placed in computers will further increase in the future. However, software developers are struggling to efficiently exploit the computational resources provided by parallel architectures. It is thus inevitable to investigate the behaviour of parallel programs and develop methods that help improving their performance. The software developer should not have to worry about how to map parallelism to the underlying architecture, but should instead concentrate on exposing the parallelism and leave the mapping task to the runtime system. In this project, the behaviour of parallel programs in the presence of workload is investigated. It is shown that choosing the right for an application is crucial to achieve the best performance possible when there is other workload running on the system. The default policy of creating as many threads as there are cores is rarely optimal in this situation and using the optimal reduces the runtime by 22.5% on average w.r.t. the default policy. Determining the optimal number of threads is not a straight-forward task, because it not only depends on the current workload but also on the program itself. For some programs, reducing the number of threads w.r.t. the default yields the optimal solution, whereas for other programs the best performance is achieved using more threads than there are cores. In order to tackle this problem, a novel technique for choosing the number of threads is presented in this work. Using Machine Learning techniques, a model is created that predicts the optimal based on the current system workload and the program. Different approaches for modelling this problem and several sets of features are evaluated. With the best model, a speedup of 92% of the optimal performance is achieved, which corresponds to a runtime reduction of almost 16% over the default policy. i

3 Acknowledgements At first, I would like to thank my supervisor Professor Michael O Boyle for his excellent advice and for supporting me throughout the whole project. The discussions we had were of great avail to me and helped me focusing on the most vital parts of my work. Furthermore, his reviews of my progress enabled me to present my work in a more structured form. I would also like to thank Zheng Wang for his patience in answering all my questions and for providing me with help on the technical issues that arose during my work on this project. Last but certainly not least I would like to thank Hugh Leather for the time he spent setting up his great tool libplugin for me. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Dominik Grewe) iii

5 Contents 1 Introduction Parallel Computing Parallel Architectures Parallelism Models Parallel Programming Overheads in Parallel Computing Workload on Parallel Computers Compilation The Problem with Static Compilation Beyond Static Compilation Optimising Parallel Programs Machine Learning in Compilation Predictive Modelling Using Machine Learning to Optimise Parallel Programs Contributions Organisation of the Dissertation Related Work Using Machine Learning for Program Optimisation Learning to Schedule Learning to Unroll Focusing the Search in Iterative Optimisation Automatic Feature Generation Parallel Programs and Machine Learning Mapping Parallelism to Multi-Cores Scalability Prediction Based on Regression Boosting Program Performance in the Presence of Workload iv

6 2.4 Summary Methodology Experimental Setup System Specification Benchmark Applications Collecting Benchmark Execution Times Machine Learning Models Support Vector Machines Multi-Layer Perceptrons How to Evaluate Predictions Summary Program Scalability Scaling of Parallel Programs Scaling on Idle Machines Scaling on Loaded Machines Performance of the Default Policy Summary Predicting the Optimal Number of Threads Feature Selection Program Features Workload Features Modelling the Problem Creating the Model Using the Model for Prediction Summary Results and Evaluation Prediction Using Different Feature Sets Prediction Using Dynamic Features Prediction Using Static Features Prediction Using Both Features Sets Comparison of the Feature Sets Splitting the Workload v

7 6.3 Performance of the Machine Learning Models Training the Model Using the Model for Prediction Summary Conclusion Extensions Summary A Scaling on Idle Machines 53 B Scaling on Different utdsp.mult-workloads 56 C Scaling on Various Workloads 59 Bibliography 62 vi

8 Chapter 1 Introduction Parallel computing has become ubiquitous in the recent past. Because processor speeds increase slowly, multiple processors are being placed in a single computer to improve its overall performance. However, this trend forces software developers to rethink their way of programming by detaching themselves from sequential programming and starting to develop parallel software. To support the programmer in this task, it is inevitable to investigate the behaviour of parallel programs in order to develop techniques that help exploiting the resources available in parallel machines. The first part of this chapter gives an overview about current parallel architectures, different models of how to exploit parallelism and techniques to write parallel programs. This is followed by a brief discussion about the overheads in parallel computing and the behaviour of workload on parallel computers. In the subsequent section, static compilation and its shortcomings are introduced. This motivates some approaches to overcome these problems, e.g. Iterative Compilation. Furthermore, some ideas about optimising parallel programs are shown. The third section of this chapter introduces the idea behind using Machine Learning to optimise programs and how it can be applied to parallel programs. This is followed by a description of the contributions of this project and an overview of the structure of this thesis. 1

9 Chapter 1. Introduction Parallel Computing Parallel Architectures In order to exploit the increasing availability of transistors in a computer, processors have become highly complex structures that can process data at a very fast pace despite severe problems such as the growing gap between processor and memory speeds [25]. However, hardware designers are struggling to gain even more speed from a single processor while being energy efficient. Therefore, the new trend is to have multiple (less complex) processing cores in a single machine to improve the computer s overall performance. Due to this development parallel computing has become pervasive and it is seen as the most promising way to exploit the ever-growing availability of computing resources [26]. Whereas in the past, clusters or other forms of distributed architectures [16] were the predominant form of parallel machines, multi-processors are now commonplace and will be even more so in the future. Most computers shipped today have at least two processors and this number will increase, as current research chips contain up to 80 cores [24]. Multi-Processors There are two kinds of multi-processors: On the one hand, several individual chips can be placed on a motherboard. Altough all processors share the computer s resources, e.g. the main memory, they do not share any of their internal resources, such as caches. On the other hand, multiple processors can be placed on a single chip (chip multi-processors or CMPs). In this case, the processors share some of their resources, such as low-level caches, but each processor has its own computing resources. The fact of having shared caches can have both advantages and disadvantages. The advantages of sharing caches is that programs running on the same processor but on different cores have the chance of reusing each others memory, e.g. when the same program is executed twice, the code sections can be shared and don t have to be loaded twice. Furthermore, if some of the CPUs are idle, the programs running on the remaining cores can take advantage of the whole cache, instead of just having a fraction available. This is not possible with private caches. However, in shared caches, programs may also negatively interfere with each other: If one program loads data into the cache this leads to other data getting evicted from the cache that is possibly still in use by the other program. The data must then be re-loaded from memory which takes considerably more time than loading it from the cache. This could have been avoided

10 Chapter 1. Introduction 3 by having private caches. Multi-Threading Processors Whereas programs executing on multi-processors only share lower-level caches, on multi-threading processors they share almost all of the computing resources, from functional units to high-level caches. The idea behind multi-threading is that the processor stores the architectural state, e.g. register contents, of multiple execution threads. It can then issue instructions from any thread trying to utilise the time other threads spend waiting for long latency operations, e.g. due to cache misses. Simultaneous multi-threading (SMT) [47, 34] is the most advanced form of multi-threading. Whereas in blocked multi-threading, for example, a hardware context switch that takes several cycles to complete is needed to issue instructions from a different thread, an SMT processor can execute instructions from multiple instruction streams in the same cycle. An SMT processor s resources are better utilised because less time is spent waiting when a thread is blocked. This not only leads to better performance, but also increases the processor s energy efficiency [34]. To the operating system the different contexts of an SMT processor appear like different processing cores to which it can assign threads. It is the hardware s task to decide which of these thread s instructions are actually issued for execution. This decision depends and which thread has instructions that are ready for execution and on the availability of computing resources. In general, the hardware scheduler tries to treat all threads equally by giving them a fair share of processing resources Parallelism Models There are two different ways to parallelise a program: With data parallelism, each thread or program computes the same set of instructions but works on different data, e.g. in loops. In task parallelism, the threads or programs perform different tasks on either the same or different data. Furthermore, there are two fundamentally different models of how to exploit parallel architectures: In shared memory programming or Thread-Level Parallelism (TLP), several threads of the same program are executed concurrently and they can communicate with each other using the shared memory of the program s virtual memory space. In the message passing model, multiple (individual) programs are executed concurrently. Because the program s memories are separate, they have to communicate by sending message to each other via the operating system.

11 Chapter 1. Introduction 4 Shared Memory The shared memory model maps to the multi-processor architecture, because the threads of a parallel program can execute concurrently on the different processors and share data using the program s virtual memory. However, to share data between threads, they have to synchronise first to make sure that only one thread ever operates on the shared data at a time. These critical sections can be implemented using locks or semaphores, but it requires the programmer s attention to make sure that no deadlocks or race conditions can occur [22]. There also exist other techniques that try to simplify the programm s task, e.g. Software Transactional Memory [41]. Message Passing In message passing, programs don t share any memory. They have to send messages to each other via operating system calls in order to exchange data. Due to the non-shared memory nature of the message passing model it maps to distributed parallel architectures, such as clusters. Each program resides on a node of the cluster and data is exchanged by sending messages over the network. However, the programs using message passing can also be executed on a single machine. Synchronisation between programs is implicit in sending and receiving messages, thus requiring no further action from the programmer. However, the programmer must make sure that no deadlocks can occur, which can happen when two programs wait for messages from each other, for example Parallel Programming There are multiple ways of creating parallel programs. Some programming languages, such as Java, have built-in mechanisms for shared memory programming by providing thread creation and synchronisation methods. For other programming languages, such as C/C++, there exists libraries that provide this functionality, e.g. pthreads [38] in the case of C/C++. Whereas these models provide very basic methods for writing parallel programs, there also exist more advanced frameworks, e.g. OpenMP [17], that simplify the task of writing parallel code. OpenMP allows the programmer to create parallel sections without having to worry about thread creation or synchronisation. For-loops with independent iterations, for example, can be parallelised by adding an OpenMP directive in front of the loop code. During compilation this directive is replaced with code that creates multiple threads, divides the iterations of the loop and assigns a share of them to each thread. At the end of the loop, code for synchronising the threads is included,

12 Chapter 1. Introduction 5 if not otherwise specified by the programmer. The default created for a parallel section is the number of cores that there are in the computer. However, the programmer can change this number by calling a library function, setting an environment variable or adding a clause to the OpenMP directive. There also exist APIs that allow the programmer to write programs using the message passing model, e.g. MPI [39]. MPI provides functions for sending and receiving messages in various ways, such as uni- or multi-casts and blocking or non-blocking. It supports the Single-Program-Multiple-Data programming model, where multiple instances of the same program are spawned and the behaviour of a particular instance usually depends on the identifier that is assigned to this instance of the program Overheads in Parallel Computing By sharing the work between processors, the execution time of programs can be significantly reduced. However, there is also some overhead associated with parallel programming that must be taken into account, because it can significantly reduce the speedup one may expect from parallelising a sequential program. In shared memory programming, thread creation, synchronisation and load imbalance are responsible for the most obvious overheads of parallel programs [22]. When new threads are to be created, the operating system has to be invoked to register the threads and reserve memory for the thread s stack. If threads have to co-operate, e.g. to exchange data, they have to synchronise, which causes inter-process communication. Furthermore, if the overall work is not equally distributed over the threads, some processors might be idle although there is still some work to be done. But there are also other, less apparent overheads that can reduce the expected speedup of parallel programs. In multi-core processors, for example, two threads of the same program that are executed on different CPUs may work on distinct parts of the same cache line. Because the granularity of cache coherence protocols is a cache line, a write by one thread will lead to an invalidation of the cache line in the other thread s processor. This phenomenon is called false sharing [22] and can be avoided by restructuring the program s memory layout. In the message passing model a significant cause for overhead is the sending and receiving of messages. Especially when the program is executed in a network of computers, the cost of sending data from one node to the other can be considerable. It is thus important to only sent the smallest amount of data necessary and to keep the

13 Chapter 1. Introduction 6 distance that data is sent at a minimum. For small problems that don t require much computation, the overhead caused by parallel programming may not be worth the speedup gained by splitting the work between processors. Furthermore, programs that require much communication between threads spent a considerable amount of time waiting for synchronisation. In these cases, it is maybe not worth parallelising, because a sequential version of the program is possibly faster Workload on Parallel Computers As already mentioned previously, when several programs or several threads of a program are executed on a multi-core computer, there can be both negative and positive interference. The former can be due to programs that cause each others variables to get evicted from caches. The latter can appear when threads of the same program share parts of their memory, e.g. the program code. Hence, a central role in how the workload of a parallel computer performs is played by the cache configurations of the processors and the memory access behaviour of the programs [28]. It is the operating system s responsibility to schedule program threads to the available CPUs. In order to exploit cache locality, it is useful to try to schedule threads to the CPU they have been running on before: A thread running on processor A populates processor A s cache with its data. If the thread is migrated to another processor B, the data has to be invalidated in processor A s cache and processor B s cache must be repopulated. Hence, the operating system attempts to not migrate threads when possible. This is called processor affinity [42]. On simultaneous multi-threading processor, the operating system has less control over which threads are executed. All it can do is schedule (multiple) threads to an SMT processor, but then it is in the hardware s responsibility to schedule these threads. A major problem with this setup is that it is no longer possible for the operating system to favour high-priority processes by giving them more CPU time while exploiting the opportunities presented by SMT processors [43, 13]. Although it is possible to only schedule the high-priority process to a processor, this will decrease the overall throughput, because computational resources are wasted.

14 Chapter 1. Introduction Compilation The Problem with Static Compilation Almost every piece of software today is written in a high-level language such as C/C++ or Java. To make these programs run on a computer, whose hardware can only process simple assembler instructions, they have to be mapped from the programming language into the machine language. This is performed by a compiler which parses the source code and transforms it into target code. To do this translation correctly is straight-forward and has been known for decades. However, for each program there are infinitely many translations that are semantically correct and finding the best performing one is the actual challenge. Modern computers are based on complex architectures whose behaviour is hard to predict. Today s processors have several levels of caches and execute multiple instructions at a time, which may be dynamically reordered to better exploit the existing computational resources. These and other optimisations are the reason that modern computers can execute programs at a high speed despite seemingly severe problems such as the growing gap between processor and memory performance. However, these complex hardware structures are also the reason that writing optimising compilers for these architectures becomes more difficult. The effects of optimisations are difficult to predict and finding good parameters for program transformations, e.g. loop unrolling, is a tedious task that requires the work of a compiler writer that is familiar with the architecture. Optimisations that work well on one architecture may slow down the program on a different one. Furthermore, interactions between different program transformations are complex and applying one optimisation may disable the use of another one. These intricate dependencies are hard to handle and again require an expert to achieve good results. Often simplified machine models are used to estimate the effects of optimisations, but due to the actual complexity of the real hardware these models are too simplistic to make accurate predictions. When a sequence of optimisations is finally found, it is applied to all programs that are being compiled. But it has been shown that different programs often need different optimisations to achieve the best performance [15, 30, 20]. Hence, using a fixed optimisation sequence with fixed parameters for transformations can only provide average performance, leaving a significant potential for optimisation unused.

15 Chapter 1. Introduction Beyond Static Compilation Ideally, to overcome the previously-mentioned problems of hand-crafted heuristics, a compiler should find the right optimisations for a certain program automatically, instead of requiring many hours of effort from an expert. The compiler should know on its own how to best transform a given program in order to achieve good performance. Additionally, whenever the architecture changes the optimisations have to be adjusted to it, because transformations may have different effects on the new architecture. Hence, for every architecture the expensive process of finding the right optimisation sequence has to be repeated. It would thus be useful to have a compiler that automatically adapts to a new architecture. The issues described here were tackled by Iterative Compilation [30]. In Iterative Compilation every program is compiled many times, each time with a different configuration, to find the best optimisations with respect to some metric such as execution time, code size or energy efficiency. This way specific transformations are found for each program and each architecture. The major drawback of Iterative Compilation, however, is its long compilation time. The search spaces of optimisation parameters are huge and theoretically every possible configuration has to be tested to find the optimal values. To overcome this problem, several techniques have been proposed to speed up Iterative Compilation, including the use of offline search [46] and static models [30] to prune the search space. However, it is still a time-consuming process. Even though Iterative Compilation seems to overcome some of the problems of static compilation, its use in general-purpose compilers is significantly limited by its long compilation time and is thus only feasible for heavily used applications or libraries. It also doesn t exploit the opportunity of adapting a program at runtime. When a program is executed, the system s situation is maybe different to when it was compiled or maybe the program input has changed. All this can effect the performance of a program and may require different optimisations to achieve the best performance Optimising Parallel Programs In recent years, parallel computing has become more popular. Almost any computer that is shipped today has two or maybe even four processing cores and this trend will become even stronger with up to eight cores on a chip by next year and as many as 80 cores in current research processors [24]. Most research in optimising parallel programs focuses on reducing inter-process communication by improving data locality

16 Chapter 1. Introduction 9 [4]. It is desirable to have a processor reuse a large share of its data, because exchanging data with other processes slows down the computation. Furthermore, for multi-threaded programs the granularity of the parallelism is important. A large granularity, where threads are created only rarely and perform a large amount of work is more desirable than a small granularity where threads only perform small amounts of work and there is much of overhead due to thread creation and synchronisation. A popular method for exploiting multiple cores in a computer is Thread-Level Parallelism (see section 1.1.2). A big potential for exploiting Thread-Level Parallelism is the parallelisation of loops. In many applications, e.g. in digital signal processing, there are loops where each iteration is independent of the other. Hence, all iterations can theoretically be executed concurrently by creating a new thread for each iteration. However, in practice the number of iterations is much larger than the number of cores. Creating more threads than there are cores is usually not useful, because it only creates unnecessary synchronisation overhead. Having less threads than cores is, in most cases, a waste of computational resources, because some cores will be idle. Only when the synchronisation overhead is higher than the benefit from parallelisation, a sequential version is faster than a parallel one. For these reason, standard frameworks such as OpenMP [17] create as many threads as there are cores for their default configuration. However, these results are obtained under the assumption that the program can use all of the computer s resources. But what if there are other programs running that also compete for processing time? Is the default configuration still the best? In the early years of parallel programming, parallel computers were only used for dedicated scientific programs and the programmer could be sure that he had exclusive use of the parallel machine. Today however, parallel computers are everywhere and a user cannot be sure anymore whether a program can use the vast majority of computing resources on its own or whether there are other programs competing for CPU time as well. Furthermore, more complex programs possibly consist of different parallel tasks that each may perform some parts of their task in parallel. It is now no longer obvious how many threads to use for each parallel region. In the ideal case, the programmer only has to expose the parallel sections of a program and it is the runtime system that decides on how to parallelise these sections, if at all. For reasons described before, optimising programs is a difficult task in general and it becomes even harder with parallel programs; even under the assumption that the program is executed in isolation. Taking possible system workload into account is almost never done due to its additional complexity and makes an already difficult

17 Chapter 1. Introduction 10 problem even harder. The goal of this project is to improve the performance of parallel programs on any kind of system workload, by making the program choose the right for execution using predictive modelling. Offline, a Machine Learning model will be trained to predict the right given a certain application and system workload. The model will then be used to choose the best configuration for a program when it is executed. 1.3 Machine Learning in Compilation Traditionally, computer software is written in terms of algorithms. When there is a problem to solve, a programmer develops an algorithm that, given some input data, produces the result. Sorting a sequence of numbers, for example, is a well-studied area in computer science and there are numerous algorithms for solving this problem. Some problems, however, are based on more complex patterns. Programming a computer how to recognise handwritten digits, for example, is not easy at all. Manually finding a pattern that distinguishes the different digits is almost impossible to do. However, now that we have the computational power to process large amounts of data, a Machine Learning algorithm can be devised that automatically finds the pattern [8]. The program learns how to interpret the data by looking at training examples, i.e. input data that is labelled with the correct result. It automatically finds a way to classify (previously unseen) data by fitting a mathematical model to the training data. This is the main advantage of Machine Learning: It is not necessary to fully understand the underlying patterns that lead to a results. A simplified, but more descriptive example of Machine Learning is curve fitting. Figure 1.1 shows several input data (x i,y i ) and an approximation of the function y = f(x) that may have produced this data (assuming some noise). In this case, the only input value is x and the Machine Learning model tries to predict a value y. Given that the curve is a third-order polynomial a 3 x 3 + a 2 x 2 + a 1 x + a 0, the Machine Learning task is to find parameters a i that best fit the training data. A crucial factor in Machine Learning is to find the right features, i.e. the input data to the model. The features must be related to the output to make an accurate prediction possible. All the Machine Learning algorithm can do is to fit a model to the data. Hence, it is in the user s responsibility to provide good data which includes finding good features that accurately characterise the data.

18 Chapter 1. Introduction 11 5 training estimation Figure 1.1: Machine Learning as curve fitting Predictive Modelling As already mentioned previously, today s processors are very complex pieces of hardware and it is difficult to predict their behaviour. Furthermore, computer architectures change over time and each new platform has a potentially different behaviour. Hence, Machine Learning is a good way to automate the process of finding compiler optimisations for these complex architectures, because instead of studying the architecture to find good optimisations, all that is required is a run of the training phase on the new architecture. The general idea behind predictive modelling in compilation is to build a Machine Learning model that predicts the right optimisations for a specific program and platform. Unlike in Iterative Compilation [30], multiple expensive profiling runs of an application that is being compiled are avoided. The compiler receives all the information it needs from the source code and maybe a single profiling run on a small data set. Based on this information it makes a prediction as to which transformations to apply. The Machine Learning model is built from training programs and their optimal transformations (see figure 1.2). These optimal configurations are found in a similar way as in Iterative Compilation, namely by searching the parameter space of optimisations. However, this process is done offline, i.e. it s only done once before the compiler is actually used. Once the model has been built, no more searching is needed. To relate unseen applications to the training data, programs are represented by a number of features. Which features are useful and which are not depends on the problem. A good

19 Chapter 1. Introduction 12 Figure 1.2: Predictive Modelling starting point, however, are features that describe the type of instructions found in the code, e.g. the number of floating point instructions or branch instructions Using Machine Learning to Optimise Parallel Programs Recently, Machine Learning has been used to predict the right and the OpenMP scheduling policy of a parallel program executed in isolation [51]. On an Intel Xeon and a Cell processor, 96% of the optimal performance is achieved, which, on the Cell platform, is a 37% performance improvement over the OpenMP default configuration. However, assuming no system workload, finding the right number of threads boils down to making a decision between the sequential version of a program or using as many threads as there are cores. Barnes et al. [7] used Machine Learning techniques to predict the performance of programs on a large number of processors given data gathered during executions on a smaller number of processors. Given the abundance of available processors in today s compute clusters, this is a useful and inexpensive means to find the right number of processors to allocate to a program in order to improve efficiency of the whole cluster. The project described here is similar to the former one in the sense that it uses supervised learning to build a model to predict the best for a parallel program. The crucial difference, however, is that there are other (possibly parallel) programs also competing for computing resources, i.e. the applications are not executed in isolation. By varying the of parallel programs executed on some workload, the best configuration of the program for a particular workload is found. This information is used to create a Machine Learning model that can accurately predict

20 Chapter 1. Introduction 13 the best to use for an arbitrary parallel program given the current system workload. 1.4 Contributions There are two main contributions of this project: On the one hand, experiments are carried out to investigate how a parallel program s performance changes when the is varied. It is shown that, on an idle machine, the default policy of creating as many threads as there are processors is the optimal solution. As soon as there is some other workload on the machine, however, the default policy is rarely optimal. By using the right, significant speedups can be achieved over the default policy. Whereas in some cases it is better to reduce the, for some programs the optimal performance is reached by using more threads than there are processors. Additionally, a novel method for choosing the optimal in the presence of workload is proposed. By applying Machine Learning techniques, a model is built that predicts for any program and any workload the best to use. Using this technique an average speedup of 92.14% of the optimal performance is achieved, compared to 77.44% of the optimum for the default policy. This is a speedup of 1.19 over the default policy or, in other words, a runtime reduction of 15.95%. Due to its additional complexity, the presence of workload is usually not considered in program optimisation. However, when parallel programs are executed on a computer, the workload has a significant influence on its performance and thus cannot be ignored when choosing the to create. 1.5 Organisation of the Dissertation The following chapter, chapter 2, describes work that is related to this project. It starts off with an overview of Machine Learning in Compilation, including learning to predict loop unroll factors and automatically finding program features. Furthermore the application of Machine Learning in the area of parallel programming is discussed. This is followed by some approaches to improve program performance in the presence of workload. Most research in this area focuses on Simultaneous Multi-Threading architectures and how to increase the throughput of these processors.

21 Chapter 1. Introduction 14 In chapter 3 the methodology of this project is explained. This includes the experimental setup, i.e. on which kind of computers were the experiments conducted, which benchmarks were used and what was the exact method for determining execution times. Furthermore, the Machine Learning models used in this project, Support Vector Machines and Multi-Layer Perceptrons, are introduced together with the methods of evaluating predictions. Chapter 4 describes the data gathered during the experiments. The scalability of programs on idle and loaded machines is illustrated and the default policy for the number of threads, i.e. using as many threads as there are cores, is discussed in terms of its performance in the presence of workload. Chapter 5 begins with a description of the features used for the Machine Learning algorithms in this project. Because both, the program and the workload, determine the program s behaviour, two sets of features are required to characterise these two factors. Additionally, there is a section on how the problem of predicting the number of threads can be modelled and how the model can be used to predict the optimal in a certain situation. In the subsequent chapter, chapter 6, the results of the Machine Learning models are presented. On the one hand, the accuracy of the models is described, i.e. in how many cases is the optimal predicted correctly. On the other hand, the performance of the programs using the predictor is shown and compared to the performance using the optimal or using the default policy. This is followed by a short discussion on the performance of the Machine Learning algorithms, i.e. how long does it take to train the model and how long does it take to make a prediction. Chapter 7 provides a conclusion to the project. The success and feasibility of the approach is discussed and possible extensions to the project are proposed.

22 Chapter 2 Related Work Machine Learning has been used in compilation for several optimisation tasks. This chapter describes some of the approaches of applying Machine Learning in the area of program optimisation and parallel programming. At first, one of the earliest papers on Machine Learning in Compilation is presented: Learning to Schedule. This is followed by two different applications of predictive modelling, namely to predict unroll factors for loop unrolling and to predict good areas in the parameter space of optimisations in order to speed up Iterative Compilation. The last paper is about one of the most important problems in predictive modelling, namely finding good program features. It tries to tackle this problem by automatically generating features for the specific learning task. In the subsequent section, the use of predictive modelling in parallel programming is presented. The first paper is about finding the optimal and scheduling policy for a parallel program. The second paper in this section is not concerned with improving a program s performance, but presents an approach of using regression to predict the runtime of programs on many processors. Because this project is concerned with program performance in the presence of workload, proposals of improving workload performance on multi-core and/or multithreading processors are described in the last section of this chapter. Due to the limited control an operating system has over which thread is scheduled on a simultaneous multi-threading (SMT) processor, there is a substantial amount of work about scheduling on SMT architectures. This ranges from making the scheduler more aware of which threads can beneficially co-exist on the same processing core to giving the operating system more control over the scheduling on the processor by making changes to the hardware. Furthermore, in grid computing, scheduling your program according to the 15

23 Chapter 2. Related Work 16 current workload is important to achieve good execution times. However, the scale of grid computing is much larger than the scale this project is concerned about, namely multi-processors and multi-cores. Hence, only a brief overview on the topic of grid computing is given. 2.1 Using Machine Learning for Program Optimisation Machine Learning has been applied to several areas of (sequential) program optimisation [36, 44, 35, 3, 10, 11, 20]. Some of these approaches are introduced in more detail in this section Learning to Schedule One of the first uses of Machine Learning in compilation was to find instruction schedules for basic blocks [36]. By exhaustively searching all possible schedules for a number of small basic blocks, training data is gathered to create a model that helps making decisions on which instruction to schedule next given the current partial schedule and the available instructions. The data is stored as triples (P,I i,i j ), where P is a partial schedule, i.e. a total order of already scheduled instructions and a partial order of remaining instructions. I i and I j belong to the set of available instructions I, from which the next selection is to be made. If instruction I i is preferable over I j given the partial schedule P, the triple (P,I i,i j ) is a positive example and (P,I j,i i ) is a negative example. Using the examples and counter-examples generated during the training phase, four different techniques are used to infer schedules for new programs. Among these are decision tree induction [49] and feed-forward artificial neural networks [23]. To relate different triples to each other, five features are used that describe both the partial schedule and the available instructions. These include Odd Partial (is the number of instructions odd or even?) and the instruction class. The authors achieved similar results to a manually tuned heuristic, but the advantage of their approach is that it is completely automatic and hence doesn t require an expert to hand-craft the heuristic.

24 Chapter 2. Related Work Learning to Unroll Machine Learning has also been used to determine unroll factors for loop unrolling [35, 44]. Loop unrolling is an always legal transformation but finding the right unroll factor is crucial to exploiting the potential benefits of this optimisation. Choosing a wrong value can even lead to performance degradations. Because it is not easy to manually determine the right unroll factors, Machine Learning has proved to be a helpful technique to improve and automate this decision. Monsifrot et al. [35] use oblique decision trees [37] to predict unroll factors. They create polygonal partitionings of the feature set to distinguish different types of loops. In order to characterise the loops they used features extracted from the source code, including the number of arithmetic operations, the number of array accesses and the number of if-statements. Their experimental results on an UltraSPARC and an IA-64 machine show that the technique slightly improves performance compared to not using loop unrolling at all. Compared to the compilers default heuristic for loop unrolling, the Machine Learning technique also performs marginally better on both architectures. Stephenson et al. [44] describe the loop unrolling problem as a multi-class classification problem, where each class corresponds to a different unroll factor. Both techniques used by the authors, nearest neighbour and Support Vector Machines (SVM), outperform the default compiler heuristic for loop unrolling on the SPEC benchmarks running on an Itanium 2 processor. Whereas the nearest neighbour method only yields marginal improvements of about 1%, the more complex SVM approach improves the runtime by 5% on average Focusing the Search in Iterative Optimisation The main problem with Iterative Compilation [30] is that it requires a large number of program evaluations. A huge part of the parameter space needs to be searched in order to find optimisations that yield acceptable performance. In 2006, Agakov et al. [3] used Machine Learning to focus the search in an Iterative Compilation approach for selecting optimisation sequences of length 5 from 14 source-level transformations. Their model mapped programs to promising regions in the search space using 36 static loop-level features, which were condensed to 5 features using Principal Component Analysis. For the model, the authors use two different approaches. The first one, an independent identically distributed (IID) model, treats program transformations as if they were

25 Chapter 2. Related Work 18 independent. Because this is usually not the case in practice, they also use a Markov Model which takes dependencies of transformations into account. For searching, the authors try two different approaches: A simple random search, where the probabilities of transformations is biased according to the Machine Learning model, and a Genetic Algorithm [18], where the initial population is based on the model. Their approach speeds up Iterative Compilation by an order of magnitude and achieves runtime speedups of more than 1.2 on a Texas Instruments and an AMD processor. The Markov model generally outperforms the IID model, but there is no significant difference between the two search algorithms Automatic Feature Generation One of the most difficult challenges in Predictive Modelling is to find good features to accurately make predictions. If bad features are chosen classification clashes can happen: Two programs whose best optimisation parameters are different may be mapped to the same feature vector, resulting in at least one wrong prediction. To overcome this problem, Leather et al. [32] propose a technique to automatically find representative program features. They describe the feature space as a grammar where every sentence from the grammar represents one feature. Using Genetic Algorithms [18], they search the feature space and gradually build a set of useful features by selecting the ones that improve the prediction accuracy. The method is evaluated on loop unrolling, a well-studied transformation that has often been used in Predictive Modelling before (see section 2.1.2). GCC s unrolling heuristic is only able to achieve 3% of the maximum performance on average. A Machine Learning technique to predict unroll factors [44] is able to push this number to 59%. With their automatic feature generation approach, however, Leather et al. are able to obtain 76% of the maximum speedup available. 2.2 Parallel Programs and Machine Learning Mapping Parallelism to Multi-Cores Probably the most similar approach to my project of using Machine Learning to optimise parallel programs is by Wang et al. [51]. In this paper they try to predict the optimal and OpenMP scheduling policy for a program executed in isolation. Using static program features, e.g. branch counts, and dynamic program

26 Chapter 2. Related Work 19 features, e.g. cache miss rates, they build two models, an Artificial Neural Network [23] and a Support Vector Machine [9], to predict both the best and scheduling policy. Instead of directly predicting the best, they predict speedups for all different configurations and then choose the one with the best predicted performance. Additionally, they build predictors that take program input data into account to make predictions for programs whose behaviour is strongly influenced by input data. The authors evaluated their predictions on two different platforms: an Intel Xeon machine with 8 cores and a Cell processor with 16 cores in total. On the Xeon computer, the OpenMP default policy already does a good job giving an average performance of 95% of the optimum. However, it is not very stable, because some programs only achieve 57% to 75% of the optimum. The Machine Learning model gives more consistent results with at least 95% of the optimal performance on all benchmarks. On average, however, it is only marginally better than the default policy reaching 96% of the optimal speedup. On the Cell processor, the Machine Learning predictor is able to significantly improve performance over the default policy. It gives 96% of the optimal performance as opposed to only 70% of the default scheme. The problem of the default policy on the Cell processor is that it always tries to use all processing resources. Because the data first has to be copied onto the Synergistic Processing Element s memory before it can be processed, it may not be worth parallelising. If the communication cost is too high, it is faster to execute a sequential version of the program on the Power Processing Element only. Hence, for the Cell processor, a Machine Learning model that can predict whether it is worth parallelising a program or not can achieve significantly better results than the default policy of parallelising in every situation. In [45] a similar technique has been used to decide for a loop whether it is worth parallelising or not Non-Machine Learning Based Methods Another technique for finding the best on a parallel architecture, albeit not using Machine Learning, is presented in [29]. In this approach, loops are analysed at compile- and at run-time to decide whether to execute the loop sequentially, with four threads or with eight threads on a four-core machine supporting HyperThreading [34], i.e. supporting up to eight logical threads on four physical CPUs. The authors restrict the possible to one, four and eight. If it is not

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.