Informatica Universiteit van Amsterdam. Tuning and Performance Analysis of the Black-Scholes Model. Robin Hansma. June 9, Bachelor Informatica

Size: px

Start display at page:

Download "Informatica Universiteit van Amsterdam. Tuning and Performance Analysis of the Black-Scholes Model. Robin Hansma. June 9, Bachelor Informatica"

Eugene Emory Burke
6 years ago
Views:

1 Bachelor Informatica Informatica Universiteit van Amsterdam Tuning and Performance Analysis of the Black-Scholes Model Robin Hansma June 9, 2017 Supervisor(s): prof. dr. R.V. (Rob) van Nieuwpoort, A. Sclocco Signed: prof. dr. R.V. (Rob) van Nieuwpoort, R. Hansma

2 2

3 Abstract There are different options to increase performance without consuming more power, such as using more power efficient hardware, or improving the efficiency of applications. A way to improve the efficiency of an application is to use auto-tuning to find the best configuration for its parameters. In this thesis we explain the performance of the Black-Scholes kernel on both GPUs and CPUs. We conclude that auto-tuning an already optimised kernel still makes sense, the performance increase for the the GPUs is in the range of 8-11% and for the CPU even 44% (the original kernel was optimised for GPUs). 3

4 4

5 Contents 1 Introduction Research Question Thesis outline Background Auto-Tuning OpenCL Black-Scholes Related Work Accelerating Radio Astronomy with Auto-Tuning Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures Implementation Black-Scholes Optimisations Experiments Experimental Setup Performance Results Black-Scholes Future work Generalise findings Conclusion 31 5

6 6

7 CHAPTER 1 Introduction In the past, it was common to increase the performance of supercomputers by simply adding more cores, either in the form of more nodes, multi-core CPUs or recently many-core GPUs. In order to scale to exascale, extra attention should be payed to the efficiency of the supercomputers and the efficiency of the algorithms running on them. The recent change from single core processors to multi-core processors required a shift in mindset of developers [7] and required a lot of programs to be rewritten to make the most out of the new architectures. Rewriting programs requires algorithm-specific knowledge and an upfront investment without any knowledge on how performant the new implementation will be. The result is that a lot of scientific [13] and commercial programs are still not optimised for multi-core processing and thus waste processing power. The recent change of interest in efficiency of supercomputers as opposed to raw performance has led to the usage of a new set of benchmarks to measure this. The results of this set of benchmarks are summarised in the Green500 list, which lists the top 500 supercomputers based on efficiency (MFLOPS/W). The top 10 of this list contains five different architectures and seven different main accelerators as shown in table 1.1. This illustrates the challenge developers face every day: which platform is best to optimise the algorithm for? To make this decision even harder, these architectures change from year to year. Auto-tuning frameworks can be used to improve performance of algorithms for a specific architecture. This improves the performance of the algorithm itself, but also improves the portability of performance [13]. Auto-tuning automatically searches the predefined configuration space for the best configuration on a architecture, this way a developer doesn t need to know the hardware specifics to develop an efficient implementation of the algorithm. 1.1 Research Question The focus of this thesis is to explain why certain configurations are more efficient on certain architectures than on others. Answering this question can provide an useful insight into architectures and help make them better performing. It also has great use for the auto-tuning field itself, because knowing in advance which configurations are more likely to perform well, can decrease the configuration space and thus the time required to find the optimal configuration. The research question of this thesis is therefore: Why some kernel configurations are more efficient on certain architectures and not on others. In particular, what we want to find out is the relationship between the configuration and performance of a kernel on different architectures. This thesis is based on the work of Alessio Sclocco [13]; an analysis of the relationship between this thesis and [13] will be presented in section 3.1. We will extend TuneBench, the auto-tuning framework developed by Alessio Sclocco, by adding another tunable kernel to it. 7

8 Supercomputer full specifications Accelerator 1 NVIDIA DGX-1, Xeon E5-2698v4 20C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 NVIDIA Tesla P100 2 Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect, NVIDIA Tesla P100 NVIDIA Tesla P100 3 ZettaScaler-1.6, Xeon E5-2618Lv3 8C 2.3GHz, Infiniband FDR, PEZY-SCnp PEZY-SCnp 4 Sunway MPP, Sunway SW C 1.45GHz, Sunway Sunway SW PRIMERGY CX1640 M1, Intel Xeon Phi C 1.3GHz, Intel Intel Xeon Phi 7260 Omni-Path 6 PRIMERGY CX1640 M1, Intel Xeon Phi C 1.4GHz, Intel Intel Xeon Phi 7250 Omni-Path 7 Cray XC40, Intel Xeon Phi C 1.3GHz, Aries interconnect Intel Xeon Phi Cray CS-Storm, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband NVIDIA K80 FDR, Nvidia K80 9 Cray XC40, Intel Xeon Phi C 1.4GHz, Aries interconnect Intel Xeon Phi KOI Cluster, Intel Xeon Phi C 1.3GHz, Intel Omni-Path Intel Xeon Phi 7230 Table 1.1: The top 10 of the Green500 of November 2016 [1] with five different architectures and seven different main accelerators 1.2 Thesis outline The second chapter provides some background information required to understand the rest of the thesis; in particular the auto-tuning framework, the implemented kernel and the language in which the kernel is implemented are discussed. In the third chapter we present a selection of papers that are important for this thesis, and in particular an overview of [13]. The implementation and the optimisations of the kernel implemented by myself are discussed in the fourth chapter. The fifth chapter discusses the experimental setup, the experiments and the results. Chapter six proposes some topics for future research in this research area. In the last chapter the conclusion of the thesis is presented. 8

9 CHAPTER 2 Background In this chapter we re going to introduce all background information that is necessary to understand the rest of this thesis. The following three sections describe how auto-tuning works, what OpenCL is and why it s used and some background information on the kernel. A kernel, as used in this thesis, is a small program which performs only one task. These kernels are as closely related to real world problems as possible, but should generate reproducible results. So the outcome could be compared to a sequential implementation to make sure the results are correct. 2.1 Auto-Tuning The process of auto-tuning consists of automatically running a kernel with several different configurations to find the best configuration possible. A configuration is a combination of parameters, for example the number of threads and the number of times a loop is unrolled. The auto-tuning framework tests all possible configurations in a predefined configuration space. A possible configuration space of the number of threads could be 2, 4, 8, 16, 32, 64, 128, 256, 512, If more than one parameter is tunable, the optimisation space grows as the Cartesian product of the values of each parameter. This growth is rapid and represents one of the main challenges of auto-tuning. 2.2 OpenCL TuneBench uses OpenCL, which stands for Open Computing Language, as language for the kernels. This language is chosen because of the possibility to compile the OpenCL code to native code for GPUs, CPUs and hardware accelerators. Some supported hardware manufacturers are Intel, AMD, NVIDIA, IBM and ARM, for a complete list see the OpenCL website by Khronos [3], the maintainer of OpenCL. The OpenCL code is compiled at run-time, this ensures the code could be run by any supported platform without recompiling the kernel manually. A disadvantage of trying to write universal code is that it s not always the most efficient code possible. In order to make the most out of an architecture, architecture-specific code should be written. This however decreases the portability of the performance and requires architecture-specific knowledge. However, most of the time, the performance of a universal OpenCL implementation is close to a native implementation [13, 9]. OpenCL has to abstract away the exact details of the underlying platform in order to provide portability. Therefore it introduces concepts like work-groups and work-items. A work-group consists of one or more work-items which all are executed concurrently within a single compute unit [14]. A compute unit may be a single core, a SIMD unit, or any other element in the OpenCL device capable of executing code. Each work-item in a work-group will execute concurrently on a single compute unit [14], thus work-items are comparable to threads but not exactly the same. Its up to the implementation how the work-items are scheduled and if its actually treated as a 9

10 thread. Within a work-group, local memory is available, this local memory is shared within the work-group and is faster than the global memory [14]. 2.3 Black-Scholes In this thesis we are going to implement the Black-Scholes algorithm. The Black-Scholes model is a model which estimates the cost of options on the European financial markets. It s not required to understand how this model works exactly in order to follow the rest of this thesis but some background does help explaining the performance of the kernel. The model consist of two types of options, the call option and the put option. An option is a security giving the right to buy (call) or sell (put) an asset, subject to certain conditions, within a specified period of time [6]. The distinction with an American option is that European options can only be exercised on a specified future date, while American options can be exercised up to the date the option expires. To be able to estimate the future price of the call and put options a couple of variables are required. These variables are the current stock price, the strike price of the options, the duration of the options in years, the riskless rate of return and the stock volatility. The riskless rate of return is the interest rate without calculating risk in to the equation. This is an assumption made by the Black-Scholes model and of course doesn t hold in the real world. The stock volatility indicates how much the stock price changes. The BlackScholes kernel used in this thesis is based on code developed by NVIDIA [2]. By using this kernel we focus on tunable optimisations instead of the implementation of the kernel itself. 10

11 CHAPTER 3 Related Work In this chapter some related papers are discussed on which this thesis is based. The most important being the thesis of Alessio Sclocco [13], of which the framework TuneBench is extended. The second paper is an important paper which discusses the optimising of stencil computations. By extensively discussing the optimisations per architecture an insight in the performance of the different architectures is given. 3.1 Accelerating Radio Astronomy with Auto-Tuning As mentioned earlier, this thesis is based on the work described in [13]. This research mainly focused on the radio astronomy and how the applications used in that field could be optimised. The possible techniques explored are using many-cores and auto-tuning. Also the question how difficult auto-tuning is, is explored. To be able to answer whether auto-tuning provides a possible solution, the framework TuneBench was developed which contains five kernels. The kernels are used to run on several different platforms, CPUs, GPUs and accelerators. The platforms used are AMD Opteron 6172, AMD HD6970, AMD HD7970, AMD FirePro W9100, AMD R9 Fury X, Intel Xeon E5620, Intel Xeon E5-2620, Intel Xeon Phi 5110P, Intel Xeon Phi 31S1P, NVIDIA GTX 580, NVIDIA GTX 680, NVIDIA K20, NVIDIA GTX Titan, NVIDIA K20X, NVIDIA GTX Titan X, NVIDIA GTX By using TuneBench to get insights in to the optimisation space, the difficulty of auto-tuning could be studied. The conclusions drawn from examining the optimisation spaces are that completely memorybound applications are easier to tune than applications that, by exposing date-reuse through tunable parameters can be made almost compute-bound. The difficulty of tuning can also be a function of the input size. Another conclusion that can be drawn is that tuning many-core accelerators is, in general, difficult, but application-specific knowledge can help prune the search space of tunable parameters. The last conclusion is that there is less correlation between an application being memory- or compute-bound and it having a more or less portable optimum configuration. The evidence found in [13] shows that the optimum is not really portable among different platforms, not even for the same input size. However some parameters are stable and do not vary at all. The variability of optimal configurations seem to be increasing in newer architectures. 3.2 Stencil computation optimization and auto-tuning on state-of-theart multicore architectures After years of simply increasing the frequency and other optimisations to increase the per core performance, the performance is now mainly increased by adding more cores. This presents a lot of new challenges and thus several architectural approaches. It s not yet clear what architectural 11

12 philosophy is best suited for what class of algorithms. This makes optimising for a new architecture an expensive task, because it s not clear in advance whether it will perform better than the current architecture. Auto-tuning helps solving this problem by automatically finding the best configuration for several different architectures, which makes an algorithm extremely portable. This paper uses stencil operations as benchmark for several architectures. These kernels can be parallelised very well and have a low computational intensity, offering a mixture of opportunities for on-chip parallelism and challenges for associated memory systems (and is thus memory bound). The used architectures are the Intel Xeon E5355, AMD Opteron 2356, Sun UltraSparc T2+, IBM QS22 PowerXCell 8i Blade and the NVIDIA GTX280. An important thing to note is that the Geforce GTX280 has a notable faster onboard memory capacity of 1GB, when a problem size of bigger than 1GB must be handled, the GPU should switch to the slow DRAM over the even slower PCIe bus, which greatly reduces the throughput and thus performance. The comparison of the architectures shows that the applicable optimisations and the effect of them is highly dependent on the architecture. There are off course some optimisations only available for certain architecture, for example SIMD is only available when implemented (which only is the case for Intel). On the other hand some optimisations are very effective on some architectures, like increasing the number of threads on the Geforce GTX280, while much less effective on the CPU s [8]. 12

13 CHAPTER 4 Implementation This chapter discusses the implementation of the Black-Scholes kernel and the optimisations applied. 4.1 Black-Scholes The original implementation of the Black-Scholes kernel follows the Black-Scholes model [6] and is discussed by NVIDIA [11], some important parts are highlighted here. In the body of the kernel the future price of one option at the time is calculated. To calculate the value of the N options, the kernel is executed N times with a different input option. The cumulative normal distribution function is not implemented in C++ so a rational approximation is used. All calculations are done using single precision floats Optimisations The tunable parameters that we added are: the number of threads, the number of loop unrolls and vectorisation with different vector sizes. The original code does not use loop unrolling, nor vectorisation. The number of threads is changed by setting the dimensions of the work group as discussed in section 2.2. Loop unrolling is a process in which the body of the loop is repeated multiple times, while updating the control logic of the loop. An example can be seen in figure 4.1, where the original loop is unrolled with an unroll factor of 3. By unrolling a loop the program size increases in an attempt to decrease the execution time. The improved execution time can be achieved by an increase in instruction-level parallelism, registry locality and memory locality [10, 12]. For GPUs this optimisation comes with a trade-off, when increasing the number of registers per thread, the number of threads that can execute concurrently decreases [10]. The last optimisations applied to this kernel is the vectorisation technique. This technique requires special vector processors and vector registers to profit of the optimisation. Intel architectures support vectorisation by using SIMD (Single Instruction Multiple Data) instruction sets like MMX, SSE and AVX. SIMD instructions run the same instruction on multiple data elements at the same time, for example when multiplying an array of 4 elements by 2 as shown in figure 4.2. A schematic drawing of a SIMD instruction in a processor is shown in figure 4.3. Modern GPUs don t support vectorisation, this is because instructions on the GPU are scheduled in such a manner that memory latency is hidden as much is possible by default, by using light-weight threads instead of vectors. The OpenCL compiler only supports vectors of size 2, 3, 4, 8 or 16. So only the configurations where the loop unroll factor is equal to one of these values is computed. The rest of the configurations are skipped because loop unrolling and vectorisation are applied at the same time. 13

1: statements An example of loop unrolling, the above for loop is unrolled to three separate # Before v e c t o r i s a t i o n : a [ 0 ] = a [ 0 ] 2 a [ 1 ] = a [ 1 ] 2 a [ 2 ] = a [ 2 ] 2

14 # Before loop u n r o l l i n g : f o r ( i n t i = 0 ; i < 6 ; i++) { p r i n t i ; } # After loop u n r o l l i n g : f o r ( i n t i = 0 ; i < 6 ; i+= 3) { p r i n t i ; p r i n t i + 1 ; p r i n t i + 2 ; } Figure 4.1: statements An example of loop unrolling, the above for loop is unrolled to three separate # Before v e c t o r i s a t i o n : a [ 0 ] = a [ 0 ] 2 a [ 1 ] = a [ 1 ] 2 a [ 2 ] = a [ 2 ] 2 a [ 3 ] = a [ 3 ] 2 # After v e c t o r i s a t i o n : a = 2 Figure 4.2: Vectorisation makes it possible to apply one instruction to multiple data elements at the same time, the four statements on top are equal to the one below when vectorisation is enabled Figure 4.3: A schematic drawing of a SIMD instruction in a processor, source Wikipedia [4] 14

15 CHAPTER 5 Experiments After discussing the experimental setup in the first section of this chapter, we later discuss the results of the experiments. Each experiment is executed on all of the devices discussed in the experimental setup. We try to provide an explanation for the observed behaviour of the experiments by holding the results next to the hardware specifications. 5.1 Experimental Setup All experiments are run on the DAS-5 supercomputer [5]. We ve used the VU cluster and the devices used for running the experiments are listed in table 5.1. The five devices are based on four different architectures, this way it s possible to make a comparison between the different architectures. The exact details of the GPUs and accelerators can be found in the appendix. The Tesla K20 and K40 are specifically designed for HPC purposes, while the Titan X (Maxwell) and Titan X (Pascal) in the first place are designed for gaming. Therefore the K20 and K40 have ECC memory to correct memory errors and also have a higher double precision performance. The ECC bits are stored in the main memory which reduces the size left for application usage by 10% for the K20 and 6.25% for the K40 accelerator. The experimental setup has ECC memory disabled for both the K20 and K40 and thus have access to all of the installed memory. There is a configurable boost option available for both the K20 and K40. This boost option increases the clock speed of the shaders to temporarily improve performance, and is only available when there is enough power headroom (maximised at 235W). During the experiments the default clock speeds of 705MHz and 745MHz respectively are used. All experiments are repeated 1000 times, the average value of the experiments is used in this thesis. Device Architecture OpenCL Version GFLOP/s GB/s NVIDIA GTX TitanX Maxwell OpenCL-NVIDIA NVIDIA GTX TitanX Pascal OpenCL-NVIDIA NVIDIA Tesla K20 Kepler OpenCL-NVIDIA NVIDIA Tesla K40 Kepler OpenCL-NVIDIA Intel E Sandy Bridge OpenCL-Intel 4.5-mic 307,2 42,6 Table 5.1: The experimental setup with one Intel CPU and four NVIDIA GPUs and accelerators, the theoretical maximum performance is for single precision floats 5.2 Performance Results In this section we discuss the performance results of the Black-Scholes kernel.the plots shown are a histogram of the optimisation space, which shows the performance on the y-axis and the 15

16 input size on the x-axis. The thicker the bar is, the more configurations are found with a certain performance. In other words, each vertical plot is a histogram. The other plots are plots of the performance per input size, where the x-axis shows the number of threads and the y-axis the performance in GFLOP/s. Not all plots are shown in this section, only the plots of input size 4.000, and options are shown below. The other plots can be found in the appendix. The maximum input size used is options, this value is chosen because this value was the maximum value which could complete all tune runs, a higher input size would give an out of resources error when tuning with vector size 16 and 1024 threads. The true maximum performance is thus not yet reached with this input size. In this chapter, when we refer to a vector size of 0, we mean that there is no vectorisation of the code Black-Scholes The Black-Scholes kernel was first tuned on the NVIDIA Titan X (Maxwell), a subset of the results is shown in figures Figure 5.1 shows the minimum and maximum floating point operations per input size and also the distribution of the configurations. Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.1: The configuration space of the BlackScholes kernel in GFLOP/s when tuning on a Titan X (Maxwell) GPU using varying input sizes When looking at the histogram its clear that the improvement of the performance of the kernel is very dependent on the input size. With a low input size of 4000 options the maximum and minimum are very close to each other. The input size is too low to fully utilise the hardware. When increasing the input size the performance increases significantly. The 75th percentile of input sizes till is closer to the median, than to the maximum. This means that it s relatively hard to get the maximum performance out of this architecture, since most configurations have median performance or less. In figure 5.2 the plot using an input size of 4000 is shown. This plot supports the observation made earlier using figure 5.1, the performance doesn t vary much and is mainly limited by the input size. Increasing the input size to increases the peak performance to 140 GFLOP/s as can be seen in figure 5.3. The differences between the configurations is now clearly visible. When using a low number of threads, a bigger vector size has a higher performance. The vector size of 8 is at its peak performance when using 4 threads and degrades at the increase of the number of threads. The configurations using vectors of size 0 or 2 are performing the best and 16

17 the peak performance is reached at 32 to 128 threads. After 128 threads the performance drops till the performance is equal to the performance of vector sizes 3 and 4. Figure 5.2: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of 4000 options with a varying number of threads and vector size Figure 5.3: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of options with a varying number of threads and vector size What we can conclude from the analysis of the plot is that increasing the amount of compu- 17

18 tation per thread (increasing the vector size) doesn t increase the performance. The theoretical peak performance hasn t been reached yet, so this means that the registers are full and therefore the relatively slow main memory has to be used for storing and retrieving data. An important observation to make is that the Titan-X doesn t have a special vector compute-unit available. The increase in the number of threads does improve the performance, because of the increased parallelism. Increasing the parallelism does also increase the register usage for a multiprocessor. So after all the registers are used, increasing the parallelism hurts the performance. Figure 5.4: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of options with a varying number of threads and vector size The other input sizes show roughly the same behaviour, the base and peak performance are at 111 GFLOP/s and 691 GFLOP/s respectively. When using a low number of threads the configurations with a higher vector size are more efficient because there are less work-groups and thus less context-switches between them. A context-switch normally does add little overhead on a GPU (since all scheduling is done in hardware), but in this case the number of work-groups is at its maximum and thus the combined overhead is significant. Also when choosing a configuration with few threads the GPU isn t fully utilized, because NVIDIA works in lock-step warps of 32 threads. In figure 5.5 the histogram of the optimisation space of the Titan X (Pascal) is shown. This plot follows the same pattern as the plot in figure 5.1. The peak performance is a bit higher at 918 GFLOP/s, an increase of roughly 30%, which is less of an increase than the increase in theoretical GFLOP/s which is roughly 65%. What does stand out is that the 75th percentile and median are higher for the Titan X (Pascal) than for the Titan X (Maxwell), which means that there are relatively more configurations closer to the maximum. When we look at figures 5.6 to 5.8 we see that the behaviour is the same as for the Titan X (Maxwell). The most notable change is found in the plot for input size 2.56 million, figure 5.8, where the vector sizes 0, 2, 3 and 4 are all within 50 GFLOP/s. This is considerably closer to each other than in figure 5.4 where the difference between those vector sizes is almost 200 GFLOP/s and supports the observation made earlier, that the Titan X (Pascal) is easier to tune than the Titan X (Maxwell). An explanation for this behaviour hasn t been found. That the Titan X (Maxwell) and Titan X (Pascal) behave mostly the same is because the architecture design of the latter is almost identical to the former. The plot in figure 5.9 shows the optimisation space of the K20 accelerator. In contrast to 18

19 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.5: The number of floating point operations of the BlackScholes kernel when tuning on a Titan X (Pascal) GPU using varying input sizes Figure 5.6: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of 4000 options with a varying number of threads and vector size the earlier observed Titan X (Pascal) the median and 75th percentile are closer to the minimum. This suggests that this accelerator is harder to tune than the Titan X. Figures 5.10 to 5.12 show the performance per number of threads per vector size for the K20 accelerator. The performance of the K20 benefits significantly from an increase in threads as can 19

20 Figure 5.7: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 5.8: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size be seen in figures 5.10 to This can be explained by the configuration of the CUDA cores. There are more CUDA cores per multiprocessor for the K20 than for the Titan-X (Maxwell), namely 192 against 128. However there is a clear drop in performance visible if more than 256 threads are used. This is caused by the fact that there aren t enough registers or multiprocessors 20

21 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.9: The number of floating point operations of the BlackScholes kernel when tuning on a K20 accelerator using varying input sizes available to process the threads. The number of registers of both devices is the same, but the K20 has 13 multiprocessors and the Titan-X (Maxwell) has 24 multiprocessors. Figure 5.10: The performance of the BlackScholes kernel on the K20 using an input size of 4000 options with a varying number of threads and vector size As the K40 is a faster version of the K20, the behaviour of the K40 is almost identical to the K20. The optimisation space, shown in figure 5.13, is the same but with a higher performance. 21

22 Figure 5.11: The performance of the BlackScholes kernel on the K20 using an input size of options with a varying number of threads and vector size Figure 5.12: The performance of the BlackScholes kernel on the K20 using an input size of options with a varying number of threads and vector size Again noteworthy that the K40 is harder to tune than the two Titan-X devices. The K40 also has a drop in performance when using a higher number of threads, but this drop is after 512 threads instead of 256 threads. This can be explained when looking at the specification of the K40, this accelerator has 15 multiprocessors instead of 13 multiprocessors for 22

23 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.13: The number of floating point operations of the BlackScholes kernel when tuning on a K40 accelerator using varying input sizes the K20. Figure 5.14: The performance of the BlackScholes kernel on the K40 using an input size of 4000 options with a varying number of threads and vector size The last architecture tested is the Intel Xeon E processor. The optimisation space of this device can be seen in figure Most of the configurations are close to the median performance of the kernel, this shows that the kernel is not easily tuned for maximum performance 23

24 Figure 5.15: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 5.16: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size but can get median performance fairly easy. When using an input size of 2,56 million options the performance drops. We ve investigated this and concluded that the data must be too big to fit in the L3-Cache, the size of the L3-cache being 20MB and the total size of the input and output array being 20,48MB. This causes the processor to retrieve the data partially from the 24

25 memory which is much slower than the cache. Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 0 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.17: The number of floating point operations of the BlackScholes kernel when tuning on a Intel E CPU using varying input sizes The figures 5.18 and 5.19 show that using a relatively small input size an increase in number of threads negatively effects the performance for vector size 2 and bigger. For vector size 0 the performance stays the same when increasing the number of threads. It makes sense that an increase in threads above the 128 threads don t, or negatively, effect the performance since the number of threads for a processor like the Xeon E is relatively small compared to a GPU. The number of threads available for the Xeon E is 16. The plot in figure 5.20 shows that vector sizes bigger than 0 provide a better performance. The best performance is when using a vector size of 4 or 8, this is because of the AVX instruction set which is supported by the processor. This instruction set supports 8 single precision floats at the time. To provide a baseline we ve executed the original kernel on all devices with an input size of 1.28 million. This way we can see whether the tuning has effect, even on such an already optimised kernel as the Black-Scholes kernel. The results can be found in table 5.2. We can conclude from the results that the performance increase is significant, namely 8 to 44%. The biggest performance increase can be seen when tuning the Intel E5-2630, which is explained by the fact that the original kernel is developed for NVIDIA GPUs, the default settings are thus quite inefficient for a CPU. Device GFLOP/s (Original) GFLOP/s (Tuned) Increase NVIDIA GTX TitanX (Maxwell) 601,94 658,61 9,41% NVIDIA GTX TitanX (Pascal) 777,02 864,56 11,27% NVIDIA Tesla K20 382,90 415,03 8,39% NVIDIA Tesla K40 459,25 500,49 8,98% Intel E ,91 116,52 44,01% Table 5.2: The maximum performance of the original implementation of the kernel compared with the maximum performance of the tuned version of the kernel using input size options 25

26 Figure 5.18: The performance of the BlackScholes kernel on the E using an input size of 4000 options with a varying number of threads and vector size Figure 5.19: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 26

27 Figure 5.20: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 27

28 28

29 CHAPTER 6 Future work We ve made a first effort to explain the performance differences between architectures. In order to be able to fully understand the performance differences further research should be done. This chapter describes some of the possible research directions and some possible extensions for TuneBench. 6.1 Generalise findings In order to completely understand what aspects of an architecture make an application better performing, the performance of more kernels should be examined. Also more devices should be tested so we can compare AMD architectures with NVIDIA architectures for example, or the Xeon Phi which has in terms of architecture a completely different design. 29

30 30

31 CHAPTER 7 Conclusion The goal of this thesis was to explain why some kernel configurations are more efficient on certain architectures than on others. With help of the TuneBench framework we ve examined the Black-Scholes kernel on the NVIDIA Maxwell, Pascal and Kepler architecture, and Intel Sandy Bridge architecture. A couple of conclusions can be drawn from this research. The first, and most obvious, conclusion is that CPUs perform significantly worse than GPUs on highly parallelised workloads. The available processing power also limits the maximum input size, a too big input size will decrease performance (because of limited cache sizes). Because of the special vector instructions, the CPU does take advantage of vectorising when the input size is big enough. As opposed to CPUs, the performance of GPUs doesn t take advantage of vectorising. Vector instructions are missing in GPUs, but GPUs are designed in such a way that memory latency is hidden as much as possible. This is done by executing multiple warps (or wavefronts for AMD) on a streaming multiprocessor, when a warp has to wait on the memory another warp is executed on the streaming multiprocessor. Because of this design the approach for optimising for a GPU is entirely different than optimising for a CPU. It s important to make sure the multiprocessors and CUDA cores of a GPU are constantly performing computations. The number of threads does influence the performance significantly. Both for the CPU as for GPUs increasing the number of threads, increases performance. We ve also concluded that auto-tuning is effective, even for already optimised kernels as the Black-Scholes kernel we ve used. Performance increases from 8 to 11% for GPUs and 44% for the CPU were visible. 31

32 32

33 Bibliography [1] The green 500 list of november /. Accessed: [2] Nvidia opencl sdk - black scholes kernel. compute/cuda/3_0/sdk/website/opencl/website/samples.html. Accessed: [3] Opencl documentation by khronos. Accessed: [4] Wikipedia page on simd. Accessed: [5] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed system for computer science research: Infrastructure for the long term. Computer, 49(5):54 63, [6] F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3): , [7] L. Chai, Q. Gao, and D. K. Panda. Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system. In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 07), pages , May [8] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 08, pages 4:1 4:12, Piscataway, NJ, USA, IEEE Press. [9] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes [10] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan. Optimal loop unrolling for gpgpu programs. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1 11, April [11] V. Podlozhnyuk. Black-scholes option pricing [12] V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the 14th International Conference on Supercomputing, ICS 00, pages , New York, NY, USA, ACM. [13] A. Sclocco. Accelerating Radio Astronomy with Auto-Tuning. PhD thesis, Vrije Universiteit van Amsterdam, [14] J. Tompson and K. Schlachter. An introduction to the opencl programming model

34 34

35 Appendix Hardware specifications Device 0 : GeForce GTX TITAN X CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 5. 2 Total amount o f g l o b a l memory : MBytes ( bytes ) (24) M u l t i p r o c e s s o r s, (128) CUDA Cores /MP: 3072 CUDA Cores GPU Max Clock r a t e : 1076 MHz ( GHz) Memory Clock r a t e : 3505 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 35

36 Device 0 : TITAN X ( Pascal ) CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 6. 1 Total amount o f g l o b a l memory : MBytes ( bytes ) (28) M u l t i p r o c e s s o r s, (128) CUDA Cores /MP: 3584 CUDA Cores GPU Max Clock r a t e : 1531 MHz ( GHz) Memory Clock r a t e : 5005 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(131072), 2D =(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(32768), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(32768, 32768), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 130 / 0 Device 0 : Tesla K20m CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 3. 5 Total amount o f g l o b a l memory : 5061 MBytes ( bytes ) (13) M u l t i p r o c e s s o r s, (192) CUDA Cores /MP: 2496 CUDA Cores GPU Max Clock r a t e : 706 MHz ( GHz) Memory Clock r a t e : 2600 Mhz Memory Bus Width : 320 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes 36

37 Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 Device 0 : Tesla K40c CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 3. 5 Total amount o f g l o b a l memory : MBytes ( bytes ) (15) M u l t i p r o c e s s o r s, (192) CUDA Cores /MP: 2880 CUDA Cores GPU Max Clock r a t e : 745 MHz ( GHz) Memory Clock r a t e : 3004 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 37

38 Plots Figure 7.1: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.2: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size 38

39 Figure 7.3: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.4: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size 39

40 Figure 7.5: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.6: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 40

41 Figure 7.7: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 7.8: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 41

42 Figure 7.9: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 7.10: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 42

43 Figure 7.11: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.12: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size 43

44 Figure 7.13: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.14: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size 44

45 Figure 7.15: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.16: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 45

46 Figure 7.17: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size Figure 7.18: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 46

47 Figure 7.19: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size Figure 7.20: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 47

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department