Informatica Universiteit van Amsterdam. Tuning and Performance Analysis of the Black-Scholes Model. Robin Hansma. June 9, Bachelor Informatica

Size: px
Start display at page:

Download "Informatica Universiteit van Amsterdam. Tuning and Performance Analysis of the Black-Scholes Model. Robin Hansma. June 9, Bachelor Informatica"

Transcription

1 Bachelor Informatica Informatica Universiteit van Amsterdam Tuning and Performance Analysis of the Black-Scholes Model Robin Hansma June 9, 2017 Supervisor(s): prof. dr. R.V. (Rob) van Nieuwpoort, A. Sclocco Signed: prof. dr. R.V. (Rob) van Nieuwpoort, R. Hansma

2 2

3 Abstract There are different options to increase performance without consuming more power, such as using more power efficient hardware, or improving the efficiency of applications. A way to improve the efficiency of an application is to use auto-tuning to find the best configuration for its parameters. In this thesis we explain the performance of the Black-Scholes kernel on both GPUs and CPUs. We conclude that auto-tuning an already optimised kernel still makes sense, the performance increase for the the GPUs is in the range of 8-11% and for the CPU even 44% (the original kernel was optimised for GPUs). 3

4 4

5 Contents 1 Introduction Research Question Thesis outline Background Auto-Tuning OpenCL Black-Scholes Related Work Accelerating Radio Astronomy with Auto-Tuning Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures Implementation Black-Scholes Optimisations Experiments Experimental Setup Performance Results Black-Scholes Future work Generalise findings Conclusion 31 5

6 6

7 CHAPTER 1 Introduction In the past, it was common to increase the performance of supercomputers by simply adding more cores, either in the form of more nodes, multi-core CPUs or recently many-core GPUs. In order to scale to exascale, extra attention should be payed to the efficiency of the supercomputers and the efficiency of the algorithms running on them. The recent change from single core processors to multi-core processors required a shift in mindset of developers [7] and required a lot of programs to be rewritten to make the most out of the new architectures. Rewriting programs requires algorithm-specific knowledge and an upfront investment without any knowledge on how performant the new implementation will be. The result is that a lot of scientific [13] and commercial programs are still not optimised for multi-core processing and thus waste processing power. The recent change of interest in efficiency of supercomputers as opposed to raw performance has led to the usage of a new set of benchmarks to measure this. The results of this set of benchmarks are summarised in the Green500 list, which lists the top 500 supercomputers based on efficiency (MFLOPS/W). The top 10 of this list contains five different architectures and seven different main accelerators as shown in table 1.1. This illustrates the challenge developers face every day: which platform is best to optimise the algorithm for? To make this decision even harder, these architectures change from year to year. Auto-tuning frameworks can be used to improve performance of algorithms for a specific architecture. This improves the performance of the algorithm itself, but also improves the portability of performance [13]. Auto-tuning automatically searches the predefined configuration space for the best configuration on a architecture, this way a developer doesn t need to know the hardware specifics to develop an efficient implementation of the algorithm. 1.1 Research Question The focus of this thesis is to explain why certain configurations are more efficient on certain architectures than on others. Answering this question can provide an useful insight into architectures and help make them better performing. It also has great use for the auto-tuning field itself, because knowing in advance which configurations are more likely to perform well, can decrease the configuration space and thus the time required to find the optimal configuration. The research question of this thesis is therefore: Why some kernel configurations are more efficient on certain architectures and not on others. In particular, what we want to find out is the relationship between the configuration and performance of a kernel on different architectures. This thesis is based on the work of Alessio Sclocco [13]; an analysis of the relationship between this thesis and [13] will be presented in section 3.1. We will extend TuneBench, the auto-tuning framework developed by Alessio Sclocco, by adding another tunable kernel to it. 7

8 Supercomputer full specifications Accelerator 1 NVIDIA DGX-1, Xeon E5-2698v4 20C 2.2GHz, Infiniband EDR, NVIDIA Tesla P100 NVIDIA Tesla P100 2 Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect, NVIDIA Tesla P100 NVIDIA Tesla P100 3 ZettaScaler-1.6, Xeon E5-2618Lv3 8C 2.3GHz, Infiniband FDR, PEZY-SCnp PEZY-SCnp 4 Sunway MPP, Sunway SW C 1.45GHz, Sunway Sunway SW PRIMERGY CX1640 M1, Intel Xeon Phi C 1.3GHz, Intel Intel Xeon Phi 7260 Omni-Path 6 PRIMERGY CX1640 M1, Intel Xeon Phi C 1.4GHz, Intel Intel Xeon Phi 7250 Omni-Path 7 Cray XC40, Intel Xeon Phi C 1.3GHz, Aries interconnect Intel Xeon Phi Cray CS-Storm, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband NVIDIA K80 FDR, Nvidia K80 9 Cray XC40, Intel Xeon Phi C 1.4GHz, Aries interconnect Intel Xeon Phi KOI Cluster, Intel Xeon Phi C 1.3GHz, Intel Omni-Path Intel Xeon Phi 7230 Table 1.1: The top 10 of the Green500 of November 2016 [1] with five different architectures and seven different main accelerators 1.2 Thesis outline The second chapter provides some background information required to understand the rest of the thesis; in particular the auto-tuning framework, the implemented kernel and the language in which the kernel is implemented are discussed. In the third chapter we present a selection of papers that are important for this thesis, and in particular an overview of [13]. The implementation and the optimisations of the kernel implemented by myself are discussed in the fourth chapter. The fifth chapter discusses the experimental setup, the experiments and the results. Chapter six proposes some topics for future research in this research area. In the last chapter the conclusion of the thesis is presented. 8

9 CHAPTER 2 Background In this chapter we re going to introduce all background information that is necessary to understand the rest of this thesis. The following three sections describe how auto-tuning works, what OpenCL is and why it s used and some background information on the kernel. A kernel, as used in this thesis, is a small program which performs only one task. These kernels are as closely related to real world problems as possible, but should generate reproducible results. So the outcome could be compared to a sequential implementation to make sure the results are correct. 2.1 Auto-Tuning The process of auto-tuning consists of automatically running a kernel with several different configurations to find the best configuration possible. A configuration is a combination of parameters, for example the number of threads and the number of times a loop is unrolled. The auto-tuning framework tests all possible configurations in a predefined configuration space. A possible configuration space of the number of threads could be 2, 4, 8, 16, 32, 64, 128, 256, 512, If more than one parameter is tunable, the optimisation space grows as the Cartesian product of the values of each parameter. This growth is rapid and represents one of the main challenges of auto-tuning. 2.2 OpenCL TuneBench uses OpenCL, which stands for Open Computing Language, as language for the kernels. This language is chosen because of the possibility to compile the OpenCL code to native code for GPUs, CPUs and hardware accelerators. Some supported hardware manufacturers are Intel, AMD, NVIDIA, IBM and ARM, for a complete list see the OpenCL website by Khronos [3], the maintainer of OpenCL. The OpenCL code is compiled at run-time, this ensures the code could be run by any supported platform without recompiling the kernel manually. A disadvantage of trying to write universal code is that it s not always the most efficient code possible. In order to make the most out of an architecture, architecture-specific code should be written. This however decreases the portability of the performance and requires architecture-specific knowledge. However, most of the time, the performance of a universal OpenCL implementation is close to a native implementation [13, 9]. OpenCL has to abstract away the exact details of the underlying platform in order to provide portability. Therefore it introduces concepts like work-groups and work-items. A work-group consists of one or more work-items which all are executed concurrently within a single compute unit [14]. A compute unit may be a single core, a SIMD unit, or any other element in the OpenCL device capable of executing code. Each work-item in a work-group will execute concurrently on a single compute unit [14], thus work-items are comparable to threads but not exactly the same. Its up to the implementation how the work-items are scheduled and if its actually treated as a 9

10 thread. Within a work-group, local memory is available, this local memory is shared within the work-group and is faster than the global memory [14]. 2.3 Black-Scholes In this thesis we are going to implement the Black-Scholes algorithm. The Black-Scholes model is a model which estimates the cost of options on the European financial markets. It s not required to understand how this model works exactly in order to follow the rest of this thesis but some background does help explaining the performance of the kernel. The model consist of two types of options, the call option and the put option. An option is a security giving the right to buy (call) or sell (put) an asset, subject to certain conditions, within a specified period of time [6]. The distinction with an American option is that European options can only be exercised on a specified future date, while American options can be exercised up to the date the option expires. To be able to estimate the future price of the call and put options a couple of variables are required. These variables are the current stock price, the strike price of the options, the duration of the options in years, the riskless rate of return and the stock volatility. The riskless rate of return is the interest rate without calculating risk in to the equation. This is an assumption made by the Black-Scholes model and of course doesn t hold in the real world. The stock volatility indicates how much the stock price changes. The BlackScholes kernel used in this thesis is based on code developed by NVIDIA [2]. By using this kernel we focus on tunable optimisations instead of the implementation of the kernel itself. 10

11 CHAPTER 3 Related Work In this chapter some related papers are discussed on which this thesis is based. The most important being the thesis of Alessio Sclocco [13], of which the framework TuneBench is extended. The second paper is an important paper which discusses the optimising of stencil computations. By extensively discussing the optimisations per architecture an insight in the performance of the different architectures is given. 3.1 Accelerating Radio Astronomy with Auto-Tuning As mentioned earlier, this thesis is based on the work described in [13]. This research mainly focused on the radio astronomy and how the applications used in that field could be optimised. The possible techniques explored are using many-cores and auto-tuning. Also the question how difficult auto-tuning is, is explored. To be able to answer whether auto-tuning provides a possible solution, the framework TuneBench was developed which contains five kernels. The kernels are used to run on several different platforms, CPUs, GPUs and accelerators. The platforms used are AMD Opteron 6172, AMD HD6970, AMD HD7970, AMD FirePro W9100, AMD R9 Fury X, Intel Xeon E5620, Intel Xeon E5-2620, Intel Xeon Phi 5110P, Intel Xeon Phi 31S1P, NVIDIA GTX 580, NVIDIA GTX 680, NVIDIA K20, NVIDIA GTX Titan, NVIDIA K20X, NVIDIA GTX Titan X, NVIDIA GTX By using TuneBench to get insights in to the optimisation space, the difficulty of auto-tuning could be studied. The conclusions drawn from examining the optimisation spaces are that completely memorybound applications are easier to tune than applications that, by exposing date-reuse through tunable parameters can be made almost compute-bound. The difficulty of tuning can also be a function of the input size. Another conclusion that can be drawn is that tuning many-core accelerators is, in general, difficult, but application-specific knowledge can help prune the search space of tunable parameters. The last conclusion is that there is less correlation between an application being memory- or compute-bound and it having a more or less portable optimum configuration. The evidence found in [13] shows that the optimum is not really portable among different platforms, not even for the same input size. However some parameters are stable and do not vary at all. The variability of optimal configurations seem to be increasing in newer architectures. 3.2 Stencil computation optimization and auto-tuning on state-of-theart multicore architectures After years of simply increasing the frequency and other optimisations to increase the per core performance, the performance is now mainly increased by adding more cores. This presents a lot of new challenges and thus several architectural approaches. It s not yet clear what architectural 11

12 philosophy is best suited for what class of algorithms. This makes optimising for a new architecture an expensive task, because it s not clear in advance whether it will perform better than the current architecture. Auto-tuning helps solving this problem by automatically finding the best configuration for several different architectures, which makes an algorithm extremely portable. This paper uses stencil operations as benchmark for several architectures. These kernels can be parallelised very well and have a low computational intensity, offering a mixture of opportunities for on-chip parallelism and challenges for associated memory systems (and is thus memory bound). The used architectures are the Intel Xeon E5355, AMD Opteron 2356, Sun UltraSparc T2+, IBM QS22 PowerXCell 8i Blade and the NVIDIA GTX280. An important thing to note is that the Geforce GTX280 has a notable faster onboard memory capacity of 1GB, when a problem size of bigger than 1GB must be handled, the GPU should switch to the slow DRAM over the even slower PCIe bus, which greatly reduces the throughput and thus performance. The comparison of the architectures shows that the applicable optimisations and the effect of them is highly dependent on the architecture. There are off course some optimisations only available for certain architecture, for example SIMD is only available when implemented (which only is the case for Intel). On the other hand some optimisations are very effective on some architectures, like increasing the number of threads on the Geforce GTX280, while much less effective on the CPU s [8]. 12

13 CHAPTER 4 Implementation This chapter discusses the implementation of the Black-Scholes kernel and the optimisations applied. 4.1 Black-Scholes The original implementation of the Black-Scholes kernel follows the Black-Scholes model [6] and is discussed by NVIDIA [11], some important parts are highlighted here. In the body of the kernel the future price of one option at the time is calculated. To calculate the value of the N options, the kernel is executed N times with a different input option. The cumulative normal distribution function is not implemented in C++ so a rational approximation is used. All calculations are done using single precision floats Optimisations The tunable parameters that we added are: the number of threads, the number of loop unrolls and vectorisation with different vector sizes. The original code does not use loop unrolling, nor vectorisation. The number of threads is changed by setting the dimensions of the work group as discussed in section 2.2. Loop unrolling is a process in which the body of the loop is repeated multiple times, while updating the control logic of the loop. An example can be seen in figure 4.1, where the original loop is unrolled with an unroll factor of 3. By unrolling a loop the program size increases in an attempt to decrease the execution time. The improved execution time can be achieved by an increase in instruction-level parallelism, registry locality and memory locality [10, 12]. For GPUs this optimisation comes with a trade-off, when increasing the number of registers per thread, the number of threads that can execute concurrently decreases [10]. The last optimisations applied to this kernel is the vectorisation technique. This technique requires special vector processors and vector registers to profit of the optimisation. Intel architectures support vectorisation by using SIMD (Single Instruction Multiple Data) instruction sets like MMX, SSE and AVX. SIMD instructions run the same instruction on multiple data elements at the same time, for example when multiplying an array of 4 elements by 2 as shown in figure 4.2. A schematic drawing of a SIMD instruction in a processor is shown in figure 4.3. Modern GPUs don t support vectorisation, this is because instructions on the GPU are scheduled in such a manner that memory latency is hidden as much is possible by default, by using light-weight threads instead of vectors. The OpenCL compiler only supports vectors of size 2, 3, 4, 8 or 16. So only the configurations where the loop unroll factor is equal to one of these values is computed. The rest of the configurations are skipped because loop unrolling and vectorisation are applied at the same time. 13

14 # Before loop u n r o l l i n g : f o r ( i n t i = 0 ; i < 6 ; i++) { p r i n t i ; } # After loop u n r o l l i n g : f o r ( i n t i = 0 ; i < 6 ; i+= 3) { p r i n t i ; p r i n t i + 1 ; p r i n t i + 2 ; } Figure 4.1: statements An example of loop unrolling, the above for loop is unrolled to three separate # Before v e c t o r i s a t i o n : a [ 0 ] = a [ 0 ] 2 a [ 1 ] = a [ 1 ] 2 a [ 2 ] = a [ 2 ] 2 a [ 3 ] = a [ 3 ] 2 # After v e c t o r i s a t i o n : a = 2 Figure 4.2: Vectorisation makes it possible to apply one instruction to multiple data elements at the same time, the four statements on top are equal to the one below when vectorisation is enabled Figure 4.3: A schematic drawing of a SIMD instruction in a processor, source Wikipedia [4] 14

15 CHAPTER 5 Experiments After discussing the experimental setup in the first section of this chapter, we later discuss the results of the experiments. Each experiment is executed on all of the devices discussed in the experimental setup. We try to provide an explanation for the observed behaviour of the experiments by holding the results next to the hardware specifications. 5.1 Experimental Setup All experiments are run on the DAS-5 supercomputer [5]. We ve used the VU cluster and the devices used for running the experiments are listed in table 5.1. The five devices are based on four different architectures, this way it s possible to make a comparison between the different architectures. The exact details of the GPUs and accelerators can be found in the appendix. The Tesla K20 and K40 are specifically designed for HPC purposes, while the Titan X (Maxwell) and Titan X (Pascal) in the first place are designed for gaming. Therefore the K20 and K40 have ECC memory to correct memory errors and also have a higher double precision performance. The ECC bits are stored in the main memory which reduces the size left for application usage by 10% for the K20 and 6.25% for the K40 accelerator. The experimental setup has ECC memory disabled for both the K20 and K40 and thus have access to all of the installed memory. There is a configurable boost option available for both the K20 and K40. This boost option increases the clock speed of the shaders to temporarily improve performance, and is only available when there is enough power headroom (maximised at 235W). During the experiments the default clock speeds of 705MHz and 745MHz respectively are used. All experiments are repeated 1000 times, the average value of the experiments is used in this thesis. Device Architecture OpenCL Version GFLOP/s GB/s NVIDIA GTX TitanX Maxwell OpenCL-NVIDIA NVIDIA GTX TitanX Pascal OpenCL-NVIDIA NVIDIA Tesla K20 Kepler OpenCL-NVIDIA NVIDIA Tesla K40 Kepler OpenCL-NVIDIA Intel E Sandy Bridge OpenCL-Intel 4.5-mic 307,2 42,6 Table 5.1: The experimental setup with one Intel CPU and four NVIDIA GPUs and accelerators, the theoretical maximum performance is for single precision floats 5.2 Performance Results In this section we discuss the performance results of the Black-Scholes kernel.the plots shown are a histogram of the optimisation space, which shows the performance on the y-axis and the 15

16 input size on the x-axis. The thicker the bar is, the more configurations are found with a certain performance. In other words, each vertical plot is a histogram. The other plots are plots of the performance per input size, where the x-axis shows the number of threads and the y-axis the performance in GFLOP/s. Not all plots are shown in this section, only the plots of input size 4.000, and options are shown below. The other plots can be found in the appendix. The maximum input size used is options, this value is chosen because this value was the maximum value which could complete all tune runs, a higher input size would give an out of resources error when tuning with vector size 16 and 1024 threads. The true maximum performance is thus not yet reached with this input size. In this chapter, when we refer to a vector size of 0, we mean that there is no vectorisation of the code Black-Scholes The Black-Scholes kernel was first tuned on the NVIDIA Titan X (Maxwell), a subset of the results is shown in figures Figure 5.1 shows the minimum and maximum floating point operations per input size and also the distribution of the configurations. Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.1: The configuration space of the BlackScholes kernel in GFLOP/s when tuning on a Titan X (Maxwell) GPU using varying input sizes When looking at the histogram its clear that the improvement of the performance of the kernel is very dependent on the input size. With a low input size of 4000 options the maximum and minimum are very close to each other. The input size is too low to fully utilise the hardware. When increasing the input size the performance increases significantly. The 75th percentile of input sizes till is closer to the median, than to the maximum. This means that it s relatively hard to get the maximum performance out of this architecture, since most configurations have median performance or less. In figure 5.2 the plot using an input size of 4000 is shown. This plot supports the observation made earlier using figure 5.1, the performance doesn t vary much and is mainly limited by the input size. Increasing the input size to increases the peak performance to 140 GFLOP/s as can be seen in figure 5.3. The differences between the configurations is now clearly visible. When using a low number of threads, a bigger vector size has a higher performance. The vector size of 8 is at its peak performance when using 4 threads and degrades at the increase of the number of threads. The configurations using vectors of size 0 or 2 are performing the best and 16

17 the peak performance is reached at 32 to 128 threads. After 128 threads the performance drops till the performance is equal to the performance of vector sizes 3 and 4. Figure 5.2: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of 4000 options with a varying number of threads and vector size Figure 5.3: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of options with a varying number of threads and vector size What we can conclude from the analysis of the plot is that increasing the amount of compu- 17

18 tation per thread (increasing the vector size) doesn t increase the performance. The theoretical peak performance hasn t been reached yet, so this means that the registers are full and therefore the relatively slow main memory has to be used for storing and retrieving data. An important observation to make is that the Titan-X doesn t have a special vector compute-unit available. The increase in the number of threads does improve the performance, because of the increased parallelism. Increasing the parallelism does also increase the register usage for a multiprocessor. So after all the registers are used, increasing the parallelism hurts the performance. Figure 5.4: The performance of the BlackScholes kernel on the Titan X (Maxwell) using an input size of options with a varying number of threads and vector size The other input sizes show roughly the same behaviour, the base and peak performance are at 111 GFLOP/s and 691 GFLOP/s respectively. When using a low number of threads the configurations with a higher vector size are more efficient because there are less work-groups and thus less context-switches between them. A context-switch normally does add little overhead on a GPU (since all scheduling is done in hardware), but in this case the number of work-groups is at its maximum and thus the combined overhead is significant. Also when choosing a configuration with few threads the GPU isn t fully utilized, because NVIDIA works in lock-step warps of 32 threads. In figure 5.5 the histogram of the optimisation space of the Titan X (Pascal) is shown. This plot follows the same pattern as the plot in figure 5.1. The peak performance is a bit higher at 918 GFLOP/s, an increase of roughly 30%, which is less of an increase than the increase in theoretical GFLOP/s which is roughly 65%. What does stand out is that the 75th percentile and median are higher for the Titan X (Pascal) than for the Titan X (Maxwell), which means that there are relatively more configurations closer to the maximum. When we look at figures 5.6 to 5.8 we see that the behaviour is the same as for the Titan X (Maxwell). The most notable change is found in the plot for input size 2.56 million, figure 5.8, where the vector sizes 0, 2, 3 and 4 are all within 50 GFLOP/s. This is considerably closer to each other than in figure 5.4 where the difference between those vector sizes is almost 200 GFLOP/s and supports the observation made earlier, that the Titan X (Pascal) is easier to tune than the Titan X (Maxwell). An explanation for this behaviour hasn t been found. That the Titan X (Maxwell) and Titan X (Pascal) behave mostly the same is because the architecture design of the latter is almost identical to the former. The plot in figure 5.9 shows the optimisation space of the K20 accelerator. In contrast to 18

19 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.5: The number of floating point operations of the BlackScholes kernel when tuning on a Titan X (Pascal) GPU using varying input sizes Figure 5.6: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of 4000 options with a varying number of threads and vector size the earlier observed Titan X (Pascal) the median and 75th percentile are closer to the minimum. This suggests that this accelerator is harder to tune than the Titan X. Figures 5.10 to 5.12 show the performance per number of threads per vector size for the K20 accelerator. The performance of the K20 benefits significantly from an increase in threads as can 19

20 Figure 5.7: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 5.8: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size be seen in figures 5.10 to This can be explained by the configuration of the CUDA cores. There are more CUDA cores per multiprocessor for the K20 than for the Titan-X (Maxwell), namely 192 against 128. However there is a clear drop in performance visible if more than 256 threads are used. This is caused by the fact that there aren t enough registers or multiprocessors 20

21 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.9: The number of floating point operations of the BlackScholes kernel when tuning on a K20 accelerator using varying input sizes available to process the threads. The number of registers of both devices is the same, but the K20 has 13 multiprocessors and the Titan-X (Maxwell) has 24 multiprocessors. Figure 5.10: The performance of the BlackScholes kernel on the K20 using an input size of 4000 options with a varying number of threads and vector size As the K40 is a faster version of the K20, the behaviour of the K40 is almost identical to the K20. The optimisation space, shown in figure 5.13, is the same but with a higher performance. 21

22 Figure 5.11: The performance of the BlackScholes kernel on the K20 using an input size of options with a varying number of threads and vector size Figure 5.12: The performance of the BlackScholes kernel on the K20 using an input size of options with a varying number of threads and vector size Again noteworthy that the K40 is harder to tune than the two Titan-X devices. The K40 also has a drop in performance when using a higher number of threads, but this drop is after 512 threads instead of 256 threads. This can be explained when looking at the specification of the K40, this accelerator has 15 multiprocessors instead of 13 multiprocessors for 22

23 Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.13: The number of floating point operations of the BlackScholes kernel when tuning on a K40 accelerator using varying input sizes the K20. Figure 5.14: The performance of the BlackScholes kernel on the K40 using an input size of 4000 options with a varying number of threads and vector size The last architecture tested is the Intel Xeon E processor. The optimisation space of this device can be seen in figure Most of the configurations are close to the median performance of the kernel, this shows that the kernel is not easily tuned for maximum performance 23

24 Figure 5.15: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 5.16: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size but can get median performance fairly easy. When using an input size of 2,56 million options the performance drops. We ve investigated this and concluded that the data must be too big to fit in the L3-Cache, the size of the L3-cache being 20MB and the total size of the input and output array being 20,48MB. This causes the processor to retrieve the data partially from the 24

25 memory which is much slower than the cache. Floating Point Operations (GFLOP/s) Maximum 75th Percentile Median 25th Percentile Minimum 0 4E3 4E4 8E4 1.6E5 3.2E5 6.4E5 1.28E6 2.56E6 Input Size Figure 5.17: The number of floating point operations of the BlackScholes kernel when tuning on a Intel E CPU using varying input sizes The figures 5.18 and 5.19 show that using a relatively small input size an increase in number of threads negatively effects the performance for vector size 2 and bigger. For vector size 0 the performance stays the same when increasing the number of threads. It makes sense that an increase in threads above the 128 threads don t, or negatively, effect the performance since the number of threads for a processor like the Xeon E is relatively small compared to a GPU. The number of threads available for the Xeon E is 16. The plot in figure 5.20 shows that vector sizes bigger than 0 provide a better performance. The best performance is when using a vector size of 4 or 8, this is because of the AVX instruction set which is supported by the processor. This instruction set supports 8 single precision floats at the time. To provide a baseline we ve executed the original kernel on all devices with an input size of 1.28 million. This way we can see whether the tuning has effect, even on such an already optimised kernel as the Black-Scholes kernel. The results can be found in table 5.2. We can conclude from the results that the performance increase is significant, namely 8 to 44%. The biggest performance increase can be seen when tuning the Intel E5-2630, which is explained by the fact that the original kernel is developed for NVIDIA GPUs, the default settings are thus quite inefficient for a CPU. Device GFLOP/s (Original) GFLOP/s (Tuned) Increase NVIDIA GTX TitanX (Maxwell) 601,94 658,61 9,41% NVIDIA GTX TitanX (Pascal) 777,02 864,56 11,27% NVIDIA Tesla K20 382,90 415,03 8,39% NVIDIA Tesla K40 459,25 500,49 8,98% Intel E ,91 116,52 44,01% Table 5.2: The maximum performance of the original implementation of the kernel compared with the maximum performance of the tuned version of the kernel using input size options 25

26 Figure 5.18: The performance of the BlackScholes kernel on the E using an input size of 4000 options with a varying number of threads and vector size Figure 5.19: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 26

27 Figure 5.20: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 27

28 28

29 CHAPTER 6 Future work We ve made a first effort to explain the performance differences between architectures. In order to be able to fully understand the performance differences further research should be done. This chapter describes some of the possible research directions and some possible extensions for TuneBench. 6.1 Generalise findings In order to completely understand what aspects of an architecture make an application better performing, the performance of more kernels should be examined. Also more devices should be tested so we can compare AMD architectures with NVIDIA architectures for example, or the Xeon Phi which has in terms of architecture a completely different design. 29

30 30

31 CHAPTER 7 Conclusion The goal of this thesis was to explain why some kernel configurations are more efficient on certain architectures than on others. With help of the TuneBench framework we ve examined the Black-Scholes kernel on the NVIDIA Maxwell, Pascal and Kepler architecture, and Intel Sandy Bridge architecture. A couple of conclusions can be drawn from this research. The first, and most obvious, conclusion is that CPUs perform significantly worse than GPUs on highly parallelised workloads. The available processing power also limits the maximum input size, a too big input size will decrease performance (because of limited cache sizes). Because of the special vector instructions, the CPU does take advantage of vectorising when the input size is big enough. As opposed to CPUs, the performance of GPUs doesn t take advantage of vectorising. Vector instructions are missing in GPUs, but GPUs are designed in such a way that memory latency is hidden as much as possible. This is done by executing multiple warps (or wavefronts for AMD) on a streaming multiprocessor, when a warp has to wait on the memory another warp is executed on the streaming multiprocessor. Because of this design the approach for optimising for a GPU is entirely different than optimising for a CPU. It s important to make sure the multiprocessors and CUDA cores of a GPU are constantly performing computations. The number of threads does influence the performance significantly. Both for the CPU as for GPUs increasing the number of threads, increases performance. We ve also concluded that auto-tuning is effective, even for already optimised kernels as the Black-Scholes kernel we ve used. Performance increases from 8 to 11% for GPUs and 44% for the CPU were visible. 31

32 32

33 Bibliography [1] The green 500 list of november /. Accessed: [2] Nvidia opencl sdk - black scholes kernel. compute/cuda/3_0/sdk/website/opencl/website/samples.html. Accessed: [3] Opencl documentation by khronos. Accessed: [4] Wikipedia page on simd. Accessed: [5] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed system for computer science research: Infrastructure for the long term. Computer, 49(5):54 63, [6] F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3): , [7] L. Chai, Q. Gao, and D. K. Panda. Understanding the impact of multi-core architecture in cluster computing: A case study with intel dual-core system. In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 07), pages , May [8] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 08, pages 4:1 4:12, Piscataway, NJ, USA, IEEE Press. [9] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes [10] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan. Optimal loop unrolling for gpgpu programs. In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1 11, April [11] V. Podlozhnyuk. Black-scholes option pricing [12] V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the 14th International Conference on Supercomputing, ICS 00, pages , New York, NY, USA, ACM. [13] A. Sclocco. Accelerating Radio Astronomy with Auto-Tuning. PhD thesis, Vrije Universiteit van Amsterdam, [14] J. Tompson and K. Schlachter. An introduction to the opencl programming model

34 34

35 Appendix Hardware specifications Device 0 : GeForce GTX TITAN X CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 5. 2 Total amount o f g l o b a l memory : MBytes ( bytes ) (24) M u l t i p r o c e s s o r s, (128) CUDA Cores /MP: 3072 CUDA Cores GPU Max Clock r a t e : 1076 MHz ( GHz) Memory Clock r a t e : 3505 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 35

36 Device 0 : TITAN X ( Pascal ) CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 6. 1 Total amount o f g l o b a l memory : MBytes ( bytes ) (28) M u l t i p r o c e s s o r s, (128) CUDA Cores /MP: 3584 CUDA Cores GPU Max Clock r a t e : 1531 MHz ( GHz) Memory Clock r a t e : 5005 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(131072), 2D =(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(32768), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(32768, 32768), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 130 / 0 Device 0 : Tesla K20m CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 3. 5 Total amount o f g l o b a l memory : 5061 MBytes ( bytes ) (13) M u l t i p r o c e s s o r s, (192) CUDA Cores /MP: 2496 CUDA Cores GPU Max Clock r a t e : 706 MHz ( GHz) Memory Clock r a t e : 2600 Mhz Memory Bus Width : 320 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes 36

37 Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 Device 0 : Tesla K40c CUDA Driver Version / Runtime Version 8. 0 / 8. 0 CUDA C a p a b i l i t y Major/Minor v e r s i o n number : 3. 5 Total amount o f g l o b a l memory : MBytes ( bytes ) (15) M u l t i p r o c e s s o r s, (192) CUDA Cores /MP: 2880 CUDA Cores GPU Max Clock r a t e : 745 MHz ( GHz) Memory Clock r a t e : 3004 Mhz Memory Bus Width : 384 b i t L2 Cache S i z e : bytes Maximum Texture Dimension S i z e ( x, y, z ) 1D=(65536), 2D =(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) l a y e r s 1D=(16384), 2048 l a y e r s Maximum Layered 2D Texture Size, (num) l a y e r s 2D=(16384, 16384), 2048 l a y e r s Total amount o f constant memory : bytes Total amount o f shared memory per block : bytes Total number o f r e g i s t e r s a v a i l a b l e per block : Warp s i z e : 32 Maximum number o f threads per m u l t i p r o c e s s o r : 2048 Maximum number o f threads per block : 1024 Max dimension s i z e o f a thread block ( x, y, z ) : (1024, 1024, 64) Max dimension s i z e o f a g r i d s i z e ( x, y, z ) : ( , 65535, 65535) Maximum memory p i t c h : bytes Texture alignment : 512 bytes Concurrent copy and k e r n e l e x e c u t i o n : Yes with 2 copy engine ( s ) Run time l i m i t on k e r n e l s : No I n t e g r a t e d GPU s h a r i n g Host Memory : No Support host page locked memory mapping : Yes Alignment requirement f o r S u r f a c e s : Yes Device has ECC support : Disabled Device supports U n i f i e d Addressing (UVA) : Yes Device PCI Domain ID / Bus ID / l o c a t i o n ID : 0 / 3 / 0 37

38 Plots Figure 7.1: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.2: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size 38

39 Figure 7.3: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.4: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size 39

40 Figure 7.5: The performance of the BlackScholes kernel using an input size of options with a varying number of threads and vector size Figure 7.6: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 40

41 Figure 7.7: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 7.8: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 41

42 Figure 7.9: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size Figure 7.10: The performance of the BlackScholes kernel on the Titan X (Pascal) using an input size of options with a varying number of threads and vector size 42

43 Figure 7.11: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.12: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size 43

44 Figure 7.13: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.14: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size 44

45 Figure 7.15: The performance of the BlackScholes kernel on the K40 using an input size of options with a varying number of threads and vector size Figure 7.16: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 45

46 Figure 7.17: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size Figure 7.18: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 46

47 Figure 7.19: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size Figure 7.20: The performance of the BlackScholes kernel on the E using an input size of options with a varying number of threads and vector size 47

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL: 00 1 EE 7722 GPU Microarchitecture 00 1 EE 7722 GPU Microarchitecture URL: http://www.ece.lsu.edu/gp/. Offered by: David M. Koppelman 345 ERAD, 578-5482, koppel@ece.lsu.edu, http://www.ece.lsu.edu/koppel

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

A GPU based brute force de-dispersion algorithm for LOFAR

A GPU based brute force de-dispersion algorithm for LOFAR A GPU based brute force de-dispersion algorithm for LOFAR W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 8 th May 2012 1 GPUs Why use GPUs? Latest Kepler/Fermi based cards

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Fra superdatamaskiner til grafikkprosessorer og

Fra superdatamaskiner til grafikkprosessorer og Fra superdatamaskiner til grafikkprosessorer og Brødtekst maskinlæring Prof. Anne C. Elster IDI HPC/Lab Parallel Computing: Personal perspective 1980 s: Concurrent and Parallel Pascal 1986: Intel ipsc

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Universiteit Leiden Opleiding Informatica

Universiteit Leiden Opleiding Informatica Universiteit Leiden Opleiding Informatica Comparison of the effectiveness of shared memory optimizations for stencil computations on NVIDIA GPU architectures Name: Geerten Verweij Date: 12/08/2016 1st

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

Measurement of real time information using GPU

Measurement of real time information using GPU Measurement of real time information using GPU Pooja Sharma M. Tech Scholar, Department of Electronics and Communication E-mail: poojachaturvedi1985@gmail.com Rajni Billa M. Tech Scholar, Department of

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Vectorisation and Portable Programming using OpenCL

Vectorisation and Portable Programming using OpenCL Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5

Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5 Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5 Paper IEEE Computer (May 2016) What is DAS? Distributed common infrastructure for Dutch Computer Science Distributed: multiple (4-6) clusters

More information

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK CUDA programming Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics CUDA requirements A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK Standard C compiler http://www.nvidia.com/cuda

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Mateo Valero SC 13, November 19 th 2013, Denver, CO, USA

More information

An Introduction to OpenACC

An Introduction to OpenACC An Introduction to OpenACC Alistair Hart Cray Exascale Research Initiative Europe 3 Timetable Day 1: Wednesday 29th August 2012 13:00 Welcome and overview 13:15 Session 1: An Introduction to OpenACC 13:15

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Concurrent Manipulation of Dynamic Data Structures in OpenCL

Concurrent Manipulation of Dynamic Data Structures in OpenCL Concurrent Manipulation of Dynamic Data Structures in OpenCL Henk Mulder University of Twente P.O. Box 217, 7500AE Enschede The Netherlands h.mulder-1@student.utwente.nl ABSTRACT With the emergence of

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

The Optimal CPU and Interconnect for an HPC Cluster

The Optimal CPU and Interconnect for an HPC Cluster 5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

ECE 574 Cluster Computing Lecture 18

ECE 574 Cluster Computing Lecture 18 ECE 574 Cluster Computing Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 2 April 2019 HW#8 was posted Announcements 1 Project Topic Notes I responded to everyone s

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

NVIDIA nforce IGP TwinBank Memory Architecture

NVIDIA nforce IGP TwinBank Memory Architecture NVIDIA nforce IGP TwinBank Memory Architecture I. Memory Bandwidth and Capacity There s Never Enough With the recent advances in PC technologies, including high-speed processors, large broadband pipelines,

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

BEST BANG FOR YOUR BUCK

BEST BANG FOR YOUR BUCK Carsten Kutzner Theoretical & Computational Biophysics MPI for biophysical Chemistry BEST BANG FOR YOUR BUCK Cost-efficient MD simulations COST-EFFICIENT MD SIMULATIONS TASK 1: CORE-H.XTC HOW TO GET OPTIMAL

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

Overview of Project's Achievements

Overview of Project's Achievements PalDMC Parallelised Data Mining Components Final Presentation ESRIN, 12/01/2012 Overview of Project's Achievements page 1 Project Outline Project's objectives design and implement performance optimised,

More information