Spreadsheet optimisation on GPUs

Size: px
Start display at page:

Download "Spreadsheet optimisation on GPUs"

Transcription

1 Spreadsheet optimisation on GPUs Using Microsoft Accelerator Bachelor thesis The IT University of Copenhagen May 18th, 2010 Authors: Tim Garbos Kasper Videbæk Supervisor: Peter Sestoft

2 Abstract The objective of this project is to investigate if it is possible to use the GPU for parallelizing evaluations in spreadsheets. We develop an experimental prototype based on the CoreCalc[18] spreadsheet implementation. We interfacing to the GPU using Microsoft Accelerator[8]. CoreCalc introduces the concept of sheet defined functions, which are functions that can be defined by users inside a spreadsheet. The project presents ideas for how to efficiently use the GPU in spreadsheets with focus on implementing sheet defined functions for the GPU. Benchmarking is performed to compare the performance to the original approach of compiling sheet defined functions to.net bytecode. Also methods of estimating the execution time of an function for the GPU and CPU are discussed. Based on our experiments and benchmarks we will conclude that it is possible to use the GPU for evaluating functions in spreadsheets, however rather large amounts of data are needed in order to do so, with performance gains. Performance gains from implementing parallelism using current GPUs are usable in complex cases such as data heavy operations like Monte Carlo simulations. i

3 Contents 1. Introduction Context and motivation Problem statement Goals and methods Thesis overview Background Spreadsheet technology Introduction to spreadsheets CoreCalc spreadsheet implementation Parallel computing and GPGPU Parallel Programming Methods Stream processing Programming for the GPU Bottlenecks Problems Previous work on parallelism and GPGPU in spreadsheets Microsoft Accelerator Accelerator and the GPU The C# interface Operations Programming Other findings Behavior of booleans Analysis of Microsoft Accelerator Hardware setup Constructing the tests Test results GPGPU approaches for spreadsheets Single normal built-in functions Sheet defined functions Higher order Map function When to use the GPU? ii

4 Contents Estimating execution time Implementation of prototype Built-in functions Sheet defined function Accelerator Abstract Syntax Higher order functions Evaluation leafs on the CPU Performance test of prototype How the tests were executed Floating point precision and performance Hardware Results Built-in function Sheet defined function Perspective The future of GPUs Other targets of parallelism using Accelerator Known problems Converting Values Limitations in Accelerator abstract syntax Exceeding memory and texture limits Worst Case Execution Time estimation Conclusion 44 Bibliography 45 Appendix 46 A. Test setups and results from prototype benchmarking 47 A.1. Built-in functions A.2. Sheet defined functions A.2.1. Fibonacci sequence iii

5 List of Figures 2.1. Herons Formula as a Sheet Defined Function in CoreCalc Time from sending a data input to getting it back again Performance tests of single simple operations on the GPU compared to the CPU Performance tests of matrix multiplication on the GPU compared to the CPU Graph of nested addition operations Performance tests on nested mixed operations on the GPU compared to the CPU Formula for estimated computation time using the CPU Formula for estimated computation time using the GPU Class hierarchy of Expr types[18] Class hierarchy of AccExpr A.1. Matrix multiplication test setup A.2. Matrix multiplication performance results A.3. Different variations of Herons formula in CoreCalc A.4. Herons Formula with random data, performance results A.5. Herons Formula with parameter data, performance results A.6. Herons Formula with constants data, performance results A.7. Performance test of the Fibonacci sequence A.8. The valentine simulation SDF defined in CoreCalc A.9. The results of the valentine simulation A.10.The perfomance results of the π approximation simulation A.11.The commute time simulation SDF defined in CoreCalc A.12.The performance results of commute example iv

6 Code samples 2.1. An embarrassingly parallel problem Accelerator code for addition Herons formula in FPA Booleans in Accelerator evaltarget code none Add operation Applier GPU and CPU applier Original code for creating MMULT Modified code for creating MMULT AccExpr class AccNumber class Generate FPA for AccInput GenerateBPA of AccComp Method for minimizing count of created AccExpr objects Original CoreCalc code for Tabulate Original CoreCalc code for Tabulate v

7 List of Tables 4.1. Hardware specification Test results from single operations Original input arguments for call: T ABULAT E(F, 2, 8) Reorganized input arguments for call: T ABULAT E(F, 2, 8) vi

8 Foreword This thesis was written by Tim Garbos and Kasper Videbæk in the period from February 2010 to May 2010 at the IT University of Copenhagen. It is a part of a 15 ECTS Bachelor s thesis project supervised by Peter Sestoft. This report including additional documents, files and source code can be downloaded from 1

9 Chapter 1. Introduction 1.1. Context and motivation Spreadsheets are used in almost every business and are used for everything ranging from keeping track of work hours, project planning, and simple lists to financial calculations and research simulations. While one might not even notice the time it takes to recalculate the spreadsheet when keeping track of work hours, the time for calculating a complex Monte Carlo based financial simulation is indeed noticeable. The focus of this thesis is primarily on optimizations of this kind of calculations Problem statement The objective of this project is to investigate if using the GPU for parallelizing evaluation of functions in spreadsheets is possible and whether this can be done with a performance gain. Within the objective we also want to investigate if sheet defined functions (described in section 2.1) can be efficiently restructured to fit the GPU target Goals and methods To do this we develop an experimental prototype based on the CoreCalc[18] spreadsheet engine. This experimental prototype will also include testing of the Microsoft Accelerator[8] framework v2 preview release, for interfacing with the GPU. Our investigation includes searching for and relating to literature about spreadsheet optimisation in relation to parallelism and GPGPU 1. 1 General Purpose computation on Graphics Processing Units 2

10 Chapter 1. Introduction We will analyse Microsoft Accelerator and look into its strengths, limitations, and weaknesses in relation to parallelizing spreadsheet evaluations. Furthermore we will benchmark different types of operations and look at the performance difference between a GPU and a CPU. With Microsoft Accelerator s strengths and limitations in mind and based on current literature we will discuss different approaches to parallelize spreadsheets. We will also discuss a model for estimating execution time on the GPU for a specific function at evaluation time. The prototype will implement different approaches to parallel evaluations in spreadsheets. We will document and discuss the design of these implementations. We will benchmark our prototype against the original CoreCalc implementation, using different spreadsheet based simulations and real life examples to determine whether or not it makes sense to evaluate spreadsheet calculations with a GPU Thesis overview This thesis is : Chapter 2 provides a background on spreadsheet technology, parallel computing in relation to graphics processing units, and what previous research or work that have been done in this area. Chapter 3 gives a technical overview of the Microsoft Accelerator that claims to provide an easy to use interface to parallel computing and GPUs. In chapter 4 we analyse and test Microsoft Accelerator in order to determine what possibilities it provides that may be relevant for parallel evaluation of spreadsheets. The results of chapter 4 is used in chapter 5 to discuss different approaches of parallelizing spreadsheets using the GPU. Chapter 6 describes our experimental prototype. We document and discuss the design and structure. In chapter 7 we document our benchmarks of the experimental prototype against the original CoreCalc implementation and analyse the test results. Chapter 8 relates our results to the future of graphics processors and discuss known problems in our implementation. At last we summarize our results an conclude whether or not it is efficiently possible to evaluate spreadsheets in parallel using the GPU. 3

11 Chapter 2. Background 2.1. Spreadsheet technology Introduction to spreadsheets The most popular spreadsheet application today is Microsoft Excel, but computerized spreadsheets as we know them today have been used since the first WYSIWYG 1 spreadsheet application, VisiCalc. It was developed for the Apple II computer by Dan Bricklin and Bob Frankston [4]. In common spreadsheet software a spreadsheet consists of a workbook that can contain up to several sheets. Each sheet has multiple cells that together create a grid consisting of rows and columns. Each cell can contain a text value, a number value or a formula. A formula describes how the value of the cell can be calculated from the values of other cells. When a cell is updated the value of all the cells referencing that cell will be recalculated. This ensures that the cells of the entire sheet automatically stay up to date CoreCalc spreadsheet implementation In this thesis we base our prototype on CoreCalc [18]. CoreCalc is an open source implementation of core spreadsheet functionality in C#. It is developed at the IT University of Copenhagen and is only intended as a platform for experiments with new technology and functionality. As the documentation states, it is not a replacement for Microsoft Excel, Gnumeric or Open Office Calc, but a research prototype. It might have been possible to base our prototype on Open Office Calc or Gnumeric as they are both open source, but they are also far more complex and feature rich than CoreCalc. It might not have been possible to implement our prototype without rewriting 1 What You See Is What You Get 4

12 Chapter 2. Background parts of the spreadsheet engine. CoreCalc however is built as a platform for new experiments and it features sheet defined functions. Sheet defined functions are separate from the normal spreadsheet evaluation gives more possibilities for optimisations. Gnumerics or Open Office Calc does not have sheet defined functions. Spreadsheet programs Based on A Spreadsheet Core Implementation in C# [18] now follows an overview of how a spreadsheet program works. Spreadsheet programs are dynamically typed functional programs that can be programmed by simple formulas in cells. Spreadsheets handle data types such as strings, numbers, logical expressions, and matrices, but they are handled dynamically. This makes formulas very dynamic and one can easily introduce an error such as = SQRT (IF (A1 < 0; Helloworld ; 25)) that returns 5 if A1 >= 0 and otherwise returns an error because SQRT only takes a number as its argument. Functional programming is a paradigm that is similar to the evaluation of mathematical functions. For example in the evaluation of spreadsheet cells it avoids states and mutable data. One cell cannot change the value of another cell if that cell is not somehow dependent on it. In functional languages you distinguish between strict (eager) and nonstrict (lazy) evaluation. In eager evaluation all expressions is evaluated independently of whether they re used or not. In lazy evaluation an expression is only evaluated when there is a demand for it, and then cached so that other demands can use this cached value. In spreadsheets we have a similar concepts in that a cell is only evaluated when one of the cells it is referencing to is updated. Sheet defined functions CoreCalc introduces a new concept called Sheet Defined Functions (or SDF for short). Sheet defined functions allow spreadsheet users to define functions which can be used just like normal built-in function through the entire workbook. 5

13 Chapter 2. Background Figure 2.1.: Herons Formula as a Sheet Defined Function in CoreCalc In fig. 2.1 Herons formula is implemented. The green cells are input cells that act as arguments for the function. The blue cell is the ouput cell, which is a formula that depends on the input cells. The user is able to use normal formulas and cell-references when defining the function. In this figure we have implemented Herons formula, with A4 and B4 as input cells, and the last side of the triangle simply calls a random function. In CoreCalc, SDFs is compiled into.net bytecode for optimised calculations Parallel computing and GPGPU Parallel computing has been around since the 1950 s and has been implemented in highperformance computing (computer cluster). It has mainly been used for research and the first dual-core processors have reached the public in The reason for the growth in public interest is the physical constraints preventing further advancement in the number of operations a CPU can perform (frequency scaling). Instead of frequency scaling, now more cores are added to CPU s and according to Intel[11], an 8 core Xeon CPU is announced for However, multicore CPU s are not the only modern method of parallelism in computers. General Purpose computation on Graphics Processing Units (GPGPU) is using the GPU, which is designed for computing graphics, for purposes that are normally handled by the CPU. Graphics cards were originally designed to be parallel in order to process each 6

14 Chapter 2. Background vertex in a 3D model independently. The term GPGPU was first coined in the 2002 paper Physically-Based Visual Simulation on Graphics Hardware [13], however, research in using GPU s for general purpose computation has been around for a while. In 1999 the PixelFlow SIMD graphics computer was used to crack UNIX password ciphers[7], using a bruteforce attack. Graphics cards manufacturers have lately added more precise arithmetic to the GPU, making it more suitable for non graphics computing such as scientific computing.[14] In 2006 both NVIDIA s CUDA SDK and ATI s CTM SDK were made public, thereby making GPGPU possible without detailed expert knowledge of the graphics API. Because GPU s are designed for graphics processing they are very limited in terms of programming possibilities and are only efficient when the problems can be computed using stream processing. They can only process single vertices, but can process multiple vertices in parallel. Modern GPUs have typically more than 100 cores (processors) and the new Fermi model from NVDIA introduces 512 cores[14]. This makes GPUs ideal for operations that should be applied to every value of a large dataset. When writing parallel programs for a 4 core CPU the theoretically maximum performance gain is 400% (minus some overhead). As GPUs work very differently from normal CPUs and have for more cores the maximum performance gain is theoretically far better Parallel Programming Methods There are two main parallel programming methods: Task-parallelism, where each task can be separated and executed on another processor while still communicating with other tasks, and data-parallelism where the data for one task can be partitioned and processed individually. GPUs do not fully support processors to synchronize with each other and classical parallel programming methods cannot be used. A thread cannot spawn a new thread or send results to other threads. This leads to data-parallelism. 1 f o r ( i = 0 ; i < N; i ++) 2 { 3 r [ i ] = a [ i ] + b [ i ] ; 4 } Listing 2.1: An embarrassingly parallel problem Given the above C# code, the array r will sequentially be filled with the result of the add-operation. This can be optimised using data-parallelism. For this specific problem the arrays a and b can be divided into smaller chunks, and given several processors, the chunks can be calculated independently and later be collected. This example can be generalized. Using the same algorithm, and substituting a[i] + b[i] with any non-volatile function. Problems 7

15 Chapter 2. Background such as this that can easily be executed in parallel are classified as embarrassingly parallel problems. Graphics cards solve many data-parallel problems when doing graphical computations like in game s and modern GPUs typically have over 100 cores[14], but a typical graphics card for games can very well have cores Stream processing Stream processing is a programming paradigm that uses data-parallelism. In stream processing you construct an algorithm by defining a kernel, which is a function that can process data and return an output. Kernels that are to be applied on a set of data called a stream. It works well for processing images and videos, but it is not designed for general purpose processing using random data access, control flow or database lookups. GPU s are however designed to be efficient for stream processing. Stream processing is designed for applications that have a high number of arithmetic operations per I/O and where every element of the stream should have the same function applied to it. This means that an optimal application for the GPU should have a large data set, a high possibility of parallelism, a kernel of high arithmetic complexity to be applied to every element, and minimal dependency between operations. In sequential programming for the CPU it is common to control the flow of the program using loops or conditions such as if/then/else. Such flow control has until recently not been possible on the GPU and is still quite limited. Some recent GPUs allow branching (if/then/else), but not without a performance loss[2] Programming for the GPU Some low level APIs allow for general purpose GPU programming. This section will give a short overview of the different possibilities. CUDA Compute Unified Device Architecture or CUDA is NVIDIA s architecture for communicating with the GPU with standard programming languages. In this case C, but wrappers for other languages exists. It shares a large part of its interface with both OpenCL and Direct Compute, but is only available for NVIDIA hardware. OpenCL The Open Computing Language is an open standard in the spirit of OpenGL and OpenAL(3D and audio) for writing data-based and task-based parallel applications. It shares a range of interfaces with CUDA and Direct Compute, but is managed by the non-profit technology consortium Khronos Group. OpenCL is not bound to specific hardware, and 8

16 Chapter 2. Background AMD has decided to support OpenCL instead of its now deprecated Close to Metal API (AMD alternative to CUDA). DirectCompute As part of the DirectX framework, DirectCompute is a low level API for programming for the GPU. Naturally this too shares a range of interfaces with OpenCL and CUDA. Microsoft Accelerator For all of the above frameworks, higher level wrappers exists and you can program in Python,.NET, Java, or any other popular mainstream language. Other more specific frameworks that avoids GPU specific programming are being developed. Microsoft Accelerator [17] is a high level framework that allows data-based parallelism on the GPU through DirectCompute. It was first introduced in the 2005 technical report Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses [8] by David Tarditi, Sidd Puri, and Jose Oglesby. Their problem statement is that GPUs are difficult for program general-purpose usages. Programmers can convert their programs so they use graphics pipeline operations or they can use APIs for stream processing. The result of their project is Microsoft Accelerator, a library that uses data-parallelism to program GPUs. The idea is that programmers can use a normal imperative language and use the high level API for dataparallel operations without worrying about the GPU. The Microsoft Accelerator library compiles the high level data-parallel operations to optimised pixel shaders on the fly. Their benchmarks show that the speeds of some compiled operations are comparable to hand written pixel shaders, but the performance is typically within 50% of the hand written shader Bottlenecks When programming for the GPU the data has to be transferred to the GPU before it can compute the result and the result will have to be transferred back as well. This creates a latency between the CPU and the GPU. Also the bandwidth and memory limitations of the GPU are different than when using the CPU. Later in the project we will identify these bottlenecks Problems Today integer and double operations are only supported on the newest NVIDIA Fermi cards using CUDA, but according to NVIDIA[15] this problem will be solved in the near future. The same goes for floating point precision that does not yet match the IEEE754 floating point standard. [10] 9

17 Chapter 2. Background 2.4. Previous work on parallelism and GPGPU in spreadsheets We have not been able to find any research or products that explore the use of the GPU to optimize heavy spreadsheets. The topic has been mentioned several times on online forums and in interviews. When it comes to parallelism and spreadsheets a limited amount of research and products exist. In Andrew P. Wacks PhD dissertation from 1996, Partitioning dependency graphs for concurrent execution: A parallel spreadsheet on a realistically modeled message passing environment. [19] he describes how to partition the spreadsheet graph in order to split the computation across multiple computers on a network. Jens Hamann s Master s thesis from 2010 on Parallelization of spreadsheet computations [9] investigates the possibilities of enabling calculations in spreadsheets to be evaluated in parallel and adapts some of Wacks theories to modern multicore CPU s using CoreCalc. Microsoft Excel 2007 has an option for enabling a multithreaded calculation engine that is described as being able to partition groups of cells that can be parallelized. There s no technical documentation for the design of Excel s multithreaded calculation engine or how it is partitioning the graph. Hamann mentions the possibilities of using the parallel power of the GPU in spreadsheets. He further mentions Microsoft Accelerator as a library that will enable elegant implementations of such. As a consequence of there being very little information on GPGPU and spreadsheets we have broadened our search to include internet forums. On an interview about GPGPU with Ian Buck from NVIDIA at he says that We think your spreadsheet might already be fast enough. While video processing was an obvious application to accelerate (...). This quote is questioned several times in the comments. Dr. Drey writes NVIDIA, saying that spreadsheet is already fast enough may be misleading. Business users have the money. Spreadsheets are already installed (huge existing user base). Many financial spreadsheets are very complicated 24 layers, 4,000 lines, with built in Monte Carlo simulations. Making all these users instantly benefit from faster computing may be the road for success for NVIDIA.. Other comments support this idea. On the NVIDIA CUDA forum in the topic CUDA, NVIDIA GPUs and Microsoft Excel [12] different approaches and reasons for using CUDA in spreadsheets are discussed. A specific computationally heavy sheet is discussed and it is concluded that it is not the kind of job for CUDA or any other GPGPU approach. This is based on the fact that GPU s have an SPMD (single program multiple data) approach, which means it works best when you use a single kernel (operation) on a big data set. The argument continues and a theoretic CUDA-accelerated spreadsheet is discussed. 10

18 Chapter 2. Background This theoretic spreadsheet would have to be able to be sliced into series that run the exact same formula on a big data set, such as a whole column. That formula could then be compiled into a kernel and the data be uploaded as a stream. It is stated in the forum thread that it would have to be an outrageously big sheet (hundreds of thousands of lines) for a simple function to be optimized using the GPU. This statement might have been correct, due to high transfer latency between the GPU and the CPU, and that GPUs at the time of writing (2008) were not as fast and suited for GPGPU as modern GPUs. Also when using complex Monte Carlo Simulations, thousands of lines is not unrealistic. Challenges such as float accuracy and rounding problems are also mentioned. In the same forum thread Hyun-Gon Ryu. from Yonsei University discusses different approaches for integrating CUDA with Microsoft Excel using VBA. That is not within the scope of this project though. 11

19 Chapter 3. Microsoft Accelerator Microsoft Accelerator (referred to as Accelerator from here on) was first developed as a research project aiming to create a GPGPU framework for C#[8]. In the 2nd version, it has turned into a general framework for solving data parallel problems. The current version supports both calculations on the GPU and on multiple processors. Later versions might implement other targets, for example one for FGPAs[17]. This project primarily investigate the DX9Target of AcceleratorV2, and this chapter will give a quick overview of how Accelerator works as a middle layer between C# programmers and the GPU Accelerator and the GPU Accelerators DirectX9 target solves data-problems by translating the data into textures, and operations into texture shaders. This allows us to do calculations on the GPU. The procedure for translating the function and data in Accelerator is described in Introduction to Accelerator [17]: 1. Translate the processing code into a form suitable for a GPU by converting it to a DirectX 9 pixel shader. 2. Translate the data into a format that is suitable for the processor by converting it to a DirectX 9 texture. 3. Transfer the shader and textures to the processor and run the operation. DirectX 9 and the associated drivers partition the data and schedule execution on the various pixel shaders. With other processors, your application might have to handle some or all of these tasks. 4. When the operation is complete, retrieve the texture containing the results and convert it back to an array. 12

20 3.2. The C# interface Chapter 3. Microsoft Accelerator Accelerator works with datatypes called parallel arrays, which are represented in the classes FloatParallelArray (FPA), BoolParallelArray (BPA) and IntParallelArray (IPA). All of these inherit from the class ParallelArray (PA). Operations on parallel arrays are functional in nature with no states or mutable data. They have no side effects, operations do not modify arguments, and results are returned in new arrays Operations Before we go further we need to quickly classify operation types. Accelerator classifies operations in six different categories. These are described in Introduction to Accelerator [17]: Construction The framework provides methods for creating parallel arrays from System.Array that contains the same elements. Conversion It also provides methods for converting the parallel array result back into the System.Array type. Element-wise operations Most operations are element-wise taking fx the add operation. It takes the n th element of the first array and adds it with the nth element of the second array, resulting in a new array of the same size. Reductions Reduction operations are operations that reduce the size of the array. A sum operation of a n X m array may compute the sum of each row and return a 1 X m array, or it even compute the sum of all values and return a 1 X 1 array. Transformations Operations that transform the organization of the elements, such as matrix transpose. Linear algebra Accelerator provides binary matrix operations such as scalar product, matrix multiplication and outer product Programming Accelerator Programmers Guide[16] describes the following as a general procedure for working with Accelerator: 1. Create input arrays. 2. Load each array from Step 1 into an Accelerator data-parallel array object. 3. Process the input data by applying Accelerator operations to the data-parallel array objects. 13

21 Chapter 3. Microsoft Accelerator 4. Evaluate the results of the operation on a target processor, which returns an array containing the processed data. All operations, except construction operations, are created by calling static member functions on the PA class. An add operation will created by calling PA.Add and so forth. Creation of parallel arrays with values will be done by creating FPA objects. The following code (listings 3.1) shows an example where two arrays are created, and using the element-wise Add operation, we add each element in each array together and return them in the result array. 1 f l o a t [ ] AddTwoNumbers ( f l o a t [ ] x, f l o a t [ ] y ) { 2 var inputx = new FPA( x ) ; 3 var inputy = new FPA( y ) ; 4 var add = ParrelelArray. Add( inputx, inputy ) ; 5 f l o a t [, ] r e s u l t = new f l o a t [ x. l e n g t h ] ; 6 e v a l T a r g e t. ToArray ( add, out r e s u l t ) ; 7 r e t u r n r e s u l t ; 8 } Listing 3.1: Accelerator code for addition The following example is an implementation of Herons formula to get the area of a triangle, given the three sides. s = a+b+c 2 A = s(s a)(s b)(s c) The formula can be implemented in Accelerator like this: 1 f l o a t [ ] HeronGPU ( f l o a t [ ] x, f l o a t [ ] y, f l o a t [ ] z ) 2 { 3 var fpax = new FPA( x ) ; 4 var fpay = new FPA( y ) ; 5 var fpaz = new FPA( z ) ; 6 7 var s = PA. D i v i d e (PA. Add(PA. Add( fpax, fpay ), fpaz ), 2) ; 8 9 var sx = PA. S u b t r a c t ( s, fpax ) ; 10 var sy = PA. S u b t r a c t ( s, fpay ) ; 11 var sz = PA. S u b t r a c t ( s, fpaz ) ; var a r e a = PA. S q r t (PA. M u l t i p l y ( s, PA. M u l t i p l y (PA. M u l t i p l y ( sx, sy ), sz ) ) ) ; e v a l T a r g e t. ToArray ( area, out r e s u l t ) ; 16 r e t u r n r e s u l t ; 17 } Listing 3.2: Herons formula in FPA Notice in this example (listings 3.2) that we create only one FPA per float array, and that we reuse these several times while we build up the Accelerator operations. If we created several FPA s, for a single array, we would transfer the data several times. This will become important during the implementation in CoreCalc. Further, notice that PA.Divide takes both an FPA and an ordinary float. This means that each element in the FPA will be divided by two. Most element-wise functions allow us to do this. 14

22 3.4. Other findings Chapter 3. Microsoft Accelerator While working with Accelerator, we have noticed some behavior that is worth mentioning Behavior of booleans Creating BPAs from booleans and working logically with these seems to generate some puzzling results. Consider the following code: 1 Target t = new DX9Target ( ) ; 2 b o o l [ ] x = { f a l s e, f a l s e, f a l s e, f a l s e } ; 3 b o o l [ ] r = new b o o l [ 4 ] ; 4 b o o l [ ] r 2 = new b o o l [ 4 ] ; 5 t. ToArray ( new BPA( x ), out r, ExecutionMode. ExecutionModeNormal ) ; 6 t. ToArray (PA. Not ( new BPA( x ) ), out r, ExecutionMode. ExecutionModeNormal ) ; Listing 3.3: Booleans in Accelerator Executing this (listings 3.3) and afterwords looking into r will have the value true, false, false, false and r2 will have the value false, false, false, false. Conditional statements PA.Cond(BPA, FPA, FPA) seem to work as intended in Accelerator, and compare functions PA.CompareLessThan(FPA, FPA) seems also to return true or false as logic would prescribe. Using further logical operations PA.Not(FPA) on a BPA returned by a compare functions also seems to give correct results. Our conclusion is that it is possible to work with conditional statements in Accelerator, however, you should never generate BPAs - instead you should rely on Accelerator to do it for you. Floating point numbers The Accelerator API works only with floating point numbers of the float type, which is most likely caused by the fact that at the time of writing (and development) of Accelerator most GPUs only supported 32-bit floats. Further, many graphic cards are not IEEE 754-compliant [10] which means that results calculated on the GPU might differ from results calculated on the CPU even when using the same precision. NVIDIA Fermi will in the future support double precision[14]. Parameters Values in FPAs cannot be substituted when the FPA object is first created. This means we cannot build up an expression tree and later substitute the values inside it, which would be ideal for constructing function calls in Accelerator syntax and later substituting parameters. The Accelerator developers may include this feature in the future[17]. Random numbers Accelerator does not implement any ways to generate random numbers. If random data is needed this has to be generated on the CPU and transferred to the GPU. Future versions of Accelerator are expected to implement this[17]. 15

23 Chapter 4. Analysis of Microsoft Accelerator In this chapter, we will look into calculations in Accelerator and benchmark against C# implementations. We will identify operations from spreadsheets that may be possible to optimize using Accelerator, and look into limitations of the framework. We will try tnesto answer the following five questions: Maximum data What is the maximum amount of data that we can send into the GPU? Transfer time What is the minimum transfer time to the GPU, and how does the amount of data affect the transfer time? Single operations What is the performance impact on single operations? Complex operations What is the performance impact on complex operations? Value creation Is there any performance impact of the different ways to create values? 4.1. Hardware setup All tests have been run on one specific machine. Different hardware setups will ofcourse yield different results for the coming performance tests. The machine is what would normally be classified as a gaming machine. It is from Hewlett Packard and the model numbers is Z400. The GPU is a NVIDIA GT240, and the CPU is an Intel Xeon W3505. It runs Windows XP 32-bit and has installed DirextX9. The following tables summarizes the hardware specifications. 16

24 Chapter 4. Analysis of Microsoft Accelerator NVIDIA GT240 CUDA Cores 96 Graphics Clock (Mhz) 550 MHz Processor Clock (Mhz) 1340 MHz Memory Clock (Mhz) 1700 MHz GDDR5 Memory 1 GB Memory Interface Width 128-bit Memory Bandwidth 54.4 GB/sec Bus Support PCI-E 2.0 Intel Xeon W3505 Cores 2 Threads 2 Clock speed 2.53 GHz Intel Smart Cache 4 MB Intstruction set 64-bit Table 4.1.: Hardware specification The machine further has 4096 MBytes of DDR3 ram installed Constructing the tests In order to ensure stable results in performance tests, each test case have been constructed to be executed and timed 100 times with randomly generated input data. The results of the GPU and CPU versions of the test cases have been compared and verified to be approximately correct (taking floating point precision problems on the GPU in consideration). The tests have all been built using Visual Studio 2010 release settings and have been executed outside the Visual Studio environment to ensure that no unnecessary monitoring was done Test results Maximum data There are two important factors to consider when looking at the amount of data we can transfer to and process on the GPU: The maximum texture size, and the amount of memory available. The maximum texture size defines the maximum width and height 17

25 Time (ms) Chapter 4. Analysis of Microsoft Accelerator for a array of floats that we are able to send to the GPU. The memory further limits how many textures and how complex shaders cab be stored. To get the maximum texture size, we simply sent textures of increasing size into the GPU until it returned and error, which was around 8000 X We were not able to find a method for determining the maximum complexity of operations, however, we did find that matrix multiplication is only possible on sizes lower than two 240x240 arrays. Transfer time Transfer time can be divided into two components: Latency and transfer speed. We define latency as the initial time it takes to transfer data to the GPU. Transfer speed is defined as the time it takes to transfer a single float value. This model is simplified in relation to hardware architecture, however it suits our purposes: transf ertime(x) = latency + x speed To measure the transfer time, we ran the following code on different sizes of x. Note that even though we use the term transfer time, it might be more accurately described as the overhead for any Accelerator evaluation on a GPU target. 1 e v a l T a r g e t. ToArray ( new F l o a t P a r a l l e l A r r a y ( x ), out r e s u l t ) ; Listing 4.1: evaltarget code Data size Figure 4.1.: Time from sending a data input to getting it back again Using regression on the data depicted in figure 4.1, we derived at a latency of 2,1 ms, and the speed is 1,68E-005 ms Note that this is the sum of the transfer time for transferring float the data to the GPU and transferring the data back. 18

26 Time (ms) Chapter 4. Analysis of Microsoft Accelerator Single operations In this section we will look into performance impacts for a few single operations in Accelerator, and compare these with a C# implementation that does the same job. We will further derive the actual cost of an operation by doing linear regression on the data and look at the difference between the overall time spent and the transfer time that we looked at in the previous section If - CPU If - GPU Add - CPU Add - GPU Div - CPU Div - GPU Mul - CPU Mul- GPU Sub - CPU Sub - GPU Square - CPU Square - GPU Sum - CPU Sum - GPU Data size Figure 4.2.: Performance tests of single simple operations on the GPU compared to the CPU Using linear regression, we looked at the slopes of both the C# version and the Accelerator counterparts of the above functions. Delta Slope in the below mean the difference between the slopes of transfering data and the slope for a whole operation. Note that it has not been possible to measure the slopes for transferring two constants, but simply multiplied the value of transferring one constant by two. Also note that the sum operation only has half the transfer slope. This is because a sum operation, while alot is transfered to the GPU, only a single constant is transferred back. This is of course simplifications, and is to some extent inaccurate. 19

27 Time ms Chapter 4. Analysis of Microsoft Accelerator Operation Slope C# Slope T ransfer Slope Operation slope Add 6,00E-06 3,36E-05 5,64E-05 2,28E-05 Sub 6,84E-06 3,36E-05 4,57E-05 1,21E-05 Mul 6,82E-06 3,36E-05 4,54E-05 1,18E-05 Div 7,81E-06 3,36E-05 4,57E-05 1,21E-05 Sqr 8,57E-05 1,68E-05 2,00E-05 3,21E-06 Sum 1,69E-05 8,40E-06 2,04E-05 1,20E-05 If 1,67E-05 3,36E-05 2,02E-05-1,34E-05 Table 4.2.: Test results from single operations The above gives us an estimate of how well the GPU performs a given operation compared to the CPU also comparing without the overhead of transferring. Most functions will, given enough data, be able to perform faster on the CPU than on the GPU looking only at the slopes. However because of limits on the GPU, it might not always be possible to reach the amount of data needed. The slope for the SUM operation on the GPU is very close to the slope for the CPUvesion. This is a general tendency for reduction operations on GPU targets[17]. If we had a slower graphics card or a faster processor, this operation would actually be overall slower, leaving us with no reason at all to transfer such an operation to the GPU. 90,00 80,00 70,00 60,00 50,00 40,00 GPU CPU 30,00 20,00 10,00 0, , , , , , , ,00 Dat size Figure 4.3.: Performance tests of matrix multiplication on the GPU compared to the CPU 20

28 Chapter 4. Analysis of Microsoft Accelerator We found some good performance gains for few single operations. Matrix multiplication was one. It is a more complex function that requires a series of arithmetic operations and the tests results above show that a performance gain is possible on realistic data sizes. Complex operations In this test case we construct complex operations by building graphs of simple operations. This is done to test how the complexity of an operation affects the time spent on the computation and to mimic possible spreadsheet formulas. For example where A1 has the formula = B1 + C1 and B1 and C1 have formulas pointing to other cells. These graphs have been constructed by nesting simple arithmetic operation as shown in fig For each time the graph increase one in size the previous generated graph will be used as the left leaf in a new operation and the right leaf will be the same constant FPA as used earlier. Figure 4.4.: Graph of nested addition operations Similar graphs for subtraction, multiplication, division, and a graph of mixed operation have been used in the test. The graph of mixed operations switches between multiplication, addition, and subtraction starting with multiplication. 21

29 Time (ms) Chapter 4. Analysis of Microsoft Accelerator Div (CPU) Div (GPU) Mixed (CPU) Mixed (GPU) Mul (CPU) Mul (GPU) Pow2 (CPU) Pow2 (GPU) Number of operations in graph Figure 4.5.: Performance tests on nested mixed operations on the GPU compared to the CPU As shown in the above chart (Fig. 4.5). The CPU time of computing the multiplication and division graphs is low compared to that of the GPU. With only 6-10 nested operations it is possible to outperform Accelerator on large datasets. The Power operations performs very well which is probably because of the use of the constant 2. Multiplication also shows great performance. Based on this we conclude: The more complex an operation is the more potential performance gain is there to get when running it on the GPU. Value creation Values can be created in several different ways in Accelerator. In this section we will compare the performances of different types of array creation. Creating arrays FloatParrallelArrays can be created in two ways in Accelerator: 1 p u b l i c F l o a t P a r a l l e l A r r a y ( f l o a t f, params i n t [ ] shape ) ; 2 p u b l i c F l o a t P a r a l l e l A r r a y ( f l o a t [, ] v a l u e s ) ; Listing 4.2: none This means that if we want to create an array of constant values, we can fill a twodimensional array with the same value or simply use the first and tell Accelerator the 22

30 Chapter 4. Analysis of Microsoft Accelerator dimensions we wish for. Test showed that we could get up to 4 times performance gain by creating arrays using the first method compared to filling an array in C# and creating the FPA with the second method. Binary operations Many binary operations are overloaded to allow easier mass operations with the same constant: 1 p u b l i c s t a t i c F l o a t P a r a l l e l A r r a y Add( F l o a t P a r a l l e l A r r a y a, f l o a t f ) ; 2 public s t a t i c FloatParallelArray Add( FloatParallelArray a1, FloatParallelArray a2 ) ; Listing 4.3: Add operation The methods will give the same result if the array in a2 is filled with the value of f. We tested the performance on these, and found no differences in performance, if either a1 or a2 was a constant array created with the first method for array-creation. 23

31 Chapter 5. GPGPU approaches for spreadsheets In this chapter we describe different approaches to implementing parallelism using the GPU in spreadsheets while taking test results of Accelerator into consideration. As described in 2.4 we have not been able to find any previous work on using the GPU for optimizing spreadsheets, but various articles describes approaches with parallelism in spreadsheets using multicore CPU s or High-Performance Computing (HPC). In this chapter we describe our own analysis based on CoreCalc, but also look into how to adapt previous parallelism theories to the GPU Single normal built-in functions CoreCalc has a range of built-in function like the ones known from Microsoft Excel. Some of these functions, like Matrix multiplication, takes one or more matrices as input should be straight forward to implement using the GPU and if the input is large enough or the arithmetic operations complex enough, based on the test results a performance gain should be possible. Especially matrix multiplication has more effective on the GPU than the CPU on relative low data sizes. Simple functions that do not work with matrices such as SQRT, SIN, Addition, Subtraction, Division, and Multiplication are also simple to implement for the GPU, but a performance boost is not expected based on the very small input size of 1-2 arguments and low arithmetic complexity. They are however needed in order to use sheet defined function Sheet defined functions As described in the introduction, Sheet defined functions are functions defined within a spreadsheet using cells. 24

32 Chapter 5. GPGPU approaches for spreadsheets Due to the high transfer latency for the GPU, the arithmetic complexity of an operation is important in order to benefit from the GPU. This is shown in chapter 4 where we test different single operations and nested operations similar to those produced by sheet defined functions. Even though sheet defined functions are more complex the input data is not necessarily large and could typically be 1-3 arguments which makes using the GPU questionable. On top of that one has to transfer all constants to the GPU as well, meaning every time you write = C1 2 or = C1 10, the constant 2 or 10 will have to be transferred a textures to the GPU. The potential performance gain will increase however, when using sheet defined functions in higher order functions such as tabulate where the same function is used on a range of input data Higher order Map function As we concluded in chapter 4 a rather large amount of data and a complex operation is needed for the GPU to be able to optimise the operation. Therefore you need a quite complex sheet defined function for the GPU to be able to optimize the evaluation on a single function call. However the same function is often used more than once and in simulations it is not uncommon for the same function to be used times. If all of these calls could be constructed into one single call, sending all the input data to the GPU and processing it using the same operation, we expect increased performance. CoreCalc includes higher order functions such as Map, RowMap, ColMap, Tabulate which all can be classified as embarrassingly parallel problems since there exists no dependency between each operation and thereby also is suitable for the GPU. Depending on the complexity of the function and the number of times it s used it should be possible to obtain a reasonable performance gain When to use the GPU? As already mentioned several times, it is not always a good idea to send the computation to the GPU. In order to estimate which platform is the best suited for a specific computation we need to estimate the execution time of an operation on both platforms on evaluation time. Both Haman[9] and Wack[19] work on partitioning the dependency graph of a spreadsheet to limit the parallel execution to where there is a potential performance gain. They both use weighted cells (nodes) in the graph and decide based on the total weighting of a partition. As Wack s theory is about distributing the workload to workstations on a network, his model take network latency, speed, distance and other factors into account. Haman uses multiple cores on one CPU and simplifies the weighting to simple numbers. 25

33 Chapter 5. GPGPU approaches for spreadsheets Many of the same principles applies when deciding whether to evaluate an SDF on the CPU or the GPU. Loosely based on their approaches we will first create a simplified model that only applies to SFDs. We use knowledge about the hardware, measured time, input data, and an estimated execution time per operation. To estimate the execution time, three major approaches are used: Experimental (testing and measuring), probabilistic measurement (based on measurements of small parts), and static analysis that uses constructed models of processor instructions and timings to predict the result. Execution time estimation of normal programs is non trivial due to loops and recursive calls that might depend on values that are not known before runtime, but as we don t allow loops and recursive calls in spreadsheets this simplifies the estimation drastically. Another approach would be to simply run the operation on both platforms the first time it is invokes, and remember what performed the best. However as input parameters might change between calls and because formulas are easily and often changed in spreadsheets, this approach will not only take more time, but will also often be wrong. For estimation on the CPU we use a simplified model that does not take the architecture, cache or any details of the CPU into account. Given: m: Number of operations c: Computation time of operation w: Number of cores in the CPU m c w Figure 5.1.: Formula for estimated computation time using the CPU When using this simple model to estimate execution time of a SDF on evaluation time, w and m are known, but c is unknown as the SDF can contain many operations and conditions. c can however be estimated using a static analysis described later. When using the GPU we have to expand our model to latency and transfer time: 26

34 Chapter 5. GPGPU approaches for spreadsheets k 0 + m c w + c k 1 + m k 2 + r k 2 Given: k 0 : Initial latency of transferring to the GPU k 1 : Time to transfer one operation k 2 : Time to transfer one float m: Number of operations c: Computation time of one operation w: Number of cores in the GPU r: The result size of the operation Figure 5.2.: Formula for estimated computation time using the GPU This formula can be partitioned into m c being the time of computing the operations on w the GPU, c k 1 + m k 2 being the time to transfer the needed data to the GPU, and r k 2 that is the time to transfer the computed result back. m and r is known at evaluation time of a spreadsheet function. w can be found in the graphics cards specs and k 0, k 1, and k 2 can easily be measured. However c has to be estimated like on the CPU Estimating execution time Estimation the execution time (c) can be done by weighting each type of operation with a value and run through all operations to be processed and add together these values. When looking at a conditional statement, one would estimate both the true leaf and the false leaf, the worst estimate will result in a worst case execution time (WCET) and the best case will be the best case execution time (BCET). We ll focus on finding the WCET of a sheet defined function, both for the GPU and the CPU. First we need to assign a weight to each type of operation, for both the GPU and the CPU. As we have benchmarked the different operations we can derive this weight from the test results. For the CPU this is simply done by using the time of the add operation in the tests, but on the GPU, we have to subtract the latency and transfer time to and from the GPU. For the GPU we also have to find k 0, k 1, and k 2. However we haven t distinguished between k 1 and k 2 in our analysis. Taking this into account we assume that k 2 includes the time of transferring the operations. Therefore we simplify c k 1 +m k 2 to c 0+m k 2 and end up with only m k 2, leaving out the transfer time of operations. This leaves only the variables known at evaluation time and allows us to estimate the execution time of operations on the GPU. Now we can simply use these two models and static 27

35 Chapter 5. GPGPU approaches for spreadsheets analysis of the SDF to determine which platform to target. On our test setup we only have a single core in the CPU, but this model also takes multicore systems into account to some extent. One factor that is not taken into account is the maximum texture size and memory of the GPU. Exceeding these limits will force us to split the Accelerator call in two or more. Due to the scope of this project, we have not looked further into this. 28

36 Chapter 6. Implementation of prototype This chapter describes the implementation of our prototype in CoreCalc. We describe the implementation of the approaches described in chapter 5 and the problems and limitations of the design Built-in functions As described in chapter 5, formulas can invoke functions. For example = SIN(90) is a formula that invokes the sinus function and = A invokes the add function. We have chosen to implement a small range of these functions that showed potential to be performance wise better on the GPU. In CoreCalc, a spreadsheet function is represented as an object of the class Function. Function connects the function name, represented as a string, and a delegate called an Applier which points to the function. 1 delegate Value Applier ( Sheet sheet, Expr [ ] es, i n t col, i n t row ) ; Listing 6.1: Applier A functions applier is invoked when a function is called from a specific Cell. We would like to benchmark the current CPU implementation in CoreCalc against the calculations on the GPU, therefore we need both a GPU-Applier and CPU-Applier, for a single function. This simple design allows us to easily choose between GPU and CPU Appliers by simply changing the target platform on the function class. 1 c l a s s Function { 2 p u b l i c enum T a r g e t P l a t f o r m { CPU, GPU } 3 p u b l i c s t a t i c T a r g e t P l a t f o r m t a r g e t ; 4 5 pr i v at e Applier appliercpu ; 6 pr i v at e Applier appliergpu ; 7 8 p u b l i c A p p l i e r A p p l i e r 9 { 10 g e t { 11 return appliergpu == n u l l t a r g e t == TargetPlatform.CPU? appliercpu : appliergpu ; 12 } } 29

37 Chapter 6. Implementation of prototype } Listing 6.2: GPU and CPU applier Because of the static TargetPlatform in the Function class, this implementation does not allow us to make an adaptive implementation where the chosen target is based on the context of the invocation, however it should be possible to choose target platform based on an estimated execution time (described in chapter 5). However, in the current design of CoreCalc, Appliers are simply returned when a function is called, and this Applier is called by the callee. This means context is only available outside the function class, or inside the implemented Appliers. Because of this the function being evaluated has to either choose platform, or the choice have to be moved. We have not looked further into this as very few of the built-in function are potentially faster on the GPU. By overloading the constructor of the Function class, it is now easy to tie two appliers to a Function by simply changing: 1 new Function ( MMULT, 2 MakeFunction ( ( Fun<Value [ ], Value >)MMult) ) ; into: Listing 6.3: Original code for creating MMULT 1 new Function ( MMULT, 2 MakeFunction ( ( Fun<Value [ ], Value >)MMult), 3 MakeFunction ( ( Fun<Value [ ], Value >)MMultGPU) ) ; Listing 6.4: Modified code for creating MMULT 6.2. Sheet defined function Sheet defined functions are functions defined within a spreadsheet. function is defined by input cells and an output cell. A sheet defined In our tests we found that due to extra latency of transferring data to the GPU, a certain complexity in the operation and a certain amount of data is needed. Sheet defined functions solves the problem of complex operation by allowing the user to define new function using many simple functions. To execute a sheet defined function in Accelerator, our goal must be to build Accelerator Expression Graph (AEG) that corresponds to the function. Accelerator does not support inserting values inside an AEG. This means we need to build the AEG when all values of the function, parameters included, are known. Because of this we introduce a middle layer, Accelerator Abstract Syntax (AAS), that will quickly be able to build an AEG given parameters. The details of this will be discussed below. 30

38 Chapter 6. Implementation of prototype Accelerator Abstract Syntax The base for AAS is the abstract class AccExpr : 1 public abstract c l a s s AccExpr 2 { 3 public abstract FPA GenerateFPA ( AccInputInfo info, i n t CallID ) ; 4 } Listing 6.5: AccExpr class A reference to the root AccExpr of a SDF is placed along with the compiled SDF. On evaluation time, this AccExpr s GenerateFPA-method will be called with the parameters, the AEG will be generated, and it will be executed on the GPU and the result returned. Figure 6.1.: Class hierarchy of Expr types[18] 31

39 Chapter 6. Implementation of prototype Figure 6.2.: Class hierarchy of AccExpr In CoreCalc, whenever a cell is changed the string in the cell is parsed and a Expr- AST (See fig. 6.1) is build. CoreCalc SDFs are compiled into.net bytecode. Before compilation, the functions Expr-AST is translated into a CGExpr abstract syntax. This is achieved with a Visitor-pattern that visits all child expressions and translates them individually. Converting Expr abstract syntax (6.1) into AASs will be done the same way - creating a concrete visitor that visits all leaves in an Expr node, and translate them. Some operations possible in CoreCalc will not be possible in Accelerator, if these are encountered, an exception is thrown. We will not handle these cases in this project. CoreCalc expressions can be of the following types: NumberConst, TextConst, Error, FunCall, CellRef, CellArea. We will now look into how translation of these will work. Values All number values have to be represented as FPAs in Accelerator, we work with conversion of NumberConsts, CellRefs, and CellAreas from CoreCalc into FPAs. Input arguments will have to be represented as FPAs as well, but we discuss this later. NumberConsts (Constants in formulas such 2 + 2) and NumberCells inside the function sheet is known compiletime and can be represented as AccConst, while CellRefs and CellAreas, if pointing outside the functionsheet, needs to be evaluated at evaluation time and are represented by their own types, AccCellRef and AccCellArea. 32

40 Chapter 6. Implementation of prototype Representation of numbers AccNumber allows for creating a FPA with the same size as the other argument in the operation it is being used. The argument will typically only be a single float when calling the SDF alone, but as we describe in section this is often not the case. 1 public abstract c l a s s AccNumber : AccExpr 2 { protected o v e r r i d e FPA GenerateFPA ( AccInputInfo info, i n t CallID ) 5 { 6 r e t u r n new FPA( Value, i n f o. Values [ 0 ]. GetLength ( 0 ), i n f o. Values [ 0 ]. GetLength ( 1 ) ) ; 7 } } Listing 6.6: AccNumber class As both AccCellRef and AccConst inherit from AccNumber they only need to define how to return the value and AccNumber will convert it to an FPA of the correct size as shown above. Input cells Input arguments are CellRef s in the Expr tree, but at compile time we match the CellRef with the list of input cells for the SDF and represent the input of the type AccInput giving the index of the input cell as argument. The input arguments are not known before evaluation time and are send through every GenerateFPA call in the AccInputInfo object. How this is built is explained in section because it s highly dependent on how the function is called. The AccInput just return the value that corresponds with its index. 1 protected o v e r r i d e FPA GenerateFPA ( AccInputInfo info, i n t CallID ) 2 { 3 r e t u r n new FPA( i n f o. Values [ i n p u t I n d e x ] ) ; 4 } Listing 6.7: Generate FPA for AccInput A general problem when using Accelerator in CoreCalc is that every value in CoreCalc will have to casted to a float, send to the GPU, castet to a double and wrapped in a NumberValue Object. The NumberValue wrapping/unwrapping is however also an issue in SDFs compiled to.net bytecode. (See Known Problems.) Function calls In Expr syntax, function calls are represented by FunCall. This includes both calls to built-in functions such as SIN and operators such as +. These are refined to a whole hierarchy in the CgExpr syntax. Most functions such as + and SIN will easily be translated to Accelerator and is represented in AccBinaryOp or AccUnaryOp, specifying which function in the constructor. Others functions such as comparison and conditional functions that are very simple in CoreCalc will have to be refined similar to how it is done in CGExpr. (Functions that only depends on AccConst s could themselves be represented as an AccConst, but as this is only optimize SDF s if they re poorly designed we have not implemented this.) 33

41 Chapter 6. Implementation of prototype Conditional Statements Expr doesn t represent boolean expressions separately, however conditional statements in Accelerator require that we generate boolean expressions. In CoreCalc any value can always be evaluated to true or false, while floats in accelerator never can. Just as the IL code generator for SDFs have conditional expressions that requires functions to return booleans, Accelerator have special functions for comparison of float values, which return arrays of booleans. In CGExpr conditional statements are represented by CGIF, CGAnd, CGOR, CGNOT and a range of different comparisons inheriting from CGComparison. Due to the scope of our project we have chosen a simple approach where comparison functions are contained to conditional functions where they are needed (see example below). This approach restricts our use of logical operators, but as this is an experimental prototype this is not a problem. We have also chosen not to implement NOT, AND and OR, due to the scope of the project. 1 public BPA GenerateBPA ( AccInputInfo info, i n t CallID ) 2 {... 3 s w i t c h ( type ) { 4 case Type.EQ: 5 r e t u r n PA. CompareEqual ( child1fpa, c h i l d 2 F p a ) ; 6 case Type.GT: 7 return PA. CompareGreater ( child1fpa, child2fpa ) ; } } Listing 6.8: GenerateBPA of AccComp Random numbers Accelerator has no volatile methods such as random number generation. Monte Carlo simulations however use a lot of random data and we therefore need to generate it on the CPU at evaluation time. As random numbers are the only volatile function we need within the scope of the project we have contained this to AccRand, that works similar to AccNumber, but generate an array of random numbers corresponding to the size of the other argument in the operation it is being used. This size is available in the AccInputInfo object. Ensuring reuse of AccExprs Because of the latency of transferring data to the GPU it is important that we generate as small an AEG as possible. In order to do this we need to make sure that the AAS doesn t contain the same node more than once. Many cells can have formulas that reference to the same input cell or numbercell. Every reference to the same cell should point to the same AAS object. This means that when creating ASS it s important to keep track of whether an AAS for the same Expr has earlier been created. If it has, this earlier object should be referenced, instead of creating a new instance. In order to achieve this we create a dictionary that allows lookups of AccExpr s from a cell address inside the already created Visitor-pattern. Another, and maybe performance 34

42 Chapter 6. Implementation of prototype wise faster solution, would be to have a Cell (or a decorator) to reference to its AccExpr. However as this is on compile time we haven t looked further into optimizations. Whenever a AAS translation is starting, a dictionary will be created, and whenever a new Expr is needed we check if it s already has a corresponding AccExpr. Because we haven t focused on optimizing the generation of AAS and because a CellRef pointing to an input cell can be represented by the same AccInput object we simply create the AccExpr object and let the TryAccExpr throw it away if it s not needed. 1 p r i v a t e s t a t i c AccExpr TryAccExpr ( F u l l C e l l A d d r addr, AccExpr newexpr ) 2 { 3 i f ( exprcache. ContainsValue ( newexpr ) ) 4 r e t u r n newexpr ; 5 AccExpr n ; 6 i f (! exprcache. TryGetValue ( addr, out n ) ) 7 n = newexpr ; 8 exprcache. Add( addr, n ) ; 9 } 10 r e t u r n n ; 11 } Listing 6.9: Method for minimizing count of created AccExpr objects Due to the high transfer latency we have a similar dictionary for number constants. If A1 contains the formula = 2 + C1 and B1 have the formula = C1/2, both of these constants would normally have to be transferred to the GPU as separate textures, but we make sure that they point to the same AccConst. For these optimizations to work on evaluation time the same AAS object should return the same FPA object for each reference in the SDF invocation. To achieve this, we simply make sure that a generated FPA is saved in the AccExpr for each invocation of the SDF (it s only invoked once in a tabulate call. See section 6.2.2). FPAs should of course not be shared between invocations and to ensure this we send a unique CallId through the callstack. The only public method of an AccExpr that returns a FPA object is the GenerateFPAWithCache of the abstract class AccExpr that all others inherit from. This method checks if we have already generated a FPA for the AccExpr in this invocation and return the cached FPA, if not it call the objects specific GenerateFPA method Higher order functions As discussed in chapter 5 using sheet defined function in higher order functions such as Map or Tabulate should improve the potential performance gain if send as one call to the GPU. 1 public s t a t i c Value Tabulate ( Value v0, Value v1, Value v2 ) 2 { 3 i f ( v0 i s FunctionValue && v1 i s NumberValue && v2 i s NumberValue ) 4 { 5 FunctionValue fv = v0 as FunctionValue ; 6 // (... Argument e r r o r h a n d l i n g.. ) 7 i n t rows = ( i n t ) ( v1 as NumberValue ). value, 8 c o l s = ( i n t ) ( v2 as NumberValue ). v a l u e ; 9 i f (0 <= rows && 0 <= c o l s ) 10 { 11 Value [, ] r e s u l t = new Value [ c o l s, rows ] ; 12 f o r ( i n t c = 0 ; c < c o l s ; c++) 35

43 Chapter 6. Implementation of prototype 13 f o r ( i n t r = 0 ; r < rows ; r++) 14 r e s u l t [ c, r ] = f v. C a l l 2 ( NumberValue. Make ( r + 1 ), NumberValue. Make ( c + 1 ) ) ; 15 r e t u r n new A r r a y E x p l i c i t ( r e s u l t ) ; 16 } 17 // (... E r r o r h a n d l i n g.. ) 18 } Listing 6.10: Original CoreCalc code for Tabulate The built-in Tabulate function works by taking a binary function and two numbers as arguments. The function is then called row col times taking rowindex and colindex as argument 1 and 2, respectively. Map, ColMap and RowMap works similar to Tabulate, but takes a CellArea as input and passes the contents of the cells as arguments to the function. We ve chosen to explain Tabulate here for simplicity, but the other implementations are similar. As shown in the code sample Tabulate is implemented with a nested loop calling the method once per argument combination. As this is an embarrassingly parallel problem that can be handled by the GPU we need to represent it as one Accelerator abstract syntax graph. As Accelerator handles each element independently all we have to do is generate a FPA corresponding to the input array before sending it to the GPU. This only requires a minor modification of AccExpr s such as InputAccNod, ConstAccExpr, and RandAccExpr as they have to fit the input. As described a Sheet defined Function has an AccExpr structure which has a FPA GenerateFPA(AccIntputInfo info, int CallID) method that generates the Accelerator abstract syntax graph that corresponds to the operation based on input arguments (a) Argument (b) Argument 2 Table 6.1.: Original input arguments for call: T ABULAT E(F, 2, 8) (a) Argument (b) Argument 2 Table 6.2.: Reorganized input arguments for call: T ABULAT E(F, 2, 8) It is possible to generate an AAS using the first of the above formats table (6.1), but as the data has to be transferred as a texture this format exceeds the maximum texture width or height long before the actual memory limits of the GPU (Maximum texture size 36

44 Chapter 6. Implementation of prototype of GT240 have been estimated to 4000 X 8000). To solve this we reorganize the data to fit the texture size as shown in table 6.2. This is done using the GenerateAcceleratorMethod method of CGManager.cs. 1 i n t c o l s = ( i n t ) Math. C e i l i n g ( Math. S q r t ( l e n g t h ) ) ; 2 w h i l e ( l e n g t h % c o l s!= 0) 3 c o l s ; 4 i n t rows = l e n g t h / c o l s ; Listing 6.11: Original CoreCalc code for Tabulate First we find the new format using the algorithm shown in fig. 6.11, where we consider that our tests showed that the texture width was exceeded before the height. We then reorganize the input into the ArrayList<float[,]> type (each float[, ] is the reorganized input arguments). This FPA is send to the GPU and the result is reorganized back into the original format before returning the value. This is done completely transparent to the Tabulate or Map function. Another solution would be to do some initial tests and know the maximum texture size of the current machine and simply wrap around the maximum texturewidth. This solution makes us able to use data sizes close to the maximum texture size, but if we exceeds the maximum texture size the data will have to be split in two. As the operations are data-parallel and element wise when calling sheet defined functions it doesn t matter how the data is partitioned. We haven t implemented this in our prototype (see Future work) Evaluation leafs on the CPU Expr contains a lot of things that we cannot or have not translated to AAS. A simple approach would be to just evaluate these expressions on the CPU using the normal evaluation in CoreCalc, but this would lead to potential recursive calls to the same version and introduce further uncertainties. Even though this is very simple to implement, we ve chosen not to and simply throw and exception if an expression cannot be translated fully. 37

45 Chapter 7. Performance test of prototype The implemented prototype shows performance gains in some areas and performance losses in others. We document the test results and look into possible conclusions How the tests were executed Each of our benchmarks have run 100 recalculations of a workbook and calculates the average time. The workbooks uses the tabulate(function, Number, Number) -function, and each benchmark is executed on a range of linearly or quadratic growing data sizes Floating point precision and performance As noted previously, Accelerator uses single precision floating point numbers, and Core- Calc uses double precision. On modern CPUs there are no differences in performance between operating with floats and doubles, except for the division operations. Most current GPUs only support single precision floating point numbers. NVIDIA has earlier implemented double precision on GPUs with the NVIDIA G80 that worked at 1/10th speed of single precision operations and the new Fermi will support double precision with half the speed of single precision[15]. As this will change drastically in the near future we have decided not to look at how float to double casting is affecting our test results Hardware All tests have been run on the same hardware setup as our earlier tests and we have used the NVIDIA GT240 graphics card. 38

46 Chapter 7. Performance test of prototype 7.2. Results We have seen performance gains in our tests and simulations this shows that spreadsheet calculations can be optimised using the GPU, if relatively large data sizes are provided. In this section we will go through which factors we believe to be important when doing such an implementation and look at how much data that is needed. The results of this benchmark can be found in appendix A Built-in function For the built-in functions we found that given a sufficiently large input array, and a sufficiently complex function, it will impact performance positively to do the calculations on the GPU. However, as shown in our analysis of Accelerator, very few operations are sufficiently complex or take enough arguments to actually have a positive impact. Our tests showed that arrays of 96 2 were needed for a matrix multiplication to show a performance gain. If this is a common scenario for calculations in spreadsheets, it would make sense to spend more time on this kind of operations. However, it seems very tedious to work to with this many cells in a spreadsheet. It should be noted that the upper limit for the calculations, is also relatively close to the lower bounds where the CPU is faster. In our example it will be possible to create optimization in the range of [96 2 ; ], where after the values have to be split into two arrays and so forth Sheet defined function We tested sheet defined functions in several scenarios, building both more and less complicated SDFs. We converted real-life examples of Monte Carlo simulations from Excel into Sheet defined functions and ran them on CoreCalc. Performance gains were found by using GPUs in this way. However, we found factors that influence the performance of this implementation.: Aggregating is slow on the GPU In Monte Carlo simulations, aggregating functions are often used. Aggregating values is slower on the GPU than on the CPU [17], and should be used with great caution. For many simulations it might not make sense to create and calculate the sampling data on the GPU, transfer it back to the CPU and do the aggregation. This is also described in chapter 4. 39

47 Chapter 7. Performance test of prototype With the NVIDIA Fermi, one would imagine that performance of aggregating functions will be improved. NVIDIA Fermi promises more shared memory and much faster atomic operations to access shared memory, which gives a better foundation for reduction operations and thereby aggregate functions.[3] Random data needed to be transferred When doing Monte Carlo simulations, we transfer random data from the CPU to the GPU. This increases the total time spent because of the larger transfer time. It would probably improve performance of Monte Carlo simulations if random numbers were simply generated on the GPU instead of generated on the CPU then transferred. A pseudo-random number generator is possible to create on the GPU, and future releases of Accelerator 2.0, are expected to support this[17]. Reducing the amount of constants In our implementation we worked on reducing the amount of data needed to be transferred to the GPU. Looking at the derived slopes of benchmarking results in the different Heron-implementations, we see that the intersection between the functions of the computation on the GPU and the CPU, respectively is smaller the less data needed to be transferred. 40

48 Chapter 8. Perspective 8.1. The future of GPUs The test results in this project have all been based on a NVIDIA GT240 graphics card, which are not the state of art, but have been sufficient within the scope of this project. The tests may very well produce more promising performance gains on modern GPUs with around 250 cores, as seen in many gaming pc s today. State of the art GPUs with 512 cores would improve this further. In the near future NVIDIA will support double operations in half the speed of float operations[14], more cores will be added to GPUs, and memory limits will be raised. Considering this, the idea of using the GPU to optimize spreadsheet evaluation become even more interesting Other targets of parallelism using Accelerator Accelerator already supports multicore targets to some extent, but ideas for other targets such as FPGA 1 and distributed networks are also mentioned in the documentation[17]. In theory we could use the FPAs generated by our current Accelerator abstract syntax and simply change the target of evaluation. Multicore CPUs are interesting targest as they may show great performance gains on lower data sizes where the GPU is struggling with latency. Hamann[9] describes performance gains close to the theoretical maximum when using the.net Task Parallel Library. Our approach has not been aimed at task parallelism and we may not be able to parallel the same problems as Hamann, but one should expect a performance gain comparable to Hamann s on the data parallel operations. Using information of multicores in the worst case estimation time for the CPU, it will be possible to estimate when not to parallel the evaluation at all, when to use multicore, and when to use the GPU. 1 Field-Programmable Gate Array 41

49 8.3. Known problems Converting Values Chapter 8. Perspective A general problem when constructing Accelerator abstract syntax in CoreCalc, is that datatypes such as ArrayValues and NumberValues which contain double values, will have to be cast to float. As mentioned earlier current GPUs do not support double s and neither does Accelerator. The only solution here is simply to convert the doubles of the CoreCalc datatypes into float arrays, before creating the FPA object. Likewise, when Accelerator returns a result to us, we need to convert this back to doubles and wrap it into ArrayValues or NumberValues. This gives an overhead proportional to the input arrays size which is a little unfortunate, but very hard to avoid. CoreCalcs sheet defined functions are compiled to.net bytecode and need to do the same wrapping of values. We have not looked further into this. However, Poul Brønnum focusses on this problem in his Master s thesis, Type Analysis for Sheet-defined Functions [5] from He states that by implementing a set-based type system influenced by soft-types, a performance gain of 20% is possible. With further improvements he documents that performance gains of up to 65% compared to the original code is possible for some functions Limitations in Accelerator abstract syntax Due to the scope of this project we haven t matched the whole CGExpr tree in Accelerator Abstract Syntax. Here is a short overview of what we do not implement. Time functions and other volatile functions Choose Lookup AND, OR, NOT and nested parentheses Aggregation functions such as Average, Percentile and others Strings... And many other functions that have not been used in the scope of this project. 42

50 Chapter 8. Perspective Exceeding memory and texture limits By reorganising our data we avoid to some extent exceeding the maximum texture size, but when the data size is not within the maximum texture size the operation should be split in two. We have not implemented this as it has not been needed for our prototype. Doing so would result in a constant time increase in the GPU computation time. We have neither been able to predict when the GPU runs out of memory, nor found a way of handling this apart for catching the exceptions that Accelerator throws. While we might be able to predict when GPUs run out of memory, this might as well be handled on a lower level where more information about the state of the memory is known Worst Case Execution Time estimation We have proposed a model for estimating the worst case execution time of a function on the GPU, but this is not part of our prototype. However, it is possible to implement this using our Accelerator Abstract Syntax. Moreover the current model does not take maximum texture sizes, memory limits, or other possibly unknown factors into account. Also our current estimations of the values are not complete as we do not distinguish between the time cost of transferring an operation and that of transferring the arguments. 43

51 Chapter 9. Conclusion The main objective of this project was to investigate whether CoreCalc could be extended to use the GPU for parallelizing evaluation of functions. This has indeed been proven possible and we have implemented an experimental prototype that allows a subset of the CoreCalc operations to be evaluated on the GPU. We have investigated methods for parallelizing spreadsheet applications using the GPU and based on our analysis we have chosen to focus on sheet defined functions and usage of these in higher order functions such as tabulate. Our prototype shows that it is possible to achieve a performance gain on spreadsheet operations, given enough data and enough arithmetic complexity. However, except for sheet defined functions, only very few builtin spreadsheet operations use enough data or have the required complexity. Only matrix operations such as the built-in matrix multiplication function have displayed potential for performance gains on the GPU. We have analysed Microsoft Accelerator and documented its limitations related to the purpose of this prototype. In order to construct Microsoft Accelerator Expression Graphs at evaluation time we have designed a simple intermediate abstract syntax based on the Expr abstract syntax from CoreCalc. If using complex simulations or extremely large data amounts, such as in Monte Carlo simulations, it is indeed possible to optimise the spreadsheet calculation using the GPU. However, there is no or very limited performance gain when using the GPU to evaluate light spreadsheets on current hardware. Taking this into account it is questionable if this should be implemented in mainstream software. However, the future development of GPUs looks promising. 44

52 Bibliography [1] [2] Gpu gems, [3] Nvidias next generation cuda computer architecture: Fermi, http: // Fermi_Compute_Architecture_Whitepaper.pdf. [4] Dan Bricklin. Visicalc information. webpage. visicalc.htm. [5] Poul Broennum. Type analysis for sheet-defined functions. Master s thesis, IT University of Copenhagen, [6] Alan Dang. Exclusive interview: Nvidia s ian buck talks gpgpu, September html. [7] Jose Oglesby David Tarditi, Sidd Puri. Brute force attack on unix passwords with simd computer [8] Jose Oglesby David Tarditi, Sidd Puri. Accelerator: Using data parallelism to program gpus for general-purpose uses. Technical Report MSR-TR , [9] Jens Hamanns. Parallelization of spreadsheet computations. Master s thesis, IT University of Copenhagen, [10] Mark Harris. Technical report, [11] Intel. Intel roadmap directions 2010, irdonline/pdf/ird_q2_2010_roadmap_all.pdf. [12] NVIDIA internet forum. Cuda, nvidia gpus and microsoft excel, May http: //forums.nvidia.com/lofiversion/index.php?t67720.html. [13] Thorsten Scheuermann Mark J. Harris, Greg Coombe and Anselmo Lastra. Physically-based visual simulation on graphics hardware. Technical report, University of North Carolina, [14] Nvidia. NVIDIAs Next Generation CUDA Compute Architecture: Fermi,

53 Bibliography [15] David Patterson. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Technical report, NVidia, D.Patterson_Top10InnovationsInNVIDIAFermi.pdf. [16] Microsoft Research. Accelerator v2 Programming Guide, [17] Microsoft Research. An Introduction to Accelerator v2, [18] Peter Sestoft. Spreadsheet technology. draft manuscript [19] Andrew P. Wack. Partitioning dependency graphs for concurrent execution: A parallel spreadsheet on a realistic modeled message passing environment. PhD thesis, Delaware,

54 Appendix A. Test setups and results from prototype benchmarking A.1. Built-in functions Most built-in functions showed few performance gains in our earlier tests. Matrix multiplication was an exception and showed a potential performance gain, so we have tested it in our prototype. Figure A.1.: Matrix multiplication test setup As can be seen from the figure, matrix multiplication was implemented to take two equally sized quadratic arrays of random numbers. The CPU is initially faster with an array size of 8 2 (64). The GPU starts getting faster at size 96 2 (9216). For the sample we have made, the time spent by on the GPU seems to grow linearly as a function of data sizes, while the CPU has a higher complexity in its growth. 47

55 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , , , ,00 CPU GPU , , ,00 0,00-10,00 40,00 90,00 140,00 190,00 240,00 Data size Figure A.2.: Matrix multiplication performance results The GPU is five times faster that the CPU at a size of (57600), which is the largest possible dataset the GPU can handle during matrix multiplication. As mentioned we have no indicator of when this maximum is reached, but our test results. A.2. Sheet defined functions Herons formula Herons formula has been implemented as a sheet defined function and tested on data sizes from 1000 to floats. This implementation has been used a solid base for further tests, to see how different changes affect the performance. 48

56 Appendix A. Test setups and results from prototype benchmarking Figure A.3.: Different variations of Herons formula in CoreCalc Random data In this implementation all three variables (A,B,C) in Herons formula are random data generated by rand() for each invocation. This means three unique FPAs are generated and sent to the GPU. Tests are run on data sizes from 1000 to floats, and an intersection was not found. Using linear regression we can derive the following functions: HeronRandomCP U(x) = 452, 4939 x HeronRandomGP U(x) = 393, 2681 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. 49

57 Time (ns) Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.4.: Herons Formula with random data, performance results Notice that with random values for A, B and C, not all generated examples will be able to create a triangle. We might end up with a negative value in the SQRT, which might impact speeds on both the CPU and the GPU, making the results harder to actually compare to the two other Heron implementations. Param data , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.5.: Herons Formula with parameter data, performance results 50

58 Time (ns) Appendix A. Test setups and results from prototype benchmarking In this implementation A and B are set as parameters and C is set to A + B/2. This way only two FPAs are sent to the GPU, along with an operation and a single constant number. An intersection was not found. Using linear regression we can derive the following functions: HeronP aramcp U(x) = 462, 4716 x HeronP aramgp U(x) = 457, 2948 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. Constant data , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.6.: Herons Formula with constants data, performance results In this implementation all values are the same constant. Only one constant is transferred to the GPU.. An intersection was not found. Using linear regression we can derive the following functions: HeronConstantCP U(x) = 456, 901 x HeronConstantGP U(x) = 440, 3515 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. 51

59 Time (ns) Appendix A. Test setups and results from prototype benchmarking A.2.1. Fibonacci sequence A given number in the Fibonacci sequence can be calculated using this formula. (( F n = 1 1+ ) n ( ) n ) , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , ,00 Data size Figure A.7.: Performance test of the Fibonacci sequence We implemented this as a simple sheet defined function in CoreCalc. The SDF uses 4 constants and one input value. The test is run from 1000 floats to floats. The GPU surpasses the CPU around a size of Monte Carlo simulations Monte Carlo simulations are used in a wide variety of industries to estimate probable outputs where deterministic algorithms would take too long to compute or simply be too complex. They rely on a random sample data and statistics, and are often implemented in Microsoft Excel. While we have implemented the data-sampling in the simulations below, we have not spent time on calculating the actual results, because of the relatively few aggregate functions implemented in CoreCalc. Aggregate functions will be reduction operations on the GPU which are often not very effecient. The lack of actual results might skew the results of an actual simulation, 52

60 Appendix A. Test setups and results from prototype benchmarking however we believe these results give a good guidance since the aggregation will often be faster to do on the CPU and a users might want to analyse the result in many ways. Greeting card estimation Straight out of an example from Microsoft ( of how Monte Carlo can be used to make business decisions, simulating different types of demand scenarios, outputting what the risk is of failing. We converted this example into a SDF and did the same simulation in CoreCalc. Figure A.8.: The valentine simulation SDF defined in CoreCalc This function was benchmarked from 500 floats to floats. The intersection point between the GPU and the CPU is around 5000 floats. 53

61 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.9.: The results of the valentine simulation Approximation of π A Monte Carlo simulation can be used to approximate the value of π. This is done by generating a number of points within the square from (0, 0) to (1, 1), and afterwords counting the number of points that are within the inscribed circle of this square. The ratio between the counted points and the generated points should be π/4 We created a SDF = IF (RAND() 2 + RAND() 2 <= 1, 1, 0). Running this n times and dividing the sum of the result with n should approximate π. 54

62 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.10.: The perfomance results of the π approximation simulation This was benchmarked from 2000 to Commute time The intersection was found around Figure A.11.: The commute time simulation SDF defined in CoreCalc This testcase uses Monte Carlo simulations to predict the commute time to work. As seen in the above image, we have two road segments and a traffic light. At the first road segment we have a 10% chance of hitting a traffic jam and at the traffic lights there is 55

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions GPGPU, 4th Meeting Mordechai Butrashvily, CEO moti@gass-ltd.co.il GASS Company for Advanced Supercomputing Solutions Agenda 3rd meeting 4th meeting Future meetings Activities All rights reserved (c) 2008

More information

Towards Breast Anatomy Simulation Using GPUs

Towards Breast Anatomy Simulation Using GPUs Towards Breast Anatomy Simulation Using GPUs Joseph H. Chui 1, David D. Pokrajac 2, Andrew D.A. Maidment 3, and Predrag R. Bakic 4 1 Department of Radiology, University of Pennsylvania, Philadelphia PA

More information

General Purpose Computing on Graphical Processing Units (GPGPU(

General Purpose Computing on Graphical Processing Units (GPGPU( General Purpose Computing on Graphical Processing Units (GPGPU( / GPGP /GP 2 ) By Simon J.K. Pedersen Aalborg University, Oct 2008 VGIS, Readings Course Presentation no. 7 Presentation Outline Part 1:

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS BeBeC-2014-08 REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS Steffen Schmidt GFaI ev Volmerstraße 3, 12489, Berlin, Germany ABSTRACT Beamforming algorithms make high demands on the

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

MD-CUDA. Presented by Wes Toland Syed Nabeel

MD-CUDA. Presented by Wes Toland Syed Nabeel MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Rendering Grass with Instancing in DirectX* 10

Rendering Grass with Instancing in DirectX* 10 Rendering Grass with Instancing in DirectX* 10 By Anu Kalra Because of the geometric complexity, rendering realistic grass in real-time is difficult, especially on consumer graphics hardware. This article

More information

Experts in Application Acceleration Synective Labs AB

Experts in Application Acceleration Synective Labs AB Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

COMP Preliminaries Jan. 6, 2015

COMP Preliminaries Jan. 6, 2015 Lecture 1 Computer graphics, broadly defined, is a set of methods for using computers to create and manipulate images. There are many applications of computer graphics including entertainment (games, cinema,

More information

Evaluation Of The Performance Of GPU Global Memory Coalescing

Evaluation Of The Performance Of GPU Global Memory Coalescing Evaluation Of The Performance Of GPU Global Memory Coalescing Dae-Hwan Kim Department of Computer and Information, Suwon Science College, 288 Seja-ro, Jeongnam-myun, Hwaseong-si, Gyeonggi-do, Rep. of Korea

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Using GPUs for unstructured grid CFD

Using GPUs for unstructured grid CFD Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Optimisation. CS7GV3 Real-time Rendering

Optimisation. CS7GV3 Real-time Rendering Optimisation CS7GV3 Real-time Rendering Introduction Talk about lower-level optimization Higher-level optimization is better algorithms Example: not using a spatial data structure vs. using one After that

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

GPU Linear algebra extensions for GNU/Octave

GPU Linear algebra extensions for GNU/Octave Journal of Physics: Conference Series GPU Linear algebra extensions for GNU/Octave To cite this article: L B Bosi et al 2012 J. Phys.: Conf. Ser. 368 012062 View the article online for updates and enhancements.

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Parallel LZ77 Decoding with a GPU Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU Outline Background (What?) Problem definition and motivation (Why?)

More information

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology Outline Introduction Algorithmic complexity on GPUs Algorithmic Building Blocks Gather & Scatter Reductions Scan (parallel prefix)

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING. A Thesis. Presented to. the Faculty of

ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING. A Thesis. Presented to. the Faculty of ASYNCHRONOUS MATRIX FRAMEWORK WITH PRIORITY-BASED PROCESSING A Thesis Presented to the Faculty of California Polytechnic State University, San Luis Obispo In Partial Fulfillment of the Requirements for

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Parallelization of K-Means Clustering Algorithm for Data Mining

Parallelization of K-Means Clustering Algorithm for Data Mining Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing

More information

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6 CHAPTER 6 Parallel Algorithm for Random Forest Classifier Random Forest classification algorithm can be easily parallelized due to its inherent parallel nature. Being an ensemble, the parallel implementation

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Parallel Execution of Kahn Process Networks in the GPU

Parallel Execution of Kahn Process Networks in the GPU Parallel Execution of Kahn Process Networks in the GPU Keith J. Winstein keithw@mit.edu Abstract Modern video cards perform data-parallel operations extremely quickly, but there has been less work toward

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information