Spreadsheet optimisation on GPUs

Size: px

Start display at page:

Download "Spreadsheet optimisation on GPUs"

Miranda Boone
5 years ago
Views:

1 Spreadsheet optimisation on GPUs Using Microsoft Accelerator Bachelor thesis The IT University of Copenhagen May 18th, 2010 Authors: Tim Garbos Kasper Videbæk Supervisor: Peter Sestoft

2 Abstract The objective of this project is to investigate if it is possible to use the GPU for parallelizing evaluations in spreadsheets. We develop an experimental prototype based on the CoreCalc[18] spreadsheet implementation. We interfacing to the GPU using Microsoft Accelerator[8]. CoreCalc introduces the concept of sheet defined functions, which are functions that can be defined by users inside a spreadsheet. The project presents ideas for how to efficiently use the GPU in spreadsheets with focus on implementing sheet defined functions for the GPU. Benchmarking is performed to compare the performance to the original approach of compiling sheet defined functions to.net bytecode. Also methods of estimating the execution time of an function for the GPU and CPU are discussed. Based on our experiments and benchmarks we will conclude that it is possible to use the GPU for evaluating functions in spreadsheets, however rather large amounts of data are needed in order to do so, with performance gains. Performance gains from implementing parallelism using current GPUs are usable in complex cases such as data heavy operations like Monte Carlo simulations. i

3 Contents 1. Introduction Context and motivation Problem statement Goals and methods Thesis overview Background Spreadsheet technology Introduction to spreadsheets CoreCalc spreadsheet implementation Parallel computing and GPGPU Parallel Programming Methods Stream processing Programming for the GPU Bottlenecks Problems Previous work on parallelism and GPGPU in spreadsheets Microsoft Accelerator Accelerator and the GPU The C# interface Operations Programming Other findings Behavior of booleans Analysis of Microsoft Accelerator Hardware setup Constructing the tests Test results GPGPU approaches for spreadsheets Single normal built-in functions Sheet defined functions Higher order Map function When to use the GPU? ii

4 Contents Estimating execution time Implementation of prototype Built-in functions Sheet defined function Accelerator Abstract Syntax Higher order functions Evaluation leafs on the CPU Performance test of prototype How the tests were executed Floating point precision and performance Hardware Results Built-in function Sheet defined function Perspective The future of GPUs Other targets of parallelism using Accelerator Known problems Converting Values Limitations in Accelerator abstract syntax Exceeding memory and texture limits Worst Case Execution Time estimation Conclusion 44 Bibliography 45 Appendix 46 A. Test setups and results from prototype benchmarking 47 A.1. Built-in functions A.2. Sheet defined functions A.2.1. Fibonacci sequence iii

5 List of Figures 2.1. Herons Formula as a Sheet Defined Function in CoreCalc Time from sending a data input to getting it back again Performance tests of single simple operations on the GPU compared to the CPU Performance tests of matrix multiplication on the GPU compared to the CPU Graph of nested addition operations Performance tests on nested mixed operations on the GPU compared to the CPU Formula for estimated computation time using the CPU Formula for estimated computation time using the GPU Class hierarchy of Expr types[18] Class hierarchy of AccExpr A.1. Matrix multiplication test setup A.2. Matrix multiplication performance results A.3. Different variations of Herons formula in CoreCalc A.4. Herons Formula with random data, performance results A.5. Herons Formula with parameter data, performance results A.6. Herons Formula with constants data, performance results A.7. Performance test of the Fibonacci sequence A.8. The valentine simulation SDF defined in CoreCalc A.9. The results of the valentine simulation A.10.The perfomance results of the π approximation simulation A.11.The commute time simulation SDF defined in CoreCalc A.12.The performance results of commute example iv

6 Code samples 2.1. An embarrassingly parallel problem Accelerator code for addition Herons formula in FPA Booleans in Accelerator evaltarget code none Add operation Applier GPU and CPU applier Original code for creating MMULT Modified code for creating MMULT AccExpr class AccNumber class Generate FPA for AccInput GenerateBPA of AccComp Method for minimizing count of created AccExpr objects Original CoreCalc code for Tabulate Original CoreCalc code for Tabulate v

7 List of Tables 4.1. Hardware specification Test results from single operations Original input arguments for call: T ABULAT E(F, 2, 8) Reorganized input arguments for call: T ABULAT E(F, 2, 8) vi

8 Foreword This thesis was written by Tim Garbos and Kasper Videbæk in the period from February 2010 to May 2010 at the IT University of Copenhagen. It is a part of a 15 ECTS Bachelor s thesis project supervised by Peter Sestoft. This report including additional documents, files and source code can be downloaded from 1

9 Chapter 1. Introduction 1.1. Context and motivation Spreadsheets are used in almost every business and are used for everything ranging from keeping track of work hours, project planning, and simple lists to financial calculations and research simulations. While one might not even notice the time it takes to recalculate the spreadsheet when keeping track of work hours, the time for calculating a complex Monte Carlo based financial simulation is indeed noticeable. The focus of this thesis is primarily on optimizations of this kind of calculations Problem statement The objective of this project is to investigate if using the GPU for parallelizing evaluation of functions in spreadsheets is possible and whether this can be done with a performance gain. Within the objective we also want to investigate if sheet defined functions (described in section 2.1) can be efficiently restructured to fit the GPU target Goals and methods To do this we develop an experimental prototype based on the CoreCalc[18] spreadsheet engine. This experimental prototype will also include testing of the Microsoft Accelerator[8] framework v2 preview release, for interfacing with the GPU. Our investigation includes searching for and relating to literature about spreadsheet optimisation in relation to parallelism and GPGPU 1. 1 General Purpose computation on Graphics Processing Units 2

10 Chapter 1. Introduction We will analyse Microsoft Accelerator and look into its strengths, limitations, and weaknesses in relation to parallelizing spreadsheet evaluations. Furthermore we will benchmark different types of operations and look at the performance difference between a GPU and a CPU. With Microsoft Accelerator s strengths and limitations in mind and based on current literature we will discuss different approaches to parallelize spreadsheets. We will also discuss a model for estimating execution time on the GPU for a specific function at evaluation time. The prototype will implement different approaches to parallel evaluations in spreadsheets. We will document and discuss the design of these implementations. We will benchmark our prototype against the original CoreCalc implementation, using different spreadsheet based simulations and real life examples to determine whether or not it makes sense to evaluate spreadsheet calculations with a GPU Thesis overview This thesis is : Chapter 2 provides a background on spreadsheet technology, parallel computing in relation to graphics processing units, and what previous research or work that have been done in this area. Chapter 3 gives a technical overview of the Microsoft Accelerator that claims to provide an easy to use interface to parallel computing and GPUs. In chapter 4 we analyse and test Microsoft Accelerator in order to determine what possibilities it provides that may be relevant for parallel evaluation of spreadsheets. The results of chapter 4 is used in chapter 5 to discuss different approaches of parallelizing spreadsheets using the GPU. Chapter 6 describes our experimental prototype. We document and discuss the design and structure. In chapter 7 we document our benchmarks of the experimental prototype against the original CoreCalc implementation and analyse the test results. Chapter 8 relates our results to the future of graphics processors and discuss known problems in our implementation. At last we summarize our results an conclude whether or not it is efficiently possible to evaluate spreadsheets in parallel using the GPU. 3

11 Chapter 2. Background 2.1. Spreadsheet technology Introduction to spreadsheets The most popular spreadsheet application today is Microsoft Excel, but computerized spreadsheets as we know them today have been used since the first WYSIWYG 1 spreadsheet application, VisiCalc. It was developed for the Apple II computer by Dan Bricklin and Bob Frankston [4]. In common spreadsheet software a spreadsheet consists of a workbook that can contain up to several sheets. Each sheet has multiple cells that together create a grid consisting of rows and columns. Each cell can contain a text value, a number value or a formula. A formula describes how the value of the cell can be calculated from the values of other cells. When a cell is updated the value of all the cells referencing that cell will be recalculated. This ensures that the cells of the entire sheet automatically stay up to date CoreCalc spreadsheet implementation In this thesis we base our prototype on CoreCalc [18]. CoreCalc is an open source implementation of core spreadsheet functionality in C#. It is developed at the IT University of Copenhagen and is only intended as a platform for experiments with new technology and functionality. As the documentation states, it is not a replacement for Microsoft Excel, Gnumeric or Open Office Calc, but a research prototype. It might have been possible to base our prototype on Open Office Calc or Gnumeric as they are both open source, but they are also far more complex and feature rich than CoreCalc. It might not have been possible to implement our prototype without rewriting 1 What You See Is What You Get 4

12 Chapter 2. Background parts of the spreadsheet engine. CoreCalc however is built as a platform for new experiments and it features sheet defined functions. Sheet defined functions are separate from the normal spreadsheet evaluation gives more possibilities for optimisations. Gnumerics or Open Office Calc does not have sheet defined functions. Spreadsheet programs Based on A Spreadsheet Core Implementation in C# [18] now follows an overview of how a spreadsheet program works. Spreadsheet programs are dynamically typed functional programs that can be programmed by simple formulas in cells. Spreadsheets handle data types such as strings, numbers, logical expressions, and matrices, but they are handled dynamically. This makes formulas very dynamic and one can easily introduce an error such as = SQRT (IF (A1 < 0; Helloworld ; 25)) that returns 5 if A1 >= 0 and otherwise returns an error because SQRT only takes a number as its argument. Functional programming is a paradigm that is similar to the evaluation of mathematical functions. For example in the evaluation of spreadsheet cells it avoids states and mutable data. One cell cannot change the value of another cell if that cell is not somehow dependent on it. In functional languages you distinguish between strict (eager) and nonstrict (lazy) evaluation. In eager evaluation all expressions is evaluated independently of whether they re used or not. In lazy evaluation an expression is only evaluated when there is a demand for it, and then cached so that other demands can use this cached value. In spreadsheets we have a similar concepts in that a cell is only evaluated when one of the cells it is referencing to is updated. Sheet defined functions CoreCalc introduces a new concept called Sheet Defined Functions (or SDF for short). Sheet defined functions allow spreadsheet users to define functions which can be used just like normal built-in function through the entire workbook. 5

13 Chapter 2. Background Figure 2.1.: Herons Formula as a Sheet Defined Function in CoreCalc In fig. 2.1 Herons formula is implemented. The green cells are input cells that act as arguments for the function. The blue cell is the ouput cell, which is a formula that depends on the input cells. The user is able to use normal formulas and cell-references when defining the function. In this figure we have implemented Herons formula, with A4 and B4 as input cells, and the last side of the triangle simply calls a random function. In CoreCalc, SDFs is compiled into.net bytecode for optimised calculations Parallel computing and GPGPU Parallel computing has been around since the 1950 s and has been implemented in highperformance computing (computer cluster). It has mainly been used for research and the first dual-core processors have reached the public in The reason for the growth in public interest is the physical constraints preventing further advancement in the number of operations a CPU can perform (frequency scaling). Instead of frequency scaling, now more cores are added to CPU s and according to Intel[11], an 8 core Xeon CPU is announced for However, multicore CPU s are not the only modern method of parallelism in computers. General Purpose computation on Graphics Processing Units (GPGPU) is using the GPU, which is designed for computing graphics, for purposes that are normally handled by the CPU. Graphics cards were originally designed to be parallel in order to process each 6

14 Chapter 2. Background vertex in a 3D model independently. The term GPGPU was first coined in the 2002 paper Physically-Based Visual Simulation on Graphics Hardware [13], however, research in using GPU s for general purpose computation has been around for a while. In 1999 the PixelFlow SIMD graphics computer was used to crack UNIX password ciphers[7], using a bruteforce attack. Graphics cards manufacturers have lately added more precise arithmetic to the GPU, making it more suitable for non graphics computing such as scientific computing.[14] In 2006 both NVIDIA s CUDA SDK and ATI s CTM SDK were made public, thereby making GPGPU possible without detailed expert knowledge of the graphics API. Because GPU s are designed for graphics processing they are very limited in terms of programming possibilities and are only efficient when the problems can be computed using stream processing. They can only process single vertices, but can process multiple vertices in parallel. Modern GPUs have typically more than 100 cores (processors) and the new Fermi model from NVDIA introduces 512 cores[14]. This makes GPUs ideal for operations that should be applied to every value of a large dataset. When writing parallel programs for a 4 core CPU the theoretically maximum performance gain is 400% (minus some overhead). As GPUs work very differently from normal CPUs and have for more cores the maximum performance gain is theoretically far better Parallel Programming Methods There are two main parallel programming methods: Task-parallelism, where each task can be separated and executed on another processor while still communicating with other tasks, and data-parallelism where the data for one task can be partitioned and processed individually. GPUs do not fully support processors to synchronize with each other and classical parallel programming methods cannot be used. A thread cannot spawn a new thread or send results to other threads. This leads to data-parallelism. 1 f o r ( i = 0 ; i < N; i ++) 2 { 3 r [ i ] = a [ i ] + b [ i ] ; 4 } Listing 2.1: An embarrassingly parallel problem Given the above C# code, the array r will sequentially be filled with the result of the add-operation. This can be optimised using data-parallelism. For this specific problem the arrays a and b can be divided into smaller chunks, and given several processors, the chunks can be calculated independently and later be collected. This example can be generalized. Using the same algorithm, and substituting a[i] + b[i] with any non-volatile function. Problems 7

15 Chapter 2. Background such as this that can easily be executed in parallel are classified as embarrassingly parallel problems. Graphics cards solve many data-parallel problems when doing graphical computations like in game s and modern GPUs typically have over 100 cores[14], but a typical graphics card for games can very well have cores Stream processing Stream processing is a programming paradigm that uses data-parallelism. In stream processing you construct an algorithm by defining a kernel, which is a function that can process data and return an output. Kernels that are to be applied on a set of data called a stream. It works well for processing images and videos, but it is not designed for general purpose processing using random data access, control flow or database lookups. GPU s are however designed to be efficient for stream processing. Stream processing is designed for applications that have a high number of arithmetic operations per I/O and where every element of the stream should have the same function applied to it. This means that an optimal application for the GPU should have a large data set, a high possibility of parallelism, a kernel of high arithmetic complexity to be applied to every element, and minimal dependency between operations. In sequential programming for the CPU it is common to control the flow of the program using loops or conditions such as if/then/else. Such flow control has until recently not been possible on the GPU and is still quite limited. Some recent GPUs allow branching (if/then/else), but not without a performance loss[2] Programming for the GPU Some low level APIs allow for general purpose GPU programming. This section will give a short overview of the different possibilities. CUDA Compute Unified Device Architecture or CUDA is NVIDIA s architecture for communicating with the GPU with standard programming languages. In this case C, but wrappers for other languages exists. It shares a large part of its interface with both OpenCL and Direct Compute, but is only available for NVIDIA hardware. OpenCL The Open Computing Language is an open standard in the spirit of OpenGL and OpenAL(3D and audio) for writing data-based and task-based parallel applications. It shares a range of interfaces with CUDA and Direct Compute, but is managed by the non-profit technology consortium Khronos Group. OpenCL is not bound to specific hardware, and 8

16 Chapter 2. Background AMD has decided to support OpenCL instead of its now deprecated Close to Metal API (AMD alternative to CUDA). DirectCompute As part of the DirectX framework, DirectCompute is a low level API for programming for the GPU. Naturally this too shares a range of interfaces with OpenCL and CUDA. Microsoft Accelerator For all of the above frameworks, higher level wrappers exists and you can program in Python,.NET, Java, or any other popular mainstream language. Other more specific frameworks that avoids GPU specific programming are being developed. Microsoft Accelerator [17] is a high level framework that allows data-based parallelism on the GPU through DirectCompute. It was first introduced in the 2005 technical report Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses [8] by David Tarditi, Sidd Puri, and Jose Oglesby. Their problem statement is that GPUs are difficult for program general-purpose usages. Programmers can convert their programs so they use graphics pipeline operations or they can use APIs for stream processing. The result of their project is Microsoft Accelerator, a library that uses data-parallelism to program GPUs. The idea is that programmers can use a normal imperative language and use the high level API for dataparallel operations without worrying about the GPU. The Microsoft Accelerator library compiles the high level data-parallel operations to optimised pixel shaders on the fly. Their benchmarks show that the speeds of some compiled operations are comparable to hand written pixel shaders, but the performance is typically within 50% of the hand written shader Bottlenecks When programming for the GPU the data has to be transferred to the GPU before it can compute the result and the result will have to be transferred back as well. This creates a latency between the CPU and the GPU. Also the bandwidth and memory limitations of the GPU are different than when using the CPU. Later in the project we will identify these bottlenecks Problems Today integer and double operations are only supported on the newest NVIDIA Fermi cards using CUDA, but according to NVIDIA[15] this problem will be solved in the near future. The same goes for floating point precision that does not yet match the IEEE754 floating point standard. [10] 9

17 Chapter 2. Background 2.4. Previous work on parallelism and GPGPU in spreadsheets We have not been able to find any research or products that explore the use of the GPU to optimize heavy spreadsheets. The topic has been mentioned several times on online forums and in interviews. When it comes to parallelism and spreadsheets a limited amount of research and products exist. In Andrew P. Wacks PhD dissertation from 1996, Partitioning dependency graphs for concurrent execution: A parallel spreadsheet on a realistically modeled message passing environment. [19] he describes how to partition the spreadsheet graph in order to split the computation across multiple computers on a network. Jens Hamann s Master s thesis from 2010 on Parallelization of spreadsheet computations [9] investigates the possibilities of enabling calculations in spreadsheets to be evaluated in parallel and adapts some of Wacks theories to modern multicore CPU s using CoreCalc. Microsoft Excel 2007 has an option for enabling a multithreaded calculation engine that is described as being able to partition groups of cells that can be parallelized. There s no technical documentation for the design of Excel s multithreaded calculation engine or how it is partitioning the graph. Hamann mentions the possibilities of using the parallel power of the GPU in spreadsheets. He further mentions Microsoft Accelerator as a library that will enable elegant implementations of such. As a consequence of there being very little information on GPGPU and spreadsheets we have broadened our search to include internet forums. On an interview about GPGPU with Ian Buck from NVIDIA at he says that We think your spreadsheet might already be fast enough. While video processing was an obvious application to accelerate (...). This quote is questioned several times in the comments. Dr. Drey writes NVIDIA, saying that spreadsheet is already fast enough may be misleading. Business users have the money. Spreadsheets are already installed (huge existing user base). Many financial spreadsheets are very complicated 24 layers, 4,000 lines, with built in Monte Carlo simulations. Making all these users instantly benefit from faster computing may be the road for success for NVIDIA.. Other comments support this idea. On the NVIDIA CUDA forum in the topic CUDA, NVIDIA GPUs and Microsoft Excel [12] different approaches and reasons for using CUDA in spreadsheets are discussed. A specific computationally heavy sheet is discussed and it is concluded that it is not the kind of job for CUDA or any other GPGPU approach. This is based on the fact that GPU s have an SPMD (single program multiple data) approach, which means it works best when you use a single kernel (operation) on a big data set. The argument continues and a theoretic CUDA-accelerated spreadsheet is discussed. 10

18 Chapter 2. Background This theoretic spreadsheet would have to be able to be sliced into series that run the exact same formula on a big data set, such as a whole column. That formula could then be compiled into a kernel and the data be uploaded as a stream. It is stated in the forum thread that it would have to be an outrageously big sheet (hundreds of thousands of lines) for a simple function to be optimized using the GPU. This statement might have been correct, due to high transfer latency between the GPU and the CPU, and that GPUs at the time of writing (2008) were not as fast and suited for GPGPU as modern GPUs. Also when using complex Monte Carlo Simulations, thousands of lines is not unrealistic. Challenges such as float accuracy and rounding problems are also mentioned. In the same forum thread Hyun-Gon Ryu. from Yonsei University discusses different approaches for integrating CUDA with Microsoft Excel using VBA. That is not within the scope of this project though. 11

19 Chapter 3. Microsoft Accelerator Microsoft Accelerator (referred to as Accelerator from here on) was first developed as a research project aiming to create a GPGPU framework for C#[8]. In the 2nd version, it has turned into a general framework for solving data parallel problems. The current version supports both calculations on the GPU and on multiple processors. Later versions might implement other targets, for example one for FGPAs[17]. This project primarily investigate the DX9Target of AcceleratorV2, and this chapter will give a quick overview of how Accelerator works as a middle layer between C# programmers and the GPU Accelerator and the GPU Accelerators DirectX9 target solves data-problems by translating the data into textures, and operations into texture shaders. This allows us to do calculations on the GPU. The procedure for translating the function and data in Accelerator is described in Introduction to Accelerator [17]: 1. Translate the processing code into a form suitable for a GPU by converting it to a DirectX 9 pixel shader. 2. Translate the data into a format that is suitable for the processor by converting it to a DirectX 9 texture. 3. Transfer the shader and textures to the processor and run the operation. DirectX 9 and the associated drivers partition the data and schedule execution on the various pixel shaders. With other processors, your application might have to handle some or all of these tasks. 4. When the operation is complete, retrieve the texture containing the results and convert it back to an array. 12

20 3.2. The C# interface Chapter 3. Microsoft Accelerator Accelerator works with datatypes called parallel arrays, which are represented in the classes FloatParallelArray (FPA), BoolParallelArray (BPA) and IntParallelArray (IPA). All of these inherit from the class ParallelArray (PA). Operations on parallel arrays are functional in nature with no states or mutable data. They have no side effects, operations do not modify arguments, and results are returned in new arrays Operations Before we go further we need to quickly classify operation types. Accelerator classifies operations in six different categories. These are described in Introduction to Accelerator [17]: Construction The framework provides methods for creating parallel arrays from System.Array that contains the same elements. Conversion It also provides methods for converting the parallel array result back into the System.Array type. Element-wise operations Most operations are element-wise taking fx the add operation. It takes the n th element of the first array and adds it with the nth element of the second array, resulting in a new array of the same size. Reductions Reduction operations are operations that reduce the size of the array. A sum operation of a n X m array may compute the sum of each row and return a 1 X m array, or it even compute the sum of all values and return a 1 X 1 array. Transformations Operations that transform the organization of the elements, such as matrix transpose. Linear algebra Accelerator provides binary matrix operations such as scalar product, matrix multiplication and outer product Programming Accelerator Programmers Guide[16] describes the following as a general procedure for working with Accelerator: 1. Create input arrays. 2. Load each array from Step 1 into an Accelerator data-parallel array object. 3. Process the input data by applying Accelerator operations to the data-parallel array objects. 13

21 Chapter 3. Microsoft Accelerator 4. Evaluate the results of the operation on a target processor, which returns an array containing the processed data. All operations, except construction operations, are created by calling static member functions on the PA class. An add operation will created by calling PA.Add and so forth. Creation of parallel arrays with values will be done by creating FPA objects. The following code (listings 3.1) shows an example where two arrays are created, and using the element-wise Add operation, we add each element in each array together and return them in the result array. 1 f l o a t [ ] AddTwoNumbers ( f l o a t [ ] x, f l o a t [ ] y ) { 2 var inputx = new FPA( x ) ; 3 var inputy = new FPA( y ) ; 4 var add = ParrelelArray. Add( inputx, inputy ) ; 5 f l o a t [, ] r e s u l t = new f l o a t [ x. l e n g t h ] ; 6 e v a l T a r g e t. ToArray ( add, out r e s u l t ) ; 7 r e t u r n r e s u l t ; 8 } Listing 3.1: Accelerator code for addition The following example is an implementation of Herons formula to get the area of a triangle, given the three sides. s = a+b+c 2 A = s(s a)(s b)(s c) The formula can be implemented in Accelerator like this: 1 f l o a t [ ] HeronGPU ( f l o a t [ ] x, f l o a t [ ] y, f l o a t [ ] z ) 2 { 3 var fpax = new FPA( x ) ; 4 var fpay = new FPA( y ) ; 5 var fpaz = new FPA( z ) ; 6 7 var s = PA. D i v i d e (PA. Add(PA. Add( fpax, fpay ), fpaz ), 2) ; 8 9 var sx = PA. S u b t r a c t ( s, fpax ) ; 10 var sy = PA. S u b t r a c t ( s, fpay ) ; 11 var sz = PA. S u b t r a c t ( s, fpaz ) ; var a r e a = PA. S q r t (PA. M u l t i p l y ( s, PA. M u l t i p l y (PA. M u l t i p l y ( sx, sy ), sz ) ) ) ; e v a l T a r g e t. ToArray ( area, out r e s u l t ) ; 16 r e t u r n r e s u l t ; 17 } Listing 3.2: Herons formula in FPA Notice in this example (listings 3.2) that we create only one FPA per float array, and that we reuse these several times while we build up the Accelerator operations. If we created several FPA s, for a single array, we would transfer the data several times. This will become important during the implementation in CoreCalc. Further, notice that PA.Divide takes both an FPA and an ordinary float. This means that each element in the FPA will be divided by two. Most element-wise functions allow us to do this. 14

22 3.4. Other findings Chapter 3. Microsoft Accelerator While working with Accelerator, we have noticed some behavior that is worth mentioning Behavior of booleans Creating BPAs from booleans and working logically with these seems to generate some puzzling results. Consider the following code: 1 Target t = new DX9Target ( ) ; 2 b o o l [ ] x = { f a l s e, f a l s e, f a l s e, f a l s e } ; 3 b o o l [ ] r = new b o o l [ 4 ] ; 4 b o o l [ ] r 2 = new b o o l [ 4 ] ; 5 t. ToArray ( new BPA( x ), out r, ExecutionMode. ExecutionModeNormal ) ; 6 t. ToArray (PA. Not ( new BPA( x ) ), out r, ExecutionMode. ExecutionModeNormal ) ; Listing 3.3: Booleans in Accelerator Executing this (listings 3.3) and afterwords looking into r will have the value true, false, false, false and r2 will have the value false, false, false, false. Conditional statements PA.Cond(BPA, FPA, FPA) seem to work as intended in Accelerator, and compare functions PA.CompareLessThan(FPA, FPA) seems also to return true or false as logic would prescribe. Using further logical operations PA.Not(FPA) on a BPA returned by a compare functions also seems to give correct results. Our conclusion is that it is possible to work with conditional statements in Accelerator, however, you should never generate BPAs - instead you should rely on Accelerator to do it for you. Floating point numbers The Accelerator API works only with floating point numbers of the float type, which is most likely caused by the fact that at the time of writing (and development) of Accelerator most GPUs only supported 32-bit floats. Further, many graphic cards are not IEEE 754-compliant [10] which means that results calculated on the GPU might differ from results calculated on the CPU even when using the same precision. NVIDIA Fermi will in the future support double precision[14]. Parameters Values in FPAs cannot be substituted when the FPA object is first created. This means we cannot build up an expression tree and later substitute the values inside it, which would be ideal for constructing function calls in Accelerator syntax and later substituting parameters. The Accelerator developers may include this feature in the future[17]. Random numbers Accelerator does not implement any ways to generate random numbers. If random data is needed this has to be generated on the CPU and transferred to the GPU. Future versions of Accelerator are expected to implement this[17]. 15

23 Chapter 4. Analysis of Microsoft Accelerator In this chapter, we will look into calculations in Accelerator and benchmark against C# implementations. We will identify operations from spreadsheets that may be possible to optimize using Accelerator, and look into limitations of the framework. We will try tnesto answer the following five questions: Maximum data What is the maximum amount of data that we can send into the GPU? Transfer time What is the minimum transfer time to the GPU, and how does the amount of data affect the transfer time? Single operations What is the performance impact on single operations? Complex operations What is the performance impact on complex operations? Value creation Is there any performance impact of the different ways to create values? 4.1. Hardware setup All tests have been run on one specific machine. Different hardware setups will ofcourse yield different results for the coming performance tests. The machine is what would normally be classified as a gaming machine. It is from Hewlett Packard and the model numbers is Z400. The GPU is a NVIDIA GT240, and the CPU is an Intel Xeon W3505. It runs Windows XP 32-bit and has installed DirextX9. The following tables summarizes the hardware specifications. 16

24 Chapter 4. Analysis of Microsoft Accelerator NVIDIA GT240 CUDA Cores 96 Graphics Clock (Mhz) 550 MHz Processor Clock (Mhz) 1340 MHz Memory Clock (Mhz) 1700 MHz GDDR5 Memory 1 GB Memory Interface Width 128-bit Memory Bandwidth 54.4 GB/sec Bus Support PCI-E 2.0 Intel Xeon W3505 Cores 2 Threads 2 Clock speed 2.53 GHz Intel Smart Cache 4 MB Intstruction set 64-bit Table 4.1.: Hardware specification The machine further has 4096 MBytes of DDR3 ram installed Constructing the tests In order to ensure stable results in performance tests, each test case have been constructed to be executed and timed 100 times with randomly generated input data. The results of the GPU and CPU versions of the test cases have been compared and verified to be approximately correct (taking floating point precision problems on the GPU in consideration). The tests have all been built using Visual Studio 2010 release settings and have been executed outside the Visual Studio environment to ensure that no unnecessary monitoring was done Test results Maximum data There are two important factors to consider when looking at the amount of data we can transfer to and process on the GPU: The maximum texture size, and the amount of memory available. The maximum texture size defines the maximum width and height 17

25 Time (ms) Chapter 4. Analysis of Microsoft Accelerator for a array of floats that we are able to send to the GPU. The memory further limits how many textures and how complex shaders cab be stored. To get the maximum texture size, we simply sent textures of increasing size into the GPU until it returned and error, which was around 8000 X We were not able to find a method for determining the maximum complexity of operations, however, we did find that matrix multiplication is only possible on sizes lower than two 240x240 arrays. Transfer time Transfer time can be divided into two components: Latency and transfer speed. We define latency as the initial time it takes to transfer data to the GPU. Transfer speed is defined as the time it takes to transfer a single float value. This model is simplified in relation to hardware architecture, however it suits our purposes: transf ertime(x) = latency + x speed To measure the transfer time, we ran the following code on different sizes of x. Note that even though we use the term transfer time, it might be more accurately described as the overhead for any Accelerator evaluation on a GPU target. 1 e v a l T a r g e t. ToArray ( new F l o a t P a r a l l e l A r r a y ( x ), out r e s u l t ) ; Listing 4.1: evaltarget code Data size Figure 4.1.: Time from sending a data input to getting it back again Using regression on the data depicted in figure 4.1, we derived at a latency of 2,1 ms, and the speed is 1,68E-005 ms Note that this is the sum of the transfer time for transferring float the data to the GPU and transferring the data back. 18

26 Time (ms) Chapter 4. Analysis of Microsoft Accelerator Single operations In this section we will look into performance impacts for a few single operations in Accelerator, and compare these with a C# implementation that does the same job. We will further derive the actual cost of an operation by doing linear regression on the data and look at the difference between the overall time spent and the transfer time that we looked at in the previous section If - CPU If - GPU Add - CPU Add - GPU Div - CPU Div - GPU Mul - CPU Mul- GPU Sub - CPU Sub - GPU Square - CPU Square - GPU Sum - CPU Sum - GPU Data size Figure 4.2.: Performance tests of single simple operations on the GPU compared to the CPU Using linear regression, we looked at the slopes of both the C# version and the Accelerator counterparts of the above functions. Delta Slope in the below mean the difference between the slopes of transfering data and the slope for a whole operation. Note that it has not been possible to measure the slopes for transferring two constants, but simply multiplied the value of transferring one constant by two. Also note that the sum operation only has half the transfer slope. This is because a sum operation, while alot is transfered to the GPU, only a single constant is transferred back. This is of course simplifications, and is to some extent inaccurate. 19

27 Time ms Chapter 4. Analysis of Microsoft Accelerator Operation Slope C# Slope T ransfer Slope Operation slope Add 6,00E-06 3,36E-05 5,64E-05 2,28E-05 Sub 6,84E-06 3,36E-05 4,57E-05 1,21E-05 Mul 6,82E-06 3,36E-05 4,54E-05 1,18E-05 Div 7,81E-06 3,36E-05 4,57E-05 1,21E-05 Sqr 8,57E-05 1,68E-05 2,00E-05 3,21E-06 Sum 1,69E-05 8,40E-06 2,04E-05 1,20E-05 If 1,67E-05 3,36E-05 2,02E-05-1,34E-05 Table 4.2.: Test results from single operations The above gives us an estimate of how well the GPU performs a given operation compared to the CPU also comparing without the overhead of transferring. Most functions will, given enough data, be able to perform faster on the CPU than on the GPU looking only at the slopes. However because of limits on the GPU, it might not always be possible to reach the amount of data needed. The slope for the SUM operation on the GPU is very close to the slope for the CPUvesion. This is a general tendency for reduction operations on GPU targets[17]. If we had a slower graphics card or a faster processor, this operation would actually be overall slower, leaving us with no reason at all to transfer such an operation to the GPU. 90,00 80,00 70,00 60,00 50,00 40,00 GPU CPU 30,00 20,00 10,00 0, , , , , , , ,00 Dat size Figure 4.3.: Performance tests of matrix multiplication on the GPU compared to the CPU 20

28 Chapter 4. Analysis of Microsoft Accelerator We found some good performance gains for few single operations. Matrix multiplication was one. It is a more complex function that requires a series of arithmetic operations and the tests results above show that a performance gain is possible on realistic data sizes. Complex operations In this test case we construct complex operations by building graphs of simple operations. This is done to test how the complexity of an operation affects the time spent on the computation and to mimic possible spreadsheet formulas. For example where A1 has the formula = B1 + C1 and B1 and C1 have formulas pointing to other cells. These graphs have been constructed by nesting simple arithmetic operation as shown in fig For each time the graph increase one in size the previous generated graph will be used as the left leaf in a new operation and the right leaf will be the same constant FPA as used earlier. Figure 4.4.: Graph of nested addition operations Similar graphs for subtraction, multiplication, division, and a graph of mixed operation have been used in the test. The graph of mixed operations switches between multiplication, addition, and subtraction starting with multiplication. 21

29 Time (ms) Chapter 4. Analysis of Microsoft Accelerator Div (CPU) Div (GPU) Mixed (CPU) Mixed (GPU) Mul (CPU) Mul (GPU) Pow2 (CPU) Pow2 (GPU) Number of operations in graph Figure 4.5.: Performance tests on nested mixed operations on the GPU compared to the CPU As shown in the above chart (Fig. 4.5). The CPU time of computing the multiplication and division graphs is low compared to that of the GPU. With only 6-10 nested operations it is possible to outperform Accelerator on large datasets. The Power operations performs very well which is probably because of the use of the constant 2. Multiplication also shows great performance. Based on this we conclude: The more complex an operation is the more potential performance gain is there to get when running it on the GPU. Value creation Values can be created in several different ways in Accelerator. In this section we will compare the performances of different types of array creation. Creating arrays FloatParrallelArrays can be created in two ways in Accelerator: 1 p u b l i c F l o a t P a r a l l e l A r r a y ( f l o a t f, params i n t [ ] shape ) ; 2 p u b l i c F l o a t P a r a l l e l A r r a y ( f l o a t [, ] v a l u e s ) ; Listing 4.2: none This means that if we want to create an array of constant values, we can fill a twodimensional array with the same value or simply use the first and tell Accelerator the 22

30 Chapter 4. Analysis of Microsoft Accelerator dimensions we wish for. Test showed that we could get up to 4 times performance gain by creating arrays using the first method compared to filling an array in C# and creating the FPA with the second method. Binary operations Many binary operations are overloaded to allow easier mass operations with the same constant: 1 p u b l i c s t a t i c F l o a t P a r a l l e l A r r a y Add( F l o a t P a r a l l e l A r r a y a, f l o a t f ) ; 2 public s t a t i c FloatParallelArray Add( FloatParallelArray a1, FloatParallelArray a2 ) ; Listing 4.3: Add operation The methods will give the same result if the array in a2 is filled with the value of f. We tested the performance on these, and found no differences in performance, if either a1 or a2 was a constant array created with the first method for array-creation. 23

31 Chapter 5. GPGPU approaches for spreadsheets In this chapter we describe different approaches to implementing parallelism using the GPU in spreadsheets while taking test results of Accelerator into consideration. As described in 2.4 we have not been able to find any previous work on using the GPU for optimizing spreadsheets, but various articles describes approaches with parallelism in spreadsheets using multicore CPU s or High-Performance Computing (HPC). In this chapter we describe our own analysis based on CoreCalc, but also look into how to adapt previous parallelism theories to the GPU Single normal built-in functions CoreCalc has a range of built-in function like the ones known from Microsoft Excel. Some of these functions, like Matrix multiplication, takes one or more matrices as input should be straight forward to implement using the GPU and if the input is large enough or the arithmetic operations complex enough, based on the test results a performance gain should be possible. Especially matrix multiplication has more effective on the GPU than the CPU on relative low data sizes. Simple functions that do not work with matrices such as SQRT, SIN, Addition, Subtraction, Division, and Multiplication are also simple to implement for the GPU, but a performance boost is not expected based on the very small input size of 1-2 arguments and low arithmetic complexity. They are however needed in order to use sheet defined function Sheet defined functions As described in the introduction, Sheet defined functions are functions defined within a spreadsheet using cells. 24

32 Chapter 5. GPGPU approaches for spreadsheets Due to the high transfer latency for the GPU, the arithmetic complexity of an operation is important in order to benefit from the GPU. This is shown in chapter 4 where we test different single operations and nested operations similar to those produced by sheet defined functions. Even though sheet defined functions are more complex the input data is not necessarily large and could typically be 1-3 arguments which makes using the GPU questionable. On top of that one has to transfer all constants to the GPU as well, meaning every time you write = C1 2 or = C1 10, the constant 2 or 10 will have to be transferred a textures to the GPU. The potential performance gain will increase however, when using sheet defined functions in higher order functions such as tabulate where the same function is used on a range of input data Higher order Map function As we concluded in chapter 4 a rather large amount of data and a complex operation is needed for the GPU to be able to optimise the operation. Therefore you need a quite complex sheet defined function for the GPU to be able to optimize the evaluation on a single function call. However the same function is often used more than once and in simulations it is not uncommon for the same function to be used times. If all of these calls could be constructed into one single call, sending all the input data to the GPU and processing it using the same operation, we expect increased performance. CoreCalc includes higher order functions such as Map, RowMap, ColMap, Tabulate which all can be classified as embarrassingly parallel problems since there exists no dependency between each operation and thereby also is suitable for the GPU. Depending on the complexity of the function and the number of times it s used it should be possible to obtain a reasonable performance gain When to use the GPU? As already mentioned several times, it is not always a good idea to send the computation to the GPU. In order to estimate which platform is the best suited for a specific computation we need to estimate the execution time of an operation on both platforms on evaluation time. Both Haman[9] and Wack[19] work on partitioning the dependency graph of a spreadsheet to limit the parallel execution to where there is a potential performance gain. They both use weighted cells (nodes) in the graph and decide based on the total weighting of a partition. As Wack s theory is about distributing the workload to workstations on a network, his model take network latency, speed, distance and other factors into account. Haman uses multiple cores on one CPU and simplifies the weighting to simple numbers. 25

33 Chapter 5. GPGPU approaches for spreadsheets Many of the same principles applies when deciding whether to evaluate an SDF on the CPU or the GPU. Loosely based on their approaches we will first create a simplified model that only applies to SFDs. We use knowledge about the hardware, measured time, input data, and an estimated execution time per operation. To estimate the execution time, three major approaches are used: Experimental (testing and measuring), probabilistic measurement (based on measurements of small parts), and static analysis that uses constructed models of processor instructions and timings to predict the result. Execution time estimation of normal programs is non trivial due to loops and recursive calls that might depend on values that are not known before runtime, but as we don t allow loops and recursive calls in spreadsheets this simplifies the estimation drastically. Another approach would be to simply run the operation on both platforms the first time it is invokes, and remember what performed the best. However as input parameters might change between calls and because formulas are easily and often changed in spreadsheets, this approach will not only take more time, but will also often be wrong. For estimation on the CPU we use a simplified model that does not take the architecture, cache or any details of the CPU into account. Given: m: Number of operations c: Computation time of operation w: Number of cores in the CPU m c w Figure 5.1.: Formula for estimated computation time using the CPU When using this simple model to estimate execution time of a SDF on evaluation time, w and m are known, but c is unknown as the SDF can contain many operations and conditions. c can however be estimated using a static analysis described later. When using the GPU we have to expand our model to latency and transfer time: 26

34 Chapter 5. GPGPU approaches for spreadsheets k 0 + m c w + c k 1 + m k 2 + r k 2 Given: k 0 : Initial latency of transferring to the GPU k 1 : Time to transfer one operation k 2 : Time to transfer one float m: Number of operations c: Computation time of one operation w: Number of cores in the GPU r: The result size of the operation Figure 5.2.: Formula for estimated computation time using the GPU This formula can be partitioned into m c being the time of computing the operations on w the GPU, c k 1 + m k 2 being the time to transfer the needed data to the GPU, and r k 2 that is the time to transfer the computed result back. m and r is known at evaluation time of a spreadsheet function. w can be found in the graphics cards specs and k 0, k 1, and k 2 can easily be measured. However c has to be estimated like on the CPU Estimating execution time Estimation the execution time (c) can be done by weighting each type of operation with a value and run through all operations to be processed and add together these values. When looking at a conditional statement, one would estimate both the true leaf and the false leaf, the worst estimate will result in a worst case execution time (WCET) and the best case will be the best case execution time (BCET). We ll focus on finding the WCET of a sheet defined function, both for the GPU and the CPU. First we need to assign a weight to each type of operation, for both the GPU and the CPU. As we have benchmarked the different operations we can derive this weight from the test results. For the CPU this is simply done by using the time of the add operation in the tests, but on the GPU, we have to subtract the latency and transfer time to and from the GPU. For the GPU we also have to find k 0, k 1, and k 2. However we haven t distinguished between k 1 and k 2 in our analysis. Taking this into account we assume that k 2 includes the time of transferring the operations. Therefore we simplify c k 1 +m k 2 to c 0+m k 2 and end up with only m k 2, leaving out the transfer time of operations. This leaves only the variables known at evaluation time and allows us to estimate the execution time of operations on the GPU. Now we can simply use these two models and static 27

35 Chapter 5. GPGPU approaches for spreadsheets analysis of the SDF to determine which platform to target. On our test setup we only have a single core in the CPU, but this model also takes multicore systems into account to some extent. One factor that is not taken into account is the maximum texture size and memory of the GPU. Exceeding these limits will force us to split the Accelerator call in two or more. Due to the scope of this project, we have not looked further into this. 28

36 Chapter 6. Implementation of prototype This chapter describes the implementation of our prototype in CoreCalc. We describe the implementation of the approaches described in chapter 5 and the problems and limitations of the design Built-in functions As described in chapter 5, formulas can invoke functions. For example = SIN(90) is a formula that invokes the sinus function and = A invokes the add function. We have chosen to implement a small range of these functions that showed potential to be performance wise better on the GPU. In CoreCalc, a spreadsheet function is represented as an object of the class Function. Function connects the function name, represented as a string, and a delegate called an Applier which points to the function. 1 delegate Value Applier ( Sheet sheet, Expr [ ] es, i n t col, i n t row ) ; Listing 6.1: Applier A functions applier is invoked when a function is called from a specific Cell. We would like to benchmark the current CPU implementation in CoreCalc against the calculations on the GPU, therefore we need both a GPU-Applier and CPU-Applier, for a single function. This simple design allows us to easily choose between GPU and CPU Appliers by simply changing the target platform on the function class. 1 c l a s s Function { 2 p u b l i c enum T a r g e t P l a t f o r m { CPU, GPU } 3 p u b l i c s t a t i c T a r g e t P l a t f o r m t a r g e t ; 4 5 pr i v at e Applier appliercpu ; 6 pr i v at e Applier appliergpu ; 7 8 p u b l i c A p p l i e r A p p l i e r 9 { 10 g e t { 11 return appliergpu == n u l l t a r g e t == TargetPlatform.CPU? appliercpu : appliergpu ; 12 } } 29

37 Chapter 6. Implementation of prototype } Listing 6.2: GPU and CPU applier Because of the static TargetPlatform in the Function class, this implementation does not allow us to make an adaptive implementation where the chosen target is based on the context of the invocation, however it should be possible to choose target platform based on an estimated execution time (described in chapter 5). However, in the current design of CoreCalc, Appliers are simply returned when a function is called, and this Applier is called by the callee. This means context is only available outside the function class, or inside the implemented Appliers. Because of this the function being evaluated has to either choose platform, or the choice have to be moved. We have not looked further into this as very few of the built-in function are potentially faster on the GPU. By overloading the constructor of the Function class, it is now easy to tie two appliers to a Function by simply changing: 1 new Function ( MMULT, 2 MakeFunction ( ( Fun<Value [ ], Value >)MMult) ) ; into: Listing 6.3: Original code for creating MMULT 1 new Function ( MMULT, 2 MakeFunction ( ( Fun<Value [ ], Value >)MMult), 3 MakeFunction ( ( Fun<Value [ ], Value >)MMultGPU) ) ; Listing 6.4: Modified code for creating MMULT 6.2. Sheet defined function Sheet defined functions are functions defined within a spreadsheet. function is defined by input cells and an output cell. A sheet defined In our tests we found that due to extra latency of transferring data to the GPU, a certain complexity in the operation and a certain amount of data is needed. Sheet defined functions solves the problem of complex operation by allowing the user to define new function using many simple functions. To execute a sheet defined function in Accelerator, our goal must be to build Accelerator Expression Graph (AEG) that corresponds to the function. Accelerator does not support inserting values inside an AEG. This means we need to build the AEG when all values of the function, parameters included, are known. Because of this we introduce a middle layer, Accelerator Abstract Syntax (AAS), that will quickly be able to build an AEG given parameters. The details of this will be discussed below. 30

38 Chapter 6. Implementation of prototype Accelerator Abstract Syntax The base for AAS is the abstract class AccExpr : 1 public abstract c l a s s AccExpr 2 { 3 public abstract FPA GenerateFPA ( AccInputInfo info, i n t CallID ) ; 4 } Listing 6.5: AccExpr class A reference to the root AccExpr of a SDF is placed along with the compiled SDF. On evaluation time, this AccExpr s GenerateFPA-method will be called with the parameters, the AEG will be generated, and it will be executed on the GPU and the result returned. Figure 6.1.: Class hierarchy of Expr types[18] 31

39 Chapter 6. Implementation of prototype Figure 6.2.: Class hierarchy of AccExpr In CoreCalc, whenever a cell is changed the string in the cell is parsed and a Expr- AST (See fig. 6.1) is build. CoreCalc SDFs are compiled into.net bytecode. Before compilation, the functions Expr-AST is translated into a CGExpr abstract syntax. This is achieved with a Visitor-pattern that visits all child expressions and translates them individually. Converting Expr abstract syntax (6.1) into AASs will be done the same way - creating a concrete visitor that visits all leaves in an Expr node, and translate them. Some operations possible in CoreCalc will not be possible in Accelerator, if these are encountered, an exception is thrown. We will not handle these cases in this project. CoreCalc expressions can be of the following types: NumberConst, TextConst, Error, FunCall, CellRef, CellArea. We will now look into how translation of these will work. Values All number values have to be represented as FPAs in Accelerator, we work with conversion of NumberConsts, CellRefs, and CellAreas from CoreCalc into FPAs. Input arguments will have to be represented as FPAs as well, but we discuss this later. NumberConsts (Constants in formulas such 2 + 2) and NumberCells inside the function sheet is known compiletime and can be represented as AccConst, while CellRefs and CellAreas, if pointing outside the functionsheet, needs to be evaluated at evaluation time and are represented by their own types, AccCellRef and AccCellArea. 32

40 Chapter 6. Implementation of prototype Representation of numbers AccNumber allows for creating a FPA with the same size as the other argument in the operation it is being used. The argument will typically only be a single float when calling the SDF alone, but as we describe in section this is often not the case. 1 public abstract c l a s s AccNumber : AccExpr 2 { protected o v e r r i d e FPA GenerateFPA ( AccInputInfo info, i n t CallID ) 5 { 6 r e t u r n new FPA( Value, i n f o. Values [ 0 ]. GetLength ( 0 ), i n f o. Values [ 0 ]. GetLength ( 1 ) ) ; 7 } } Listing 6.6: AccNumber class As both AccCellRef and AccConst inherit from AccNumber they only need to define how to return the value and AccNumber will convert it to an FPA of the correct size as shown above. Input cells Input arguments are CellRef s in the Expr tree, but at compile time we match the CellRef with the list of input cells for the SDF and represent the input of the type AccInput giving the index of the input cell as argument. The input arguments are not known before evaluation time and are send through every GenerateFPA call in the AccInputInfo object. How this is built is explained in section because it s highly dependent on how the function is called. The AccInput just return the value that corresponds with its index. 1 protected o v e r r i d e FPA GenerateFPA ( AccInputInfo info, i n t CallID ) 2 { 3 r e t u r n new FPA( i n f o. Values [ i n p u t I n d e x ] ) ; 4 } Listing 6.7: Generate FPA for AccInput A general problem when using Accelerator in CoreCalc is that every value in CoreCalc will have to casted to a float, send to the GPU, castet to a double and wrapped in a NumberValue Object. The NumberValue wrapping/unwrapping is however also an issue in SDFs compiled to.net bytecode. (See Known Problems.) Function calls In Expr syntax, function calls are represented by FunCall. This includes both calls to built-in functions such as SIN and operators such as +. These are refined to a whole hierarchy in the CgExpr syntax. Most functions such as + and SIN will easily be translated to Accelerator and is represented in AccBinaryOp or AccUnaryOp, specifying which function in the constructor. Others functions such as comparison and conditional functions that are very simple in CoreCalc will have to be refined similar to how it is done in CGExpr. (Functions that only depends on AccConst s could themselves be represented as an AccConst, but as this is only optimize SDF s if they re poorly designed we have not implemented this.) 33

41 Chapter 6. Implementation of prototype Conditional Statements Expr doesn t represent boolean expressions separately, however conditional statements in Accelerator require that we generate boolean expressions. In CoreCalc any value can always be evaluated to true or false, while floats in accelerator never can. Just as the IL code generator for SDFs have conditional expressions that requires functions to return booleans, Accelerator have special functions for comparison of float values, which return arrays of booleans. In CGExpr conditional statements are represented by CGIF, CGAnd, CGOR, CGNOT and a range of different comparisons inheriting from CGComparison. Due to the scope of our project we have chosen a simple approach where comparison functions are contained to conditional functions where they are needed (see example below). This approach restricts our use of logical operators, but as this is an experimental prototype this is not a problem. We have also chosen not to implement NOT, AND and OR, due to the scope of the project. 1 public BPA GenerateBPA ( AccInputInfo info, i n t CallID ) 2 {... 3 s w i t c h ( type ) { 4 case Type.EQ: 5 r e t u r n PA. CompareEqual ( child1fpa, c h i l d 2 F p a ) ; 6 case Type.GT: 7 return PA. CompareGreater ( child1fpa, child2fpa ) ; } } Listing 6.8: GenerateBPA of AccComp Random numbers Accelerator has no volatile methods such as random number generation. Monte Carlo simulations however use a lot of random data and we therefore need to generate it on the CPU at evaluation time. As random numbers are the only volatile function we need within the scope of the project we have contained this to AccRand, that works similar to AccNumber, but generate an array of random numbers corresponding to the size of the other argument in the operation it is being used. This size is available in the AccInputInfo object. Ensuring reuse of AccExprs Because of the latency of transferring data to the GPU it is important that we generate as small an AEG as possible. In order to do this we need to make sure that the AAS doesn t contain the same node more than once. Many cells can have formulas that reference to the same input cell or numbercell. Every reference to the same cell should point to the same AAS object. This means that when creating ASS it s important to keep track of whether an AAS for the same Expr has earlier been created. If it has, this earlier object should be referenced, instead of creating a new instance. In order to achieve this we create a dictionary that allows lookups of AccExpr s from a cell address inside the already created Visitor-pattern. Another, and maybe performance 34

42 Chapter 6. Implementation of prototype wise faster solution, would be to have a Cell (or a decorator) to reference to its AccExpr. However as this is on compile time we haven t looked further into optimizations. Whenever a AAS translation is starting, a dictionary will be created, and whenever a new Expr is needed we check if it s already has a corresponding AccExpr. Because we haven t focused on optimizing the generation of AAS and because a CellRef pointing to an input cell can be represented by the same AccInput object we simply create the AccExpr object and let the TryAccExpr throw it away if it s not needed. 1 p r i v a t e s t a t i c AccExpr TryAccExpr ( F u l l C e l l A d d r addr, AccExpr newexpr ) 2 { 3 i f ( exprcache. ContainsValue ( newexpr ) ) 4 r e t u r n newexpr ; 5 AccExpr n ; 6 i f (! exprcache. TryGetValue ( addr, out n ) ) 7 n = newexpr ; 8 exprcache. Add( addr, n ) ; 9 } 10 r e t u r n n ; 11 } Listing 6.9: Method for minimizing count of created AccExpr objects Due to the high transfer latency we have a similar dictionary for number constants. If A1 contains the formula = 2 + C1 and B1 have the formula = C1/2, both of these constants would normally have to be transferred to the GPU as separate textures, but we make sure that they point to the same AccConst. For these optimizations to work on evaluation time the same AAS object should return the same FPA object for each reference in the SDF invocation. To achieve this, we simply make sure that a generated FPA is saved in the AccExpr for each invocation of the SDF (it s only invoked once in a tabulate call. See section 6.2.2). FPAs should of course not be shared between invocations and to ensure this we send a unique CallId through the callstack. The only public method of an AccExpr that returns a FPA object is the GenerateFPAWithCache of the abstract class AccExpr that all others inherit from. This method checks if we have already generated a FPA for the AccExpr in this invocation and return the cached FPA, if not it call the objects specific GenerateFPA method Higher order functions As discussed in chapter 5 using sheet defined function in higher order functions such as Map or Tabulate should improve the potential performance gain if send as one call to the GPU. 1 public s t a t i c Value Tabulate ( Value v0, Value v1, Value v2 ) 2 { 3 i f ( v0 i s FunctionValue && v1 i s NumberValue && v2 i s NumberValue ) 4 { 5 FunctionValue fv = v0 as FunctionValue ; 6 // (... Argument e r r o r h a n d l i n g.. ) 7 i n t rows = ( i n t ) ( v1 as NumberValue ). value, 8 c o l s = ( i n t ) ( v2 as NumberValue ). v a l u e ; 9 i f (0 <= rows && 0 <= c o l s ) 10 { 11 Value [, ] r e s u l t = new Value [ c o l s, rows ] ; 12 f o r ( i n t c = 0 ; c < c o l s ; c++) 35

43 Chapter 6. Implementation of prototype 13 f o r ( i n t r = 0 ; r < rows ; r++) 14 r e s u l t [ c, r ] = f v. C a l l 2 ( NumberValue. Make ( r + 1 ), NumberValue. Make ( c + 1 ) ) ; 15 r e t u r n new A r r a y E x p l i c i t ( r e s u l t ) ; 16 } 17 // (... E r r o r h a n d l i n g.. ) 18 } Listing 6.10: Original CoreCalc code for Tabulate The built-in Tabulate function works by taking a binary function and two numbers as arguments. The function is then called row col times taking rowindex and colindex as argument 1 and 2, respectively. Map, ColMap and RowMap works similar to Tabulate, but takes a CellArea as input and passes the contents of the cells as arguments to the function. We ve chosen to explain Tabulate here for simplicity, but the other implementations are similar. As shown in the code sample Tabulate is implemented with a nested loop calling the method once per argument combination. As this is an embarrassingly parallel problem that can be handled by the GPU we need to represent it as one Accelerator abstract syntax graph. As Accelerator handles each element independently all we have to do is generate a FPA corresponding to the input array before sending it to the GPU. This only requires a minor modification of AccExpr s such as InputAccNod, ConstAccExpr, and RandAccExpr as they have to fit the input. As described a Sheet defined Function has an AccExpr structure which has a FPA GenerateFPA(AccIntputInfo info, int CallID) method that generates the Accelerator abstract syntax graph that corresponds to the operation based on input arguments (a) Argument (b) Argument 2 Table 6.1.: Original input arguments for call: T ABULAT E(F, 2, 8) (a) Argument (b) Argument 2 Table 6.2.: Reorganized input arguments for call: T ABULAT E(F, 2, 8) It is possible to generate an AAS using the first of the above formats table (6.1), but as the data has to be transferred as a texture this format exceeds the maximum texture width or height long before the actual memory limits of the GPU (Maximum texture size 36

44 Chapter 6. Implementation of prototype of GT240 have been estimated to 4000 X 8000). To solve this we reorganize the data to fit the texture size as shown in table 6.2. This is done using the GenerateAcceleratorMethod method of CGManager.cs. 1 i n t c o l s = ( i n t ) Math. C e i l i n g ( Math. S q r t ( l e n g t h ) ) ; 2 w h i l e ( l e n g t h % c o l s!= 0) 3 c o l s ; 4 i n t rows = l e n g t h / c o l s ; Listing 6.11: Original CoreCalc code for Tabulate First we find the new format using the algorithm shown in fig. 6.11, where we consider that our tests showed that the texture width was exceeded before the height. We then reorganize the input into the ArrayList<float[,]> type (each float[, ] is the reorganized input arguments). This FPA is send to the GPU and the result is reorganized back into the original format before returning the value. This is done completely transparent to the Tabulate or Map function. Another solution would be to do some initial tests and know the maximum texture size of the current machine and simply wrap around the maximum texturewidth. This solution makes us able to use data sizes close to the maximum texture size, but if we exceeds the maximum texture size the data will have to be split in two. As the operations are data-parallel and element wise when calling sheet defined functions it doesn t matter how the data is partitioned. We haven t implemented this in our prototype (see Future work) Evaluation leafs on the CPU Expr contains a lot of things that we cannot or have not translated to AAS. A simple approach would be to just evaluate these expressions on the CPU using the normal evaluation in CoreCalc, but this would lead to potential recursive calls to the same version and introduce further uncertainties. Even though this is very simple to implement, we ve chosen not to and simply throw and exception if an expression cannot be translated fully. 37

45 Chapter 7. Performance test of prototype The implemented prototype shows performance gains in some areas and performance losses in others. We document the test results and look into possible conclusions How the tests were executed Each of our benchmarks have run 100 recalculations of a workbook and calculates the average time. The workbooks uses the tabulate(function, Number, Number) -function, and each benchmark is executed on a range of linearly or quadratic growing data sizes Floating point precision and performance As noted previously, Accelerator uses single precision floating point numbers, and Core- Calc uses double precision. On modern CPUs there are no differences in performance between operating with floats and doubles, except for the division operations. Most current GPUs only support single precision floating point numbers. NVIDIA has earlier implemented double precision on GPUs with the NVIDIA G80 that worked at 1/10th speed of single precision operations and the new Fermi will support double precision with half the speed of single precision[15]. As this will change drastically in the near future we have decided not to look at how float to double casting is affecting our test results Hardware All tests have been run on the same hardware setup as our earlier tests and we have used the NVIDIA GT240 graphics card. 38

46 Chapter 7. Performance test of prototype 7.2. Results We have seen performance gains in our tests and simulations this shows that spreadsheet calculations can be optimised using the GPU, if relatively large data sizes are provided. In this section we will go through which factors we believe to be important when doing such an implementation and look at how much data that is needed. The results of this benchmark can be found in appendix A Built-in function For the built-in functions we found that given a sufficiently large input array, and a sufficiently complex function, it will impact performance positively to do the calculations on the GPU. However, as shown in our analysis of Accelerator, very few operations are sufficiently complex or take enough arguments to actually have a positive impact. Our tests showed that arrays of 96 2 were needed for a matrix multiplication to show a performance gain. If this is a common scenario for calculations in spreadsheets, it would make sense to spend more time on this kind of operations. However, it seems very tedious to work to with this many cells in a spreadsheet. It should be noted that the upper limit for the calculations, is also relatively close to the lower bounds where the CPU is faster. In our example it will be possible to create optimization in the range of [96 2 ; ], where after the values have to be split into two arrays and so forth Sheet defined function We tested sheet defined functions in several scenarios, building both more and less complicated SDFs. We converted real-life examples of Monte Carlo simulations from Excel into Sheet defined functions and ran them on CoreCalc. Performance gains were found by using GPUs in this way. However, we found factors that influence the performance of this implementation.: Aggregating is slow on the GPU In Monte Carlo simulations, aggregating functions are often used. Aggregating values is slower on the GPU than on the CPU [17], and should be used with great caution. For many simulations it might not make sense to create and calculate the sampling data on the GPU, transfer it back to the CPU and do the aggregation. This is also described in chapter 4. 39

47 Chapter 7. Performance test of prototype With the NVIDIA Fermi, one would imagine that performance of aggregating functions will be improved. NVIDIA Fermi promises more shared memory and much faster atomic operations to access shared memory, which gives a better foundation for reduction operations and thereby aggregate functions.[3] Random data needed to be transferred When doing Monte Carlo simulations, we transfer random data from the CPU to the GPU. This increases the total time spent because of the larger transfer time. It would probably improve performance of Monte Carlo simulations if random numbers were simply generated on the GPU instead of generated on the CPU then transferred. A pseudo-random number generator is possible to create on the GPU, and future releases of Accelerator 2.0, are expected to support this[17]. Reducing the amount of constants In our implementation we worked on reducing the amount of data needed to be transferred to the GPU. Looking at the derived slopes of benchmarking results in the different Heron-implementations, we see that the intersection between the functions of the computation on the GPU and the CPU, respectively is smaller the less data needed to be transferred. 40

48 Chapter 8. Perspective 8.1. The future of GPUs The test results in this project have all been based on a NVIDIA GT240 graphics card, which are not the state of art, but have been sufficient within the scope of this project. The tests may very well produce more promising performance gains on modern GPUs with around 250 cores, as seen in many gaming pc s today. State of the art GPUs with 512 cores would improve this further. In the near future NVIDIA will support double operations in half the speed of float operations[14], more cores will be added to GPUs, and memory limits will be raised. Considering this, the idea of using the GPU to optimize spreadsheet evaluation become even more interesting Other targets of parallelism using Accelerator Accelerator already supports multicore targets to some extent, but ideas for other targets such as FPGA 1 and distributed networks are also mentioned in the documentation[17]. In theory we could use the FPAs generated by our current Accelerator abstract syntax and simply change the target of evaluation. Multicore CPUs are interesting targest as they may show great performance gains on lower data sizes where the GPU is struggling with latency. Hamann[9] describes performance gains close to the theoretical maximum when using the.net Task Parallel Library. Our approach has not been aimed at task parallelism and we may not be able to parallel the same problems as Hamann, but one should expect a performance gain comparable to Hamann s on the data parallel operations. Using information of multicores in the worst case estimation time for the CPU, it will be possible to estimate when not to parallel the evaluation at all, when to use multicore, and when to use the GPU. 1 Field-Programmable Gate Array 41

49 8.3. Known problems Converting Values Chapter 8. Perspective A general problem when constructing Accelerator abstract syntax in CoreCalc, is that datatypes such as ArrayValues and NumberValues which contain double values, will have to be cast to float. As mentioned earlier current GPUs do not support double s and neither does Accelerator. The only solution here is simply to convert the doubles of the CoreCalc datatypes into float arrays, before creating the FPA object. Likewise, when Accelerator returns a result to us, we need to convert this back to doubles and wrap it into ArrayValues or NumberValues. This gives an overhead proportional to the input arrays size which is a little unfortunate, but very hard to avoid. CoreCalcs sheet defined functions are compiled to.net bytecode and need to do the same wrapping of values. We have not looked further into this. However, Poul Brønnum focusses on this problem in his Master s thesis, Type Analysis for Sheet-defined Functions [5] from He states that by implementing a set-based type system influenced by soft-types, a performance gain of 20% is possible. With further improvements he documents that performance gains of up to 65% compared to the original code is possible for some functions Limitations in Accelerator abstract syntax Due to the scope of this project we haven t matched the whole CGExpr tree in Accelerator Abstract Syntax. Here is a short overview of what we do not implement. Time functions and other volatile functions Choose Lookup AND, OR, NOT and nested parentheses Aggregation functions such as Average, Percentile and others Strings... And many other functions that have not been used in the scope of this project. 42

50 Chapter 8. Perspective Exceeding memory and texture limits By reorganising our data we avoid to some extent exceeding the maximum texture size, but when the data size is not within the maximum texture size the operation should be split in two. We have not implemented this as it has not been needed for our prototype. Doing so would result in a constant time increase in the GPU computation time. We have neither been able to predict when the GPU runs out of memory, nor found a way of handling this apart for catching the exceptions that Accelerator throws. While we might be able to predict when GPUs run out of memory, this might as well be handled on a lower level where more information about the state of the memory is known Worst Case Execution Time estimation We have proposed a model for estimating the worst case execution time of a function on the GPU, but this is not part of our prototype. However, it is possible to implement this using our Accelerator Abstract Syntax. Moreover the current model does not take maximum texture sizes, memory limits, or other possibly unknown factors into account. Also our current estimations of the values are not complete as we do not distinguish between the time cost of transferring an operation and that of transferring the arguments. 43

51 Chapter 9. Conclusion The main objective of this project was to investigate whether CoreCalc could be extended to use the GPU for parallelizing evaluation of functions. This has indeed been proven possible and we have implemented an experimental prototype that allows a subset of the CoreCalc operations to be evaluated on the GPU. We have investigated methods for parallelizing spreadsheet applications using the GPU and based on our analysis we have chosen to focus on sheet defined functions and usage of these in higher order functions such as tabulate. Our prototype shows that it is possible to achieve a performance gain on spreadsheet operations, given enough data and enough arithmetic complexity. However, except for sheet defined functions, only very few builtin spreadsheet operations use enough data or have the required complexity. Only matrix operations such as the built-in matrix multiplication function have displayed potential for performance gains on the GPU. We have analysed Microsoft Accelerator and documented its limitations related to the purpose of this prototype. In order to construct Microsoft Accelerator Expression Graphs at evaluation time we have designed a simple intermediate abstract syntax based on the Expr abstract syntax from CoreCalc. If using complex simulations or extremely large data amounts, such as in Monte Carlo simulations, it is indeed possible to optimise the spreadsheet calculation using the GPU. However, there is no or very limited performance gain when using the GPU to evaluate light spreadsheets on current hardware. Taking this into account it is questionable if this should be implemented in mainstream software. However, the future development of GPUs looks promising. 44

52 Bibliography [1] [2] Gpu gems, [3] Nvidias next generation cuda computer architecture: Fermi, http: // Fermi_Compute_Architecture_Whitepaper.pdf. [4] Dan Bricklin. Visicalc information. webpage. visicalc.htm. [5] Poul Broennum. Type analysis for sheet-defined functions. Master s thesis, IT University of Copenhagen, [6] Alan Dang. Exclusive interview: Nvidia s ian buck talks gpgpu, September html. [7] Jose Oglesby David Tarditi, Sidd Puri. Brute force attack on unix passwords with simd computer [8] Jose Oglesby David Tarditi, Sidd Puri. Accelerator: Using data parallelism to program gpus for general-purpose uses. Technical Report MSR-TR , [9] Jens Hamanns. Parallelization of spreadsheet computations. Master s thesis, IT University of Copenhagen, [10] Mark Harris. Technical report, [11] Intel. Intel roadmap directions 2010, irdonline/pdf/ird_q2_2010_roadmap_all.pdf. [12] NVIDIA internet forum. Cuda, nvidia gpus and microsoft excel, May http: //forums.nvidia.com/lofiversion/index.php?t67720.html. [13] Thorsten Scheuermann Mark J. Harris, Greg Coombe and Anselmo Lastra. Physically-based visual simulation on graphics hardware. Technical report, University of North Carolina, [14] Nvidia. NVIDIAs Next Generation CUDA Compute Architecture: Fermi,

53 Bibliography [15] David Patterson. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Technical report, NVidia, D.Patterson_Top10InnovationsInNVIDIAFermi.pdf. [16] Microsoft Research. Accelerator v2 Programming Guide, [17] Microsoft Research. An Introduction to Accelerator v2, [18] Peter Sestoft. Spreadsheet technology. draft manuscript [19] Andrew P. Wack. Partitioning dependency graphs for concurrent execution: A parallel spreadsheet on a realistic modeled message passing environment. PhD thesis, Delaware,

Appendix A. Test setups and results from prototype benchmarking A.1. Built-in functions Most built-in functions showed few performance gains in our earlier tests.

54 Appendix A. Test setups and results from prototype benchmarking A.1. Built-in functions Most built-in functions showed few performance gains in our earlier tests. Matrix multiplication was an exception and showed a potential performance gain, so we have tested it in our prototype. Figure A.1.: Matrix multiplication test setup As can be seen from the figure, matrix multiplication was implemented to take two equally sized quadratic arrays of random numbers. The CPU is initially faster with an array size of 8 2 (64). The GPU starts getting faster at size 96 2 (9216). For the sample we have made, the time spent by on the GPU seems to grow linearly as a function of data sizes, while the CPU has a higher complexity in its growth. 47

55 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , , , ,00 CPU GPU , , ,00 0,00-10,00 40,00 90,00 140,00 190,00 240,00 Data size Figure A.2.: Matrix multiplication performance results The GPU is five times faster that the CPU at a size of (57600), which is the largest possible dataset the GPU can handle during matrix multiplication. As mentioned we have no indicator of when this maximum is reached, but our test results. A.2. Sheet defined functions Herons formula Herons formula has been implemented as a sheet defined function and tested on data sizes from 1000 to floats. This implementation has been used a solid base for further tests, to see how different changes affect the performance. 48

56 Appendix A. Test setups and results from prototype benchmarking Figure A.3.: Different variations of Herons formula in CoreCalc Random data In this implementation all three variables (A,B,C) in Herons formula are random data generated by rand() for each invocation. This means three unique FPAs are generated and sent to the GPU. Tests are run on data sizes from 1000 to floats, and an intersection was not found. Using linear regression we can derive the following functions: HeronRandomCP U(x) = 452, 4939 x HeronRandomGP U(x) = 393, 2681 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. 49

57 Time (ns) Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.4.: Herons Formula with random data, performance results Notice that with random values for A, B and C, not all generated examples will be able to create a triangle. We might end up with a negative value in the SQRT, which might impact speeds on both the CPU and the GPU, making the results harder to actually compare to the two other Heron implementations. Param data , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.5.: Herons Formula with parameter data, performance results 50

58 Time (ns) Appendix A. Test setups and results from prototype benchmarking In this implementation A and B are set as parameters and C is set to A + B/2. This way only two FPAs are sent to the GPU, along with an operation and a single constant number. An intersection was not found. Using linear regression we can derive the following functions: HeronP aramcp U(x) = 462, 4716 x HeronP aramgp U(x) = 457, 2948 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. Constant data , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.6.: Herons Formula with constants data, performance results In this implementation all values are the same constant. Only one constant is transferred to the GPU.. An intersection was not found. Using linear regression we can derive the following functions: HeronConstantCP U(x) = 456, 901 x HeronConstantGP U(x) = 440, 3515 x It can be seen, that though our tests do not show an intersection, because of the slopes, it will be hit well within the limits of the possible data on the GPU. 51

59 Time (ns) Appendix A. Test setups and results from prototype benchmarking A.2.1. Fibonacci sequence A given number in the Fibonacci sequence can be calculated using this formula. (( F n = 1 1+ ) n ( ) n ) , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , ,00 Data size Figure A.7.: Performance test of the Fibonacci sequence We implemented this as a simple sheet defined function in CoreCalc. The SDF uses 4 constants and one input value. The test is run from 1000 floats to floats. The GPU surpasses the CPU around a size of Monte Carlo simulations Monte Carlo simulations are used in a wide variety of industries to estimate probable outputs where deterministic algorithms would take too long to compute or simply be too complex. They rely on a random sample data and statistics, and are often implemented in Microsoft Excel. While we have implemented the data-sampling in the simulations below, we have not spent time on calculating the actual results, because of the relatively few aggregate functions implemented in CoreCalc. Aggregate functions will be reduction operations on the GPU which are often not very effecient. The lack of actual results might skew the results of an actual simulation, 52

60 Appendix A. Test setups and results from prototype benchmarking however we believe these results give a good guidance since the aggregation will often be faster to do on the CPU and a users might want to analyse the result in many ways. Greeting card estimation Straight out of an example from Microsoft ( of how Monte Carlo can be used to make business decisions, simulating different types of demand scenarios, outputting what the risk is of failing. We converted this example into a SDF and did the same simulation in CoreCalc. Figure A.8.: The valentine simulation SDF defined in CoreCalc This function was benchmarked from 500 floats to floats. The intersection point between the GPU and the CPU is around 5000 floats. 53

61 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.9.: The results of the valentine simulation Approximation of π A Monte Carlo simulation can be used to approximate the value of π. This is done by generating a number of points within the square from (0, 0) to (1, 1), and afterwords counting the number of points that are within the inscribed circle of this square. The ratio between the counted points and the generated points should be π/4 We created a SDF = IF (RAND() 2 + RAND() 2 <= 1, 1, 0). Running this n times and dividing the sum of the result with n should approximate π. 54

Time (ns) Appendix A. Test setups and results from prototype benchmarking 40.000.000,00 35.000.000,00 30.000.000,00 25.000.000,00 20.000.000,00 15.000.000,00 CPU GPU 10.000.000,00 5.000.000,00 0,00 0,00 10.

62 Time (ns) Appendix A. Test setups and results from prototype benchmarking , , , , , ,00 CPU GPU , ,00 0,00 0, , , , , , , , , , ,00 Data size Figure A.10.: The perfomance results of the π approximation simulation This was benchmarked from 2000 to Commute time The intersection was found around Figure A.11.: The commute time simulation SDF defined in CoreCalc This testcase uses Monte Carlo simulations to predict the commute time to work. As seen in the above image, we have two road segments and a traffic light. At the first road segment we have a 10% chance of hitting a traffic jam and at the traffic lights there is 55

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous