Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013

Size: px

Start display at page:

Download "Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013"

Darren Barton
5 years ago
Views:

1 Evaluating the MATLAB Parallel Computing Toolbox Kashif Hussain Computer Science 2012/2013 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary This study presents research into the Parallel Computing Toolbox (PCT), provided by MATLAB, and the potential benefits of parallel programming. The study considers the Shared, Distributed and GPU memory models. Algorithms include the Matrix-Matrix Multiplication, The Jacobi Method and the Mandelbrot Set. The study will evaluate approaches to parallel programming to determine which parallel approach is most efficient and under what circumstances this is achieved. Taking into account of the trade-off between simplicity and the performance provided by using the toolbox provided by MATLAB. The conclusions discussed show how there is not one specific method for a problem to obtain parallel benefits, and that some methods actually hinder performance rather than improve it. The study highlights how performance not only depends on the memory model deployed but also the parallel model and algorithm being used. Increased performance is not exhibited for all problems considered, however it is seen that the PCT does have potential to provide performance benefits. MATLAB must continue to evolve at a great pace, to avoid being left behind as the area of computing unfolds further. i

3 Acknowledgements For making this project possible and worthwhile, I would like to thank Mark Walkley my project supervisor. Who provided support, advice and feedback which will remain priceless. Thanking Peter Jimack my project assessor for his invaluable feedback and great energy which has driven me to do so well. I could not ask for better friends than Shaf and Adnan, without whom life would just not be the same. Thanking Shaf for proof-reading my draft reports and doing such a brilliant job of it. Also for believing in me, providing endless support and making me laugh and smile when feeling down. Also thanking Adnan for putting up with me in the toughest of times and keeping me awake during morning lectures. Without him University could not have been the grand and unforgettable experience it has been. I would also like acknowledge my cousin Minhass, who has been a great companion since childhood, I would like to take this opportunity to wish him too every success in life. I dedicate this project to my parents, who s prayers, love and support enabled me to achieve such success in life. I thank them for always being there for me, encouraging me at all stages of life, teaching me the value of hard work and education making me into the person I am today. Thank you mum and dad for being the best parents in the world. ii

4 Contents 1 Introduction Aim Objectives Minimum Requirements Possible Extensions Background Research Background to the Problem Parallel Computing Parallel Architectures Shared Memory Distributed Memory GPU Memory Parallel Programming MATLAB Parallel MATLAB History of Parallel MATLAB Other Parallel Approaches Parallel Computing Toolbox Shared Memory Distributed Memory GPU Memory Research Summary Project Management Methodology Project Stages Background Research and Planning Learning Implementation and Testing iii

5 3.2.4 Benchmarking and Evaluation Available Environments Assessing Risk Computational Tasks Matrix-Matrix Multiplication Serial Shared Memory GPU Memory Testing The Mandelbrot Set Serial Shared Memory GPU Memory Testing Jacobi Method Serial Shared Memory GPU Memory Testing Reliability Sampling Evaluation Strategy Evaluation Criteria Measurements Schedule Changes Important Deadlines Implementation Matrix-Matrix Multiplication Implementation Serial Shared Memory GPU Computing Functionality Testing Benchmark Testing The Mandelbrot Set Implementation Serial iv

6 Shared Memory GPU Computing Functionality Testing Benchmark Testing The Jacobi Method Implementation Serial Shared Memory GPU Computing Functionality Testing Benchmark Testing Evaluation Terminology Matrix-Matrix Multiplication Evaluating Results Discussion The Mandelbrot Set Evaluating Results Discussion The Jacobi Method Evaluating Results Discussion Evaluating Environments Evaluating Effort Summarising Evaluation Conclusion Minimum Requirements Extensions Project Management Future Projects Predictions Bibliography 60 A Personal Reflection 63 B Resources and Extra Material 67 v

7 C Ethical Issues 68 D Schedule and Data Tables 69 D.1 Schedules D.2 Data for Matrix Matrix Multiplication D.3 The Mandelbrot Set D.4 The Jacobi Method vi

8 Chapter 1 Introduction Visiting the neighbourhood PC retail store provides plentiful evidence that it is the multi-core era. Manufacturer competition has become centred around the number of cores that can be packed onto a single chip. With clock frequencies reaching a limit, and not likely to significantly change for years to come, increasing computing power is not as simple as just using faster processors. Modern multi-core hardware does offer performance enhancements however, to fully benefit from this, focus must be shifted to the software infrastructure, a shift to parallel computing [20] [19]. Piotr Luszczek in his article Parallel Programming in MATLAB [20] said, The trend so far has been to double the number of cores every few years. This translates into a doubling of computational power. Harnessing that power will require the right software, and writing that software will require the right software tools. Solving computationally intensive problems involving very large data sets or numerous simulations, has become a necessity for researchers, scientists and engineers. Solving such complex problems with running times and data sets that far exceed capabilities of traditional uni-processor systems therefore means the requirement for increasing performance has become ever more important [24]. Traditional parallel programming has been a low-level, technically challenging task. Languages now attempt to provide a high-level approach to parallel programming, but this may be at the expense of efficiency. 1.1 Aim The aim of this project is to evaluate parallel performance of the Parallel Computing Toolbox (PCT) available in MATLAB. Determining the best model of parallel programming, to gain performance benefits of maximising speedup and efficiency of some scientific algorithm, given the available technologies of the current MATLAB release. Seeking general guidelines on which parallel methods are best suited or most appropriate to an algorithm and problem size. 1

9 1.2 Objectives The objectives of the project are to: Understand the PCT in MATLAB. Investigate the use of parallel programming with the PCT, evaluating how effective this can be. Evaluating parallel performance of the PCT available in MATLAB, with respect to Shared, Distributed and GPU memory models. Determining the most applicable model of parallel programming for some scientific algorithm. Evaluate benchmark performance of some standard scientific algorithm, respective to different memory models through testing. 1.3 Minimum Requirements The minimum requirements are: Implement the parallel version of some scientific algorithm using Shared and GPU memory models. Analyse the performance (speedup and efficiency) of parallel implementations with respect to GPUs and Shared memory Benchmark the results against the serial implementations, with performance considering the problem size and number of processors. 1.4 Possible Extensions The possible extensions are: Implement the Mandelbrot Set using the Shared and GPU memory models. Implement the Jacobi Method for sparse and dense matrices using the Shared and GPU memory models. Implement the algorithms used using Distributed memory scaling up to clouds of clusters. Implement hybrid solutions to the algorithms used, to evaluate performance of different parallel methods co-existing. Compare the performance of the PCT in MATLAB with the parallel methods provided by GNU Octave. 2

10 Chapter 2 Background Research The research in general builds upon the module Parallel Scientific Computing (TC32) [4] and also relates to the Numerical Computation and Visualisation (MJ21) [3] module. TC32 covered parallel architectures, different parallel programming methods and measures of performance. MJ21 covered numerical computations and how computational performance and numerical accuracy was affected by changes in parameters and change in problem size. This lays the foundation for the projects understanding. Research was primarily technical reports and articles published by MathWorks about parallel computing with MATLAB and the PCT. Other online resources helped with the understanding of the programming methods, this included blogs provided by staff at MathWorks. Some articles and research papers discussing the history and development of such parallel methods for MATLAB and then books on general parallel computing. 2.1 Background to the Problem Computational power from computers and the concept of using computers with multiple internal processors, multiple interconnected computers and the hundreds of cores of a GPU has led to the demand to increase speed of execution and throughput [34, 39]. Engineers, scientists, researchers and financial analysts are having to deal with computationally intensive problems, large data sets and simulations, which means that there is an ever-growing demand for greater computational power from computer systems than is currently possible [9]. As a peak for hardware is reached, there is only so much performance which can be achieved solely depending on hardware. The improvement in CPUs has become fairly stable, manufacturers may be putting more of them in each machine, but maximising their use is not so simple. Using several machines simultaneously is harder still. Therefore its important to appreciate that performance is not just about hardware but software architectures are just as necessary. A change is required in the software infrastructure, the combination of both is the base of parallel computing [20, 19]. 3

11 Industry has been able to speed up computationally intensive applications using multi-core machines and hyper-threading technology, however there is now potential and promise bought by the graphics processing unit (GPU) to provide better computational performance [34]. 2.2 Parallel Computing There are several reasons why parallel computing should be used: it can save time and money, solve larger problems, provide concurrency, use of non-local resources and limit serial computation [7]. Users of parallel computing algorithms aim to either reduce computation time or undertake analysis of larger data sets. Analysis with smaller data sets can also be impractical with respect to time, due to the possible wide parameter sweeps required for analysis [6]. As discussed modern applications require more computing power, far exceeding the offerings of traditional uni-processor systems. Use of parallel processing can improve performance, even when optimized code is not fast enough. Study of parallel algorithms has now become the demand of highspeed computing [23, 33]. Video games have greatly benefited by the development of the GPU cards, considering the memory, speed efficiencies and number of cores available. Processors designed to provide high degree of parallelism best describe the GPU, with core count on GPUs now continuing to grow, investigating the GPU has become the norm for parallel computations [38]. 2.3 Parallel Architectures Parallel programming requires memory models, which define how operations on computer memory should be executed Shared Memory For shared memory, configurations are such where each processor can access any memory module in a single address space. It is natural to connect multiple processors with multiple memory modules. The connection between the processors and memory is through some form of interconnection network [39]. All processors have direct access to a common physical memory. Shared memory is an efficient method of data passing between programs [4]. A user-friendly programming perspective to memory is provided by the global address space, for which the programmer remains responsible to ensure synchronisation and correct access. The architecture for shared memory is shown in Figure

Figure 2.1: Diagrams to show the architecture of shared memory (left) and distributed (right). 2.3.

12 Figure 2.1: Diagrams to show the architecture of shared memory (left) and distributed (right) Distributed Memory For distributed memory configurations each processor has its own local memory associated with it, which only it can access operating independently. Use of network-based memory access as physical memory is not common [4]. Data must be in the local memory for computations on any data, hence where remote data is involved communication with remote processors is required by the computational task. The programmer must manage the fine details associated with data communication between processors. The size of memory should increase proportionally to the number of processors as memory is scalable with the number of processors [7]. The architecture for distributed memory is shown in Figure GPU Memory The GPU memory model is unalike, it holds a massively parallel array of integer and floating-point processors, as well as dedicated, high-speed memory [35]. A typical GPU is composed of hundreds of smaller processors that can handle thousands of threads simultaneously. The GPU contains functions called kernels, which in blocks organise the threads to perform the computation on GPUs [16]. GPUs have always been associated with the acceleration of graphics rendering, they are now being applied to scientific calculations. The GPU provides increased throughput, however there are limitations. Memory access becomes a concern as performance of entire system may become limited by a single component or resource. The architecture for GPU memory is shown in Figure 2.2. GPUs are attached to the host CPU via the PCI-Express bus, hence having slower memory access than traditional CPUs. GPU programming requires data to be transferred from the CPU to the GPU before the calculation and then transferred back to the CPU if required. Such memory access is slower constraining the overall computational speedup achieved [34]. Data transfer overhead can become so significant that it degrades the applications overall performance, especially if data is exchanged re- 5

Figure 2.2: Diagrams to show the architecture of typical CPU and GPU memory. peatedly to execute relatively few computationally intensive operations [38]. 2.4 Parallel Programming There are many parallel programming languages which have been developed over the years.

13 Figure 2.2: Diagrams to show the architecture of typical CPU and GPU memory. peatedly to execute relatively few computationally intensive operations [38]. 2.4 Parallel Programming There are many parallel programming languages which have been developed over the years. MPI 1 and OpenMP 2 being two of the more accepted and successful languages, with CUDA seen to be the revolution in parallel programming [30]. NVIDIA have also stated in What is GPU Computing? [31] that GPU computing is growing with great momentum and GPUs have become a major aspect for parallel programming. MPI provides communication and portability for both shared and distributed memory models. It is mostly used for scalable cluster computing, computers in a cluster do not share memory, but it is still possible to run MPI on a shared memory resource. OpenMP is common for shared memory multiprocessors [18]. MPI works with explicit message passing and can provide a low-level approach to parallel programming. MPI has had significant success in high performance computing, but the difficulty in making a program parallel with MPI is great and can be a time-consuming process. MPI is also harder to debug and communication overheads can limit the overall performance. For OpenMP its performance is limited by thread management overhead and cache coherency. CUDA solves the issue of shared memory for execution on the GPU but struggles with the CPU and GPU communication. As CUDA provides higher scalability and overcomes the limitations of OpenMP discussed, not a wide range of applications are supported. Parallel processing and multiprocessing is also possible with Python. Python libraries exist such that allow the programmer to use multiple CPUs or multi-core CPUs. Providing a shared memory en- 1 For more details see: 2 For more details see: 6

14 vironment or cluster of computers with the ability to scale is available. Most libraries feature Python s threading API as this is included in the standard library alongside support for processes and asynchronous communication for some form of concurrency [12]. With Java it is possible to use threads to achieve parallelism, essentially running competing threads on different CPUs. Threads in Java are independent executions, working as sub processes sharing system resources. More recent is Intel s hyper-threading technology, which attempts to use available resources more efficiently by increasing processor throughput to enhance threaded performance and provide parallelism [14]. Hyper-threading can provide theoretical cores, essentially doubling the processing power available. The total number of cores available for a machine are then, the physical cores of the machine alongside as many theoretical cores. Theoretical cores allow scheduling of simultaneous processes or threads. A software tool also available is the Parallel Virtual Machine (PVM) 3 designed to allow networked machines to be used as a single parallel computer. PVM provides a support for FORTRAN and C allowing existing code to be used and a graphical environment is also available. PVM known to allow its users to exploit existing computer hardware for parallel performance with least cost, but with no support for shared memory, data distribution and scalability issues it is not as widely used today. 2.5 MATLAB MATLAB is a high-level technical language that enables scientists and engineers to move quickly from ideas to implementations, providing an interactive environment for algorithm development, data visualisation, data analysis and numerical computations. Currently there are an estimated one million MATLAB users in industry and academia worldwide 4. Major application areas in which MATLAB is used are signal and image processing, communications, control design, test and measurement, financial modelling and analysis, as well as computational biology. In some application areas, particular classes of problems cannot be solved by MATLAB alone, domain specific add-on toolboxes provided by MathWorks extend the MATLAB environment [5]. The language is well-suited for rapid prototyping and development of technical computing applications. Despite such features there are limitations associated with MATLAB, with algorithmic complexity and data set sizes growing rapidly, performance has become a concern Parallel MATLAB The need to solve increasingly complex problems, with running times and data sets that far exceed the capabilities of the traditional computer system, with a single processing unit has led to the support of parallel computing in MATLAB [5]. With functional constructs and data structures, parallelism has been built into the MATLAB language. Users are allowed to program a parallel application without 3 marquet/ens/pp/pvm/ 4 [27] 7

15 making major changes to existing code if possible, as the execution environment and resource allocation remains aside from the language itself. MATLAB provides implicit parallelism using shared memory which is in-built and is not a particular feature of the PCT. The MATLAB client now by default provides multithreaded parallelism. Such parallelism allows multiple processors or cores to share the memory of a single computer and execute instruction streams. Essentially generating multiple simultaneous instruction streams by one instance of MATLAB automatically [27]. Users can explicitly specify the client to launch in a singlethreaded. MATLAB processes are capable of splitting into threads which can run concurrently, and all the threads have access to the same variables. It is possible to use the maxnumcompthreads function to specify the number of threads MATLAB is to use, but it is likely that best performance is gained when there are as many threads as cores. In the future release of MATLAB this function will be discontinued, hence MATLAB will provide implicit parallelism automatically. MATLAB automatically trys use threads where deemed appropriate, though not all MATLAB functions have been predefined for multithreading. multithreaded parallelism is the sharing of memory of a single machine, for execution of streams by multiple processors or cores. MATLAB offers other means of parallelism using the PCT as explained below: Distributed Computing - Allows a single program to run numerous times with different parameters. Providing the distributed memory model, multiple instances of MATLAB are run on separate computers, each with its own memory for the same independent computation [27]. Explicit Parallelism - Allows several instances of MATLAB to run on several processors or computers, often with separate memories providing simultaneous execution where required. It can be deployed on both shared and distributed memory [27]. GPU computing - Parallelism made possible using GPU memory. Allowing data, operations and computations to be carried out on the GPU. MATLAB provides predefined functions enabled for the GPU to help speedup MATLAB operations without low-level CUDA programming. However existing CUDA kernels can be integrated without additional programming [34] History of Parallel MATLAB Over 20 years ago MATLAB was not said to be a programming language, its main use was for numerical analysis in academia. It had always been a serial program, there were no M-files, no toolboxes, no ODE solvers, no Fourier transforms and no graphics [29]. Cleve Moler of MathWorks in 1995 famously wrote an article Why there isn t a parallel MATLAB: Our experience has made us skeptical [28] stating that the few experimental versions of Parallel MATLAB had not been effective enough to 8

16 justify further development. He based his arguments on the following points: 1. Memory model: Even though MATLAB was a lot bigger and parallel computers had grown, the memory distribution was still a fundamental difficulty. Computations that take far longer to distribute the data than to compute could not make effective use parallel methods. With the most important model for parallel computers being distributed memory, parallel applications were not worthwhile. 2. Granularity: As MATLAB spent only a small portion of its time in routines that could actually be made parallel, most of the time spent was in the parser, interpreter, and graphics routines, where the potential for parallelism was minimal. For MATLAB to handle such parallelism, fundamental changes in the architecture would be required. 3. Business situation: With the lack of users with access to parallel computers, it was not a logical business decision to make such changes. Cleve Moler insisted that their efforts would be devoted to improving the conventional MATLAB. Despite promising that even though MathWorks would not get solemnly tangled in developments of parallel computing, interest would remain in developments of parallel computing. In 2007 Cleve Moler himself invalidated his arguments from twelve years before in his article Parallel MATLAB: Multiple Processors and Multiple Cores [27] by stating, The situation is very different today. MATLAB had then become a technical computing environment, hardware capabilities had improved significantly with further development expected. Likewise MATLAB users had better access to networks and parallel machines. As single memory machines were not capable of handling modern scientific problems, the need for parallel computation had become obvious. Increase in problem sizes and improvement in processor speeds, had shown greater potential for parallelism as time spent in other routines had seen a reduction. MATLAB was renowned for its user friendly environment. With most parallel computer users not being experts in parallel programming, MATLAB saw the potential market for Parallel MATLAB. With such changes, Parallel MATLAB was introduced providing support for implicit, explicit and multithreaded parallelism Other Parallel Approaches Attempts to make MATLAB parallel have been on going since the late 1980s. Cleve Moler at the time sidelined parallelism for MATLAB and continued to focus on the original purpose of MATLAB as discussed in Section

17 A popular approach used for making MATLAB parallel is extending MATLAB with libraries or even by modifying the language itself. Projects which have done this successfully and become widely used by the user community are MatlabMPI and pmatlab. Both developed at MIT Lincoln Laboratory [13]. pmatlab is an extension which is still actively developed, relying only on MatlabMPI for message passing capabilities it can achieve portability. Portability comes at the compromise of performance, which however can be addressed by bcmpi 5. pmatlab is seen as a more high-level approach and provides simplicity to the process of parallelism, making it adequate for more basic users. MatlabMPI aims to allow users the same functionality of MPI, implementing the MPI standard in MATLAB. MatlabMPI has been able to successfully provide speedup, however writing such parallel programs is still a difficulty, hence only preferred by experienced parallel programmers [11]. Another approach tries to translate MATLAB into a low-level language (C or FORTRAN), generating parallel code from the compiler using annotations and other mechanisms. Such code translations are not easy, but with limited language and library support the difficulty is greater. A downfall of this approach is that users may have to discard their original MATLAB implementation all together or re-implement their code using limited functionality provided by such systems Parallel Computing Toolbox In 2004 MathWorks responded to such demand by releasing the Distributed Computing Toolbox, which has since been renamed to Parallel Computing Toolbox (PCT). The PCT with the use of multicore processors, GPUs and computer clusters makes solving computationally and data-intensive problems possible [26]. The toolbox provides parallel processing constructs that allow implementing algorithms which are both data and task parallel with the use of: parallel for-loops and code blocks, distributed arrays, parallel numerical algorithms, and message-passing functions. Explicit programming in-terms of hardware and network architectures achieves this at a high-level [22]. The toolbox also provides a pmode function, which allows the interactive parallel execution of MATLAB commands. Using a command window connected to the workers running in the matlabpool, it is possible to send, receive and process data between the client and workers. Interactive parallel labs assist the development of parallel algorithms and help in understanding the underlying communication. Those at VirginiaTech claim that, MATLAB s parallelism can be enjoyed by novices and exploited by experts [15] emphasising that most users will choose high-level constructs, with advanced parallel programmers favouring low-level message passing functions. Key features of the PCT are summarised below: 5 For more details see: 10

18 Running task-parallel algorithms on multiple processors with the use of parallel for-loops. CUDA-enabled NVIDIA GPUs are also supported. Being able to run twelve workers locally on a multi-core desktop. MATLAB Distributed Computing Server supports computer clusters and grids. Parallel applications can be executed interactively or in batch. For handling large data sets and algorithms which are data parallel, single program multiple data constructs and distributed arrays can be used Shared Memory The PCT provides users the ability to use up to twelve workers (essentially full instances of MAT- LAB), allowing applications to run locally on a multi-core machine. Previous versions of PCT allowed only up to eight workers, the current version though allows twelve. This can be seen as local parallel computing. The maximum number of workers on the local machine is twelve, however one worker per core is recommended. Running more workers than cores available on the local machine would result in core sharing, which could hinder performance significantly in some cases [10]. Parallel programming has commonly been associated with threads on multi-core systems, however it is important to understand that multithreading and multi-core processors are not closely associated or suggestive of each other. It is often assumed that for the best performance, the number of threads and number of cores must correspond [19]. Although problems do exist where fewer threads than cores would be the better choice in attempt to utilise the full computation power available. The worker model is common for achieving speedup by organising programs into independent tasks [22, 26], which are then run concurrently. The high-level programming possible simplifies the development of parallel code, making it easier going from from serial to parallel. By removing the complexity associated with managing coordination and distribution of computations and data, MAT- LAB helps the user exploit the parallelism possible in their program. The toolbox provides constructs such as distributed arrays, in-built parallel functions and spmd (single program multiple data), these can be used by the most users and not much experience is required. The parallel constructs themselves handle the inter-work communication and computations implicitly, taking the burden from the users. It is possible for the worker model to use low-level programming constructs for parallelism, but this comes with increased difficulty. The PCT provides users with the spmd function, distributed arrays and co-distributed arrays. The spmd is a language construct allowing seamless interleaving of serial and parallel programming. The spmd statement allows defining a block of code to run simultaneously on multiple workers. The construct allows designating sections of code to run concurrently across workers participating in parallel 11

19 computation. During the program execution, spmd automatically transfers data and code used within its body to the workers and, once the execution is complete, brings results back to the MATLAB client session. To make use of the worker model the key functions are the distributed and co-distributed arrays. Distributed arrays are special arrays, which store segments of data on MATLAB workers, being able to store data on many workers it is possible to handle larger data sets than possible with the computers RAM. Co-distributed arrays, partition data into segments and storing them on different workers. Each worker then has its own portion of data. It is also important to note that the co-distributed arrays only work when used within spmd blocks, otherwise data will remain in main memory and not the worker memory. Distributed arrays are intended to be easier to use, whereas co-distributed arrays allow access to the underlying data (that allow implementing new communicating algorithms). Also distributed arrays partition the data amongst the specified number of workers implicitly, whereas the co-distributed arrays allow the user to specify the distribution scheme explicitly. The default distribution provided is using the column dimension. It is possible to distribute data along a single specified dimension, in a non-acyclic manner or distribute data by blocks using a two dimensional block-cyclic distribution Distributed Memory Distributed computing in MATLAB allows the user to develop their program using the PCT on a multi-core machine, then scale up to many computers by using the Distributed Computing Server (DCS) 6. The DCS has the capability of running MATLAB programs on clusters of computers, clouds and grids. The server runs on clusters of computers or virtual machines, on a distributed computing resource or cloud service respectively [25]. It can be seen as distributed remote parallel computing. The server offers multiple workers, which are essentially MATLAB computational engines running independently of client sessions. The server also features the ability for users to run applications simultaneously. In general multiple instances of MATLAB run multiple independent computations on separate computers, each with its own memory [27]. The PCT provides special arrays which can hold extremely large amounts of data compared to a machines RAM. The DCS can divide and allocate data over many MATLAB worker processes running on a computer cluster [26], which in principle can overcome memory limits and help solve problems involving manipulation of large data. The toolbox provides functions allowing the users to interact with distributed and co-distributed arrays and allow manipulation of data remotely with no requirement for low-level programming. Such functionality works with the worker model as discussed in Section , the shift to the distributed model is seen when the program scales up to clusters of computers, grids and clouds using the DCS. 6 for more details see: 12

20 Also to avoid resource sharing issues, workers spanning a range of machines are likely to utilise full computational resources. The worker model provided by the PCT can be deployed on both shared and distributed memory. Therefore a worker model setup for shared memory can be easily extended to utilise distributed memory. This is achieved by setting up a matlabpool of workers which uses cores from other computers connected to via the same network, rather than being limited to resources on the local machine. Also using distributed computing it is possible to not only utilise shared memory on each computer but also to utilise the GPU memory on each computer, supporting distributed GPU computing GPU Memory The PCT provides users the ability to accelerate programs by executing some operations directly on the GPU. It is possible to run MATLAB code on a GPU with simple changes to existing code, whereas GPU programming in C or FORTRAN is a complex and time consuming process [34]. Using the PCT to accelerate a program on the GPU requires it to meet certain conditions, not all applications can benefit from the GPU and performance is not always guaranteed. Sarah Wait Zaranek from MathWorks discussed how GPUs could provide increased performance in her blog Using GPUs in MATLAB [40] and how there is certain criteria to consider before programming on the GPU. Applications not meeting the conditions could actually result in slower performance from the GPU. For the program to be eligible, it must be computationally intensive and massively parallel. Best performance is achieved when the GPUs parallel nature is exploited such that all the cores are kept busy and it is also important that the computation time is significantly greater than the communication time. Situations where the communication is greater than the computation, performance can be worse when using parallel methods [38]. For the basic users the toolbox provides a convenient way to increase the performance of their programs by executing it on the GPU, many functions come ready-built for the GPU therefore basic changes in the existing code can take advantage of the GPU. MATLAB functions allow transferring of data to and from the GPU and running numerical operations and other methods directly on the GPU. To initialise data directly on the GPU as an array the gpuarray function is used. These arrays can be operated on using normal MATLAB functions, which of most now are GPU-enabled. It is also possible to apply functions on each element of these arrays if applicable using the arrayfun function. GPU-enabled functions work only with full matrices and there is no support for sparse matrices as of yet. There is also extensive functionality for experienced programmers who can use the CUDA- Kernel Interface to run their CUDA kernels on the GPU from MATLAB. A point to consider is that a compatible GPU is required, CUDA-enabled NVIDIA GPUs work well but the main requirement is for the GPU to have compute capability of 1.3 or greater. Any GPU 13

21 with compute capability below this does not support double precision arithmetic [21]. The hardware is also a significant factor, better GPUs are much more likely to perform better, providing the problem is a good fit for the GPU (massively parallel and computationally intensive). Multiple GPUs on a single machine, or GPUs on a cluster of machines connected through a network can be utilised using the DCS. To achieve distributed GPU computing, the worker model is used such that one worker per GPU is used and the matlabpool is setup to utilise the GPU of each computer within the cluster. It is also possible to combine the CPU and GPU together within the same matlabpool. However communication overheads can get very large, therefore it is important to take into account both the network and GPU overheads. The problem must be extremely parallel and data intensive for such overhead to become minimal. 2.6 Research Summary The background research makes it quite clear that the MathWorks toolset aims to provide users the opportunity to take advantage of parallelism without much effort. MATLAB users have been given sufficient choice, novice users may choose the high-level constructs whereas the experienced user is more likely to use low-level constructs. In 2008 Gaurav Sharma and Jos Martin from MathWorks wrote an article MATLAB: A Language for Parallel Computing insisting that even with MATLAB being parallel, the aim is to ensure that programmability trumps performance and the highest degree of programmability remains [11]. The concern however remains about the trade-off between simplicity and performance. There are many features provided by the PCT, however it is still not certain which parallel approach is the best, if any. Also whether there are certain guidelines which could help users determine in advance which parallel approach would be most applicable to their problem and if there was any potential for improvement in performance. Evaluating the PCT would bring a further insight on the parallel programming possible in MATLAB. There are also no books actually published on how to parallel program in MATLAB using the PCT, only a limited number of research papers and other resources exist that provide information and details on all different aspects of the toolbox and parallel approach. The current situation can be considered as a parallel task, the knowledge has been scattered and learning is complete but it is now time to gather and put everything together. 14

22 Chapter 3 Project Management Project Management is essential to achieve a well produced study. Nick Efford and Hamish Carr, taught the second semesters of modules Project Management (SS12) [1] and Software Systems Engineering (CR21) [2] respectively. They provided a clear view of the methodologies available and how they differed. Key methodologies considered are the Waterfall Model and the Agile Model. The waterfall method can be costly and inefficient as it prevents changes or repetition of previous stages of development already completed. The agile method means breaking down the problem into smaller problems, promoting recursion for each stage of development. Time management is the key to a well structured project and planning is essential as time available may be reduced or time taken on a particular task may exceed expectations. Such delays are common in projects, however being prepared, structured and logical can help minimise such concern. 3.1 Methodology Multiple implementations, testing and evaluations are best suited to the Iterative Approach from the agile model. For the purposes of this project, it will consist of implementing basic serial versions of algorithms and then building upon the algorithms to develop the parallel implementations of the same algorithms relative to the various memory models. With each implementation there will be test-cases which will be benchmarked. Approaches which prevent re-iterating and making changes would not be suited for this particular project. The iterative approach works towards higher quality and fewer deficits, preventing any issues that could occur. This also allows better progress monitoring and predictability. 3.2 Project Stages Various different implementations of the algorithms would be required to test the performance of the Parallel Computing Toolbox (PCT), hence to be methodical was very important. The project was broken down into four stages, each of which links with the next stage. 15

23 3.2.1 Background Research and Planning Background research was important to gain a better understanding of the problem and to learn about the domain. Planning the implementations and evaluation criteria allowed working towards a clear goal. With more knowledge of the domain and the problem, it was possible to assess early if the minimum requirements and objectives of the project were feasible Learning Researching the problem and domain allowed discovery of methods and techniques available with the PCT. Understanding how these work and can be used, sample implementations of basic algorithms and test functions were created. Understanding how the algorithms worked was essential, as the correctness of the algorithm was important for the performance. Returning to the planning stage was important, as with knowledge and experience of the domain it was obvious that some changes were required Implementation and Testing Using the iterative model, a serial implementation of the algorithm was developed. Providing the serial implementation was correct, parallel implementations of the same algorithm were developed using the GPU and Shared memory models. Implementations were tested for algorithmic correctness and performance, this was measured with respect to running times, speedup and efficiency [17]. Once all implementations of the algorithm were completed, they were benchmarked in the next stage Benchmarking and Evaluation Benchmarking the implementations was important to understand how effective different memory models were and how the parallel techniques performed. A benchmark program was developed which would run the various implementations of the algorithm for the same criteria to get accurate and reliable results for evaluation. The main purpose of evaluation is to discuss and analyse the results gathered from benchmarking, in relation to the background research and project aims and objectives. 3.3 Available Environments There are three environments available for the purpose of this project, the environments are similar but have some technical differences that need to be considered when testing and evaluating implementations. For the project only certain workstations will be used. DEC10 is the primary environment for testing the implementations to evaluate performance. ENIAC is the secondary environment, which will be used for development and prototyping in the event of DEC10 being unavailable. ENIAC machines are not as powerful as DEC10 hence will not be used for testing purposes. A high-performance 16

24 research workstation is also used in certain cases to compare to DEC10. To follow are details of the workstations to be used for testing and benchmarking: Workstation cslin040: Environment: Machine in DEC10 running the CentOS 6.4 operating system. CPU: Intel(R) Xeon(R) CPU E Ghz, Quad Core and 16GiB memory. GPU: Quadro 600, 96 CUDA cores with 25.6 GB/s memory bandwidth. Software: MATLAB Version: (R2012b) and Parallel Computing Toolbox: Version 6.1. Workstation cslin049: Environment: Machine in DEC10 (Final Year Project Room) running the CentOS 6.4 operating system. CPU: Intel(R) Core(TM) i7 CPU 2.93Ghz, Quad Core (Hyper-threading 1 ) and 8GiB memory. GPU: No supported GPU device. Software: MATLAB Version: (R2012b) and Parallel Computing Toolbox: Version 6.1. Workstation cslin146: Environment: Machine in School Of Computing running the CentOS 6.4 operating system. CPU: Intel(R) Xeon(R) CPU 2.4Ghz, 4 Quad Core Processors and 24GiB shared memory. GPU: No supported GPU device. Software: MATLAB Version: (R2012b) and Parallel Computing Toolbox: Version Assessing Risk Setbacks and problems are expected in this project, and these could be minor or major. The major factor in the project is time, when working with time constraints any setback can have an affect on the overall time available for the project. It is important to consider as a risk for the project, that the PCT is fairly new in computing and few resources and literature exist, therefore it is expected that delays may occur. Due to the minimal initial knowledge, creating very basic implementations of small problems will help understand the underlying concepts and learn about the fundamental constructs available. With the iterative development chosen and initial learning of the parallel techniques, development issues should be minimal. Unexpected issues are possible and difficult to prevent but must be accounted for. 1 Intel provides hyper-threading technology, which splits a single core into two logical threads. Allowing the operating system to treat each core as two cores instead. 17

25 3.5 Computational Tasks The computational tasks chosen provide a range of problems which can be addressed by the PCT. It is expected that the problems to be considered can benefit from parallel techniques, evaluation will assess how effective the parallelism provided by the PCT is for such problems Matrix-Matrix Multiplication The Matrix-Matrix Multiplication is simple and ideal for gathering some preliminary results and to establish further understanding of the programming methods available with the PCT. MATLAB has an in-built function mtimes defined by Equation (3.1) which allows the multiplication of two matrices A andb, therefore just the initialisation of the matrices will be required. The computation itself for an element (i,j) in C is the product of the ith row and jth column of A and B respectively. Matrix- Matrix Multiplication will be made parallel using the shared and GPU memory models. Testing for this problem will vary the matrix dimensions, n, and number of workers used for the different implementations possible for a particular memory model. C (i, j) = n k=1 A (i,k) B (k, j) (3.1) Theoretical Running Time - For the Matrix-Matrix Multiplication, the running time is O(n 3 ). Theoretically as the problem sizes double, the timings should increase by a factor of 8 which is the correct algorithmic behaviour Serial The serial implementation will be used as the base measurement for comparison. MATLAB will have to be setup to run with a single-thread to gain a measure for serial time. This is because as discussed in Section by default MATLAB provides implicit parallelism. Using a single-thread will also allow evaluating the performance of implicit parallelism Shared Memory The shared memory implementations will use the serial code as its structure. Firstly the implicit parallelism provided by multithreading will be measured, for this code that does not require any changes as discussed in Section Just launching MATLAB with its default settings can allow the code to run with multithreading. For taking advantage of the worker model, using spmd, distributed and co-distributed arrays will be necessary. The distribution of data amongst the workers available will be done by MATLAB automatically, using the distribution scheme of distributed arrays as highlighted in Section

26 GPU Memory The GPU memory implementations will use the serial code. Several implementations are possible to utilise the GPU but not all are feasible. Transferring data to and from the GPU, initialising the data directly on the GPU and the cost of computation only will be considered. This will provide a better understanding of any overheads and communication costs associated with the GPU. These measurements will be compared against both the serial and multithreaded times alongside the worker model discussed above in Section Testing Testing the algorithmic behaviour is achieved by making sure the code produces the correct mathematical result. For Matrix-Matrix Multiplication this is simple for smaller sized matrices but for larger matrices can be challenging if done by hand. However themtimes function in MATLAB ensures this for the user. Timings for very small sized matrices are likely to be very quick, hence will need to be executed many times for accurate mean run-times with minimal variation. Timings will be done to measure data transfer, computation time and data overhead. For benchmarking, all the different variations of the algorithm will be executed for the same generated data to get fair results and make multiple testing efficient The Mandelbrot Set The Mandelbrot set is defined by plotting points that satisfy a particular condition, visual pictures are created by plotting thousands of points satisfying this condition. In mathematical terminology it is known as a set of points whose boundary is a distinctive and easily recognisable two-dimensional shape. The Mandelbrot set is a pixel by pixel operation, where the algorithm is applied to each pixel, which can be a slow calculation in serial. The algorithm can be seen below in Equation 3.2. For each pixel location a complex number is defined by Z 0, which is iterated on for a fixed number of iterations or until it diverges. It is infinitely complex yet the underlying calculation is based on a simple equation involving complex numbers. The nature of the algorithm makes it perfect for parallel programming as it is embarrassingly parallel [34]. With the PCT element-wise operations are possible on the GPU as discussed in Section with a function already available, so it is expected that the GPU will provide good performance. The algorithm for the Mandelbrot Set is seen below: Z (0) = Z 0 Z (i+1) = Z 2 (i) + Z (0), i = 1,..., n. (3.2) Theoretical Running Time - For the Mandelbrot set, the running time is O(n 2 ). The Mandelbrot Set is a 2-dimensional array of pixels, therefore there are n 2 pixels to operated on. Theoretically as 19

27 the problem sizes double, the timings should increase by a factor of 4 which is the correct algorithmic behaviour Serial The code for the Mandelbrot set will be adapted from that available in [36]. The code is quite simple, the starting points for each of the calculations will be setup and the number of iterations fixed however only the grid size will vary. Once the limits have been set, the grid will be initialised on CPU memory and the calculation made. The output will have to be setup so a plot is automatically created Shared Memory Implicit parallelism will not require a separate implementation and the serial code will be used without any changes. The worker model will be able to use the serial code as its foundation. The worker model will require the spmd and co-distributed arrays previously discussed in Section Distributed arrays cannot be considered here because some of the constructs and methods used for the implementation of the Mandelbrot set are only supported by the co-distributed arrays GPU Memory The Mandelbrot Set can be implemented in several ways using GPU memory. Implementations will use the same serial algorithm but initialisation of data will vary. To identify any communication overheads, data will be initialised on the CPU and then transferred to the GPU or data will be initialised directly on the GPU. The Mandelbrot set holds a key property of being an element-wise operation because it is a pixel by pixel calculation. The code being used is already vectorised so that every location is updated at once therefore using element-wise operations to perform the calculation on each element independently are possible with the GPU as discussed in Section Testing Testing that the Mandelbrot set was producing the correct image as output, is important to ensure the code is working correctly. The Mandelbrot set is data-intensive and massively parallel and it is expected will work well with the GPU, however due to the memory limits of the GPU only smaller problem sizes will be possible. The worker model deployed on shared memory is expected to overcome this memory limit whilst providing some performance gains over the serial code. Benchmark code will be designed for the shared memory and GPU memory separately, the same data will be setup for all the tests to allow fair comparisons Jacobi Method The Jacobi Method is an iterative method used for solving linear systems. Such methods are an important part of numerical linear algebra and scientific computing. The formula for the algorithm 20

28 is defined by Equation 3.3 [3]. The matrix A is decomposed into a diagonal component D and the remainder A, x (k) is the current approximation and using an initial guess the solution is obtained iteratively. The Jacobi method will be used to evaluate the performance of both dense and sparse matrices. The Jacobi method will be used as an iterative solver for a random matrices, considering both sparse and dense cases. Also the Jacobi method will be used to solve for a particular sparse matrix, the Heat Diffusion problem. GPU programming is not possible with sparse matrices as discussed in Section and hence will not be possible, but it is important to understand how dense data can be handled by the Jacobi algorithm with respect to the shared and GPU memory. x k+1 = x (k) + D 1 (b Ax (k) ) (3.3) Theoretical Running Time - For the Jacobi Method, the running time for full matrices is O(n 2 ). Theoretically as the problem sizes double, the timings should increase by a factor of 4 which is the correct algorithmic behaviour. However for sparse matrices, this cost can be reduced to O(n). The amount of work required is also dependent on the number of iterations required for a satisfactory solution [3] Serial For the serial implementation of the Jacobi method both the dense and sparse cases will be considered. For the sparse case a random matrix will created with the matrix density explicitly specified. The initialised matrix will then be made diagonally dominant, else the algorithm will not behave correctly. Diagonal dominance will be ensured by modifying the randomly generated matrix to meet the condition defined by Equation 3.4. The matrix is diagonally dominant if the diagonal entry a i j (denoting the ith row and jth column entry) in a row is greater than the sum of the magnitudes of all other entries in that corresponding row. A vectorised implementation of the Jacobi method will be used, removing the inner-most for-loop. Code vectorisation makes the code more efficient, making use of matrix and array operations from the MATLAB libraries [23]. The initial guess for the solutions used will be a vector of zeros. For the Heat Diffusion problem, a particular sparse matrix is setup to find the distribution of temperature throughout a square for a given temperature distribution on the boundary. The matrix for the system of equations will be sparse and be solved using a the Jacobi method for a fixed number of iterations Shared Memory a ii > a i j for all i, (3.4) j i Implicit parallelism will require running the same serial code without any modifications. The worker model is implemented by initialising the matrix, right-hand side vector and initial solution using codistributed arrays with spmd blocks. These shared memory implementations will be similar, the actual 21

29 algorithm will remain unchanged for the sparse and dense implementations, as well as the Heat Diffusion problem. For the Heat Diffusion problem the sparse matrix will be defined using co-distributed arrays. It is the actual initialisation of the data and execution of the iterative solver that will require modifying GPU Memory GPU computing with sparse matrices is not yet possible, therefore the sparse matrix and Heat Diffusion problem will not be possible to implement, only the dense case will be considered. For the implementation, the code for the shared memory can be used for structure. Implementations will consider the cost of initialising the data and the cost computation only, to highlight any overheads associated with the GPU. It is expected that as the size of the data increases, the performance is overwhelmed by the overhead of data transfer between the CPU and GPU. The Heat Diffusion problem will not be implemented using GPU computing, because the problem uses a sparse matrix Testing To ensure that the implementations are correct it is important to ensure the algorithm is behaving correctly, for the Jacobi method the error between the exact values and the computed values should always steadily decrease regardless of the problem size [3]. Also for the Jacobi method to work correctly the matrix must be diagonally dominant, using random matrices this will not be possible implicitly. It is expected that without the diagonal dominance the algorithm will not function correctly, hence this will be required explicitly. The number of iterations will be fixed for all the tests and only the problem size will vary, with the same sample being distributed for all possible implementations. For the case considering the heat diffusion problem, the actual visual output and solution for a given grid size and fixed boundary conditions will should remain the same. 3.6 Reliability The implementations will be developed in possibly different environments but the testing of the implementations will be controlled. Development will take place in DEC10 or ENIAC. The testing of the implementations will take place in DEC10 preferably on the workstations discussed in Section 3.3 as well as the high-performance machine also mentioned. If not then other machines are available within the same environment with similar architectures. Other applications or processes on the machine used must be kept to a bare minimum to ensure that the performance is unaffected, but this has to be considered. For this reason most of the testing and benchmarking is only carried out when the environment is least occupied. The benchmarking tests therefore have been scheduled for early mornings or late in the evenings. The benchmark program developed will run the tests in a batch to obtain a fair comparison of results. 22

30 3.7 Sampling For the purpose of quantitative evaluation, the standard deviation and average will be used to increase the accuracy and reliability of the results. Multiple tests will be required, which will vary depending on the problem size. These multiple tests will be averaged, and the accuracy of the average will be checked by using the standard deviation. Smaller problem sizes will be executed a greater number of times but fewer longer executions will be made, the number of executions per problem size will be determined by preliminary testing. The number of executions for each problem size will remain the same for all implementations to ensure fair comparisons. Problem sizes will range from reasonably small up to the largest possible for a particular implementation, the largest possible will also depend on the memory limit of the memory model used. Similar sampling will be used for all problems considered, to gain a clear comparison. The number of workers will depend on the environment being used. In the DEC10 environment using 4 workers is possible on the standard machines and 8 workers is possible with the i7 machines using hyper-threaded cores or on cslin146 which has 16 cores available but due to limitations of the PCT only a maximum of 12 workers are possible. Using more workers than cores available will be used for some implementations to demonstrate the concept of core sharing. 3.8 Evaluation Strategy Quantitative evaluation will be used for comparisons of the different implementations of the Matrix- Matrix Multiplication, Mandelbrot Set and the Jacobi Method algorithms discussed in Section 3.5. Quantitative timing data will be obtained and recorded from the benchmarking, to then be compared. Preliminary tests implementing Matrix-Matrix Multiplication and the Jacobi Method will help understanding the environment and the features of the PCT. It is important to take into account that by default MATLAB provides implicit parallelism by multithreading as explained in Section 2.5.1, therefore comparisons of other implementations will be made against the default MATLAB times and also MATLAB setup to explicitly use a single-thread to provide a clear comparison. For the serial time of an implementation the single-threaded measurements will be used. Also qualitative evaluation will be used to assess the effort required for implementing a particular model using different memory models. This is important as it is essential to understand whether or not using the PCT is worthwhile and assess the ease of parallel programming with MATLAB. In accordance with the iterative methodology, testing of each implementation will be carried out to prevent bugs and obtain preliminary results to allow understanding of expected results. A final evaluation of the benchmarking tests will be carried out, after all the implementations and measurements are complete. 23

31 3.8.1 Evaluation Criteria It is important to ensure not just the correctness of the implementations but also the algorithmic behaviour. The measurements obtained from the evaluation will make it possible to understand the effectiveness of the implementations by making comparisons. Only valid and correct serial implementations can be developed using the parallel techniques to be considered. Comparisons will focus on the speedup, efficiency and execution times for a range of different variables. Measuring the scalability is also essential in-terms of the problem size and the number of workers used. Comparing the different environments used will be important to understand how the parallel performance is affected. The qualitative evaluation will determine the difficulty in developing the implementation relative to the performance gains Measurements Parallel performance is commonly measured using execution times, speedup and efficiency [17]. For measuring the performance of the PCT, the tic and toc functions will be used which are available within MATLAB. This uses an internal clock to measure the performance of algorithms and functions, and can explicitly specify what parts of the code or algorithm to measure. The other available measurement within MATLAB wascputime, which allows measuring only the elapsed CPU time which was not appropriate when considering the GPU implementations. All measurements hence will use the tic and toc functions, timing the implementations will encapsulate the extra code required for parallel implementations to gain a fair comparison. Also for all implementations, testing will be done considering both just the computation and data transfer involved. 3.9 Schedule The initial schedule intended to be followed is shown in Appendix D, Figure D.1. During the course of the project, some aspects had to be changed. The amended and final schedule can be seen in Appendix D, D.2. Initially it was difficult to plan exactly how much time would be required for certain tasks especially in-terms of implementation and even if they would be feasible. After the background research it became clear what was actually possible and how long it would roughly take. The changes that were made during the project were made in terms of the minimum requirements and implementations. The schedule consists of tasks arranged over the duration of the project, with each task having some leeway in the event of unexpected issues Changes Upon understanding the PCT better, it became apparent that using the distributed memory model would not be possible due to resource issues. As the licence for the Distributed Computing Server (DCS) was unavailable for the available environments, exploring this route would not be possible. 24

32 It was valuable that with thorough research and initial testing of the environment this issue became apparent, it could have been a potential setback in the latter stages of the project. It is however worth nothing that MATLAB licenses are expensive and to gain access to both the PCT and DCS, would be expensive. Also due to lack of domain knowledge, initial research was difficult and hence some concepts of the PCT were misunderstood which lead to some false assumptions. However creating the test environment and from test implementations this was resolved and rectified at an early stage. Had this not been the case, some implementations would have been developed which would have been invalid and the evaluations untrue. Initially all the implementations were to be completed, then benchmarked and evaluated however it was deemed appropriate to integrate the stages to prevent any cascading issues. When considering the algorithms to be used for the purpose of this project, the Heat Diffusion problem was initially chosen. However during background research and after setting up a test environment this did not seem feasible. Due to time constraints and the level of programming possible, the Heat Diffusion problem was used as a specific case for the Jacobi Method rather than just an individual implementation Important Deadlines Each week two meetings with the supervisor are scheduled, for discussion of progress and any guidance that may be required. There are some important deadlines throughout the project, it is essential that time available is used effectively and the deadlines are not underestimated because efficient progress is a key requirement. Friday 25th January Submit aim and minimum requirements form. Friday 22nd February Seminar presentation. Friday 1st March Submit mid-project report. Friday 19th April Progress meeting (student/supervisor/assessor). Wednesday 8th May Submit project report. Friday 17th May Presentation at student workshop. 25

33 Chapter 4 Implementation It is now important to describe the implementation process taken for the algorithms discussed previously. All algorithms were implemented using MATLAB and the Parallel Computing Toolbox (PCT). The discussion will highlight any concerns that occurred during the implementations, and how they were tested for performance and algorithmic correctness. 4.1 Matrix-Matrix Multiplication The Matrix-Matrix Multiplication code was taken from MJ21 [3]. The implementation provided a better understanding and valuable experience of the environments for further development. Using the GPU memory and worker model for shared memory required quite different approaches and it was necessary to fully understand the PCT to properly complete them Implementation It was possible to implement the algorithm using implicit parallelism, worker model and GPU computing. There were issues which arose with regards to performance and testing which were resolved. Such issues were faced due to the initial lack of knowledge and understanding of the PCT. Running preliminary tests on the test environment helped gain a better understanding of the programming constructs required for the implementations Serial Implementation of the Matrix-Matrix Multiplication was first coded in serial. The serial implementation ensured that the MATLAB client only ran in a single-thread and that default parallelism of multithreading was not used. The actual code did not require much modification to achieve this. The code was setup such that the size of matrix could be specified as user input to make preliminary testing a seamless process. Data for the algorithm was initialised using random matrices. Initially a separate function was used which calculatedc = A * B given the matricesa andbbut that was then removed 26

34 and the MATLAB function mtimes, as initially proposed, was used. To run the code using just a single-thread the maxnumcompthreads command was used to specify the number of threads to run as discussed Section The output of the code was captured in a text file separately, which was used to calculate the performance measures. For multiple executions a loop was introduced to encapsulate the whole code and allow an average to be calculated for reliable timings. Timing different sections of code separately for each programming model was inefficient, therefore a benchmark program was developed. This was only possible once the functionality of the individual implementations was valid. The program considered all the possible cases for a programming model, for all problem sizes. The number of executions were specified explicitly. The output produced was the average timing and the standard deviation for a particular case. Three separate benchmark implementations were developed, considering implicit parallelism, the worker model for shared memory and GPU memory model. To ensure the serial implementation also provided correct algorithmic behaviour, preliminary testing provided results which showed the timings were appropriate to the theoretical running time of Matrix-Matrix Multiplication. The results are seen in Table 4.1, which show asndoubles, the timings increase by approximately a factor of 8. This shows that the algorithm is behaving as to the theoretical running times discussed in Section N Timing (s) Ratio Table 4.1: Preliminary results Shared Memory For the shared memory model, the first implementation was implicit multithreading, however as this was done implicitly only the default MATLAB client was used during testing. For the worker model, the existing code was built upon, the matrices were initialised using distributed arrays and a matlabpool was set up to handle the parallel computation. Using A = distributed.rand(n,n) a n by n matrix was initialised directly on the worker memory, whilst implicitly handling the distribution of data. The implementation considered both the timing of the computation involved and also the cost of setting up the data to take into account any overhead of data 27

35 distribution and setup of the MATLAB workers. MATLAB implicitly distributes data when using distributed arrays, if co-distributed arrays were used the distribution scheme can be explicitly specified as discussed in Section Another possible implementation for the worker model was to use the matlabpool, but with an spmd block and co-distributed arrays. Using co-distributed arrays without any explicit distribution schemes specified, automatically distributes the data across the workers as evenly as possibly. Due to time constraints this route could not be explored in great detail and was only investigated GPU Computing Implementing the algorithm on the GPU was relatively simple, the existing serial code was built upon again. There were a few cases to consider. The first implementation initialised the matrices on the CPU using the A = rand(n,n) command, and then transferred them to the GPU with the GA = gpuarray(a) command, the multiplication operation then used the gpuarrays for the calculation. With some functions already predefined for the GPU within MATLAB it was possible to directly initialise the matrices on the GPU by using the GA = gpuarray.rand(n,n) command and avoid the data transfer cost. For measurements of performance all the cases were considered: transferring data from the CPU to the GPU for computation, measuring the computation only; performance involving no data transfer, and; the cost of returning computed results back to the CPU Functionality Testing Using small matrices, for which the matrix product could be calculated by hand, was appropriate to check if the code was producing the correct results and also to check if timings were theoretically close to as expected. This was done using two matrices with explicitly specified values for which the result was already known. MATLAB also provides an interactive parallel command window known as pmode explained in Section 2.5.4, for the worker model. Using the interactive window it was possible to see the workspace of each worker connected to the current job, which allowed understanding of how the distribution schemes split the data and whether the final result was still valid. There were no major concerns with the functionality of the algorithm or the implementations. However testing highlighted issues with regards to the measurements. The key issue was memory, restricting the problem sizes which will be discussed in Section 5 in greater detail, and also the distinction between distributed and co-distributed arrays, discussed in Section Benchmark Testing For the Matrix-Matrix Multiplication matrices problem sizes up to the largest possible were used. It was important to understand the effect of the memory limits on the different models, hence larger matrices were used for the worker model, but for the GPU the memory limit was much smaller. Each implementation was executed many times to produce accurate timings. There were issues with the 28

36 GPU, the first execution for the GPU would take longer than the others when considering multiple tests. This was because the first invocation to the GPU always holds significant overheads [38], therefore this was taken into account when considering multiple executions and averages. Preliminary testing allowed the number of executions to be determined beforehand. The standard deviation was used to measure the accuracy of the timings. Greater level of variation was seen for smaller problem sizes and less so for the larger problem sizes. As long as the variation was relatively small in comparison to the average timing, typically 1%, it was acceptable. The serial implementation ran on a single-thread measuring both the data setup and the computation costs. The same implementation was then run using implicit parallelism, which was with multithreading. The worker model was run using 1, 2, 4, and 8 cores on workstation cslin040. Using workstations cslin049 and cslin146 also allowed using 1, 2, 4, and 8 cores to provide a better measure of scalability. For the purpose of evaluating core sharing, using cslin040 1, 2, 4, and 8 workers were utilised. For the GPU memory, data transfer between the CPU and GPU and just the lone computation was considered. Results are presented and discussed in Section Tables of results for all tests, which include average times and relative standard deviations are seen in Section D The Mandelbrot Set The complexity of the Mandelbrot Set implementation can vary, however due to the time available a simple implementation was taken from [36]. The code was set up such that the grid for the Mandelbrot calculation was initialised with the spatial limits and number of iterations, for which the algorithm would iterate at each grid location specified. The grid remained the same for all the implementations. With some complex functions being used as part of the algorithm, the GPU and Shared memory model had to be implemented with care, to ensure correct functionality Implementation For the Mandelbrot set all implementations planned were possible, allowing the use of GPU memory and Shared memory with the worker model and implicit parallelism. The benchmark program made testing of all cases a seamless process. Using the same setup for all the implementations the comparisons were fair. The section of code which plots the output of the algorithm for the Mandelbrot Set was used for initial preliminary tests. For the benchmark program this was removed. This was because the output produced took a considerable amount of time and with multiple executions the experiments would not have been feasible as the program would have taken much longer to run. Also for the GPU memory, producing graphical output would require transferring data back to the CPU before producing the output which would slow down the code further. 29

37 Serial The initial implementation measured serial performance, this was done by running MATLAB with a single-thread as discussed in Section The algorithm and setup was encapsulated within a function so it could be run directly from the command window within MATLAB. The only variable was the grid size which was specified with the function call from the command window. After preliminary testing a benchmark program was developed which executed the code both multiple times and also for different cases. To ensure the serial code not only provided the correct results but the algorithmic behaviour was also correct, preliminary results were compared to theoretically expected timings. The results are seen in Table 4.2, which show as the grid size doubles, the timings increase by approximately a factor of 4. Therefore the algorithmic behaviour is correct as to discussed in Section Grid Size Timing (s) Ratio Table 4.2: Preliminary results Shared Memory For the shared memory, the serial implementation was run using the default MATLAB client where MATLAB would thread the program implicitly where appropriate, providing default parallelism discussed in Section For the worker model, the existing code was used and developed further. The initialisation of the grid remained the same, however the algorithmic setup was modified. The algorithm uses functions linspace 1, logspace 2 and meshgrid 3 which are compatible with co-distributed arrays for the worker model. Using distributed arrays was not possible with the Mandelbrot set because the functions are not compatible. Explicit distribution schemes could be specified, however it was not relevant because the calculation had been vectorised such that every location is updated at once [36]. The only other change to the code required was setting up the matlabpool and the spmd block for the co-distributed arrays to function correctly. 1 MATLAB function to generate linearly spaced vectors. 2 MATLAB function to generate logarithimaclly spaced vectors. 3 MATLAB function to for rectangular grids in 2-D and 3-D space. 30

38 GPU Computing Implementing the algorithm for a GPU involved a similar structure to the worker model. The gpuarray class also provides functions that can be used to create data arrays. The serial code was modified and the data initialised directly on the GPU. Knowing the overhead associated with data transfer it was decided that the data would be directly initialised on the GPU and not transferred to the GPU for the computation. Bringing the data back to the CPU after computation was not necessary for the benchmarking implementation as the result was not to be plotted, however for the preliminary testing the results were plotted to ensure correctness. With these changes to the initialisation of the data the calculations were performed on the GPU. The algorithm was operating on every element of the grid identically. MATLAB has a function arrayfun, discussed in Section which applies a function to each element of an array. This function has also been enabled for GPU programming by MATLAB. The actual computation of the Mandelbrot set was then encapsulated in a separate function, which was then called using the arrayfun function within the original algorithm after data initialisation. This function uses no predefined features of the GPU and is basic MATLAB code, however when called within a GPU-based algorithm it operates on the GPU. The grid size for the Mandelbrot set was the only variable left to be defined within the code. Another possible implementation was to work with C/C++ and run written CUDA kernels using MATLAB data, as discussed in Section This implementation was not possible due to the time constraints Functionality Testing For the Mandelbrot set the output produced was a useful tool to check that the algorithm was behaving correctly. To test for this correctness, each implementation was tested individually to ensure the output was identical. For benchmarking, the plots would not be produced hence this had to be validated beforehand. For increasing grid size, better resolution and more sharpness was expected. There were no issues with any of the implementations. The memory issue had come to light earlier for the Matrix-Matrix Multiplication as discussed in Section Such an issue was then avoided, however for larger problems on the GPU multiple executions using a for-loop to handle the function was not possible. Therefore in such situations the results for each individual execution were appended to a text file, this meant that the implementation was executed multiple times by repeated function calls from the command window. 31

39 4.2.3 Benchmark Testing Preliminary testing was used to determine the appropriate problem sizes to consider and also the number of executions that would be required per problem size for accurate timings. The memory limits were important to account for, hence the largest possible problem size was determined for both memory models and rounded to the nearest whole thousand. As before the memory limits showed a similar pattern, but larger problems could be accommodated using the arrayfun function which seemed to manage memory better, this is discussed later in Multiple executions were used to calculate the mean and standard deviation to ensure that the results were accurate, with minimal variation. For smaller problem sizes, more variation is due to underlying scheduling of the operating system and also small computation can hold great overhead associated with the set up of the workers or the GPU. As before, if the standard deviation was relatively small compared to the average timing then the results were accepted. For the GPU larger problems suffered from memory issues. Benchmarking was set up such that there were no multiple executions. 4.3 The Jacobi Method The Jacobi Method code was adapted from the MJ21 [3] module. The iterative method was chosen to establish an understanding of how the PCT worked with sparse and dense matrices. The Heat Diffusion problem was also considered with respect to the Jacobi method and sparse matrices, the code was also taken from the same source [3]. There were issues with the correctness of the algorithm for the Jacobi method, for data initialised using random generation, however this was resolved and algorithmic correctness validated. Both types of matrices were considered for all implementations Implementation The Jacobi method was encapsulated within a function, which required the matrix A, right-hand side b, initial solution estimate u and the dimension of the system and number of iterations required as arguments. A separate function was required to initialise the linear system to be solved by the Jacobi method, from which a call to the iterative method was made. The function initialised a random matrix A, a random right-hand side b, initial solution estimate of zeros u to consider dense matrices. For the sparsity, the matrix A was initialised using the sprand function which created a sparse uniformly distributed random matrix for a specified density. Using random generation for the matrix A did not guarantee a diagonally dominant matrix, a key property of the Jacobi Method, as discussed in Section To ensure that the algorithm was functioning correctly, the error was calculated between the computed and exact answers for the system of equation. For all implementations the solver would use a fixed number of 100 iterations. 32

40 Serial The Jacobi Method was already coded in serial, it was initially tested using a system of equations with known solutions using a single-threaded MATLAB client. The serial implementation did not work for data that was dense or sparse, initialised using random generators. The serial code would compile and produce results, but the correctness of the results was an issue. As discussed in Section the error would be calculated between the exact and approximate values, this should always steadily decrease. For the serial code this trend was not seen. The code for the Jacobi method was vectorised to make it more efficient as discussed in Section , The vectorised code was then tested using the same test as for the previous code. The computed answers were correct to a known solution, hence the code was correct. To ensure diagonal dominance, the randomly generated matrix A was modified, such that the magnitude of each diagonal entry in a row, was greater than or equal to the sum of the magnitudes of all other non-diagonal entries in that same row as explained in Section Also to ensure the correctness of the algorithm, the error was computed and printed for each iteration which showed steady decrease as expected. To check the serial implementation also provided correct algorithmic behaviour, preliminary testing provided results which showed the timings were appropriate to the theoretical running times expected for the Jacobi Method. The results seen in Tables 4.3 & 4.4, show that as the problem size grows, the complexity gets closer to O(n 2 ). For the large problem, the timings increase by almost a factor of 4. The algorithm therefore is behaving correctly, to as discussed in Section For the Heat Diffusion problem the visual output was used to check the correctness of the implementation, code used was already written in serial, the code was modified to use the Jacobi Method to solve the system of equations and also to preallocate the sparse matrix. For the Heat Diffusion problem, the spalloc function was used to create the sparse matrix. Such preallocation of space for the sparse matrix is much more efficient, rather than dynamic memory allocation as the matrix was set up for the problem. This allowed efficient generation of the matrix which had an average of at most 5 nonzero elements per column. This relates to the five-point stencil approximation use to solve partial differential equations, as the system of equations to solve here, contain 5 unknowns [4]. N Timing (s) Ratio Table 4.3: Dense matrix N Timing (s) Ratio Table 4.4: Sparse matrix 33

41 Shared Memory The first implementation coded in serial was run using the default MATLAB client, which provided multithreading automatically as discussed in Section This implicit parallelism worked without any issues for all cases considered. Separate implementations considered the dense and sparse matrices. The serial implementation was also run using implicit parallelism without any problems. The serial code was used and developed for the worker model. The data was initialised using distributed arrays within a matlabpool of workers. The data involved was distributed implicitly as equally as possible using the implicit distribution scheme of distributed arrays as discussed in Section Time constraints did not allow co-distributed arrays to be considered for the Jacobi Method, hence using explicit distribution schemes were not possible. For the Heat Diffusion implementation, the worker model was implemented using co-distributed arrays which were used to initialise the matrix and linear system of equations. Using distributed arrays was not possible here, because some of the constructs used to define the Heat Diffusion problem were not compatible. The modifications required to the matrix to set up the Heat Diffusion problem were made to the co-distributed arrays. For the visual output of the Heat Diffusion problem which illustrated the distribution of heat on a grid, the computed solutions had to be gathered from the co-distributed arrays. Otherwise there were no errors or problems with the implementations GPU Computing GPU computing with MATLAB requires data to be full and not sparse as discussed in Section , hence considering sparse data for the Jacobi Method and the Heat Diffusion problem was not possible. The implementation for the dense matrix on the GPU, used the serial code for its structure. The algorithm was unchanged, however the initialisation of data required modifying. Data was not directly initialised on the GPU using the the gpuarray function because after its random generation it required modifying for diagonal dominance. Therefore the matrix A was initialised on the CPU and then transferred to the GPU to avoid any communication overhead, which is explained in Section 5.4. The computation for the Jacobi Method was then executed on the GPU. Separate implementations were developed to measure both the data initialisation cost and the computation cost. This involved modifying the position of the tic & toc functions within the code. Considering the size of the problem, bringing the computed result back to the CPU was not necessary as there were no further computations or outputs required Functionality Testing The code initially ran without any errors so it was important to assure that algorithmic behaviour was correct. The code was then set up to output the error at each iteration solving for the exact answer and the approximate values. The exact answer was calculated using Exact = A/b. The error should 34

42 always decrease after each iteration, for larger problem sizes this decrease is expected to be very slow. The error was not as expected initially, there were many oscillations. Print statements were used to return the error at each iteration. A key property of the Jacobi method is that it solves for a matrix which is diagonally dominant, generating the system of equations randomly this was not possible. Diagonal dominance was ensured explicitly. Once the matrix had been initialised it was then made diagonally dominant. Further testing showed that the algorithm was then behaving correctly. Setting up the dense matrix was simple, setting up the sparse matrix had to be done before it was made diagonally dominant. Therefore a sparse matrix was created for a given density (the number of nonzero elements in the matrix). Sparse and dense matrices were both considered for the worker model and implicit parallelism, however for the GPU only dense matrices were possible. Considering the Heat Diffusion problem where a particular sparse matrix was used. To test for functionality the visual output and the computed solution was used, these should remain the same for a given problem size because the problem is solved for a particular sparse matrix. The serial algorithm was running correct from MJ21 [3] therefore the solutions and output for different problem sizes were used to validate the parallel implementations Benchmark Testing During functionality testing, the problem sizes to consider and multiple executions required for accuracy were determined. Both for the shared and GPU memory model the memory limits were accounted for, a fair sample was required to gain a clear understanding of the performance involved. The error at each iteration was not required for benchmarking, it was deemed acceptable if the algorithm functioned correctly during the functionality testing. Outputting the error at each iteration for the benchmarking was not efficient. However, the data was initialised once, so for the different implementations within the program the converged answer from the Jacobi method was expected to be the same. For the Heat Diffusion problem as benchmark program was used to run the code for different problem sizes, each for an increasing number of workers. The visual output was removed to make the benchmarking more efficient. Separate benchmark programs were developed for the different programming models and the problem being considered. Statistical tests discussed in Section 3.7 were used to check the accuracy of the timings, the number of executions was determined during functionality testing. 35

43 Chapter 5 Evaluation The discussions to follow form an analysis of the Parallel Computing Toolbox (PCT) with respect to different memory models and programming methods used. It is important to note that the use of execution times, speedup and efficiency graphs varies throughout, graphs seen are those which show an appropriate aspect of the data. The tabulated data with standard deviations for the graphs can be found in Appendix D. The tests were performed where possible, using all 3 environments described previously in Section 3.3. Initially tests were only performed on workstations cslin040 & cslin049, tests later were performed on the high-performance research workstation cslin146 to compare to the former workstations. 5.1 Terminology Throughout the evaluation and on legends of the graphs some terminology used may seem ambiguous, so more detailed descriptions of the terms used, are given below. Communication Only - This measurement is considered to be the time taken to complete the computation within the algorithm or function. It is expected that the computation time provides better performance because there is no data transfer considered, hence no data overhead accounted for. Data Created - This is considered to be the measurement which accounts for all data generation within the algorithm or function. This is data which is created directly on the CPU or on the GPU. It is assumed that there is minimal overhead when data is created directly. Data Transfer - This is considered to be the measurement which accounts for the data transfer within the algorithm or function. This is data which is created directly on the CPU or GPU, and then transferred to or from the CPU or GPU respectively. Some communication overhead is expected when transferring the data, hence the performance can suffer. 36

44 5.2 Matrix-Matrix Multiplication Matrix-Matrix Multiplication was used for preliminary testing and to gain better understanding of the PCT. Fortunately it has provided a valuable insight on the potential of the PCT and laid the foundation for the other implementations. It has also highlighted key issues in development, and helped evaluate the performance of the PCT for the purpose of this project Evaluating Results Parallel methods tested for the Matrix-Matrix Multiplication on workstation cslin040 are seen in Figure 5.1a. The tests show that using the GPU memory can be advantageous, giving a significant performance increase providing that the problem size is within memory limits. Figure 5.1b shows that as expected, for small problem sizes the GPU is outperformed by the CPU considering both the singlethreaded and multithreaded approach. Initial launch of the GPU-kernel required for the code to run holds an overhead which is significant for the small data sets, resulting in poor performance. Figure 5.1a shows for larger problem sizes the GPU provides better performance, because the penalty of the overhead decreases, relative to the speedup in the computation [38]. This supports the discussion in Section that the GPU is well suited for problems which are data intensive and massively parallel. For the largest problem on the GPU wheren=6000 approximately 73x speedup is achieved relative to using a single-thread, with the execution time decreasing from seconds to seconds. Considering implicit parallelism, the GPU provides a speedup of 23x, execution time is reduced from seconds to seconds. It is expected that for larger problems the GPU will continue to provide the best parallel performance for this algorithm. It also shows that the GPU is restricted by memory limits, therefore cannot accommodate problems as large as those handled by shared memory. The only compatible GPU available was on workstation cslin040, preventing experiments with other GPUs. Figure 5.1a shows that increasing the number of workers for a fixed problem size provides better performance. The tests have shown using 4 workers exhibits similar performance relative to default MATLAB which provides implicit parallelism using multi-threading. Similar behaviour is seen considering the performance of a single-worker and a single-thread. The speed up achieved using 4 workers rather than implicit parallelism when n = is approximately 0.98x, which shows that actually implicit parallelism is slightly better than the worker model. Considering just the worker model, for n = 17000, utilising 4 workers provides a speed up of 3.6x relative to a single worker, reducing the execution time from seconds to seconds. No benefit is seen from using the worker model relative to the implicit parallelism, because essentially each MATLAB worker runs using a single-thread. Workstation cslin040 has 4 cores available, the worker model on shared memory therefore utilises 4 workers (one worker per core) and implicit parallelism utilises 4 threads (one thread per core) hence providing no difference in performance [27]. It also shows that worker 37

45 model on shared memory is restricted by the memory of the machine. Figure 5.1b demonstrates the performance for smaller problem sizes. It is seen that implicit parallelism is actually slower than a single-thread of execution. This behaviour is seen because the overhead associated with setting up multithreading is greater than the computation involved [4]. It is only when more computational work is involved with the larger problems, that better performance improves. Setting up the worker model holds the expense of copying data to and from the workers, hence being unsuitable for the smaller problems s Execution Times (s) s s s s GPU Multi threaded Single threaded 1 Worker 2 Workers 4 Workers ,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 Matrix Dimensions (a) Larger problems Execution Times (s) GPU Multi threaded Single threaded 1 Worker 2 Workers 4 Workers Matrix Dimensions (b) Smaller problems Figure 5.1: Execution times against different problem sizes for different parallel methods 38

46 Another experiment considered how overheads associated with GPU memory would affect the performance. The experiment considered various cases; creating the data on the CPU and transferring it to the GPU, creating the data directly on the GPU, transferring the results back from the GPU to the CPU and considering just the computation with no data transfer. Results are shown in Figure Data initialized on GPU Data transferred to GPU GPU computation only Data gathered from GPU s Execution times (s) s s s Matrix Dimensions Figure 5.2: Using GPU computing for better performance. Initialising the data directly on the GPU is seen to be the best method to utilise the GPU for practical problem sizes. Transferring the data from the MATLAB workspace to the GPU for when n = 5000, is 3x slower than initialising the data on the GPU directly. Transferring the computed result back degrades the overall performance such that it becomes 80x slower, increasing the execution time from seconds to seconds. This is because initialising data directly on the GPU limits the overhead associated with data transfer, which can be significant enough to reduce overall performance [40]. For smaller problem sizes, transferring data to the GPU is better than initialising the data on the GPU directly because the overhead associated with transferring data between the GPU and main memory of data over the PCI-E bus is not as significant as launching the GPU-kernel to directly initialise data [40, 38]. Considering just the cost of the computation, it provides the best performance. However it is important to take into account the communication overheads when comparing to the CPU [8]. The worst performance is seen for the implementation where computed results are transferred back to the CPU from the GPU. This is because transferring a full computed array back to the CPU is expensive, the overhead associated with such transfer is significant. The worker model implementation for shared memory showed that for problems of fixed size, by using more workers (utilising more cores) performance gains could be gained for the Matrix-Matrix Multiplication. The efficiency graph seen in Figure 5.3a shows the results for using workstation 39

47 cslin040. It demonstrates that by increasing the number of workers, efficiency is maintained for the larger problem sizes. For the when n = 16000, 3% efficiency is lost utilising 2 workers and 10% efficiency is lost utilising 4 workers. Utilising more workers shows greater loss in efficiency because on a machine with 4 cores, the full potential of all 4 cores can not be utilised [4, 27]. Smaller problems lose efficiency quite significantly by using more workers. Problem sizes where n is greater than 2000, at least 80% efficiency is maintained but for anything smaller it does not scale very well. The GPU also outperforms the worker model, results show that it is 76 times faster than a single worker and 22 times faster than using 4 workers. Speedup achieved is similar to that achieved relative to a single-thread and multithreading, again highlighting that there is no benefit of using the worker model for shared memory, relative to multithreading. The PCT actually allows up to 12 workers per machine, however when the number of workers is greater than the cores available, the performance diminishes. The diminishing behaviour, a result of the workers having to share the available cores. Testing was done with 8 workers, to show how the performance was affected by core sharing. The performance of Matrix-Matrix Multiplication for this case can be seen in Figure 5.3a. The graph shows that immediately after utilising any more than 4 workers there is further loss in efficiency. This is seen regardless of the problem size. For when n = 16000, a loss of almost 50% efficiency is evident and for other large problems efficiency loss is of greater than 50%. Such performance is seen because utilising more workers than cores available, essentially leads to core sharing. For workstation cslin040 which has 4 available cores, utilising 8 workers, essentially sets up 2 workers per core resulting in the available resources of a single core being shared by 2 workers. The worker model was also considered in other computing environments to evaluate the scaling. Figure 5.3b shows the efficiency for workstation cslin049. This workstation uses hyper-threading technology as discussed in Section 2.4 and shows similar loss in efficiency as seen for workstation cslin040. This is due to the issue of core-sharing. Hyper-threading technology provides additional cores, but no advantage is gained from this for the PCT. Figure 5.3c demonstrates the efficiency for workstation cslin146. This workstation has 16 cores, allowing the PCT to utilise 4 workers with full computational resources and showing better scalability overall. It was also possible to use 8 workers, and compare how the performance would actually scale if more workers were available. It is seen that for the larger problems efficiency is not lost as soon. For workstation cslin146 where n = efficiency lost by utilising 8 workers is 24% relative to workstation cslin040 where efficiency lost for the same problem size is 55%. The larger problem sizes take advantage of the computational resources available, problems where n is greater than 4000, exhibit much better scaling. Overall for problem sizes smaller than n = 2000, efficiency degrades by approximately 75% when utilising 4 workers, regardless of the workstation being used. From the performance exhibited, it is expected that larger problems will scale much better when utilising more workers, providing availability of resources. 40

48 Efficiency, S(p)/p N = 500 N = 1000 N = 2000 N = 4000 N = 6000 N = 8000 N = N = Ideal Efficiency Number of workers, (w) (a) cslin040 Efficiency, S(p)/p N = 500 N = 1000 N = 2000 N = 4000 N = 8000 N = Ideal Efficiency Number of workers, (w) (b) cslin Efficiency, S(p)/p N = 500 N = 1000 N = 2000 N = 4000 N = 8000 N = Ideal Efficiency Number of workers, (w) (c) cslin146 Figure 5.3: Execution times against different problem sizes for different parallel methods 41

49 5.2.2 Discussion For Matrix-Matrix Multiplication, performance benefits are seen with all parallelism methods considered provided the problem is large enough. If the problem size is within the memory limit of the GPU and no further computations are required on the CPU for which the GPU result is required, GPU parallelism is ideal. For larger problems the worker model on shared memory is appropriate as it accommodates larger data sets in main memory. However it is important to take into account that implicit parallelism has shown almost identical performance relative to the worker model. The shared memory model (implicit parallelism and worker model) suffers from memory limits relative to the hardware, but that is an indication for the need of the Distributed memory model. If the Distributed Computing Server (DCS) was available, the same tests would have been scaled up further to clusters of computers utilising much more workers. This would have also allowed a fair comparison of how distributed memory performs relative to shared memory, considering the implicit parallelism MAT- LAB provides by default. As has been discussed, there is no single way to implement parallelism for such problems, each carries its own benefits. Implicit parallelism can only utilise the computational resources available on the current machine, hence for even larger problems or for increased performance the distributed memory model would be appropriate. If there is no availability of the DCS, then it is more than likely that implicit parallelism is sufficient unless explicit parallelism is to be used to improve the current parallel performance. Handling the underlying communication and distribution of the data, using explicit parallelism utilising the worker model, can provide further parallel benefits. This was not possible here due to time constraints. 5.3 The Mandelbrot Set The Mandelbrot Set was implemented as an extension. It has provided further evidence that the GPU has the potential to provide parallel performance. In addition it has highlighted how different methods of using the GPU can effect the performance and also helped evaluating the PCT in greater detail Evaluating Results The Mandelbrot Set was implemented using both GPU and shared memory on workstation cslin040, the performance can be seen in Figure 5.4. Tests shows that using the GPU provides better performance, as previously seen in Section Figure 5.4 shows that for large problems using the GPU provides the most speed up, relative to a single-thread. Two different GPU implementations were possible, as discussed in Section Figure 5.4a shows that naive GPU implementation, using gpuarray, achieves a speedup of 2x for when n = 3000, 42

50 relative to a single-thread. However it is slower relative to implicit parallelism with multithreading increasing the execution time from seconds to seconds. Using the arrayfun method, which is specific to element-wise operations, a significant increase in performance is exhibited, relative to using the gpuarray and to the CPU. For the same problem size; arrayfun is 28 times faster than using a single-thread, 12 times faster than implicit parallelism and 13 times faster than using a gpuarray. The arrayfun method does not suffer as badly from the memory limit of the GPU as the gpuarray function, in-fact it can almost accommodate problems of twice the size [8]. For the largest problem on the GPU using arrayfun where n = 6000 a speedup of almost 30x is achieved relative to a single-thread, reducing the execution time from seconds to seconds and is 12x faster than implicit parallelism. Comparing both the GPU implementations, using element-wise operations on the GPU gives better performance because the Mandelbrot Set is a problem which is a pixel-by-pixel calculation as explained in Section Using the gpuarray for each pixel of the Mandelbrot Set many individual invocations are made, whereas arrayfun compiles the entire function on the GPU, where it is evaluated in one single invocation, significantly reducing overhead and providing speedup [36]. Very small problems which did not seem practical were only investigated because it had earlier been learnt in Section 5.2.1, that speedup achieved with the GPU is for when a large amount of data is used. This is because each GPU core essentially, is individually slower than a CPU core. For parallel processing a large number of the cores must be utilised [38]. Figure 5.4c shows the performance for smaller problems with different parallel methods. Considering the implicit parallelism performance is seen to improve relative to a single-thread, results are shown in Figure 5.4b. For the large problem where n = 12000, the execution time is reduced from seconds to seconds, making it twice as fast. Implicit parallelism creates 4 threads and theoretically it is expected to provide 4 times speedup, however this is not seen because the threads are unable to utilise the 4 available cores. It is expected that using threads or workers, should provide similar performance as previously seen in Section However utilising 1 worker is much slower than a single-thread, increasing execution time from seconds to seconds. Multithreading is also faster than the use of 4 workers, providing a speedup of 1.6 times. Explicit parallelism provided by the workers cannot compete with single or multithreads because of cost associated with setting up the workers. The communication overhead associated with setting up the workers and copying the data is significantly large, hence no benefit is seen from using the worker model over implicit parallelism. As it was seen that the GPU using the arrayfun method was significantly faster than both a single-thread and implicit parallelism, the same behaviour is exhibited relative to the worker model. The GPU is 40 times faster than using a single worker, reducing the execution time from seconds to seconds. The speedup achieved relative to using 4 workers decreases the execution time from seconds to seconds, making it 22 times faster. 43

51 Using the worker model for shared memory, showed that utilising more cores did not provide perfect efficiency for problems of fixed size, however benefits were seen. The efficiency graph seen in Figure 5.5a shows the results for using workstation cslin040. For the large problems where the grid size is greater than 1000, 83% is exhibited for going from a single worker to 2 workers. Utilising 4 workers however gives an efficiency of about 50%. Considering the largest problem, where the grid size = 12000, using 2 workers is 1.68 times faster relative to the single worker. Using 4 workers, relative to the 2 workers is expected to be twice as fast however a speedup of only 1.07 is achieved. Such performance is seen because increasing the number of workers, eventually diminishes the benefit of parallelism due to the communication overhead associated with the workers [9]. In addition, similar to threads the workers are unable utilise the 4 available cores fully because of other resource requirements of the machine. The worker model was also considered on the other computing environments. Figure 5.5b shows the efficiency achieved using workstation cslin049, which provides hyper-threading technology as explained in Section 2.4. Utilising the hyper-threaded cores essentially creates 2 theoretical threads per core, aiming to utilise the available resources more efficiently. As seen before for Matrix-Matrix Multiplication in Section 5.2.1, hyper-threading does not provide any benefit to the PCT. The results show similarity, to the efficiency observed for cslin040 shown in Figure 5.5a. It was expected that with hyper-threading better scaling would be achieved for the problems considered, however this was not the case as results have shown. Figure 5.5c shows the efficiency for cslin146, where the scalability is much better. For the largest problem, 99% efficiency is achieved for utilising 2 workers and 86% for 4 workers relative to the 85% and 46% efficiency maintained respectively using cslin040. This shows that better use of computational resources provides better performance. Overall considering just the 4 workers for large problem sizes at least 80% efficiency is maintained. For the smallest problem size there is still no benefit as expected, because the worker model benefits only when there is enough computational work involved. Also highlighted by Figure 5.5c is that using 8 workers also provides some benefit, with at least 60% efficiency maintained for large problems, relative to the 30% observed for cslin040 in Figure 5.5a. This shows that using more workers providing the availability of computational resources, the worker model can provide good performance. Due to the DCS not being available and that route not explorable, it is difficult to say how utilising many more workers on clusters of machines would affect performance and how the performance would scale. It is clear that there is improvement in performance using the worker model, so it is likely that the worker model for distributed memory would provide scalable results on larger problems. 44

52 Execution times (s) worker 2 workers 4 workers Multi threaded Single threaded GPUFun GPUArray s 91.99s s s s s s Grid Size (a) Larger problems for GPU computing s s Execution times (s) s worker s 2 workers 4 workers 10 1 Multi threaded Single threaded GPUFun GPUArray Grid Size (b) Larger problems for the worker model Execution Times, (s) Grid Size (c) Smaller problems 1 worker 2 workers 4 workers Multi threaded Single threaded GPUFun GPUArray Figure 5.4: Execution times against different problem sizes for different parallel methods 45

53 Efficiency, S(p)/p GridSize = 500 GridSize = 1000 GridSize = 2000 GridSize = 4000 GridSize = 8000 GridSize = Ideal Efficiency Number of workers, (w) (a) cslin040 1 Efficiency, S(p)/p GridSize = 500 GridSize = 1000 GridSize = 2000 GridSize = 4000 GridSize = 8000 Ideal Efficiency Number of workers, (w) (b) cslin Efficiency, S(p)/p GridSize = 500 GridSize = 1000 GridSize = 2000 GridSize = 4000 GridSize = 8000 GridSize = Ideal Efficiency Number of workers, (w) (c) cslin146 Figure 5.5: Execution times against different problem sizes for different parallel methods 46

54 5.3.2 Discussion For the Mandelbrot Set the use of the PCT in general has shown increase in performance. The results have shown that for the problem sizes within the memory limits of the GPU, using arrayfun has provided the best performance. It has also highlighted, that using just the GPU does not always provide advantageous performance, it is only when utilised correctly that performance gains seen. It is also important to note that once again it is seen that the worker model does not provide better performance than implicit parallelism. Time constraints have not allowed low-level explicit parallelism to be investigated, it is difficult to comment on how the worker model may behave with such explicit parallelism. However it is expected that implicit parallelism will generally always beat explicit parallelism using MATLAB workers (on a single machine) for the simple reason that the explicit parallelism copies the data being used to and from the workers for progressing. It is only when there is very large amounts of work involved, that implicit parallelism is unable to handle, that explicit parallelism with the workers can do better. For machines with more resources available, implicit parallelism is expected to take advantage, creating the number of threads accordingly as discussed in Section The Jacobi Method The Jacobi Method was used as a proof-of-concept for iterative algorithms and is perhaps not the most appropriate algorithm to consider the ideas of dense and sparse matrices. However it has provided enough understanding which can be used to predict the behaviour of other algorithms with regards to such matrices Evaluating Results There is no support yet for sparse matrix operations on the GPU, hence implementing the basic Jacobi Method using sparse matrices was not possible without low-level programming. It is also not yet clear if MATLAB intend to provide such support in the near future, as GPU computing with MAT- LAB is still a fairly recent development. Not all functions and methods are possible on the GPU, which shows that GPU computing with MATLAB has a long way to go to become a general purpose tool. Considering dense matrices, results seen in Figure 5.6 show that the GPU does not provide any performance benefits relative to using a single-thread or multithreading. The nature of the algorithm, requires multiple calls to the GPU memory and such frequent memory access can hold a significant overhead, hurting the overall performance [38]. Considering dense matrices for the largest problem size, the GPU is almost 63 times slower than implicit parallelism. However the benefit of implicit parallelism to a single-thread is minimal, reducing the execution time from seconds to seconds, as seen in Figure 5.6. The Heat Diffusion problem also uses a sparse matrix, therefore a GPU implementation was not possible. Figure 5.7 demonstrates the behaviour of using sparse and dense matrices with MATLAB. It shows that using fully dense or sparse matrices perform better than 47

55 matrices which are either partially dense or sparse. Dense s % Sparse 90% Sparse Execution time (s) s s Execution times (s) Single thread Multi thread GPU Size of n Size of n Figure 5.6: Using GPU computing Figure 5.7: Using different matrices Table 5.1 shows the performance of implicit parallelism for both sparse and dense matrices. For sparse matrices implicit parallelism does not provide any benefit relative to using a single thread, the timings are almost identical. For dense matrices it is also similar, for the largest problem considered, using implicit parallelism reduces the execution time from seconds to seconds providing minimal speedup. This is not a surprise because MATLAB has not yet enabled all MATLAB functions and operations for implicit parallelism, hence threads will only be created where MATLAB deems it possible [37]. Dense Sparse (60%) Sparse (90%) N Single (s) Multi (s) Single (s) Multi (s) Single (s) Multi (s) Table 5.1: Timings for different matrices on cslin040. The worker model for shared memory was able to handle both sparse and dense matrices, results are seen in Tables 5.2 & 5.3. There is minimal improvement in performance for dense matrices, for the largest problem size the execution time is reduced from seconds to seconds by utilising 2 workers. Utilising 4 workers actually performs poorly relative to using 2 workers but provides hardly any speedup relative to the 1 worker. Using more workers than cores, exhibited core sharing and resulted in poor performance. Considering sparse matrices, matrices of different sparsity 48

56 were used. The experiments showed that for matrices of 60% sparsity there is no benefit using the worker model. For matrices of 90% sparsity a similar pattern is observed for increasing the number of workers used, however overall execution times are almost twice as fast relative to matrices which are 60% sparse, the results are shown in Appendix D, Section D.4. The results show a similar trend to that observed for implicit parallelism. In comparison, the worker model is slower for all types of matrices, because of the overhead of setting up the workers and copying the data to and from the workers as discussed in Section For the largest dense problem considered, using 4 workers relative to a implicit parallelism increases the execution time from seconds to seconds. Overhead is more significant for smaller problems. N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.2: Timings for dense matrices on cslin040. N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.3: Timings for 60% sparse matrices on cslin040. The implementation was also considered on the other available environments as discussed in Section 3.3. Tables 5.4 & 5.5 show the results for using cslin049. As previously seen in Sections & that hyper-threading has not provided any benefit to the PCT. The results are similar to those observed for cslin040 seen in Tables 5.2 & 5.3. N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.4: Timings dense matrices on cslin

57 N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.5: Timings for 60% matrices on cslin049. The results for cslin146 were slightly better as shown in Tables 5.6 & 5.7. Overall the general timings were significantly slower due to the processor speeds. For dense matrices, some benefit of going parallel is seen for the larger problems sizes. For the largest problem, utilising 2 workers relative to the single worker provides 1.6 times speedup, reducing the execution time from seconds to seconds. However when utilising 4 workers, the parallel benefit seems to diminish as only a further speedup of 1.17 is achieved doubling the number of workers again. For sparse matrices, small increase in performance is observed for the largest problem size. Using 4 workers instead of the 1 worker, reduces the execution time from seconds to seconds, for a matrix of 60% sparsity. Theoretically it is expected that doubling the number of workers, should halve the execution times, however this is not the case for the worker model. Availability of more computational resource has provided some benefit of going parallel for the largest problem for dense matrices, but for sparse matrices there is not much advantage. N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.6: Timings for dense matrices on cslin146. N 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.7: Timings for 60% sparse matrices on cslin

58 Sparse matrices are handled by MATLAB automatically, the more sparse the matrix the faster typical matrix operations will be handled, however in attempt to going parallel there is no benefit, unless the matrix is extremely large with reasonable density. Results of using the worker model for matrices of 90% sparsity can be seen in Appendix D, Section D.4. For increasing problem sizes, using more workers, performance gets worse. The decrease in performance is not radical, however still obvious. Due to time constraints the explicit distribution schemes of the co-distributed arrays were only briefly investigated and it was not possible to run any experiments. It is possible as seen in the TC32 [4] module to exploit sparse matrices using MPI to achieve speedup. However with the high-level parallelism considered for this project this has not been the case. It is possible that using low-level parallel constructs with explicit distribution schemes, may provide better performance. For the Heat Diffusion problem, a particular sparse matrix is solved. Timings are shown in Tables 5.8 & 5.9. As observed for using the Jacobi method with random sparse matrices, implicit parallelism provides no performance benefit as Table 5.10 shows. For the largest problem size considered, implicit parallelism was slightly slower at seconds, relative to the seconds achieved by a singlethread. Using the worker model, no clear benefit is seen, using either cslin040 or cslin146. The greater availability of resources with cslin146 has also failed to provide any significant benefit, even for the largest problem. Increasing the number of workers on either workstation provides poor timings. Workstation cslin049 was not considered, because it did not provide any advantage for using random sparse matrices as observed earlier in Table 5.4. m 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.8: Timings for the Heat Diffusion problem on cslin040. m 1 Worker (s) 2 Workers (s) 4 Workers (s) 8 Workers (s) Table 5.9: Timings for the Heat Diffusion problem on cslin

59 m Single-thread (s) Multi-thread (s) Table 5.10: Timings for the Heat Diffusion problem using implicit parallelism. Figures 5.8 & 5.9 show the efficiencies achieved for both workstations. The general behaviour shows that for any problem, doubling the number of workers almost halves the efficiency, this behaviour is seen for each time the number of workers is doubled. Using 2 workers obtains an efficiency of almost 50% and using 4 workers an efficiency of almost 25%, this shows that increasing the number of workers is not beneficial for this problem. It is when utilising 8 workers that some difference in performance is seen for the different environments. Workstation cslin040 maintains approximately 5% efficiency when using 8 workers, whereas cslin146 maintains 11% efficiency. The worker model holds a great overhead for setting up the workers and copying the data to and from the workers, which is the reason the worker model fails to compete with implicit parallelism. The Heat Diffusion problem was implemented in parallel using MPI in the TC32 [4] module, in which parallel benefits were achieved for the sparse matrix. That demonstrates that it is possible to achieve parallel benefits for the sparse matrix, however it may be that with the low-level programming in the PCT, this may be possible. Time constraints have not allowed this to be investigated. Efficiency, S(p)/p Ideal Efficiency M = 9 M = 17 M = 33 M = 65 M = 129 Efficiency, S(p)/p Ideal Efficiency M = 9 M = 17 M = 33 M = 65 M = Number of workers, (w) Number of workers, (w) Figure 5.8: Heat diffusion on cslin040 Figure 5.9: Heat diffusion on cslin Discussion Implementing the Jacobi Method has highlighted the effects of sparse and dense data. This is not a clear example to use, however it provides insight into the kind of behaviour and performance that can be expected from an iterative algorithm. It also verified that the sparse matrices could not be 52

60 handled by the GPU. For dense data, providing enough computational resources are available, running in parallel with the worker model provides good performance for larger problems. It is expected that for very large data, which cannot be accommodated on shared memory using the DCS would provide positive parallel performance. Considering the sparse matrices, it has been shown that for data which is more sparse the default MATLAB client performs well but when parallelism is introduced the performance suffers. It has been seen that for decreasing density in a matrix, the performance of parallelism is diminishing. Implicit parallelism has failed to provide any benefit, this has been observed for the random sparse matrix and also the sparse matrix Heat Diffusion problem. However taking into account that the Jacobi Method is not the most appropriate to give rise to such matrices, these results give enough understanding of the kind of results to expect. The Heat Diffusion problem has highlighted similar performance for that observed with random sparse matrices. Due to time constraints, it was not possible to investigate the low-level parallelism possible with PCT. Therefore it is not possible to comment on how such methods would work for these problems. 5.5 Evaluating Environments The environments used had different architectures. Considering workstation cslin040 where only a quad-core processor was available, good performance and scaling was achieved when 2 workers were utilised. Using 4 workers showed small benefit in some cases but worse performance for others. This was due to resource sharing. In addition where 4 workers provided increased benefits, the benefits were not as significant as when going from serial to parallel with 2 workers. With regards to the GPU, as discussed in Section the PCT requires compute capability greater than 2.1 for GPU computing to be possible. Therefore the GPU computing discussed was all tested on this environment. There were no other GPUs available which would allow testing of how GPU hardware can affect the performance of the PCT. For workstation cslin049 which holds a better processor alongside Intel s hyper-threading technology, no clear advantage was seen over cslin040. It was assumed initially that with the availability of 4 hyper-threaded cores, utilising 4 workers would allow the full use of the main 4 cores on the machine providing better efficiency. Analysing the system monitor and activity on the cores this was correct, however performance was not significantly improved. Scaling up to more workers to utilise the full potential, actually hindered performance for all cases, except the Mandelbrot Set. This issue was of performance falling when utilising more workers than cores available. The same applies to the case of creating more threads than cores available. Therefore there was no advantage gained from using a machine with hyper-threading technology because the results show that the power of 4 more cores is not provided but actually more threads on the original cores are created implicitly. 53

61 The final environment considered was cslin146, an older machine with respect to the CPU itself but it had 4 quad-core CPUs available which provide 16 cores of computational power. GPU programming was not considered here because the GPU on this machine was not sufficiently new to run the PCT. All algorithms were both tested on this machine, and promising results were achieved as has been discussed. Performance benefits were seen similar to those on cslin040 however, the results scaled far better especially in the case of considering 4 workers. Considering the issue of core-sharing, the results for 8 workers were far better showing that on genuinely multi-core architectures the PCT will scale further. 5.6 Evaluating Effort From the project it has become clear that providing a correct implementation of an application or algorithm exists in serial, it can be quite simply made parallel if there is good understanding of the parallel constructs involved. Considering all the implementations involved, there were no major changes required to the existing code. By introducing some extra constructs it was possible to introduce parallelism into the code. The performance gains however are entirely dependent on the problem involved and parallel method involved. Considering that there was already initial experience of MATLAB gained from MJ21 [3] and knowledge of parallel programming from TC32 [4] the learning curve was not too steep. MATLAB provides both low-level and high-level approaches to parallel programming. For the purpose of this project and mainly due to time constraints only high-level constructs have been evaluated. Comparing the effort required to develop the algorithms in parallel varied for the algorithm and application being considered. Also it was deemed that for more effort put in, the results were better. Basic implementations can provide parallel benefits, but more extensive implementations can provide significant improvements. For the Matrix-Matrix Multiplication it was not a difficult task which required great effort. On the other hand implementing the Mandelbrot Set was much more time consuming. It is important to take into account that the initial lack of knowledge and experience of using the PCT played a key role. The Jacobi method was fairly straight forward, more time was spent correcting the MATLAB implementation of the algorithm than the actual parallel constructs. Code changes required to the Matrix-Matrix Multiplication going from serial, to parallel with the GPU can be seen below: function serialmxm(n, ntests) for i = 1:ntests % For loop for multiple executions tic; A = rand(n,n); % Initialize matrix on CPU B = rand(n,n); 54

62 C = mtimes(a, B); % Perform operatiion on CPU toc; end end function gpumxm(n, ntests) for i = 1:ntests % For loop for multiple executions tic; ga = gpuarray.rand(n,n); % Initialize matrix directly on GPU gb = gpuarray.rand(n,n); gc = mtimes(ga, gb); % Perform operation on GPU C = gather(gc); % Bring computed results back to CPU from GPU toc; end end With regards to the GPU computing, using the parallel constructs was not as much of an issue as the restriction of memory. For the environments available, the GPU had significantly less memory relative to the CPU. It was more important to understand how data was being transferred between the CPU and GPU and how this would affect computation and the performance of the code. The research involved with regards to parallel programming was more time consuming than the actual development, which was the key in the learning curve involved. 5.7 Summarising Evaluation Promising results have been achieved by using the PCT. For Matrix-Matrix Multiplication several implementations were possible which provided a good overview of the potential performance. It has become clear that utilising the GPU for problems which are data intensive and extremely parallel, providing it is within memory constraints, the performance benefits can be impressive. Using the Mandelbrot Set demonstrated that there is more than one method to use the GPU and with naive use performance may not be enhanced. It is when utilised correctly that performance gains are worthwhile. For the cases considered, it is shown that implicit parallelism provided by MATLAB is very good relative to a basic single-threaded MATLAB. It is accepted that such parallelism is not provided for all library functions, however where possible the gains are good. Overall using the worker model in shared memory is not advisable. Only with the DCS where the worker model can be deployed for distributed memory are significant gains possible. MATLAB s 55

63 implicit parallelism with multithreading is always as good, or better in the cases considered here. It is not feasible to use the maximum number of workers on a single machine, this limits the potential benefits by core sharing. It is assumed that providing no other processes are running, 2 workers on a quad-core CPU would work well, and this has been seen in the results. For problems of very large size going parallel is only seen to be appropriate for dense matrices. The current methods applied to using sparse matrices have shown no clear improvement. The GPU is difficult to comment on, using dense matrices with other methods can show good benefits but for the Jacobi Method opposite effects are seen. With respect to environments, the performance of the PCT has been shown to depend on the available computational power. Machine cslin146 had 16 cores available for use, however due to software restrictions this was not possible. It has highlighted that without adequate software, utilising powerful hardware is not as rewarding [20]. For the GPU, the PCT cannot provide parallelism using GPU computing without an appropriate GPU device. There was only one of the three environments where GPU computing was possible, therefore a broader evaluation is not possible. MATLAB has offered a simple approach to parallel programming as claimed [11], however the performance gains are clearly dependent on the user. For novice users, high-level constructs can provide performance gains depending on the problem. As shown not all problems, algorithms and applications can benefit from parallelism. For the expert user, low-level programming constructs could be used, which requires more effort and enhanced knowledge, however with this increased effort the potential of better performance remains possible. Even without the PCT, MATLAB provides implicit parallelism by using multithreading. Without the DCS the potential of the worker model has not been fully explored, and the implicit parallelism performance has equalled if not bettered the worker model. However considering the GPU, the PCT has shown great benefits if utilised correctly. GPU computing with MATLAB also follows a similar trend, the more effort deployed the better the gains. Simple use of the GPU using arrays on the GPU can be advantageous, using low-level CUDA code in C++ with the PCT could however provide much greater benefits and solve more challenging problems [34, 35] 56

64 Chapter 6 Conclusion To conclude the project, it is important to evaluate the project as a whole. Discussing requirements, results and their evaluation and future work related to the Parallel Computing Toolbox (PCT). 6.1 Minimum Requirements The results achieved, and evaluations made, evidence the aims and objectives of the project being fulfilled. The aim of the project, to evaluate parallel performance of the Parallel Computing Toolbox (PCT), has been met by meeting all minimum requirements and some extensions. Using the Matrix- Matrix Multiplication provided preliminary results and invaluable understanding of techniques and constructs involved in parallel programming with the PCT. Being able to implement this in parallel using different programming models and then to analyse the performance based on the evaluation fulfilled the minimum requirements. Possible extensions attempted successfully, implementing the Mandelbrot Set and the Jacobi Method to gain further insight on the potential of the PCT. Proposed extensions not fulfilled, such as comparing the performance of the PCT in MATLAB with the parallel methods provided by GNU Octave, were due to the time constraints. 6.2 Extensions The extensions to the project were to consider other algorithms to be able to achieve a more complete evaluation of the PCT. This highlighted key factors involved in the parallelism achieved by using the PCT. Being able to understand how MATLAB handles sparse and dense matrices is very important, as this can be taken into consideration when implementing other algorithms in the future. GPU computing is a new concept for parallel performance, the potential of the GPU in-terms of performance has been shown by evaluating the Mandelbrot Set as an extension. This provided a different perspective to GPU computing than from just evaluating the Matrix-Matrix Multiplication. Without the extensions being fulfilled, many important conclusions would have been missed and a sufficient evaluation of the PCT would not be possible. 57

65 6.3 Project Management The methodology devised in Section 3 was followed very precisely and has contributed greatly to the completion and success of the project. The initial schedule had to be revised, as shown in Appendix A, Figure D.2, as per changes discussed in Section Initially there was no actual break scheduled during the project, as commitments else where would have always required days to be taken off. Hence being flexible allowed the schedule to be followed fairly precisely. During the initial stages of the project work was only done during weekdays, however as the project progressed towards the latter stages the weekends were also used. Another key change related to the Heat Diffusion Problem. At the start it was considered that a single implementation for this problem would be required for this project. During the project, implementing and evaluating the Jacobi Method allowed using the Heat Diffusion Problem as a test case for the Jacobi Method. Immediately after the mid-project report it was clear what was possible for this project and how realistic the aims and minimum requirements were, interms of what was achievable. Revising the schedule quite early prevented any major setbacks during the project. The availability of the Distributed Computing Server (DCS), was not certain during the initial stages of the project however it later was known that due to licensing issues this route could not be explored. Being able to complete a full draft of the final report early, allowed more time for redrafting and careful scrutiny for both grammatical and technical errors. 6.4 Future Projects Due to the time constraints some work high-lighted during the course of the project could not be investigated further. In terms of using the worker model, it is possible to use distributed arrays and co-distributed arrays for the distribution of data. There was only time to understand how the distribution schemes work, but not how they can affect the performance in detail. It is expected that specific distributions will work better for some algorithm and data than others, this a more problem-specific concept. However previous understanding and knowledge of parallel programming [4] has clearly shown that managing the data and algorithm carefully, considering distribution and communication, significant speedup is achievable. Due to the DCS not being available, it was not possible to evaluate the performance of the worker model considering distributed memory. For an extension on the project as a whole, a future project could evaluate using distributed memory with the PCT. It is expected that with the distributed memory, the worker model will provide more benefits, as for the shared memory model it is limited by the single machine available. The PCT and DCS are both expensive with respect to licences. During the project it became apparent that GNU Octave [32] was a free programming environment. GNU Octave is a similar environment to MATLAB and with extensions can provide parallel programming in a similar way. Therefore it be would interesting to know, how justifiable the costs associated with PCT and DCS are 58

66 in-terms of performance achieved. Finally for the project, parallel techniques considered were mainly high-level, which are more for the novice user. Considering low-level parallel programming, as an expert user, is also very important to provide further insight on the potential of the PCT. This was not possible in-the project due to the time constraints. Results for GPU computing have definitely been promising. Research has shown that the GPU itself is very important in-terms of GPU programming. Utilising a low-spec GPU may not provide any gains relative to the performance gains achieved by a much more powerful GPU. It would be interesting to see how the results of the algorithms considered scale, if a more powerful GPU was considered. 6.5 Predictions The emergence of GPU programming has added more to the potential of parallel performance. Hybridmodels have been able to achieve good parallelism combining the advantages of the CPU and the GPU discussed in Section 2.3. This powerful combination has potential because CPUs consist of a few cores optimized for serial processing, while GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. In-terms of parallel code, serial portions can be devoted to the CPU whilst any parallelism deployed on the GPU. Jack Dongarra in NVIDIA s article What Is GPU Computing [31] said, GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Currently MATLAB provides support for several parallel methods, many MATLAB library functions have been enabled to work in parallel. MATLAB have already claimed that support for more parallel methods will continue [20], therefore it is expected that more MATLAB library functions will work in parallel in the future. MATLAB aims hide the low-level detail behind the parallelism provided, however this can for problems limit the parallel performance. It is felt that by using the explicit parallelism, the PCT can provide better performance and also solve more parallel problems relative to that by implicit parallelism. MATLAB currently provides a parallel interactive window as discussed in Section for purpose of prototyping. It is expected that this functionality will be extended to allow the user to develop parallel programs, not just with the worker model but also with the GPU. Interactively allowing users to handle underlying communication and manage memory without the need for difficult low-level programming. 59

67 Bibliography [1] COMP 1945/SS12: Project Management (University of Leeds), N. Efford and K. Markert, [2] COMP 2540/CR21: Software Systems Engineering (University of Leeds), H. Carr, [3] COMP 2647/MJ21: Numerical Computation and Visualisation (University of Leeds), P. Jimack, [4] COMP 3920/TC32: Parallel Scientific Computing Module (University of Leeds), M. Hubbard, [5] E. Laure M. Jouvin G. Philippon C. Loomis A. J. Chakravarti, S. Grad-Freilich and E. Flores. Enhancing e-infrastructures with Advanced Technical Computing: Parallel MATLAB on the Grid. (91584v00), [6] S. Samsi A. Krishnamurthy and V. Gadepally. Parallel MATLAB Techniques. December [7] B. Barney. Introduction to parallel computing. parallel_comp/#whatis, [Online; accessed 25-February-2013]. [8] M. Croucher. MATLAB GPU / CUDA experiences on my laptop Elementwise operations on the GPU 2. Technical report, [9] E. Ellis. Improving Optimization Performance with Parallel Computing. Technical Report 91710v00 03/09. [10] E. Ellis. Solving Large-Scale Linear Algebra Problems Using SPMD and Distributed Arrays. Technical Report 91819v00 05/10. [11] J. Martin G. Sharma. Matlab: A Language for Parallel Computing. Technical report, October [12] N. Gift. Practical threaded programming with python: Threading usage patterns. http: // [Online; accessed 28-February-2013]. [13] J. Mullen H. Kim and J. Kepner. Introduction to Parallel Programming and pmatlab v

68 [14] Intel. Cuda parallel computing platform. new.html. [Online; accessed 28-February-2013]. [15] G. Cliff J. Burkardt and J. Krometis. Parallel MATLAB: Parallel For Loops. Technical report, [16] P. Kalinova and D. Sykora. Solving large sparse systems of linear equations on gpu [17] A. H. Karp and H. P. Flatt. Measuring parallel processor performance. Communications of the ACM, 33: , [18] D. B. Kirk and W. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, first edition, [19] P. Luszczek. Enhancing Multicore System Performance Using Parallel Computing with MAT- LAB. Technical Report 80367v00 09/08. [20] P. Luszczek. Parallel programming in matlab. The International Journal of High Performance Computing Applications, 22: , [21] Jos Martin. Cuda-gpu compute capability restriction - newsreader- matlab central. [Online; accessed 29-April-2013]. [22] MathWorks. Parallel Computing Toolbox 4: Perform parallel computations on multicore computers and computer clusters. Technical Report 91541v01 10/08. [23] MathWorks. Speeding Up MATLAB Applications. Technical Report 91991v00 12/11. [24] Mathworks. Parallel Computing. Technical Report 91787v00 11/09, [25] Mathworks. MATLAB Distributed Computing Server. Technical report, [26] MathWorks. Parallel Computing Toolbox: Perform parallel computations on multicore computers, GPUs, and computer clusters. Technical report, [27] C. Moler. Parallel MATLAB: Multiple Processors and Multiple Cores. (91467v00 06/07). [28] C. Moler. Why there isn t a parallel MATLAB. Technical report, [29] C. Moler. The Growth of MATLAB and The MathWorks over Two Decades. Technical report, January [30] NVIDIA. Intel hyper-threading technology. architecture-and-technology/hyper-threading/hyper-threading-technology. html. [Online; accessed 27-March-2013]. 61

69 [31] NVIDIA. What is gpu computing? gpgpu, cuda and fermi explained. com/object/what-is-gpu-computing.html. [Online; accessed 27-February-2013]. [32] GNU Octave. About gnu octave. [Online; accessed 27-April-2013]. [33] M. J. Quinn. Parallel Computing: theory and practice. McGRAW-HILL International Editions: Computer Science Series, second edition, [34] J. Reese and S. Zaranek. GPU Programming in MATLAB. Technical Report 91967v01. [35] L. Shure. Loren on the Art of Matlab: Using GPUs in MATLAB. Technical report, February [36] B. Tardoff and L. Shure. Loren on the art of matlab: A mandelbrot set on the gpu. blogs.mathworks.com/loren/2011/07/18/a-mandelbrot-set-on-the-gpu/. [Online; accessed 27-February-2013]. [37] Boston University Information Services & Technology. Matlab parallel computing toolbox tutorial. [Online; accessed 15-April-2013]. [38] Robert W. Sun, R. Ricci and M. L. Curry. Gpustore: harnessing gpu computing for storage systems in the os kernel. In Proceedings of the 5th Annual International Systems and Storage Conference, SYSTOR 12, pages 9:1 9:12, New York, NY, USA, ACM. [39] B. Wilkinson and M. Allen. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Pearson Prentice Hall, second edition, [40] S. W. Zaranek. Using GPUs in MATLAB. Technical report,

70 Appendix A Personal Reflection Producing a well written, structured report through lots of hard work has given great satisfaction. This has to be one experience in life, which I will recall as the most enjoyable. Throughout the project, there was no stage when I felt that I could not carry on, actually I did not want it to finish and hoped there was more time available. As the research and implementations developed, many interesting things were highlighted, time permitting I would have certainly investigated these. However for now, I have left them as recommendations for future projects. Having thoroughly enjoyed studying relating modules through the degree programme I was almost certain on the type of project I wanted. I wanted to do a project which was based around something which was not exactly my strength. Parallel programming is something which I had never heard off. Parallel Scientific Computing [4] was the module where I began to explore this. Prior to this, I studied Numerical Computation and Visualisation [3] where I developed a great interest in numerical computation, involving accuracy and performance of algorithmic methods. It was also a reason why I chose the former module in the final year of studies. Having struggled to do well in the module coursework, I was determined to change this. I felt I had missed the opportunity but fortunately for me, this project gave me the chance to rectify that. Considering other choices for projects available, I feel I made the perfect choice and never during the course of the project have I regretted this decision. The project aimed to explore a route into parallel programming which was unfamiliar, I had never heard of the Parallel Computing Toolbox(PCT) before reading the proposal of the project. Being given the freedom to engage myself in to something which was of most interest to me, provided great motivation and enthusiasm rather than just completing a project for the purpose of the degree alone. I have always felt, that one should always pursue a route which is of most joy and satisfaction. Before the project began, I had promised myself not to make mistakes which I had made during earlier studies with planning and organisation. Having learnt from past experiences, I was not going 63

71 to risk file corruption or accidental overwrite issues. I saved all my work on a cloud-based storage, a flash-drive and on my university storage. Always ensuring that access to up-to-date files, papers and other project related resources were at hand. I decided to keep hard-copies of any write-up, notes, code, sketches, preliminary results and graphs organised in separate folders. This helped not only the organisation but also provided handy documentation in the event of a computer not being accessible. Initially the project did not start very well, I had met my supervisor immediately after my last exam and planned the scope of the project. I was full of energy and eager to get into my project straight away, however I was taken seriously ill. It had been considered to apply for an extension, but as soon as my health started showing signs of improvement I started to work more productively. Initially progress was steady and slow, but within a couple of weeks I had formed a working routine to which I have successfully managed to follow throughout the project. The initial stages of the project were deemed critical for the success of the project, therefore attention was greatly paid towards the independent reading and learning required. Doing this thoroughly I was able to gain a better understanding of the task ahead and realistically consider the practical development. If I had not put so much effort into the project I believe I would have struggled and most likely missed important details. This has turned out to be absolutely vital for the purpose of the project. Problems occurred throughout the project, however starting early enough I had always found myself in a position to manage and resolve the issues. If I was struggling for time, some of the issues may not have been resolved hence hindering the potential and success of the project achieved. Lack luster preparation can be at fault when corners are cut, areas of work skipped or something left so late that it is not feasible to do. I set myself personal deadlines which were before the actual deadlines. This provided me the opportunity to manage my workload and also accommodate for any setbacks, but also to assure the work done was of good standard. During the project I felt that I was constantly jumping to conclusions and making false claims about my results. This was because initially I failed to check if my implementations were algorithmically correct. During evaluation of preliminary results for the mid-project report I realised that the results were invalid. I had initially failed to understand certain programming constructs and how they affected the functionality of the code and behaviour of the algorithm. Therefore I had to run most of my tests again, in order to obtain accurate and valid results. For the remainder of the project, I ensured this mistake was not repeated. Also for the final evaluation, the initial draft was very weak. I had not been specific enough and not explained my graphs very well. It was only after the progress meeting, where I was asked several questions about the results presented, I realised some mistakes were made. This led to the redrafting of the evaluation of the results. If I was to do this project again, I would make sure that my explanations were concise and specific. Also to avoid the issue of invalid results, I would extensively check preliminary results and not just assume their correctness. 64

72 Throughout the degree programme, my interest and desire for programming just got bigger. Finding it hard to understand it during first year studies I was certain I would be able to improve. Being able to do this project, where such independent work was involved is very pleasing. I agree to claims that small, unexpected and not immediately obvious errors and bugs can cause major delays. My supervisor and assessor were very helpful, I was lucky to get assigned both Mark and Peter respectively who were experts in the subject area. Regular supervision, weekly meetings and constant contact provided valuable advice from Mark. Peter s feedback for the mid-project report and progress meeting was very important in-terms of the success of the project. The Scientific Computing Seminars, held by the Scientific Computing Research Group were a pleasure to attend. Relevant or not to the final year projects, it was brilliant to be part of such great expertise and enthusiasm. I was always asked throughout the project by fellow students why I never showed any signs of stress. I believe that stressing and getting overwhelmed by problems is not the solution and way forward. Taking on the challenge, being practical and realistic are very important. For me setting earlier deadlines played a key role in the success achieved. For the write up, I had decided I would finish the final report at least 10 days before the actual deadline to give myself time for redrafting. Fortunately I managed to achieve this target earlier, allowing me to redraft and get my report proof-read. Not losing motivation and staying focused I was able to have as many as 3 drafts of the report, making sure it was technically and grammatically sound. I started an online blog when the project began, initially I would update this twice a week but during the project the workload increased significantly not allowing me much time for regular updates. The purpose of the blog was to allow me to monitor my progress and also for my supervisor keep track of my progress. For future final year project students there are many recommendations to making a project a satisfying and brilliant experience it is meant to be. Final year projects for School of Computing students start immediately after the final exams in January. I would advise that it is important that students do not slack after this period because the project time begins immediately as semester 2 begins. Organising the work involved in-terms of tasks, helps one assess the current situation and progress during the project. Personally I fixed a set amount of time that was to be devoted to the project everyday and monitored any time lost or extra time gained each day. Constant contact with the supervisor is of great benefit. Any meetings with the assessor should also not be wasted and taken as an opportunity to discuss the project. Supervisors and assessors both can spot things which are not immediately obvious to the student. It is important not to shy away from such opportunities and let them go to waste. Something of key benefit for the write up is to make notes on concepts researched, learnt and understood for evaluation purposes. 65

73 Finally, predefining a structure for the report and filling in the gaps throughout is a great way to begin the report early. Doing this will make the writeup a seamless and less stressful experience. Also emotional attachment to any project related requirement and aim is not good. Freedom given in the project can allow exploration of many different paths, however it is vital to remain realistic and not drift beyond the scope of the project as this can have a negative affect on the project. Something which I have found most valuable is Latex. Using Latex made the write up a smooth and efficient process. It does hold increased difficulty relative to the standard document creator, however the results are very satisfying. Latex works behind the scenes in setting up and organising the document. If Latex is something which students are to consider, using it for the mid-project report is advised. That will provide the opportunity for initial testing and familiarisation of the environment and help decide if the use of Latex is appropriate for the final report. This can be considered as a measure to prevent any setbacks further into the project. With respect to the mid-project report it should be considered very important, not only does it highlight any issues of the project at an early stage but also form the foundation of the final report which is later to come. 66

74 Appendix B Resources and Extra Material The Mandelbrot Set implementation was developed using the code from [36]. This source provided details of the algorithm and programming constructs and sample code for the GPU. The Jacobi Method and Heat Diffusion implementations were developed using the code from the Numerical Computation and Visualization [3] module. This source provided the details of the algorithm and sample code for serial implementations. 67

75 Appendix C Ethical Issues There were no ethical issues for this project. 68

76 Appendix D Schedule and Data Tables D.1 Schedules Figure D.1: Planned schedule to be followed. 69

Figure D.2: Actual schedule followed. D.2 Data for Matrix Matrix Multiplication N Mean (s) StDev 50 0.023026 2.93e-03 250 0.023497 2.69e-03 1000 0.027338 2.94e-03 2000 0.044572 3.67e-03 3000 0.

77 Figure D.2: Actual schedule followed. D.2 Data for Matrix Matrix Multiplication N Mean (s) StDev e e e e e e e e-03 Table D.1: GPU - Data Initialised. N Mean (s) StDev e e e e e e e e-04 Table D.2: GPU - Data Transferred. 70

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Speeding up MATLAB Applications Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Non-rigid Displacement Vector Fields 2 Agenda Leveraging the power of vector and matrix operations Addressing