Matrix Multiplications on Apache Spark through GPUs

Size: px

Start display at page:

Download "Matrix Multiplications on Apache Spark through GPUs"

Michael Evans
5 years ago
Views:

1 DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Matrix Multiplications on Apache Spark through GPUs ARASH SAFARI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

2 Matrix Multiplications on Apache Spark through GPUs ARASH SAFARI Master in Computer Science Date: June 28, 2017 Supervisor: Per Austrin Examiner: Hedvig Kjellström Swedish title: Matrismultiplikationer på Apache Spark med GPU School of Computer Science and Communication

3 ii Abstract In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU s advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment. This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices.

4 iii Sammanfattning I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationer över ett Apache Spark kluster, där varje system i klustret delegerar beräkningarna till grafiska processorenheter (GPU). Syftet med denna avhandling är att undersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementation i en distribuerad miljö. Detta gjordes genom att testa de olika implementationerna i en miljö där optimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogs slutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverket har stöd för single precision beräkningar. När detta inte är fallet så presterar de flesta GPU-enheterna betydligt sämre på grund av deras låga double-precision prestanda.

5 Contents Contents iv 1 Introduction Motivation and Aim Environmental and ethical concerns Problem Definition Previous Studies GPU computing Spark & GPU Delimitation Problem Statement Background Linear Algebra Matrix Multiplications Partitioned Matrix Multiplication Parallel matrix multiplication BLAS library Graphical Processing Units GPU architecture CUDA GPU Limitations Spark Spark data management Spark Resource management MLlib Miscellaneous Netlib Native BLAS Libraries iv

6 CONTENTS v Garbage collection Performance Optimization Methodology Testing environment Setup Optimization Testing Partition Testing Executor testing Memory Management Testing Garbage Collection Scalability Testing Spark & Single Precision Operations Results Optimization Test Results Data Partitioning Cores & Executors Memory Management JVM options Scalability Testing Optimal Environment Evaluation OpenBLAS Scaling NVBLAS Scaling Comparison Results Discussion Speculations and Conclusions Performance Cluster Scaling Comparison Conclusion Summary Resolving Research Questions Methodology and Results Discussion Future work Summary Bibliography 50 Appendices 54

7 CONTENTS vi A Installation instructions 55 B Local Single vs Double Precision 59

8 Chapter 1 Introduction Matrix multiplications are linear algebra computations that are frequently used behind the scenes in many fields. Unfortunately, they are computationally heavy, and can take an unreasonable amount of time to complete for large datasets. The solution to this problem lies in the parallel nature of matrix multiplications. The values of different cells in the resulting matrix can be computed independently of each other. It is this parallel nature that is exploited by Graphical Processing Units (GPUs). Due to matrix multiplications being heavily used in computer graphics [1], GPUs have been optimized to perform these types of operations extremely efficiently when compared to CPUs [2]. However, while GPUs are superior at performing the multiplications, they are much slower when it comes to accessing the main memory, which sometimes offsets the advantage that utilization of a GPU device brings. Further problems also arise as the size of the matrices grow large and enter the big data realm. When data gets too big for a single system to handle in a reasonable time, it is often distributed across a cluster of systems with the help of frameworks such as Apache Spark. However, this distribution comes with significant overhead costs. Additionally, Spark does not currently have any support for utilization of GPU devices. Therefore, workarounds such as wrappers and interception of calls have to be utilized if one wishes to use GPUs for large scale matrix multiplications in clusters. 1

9 CHAPTER 1. INTRODUCTION Motivation and Aim Matrix multiplications are widely used in many industries, such as the previously mentioned graphics industry. While many of these fields usually deal with relatively small matrices, some of them deal with matrices large enough for distribution to bexd helpful. Machine learning and data queries are example of instances where large scale matrix multiplications are of use. Data can for example be queried from a database with the help of matrix multiplications by representing the entire data set as a matrix, and a query by another matrix. The resulting matrix of the multiplication between these two matrices would indicate the results of the query. Unfortunately, due to the computational complexity of matrix multiplications, these operations can take unreasonable running times for a single query on large datasets. The aim of this thesis is to find out whether utilization of GPU s could prove useful in speeding this process up and yield more reasonable running times. 1.2 Environmental and ethical concerns Shorter running times on a distributed system has positive environmental effects by consuming less resources. Even if running times are already acceptable, the usage of more efficient hardware could lower the number of nodes needed in a cluster. This would ultimately lower the energy consumption both during use, and by reducing demand for production of additional hardware. However, GPU devices require a considerable amount of electricity in order to be kept cool during continues use. So even if GPU s prove to multiply matrices faster, it is unlikely that this improved speed would come with an overall reduction in energy consumption. Furthermore, areas that would benefit from this report, such as the machine learning and big data processing, are areas where ethical practices is a topic of ongoing conversations. In the case of machine learning, the prospect of smarter and more capable machines are exciting to some due to the great potential of enhancement to our day to day lives. At the same time, it is concerning to some, who are worried about the consequences of such a change, such as the prospect of mass unemployment caused by machines replacing human workers. In the case of big data processing, there has recently been many instances of large corporations gathering and processing large amounts of personal data from users of their services in hopes to either provide better service, or increase their ad revenue. The general public is mostly less than pleased about databases

10 CHAPTER 1. INTRODUCTION 3 containing and processing their personal data, while simultaneously enjoying the fruits of this labour such as personalized Google search results. In summary, the environmental effects of this thesis is marginal and its social effects controversial. 1.3 Problem Definition The purpose of this thesis is to figure out whether delegation of distributed matrix multiplications to the GPU scales well despite the penalties that comes with the usage of wrappers, interceptors, and distribution framework. This is done by measuring the running time of distributed matrix multiplications for matrices and clusters of varying sizes. These measurements are made for multiplications performed both on the GPU and CPU, in order for comparisons to be possible. 1.4 Previous Studies In this section, we mention a few previous studies related to this subject, and the insight they have provided going into this project GPU computing General-purpose computing on graphics processing units (GPGPU) has been a phenomena since the early 2000 s. The idea is to utilize the massive parallel capacities of the GPU to speed up certain aspects of applications. The has been a high number of studies claiming a significant speedup when utilizing GPUs rather than CPUs [3, 4, 5, 6]. This notion has however been challenged and claimed to be exaggerated by Intel. In a 2011 paper titled Debunking the 100X GPU vs. CPU myth [7], Intel claims that many of the studies compare optimized GPU implementations to unoptimized CPU implementations. It also points out the importance of taking the cost of transferring data from host memory to device memory when making comparisons, which is an important detail to take into account. The paper was in turn criticized by, among others, Nvidia for using a previous generation GPU and a current generation CPU in its measurements [8]. Nevertheless, the points that the Intel paper brought up against previous studies are still important and valid. In this thesis, we therefore try to optimize

11 CHAPTER 1. INTRODUCTION 4 our implementation of both the GPU and CPU solution, while also making sure that data transfer costs are taken into consideration Spark & GPU Apache Spark is not GPU-aware, meaning that it does not attempt to utilize any GPU devices on the cluster. There has been a number of studies however that has tried to sidestep this limitation. Li et al. [9] proposed one such solution named HeteroSpark. The rather messy solution required GPU implementations of methods to have been pre-compiled and made available on a device with a GPU. The Spark applications are then to utilize these precompiled codes through a combination of Java Remote Method Invocation (RMI), Java Native Interface (JNI) and a Java wrapper for the precompiled GPU code. In late 2016, Yuan et al. managed to achieve a 4.83x speedup in performance of SQL queries by utilizing this solution [10]. Zadeh et al [11] has also presented a study in which optimized matrix multiplications through Spark s linear algebra library were benchmarked. Matrix multiplications through the GPU was one of the approaches tested and benchmarked by the study. However, the tests were not run on a distributed cluster, but only on a single, powerful node. The study found that, with their hardware, CPU implementations were superior to that of the GPU implementations for matrices with dimensions up to After that point, GPU implementations take the lead. But by only utilizing one single node, the matrices are not distributed and the multiplications are performed locally. These results therefore do not reflect the actual performance of distributed matrix multiplications, but rather the local performance of the Spark engine on a single node. However, an important takeaway from [11] is the manner in which the GPU was utilized. Instead of the complicated solution proposed by HeteroSpark, a simple wrapper developed by Nvidia was utilized. This solution exploits the fact that Spark performs linear algebra computations by calling a native library on the system. These calls to the system s native library are simply intercepted by the Nvidia wrapper and rerouted to the GPU. This solution, unlike HeteroSpark, only allows the GPU to be utilized for linear algebra operations. However, this is all that we require for this thesis. We therefore utilize this approach in our implementation.

12 CHAPTER 1. INTRODUCTION Delimitation The limiting factors in this thesis are the capabilities of the Spark engine, the limitations of GPU devices and libraries, and the available resources. Due to a lack of a sparse matrix multiplication library for the GPU that is compatible with Spark, the project is limited to dense matrices, and the number of nodes in our cluster is limited to 1 master node and 3 slave nodes due to hardware resource constraints. 1.5 Problem Statement The main question this thesis is attempting to answer is How does distributed matrix multiplications performed on Apache Spark scale (with regard to variables such as running time, different input sizes and cluster size), if the multiplications are performed by GPU devices rather than CPU devices?. In order to be able to answer this question fairly and accurately however, we need Spark to perform at its peak when evaluating both the CPU and the GPU performance. This leads us to the prerequisite question of How can Spark be configured to run matrix multiplications as efficiently as possible?

13 Chapter 2 Background This chapter introduces the concepts and techniques that are used in this report. We first cover matrix multiplications and the challenges it brings. Then, we cover computations through GPUs and how we can distribute large workloads between multiple systems in order to speed up the multiplications. 2.1 Linear Algebra A matrix is a two-dimensional data structure containing numbers in fixed rows and columns. Matrices are often used as a mathematical representation of some concept or object. Such representations are prevalent in many field. One such field is the digital graphics field. In digital graphics, any given object viewed from a given perspective is represented by a matrix [1]. The advantage of doing such representations is that changes in the perspectives of which the viewer is viewing the objects from, can efficiently be simulated with the help of simple linear algebra calculations such as linear transformations, rotations, scaling, projections, and the likes [1]. These types of calculations rely heavily on matrix multiplications, which is what this report is focused on Matrix Multiplications Matrix multiplications are computationally heavy work. When multiplying two matrices, A and B, each cell in a resulting matrix C consists of the sum of a series of multiplications between the entries of a row from matrix A and a column from matrix B. Figure 2.1 illustrates this process, which is then to be repeated for all the cells in the matrix C. 6

14 CHAPTER 2. BACKGROUND 7 Figure 2.1: Illustration of Matrix Multiplication, the figure depicts the pattern followed when populating Matrix C (blue) as a product of Matrices A (red) and B (green). In fact, the time complexity to perform a matrix multiplication naively is an impractical O(n 3 ). Even when utilizing more advanced algorithms, the results do not get much better. The Strassen algorithm, for example, runs in O(n 2.8 ) time [12]. And we have yet to even take constant factors into consideration. In practice, this means that the running time of matrix multiplications can take hours for even moderately size matrices, which is not acceptable in many areas. The running time can however be shortened by exploiting a few properties of the matrix multiplication process. Namely, their decomposable and parallel nature. These properties allow us to employ so called divide and conquer and parallel and distributed strategies described in the coming sections Partitioned Matrix Multiplication Matrix multiplications can be decomposed into smaller tasks through what is often referred to as Block Partitioning [13]. This strategy takes advantage of the associative and distributive properties of matrix multiplications in order to divide the two big matrices into groups of smaller sub-matrices. These submatrices are then multiplied together in an appropriate manner, before finally compiling the results of these sub-calculations together. See Figure 1.2 for an illustration. This process introduces some additional workload in the form of the initial partitioning, and then later reassembling the pieces. However, it allows us to distribute the sub-multiplications between different systems working in parallel, which is a popular way of dealing with problems arising from too large workloads.

15 CHAPTER 2. BACKGROUND 8 Figure 2.2: Block Matrix Partitioning. The figure illustrates how the 3 by 3 matrix A and the 3 by 2 matrix B can be multiplied together by dividing the original matrices into sub-matrices no bigger than 2 by Parallel matrix multiplication Another property of matrix multiplications that can be taken advantage of when attempting to speed up the process, is that computations for different cells of the result matrix are independent of each other, as showcased by Figure 2.3. Meaning that the workload of populating the resulting matrix can be divided amongst several entities, such as processor cores.

4 BLAS library No matter which strategy one takes towards solving matrix multiplications, it is important that it is implemented efficiently.

16 CHAPTER 2. BACKGROUND 9 Figure 2.3: Depiction of an instance of matrix multiplication in process. The figure illustrates the fact that the computation of each individual cell are not dependent on the others BLAS library No matter which strategy one takes towards solving matrix multiplications, it is important that it is implemented efficiently. Especially when performing large matrix multiplications on computers, where memory management is a key issue. The BLAS (Basic Linear Algebra Subprogram) library was initially developed in 1979 using Fortran. This library contains efficient implementations of linear algebra subroutines that have been maintained for the last 35 years, and is implemented by many high profile vendors and libraries such as Intel s Math Kernel Library [14], Nvidia s cublas library [15], and the Netlib BLAS and CBLAS projects [16]. What makes these libraries special, is the level of efficiency that they provide in all aspects. Aside from being written in low level languages and utilizing Single Instruction Multiple Data (SIMD) strategies, their main strength over a typical implementation of linear algebra subroutines is their cache efficiency [17]. BLAS libraries partitions matrices (in the manner described in Section 2.1.2) by using block sizes that perfectly fit inside the processors cache memory. By doing this, halting of the multiplication process for the sake of retrieving data from the main memory can be minimized. In order to be able to do this however, the user needs to compile the library locally. This results in a native and system specific BLAS implementation optimized for the system of the end user [17].

17 CHAPTER 2. BACKGROUND Graphical Processing Units A Graphical Processing Unit (GPU) is a special purpose processor originally designed and optimized specifically for processing 3-dimensional images more efficiently than other forms of processors (such as a CPU). The type of tasks this might entail was briefly touched upon in Section Simply put, the tasks that a GPU device specializes in are tasks where great amounts of parallel workload needs to be perfomed with high throughput. Therefore, a GPU device possesses a massively parallel architecture consisting of thousands of small but efficient cores designed for handling multiple tasks simultaneously [2]. Figure 2.4 illustrates the difference between the architecture of a typical CPU device compared to that of a GPU. Figure 2.4: Depiction illustrating the difference between the architecture of a CPU and a GPU GPU architecture An important distinction between CPUs and GPUs is the architectural hierarchy of GPU devices. A simple program that is intended to run on a CPU typically contains a main function that runs serially on a single thread from start to

18 CHAPTER 2. BACKGROUND 11 finish. Many threads executing different tasks or programs are juggled by a CPU core simultaneously. Modern CPUs contain a number of cores splitting up the workload. A typical GPU program however, consists of a piece of code called a Kernel. A Kernel is similar to a main function in many ways. The big distinction however, is that while only one instance of the main function is executed by a typical CPU program, there are usually hundreds of instances of Kernels executed when starting up a single GPU program. Each instance runs on a different thread. Threads running on the GPU are grouped in what is referred to as blocks. A block can contain up to 512 or 1024 different threads depending on the GPU device. The significance of blocks is that threads within the same block communicate through what is called the shared memory, while communication between blocks must be performed through the global memory. The difference between the two is that the shared memory consists of 16KB of memory, but can be accessed faster in comparison to the global memory, which usually consists of several GB of memory. The number of threads per block, and number of blocks can be configured. They should be chosen carefully in order for the program to run optimally [18] CUDA CUDA (Compute Unified Device Architecture) is a driver API that allows end users to write GPU-executable code. Using the CUDA API, a programmer can write a Kernel code, which is executed simultaneously by all GPU threads. Nvidia has themselves developed a couple of libraries of their own using this API [19]. The most important ones for our purposes are named cublas and NVBLAS. As described in the last section, on top of writing an efficient kernal code, several other factors such as the thread count and block-size must be taken into consideration in order to produce a fast application. However, when it comes to Basic Linear Algebra Subprograms (BLAS), such as matrix multiplication, programmers can simply use the Nvidia developed cublas library. cublas is an implementation of BLAS (see Section 2.1.4) on top of the CUDA runtime. It contains system-optimized BLAS routines, and comes with a simple interface [15]. NVBLAS on the other hand is an even more context aware library built on top of cublas. It is intended to replace native BLAS libraries by intercepting computationally heavy BLAS calls to the CPU, and redirecting them to GPUs that are present in the system. NVBLAS can be further configured to let a certain

19 CHAPTER 2. BACKGROUND 12 percentage of the calls slip through to the CPU, effectively sharing the workload between the CPU and the GPU [20] GPU Limitations The major drawback of GPUs is their slow access to the main memory. This introduces a significant penalty to their running time that is only made up for once the matrices reaches a certain size, where their fast computation speed can make up for the lost time. Another major drawback of most GPU devices is that their fast computation speed only applies to single precision operations. This increased single precision performance comes at a great cost to their double precision operation speed. As an example, by utilizing the bench-marking application provided by [11] in their report, it was established that the Nvidia Quadro K620 (the device used in this thesis) has a 30 times slower double precision performance compared to its single precision performance. Detailed results can be seen in Appendix B. 2.3 Spark One way of tackling large workloads which consist of smaller, independent tasks is to distribute these smaller tasks amongst a number of separate systems. These systems are to perform their individual tasks and report the results back to some central entity which oversees the process. This type of configuration is commonly referred to as a cluster, and the individual devices that form the cluster are referred to as nodes. Clusters are often a cost efficient way of increasing performance with higher availability than a single system that offers comparable results [21]. However, they are accompanied by certain complications. Transferring data between the individual nodes, for example, is a necessity that is not trivial to accomplish efficiently. Distributing the tasks among nodes is yet another non-trivial task. But perhaps one of the complications hardest to deal with is the prospect of failure or delays on a given individual node. The cluster needs to be able to detect a failing, or straggling node and redirect the workload, and perhaps recalculating data that was lost on the failing node. Apache Spark is an open-source cluster-computing framework which handles all these complications in the background, and instead offers a simple interface for programming the entire cluster. Spark provides implicit data

20 CHAPTER 2. BACKGROUND 13 parallelism and fault tolerance, meaning it automatically distributes tasks and data to nodes where optimal performance is expected while also being able to handle failures or delays at any given node by rescheduling [22]. Spark clusters usually consist of one master node, and one or more slave nodes. The Master controls resources such as memory or processors available in the worker nodes. When a task is initiated in the cluster, a driver process is created. This Driver service splits the bigger tasks into smaller ones, and delegates these to the slave nodes [23]. Data is primarily stored in the main memory of the nodes in a special data structure called Resilient Distributed Dataset, or RDD for short. RDDs can efficiently be stored on disk or transferred between nodes when the need arises. They are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators [24] Spark data management One of the main use cases of Spark, or clusters in general, is processing large amounts of data (Big data). What is considered large can be subjective, but generally, the concept applies to any set of data too big for one system to handle efficiently. This sub-chapter attempts to provide a shallow oversight of how Spark handles data of different sizes [25]. Let us consider a data set small enough to fit in the main memory of a single system. In this case, distributing the task is not necessary as far as memory management is concerned (although one still might want to do so in order to distribute the workload). Next, consider a data-set too big for the main memory of one system, but small enough to fit in the main memory of several systems in the cluster. If we choose to not distribute the data, and instead process it on one system, we would be forced to keep some of the data on the hard drive and reading or writing to disk when the need arises. If we choose to distribute the data however, parts of the data will reside inside of the main memory of the different systems. When one node requires data that it does not possess, it can request it from the node that does, or alternatively, have the other node execute the task for it. Through this approach, we avoid any transactions between disk and main memory. In the third case, we consider data too big for the entire main memory of the cluster combined. In this situation, even when we distribute the data across several systems, we are still forced to store some of the data on disk. This causes

21 CHAPTER 2. BACKGROUND 14 additional problems when system A requires data that system B possesses, but has currently placed on the disk. In this situation, system B might need to evict some of the data it is using to the disk in order to be able to process system A s request. This is a worst case scenario that sometimes cannot be avoided, but the damage can be controlled through sparks storage level option, which dictates what Spark will do when data becomes too big for the main memory [25]. The options are: MEMORY_ONLY: This storage level dictates that the data is to be stored as plain Java objects in the JVM. Should the system run out of memory, data is dropped from memory entirely, and recalculated anew on the fly later, would the need arise. This is the default storage level. MEMORY_ONLY_SER: This storage level functions like the previous one, in the sense that data is dropped and then recalculated when needed. But the difference between the two is that this option stores data in serialized form. This is generally more space-efficient than storing non-serialized objects, but more CPUintensive. MEMORY_AND_DISK: This storage level functions like the MEMORY_ONLY storage level, in the sense that data is stored in a non-serialized state. However, when data gets too big for the main memory, it is written to disk, rather than being dropped entirely. MEMORY_AND_DISK_SER: This storage level functions like a mix between MEMORY_ONLY_SER and MEMORY_AND_DISK storage levels. Data is stored in a serialized form, and when it gets too big for the main memory, it is written to disk Spark Resource management The memory management of spark is not as straight forward as alluded to in the previous section. Different nodes in the cluster may or may not posses different amount of resources. Having the same task being carried out on nodes with different amount of resources might be problematic and cause complications. In order to avoid this, Spark performs its tasks through what it calls executors.

22 CHAPTER 2. BACKGROUND 15 An executor is a virtual slave instance hosted by a node. Many instances of executors can be hosted by the same node, but all executors across all nodes posses the same amount of resources, such as RAM and CPU cores. The exact amount is specified by the user during execution. By default, each node hosts as many executors as possible, striving for an identical running environment for all executors while still taking advantage of as much as the nodes resources as possible [25]. Figure 2.5 illustrates an example where a cluster consists of two nodes. One of them possessing 4 CPU cores and 16 GB of memory, the other 4 CPU cores and 12 GB of memory. By specifying that each executor is to have 2 CPU cores and 8 GB of memory, the user would receive three executors in total, two from the first node, one from the second node. However, 2 CPU cores and 4 GB of memory from the second node would go unused. It is therefore important to be careful when selecting the executor properties, or large parts of the cluster could end up being left unused. Figure 2.5: Figure describing the process of generating homogeneous executors from resources available on a cluster There are som tradeoffs that are made when deciding on the size of executors. For example, creating many small executors creates more opportunity for parallelism, but each executor would be weaker and not be able to handle as large tasks as a larger executor could. Additionally, each executor comes with a certain amount overhead, leaving even less resources for the actual tasks. When deciding on the resource division among executors, it is commonly recommended by the Spark community to leave a minimum of 1 core and 1 GB

23 CHAPTER 2. BACKGROUND 16 of RAM untouched by Spark, in order to leave some resources behind for the OS and vital background processes [26]. Once an executor has received a certain amount of memory, it splits it up into different sections for different purposes. The two sections we are interested in are the storage memory and execution memory. Storage memory is used for caching of data that is to be used, while execution memory is used for the actual computation. The main memory that was referred to in the previous section is in actuality only the storage memory portion of the memory that is allocated to each executor [25] MLlib When distributing data across a cluster, the data structure that is used is of importance. It is often a challenge to partition and distribute the data in an efficient manner. In our case however, we can simply utilize standard data structures present in Sparks Machine Learning library. Spark s Machine Learning Library (MLlib) contains a vast array of procedures and data structures that are often used in machine learning contexts. In this project, we utilize the library s implementation of matrix representations, and its matrix multiplication interface. MLlib possesses several data structures to represent matrices with. Here, we briefly go over those that are designed for distributional purposes [27]. The simplest distributed matrix representations are RowMatrix and IndexedRowMatrix. These are simple collection of rows, where each row is represented by a standard vector. The difference between the two representations is that IndexedRowMatrix has meaningful row indices, while RowMatrix does not. Since the rows are represented by a standard vector, column lengths are limited by the Integer range. A more niche matrix representation is that of CoordinateMatrix. A Coordinate matrix is a sequence of entries, where each entry consists of the row and column indexes (i, j) and a double value. This representation is intended for matrices that are huge and very sparse. The final matrix data structure is called BlockMatrix, BlockMatrices are representations of the partitioned matrices mentioned in Section A BlockMatrix consists of a series of Blocks, which in turns consists of a block index (i, j) and a matrix. Figure 2.6 demonstrates how different types of matrices can be partitioned. Each gray block represents a portion of the matrix that can be assigned to a partition. Since local matrices consists of one big block, it cannot be distributed

CHAPTER 2. BACKGROUND 17 between partitions. A row matrix however, has its rows distributed amongst the partitions, while the BlockMatrix has each block distributed.

24 CHAPTER 2. BACKGROUND 17 between partitions. A row matrix however, has its rows distributed amongst the partitions, while the BlockMatrix has each block distributed. Of course, the partitioning and distribution of these matrices come at a price of additional overhead. Figure 2.6: An illustration of how a matrix can is stored by Spark s different data structures. On the left, the matrix is not partitioned and saved as a giant self contained unit. To the middle, each row of the matrix is stored separately. To the right, the matrix is divided and stored as uniformly smaller blocks. Source: [26] The trade-off between these different matrix representations boils down to the relation between the increased opportunity for parallelism compared to the additional overhead that comes with separating the matrix into smaller pieces. Ideally, one wants to avoid a situation where some nodes are idle due to insufficient partitioning, while also avoiding a queue of smaller than necessary tasks on every node due to too much partitioning and overhead. Other factors, such as the memory limits of the nodes and the data transfer speed, should also be considered when deciding on how to best distribute a matrix. A detail about the MLlib library that is particularly important in our case is that MLlib has no support for single precision data structures at all. All the data structures mentioned above are designed to hold double precision values exclusively.

25 CHAPTER 2. BACKGROUND Miscellaneous There are a few miscellaneous libraries and techniques that can be utilized in order to optimize Spark jobs. This section contains a brief rundown on these subjects Netlib The matrix multiplication routine of MLlib is not quite straight forward, due to the involvement of many different libraries and licensing issues that are brought about as a result. The multiplications are handled by the Netlib-java library, which primarily attempts to perform them through a system specific library such OpenBLAS, Intel MKL or Nvidia s cublas. Would that fail, a built-in native reference implementation written in Fortran is used. Would both of these approaches fail, a pure-java fallback implementation is used at a great cost to performance [28]. The system specific libraries, as well as the Fortran Netlib implementation, is not included in Sparks MLlib due to licensing issues, and have to be downloaded, compiled and linked by the end user manually. The advantage of using these libraries is their high level of optimization. Optimized here refers to specialist assembly instructions being combined with compile time profiling and the selection of array alignments for the Kernel and CPU combination [29]. All in all, the library is optimized for the machine that it is running on, rather than being generically tuned Native BLAS Libraries As mentioned in Section 2.1.4, the BLAS library have been implemented by a series of optimized, native math libraries. Here, we quickly recap the two libraries most prominently used in this thesis. The cublas library is a device specific, GPU implementation of the BLAS routines. It is available free of charge as a part of the CUDA driver API provided by Nvidia. The NVBLAS library is built on top of the cublas library and dynamically routines certain BLAS calls to GPU devices of the system. The OpenBLAS library is a native CPU implementation of the BLAS library compiled by the end user resulting in a system optimized library. Being an open source project, it does not require a license and is compatible with any processor.

CHAPTER 2. BACKGROUND 19 2.4.3 Garbage collection Spark is a Java based framework, so naturally it is impacted by different JVM settings and environment variables.

26 CHAPTER 2. BACKGROUND Garbage collection Spark is a Java based framework, so naturally it is impacted by different JVM settings and environment variables. The settings we are particularly interested in are the garbage collection settings. By default, the heap space of a Java application is divided into three generations. The young, old and permanent generation. The young generation is in part divided into three subsections named Eden, Survivor 1 and Survivor 2. When new data is created, it is initially placed in the Eden section. Every time a minor garbage collection is performed, data from Eden, together with any surviving data from Survivor 1 is moved to Survivor 2. The Survivor 1 and Survivor 2 regions are then switched. Data that has survived a number of minor garbage collections are moved to the Old generation section. Once the Old generation section is completely full, all threads of the application are suspended in order for a major garbage collection to take place. Figure 2.7: Illustration of the different sections that the heap space of a Java application is divided into. Source: [30] G1GC is a newer, alternative approach to garbage collection. In this approach, the heap is divided into equal sized heap regions. The regions are initially assigned similar roles as in the traditional approach, but the difference is that the sizes for the different sections are not set, but changed dynamically. When data is created, it is allocated in an available Eden area. When minor Garbage Collection occurs, live objects are copied from one or more regions of the heap to a single region, and new free regions as are assigned as Eden regions. Full GC occurs only when all regions hold live objects and no fullempty region can be found.

CHAPTER 2. BACKGROUND 20 Figure 2.8: Illustration of how the G1C1 garbage collection strategy divides the heap space into different sections. Source: [30] 2.

27 CHAPTER 2. BACKGROUND 20 Figure 2.8: Illustration of how the G1C1 garbage collection strategy divides the heap space into different sections. Source: [30] 2.5 Performance Optimization As mentioned earlier, what we are interested in finding out in this report is how distributed matrix multiplications performed by the GPU scale compared to when they are performed by the CPU. As described in the previous chapter, when performing multiplications on a distributed environment, the actual multiplication is only a part of the workload. A speedup in the multiplication portion of the workload might not be accurately portrayed if the rest of the workload is performed inefficiently, and causing noise in our measurements. In order to accurately measure the improvement that is brought about by the usage of GPUs, we therefore need to optimize the rest of the workload as much as possible. Our application consists of three main components: the cluster that is managed by Spark, the processing unit that is performing the calculation and the linear algebra libraries containing the implementation for the multiplications. The processing units, as well as the libraries that are to be used in the project do not require any tuning, since they are already optimized out of the box. When it comes to Spark however, the list of parameters that need tuning is rather long, and setting these parameters incorrectly can have a negative impact due to the reasons explained in the previous sections. Here is a quick summary of the parameters that needs tuning: Number of executors per node: As explained in Section 2.3.2, Spark performs its tasks through executors, which is a virtual worker residing inside of a node. All executors on all nodes need to possess identical resources, and the user must decide whether to create many small executors, or fewer but bigger ones. Smaller

28 CHAPTER 2. BACKGROUND 21 executors have the potential of speeding up the execution by providing more opportunity for paralleization. But come at a price of additional overhead and less capabilities per executor. Number of cores per executor: This parameter controls the number of cores that are supplied to each executor. With more cores, individual tasks have the potential of being completed faster. However, giving more cores to each executor limits the total number of executors that can be hosted by a single node. Additionally, the usage of CPU cores work slightly different when the multiplications are being performed by a GPU device. Since Spark is not aware of any GPU devices, it assumes that each CPU core is performing multiplications assigned to it independently of the other ones. But in actuality, all CPU cores are delegating their matrix multiplication tasks to a single GPU device. This might lead to a queue for tasks incoming to the GPU device, and ultimately means that Spark is making decisions regarding the delegation of workload and the capabilities of its resources on false assumptions. Finally, the usage of additional cores come with additional overhead in the heap. Size of each partition: As explained in Sections and 2.3.3, data needs to be partitioned and distributed across the cluster in specific manners in order for Spark to be able to process them. The size of these partitions are chosen by the user, and has an impact on the performance of the program. Larger partitions mean less overhead but also less parallelism. Data management strategy: As we multiply bigger and bigger matrices, we expect them to become too big for the main memory of the nodes at some point. As described in Section spark offers several alternatives for handling these scenarios: Keep data in main memory only, drop some data when memory is full. Keep data in main memory only, but in a serialized form in order to allow for more data to fit. Drop some data when memory is full. Keep data in main memory, spill to disk when main memory is full.

29 CHAPTER 2. BACKGROUND 22 Keep data in main memory, but in serialized form. Spill to disk when main memory is full. Different options are expected to have different peaks and valleys when it comes to performance, depending on the size of the matrix. Memory fraction ratio: This parameter specifies how much of the main memory is set aside for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records [25]. JVM parameters: Since Spark is a Java based application, JVM options such as different garbage collection strategies, amount of heap space, etc. has an impact on performance.

30 Chapter 3 Methodology The goal of this project is to compare the scaling of GPU centric distributed matrix multiplication to the traditional CPU centric approach. As described in the last chapter, there are a lot of factors other than the multiplication speed that affects the running time when using Spark. If these other factors are sub-optimally configured, we might get inaccurate data when evaluating the scalability. In order to avoid this, our testing portion of the project is divided into two sections. In the first section, we attempt to evaluate the impact of these additional variables in order to find the optimal settings for our purposes. In the second portion, we use the optimal settings found in the first portion to measure the scaling of matrix multiplication more accurately. In this chapter, we first describe the hardware and software used in our testing, before describing the testing process itself. 3.1 Testing environment The cluster used in our tests consists of 4 identical nodes. One of which is used as master a Master node in the Spark cluster, while the other 3 as slaves. The hardware specs of these machines are as follows: 16 GB of RAM Seagate Desktop HDD 1TB 64MB Cache SATA 6.0Gb/s Nvidia Quadro k-620, released july of 2014 (Master node does not require a GPU device) 23

31 CHAPTER 3. METHODOLOGY 24 Intel(R) Core(TM) i GHz, 4 cores, released may of 2014 All three nodes run CentOS 6, but the testing environment have been successfully installed on Fedora 25 and Ubuntu 16 systems according to instructions given in Appendix A. Additionally, all nodes are connected together through the same router, minimizing network latency Setup In this section, we go through the preparatory work required for successfully running the tests in the coming sections. While most of these steps might seem straightforward, complications and irregularities should be expected. As there is a lack of detailed instructions on these subjects elsewhere online, detailed instructions on how to replicated the author s testing environment can be found on Appendix A for those interested in replicating the experiments. Below is an ordered list of actions that is required for setting up the necessary native libraries and Nvidia drivers. They need to be performed on all slave nodes. Install the latest Nvidia CUDA Driver. Full instructions can be found on Nvidia s webpage. Download the source code for CBLAS and Native BLAS libraries from netlib.com Compile into a shared library (.so) using GCC version 4.8 or higher. Install latest Liblapack and OpenBLAS libraries, either through a package manager or compile from source. Link the installed libraries together. Configure NVBLAS into using the correct native BLAS library as fallback. Spark also needs to be compiled from source, this needs to be done on all nodes. Clone the source code for Apache Spark from the official repository. Using Maven, build the spark engine from the source code using the - Pnetlib-lgpl flag. Once this has been done, Spark jobs can be submitted as usual. Refer to the Spark Quick Start guide for further instructions.

32 CHAPTER 3. METHODOLOGY Optimization Testing This section describes the tests performed in order to evaluate the impact that different variables have on the running time of our matrix multiplications. The purpose of these tests are to figure out the optimal environment for matrix multiplication jobs. The tests are ran with different parameters such as dimensions, blocksize etc, but the general structure of the tests is always as follows: Each test case consists of a number of iterations. In each iteration, two matrices are created and filled with randomized values. In the first iteration, the matrices consist of a given number of blocks. The number, as well as the block size, is specified through a parameter. In each following iteration, the dimensions of the matrix grows by a constant margin specified through a parameter. The test continues until either a single iterations takes longer than 20 minutes, or the program crashes due to hardware limitations (Java heap size).

33 CHAPTER 3. METHODOLOGY 26 Algorithm 1 Square Matrix Multiplication 1: procedure MULTIPLICATION(blockSize, initialsize, interval) 2: rows initialsize/blocksize 3: cols rows 4: while r < 20minutes do 5: MatrixA BlockMatrix(blockSize, rows, columns) 6: MatrixB BlockMatrix(blockSize, rows, columns) 7: MatrixA.fill(Float.Random()) 8: MatrixB.fill(Float.Random()) 9: r 0 10: for i 1 to 10 do 11: StartTime CurrentTime 12: MatrixA.multiply(MatrixB) 13: RunTime CurrentTime-StartTime 14: r r + Runtime 15: r r/10 16: P rint( Runtime for size + rows + : + r) 17: rows rows + interval 18: cols cols + interval Partition Testing The first variable whose impact needs to be tested is the size of the partition. It is suspected that multiplications of matrices consisting of, for example, 4 block of size will have a different running time than 16 blocks of size In order to test this hypothesis, we run the test programs described in the previous section. Block sizes of 500, 1000, 2000, 4000 and 8000 will be used when testing GPU multiplication, and 50, 250, 500, 1000, 2000, 4000 and 8000 when testing CPU multiplication. The reasoning behind the choice of the block sizes for the GPU is that the minimum block size deemed big enough for NVBLAS to intercept is 400x400, and the default size of blocks are The rest of the block sizes were therefore picked as they are round multiples on both sides of the default value and allow for easy comparisons. The spark executor settings used in these tests is 12 GB memory and 3 cores per executor, resulting 1 executor per node for a total of 3 executors. The storage level used is MEMORY_ONLY. The effect of these additional settings are assumed to be equalized between the different tests, as their effect on the

34 CHAPTER 3. METHODOLOGY 27 running times is assumed to be similar Executor testing There are two additional variables related to executors that are suspected to affect both the running time, and perhaps the memory usage of the program. The number of cores per executor. The more cores an executor has, the more work it can perform in parallel through the CPUs. However, when performing multiplications through the GPU, the main parallel workload of the application is performed by one single GPU device, which is shared between all cores and executors. We should therefore benefit less from an increased number of cores in our case, than in most typical CPU-centric programs. The purpose of this test is to find out whether the diminished efficiency boost potential that additional CPU cores bring in our case is worth the additional overhead of working with several cores. The number of executors per worker. The more executors per node, the more potential for parallelization. However, creating more executors per slave node would require the resources of the worker to be divided among these executors, which might lead to problems with for example memory size. Additionally, there being only 1 GPU device shared among all executors is also a factor to consider. In order to test the impact of these variables, we run the standard test program described in Section 3.2 with the following executor configurations. We use square matrices, with the default blocksize of executor per node, 1 core and 12 GB of memory per executor 1 executor per node, 2 core and 12 GB of memory per executor 1 executor per node, 3 core and 12 GB of memory per executor 3 executors per node, 1 core and 4 GB of memory per executor Memory Management Testing There are two additional Spark configuration options related to the memory management that we suspect affects not only the running time, but also the maximum matrix size we can multiply.

35 CHAPTER 3. METHODOLOGY 28 The storage level. As described in Section 2.3.1, different storage levels cause memory spilling to be handled with different strategies. In this test, we attempt to deduce the impact of these different storage level options on both running time, and the size of matrices that we can multiply by running the previously mentioned test with the default blocksize of Data gathered from this test is expected to showcase the effect of the different storage levels as the matrices grow too large for the main memory. Spark.memory.fraction value. This value decides the fraction of memory used for execution and storage. The lower this is, the less memory is set aside for storage purposes, and as a result spills and data evictions will be more frequent. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records [25]. A lower value is expected to cause longer runtimes, but allow bigger matrices to be multiplied. The impact of this variable, and the trade-of made between running time and matrix dimension, is explored for the preferred storage level of the previous test Garbage Collection The final configuration that we are concerned with is the JVM garbage collection strategy. The two different garbage collection strategies described in Section is tested on square matrices using the so far established optimal settings. 3.3 Scalability Testing Once the tests described in the previous section has been performed, we evaluate the results of these tests and establish the optimal settings and environment for the matrix multiplication jobs. Once the optimal settings have been established, we evaluate the scalability of both the CPU and GPU implementations. This is done by performing multiplications of different sizes with optimal settings. These series of tests are ran twice, once with only 2 nodes in the cluster, once with 3 nodes, giving us data to evaluate the scalability of the application both in terms of input size, and cluster size. More details about this phase of the testing is given in Section 4.2.

36 CHAPTER 3. METHODOLOGY Spark & Single Precision Operations As mentioned in Section 2.2.3, most devices have very poor double precision performance. Additionally, at the time of writing this report, Spark does not have any single precision support for its distributed matrix data structures. This makes GPU devices perform very badly with an out of the box MLLib implementation. Spark is an open source framework however, so a custom version of Sparks MLlib library with single precision support has been created for the purpose of our tests. The source code can be found in the following repository

37 Chapter 4 Results In this chapter, the results of the tests described in the previous chapter is presented. The first section will contain the results for the environment optimization tests, while the second section contains the scalability test results. 4.1 Optimization Test Results The results of the optimization tests is grouped into four sections and presented below. Section contains the results of the data partitioning tests, which tested the optimal sizes for partitions. Section contains the results of the core and executor tests, which tests the optimal hardware division between the executors. Section contains the results of Spark s memory management tests, which indicated the optimal storage level and memory fraction values. Finally Section contains the results of the garbage collection tests, which tested two different garbage collection strategies and tested their impacts on performance Data Partitioning As indicated by Figure 4.1, our tests seem to suggest that larger blocksizes are considerably faster when compared to smaller blocksizes. This applies to both GPU-based multiplications utilizing NVBLAS and CPU-based multiplications utilizing OpenBLAS. However, as can be seen by the running time of the 8000 unit blocksize, a blocksize too large to allow proper partitioning causes severe running time penalties. This can be seen best when trying to multiply the units matrix using a blocksize 8000, preventing proper partitioning between our three nodes. 30

38 CHAPTER 4. RESULTS 31 Figure 4.1: Graph depicting the results of the partition tests of the OpenBLAS implementation (bottom) and the NVBLAS implementation (top). The different categories represent the different block sizes used when partitioning the matrices.

39 CHAPTER 4. RESULTS Cores & Executors Our tests indicate that additional CPU cores speed up the running time of both Spark s GPU and CPU backed matrix multiplication jobs by approximately 10% (although this value fluctuates rather heavily, as can be seen in Figure 4.2). This speedup comes at a price of a significant heap overhead. As can be seen in Figure 4.2 below, each additional core that is added causes the program to crash earlier through a Java heap space exception. Figure 4.2: Graph depicting the results of the CPU core tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation (top)- Each category shows the running time of utilizing a certain number of CPU cores.

40 CHAPTER 4. RESULTS 33 Furthermore, our tests indicate further speed increase when utilizing a higher number of smaller executors compared to fewer but larger executors. As indicated in by Figure 4.3, 3 executors utilizing 1 core and 4 GB of memory each perform around 5% faster than a single executor with 3 cores and 12 GB memory. This speed increase, once again, comes at a price of additional overhead which causes the program to crash earlier. Figure 4.3: Graph depicting the results of the Spark Executor tests on the OpenBLAS implementation (left) and the NVBLAS implementation (right). A comparison between utilizing a single executor with 3 cores and 12GB of RAM, and 3 separate executors with 1 core and 4GB of RAM.

CHAPTER 4. RESULTS 34 4.1.3 Memory Management The results of our Storage Level tests, depicted in Figure 4.4, shows a similar behaviour for both the CPU and the GPU.

41 CHAPTER 4. RESULTS Memory Management The results of our Storage Level tests, depicted in Figure 4.4, shows a similar behaviour for both the CPU and the GPU. As can be expected, both implementations show a gradual change from the MEMORY_ONLY option being preferable at lower matrix sizes, to the MEMORY_ONLY_SER option being optimal somewhere in the middle, and the MEMORY_AND_DISK_SER option being optimal at the highest matrix sizes we could test. Figure 4.4: Graph depicting the results of the storage level tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation (top). When examining the different memory fraction levels, the results of the GPU and CPU implementations were also quite similar. In both cases, the lower

$CHAPTER 4. RESULTS 35 fraction levels allowed the program to process much bigger matrices.$

42 CHAPTER 4. RESULTS 35 fraction levels allowed the program to process much bigger matrices. The size of the matrices could be increased from to when comparing the highest fraction level to the lowest, among the ones we tested. When it comes to the running time however, the tests showed vague signs of impact from the different fraction levels on the lower dimension multiplications. However, once the dimension of the matrices reached a high enough point, a clear relation could be seen between the running time and the fraction level chosen. This can be best observed in the last three categories of Figure 4.5. The turning point seems to be around the point where the higher fraction levels start to fail. Figure 4.5: Graph depicting the results of the Memory Fraction value tests on the OpenBLAS implementation (bottom) and the NVBLAS implementation(top).

43 CHAPTER 4. RESULTS JVM options Our garbage collection tests indicate a that the default garbage collection strategy of the JVM is far superior to the G1C1 strategy both for the CPU and the GPU implementation. Additionally, results indicate that the negative impact of using the G1C1 strategy is far greater on the CPU than on the GPU. As can be seen on Figure 4.6, the running time for the final category of the CPU implementation test was almost doubled when switching from the default to the G1C1 strategy. Figure 4.6: Graph depicting the results of the Garbage Collection tests on the OpenBLAS implementation (left) and the NVBLAS implementation (right).

44 CHAPTER 4. RESULTS Scalability Testing In this section, we first evaluate the results of the previous tests and describe how the scaling test is to be performed based on these evaluation. Later, we present the results of the scalability testing when running these settings Optimal Environment Evaluation The following is a summary of our findings during the Optimization tests: Garbage Collection The results for this test were quite clear, the default garbage collection strategy is preferred by both our implementations. Memory Fraction Our tests indicate that the Memory.fraction value did not meaningfully impact the smaller matrix multiplications in either the GPU or CPU implementations. However, it was discovered that the limit of what a cluster can handle, when it comes to matrix dimensions, can be pushed further by lowering the memory fraction value at a small cost to runtime. Storage Level The tests indicated that the storage level should be gradually change from MEMORY_ONLY, to MEMORY_ONLY_SER, to MEMORY_AND_DISK_SER. CPU cores and executors The tests indicated that addition of each core and executor marginally increases the speed of which the matrices can be multiplied at, but also lowers the maximum size of matrices the cluster can handle drastically. Block Size The tests indicated clearly that a large block size was definitely advantageous over smaller ones. The block size should therefore be selected to be as large as possible without leaving idle nodes. The scalability test is conducted by multiplying matrices as small as 5000, incremented by 5000 until further increments is no longer possible. The cluster is configured to run the multiplication as fast as possible based on the observations listed above. Meaning that the multiplications are attempted with

45 CHAPTER 4. RESULTS 38 configurations yielding the fastest results. If such configurations leads to a crash, we tone down the variables that cause early crashes, such as the number of CPU cores or memory fraction level. When attempting to tone down variables that cause crashes, it seems reasonable to start with the number of CPU cores, then the memory fraction level, and finally the block size. This is due to our results indicating that higher number of CPU cores causes crashes at much earlier stages. The block sizes are lowered only when absolutely necessary, since our results seem to indicate that they have the greatest impact on the running time of the application OpenBLAS Scaling Figure 4.7 depicts the results of the scaling test performed on the OpenBLAS implementation. Using the configurations described in figure 4.8, we were able to multiply matrices with dimensions of units squared with 2 nodes, and units with 3 nodes. Figure 4.7: The graph showcases the results of the Scalability test on clusters consisting of two and three slave nodes that are utilizing OpenBLAS.

CHAPTER 4. RESULTS 39 Figure 4.8: Table describing the Spark configuration used when running the scalability test on two (top) and three (bottom) nodes.

46 CHAPTER 4. RESULTS 39 Figure 4.8: Table describing the Spark configuration used when running the scalability test on two (top) and three (bottom) nodes. The results seem to indicate that the multiplication speed scales linearly with minimal overhead cost. Figure 4.9 visualizes a comparison between the measured running time for the 3 node cluster, and what the expected running time would be if the two node cluster running time scaled perfectly with no overhead cost when expanding into three nodes. The expected running time for three nodes is calculated as 2 of the time it takes for two nodes. 3 Additionally, as Figure 4.7 depicts, the addition of a third node allows the application to process larger matrices of size

47 CHAPTER 4. RESULTS 40 Figure 4.9: Graph depicting the similarity between the running time of the scalability test when utilizing three nodes and the running time that was projected based on the results of the tests ran on the cluster containing two nodes) NVBLAS Scaling Figure 4.10 depicts the results of the scaling tests performed on a NVBLAS implementation. The configurations used can be found in figure 4.11 Figure 4.10: The graph showcases the results of the Scaling test on clusters consisting of 2 and 3 slave nodes that are utilizing NVBLAS.

48 CHAPTER 4. RESULTS 41 Figure 4.11: Table describing the Spark configuration used when running the scalability test on two (left) and three (nodes). Similar to the results of the OpenBLAS test, the NVBLAS implementation test results showcase the ability to multiply larger matrices, and a linear scaling with minimal overhead cost when expanding the cluster form two to three nodes. The calculation was performed in the same manner described in the previous section and the results are visualized by Figure 4.12.

time that was projected based on the results of the tests ran on the cluster containing two nodes. 4.2.

49 CHAPTER 4. RESULTS 42 Figure 4.12: Graph depicting the similarity between the running time of the scalability test when utilizing three nodes and the running time that was projected based on the results of the tests ran on the cluster containing two nodes Comparison Results Based on the results described in Sections and 4.2.3, the following graph comparing the performance of the two implementations can be produced. Figure 4.13: Graph comparing the performance for the CPU and the GPU implementation.

50 Chapter 5 Discussion In this chapter, the report is concluded by first discussing the results from the previous chapter. We then resolve our research questions and finally take a look back at our methodology in hindsight and discuss potential areas of interest for future work. 5.1 Speculations and Conclusions Based on our test results described in chapter 4, we can make some speculations and draw some conclusions about the behaviour and scalability of both GPU and CPU based matrix multiplication on Spark clusters Performance The two processing units used in our tests, namely the Nvidia Quadro K620 and the Intel i were released roughly at the same point in time with similar price tags. A comparison between the performance of these two devices would therefore be reasonable. As Figure 4.13 illustrates, the GPUs running time increased almost at half the speed of the CPU. This would indicate that the superior calculation speed that GPUs showcase in local environments can also be extended to a distributed environment. This is despite the fact that the utilization of GPUs with Spark requires wrappers. Furthermore, it was showcased by our results in Section that utilizing the CPUs full capabilities by utilizng all CPU cores severely lowers the maximum size of matrices that Spark can handle. 43

51 CHAPTER 5. DISCUSSION 44 It should however be noted that the superior performance of the GPU only applies to single precision calculations. It should also be kept in mind that Sparks does not support distributed single precision linear algebra out of the box. In order to utilize GPUs with Spark at their full potential, a custom made version of the MLlib library with single precision support must be created Cluster Scaling Comparison As showcased in chapter 4, both the CPU and the GPU showcased a perfectly linear scaling when extending the cluster from two nodes to three nodes. Thereby insinuating that, according to the results atleast, there is virtually no difference between how the CPU and GPU implementations scale with an addition of a third node. Whether this perfect scaling continues past the third node or not is not something we can answer in this thesis based on our results. However, we can assert that our tests indicate that the choice of native BLAS library does not seem to affect the scaling of the cluster in a significant way. Likewise, The clusters capacity of multiplying larger matrices were also affected similarly on both implementations Conclusion Summary The points below summarizes the conclusions mentioned in this section: Most GPU devices are not compatible with Spark out of the box due to Sparks lack of support for single precision distributed data structure, combined with most GPU s comparatively low double precision throughput. If single precision support is added to the Spark framework however, then the superior throughput and better scaling of the GPU can be taken advantage of for faster matrix multiplications even in a distributed environment such as Spark. The GPU and the CPU implementations of native BLAS libraries did not significantly affect the scalability of our cluster when expanding from two to three nodes. Suggesting that one could expect the running time and memory capacity of an application utilizing GPU based BLAS routines to scale similar to an application utilizing CPU based BLAS routines when new nodes are introduced to the cluster.

52 CHAPTER 5. DISCUSSION Resolving Research Questions Our research questions were stated in Section 1.4 as How can Spark be configured to run matrix multiplications as efficiently as possible? and How does distributed matrix multiplications performed on Apache Spark scale (with regard to variables such as running time, different input sizes and cluster size), if the multiplications are performed by GPU devices rather than CPU devices?. Based on the conclusions from Section 4.2.1, the answer for the first question is as follows: The most impactful factor when it comes to the speed of which the multiplications are performed at, is the block size that is chosen to partition the matrix by. The matrix should preferably be divided into as large submatrices as possible, while still allowing the number of submatrices to be a multiple of the number of executors on the cluster. The running time is also impacted by the number of executors and CPU cores involved in the computation. Our results indicated that having many smaller executors have the potential to perform faster than a few larger executors. However, having more executors is more memory intensive and causes the application to crash faster. The same goes for the number of CPU cores. The running time of the multiplications are affected by the storage level of the RDD s. The optimal storage level varies depending on matrix size, and should be discovered individually for the problem at hand. However, a general rule of thumb seems to be to prefer serialization for larger matrix sizes. In the event that the cluster is not able to process a matrix of a particular size due to Out Of Memory exceptions, one can increase the capacity of the cluster in three ways: The first and most efficient option is to lower the amount of memory spent on overhead. This is best done by lowering the number the number of executors or CPU cores. The second option is to lower the workload that needs to be performed during each task. This is done most efficiently by lowering the block size and gives more opportunities for the garbage

53 CHAPTER 5. DISCUSSION 46 collection to kick in and free up some memory before the start of the next task. The final option is by increasing the memory allocated for computations at the expense of storage space. This is done by lowering the Spark.memory.fraction value. However, this increases the frequency of which data is leaked. The second question can be answered as follows based on conclusions from the previous section: When it comes to the scaling of the running time as a function of the cluster size, our tests showcased the running time of the multiplication being linearly reduced with the addition of a third node. This reduction was virtually identical for both the CPU based and the GPU based implementation, indicating that utilization of a GPU based native BLAS library either does not impact the scalability of a cluster, or affects it at the same rate that the utilization of a CPU based native BLAS library would. When it comes to the scaling of the running time as a product of the input size, it was discovered that GPUs do in-fact perform better than their CPU counterparts when performing matrix multiplication in a distributed environment, despite the penalties and limitations that such an environment would bring. It is once again emphasized however, that the above statement only applies if single precision support is made available. 5.3 Methodology and Results Discussion The time and resource limitation for this project has in turn led to shortcomings in the methodology of project, which in turn has no doubt affected the results. In this section, we try to identify these shortcomings and try to assess their impact on the results. Long running times and high variance: Each individual test described in chapter 3 lasted approximately 11 hours. The long running time is a result of repeating each measurement 10 times, in order to use the average measurement as the final result. However, the individual measurements had a very high variance of sometimes up to 20%.

54 CHAPTER 5. DISCUSSION 47 The variance was not caused by any external factors, but is believed to be caused by a combination of the randomized matrices, together with any random decisions made by Spark. A solution to this high variance would be to increase the number of times a test is repeat in order to lessen the potential of skewed averages, but this was not an option due to the long running time of the tests and the limited time with access to the hardware. Impact of specific hardware used: The CPU and GPU devices used in this project are not the only ones with an impact on the results. Since data is often spilled, and fetched from disk, the hard drive that is used affects the running time as well. Additionally, since the different nodes communicate using the local network, the router and network settings in general has an impact on the running time of the tests. Small number of nodes: The hardware limitations only allowed the tests to be performed on a cluster with up to 3 nodes. This allowed us to gather data around how the tests scaled from a cluster with two nodes to a cluster with three nodes. There is however no guarantees that this scaling factor holds for expansion past 3 nodes. In summary, a high variance in measurements may have caused slightly exaggerated or skewed results. The specific hardware that was used, together with the low number of nodes in the cluster, might also have caused the results to be too specific to the setup used in this report and not applicable to a different setup using other hardware. It can be argued that it is very unlikely that the vast difference between the approaches depicted in figure 4.13 would be significantly affected by the slightly skewed mean values. However, results being too specific to the hardware that was used in the thesis is a possibility that might be worth looking into. 5.4 Future work There are a couple of current and upcoming technologies not mentioned in this report that might have an impact on our results. IBM Spark GPU Enabler:

55 CHAPTER 5. DISCUSSION 48 Spark GPU Enabler, an open sourced project by IBM, is an effort to bring GPU capabilities into the Spark framework, which is currently not GPU aware. The project is aimed to allow the user to run their own Kernels through the Spark framework, which would eleminate the need for wrappers and interception of calls. YARN GPU support YARN is a resource management tool sometimes used in Spark environments. It offers great control and flexibility to Spark jobs by allowing the user to individually manage the devices. GPU support for YARN is currently being looked into. NVLink NVLink is a work in progress by Nvidia. It is a high-bandwidth interconnect that enables fast communication between CPU and GPU. It is promised to allow the GPU to read from the main memory at the same speed as a CPU does, and allow communication between several GPU devices on the same system at 5x that speed. Kryo serialization Kryo serialization is a serialization mehod claimed to be faster than the default Java solution. Utilizing this serialization method might have an impact on the running time of the application, and the level of impact could be investigated. Larger scale testing The number of nodes used in the tests of this project are rather low. A more in-depth and conclusive assessment of the subjects discussed in this report can be attained by performing the tests on a larger size cluster. Single precision MLlib As explained previously in this report, the latest official version of Spark at the time of writing this report does not contain any single precision support. An very useful continuation to this project would be the implementation of single precision support for the Apache Spark framework.

56 CHAPTER 5. DISCUSSION Summary In this report, we attempted to figure out how matrix multiplication distributed on an Apache spark cluster, and delegated to GPU devices, scale with large inputs and growing clusters. This was done by first providing information about the libraries, frameworks and technologies that are used in such a setup, before starting the testing process. The testing process consisted of an environmental optimization phase, where the cluster was adjusted to perform optimally during the scalability testing phase, where the actual scaling of the setup was tested. Through the results of the optimization phase, we found out that the size of the partition is arguably the most important variable to fine-tune in applications such as ours. Additionally, variables like Storage Level, Memory.fraction, and resources given to executors were found to both affect the running time and memory capacity of the application by varying degrees. The results of the scalability tests indicated that the capabilities of the cluster grows similarly for both the CPU and the GPU implementation of native BLAS libraries when additional nodes were introduced to the cluster. Additionally, it was concluded that GPU devices do not perform well with an out of the box instance of the Spark engine due to a number of factors such as relatively low double precision throughput of most GPU devices, and lack of support for single precision distributed matrix data structures in Spark. However, if Single precision support is implemented for Spark, GPU devices do indeed scale significantly better with the input size when compared to their CPU counterparts.

57 Bibliography [1] OpenGL. Tutorial 3. matrices. www. opengl-tutorial. org/ beginners-tutorials/tutorial-3-matrices. [Online; accessed 04-April- 2017]. [2] Nvidia. What is GPU computing? what-is-gpu-computing.html,. [Online; accessed 04-April-2017]. [3] Zhiyi Yang, Yating Zhu, and Yong Pu. Parallel image processing based on cuda. In Computer Science and Software Engineering, 2008 International Conference on, volume 3, pages IEEE, [4] Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. A performance and energy comparison of fpgas, gpus, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 12, pages 47 56, New York, NY, USA, ACM. ISBN doi: / URL [5] Vasily Volkov and James W Demmel. Benchmarking gpus to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, SC International Conference for, pages IEEE, [6] Mikhail Smelyanskiy, David Holmes, Jatin Chhugani, Alan Larson, Douglas M Carmean, Dennis Hanson, Pradeep Dubey, Kurt Augustine, Daehyun Kim, Alan Kyker, et al. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Transactions on Visualization and Computer Graphics, 15(6), [7] Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, et al. Debunking the 100x gpu vs. cpu 50

58 BIBLIOGRAPHY 51 myth: an evaluation of throughput computing on cpu and gpu. SIGARCH Computer Architecture News, 38(3): , ACM [8] Andy Keane. Gpus are only up to 14 times faster than gpus says intel. https : / / blogs. nvidia. com / blog / 2010 / 06 / 23 / gpus-are-only-up-to-14-times-faster-than-cpus-says-intel/, [Online; accessed 04-April-2017]. [9] Peilong Li, Yan Luo, Ning Zhang, and Yu Cao. Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms. In Networking, Architecture and Storage (NAS), 2015 IEEE International Conference on, pages IEEE, [10] Yuan Yuan, Meisam Fathi Salmi, Yin Huai, Kaibo Wang, Rubao Lee, and Xiaodong Zhang. Spark-gpu: An accelerated in-memory data processing engine on clusters. In Big Data (Big Data), 2016 IEEE International Conference on, pages IEEE, [11] Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, and Matei Zaharia. Matrix computations and optimization in apache spark. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM, [12] Thomas H Cormen. Introduction to algorithms. MIT press, [13] Howard Eves. Elementary Matrix Theory (reprint ed.) [14] Intel Math Kernel Library (Intel MKL). en-us/intel-mkl. [Online; accessed 04-April-2017]. [15] Nvidia. [Online; accessed 04- April-2017]. [16] netlib-java. BLAS(BasicLinearAlgebraSubprograms). [Online; accessed 04- April-2017]. [17] Robert A Van De Geijn and Enrique S Quintana-Ortí. The science of programming matrix computations [18] Nvidia. Cuda basics. http : / / www. nvidia. com / docs / IO / / sc11-cuda-c-basics.pdf,. [Online; accessed 04-April-2017].

59 BIBLIOGRAPHY 52 [19] Nvidia CUDA Documentation. CUDA. index.html#axzz4za8nwykp,. [Online; accessed 04-April-2017]. [20] Nvidia. Nvidia NVBLAS documentation. nvblas/#axzz4amwzoku3,. [Online; accessed 04-April-2017]. [21] David A Bader and Robert Pennington. Applications. The International Journal of High Performance Computing Applications, 15(2): , [22] Apache Spark Documentation. Cluster Mode Overview. apache.org/docs/latest/cluster-overview.html,. [Online; accessed 04- April-2017]. [23] Apache Spark. Cluster Mode Overview. latest/cluster-overview.html,. [Online; accessed 04-April-2017]. [24] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2 2. USENIX Association, [25] Apache Spark. Spark Documentation: Configuration. apache.org/docs/latest/configuration.html,. [Online; accessed 04- April-2017]. [26] Patrick Pisciuneri. How Target Performance Tunes Machine Learning Applications. numspark. [Online; accessed 04-April-2017]. [27] Apache Spark Documentation. Data Types - RDD-based API. https: //spark.apache.org/docs/latest/mllib-data-types.html,. [Online; accessed 04-April-2017]. [28] Benjamin Herta. Improving BLAS library performance for MLlib. http: // [Online; accessed 04-April- 2017]. [29] Sam Halliday. netlib-java. [Online; accessed 04-April-2017]. [30] Oracle. JavaGarbageCollectionBasics. [Online; accessed 04-April-2017].

60 BIBLIOGRAPHY 53 [31] Spark. Building Spark. building-spark.html,. [Online; accessed 04-April-2017].

61 Appendices 54

62 Appendix A Installation instructions 1. A GCC version of atleast 4.8 is required, in this report, version 5.1 was used. versions previous to 4.8 contain a libgfortran library that is incompatible with netlib-java wrappers [? ]. For distributions such as Ubuntu, a newer GCC version should be isntalled by default. And if not, downloading a newer version of GCC should be trivial. This report however used CentOS version 6.5, where GCC needed to be built from source by entering the following commands in the command prompt[? ]: sudo yum i n s t a l l svn texinfo tex f l e x zip l i b g c c. i686 glibc devel. i686 mkdir ~/ s o u r c e I n s t a l l a t i o n s cd ~/ s o u r c e I n s t a l l a t i o n s svn co svn : / / gcc. gnu. org / svn / gcc / t a g s / g c c _ 5 _ 1 _ 0 _ r e l e a s e / cd gcc_5_1_0_release/./ c o n t r i b /download_prerequisites cd.. mkdir gcc_5_1_0_release_build/ cd gcc_5_1_0_release_build/.. / gcc_5_1_0_release/configure make sudo make i n s t a l l Once a never version of GCC is installed, the path to the new GCC should always be in the system path. It can be added by executing this command: e x p o r t LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/new/gcc/ l i b

63 APPENDIX A. INSTALLATION INSTRUCTIONS The latest version of Spark can be cloned from the official repository by entering the following into the command prompt. g i t clone git@github. com : Arash s/spark Single precision LinAlg. g i t Once cloned, it is required to configure Maven to use more than the usual memory by entering teh following line into the command prompt[31]: e x p o r t MAVEN_OPTS=" Xmx4g XX : ReservedCodeCacheSize =2048m" Finally, build a netlib enabled implementation of Spark from source using the following command line:./ build/mvn Pnetlib l g p l DskipTests clean package 3. next, we need to download and build our desired implementation of BLAS OpenBLAS: i. The source of OpenBLAS is to be downloaded and compiled following the commands below: g i t clone https : / / g i t h u b. com / x i a n y i / OpenBLAS. g i t cd OpenBLAS g i t checkout v wget http : / / s o u r c e f o r g e. n e t / p r o j e c t s / slurm r o l l / f i l e s / addons / / rpms / pb b i n u t i l s x86_64. rpm sudo yum l o c a l i n s t a l l pb b i n u t i l s x86_64. rpm export PATH=/opt/pb/ b i n u t i l s 2.24/ bin :$PATH make sudo make PREFIX=/opt/OpenBLAS i n s t a l l

64 APPENDIX A. INSTALLATION INSTRUCTIONS 57 This will compile and install OpenBLAS in /opt/openblas directory. ii. Enter the following lines in order to link the OpenBLAS libraries so that netlib can detect it: sudo cp /opt/openblas/ l i b /libopenblas. so /opt/ OpenBLAS/ l i b / l i b l a p a c k. so. 3 sudo cp /opt/openblas/ l i b /libopenblas. so /opt/ OpenBLAS/ l i b / l i b b l a s. so. 3 iii. add the following lines to the spark-env.sh file located in the conf folder in Sparks directory: export LD_LIBRARY_PATH=/usr/ l o c a l /bin :/ opt/ OpenBLAS/ l i b 3.2. Nvidia NVBLAS i. First, the latest Nvidia CUDA drivers should be downloaded and installed. This project used CUDA version 8.0. Installation steps vary greatly depending on OS, but they are all well documented. The installation files and instructions can be found on Nvidia s webpage. ii. Second, the CBLAS and BLAS headers from the Netlib library must be manually compiled, since they are not included in nvblas. To do this, enter the following commands into the command prompt: wget http : / /www. n e t l i b. org / b l a s / b l a s. t g z t a r xzvf b l a s. tgz cd BLAS make wget http : / /www. n e t l i b. org / b l a s / b l a s t forum / c b l a s. t g z t a r xzvf c b l a s. tgz cd CBLAS Open the Makefile.in file and make the following changes:

65 APPENDIX A. INSTALLATION INSTRUCTIONS 58 BLLIB = /path_to_compiled_blas/blas_linux. a CBLIB =.. / l i b /cblas_\$ (PLAT). so CFLAGS = O3 DADD_ fpic FFLAGS = O3 fpic ARCH = gcc ARCHFLAGS = shared o Finally, enter the following commands in order to compile the code and link them in a manner that netlib can access them. make ln s /path/to/ c b l a s / l i n /cblas_linux. so /path/ to/ c b l a s / l i n / l i b b l a s. so. 3 iii. NVBLAS requires a native CPU based blas library to use as fallback for blas calls that are not supported by NVBLAS. If one was installed as described in previous steps, skip this step. Otherwie, enter the following commands to installa a genericly tuned version of OpenBLAS and LAPACK. As these will not be used at all in our GPU implementation of matrix multiplication, the generically tuned ones will do fine. yum i n s t a l l lapack openblas ln s / l i b 6 4 / l i b l a p a c k. so iv. Create a file named nvblas.conf containing code found in section of the NVBLAS documentation. Change the following line so that it correctly points to your native BLAS implementation. NVBLAS_CPU_BLAS_LIB /usr/ l i b /libopenblas. so v. Add the following lines to the spark-env.sh file located in the conf folder in Sparks directory: e x p o r t LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/ usr/ l o c a l /bin :/home/nvidia/ l i b s /CBLAS/ l i b :/ usr/ l o c a l /cuda/ l i b 6 4 :/ usr/ l i b 6 4 export NVBLAS_CONFIG_FILE=/usr/ l o c a l /cuda/ l i b 6 4 /nvblas. conf export LD_PRELOAD=/usr/ l o c a l /cuda/ l i b 6 4 / l i b n v b l a s. so

66 Appendix B Local Single vs Double Precision 59

67 APPENDIX B. LOCAL SINGLE VS DOUBLE PRECISION 60

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms Luna Xu (Virginia Tech) Seung-Hwan Lim (ORNL) Ali R. Butt (Virginia Tech) Sreenivas R. Sukumar (ORNL) Ramakrishnan