Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Size: px

Start display at page:

Download "Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop"

Cody Craig
5 years ago
Views:

1 Regression Modelling of Power Consumption for Heterogeneous Processors by Tahir Diop A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto c Copyright 2013 by Tahir Diop

2 Abstract Regression Modelling of Power Consumption for Heterogeneous Processors Tahir Diop Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto 2013 This thesis is composed of two parts, that relate to both parallel and heterogeneous processing. The first describes DistCL, a distributed OpenCL framework that allows a cluster of GPUs to be programmed like a single device. It uses programmer-supplied meta-functions that associate work-items to memory. DistCL achieves speedups of up to 29 using 32 peers. By comparing DistCL to SnuCL, we determine that the compute-to-transfer ratio of a benchmark is the best predictor of its performance scaling when distributed. The second is a statistical power model for the AMD Fusion heterogeneous processor. We present a systematic methodology to create a representative set of compute micro-benchmarks using data collected from real hardware. The power model is created with data from both micro-benchmarks and application benchmarks. The model showed an average predictive error of 6.9% on heterogeneous workloads. The Multi2Sim heterogeneous simulator was modified to support configurable power modelling. ii

3 Dedication To my wife and best friend Petra. iii

4 Contents 1 Introduction Contributions Organization Background GPU Architecture AMD Evergreen Nvidia Fermi Fusion APU CPU GPU Programing Models OpenCL CUDA Simulators Multi2Sim GPGPUSim SciNet Distributing OpenCL kernels Background DistCL Partitioning Dependencies Scheduling Work Transferring Buffers Experimental Setup Linear Compute and Memory Compute-Intensive Inter-Node Communication Cluster SnuCL Results and Discussion iv

5 3.5 Performance Comparison with SnuCL Conclusion Selecting Representative Benchmarks for Power Evaluation Power Measurements Micro-benchmark Selection Memory Benchmarks Compute Benchmarks Conclusion Power Modelling Background Selecting Benchmarks Micro-Benchmarks Application Benchmarks Measuring Hardware Performance Counters Multi2Sim Simulation Modelling Conclusion Power Multi2Sim Epochs Using Power Modelling Configuration Runtime Usage Reports Validation Conclusion Conclusion and Future Work 82 Bibliography 84 Appendices 92 A Clustering Details 93 B Multi2Sim CPU Configuration Details 95 v

6 List of Tables 2.1 AMD A Specification Benchmark Description Cluster Specifications Measured Cluster Performance Execution Time Spent Managing Dependencies Execution Time Spent Managing Dependencies Benchmark Performance Characteristics Data Acquisition Unit Specifications ACS711 Current Sensor Specifications AMD Fusion Cache Specifation Possible Factor Values for Benchmarks Operation Groupings Sensitivity Scores for the CPU Sensitivity Scores for the GPU Applications Benchmarks Used Instruction Categories CPU Configuration Summary Memory Latency Comparison Memory Configuration GPU Model Coefficients CPU Model Coefficients GPU Model Coefficients CPU Model Coefficients APU Model Coefficients A.1 Most Common Property per Cluster for the CPU B.1 CPU Configuration Details vi

7 List of Figures 2.1 AMD Evergreen based streaming processor AMD Evergreen based SIMD core Nvidia Fermi base CUDA core Vector s 1-dimensional NDRange is partitioned into 4 subranges The read meta-function is called for buffer a in subrange 1 of vector Speedup of distributed benchmarks using DistCL Breakdown of runtime HotSpot with various pyramid heights DistCL and SnuCL speedups DistCL and SnuCL compared relative to compute-to-transfer ratio Idle power measurements done using the DI Idle power measurements done using the DI MSI A75MA-G55 motherboard schematic [63] Schematic of the measuring setup A Picture of the measuring setup in action An example of a stack used to store the order of recent memory accesses Energy consumption of ALU benchmarks on the CPU Energy consumption of ALU benchmarks on the GPU Frequency of cluster sizes from the CPU results Frequency of property being the most common in a cluster Percentage of benchmarks in a cluster that share the most common property Steps involved in the modelling process Comparison of the literal and best memory configurations The Regression process Fitting error of the training benchmarks for the CPU models Fitting error of the training benchmarks for the GPU models Fitting error of the validation benchmarks for the CPU models Fitting error of the validation benchmarks for the GPU models Linear regression of workloads at various frequencies Predicted and true values for the total energy of the Rodinia benchmarks Measured power consumption of back propagation on real hardware vii

8 6.2 Simulate power consumption of back propagation using Multi2Sim viii

9 List of Abbreviations ABI Application Binary Interface AGU Address Generating Unit ALU Arithmetic Logic Unit APU Accelerated Processing Unit AMD Advanced Micro Devices API Application Programming Interface ATX Advanced Technology extended CMP Chip Multi-Processor CPU Central Processing Unit DAQ Data AcQuisition DSE Design Space Exploration DSM Distributed Shared Memory DSP Digital Signal Processor DVFS Dynamic Voltage and Frequency Scaling EPI Energy Per Instruction FPU Floating Point Unit FU Functional Unit GND GrouND GPU Graphics Processing Unit GPGPU Genral Purpose Graphics Processing Unit HPC High Performance Computing IPC Instructions Per Clock ILP Instruction-Level Parallelism ISA Instruction Set Architecture ISP Image Signal Processor MLP Memory-Level Parallelism NOC Network On-Chip OoO Out-of-Order ix

10 PCIe Peripheral Component Interconnect Express PMU Performance Monitoring Counters PSU Power Supply Unit OS Operating System SATA Serial Advance Technology Attachment SC SIMD Core SFU Special Function Unit SIMD Single Instruction Multiple Data SIMT Single Instruction Multiple Thread SMT Simultaneous Multi-Threading SoC System on Chip SP Streaming Processor SPU Stream Processing Unit SSE Streaming SIMD Extensions TLP Thread Level Parallellism VLIW Very Long Instruction Word VRM Voltage Regulator Module x

11 Chapter 1 Introduction Since the introduction of microprocessors in the 1970s, their processing power has exponentially increased. Performance increases are due to increasing transistor counts and increasing clock frequency. However, in modern processors, power density constraints have led to a dramatic slowing of frequency increases. To keep power density in check, multicore designs have been introduced. This allowed continued increases in performance, by increasing the number of cores instead of frequency [1]. Graphics processing units (GPUs) are highly parallel processors, and they have recently seen an explosion of use for parallel workloads. This has lead to the creation of GPU programming frameworks, such as CUDA [2] and OpenCL [3], which allow GPUs to be programmed with an emphasis on general purpose computation, rather than graphics work. As general purpose GPU (GPGPU) computation has gained broader acceptance, GPUs have started to be included in compute clusters [4]. However, current GPGPU programming frameworks do not assist in programming clusters of GPUs, so multiple programming models must be combined. One programing model manages the cluster environment by transferring data between nodes and assigning work to different GPUs, and another is responsible for programing the GPUs themselves. To take advantage of a GPGPU an application must be highly parallelizable, which is not always the case. This has led to an increased use of accelerators and the emergence of heterogeneous architectures. These architectures combine multiple specialized processors that are particuarly fast for a limited set of applications. Most of the processors, shipped by Intel and AMD in the consumer space today, are heterogeneous [5][6]. They include central processing unit (CPU) and GPU cores. In the ultra-mobile space, we see even more convergence with entire systems on a chip (SoC) [7]. These systems are highly heterogeneous and contain a CPU, GPU, digital signal processor (DSP), image signal processor (ISP), video decoder/encoder, and/or wireless controllers. As transistor sizes keep shrinking and power limits do not, we are entering the age of dark silicon [8], where a chip contains more transistors than it can reasonably power. This will make heterogeneous architectures even more attractive, since we can spare the area and desperately need the power savings associated with less general purpose hardware. This thesis is composed of two main parts: DistCL [9], a distributed OpenCL framework that allows a cluster of GPUs to be programmed as if it were a single device, and a power model for a heterogeneous 1

12 Chapter 1. Introduction 2 processor. While at first glance these two parts may appear unrelated, they are part of a larger project focusing on heterogeneous computing. The project seeks to investigate the best way to schedule work on a heterogeneous processor. One of the difficulties of using heterogeneous processors is that there are very few programming models that are common to the different types of processors. OpenCL is a heterogeneous programming model that aims to make heterogeneous processors programmable using a single framework. Dividing up an OpenCL application so that it may run on a heterogeneous system is in many ways similar to dividing it up to run on a cluster. In both cases, we must ensure that program correctness is preserved even though the work itself is being divided up amongst multiple processors. This involves determining the memory dependencies required for each part of the work and ensuring that this memory is made available to the correct device. In order to do this efficiently, it is important to understand the overheads involved and the trade-offs that can be made. In this thesis, two frameworks that allow OpenCL kernels to be distributed across a cluster, with varying degrees of programmer involvement, are compared. This allows us to gain better insights into how performance scales for distributed OpenCL applications. Before one can determine how to best distribute work on a heterogeneous processor, it is necessary to define a metric for best. Possible metrics include shortest runtime, minimum energy consumption to complete a given task, or maximum performance within a given power envelope. Simulators can easily be used to determine which approach produces the lowest runtime, but since there are no publicly available power models of a heterogeneous processor, it is impossible to assess the other two metrics. A heterogeneous power model must take into account not only the power consumption of individual processors, such as CPUs and GPUs, but also that of shared resources, such as memory controllers. Developing a power model allows architects to investigate what hardware configurations are best using similar metrics. To this end, power modelling capabilities were added to the Multi2Sim [10] heterogeneous architecture simulator. A power model of the AMD Fusion accelerator processing unit (APU) was created and tested within Multi2Sim. A statistical approach was used to create the power model and to determine which micro-benchmarks are necessary to create a valid model. Multi2Sim was then configured using this model and its power modelling capabilities were validated. This approach is not unique to the Fusion and could be used to create similar power models for any desired hardware. 1.1 Contributions The major contributions of this work are: 1. An analysis of performance scaling for distributed OpenCL kernels using two approaches. 2. A systematic methodology to create a representative set of power micro-benchmarks using data collected from real hardware. 3. Creating the first power model for a CPU/GPU heterogeneous processor. 4. Adding configurable power modelling capabilities to the Multi2Sim heterogeneous architecture simulator.

13 Chapter 1. Introduction Organization The thesis is organized as follows: Chapter 2 provides background that is common to the thesis as whole. This includes background on GPU architecture, the Fusion heterogeneous APU, OpenCL and competing programing models, current GPGPU simulators, and the computing infrastructure used to conduct experiments. Further chapter-specific background will also be provided at the beginning of each chapter. Chapter 3 describes the DistCL framework and compares it to SnuCL [11], another framework that allows GPU clusters to be programmed. Chapter 4 describes the power measuring setup that was used when measuring power consumption on the Fusion. This chapter also describes how measured power information of over 1600 benchmarks was used to create a representative set of power benchmarks that contained less than 350 benchmarks. Chapter 5 describes the process used to create a power model for the Fusion APU. This includes describing how Multi2Sim was configured to simulate the APU and explaining the regression analysis used to create the actual model. Chapter 6 explains how Multi2Sim was modified to support power modelling. It also describes the approach used and how users of the simulator can take advantage of this new feature. Finally, Chapter 7 summarizes the conclusions and insights made throughout this work and provides further insights into the future work that is now possible.

14 Chapter 2 Background This chapter provides background to this work as a whole. Where necessary, subsequent chapters will provide chapter-specific background sections. Section 2.1 describes current GPU architectures and contrasts AMD and Nvidia designs. Section 2.2 describes the architecture of the Fusion APU. Section 2.3 introduces OpenCL and CUDA, which are frameworks for writing and executing heterogeneous and GPGPU programs. Section 2.4 discusses the simulators used in GPU architecture research and introduces Multi2Sim, a heterogeneous simulator used in this work. Finally, Section 2.5 describes the SciNet [12] computing clusters that were used for this work. 2.1 GPU Architecture Modern CPUs are highly versatile processors that are best optimized for low latency single threaded computation [13]. Features such as out-of-order (OoO) execution, branch prediction, superscalar designs, and large caches help achieve this goal. While these features increase performance, they come at a price: increased complexity, area, and power consumption. These factors limit the number of cores in a single chip multi-processor (CMP). On the other hand, GPUs focus on maximizing throughput and reducing core area, to fit as many cores as possible onto a single chip. This comes at the expense of architectural efficiency, latency, and memory bandwidth per-core. A GPU core is a single instruction multiple thread (SIMT) pipeline [14]. AMD calls these SIMD cores, while Nvidia calls these CUDA cores. SIMT, a variation on single instruction multiple data (SIMD), groups multiple threads together into wavefronts (AMD terminology) or warps (Nvidia terminology), which execute the same instructions in lock step. SIMT allows threads to take divergent branches. To ensure the threads keep executing in lock step they may need to take both branches. To maintain correctness each thread will only write back the result of the appropriate branch. Such computation of unnecessary results is one way architectural efficiency is reduced. Due to limitations on the number of I/O pins per chip, GPUs have limited per-core memory bandwidth [15]. To make the most of this limited bandwidth they sacrifice latency to increase throughput. The memory controller will try to group multiple requests into a single larger contiguous request to reduce the number of memory accesses. Due to the 4

15 Chapter 2. Background 5 in-order nature of GPUs and the high memory latencies, cores are typically running multiple wavefronts simultaneously. A wavefront waiting on a memory request to be filled will be swapped out for one that is ready to run. To get high architectural efficiency, each core generally requires hundreds of simultaneously executing threads. For traditional graphics workloads where a GPU needs to produce frames for a display at a resolution of say 1920x1080, this is not an issue. Each pixel can be an independent thread, meaning that there are over 2 million threads per frame, giving the GPU plenty of wavefronts to choose from. However, extracting this level of parallelism out of general purpose applications is not always trivial. The rest of this section takes a closer look at GPU architectures. Section describes the AMD Evergreen architecture, which is found in the AMD Fusion APU, that was used for this work and can be simulated using Multi2Sim. Section describes a competing Fermi architecture from Nvidia. Fermi GPUs are available on SciNet and can be simulated using GPGPUSim [16]. Some of the naming conventions of AMD and Nvidia that are introduced in the following sections can be confusing when presented concurrently. The reader should focus on the AMD naming convention as the remainder of this work will use the AMD naming convention AMD Evergreen The AMD Evergreen micro-architecture, or 5000 series, specifies the design of the SIMD core, the associated memory controllers, and rastering hardware. The Evergreen architecture uses a very long instruction word (VLIW) instruction set architecture (ISA) [17], which influences the design. Starting at the lowest level, individual threads are mapped onto streaming processors (SPs). Each SP is composed of five stream processing units (SPU), named x, y, z, w, and t, as well as a register file as shown in Figure 2.1. The first four SPUs are able to perform simple integer and floating point operations, including multiplication, and load/store operations. The t, or transcendental, SPU is more complex and has additional functionality. It can perform all the remaining complex operations such as division, trigonometric operations, and square roots. This design maps very well to pixel shading as the x, y, z, and w SPUs can calculate the three colour components and transparency of a pixel, while the t SPU handles the more complex lighting operations. Instruction level parallelism (ILP) is required to keep all the SPUs busy. In the Evergreen ISA, each instruction clause can contain up to five separate calculations, one for each SPU. If the ILP is less then five, some SPUs will remain idle. Since this ILP must be expressed in the machine code, it must be discoverable at compile time to be taken advantage of. Streaming Processor Register File X y Z W t Figure 2.1: AMD Evergreen based streaming processor. A SIMD core (SC) is composed of sixteen SPs, as shown in Figure 2.2. The SIMD core is the smallest

16 Chapter 2. Background 6 unit to which work can be assigned, as all the SPs in the SC will be executing in lock step. The SC can be assigned multiple wavefronts. Each wavefront contains 64 threads which are split into four groups of sixteen and always run concurrently. The scheduler switches between wavefronts to hide high latency events. The SIMD core also contains 32 kb of shared memory, which can be used to store data that will be shared among SPs. It is much faster than main memory and this is used for OpenCL s local memory. The texture unit is normally responsible for applying graphic textures, but in GPGPU computations it is used to make global memory reads. The texture cache is not used as a data cache when performing GPGPU computations, because textures are read-only. SIMD Core Scheduler SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shared Memory Texture Units Texture Cache Figure 2.2: AMD Evergreen based SIMD core. An entire Evergreen GPU consists of one to twenty SCs and one to four memory controllers. There is global read-only data share which can be used in GPGPU programing as constant memory. Other shared resources include the thread dispatch scheduler which assigns wavefronts to SCs Nvidia Fermi A direct competitor to AMD s Evergreen based GPUs are Nvidia s Fermi based, or 400 series, GPUs. Nvidia does not use a VLIW architecture, so there are significant differences in its design. Again starting from the bottom, individual threads are mapped onto a shader processors (SP). 1 Each SP can handle simple integer and floating point operations, including multiplication, similar to AMD s SPUs, but unlike them it cannot perform load or store operations. There are separate load/store units, as well 1 For the remainder of this section SP will refer to a shader processor. Thereafter, this abbreviation will return to meaning an AMD streaming processor

17 Chapter 2. Background 7 as special function units to handle more complex operations. These are all combined into a CUDA core (CC). CUDA Core Scheduler Register File LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST SFU SFU SFU LD/ST SP SP SP SP LD/ST LD/ST SP SP SP SP LD/ST Shared Memory L1 cache Texture Units Texture Cache SFU Figure 2.3: Nvidia Fermi base CUDA core. Figure 2.3 shows the components of a CC. It is composed of 32 SPs, 16 load/store units, and 4 special function units (SFUs). The heterogeneity in execution resources is seen at the CC level rather than at the streaming processor level, as in AMD s design. Just like a SC, a CC will execute a single instruction at a time. This means 32 threads for most instructions, 16 threads for load/stores, and 4 threads for complex operations at a time. The Fermi architecture makes no effort to take advantage of ILP and instead focuses on data-parallelism by including more SPs per CC. Nvidia s CC scheduler can also be assigned multiple warps; it interleaves them to hide high latency operations. A major difference of the CC, compared to the SC, is that it contains both a shared memory and an L1 data cache. The shared memory and cache are 64 kb in total, and can be configured as either a 16 kb shared memory and a 48 kb cache or as a 48 kb shared memory and a 16 kb cache. The texture cache cannot be used for GPGPU computing. Due to the differences in the hardware, AMD and Nvidia GPUs should be programmed differently. When programming the AMD GPU it is important to ensure that the program contains ILP to take advantage of all the SPUs. If this is not done, it is possible to obtain as little as 20% of the available performance. With the AMD GPU it is also more important to make use of the shared memory, as there is no data cache. However, due to the simpler architecture of AMD s SPs, it is possible to fit more of them into a given area. This means one can get more performance per dollar, however, the challenge is unlocking it all.

18 Chapter 2. Background Fusion APU This work presents a power model developed for the AMD Fusion A6-3650, a Llano APU [5][18][19]. The Fusion APU is a heterogeneous processor that contains four CPU cores, four GPU cores, and a shared memory controller. The CPU is based on the family 12h [20] architecture and the GPU is based on the Evergreen architecture. The APU s specifications are summarized in Table 2.1. The rest of this section describes the CPU architecture and the specific the GPU configuration that the Fusion employs. Table 2.1: AMD A Specification Component Value CPU cores 4 CPU architecture family 12h CPU Operating Frequency 2.6 GHz L1 I-cache 64 KB per-core L1 D-cache 64 KB per-core L2 Cache 1 MB per-core GPU cores 4 GPU architecture Evergreen GPU Operating Frequency 443 MHz Streaming Processors 16 per-core Stream Processing Unit 5 per-streaming processor Memory controller Dual channel DDR3 TDP 100 W CPU The CPU in the A is an OoO, x86 four core family 12h processor, which is based on the K8 64-bit architecture. It is a three-wide architecture, with three integer and three floating point pipelines and support for SSE instructions. Each integer pipeline contains a scheduler, an integer arithmetic-logic unit (ALU), and an address generation unit (AGU). Each pipe also handles one of the following types of instructions: multiplication, division, or branch instructions. The floating point pipelines are all different, but share a single scheduler. While some instructions can be handled by multiple pipes, in general one is responsible for simple arithmetic, another is responsible for complex arithmetic, and the last one is responsible for load and store operations. The floating point unit (FPU) executes x87 and SSE instructions. SSE instructions allow the CPU to take advantage of data parallelism by executing vector instructions. This is done by operating on 128-bit registers. In the SSE terminology, these registers are packed with smaller data-types. The number of operands that can be packed depends on the data-type; it is possible to pack two 64-bit values, four 32-bit values, and up to sixteen 8-bit values into an SSE register. A register does not need to be fully packed in order to perform operations; for example, it is possible to use an SSE instruction to operate on three 32-bit floats. The OpenCL vector data-types, such as int4, map directly to packed SSE instructions when executed on the CPU. The OpenCL compiler also uses the SSE instructions to perform any floating point operations, instead of x87 instructions.

19 Chapter 2. Background 9 The processor includes two levels of cache and an integrated memory controller shared with the GPU. Each core has private 64 kb L1 instruction and same size data caches, as well as a private 1 MB unified L2 cache. The caches use an exclusive design so the L2 cache is a victim cache and does not contain any data found in any L1 caches GPU The GPU in the APU is based on the Evergreen micro-architecture. It contains four cores, each made up of sixteen streaming processors. See Section for more details on the Evergreen architecture. When programming Evergreen GPUs with OpenCL, it is possible to obtain much better performance when using vector data-types. This is because the compiler interprets a vector operation as a group of independent operations. Consider the example val = a + b. If the values are of type int, this is a single operation that will be executed by a single SPU in an SP, for 20% of the maximum throughput of the SP. However, if instead they are of type int4, then this can be treated as four independent additions, assigned to four SPUs in a single SP, which means 80% of the SP s throughput is utilized. Therefore, by using vector data-types, it possible to increase the architectural efficiency of the GPU considerably. The Fusion s architectural details are important for two reasons. First, in Chapter 4 we need to understand the code the OpenCL compiler produces, so we know how to write kernels to target specific hardware components. Second, we need to know this information in Chpater 5 so we can configure Multi2Sim to approximate the Fusion as closely as possible. 2.3 Programing Models There are two common frameworks available GPGPUs computing: OpenCL [3] and CUDA [2]. OpenCL is an open heterogeneous programing standard maintained by the Khronos group that can be used not only to program GPUs but many other types of devices as well. This section will explain OpenCL in detail, as it is needed in Chapter 3 to understand how OpenCL kernels can be transparently distributed. This explanation also helps in understanding how OpenCL is used in Chapter 4 to create micro-benchmarks. OpenCL is also briefly contrasted with CUDA, which is used to program Nvidia GPUs. It is also used under the hood by SnuCL [11] and by GPGPUSim [16] OpenCL OpenCL is a framework that allows the programming of heterogeneous processors. OpenCL programs have two major components: the host program and the kernels. The host program is normal C or C++ code and runs on the CPU. It is the code that makes calls to the OpenCL application programming interface (API), manages the devices on which kernels will be executed, and launches kernels. The kernels consist of functions written in OpenCL C, a C-99 derivative, and can run on any device that supports OpenCL. Kernels usually contain algorithms to be run by a GPU or other accelerator, but they can also be run on a CPU.

20 Chapter 2. Background 10 OpenCL represents hardware components in a hierarchy. At the top of the hierarchy we find the OpenCL platform. The platform is made up of host processors and one or more compute devices. Compute devices are hardware accelerators that execute OpenCL kernels. A compute device is composed of one or more compute units. Work is scheduled at the compute unit level, but a compute unit can be further subdivided into one or more processing elements. Each processing element will run a single thread of execution. On x86 CPUs, compute units map to cores, and each compute unit contains a single processing element, the core itself. On Evergreen GPUs compute units map to SCs and processing elements map to SPs. This hardware hierarchy informs both OpenCL s memory and programing models. OpenCL supports three distinct types of memory: global, local, and private. Global memory is shared at the device level. All the compute units in a single device share the same global memory, however, it is not guaranteed to be consistent. Reads and writes to and from global memory can be executed in any order as long as calls from a single compute unit remain ordered. This means that compute units cannot communicate through global memory. OpenCL also supports constant memory, which is essentially read-only global memory. Local memory is per compute unit memory. This allows processing elements within a compute unit to communicate using local memory. Local memory is smaller and faster than global memory. Private memory is per-processing-element memory. This memory can be used by individual threads to store private data. On CPUs there is no distinction between the different types of memory at the hardware level. However, the device driver limits the size of local and private memory such that they fit into the L1 cache. On GPUs the different types of memory usually map to different physical memories. For Evergreen GPUs the register file is used for private memory, the per SC shared memory is used for local memory, the global data share is used for constant memory, and main memory is used for global memory. The host and compute devices do not have a shared address space, so OpenCL provides buffer objects. Buffers are host allocated in either device or host memory. OpenCL provides API calls to manipulate these buffers which handle pointer marshalling and copying when necessary. Copies between buffers and host memories are explicit. Local memory requirements must be static. This is because the local memory requirements may limit how many work-groups can be simultaneously assigned to a compute unit. The OpenCL programming model follows a hierarchy similar to that of the memory. When work is assigned to a compute device, a kernel and NDRange must be specified. The kernel is an OpenCL function. The NDRange specifies the number of threads, or work-items in OpenCL, that will run the kernel. The range can be one, two, or three dimensional and can be thought of as a Cartesian space. Work-items are identified by their position in the range using a unique global-id, which is its coordinates in the space. For example, if we create a two-dimensional NDRange of size sixteen in both the x and y dimensions, it will contain 256 work-items. Each work item will have an x-id ranging from zero to fifteen and a y-id from zero to fifteen, though each combination will be unique. This NDRange is subdivided into work-groups. Work-groups can be thought of as NDRanges which are mapped to compute units. Each work-item has a local-id which identifies its position within the work-group. Work-groups have the same number of dimensions as the NDRange and have their own unique multi-dimensional ID. In OpenCL, work is scheduled at the work-group granularity because work-items within a work-group must be able to communicate using shared memory. The work-items themselves are assigned to processing elements. OpenCL uses a SIMT execution model, just like GPUs. The local- and global-ids are used to allow the programmer to express data parallelism, and are usually used to index data in a buffer, but

21 Chapter 2. Background 11 1 k e r n e l void i n c ( g l o b a l int a, ) 2 { 3 a [ g e t g l o b a l i d ( 0 ) ] ++; 4 5 } Listing 2.1: OpenCL kernel to increment array elements. 1 void i n c r e m e n t a r r a y ( int input, int b u f f e r s i z e, c l c o n t e x t context, cl command queue queue, c l k e r n e l k e r n e l ) 2 { 3 4 cl mem b u f f e r = c l C r e a t e B u f f e r ( context, CL MEM READ WRITE, s i z e o f ( int ) b u f f e r s i z e, NULL, &e r r c o d e ) ; 5 e r r c o d e = clenqueuewritebuffer ( queue, b u f f e r, CL TRUE, 0, s i z e o f ( int ) b u f f e r s i z e, input, 0, NULL, NULL) ; 6 a s s e r t ( e r r c o d e == CL SUCCESS) ; 7 8 a s s e r t ( c l S e t K e r n e l A r g ( k e r n e l, 0, s i z e o f ( cl mem ), &b u f f e r ) == CL SUCCESS) ; 9 10 s i z e t g l o b a l [ ] = {1, 1, 1 } ; 11 s i z e t l o c a l [ ] = {1, 1, 1 } ; g l o b a l [ 0 ] = b u f f e r s i z e ; 14 l o c a l [ 0 ] = b u f f e r s i z e / 4 ; a s s e r t ( clenqueuendrangekernel ( queue, k e r n e l, dims, NULL, g l o b a l, l o c a l, 0, NULL, NULL) == CL SUCCESS) ; 17 c l F i n i s h ( queue ) ; ( clenqueuereadbuffer ( queue, b u f f e r, CL TRUE, 0, s i z e o f ( int ) b u f f e r s i z e, input, 0, NULL, NULL) == CL SUCCESS) ; 20 } Listing 2.2: OpenCL host code to increment array elements. occasionally as a direct operand. An example OpenCL kernel can be seen in Listing 2.1 and the associated host code in Listing 2.2. The code in the example increments each element in an array by one. This stripped down example omits the OpenCL boilerplate required to discover the devices, create a context, compile the kernel code, and create the devices command queue. We can assume the calling function has taken care of these steps and is passing a valid context, command queue, and kernel object as arguments. In this example, the array we want to increment is pointed to by input, and is of size buffersize, and we will assume buffersize is 100. The kernel is very simple. It accepts an array as an argument and increments the value at index global-id of the array, as seen in Listing 2.1. In order to execute this kernel, the host code must first create a buffer and copy the array values to it, as is done in lines 4 and 5 of Listing 2.2. This buffer is then be passed as the argument to the kernel in line 8. This kernel needs one-dimensional NDRange of size 100, and for this particular example, the local size is not important so we can set it to any divisor of 100, say, 25. This step is done in lines 10 through 14. This will create four work-groups, each with twenty-five work-items. Each work-item will execute identical instructions, but since they each have a different global-id, each work item will increment a different array element. The kernel is then submitted to the command queue of the device we are using in line 16. To get the results from the OpenCL device, the buffer needs to be copied back to host memory, as is done in line 19. We can see from the example that the kernel code will work for any size of buffer, as the kernel code itself only increments a single element. The host program will run as many instances of the kernel as

22 Chapter 2. Background 12 necessary for a given array. In this example, the number of work-groups will always be four, which limits the problem size to the maximum size work-group the device can run. It would also be possible to create a variable number of work-groups by specifying the local size. If in this example the local size were always four, then we would have created 25 work-groups instead of four. The ideal local size depends on the device being used. For the CPU, it make no difference, since each work-item must be executed separately. On the other hand, on an Evergreen GPU, any work-group size that is not a multiple of 16 will leave some SPs idle when executing, thereby decreasing performance. Also, the larger the work-groups, the more wavefronts the SC will have to interleave during execution. Since OpenCL can be used to program both CPUs and GPUs, it is an ideal framework to program heterogeneous processors such as the AMD Fusion. Both the Fusion s CPU and GPU can be used as OpenCL devices. This allows for the same kernel to be executed on either processor and the work can be more easily shared since there is no need to copy data over an external interface such as PCIe [21]. This is useful for comparing the relative strengths of both processors and to allow them to collaborate on the same workload CUDA The CUDA programming model is very similar to OpenCL. In fact, Nvidia GPUs can be programmed using either OpenCL or CUDA. When it comes to writing kernels, the only differences between CUDA and OpenCL are the names of the IDs a work-item or work-group has. To avoid confusion, I am omitting the CUDA nomenclature as it is not used anywhere in this work. However, there are more differences in the host code. Since CUDA can only be used to program GPUs, it does not need to be as general, which simplifies the host code considerably. The same things are still taking place behind the scenes, but the framework can make more assumptions because it knows an Nvidia GPU is being programmed. The biggest difference is that CUDA gives access to certain Nvidia specific features. For example, the shared memory/l1 cache can only be configured using CUDA. In OpenCL you always get 48 kb of shared memory. There are also CUDA plug-ins for programs like MatLab and many libraries available. This is partially due to the fact that CUDA pre-dates OpenCL, but also because it is easier to write libraries that only need to work on one type of hardware. 2.4 Simulators GPU simulators come in two flavours: those that simulate graphics programs and those that simulate general purpose computation. Simulators such as Qsilver [22] or ATTILA [23] can be used to simulate graphics programs written in OpenGL [24]. GPGPU computations written in CUDA or OpenCL can be run in simulators such as GPGPUSim [16] or Multi2Sim [25][10]. As this work focuses on GPGPU, this section will introduce the GPGPU simulators starting with Multi2Sim, since it was used for this research.

23 Chapter 2. Background Multi2Sim Multi2Sim is an open source heterogeneous architecture simulator. It supports the simulation of much more than just GPGPU programs. When this research started in 2012, it supported the simulation of 32-bit x86 CPUs (i386) and GPUs based on AMD s Evergreen architecture. Since that time, Muli2Sim has added support for ARM, MIPS, AMD s Southern-Island, and Nvidia s Fermi GPUs. Multi2Sim is capable of performing cycle-accurate simulations for both CPUs and GPUs. Multi2Sim is not a full system simulator, meaning that is does not run an operating system (OS), but runs the target application directly. Multi2Sim emulates the desired ISA, program loading, and system calls. Multi2Sim must emulate program loading and system calls directly because these are services an OS would normally provide. Multi2Sim emulates the system calls specified in the Linux application binary interface (ABI) allowing it to run most Linux applications. Multi2Sim is the only GPGPU simulator that simulates AMD GPUs and specifically the Evergreen micro-architecture, which is found in Llano generation APUs. While the x86 CPU model in Multi2Sim is not based on any existing hardware, it is highly configurable. This makes Multi2Sim an ideal platform for work with the Fusion. Originally, Multi2Sim only supported the execution of OpenCL kernels on the GPU. This limited its ability to run a single workload on both the CPU and GPU or to run workloads where both the CPU and GPU were collaborating to run the same OpenCL kernel. Steven Gurfinkel and myself addressed this limitation by writing a CPU OpenCL runtime that was incorporated into Multi2Sim. Since AMD does not publish the specification for its OpenCL runtime, it had to be reverse engineered. Our OpenCL runtime is compatible with binaries produced by the OpenCL compiler in the AMDAPP SDK versions 2.5 through 2.7. The runtime works both in and outside of Multi2Sim. More details on its operation is available in the Multi2Sim documentation [26]. Limitations Multi2sim exhibits a number of limitations that were discovered over the course of this work. The most important limitation is the inaccuracy of the memory model. Originally Multi2Sim simulated the cache hierarchy as being part of a complex network on-chip (NOC), where each cache and core had its own router. This caused very high latency cache accesses, as routing took multiple cycles at each stage. The lowest latency cache access we could originally achieve for the L1 cache was twelve cycles, which was much higher than the three it takes on the Fusion. After bringing this issue to the attention of the Multi2Sim team, a bus model was added. This significantly sped up the communication between the caches. However, there were still some issues with modelling all the cache latencies correctly, as the L2 cache s latency is closely tied to that of main memory and the prefetcher in Multi2Sim does not perform nearly as well the one found in real hardware. Multi2Sim simulates main memory the same way it simulates caches. This means that there is a constant latency associated with main memory accesses. This is not accurate for DRAM where the latency can vary greatly depending on multiple factors. DRAM is organized into multiple banks and only one bank can be active (precharged) at a time. Memory access latency is lowest when we are accessing a bank that

24 Chapter 2. Background 14 is currently active and highest when we have to deactivate the current bank and activate another one. This is one reason contiguous memory accesses are faster than sparse memory accesses. The memory latency is also affected by the need to periodically refresh the DRAM s values. If we make a request to an address that is being refreshed, the latency will increase. Another issue is the fact that Multi2Sim simulates an inclusive cache hierarchy while the family 12h CPU we are modelling has an exclusive cache. This means that the size of the L2 cache for each processor is simulated as being 128 kb or 1 th 8 smaller than in reality, since it must also contain all the data found in both L1 caches. Unfortunately, it was not possible to correct for this by increasing the number of sets or the associativity of the L2 cache since Multi2Sim only handles powers of two for both of these values. Given the choice between a cache that was 1 8 too small or 7 8 too large, the former option was chosen. The other issue is that Multi2Sim does not support the simultaneous execution of the CPU and the Evergreen GPU. Currently, when a kernel is enqueued, the CPU is suspended while the kernel executes. The new Southern Island based GPU model does not have this limitation, but it uses a different OpenCL runtime. This was not an issue for most of the benchmarks since they usually have the CPU block until the kernel completes. It was however, an issue when simulating power in Multi2Sim since it was impossible to model power consumption that was truly the sum of both processors. It is unlikely that the Evergreen issues will be solved as development focus has shifted to the more recent Southern Island (SI) micro-architecture. Unfortunately, the SI micro-architecture is a radical departure from previous AMD GPU architecture. The VLIW ISA has been replaced by a non-vilw one, which focuses primarily on exploiting data-parallelism rather than instruction level parallelism (ILP). This architecture has more in common with the Fermi architecture, described in Section 2.1.2, than the Evergreen architecture. Therefore, in spite of its improved features and continued development, it was a poor candidate for emulating Evergreen-based hardware GPGPUSim GPGPUSim [16] is a GPU architectural simulator that can simulate the execution of CUDA or OpenCL kernels. It simulates Nvidia hardware from the 8800GTX [27] up to and including Fermi. GPGPUSim only simulates the execution of the kernel, while the host program executes in real-time on a real CPU. Calls to libcudart are intercepted, to allow the host program to communicate with GPGPUSim. There have been some efforts to combine GPGPUSim with a CPU simulator to create heterogeneous simulators. Work by Zakharenko et al. [28] combines GPGPUSim with PTLSim [29], but does not include a power model. Work by Wang et al. [30] combines GPGPUSim with gem5 [31]. They also include a power model to study power budgeting, but the power model is very coarse assuming constant per-core power consumption. Since GPGPUSim was primarily developed to simulate Nvidia hardware, it expects the kernel to be compiled to Nvidia s PTX assembly. It has no support for AMD s Evergreen assembly, nor does it support the simulation of VLIW architectures, so it could not be used to simulate the Fusion s GPU.

25 Chapter 2. Background SciNet Most of the computations in this work were performed on systems belonging to the SciNet HPC Consortium [12]. Two of their clusters were used: GPC and Gravity. GPC is the general purpose cluster and consists of octo-core Xeon processor nodes and it was used to run Multi2Sim simulations. Without it, it would have taken nearly two months to run all the simulations required for the power modelling. The Gravity cluster is a GPU cluster where each node contain a dodeca-core Xeon processor and two Tesla GPUs. This cluster was used to run experiments using the SnuCL [11] and DistCL [9] distributed OpenCL runtimes.

26 Chapter 3 Distributing OpenCL kernels GPUs were first used to offload graphics tasks from the CPU. Thanks to the demand of computer gamers for ever higher quality graphics, GPUs evolved from simple fixed function accelerators to fully programmable massively parallel processors [32]. As the level of programmability of GPUs increased, it became possible to run non-graphics workloads on them. Recently, there has been significant interest in using GPUs for general purpose and high performance computing (HPC). Significant speedups have been demonstrated when porting applications to a GPU [33], even in HPC workloads such as linear algebra [34], computational finance [35], and molecular modelling [36]. Therefore, it is no surprise that modern computing clusters such as the TianHe-1A [37] and Titan [4] are incoporating GPUs. However, additional speedups are still possible beyond the computational capabilities afforded by a single GPU. As GPU programing has grown in popularity in the HPC space, there has been much interest in expanding the OpenCL and CUDA [2] programing models to support cluster programming. This chapter introduces DistCL [9], a framework for the distribution of OpenCL kernels across a cluster. To simplify this task, DistCL takes advantage of three insights: 1) OpenCL tasks (called kernels) contain threads (called work-items) that are organized into small groups (called work-groups). Work-items from different work-groups cannot communicate during a kernel invocation. Therefore, work-groups only require that the memory they read be up-to-date as of the beginning of the kernel invocation. Thus, DistCL must know what memory a work-group reads, to ensure that the memory is up-to-date on the device that runs the work-group. DistCL must also know what memory each work-group writes, so that future reads can be satisfied. However, no intra-kernel synchronization is required. 2) Most OpenCL kernels make only data-independent memory accesses; the addresses they access can be predicted using only the immediate values they are passed and the geometry they are invoked with. Their accesses can be efficiently determined before they run. DistCL requires that kernel writes be data-independent. 3) Kernel memory accesses are often contiguous. Contiguous accesses fully harness the wide memory buses of GPUs [33]. DistCL does not require contiguous accesses for correctness, but they improve distributed performance because contiguous accesses made in the same work-group can be treated like a 16

27 Chapter 3. Distributing OpenCL kernels 17 single, large access when tracking writes and transferring data between devices. In OpenCL (and DistCL) memory is divided into large device-resident arrays called buffers. DistCL introduces the concept of meta-functions: simple functions that describe the memory access patterns of an OpenCL kernel. Meta-functions are programmer-written kernel-specific functions that relate a range of work-groups to the parts of a buffer that those work-groups will access. When a meta-function is passed a range of work-groups and a buffer to consider, it divides the buffer into intervals, marking each interval either as accessed or not accessed by the work-groups. DistCL takes advantage of kernels with sequential access patterns, which have fewer (larger) intervals, because it can satisfy their memory accesses with fewer I/O operations. By dividing up buffers, meta-functions allow DistCL to distribute an unmodified kernel across a cluster. To our knowledge, DistCL is the first framework to do so. In addition to describing DistCL, this chapter evaluates the effectiveness of kernel distribution across a cluster based on the kernels memory access patterns and their compute-to-transfer ratio. It also examines how the performance of various OpenCL and network operations affect the distribution of kernels. This work was done in partnership with Steven Gurfinkel. While I did participate in the design process of DistCL and the writing of the first version, development on the two subsequent versions was done primarily by Gurfinkel. My main contributions are: The evaluation of how the properties of different kernels affect their performance when distributed. A performance comparison between DistCL and another framework that distributes OpenCL kernels, SnuCL [11]. The rest of this chapter is organized as follows: It first describes related work in Section 3.1. Then in Section 3.2, it describes DistCL using vector addition as an example, in particular looking at how DistCL handles each step involved with distribution. Focus then shifts to analysis; Section 3.3 introduces the benchmarks, which are grouped into three categories: linear runtime benchmarks, compute intensive benchmarks, and benchmarks that involve inter-node communication. Results for these benchmarks are presented in Section 3.4. A comparison with SnuCL is provided in Section Background Programming GPUs is not a simple task, especially in a cluster environment. Often, a distributed programming model such as MPI [38] is combined with a GPU programming model such as OpenCL or CUDA. This makes memory management difficult because the programmer must manually transfer it not only between nodes in the cluster but also to and from the GPUs in each node. There have been multiple frameworks proposed that allow all the GPUs to be accessed as if they are part of a single platform. rcuda [39] is one such framework for CUDA, while Mosix VCL [40] provides similar functionality for OpenCL. These frameworks are limited by the fact that they work with a single CUDA or OpenCL implementation; that is to say, one cannot mix devices from different vendors. More recent frameworks such as Hybrid OpenCL [41], dopencl [42], Distributed OpenCL [43], and SnuCL [11] address this limitation. These frameworks allow devices from different vendors using different OpenCL implementation to be combined into a single context. This makes memory management much easier, as OpenCL buffer

28 Chapter 3. Distributing OpenCL kernels 18 copy operations can be used to transparently transfer data between nodes. clopencl [44] takes a similar approach but allows the user to specify which nodes to include in contexts using host names. With any of these frameworks, work can still only be dispatched to a single device at a time. Therefore, in order to take advantage of multiple devices, work must be manually broken down and dispatched in smaller pieces. With CUDASA [45] it is possible to launch a single kernel and have it run on multiple devices in a network. However, this is not transparent to the programmer. The CUDA programming model is extended with network and bus layers to represent a cluster and a node respectively. This is an addition to existing kernel(ndrange), block (work-group), and thread (work-item) layers which map to devices, compute units, and processing elements respectively. Unfortunately, this means that kernel code must be modified accordingly if one wants to take advantage of more than a single device. Another drawback of CUDASA is that it does not handle any memory transfers transparently. CUDASA includes a distributed shared memory (DSM) to allow all nodes to share a single address space. When allocating memory, the programmer can specify the desired address range, to ensure the memory is on the right node. Functions are provided to easily copy memory across the DSM using MPI. Single OpenCL kernels are transparently run on multiple devices in work by Kim et al. [46]. This framework is targeted at multiple GPUs in a single computer and has no support for distributing a kernel across a cluster. To determine which device requires which memory, sample runs are used. Before enqueuing work onto any GPU, the work-items at the corners of the NDRange are run on the CPU to determine the memory access pattern. In the event that this analysis is inconclusive, the work is still distributed and the output buffers are diff ed. The fact that in certain cases this framework relies on diff ing entire buffers makes it ill suited for distribution. If distributed, the process involves not only transferring the entire buffer from the GPU, but also from each worker node to the master, and this operation consumes a significant amount of time. SnuCL [11] is another framework that distributes OpenCL kernels across a cluster. SnuCL can create the illusion that all the OpenCL devices in a cluster belong to a local context, and can automatically copy buffers between nodes based on the programmer s placement of kernels. As opposed to SnuCL, DistCL not only abstracts inter-node communication, but also the fact that there are multiple devices in the cluster. A more detailed description of SnuCL is provided in Section In this chapter, the performance of DistCL will be compared to that of SnuCL. 3.2 DistCL DistCL executes on a cluster of networked computers. OpenCL host programs use DistCL by creating one context with one command queue for one device. This device represents the aggregate of all the devices in the cluster. When a program is run with DistCL, identical processes are launched on every node. When the OpenCL context is created, one of those nodes becomes the master. The master is the only node that is allowed to continue executing the host program. All other nodes, called peers, enter an event loop that services requests from the master. Nodes communicate in two ways: messages to and from the master, and raw data transfers that can happen between any pair of nodes. To run a kernel, DistCL divides its NDRange into smaller grids called subranges. Kernel execution

29 Chapter 3. Distributing OpenCL kernels 19 1 k e r n e l void v e c t o r ( g l o b a l int a, g l o b a l int b, g l o b a l int out ) 2 { 3 int i = g e t g l o b a l i d ( 0 ) ; 4 out [ i ] = a [ i ] + b [ i ] ; 5 } Listing 3.1: OpenCL kernel for vector addition. 0 1M 0 256k 512k 768k 1M Figure 3.1: Vector s 1-dimensional NDRange is partitioned into 4 subranges. gets distributed because these subranges run on different peers. DistCL must know what memory a subrange will access in order to distribute the kernel correctly. This knowledge is provided to DistCL with meta-functions. Meta-functions are programmer-written, kernel-specific callbacks that DistCL uses to determine what memory a subrange will access. DistCL uses meta-functions to divide buffers into arbitrarily-sized intervals. Each interval of a buffer is either accessed or not. DistCL stores the intervals calculated by meta-functions in objects called access-sets. Once all the access-sets have been calculated, DistCL can initiate the necessary transfers needed to allow the peers to run the subranges they have been assigned. Recall the important distinction between subranges which contain threads and intervals which contain data. The remainder of this section describes the execution process in more detail, illustrating each step with a vector addition example, whose kernel source code is given in Listing Partitioning Partitioning divides the NDRange of a kernel execution into smaller grids called subranges. DistCL never fragments work-groups, as that would violate OpenCL s execution model and could lead to incorrect kernel execution. For linear (1D) NDRanges, if the number of work-groups is a multiple of the number of peers, each subrange will be equal in size. Otherwise, some subranges will be one work-group larger than others. DistCL partitions a multidimensional NDRange along its highest dimension first, in the same way it would partition a linear NDRange. If the subrange count is less than the peer count, DistCL will continue to partition lower dimensions. Multidimensional arrays are often organized in row-major order, so highest-dimension-first partitioning frequently results in subranges accessing contiguous regions of memory. Transferring fragmented regions of memory requires multiple I/O operations to avoid transferring unnecessary regions, whereas large contiguous regions can be sent all at once. Our vector addition example has a one-dimensional NDRange. Assume it runs with 1M = 2 20 work-items on a cluster with 4 peers. Assuming 1 subrange per peer, the NDRange will be partitioned into 4 subranges, each with a size of 256 k work-items, as shown in Figure 3.1.

30 Chapter 3. Distributing OpenCL kernels Dependencies The host program allocates OpenCL buffers and can read from or write to them through OpenCL function calls. Kernels are passsed these buffers when they are invoked. For example, the three parameters, a, b, and out in Listing 3.1 are buffers. DistCL must know what parts of each buffer a subrange will access in order to create the illusion of many compute devices with separate memories sharing a single memory. The set of addresses in a buffer that a subrange reads and writes are called its read-set and write-set, respectively. DistCL represents these access-sets with concrete data-structures and calculates them using meta-functions. Access-sets are calculated every kernel invocation, for every subrange-buffer combination, because the access patterns of a subrange depend on the invocation s parameters, partitioning, and NDRange. In our vector addition example with 4 subranges and 3 buffers, 24 access-sets will be calculated: 12 read-sets and 12 write sets. An access-set is a list of intervals within a buffer. DistCL represents addresses in buffers as offsets from the beginning of the buffer; thus an interval is represented with a low and high offset into the buffer. These intervals are half open; low offsets are part of the intervals, but high offsets are not. For instance, subrange 1 in Figure 3.1 contains global IDs from the interval [256k, 512k). As seen in Listing 3.1, each work-item produces a 4-byte (sizeof (int)) integer, so subrange 1 produces the data for interval [1 MB, 2 MB) of out. Subrange 1 will also read the same 1 MB region from buffers a and b to produce this data. The intervals [0 MB, 1 MB) and [2 MB, 4 MB) of a, b and out are not accessed by subrange 1. Calculating Dependencies To determine the access-sets of a subrange, DistCL uses programmer-written, kernel-specific metafunctions. Each kernel has a read meta-function to calculate read-sets and a write meta-function to calculate write-sets. DistCL passes meta-functions information regarding the kernel invocation s geometry. This includes the invocation s global size (global in Listing 3.2), the current subrange s size (subrange), and the local size (local). DistCL also passes the immediate parameters of the kernel (params) to the meta-function. The subrange being considered is indicated by its starting offset in the NDRange (subrange offset) and the buffer being considered is indicated by its zero-indexed position in the kernel s parameter list (param num). DistCL builds access-sets one interval at a time, progressing through the intervals in order, from the beginning of the buffer to the end. Each call to the meta-function generates a new interval. If and only if the meta-function indicates that this interval is accessed, DistCL includes it in the access-set. To call a meta-function, DistCL passes the low offset of the current interval through start and the meta-function sets next start to its end. The meta-function s return value specifies whether the interval is accessed. Initially setting start to zero, DistCL advances through the buffer by setting the start of subsequent calls to the previous value of next start. When the meta-function sets next start to the size of the buffer, the buffer has been fully explored and the access-set is complete.

31 Chapter 3. Distributing OpenCL kernels 21 1 int i s b u f f e r r a n g e r e a d v e c t o r ( 2 const void params, const s i z e t global, 3 const s i z e t subrange, const s i z e t l o c a l, 4 const s i z e t s u b r a n g e o f f s e t, unsigned int param num, 5 s i z e t s t a r t, s i z e t n e x t s t a r t ) 6 { 7 int r e t = 0 ; 8 n e x t s t a r t = s i z e o f ( int ) g l o b a l [ 0 ] ; 9 i f ( param num!= 2) { 10 s t a r t /= s i z e o f ( int ) ; 11 r e t = r e q u i r e r e g i o n ( 1, g l o b a l, s u b r a n g e o f f s e t, subrange, s t a r t, n e x t s t a r t ) ; 12 n e x t s t a r t = s i z e o f ( int ) ; 13 } 14 return r e t ; 15 } Meta-Function Verification Listing 3.2: Read meta-function. DistCL provides a tool that allows meta-functions to be verified. It can be configured to verify the metafunction of a kernel for any desired NDRange. A configuration file instructs the tool with which kernel to run, what buffers to create, what subrange to consider, and whether the read or write meta-function should be considered. The kernel code must be modified to write the value 1 to the memory location being accessed. Both the kernel and the meta-function are then run and their outputs are compared. Both a graphical and a text representation are then provided to show any regions the meta-function missed and any regions the meta-function included that were not modified by the kernel. This tool allows meta-functions to be tested independently, which is particularly helpful for multi-kernel benchmarks where errors in the early meta-function will propagate through and make it difficult to tell which meta-function caused the benchmark to execute incorrectly. Rectangular Regions Many OpenCL kernels structure multidimensional arrays into linear buffers using row-major order. When these kernels run, their subranges typically access one or more linear, rectangular, or prism-shaped areas of the array. Though these areas are contiguous in multidimensional space, they are typically made up of many disjoint intervals in the linear buffer. Recognizing this, DistCL has a helper function, called require region, that meta-functions can use to identify which linear intervals of a buffer constitute any such area. require region, whose prototype is shown in Listing 3.3, operates over a hypothetical dim-dimensional grid. Typically, each element of this grid represents an element in a DistCL buffer. The size of this grid in each dimension is specified by the dim-element array total size. require region considers a rectangular region of that grid whose size and offset into the grid are specified by the dim-element arrays required size and required start, respectively. Given this, require region calculates the linear intervals that correspond to that region if the elements of this dim-dimensional grid were arranged linearly, in row-major-order. Because there may be many such intervals, the return value, start parameter and next start parameter of require region work the same way as in a meta-function, allowing the caller to move linearly through the intervals, one at a time. If a kernel does not access memory in rectangular regions, it does not have to use the helper function. Even though vector has a one-dimensional NDRange, require region is still used for its read meta-

32 Chapter 3. Distributing OpenCL kernels 22 1 int r e q u i r e r e g i o n ( int dim, const s i z e t t o t a l s i z e, const s i z e t r e q u i r e d s t a r t, const s i z e t r e q u i r e d s i z e, s i z e t s t a r t, s i z e t n e x t s t a r t ) ; Listing 3.3: require region helper function. 0 m( ) m( ) m( ) Figure 3.2: The read meta-function is called for buffer a in subrange 1 of vector. function in Listing 3.2. This is because require region not only identifies the interval that will be used, but also identifies the intervals on either side that will not be used. require region is passed global as the hypothetical grid s size, making each grid element correspond to an integer, the datatype changed by a single work-item. Therefore, lines 10 and 12 of Listing 3.2 translate between elements and bytes, which differ by a factor of sizeof (int). Figure 3.2 shows what actually happens when the meta-function is called on buffer a for subrange 1. In Figure 3.2a, the first time the meta-function is called, DistCL passes in 0 as the start of the interval and the meta-function calculates that the current interval is not in the read set, and that the next interval starts at an offset of 1MB. Next, in Figure 3.2b, DistCL passes in 1 MB as the start of the interval. The meta-function calculates that this interval is in the read-set and that the next interval starts at 2 MB. Finally, in Figure 3.2c, DistCL passes in 2 MB as the start of the interval. The meta-function calculates that this interval is not in the read-set and that it extends to the end of the buffer which has a size of 4 MB Scheduling Work The scheduler is responsible for deciding when to run subranges and on which peer to run them. The scheduler runs on the master and broadcasts messages to the peers when it assigns work. DistCL uses a simple scheme for determining where to run subranges. If the number of subranges equals the number of peers, each peer gets one subrange; however, if the number of subranges is fewer, some peers are never assigned work Transferring Buffers When DistCL executes a kernel, the data produced by the kernel is distributed across the peers in the cluster. The way this data is distributed depends on how the kernel was partitioned into subranges, how these subranges were scheduled, and the write-sets of these subranges. DistCL must keep track of how the data in a buffer is distributed, so that it knows when it needs to transfer data between nodes to satisfy subsequent reads - which may not occur on the same peer that originally produced the data. DistCL represents the distribution of a buffer in a similar way to how it represents dependency information. The buffer is again divided into a set of intervals, but this time each interval is associated with the ID of the node that has last written to it. This node is referred to as the owner of that interval.

33 Chapter 3. Distributing OpenCL kernels 23 Buffers Every time the host program creates a buffer, the master allocates a region of host (not device) memory, equal in size to the buffer, which DistCL uses to cache writes that the host program makes to the buffer. Whether the host program initializes the buffer or not, the buffer s dependency information specifies the master as the sole owner of the buffer. Additionally, each peer allocates, but does not initialize, an OpenCL buffer of the specified size. Generally, most peers will never initialize the entire contents of their buffers because each subrange only accesses a limited portion of each buffer. However, this means that using DistCL does not give access to any more memory than can be found in a single GPU. Satisfying Dependencies When a subrange is assigned to a peer, before the subrange can execute, DistCL must ensure that the peer has an up-to-date copy of all the memory in the subrange s read-set. For every buffer, DistCL compares the ownership information to the subrange s read-set. If data in the read-set is owned by another node, DistCL initiates a transfer between that node and the assigned node. Once all the transfers have completed, the assigned peer can execute the subrange. When the kernel completes, DistCL also updates the ownership information to reflect the fact that the assigned peer now has the up-to-date copy of the data in the subrange s write-set. DistCL also implements host-enqueued buffer reads and writes using this mechanism. Transfer Mechanisms Peer-to-peer data transfers involve both intra-peer and inter-peer operations. For memory reads, data must first be transferred from the GPU into a host buffer. Then, a network operation can transfer that host buffer. For writes, the host buffer is copied back to the GPU. DistCL uses an OpenCL mechanism called mapping to transfer between the host and GPU. 3.3 Experimental Setup Eleven applications, from the Rodinia benchmark suite v2.3 [47][48], AMD APPSDK [49], and GNU libgcrypt [50], were used to evaluate our framework. Each benchmark was run three times, and the median time was taken. This time starts when the host initializes the first buffer and ends when it reads back the last buffer containing the results, thereby including all buffer transfers and computations required to make it seem as if the cluster were one GPU with a single memory. The time for each benchmark is normalized against a standard OpenCL implementation using the same kernel running on a single GPU, including all transfers between the host and device. We group the benchmarks into three categories: 1. Linear compute and memory characteristics: nearest neighbor, hash, Mandelbrot; 2. Compute-intensive: binomial option, Monte Carlo;

34 Chapter 3. Distributing OpenCL kernels Inter-node communication: n-body, bitonic sort, back propagation, HotSpot, k-means, LU decomposition. These benchmarks were chosen to represent a wide range of data-parallel applications. They will provide insight into what type of workloads benefit from distributed execution. The three categories of problems give a spread of asymptotic complexities. This allows the effect of distribution, which primarily affects memory transfers, to be studied with tasks of varying compute-to-transfer ratios. The important characteristics of the benchmarks are summarized in Table 3.1, and each is described below. For the Rodinia benchmarks, many of the problem sizes are quite small, but they were all run with the largest possible problem size, given the input data distributed with the suite. The worst relative standard deviation in runtime for any benchmark is 11%, with the Rodinia benchmarks runtime varying the most due to their smaller problem sizes. For the non-rodinia benchmarks, it was usually under 1%. Table 3.1: Benchmark Description Benchmark Source Inputs Complexity Work-Items Kernels Problem Size per Kernel (bytes) Nearest Rodinia locations(n) O(n) n 1 12n neighbor 24M points(n) Mandelbrot AMD x: 0 to 0.5, y: to 0.25 O(kn) n 1 4n max iterations(k) 1000 Hash Libgcrypt 24M hashes(n) O(n) n n Binomial AMD samp.(k) O(kn 2 ) k(n + 1) 1 32k 767 iterations(n) Monte AMD 4k sums(n) O(knm 2 m ) 2 8n Carlo 1536 samp.(m), 10 steps(k) k 2m 2 (2n + 1) n-body AMD 768k bodies(n), 1,8 iter.(k) O(kn 2 ) n k 16n Bitonic AMD 32M elem.(n) O(n lg 2 n n) 2 lg 2 n 4n k-means Rodinia points(n) kern. 1 O(nk) n 1 8nk 34 features(k), 5 clusters(c) kern. 2 O(nck) var. 4(2nk + c) Back Rodinia 4M input nodes(n) kern. 1 O(nk) nk 2 4(kn + 3n + 2k + 4) propagation 16 hidden nodes(k) kern. 2 O(nk) 4(kn + 2n + k + 3) ( ) HotSpot Rodinia chip dim.(n) 1k, time-steps: O(n 2 n ) x 2 k x 4 n 32 2) 16 2x ( 8 x ) per-kernel(x) 5, total(k) 60 +4n 2 matrix dim.(n) 2k kern. 1 O(n) 16 k + 1 4n 2 n LUD Rodinia 16 1 iterations (k) kern. 2 O(n2 ) 2n 32(i + 1) k 4n 2 current iter. denoted(i) kern. 3 O(n 3 ) (n 16(i + 1)) 2 k 4n Linear Compute and Memory All linear benchmarks consist of n work-items and a single kernel invocation. For these benchmark,s the amount of data transfered scales linearly with the problem size. The compute-to-transfer ratio remains constant regardless of problem size. Nearest neighbor. This benchmark determines the nearest locations to a specified point from a list of available locations. Each work-item calculates the Euclidean distance between a single location and the specified point. The input buffer consists of n coordinates and the output is a n-element buffer of distances. Since 12 bytes are transferred per distance calculation, this benchmark has a very low compute-to-transfer ratio and is therefore poorly suited to distribution. Hash. The hash benchmark attempts to find the hash collision of a sha-256 hash, similar to Bitcoin [51] miners. Each kernel hashes its global ID and compares it to the provided hash. Hash is well-suited to

35 Chapter 3. Distributing OpenCL kernels 25 distribution because the only data transmitted is the input hash and a single byte from each work-item that indicates whether a collision was found. Mandelbrot. This benchmark uses an iterative function to determine whether or not a point is a member of the Mandelbrot set. Each work-item iterates over a single point which it determines using its global ID. This benchmark is well suited to distribution because it has similar characteristics to the hash benchmark. There are no input buffers and only an n-element buffer that is written back after the kernel execution, giving it a high compute-to-transfer ratio Compute-Intensive Binomial option. Binomial Option is used to value American options and is common in the financial industry. It involves creating a recombinant binomial tree that is n levels deep, where n is the number of iterations. This creates a tree with n + 1 leaf nodes and one work-item calculates each leaf. The n + 1 work-items take the same input, and only produce one result. Therefore, as the number of iterations is increased, the amount of computation grows quadratically, as the tree gets both taller and wider, while the amount of data that needs to be transferred remains constant. This benchmark is very well suited to distribution. Since all samples can be valued independently, only a single kernel invocation is required. Monte Carlo. This benchmark uses the Monte Carlo method to value Asian options. Asian options are far more challenging to value than American options, so a stochastic approach is employed. This benchmark requires mn2 8 work-items and m kernel invocations, where n is the number of options, and m the number of steps used Inter-Node Communication These benchmarks all have inter-node communication between kernel invocations, as opposed to the other benchmarks where nodes only need to communicate with the master. The inter-node communication allows the full path diversity of the network to be used when data is being updated between kernels. These benchmarks, like the others, require a high compute-to-transfer ratios to see a benefit from distribution. n-body. This benchmark models the movement of bodies as they are influenced by each other s gravity. For n bodies and k iterations, this benchmark runs k kernels with n work-items each. Each work-item is responsible for updating the position and velocity of a single body, using the position and mass of all other bodies in the problem. Data transfers occur initially (when each peer receives the initial position, mass, and velocity for the bodies it is responsible for), between kernel invocations (when position information must be updated globally), and at the end (when the final positions are sent back to the host). As the number of bodies increases, the amount of computation required increases quadratically, while the amount of data to transfer only increases linearly, meaning that larger problems are better suited to distribution. Bitonic sort. Bitonic sort is a type of merge sort well suited to parallelization. For an n-element array, it requires n 2 lg2 n comparisons, each of which are performed by a work-item, through lg 2 n kernel invocations. Each kernel invocation is a global synchronization point and could potentially involve data

36 Chapter 3. Distributing OpenCL kernels 26 transfers. Bitonic sort divides its input into blocks which it operates on independently. While there are more blocks than peers, no inter-node communication takes place; only when the blocks are split between peers does communication begin. k-means. This benchmark clusters n points into c clusters using k features. This benchmark contains two kernels: The first, which is only executed once, simply transposes the features matrix. The second kernel is responsible for the clustering. This kernel is executed until the result converges, which varies depending on the input data. For the largest input set available it took 20 kernel invocations before convergence. Both kernels consist of n work-items. For the first kernel, each work-item reads a row of the input array and writes it to a column of the output array. This results in a non-ideal memory access pattern for the writes. The second kernel reads columns of the features matrix, and the entirety of the cluster matrix which contains the centroid coordinates of each of the existing clusters. The writes of this kernel are contiguous because each work-item uses its one-dimensional global ID as an index into the array where it writes its answer. Back propagation. This benchmark consists of the training of a two-layer neural network and contains two kernels. For a network with n input nodes and k hidden nodes, each kernel requires nk work-items. The work is divided such that each work-item is responsible for the connection between an input node and one of the hidden nodes. As the number of input nodes grows, the amount of computation required increases linearly, since the number of hidden nodes is fixed for this benchmark. HotSpot. This benchmark models processor temperature based on a simulated power dissipation profile. The chip is divided into a grid and there is a work-item responsible for calculating the temperature in each cell of the grid. The temperature depends on power produced by the chip at that cell, as well as the temperature of the four neighboring cells. To avoid having to transfer data between work-groups at each time-step, this benchmark uses a pyramid approach. Since we need the temperature of all neighboring cells when updating the temperature, we will always read the temperature for a larger region than we will write. If we read an extra x cells in each direction we can find the temperature after x time-steps without any memory transfers. For each time-step, we calculate the updated temperature for a region that is smaller by one in each direction and that region then becomes the input for the next time-step. This creates a pyramid of concentric input regions of height x. While this results in less memory transfers it does mean that some work will be duplicated as there will be multiple work-groups calculating the temperature of overlapping regions during intermediate time-steps. The total amount of computation performed increases with x, while the amount of memory transferred decreases. LU decomposition. This benchmarks factors a square matrix into unit lower triangular, unit upper triangular and diagonal matrices. LU decomposition consists of three kernels that calculate the diagonal, perimeter, and remaining values, respectively. These kernels operate over a square region of the matrix, called the area of interest. The problem is solved in element blocks so for a matrix of size n n, LU decomposition requires n 16 1 iterations. At each iteration the region of interest shrinks, losing 16 rows from the top and 16 columns from the left. Each iteration, the diagonal kernel updates a single block; the perimeter kernel updates the top 16 rows and left-most 16 columns of the area of interest; and the internal kernel updates the entire area of interest. After all the iterations, the diagonal kernel is run again to cover the bottom right block. While the perimeter and internal kernels can scale well, performance is limited by the diagonal kernel which consists of a single work-group and cannot be parallelized. This

37 Chapter 3. Distributing OpenCL kernels 27 Table 3.2: Cluster Specifications Number of Nodes 49 GPUs Per Node 2 (1 used) GPU NVIDIA Tesla M2090 GPU memory 6 GB Shader / Memory clock 1301 / 1848 MHz Compute units 16 Processing elements 512 Network 4 QDR Infiniband (4 10 Gbps) CPU Intel E CPU clock 2.0 GHz System memory 32 GB Table 3.3: Measured Cluster Performance Transfer type Test 64MB Latency 8B Latency ms (Gbps) ms (Mbps) In-memory Single thread memcpy() 26.5 (20.3) (21) Inter-device OpenCL map for reading 36.1 (14.9) 0.62 (0.10) Inter-node Infiniband round trip time 102 (10.5) (3.0) benchmark is not well suited to DistCL because of its inter-node communication, complex access pattern, and lack of parallelism Cluster Our framework is evaluated using a cluster with an Infiniband interconnect [12]. The configurations and theoretical performance are summarized in Table 3.2. The cluster consists of 49 nodes. Though there are two GPUs per node, we use only one to focus on distribution between machines. We present results for 1, 2, 4, 8, 16 and 32 nodes. We use three microbenchmarks to test the cluster and to aid in understanding the overall performance of our framework. The results of the microbenchmarks are reported in Table 3.3. We first test the performance of memcpy(), by copying a 64 MB array between two points in host memory. We initialize both arrays to ensure that all the memory was paged-in before running the timed portion of the code. The measured memory bandwidth was 20.3 Gbps. To test OpenCL map performance, a program was written that allocates a buffer, executes a GPU kernel that increments each element of that buffer, and then reads that buffer back with a map. The program executes a kernel to ensure that the GPU is the only device with an up-to-date version of the buffer. Every time the host program maps a portion of the buffer back, it reads that portion, to force it to be paged into host memory. The program reports the total time it took to map and read the updated buffer. To test the throughput of the map operation, the mapping program reads a 64 MB buffer with a single map operation. Only the portion of the program after the kernel execution completes gets timed. We measured 14.9 Gbps of bandwidth between the host and the GPU. The performance of an 8-byte map

38 Chapter 3. Distributing OpenCL kernels 28 was measured to determine its overhead. An 8-byte map takes 620 µs, equivalent to 100 kbps. This shows that small fragmented maps lower DistCL s performance. The third program tests network performance. It sends a 64MB message from one node to another and back. The round trip time for Infiniband took 102 ms and each one-way trip took only 51 ms on average, yielding a transfer rate of 10.5 Gbps. Since Infiniband uses 8b/10b encoding, this corresponds to a signalling rate of 13.1 Gbps. This still fall short of the maximum signalling rate of 40 Gbps. Even using a high-performance Infiniband, network transfers are slower than maps and memory copies. For this reason, it is important to keep network communication to a minimum to achieve good performance when distributing work. Infiniband is designed to be low-latency, and as such its invocation overhead is lower than that of maps SnuCL We compare the performance of DistCL with SnuCL 1 [11], another framework that allows OpenCL code to be distributed across a cluster. SnuCL can create the illusion that all OpenCL devices on the cluster belong to a single local context. It is designed with heterogeneous hardware in mind and translates OpenCL code to CUDA when running on Nvidia GPUs and to C when running on a CPU. To distribute a task with SnuCL, the programmer must partition the work into many kernels, ensuring that no two kernels write to the same buffer. If the kernel is expected to run on a variable number of devices, the programmer is responsible for ensuring that their division technique handles all the necessary cases. SnuCL transfers memory between nodes automatically, but requires the programmer to divide their dataset into many buffers to ensure that each buffer is written to by at most one node. For efficiency, the programmer should also divide up the buffers that are being read, to avoid unnecessary data transfers. The more regular a kernel s access pattern, the larger each buffer can be, and the fewer buffers there will be in total. With SnuCL, the programmer uses OpenCL buffer copies to transfer data. SnuCL will determine if a buffer copy is internal to a node, in which case it uses a normal OpenCL copy, or if it is between nodes, in which case it uses MPI. If subsequent kernel invocations require a different memory division, this task again falls to the programmer, who will have to create new buffers and copy the appropriate regions from existing buffers. The buffers in SnuCL are analogous to the intervals generated by meta-functions in DistCL. However, in SnuCL, buffers must be explicitly created, resized, and redistributed when access patterns change, whereas DistCL manages changes to intervals automatically. SnuCL does not abstract the fact that there are multiple devices, it only automates transfers and keeps track of memory placement. When using SnuCL, the programmer is presented with a single OpenCL platform that contains as many compute devices as are available on the entire cluster. The programmer is then responsible for dividing up work between the compute devices. If SnuCL is linked to existing OpenCL code, all computation would simply happen on a single compute-device, as the code must be modified before SnuCL can distribute it. Existing code can be linked to DistCL without any algorithmic modification, reducing the likelihood of introducing new bugs. The meta-functions are written in a separate file that is pointed to by the DistCL configuration file. This allows the meta-functions file to be managed separately from the both the kernel and host files. 1 Version 1.2 beta, downloaded November 15 th 2012.

39 Chapter 3. Distributing OpenCL kernels 29 We ported four of our benchmarks to SnuCL. We did not compare DistCL and SnuCL using the Rodinia benchmarks. While porting the other benchmarks to DistCL involved only the inclusion of meta-functions, porting the Rodinia benchmarks to SnuCL is a much more involved process involving modifications that would alter the characteristics of the Rodinia benchmarks. We were also unable to compare against inter-node communication benchmarks due to a (presumably unintentional) limitation in SnuCL s buffer transfer mechanism. The one exception was n-body, which ran correctly for a single iteration, removing the need for inter-node communication and turning it into a compute-intensive benchmark. Since kernels and buffers are subdivided to run using SnuCL, sometimes kernel arguments or the kernel code itself must be modified to preserve correctness. For example, Mandelbrot requires two arguments that specify the initial x and y values used by the kernel. To ensure work is not duplicated, no two kernels can be passed the same initial coordinates, and the programmer must determine the appropriate value for each kernel. Kernels such as n-body require an additional offset parameter because per-peer buffers can be accessed using the global ID, but globally shared buffers must be accessed with what would be the global ID if the problem was solved using a single NDRange. Hash also required an offset parameter since it used a work-item s global ID as the preimage. Similar changes must be made for any kernel that uses the value of its global ID or an input parameter to determine what part of a problem it is working on, rather than data from an input buffer. Having ported benchmarks to both DistCL and SnuCL, we find the effort is less for DistCL. Using DistCL, we can focus solely on the memory access patterns of a kernel when writing its meta-functions, and the host code does not need to be understood. However, with SnuCL, we must understand both the host and kernel code. Debugging was also simpler with DistCL thanks to its meta-function verification tool. This allowed us to write meta-functions one kernel at a time. We did not port any multi-kernel benchmarks to SnuCL, but we did attempt to debug some, before we found the buffer transfer issue, and it was quite difficult to find the exact source of error for incorrect benchmarks. Another advantage of DistCL is the fact that once a benchmark has been successfully distributed, we can be confident it will distribute correctly for any number of peers. While it is simpler to port applications to DistCL than SnuCL, the latter should experience lower runtime overhead. This is because the work done by meta-functions at runtime has already been done by the programmer. With SnuCL, there is less need to synchronize globally as buffer ownership does not need to be coherent across nodes. There is also less communication taking place as nodes must not be informed of buffer ownership unless a transfer is required, in which case the host program will explicitly specify source and destination devices. 3.4 Results and Discussion Figure 3.3 shows the speedups obtained by distributing the benchmarks using DistCL, compared to using normal OpenCL on a single node. Compute-intensive benchmarks see significant benefit from being distributed, with binomial achieving a speedup of over 29x when run on 32 peers. The more compute-intensive linear benchmarks, hash and Mandelbrot, also see speedup when distributed. Of the inter-node communication benchmarks, only n-body benefits from distribution, but it does see almost perfect scaling from 1-8 peers and speedup of just under 15x on 32 peers. For the above benchmarks,

40 Chapter 3. Distributing OpenCL kernels peer 2 peers 4 peers 8 peers 16 peers 32 peers 20 Speedup Mandelbrot Nearest Neighbor Hash Binomial N-body 8 iterations N-body 1 iteration Monte Carlo Bitonic Back propagation K-means HotSpot LUD Mean Figure 3.3: Speedup of distributed benchmarks using DistCL. we see better scaling when the number of peers is low. While the amount of data transferred remains constant, the amount of work per peer decreases, so communication begins to dominate the runtime. The remaining inter-node communication and linear benchmarks actually run slower when distributed versus using a single machine. These benchmarks all have very low compute-to-transfer ratios, so they are not good candidates for distribution. For the Rodinia benchmarks in particular, the problem sizes are very small. Aside from LU decomposition, they took less than three seconds to run. Thus, there is not enough work to amortize the overheads. Figure 3.4 shows a run-time breakdown of the benchmarks for the 8 peer case. Each run is broken down into five parts: Buffer the time taken by host program buffer reads and writes. Execution the time during which there was at least one subrange execution but no inter-node transfers. Transfer the time during which there was at least one inter-node transfer but no subrange executions. Overlapped transfer/execution the time during which both subrange execution and memory transfers took place. Other/sync the average time the master waited for other nodes to update their dependency

41 Chapter 3. Distributing OpenCL kernels Buffer Execution Transfer Overlapped Transfer/Execution Other/Sync Portion of Runtime Mandelbrot Hash Nearest Neighbor Binomial Monte Carlo Nbody Bitonic Back propagation K-means HotSpot LUD Figure 3.4: Breakdown of runtime. information. The benchmarks which saw the most speedup in Figure 3.3 also have the highest proportion of time spent in execution. The breakdowns for binomial, Monte Carlo, and n-body are dominated by execution time; whereas, the breakdowns for nearest neighbor, back propagation and LU decomposition are dominated by transfers and buffer operations, which is why they did not see a speedup. One might wonder why Mandelbrot sees a speedup, but bitonic and k-means do not, despite the proportion of time they spent in execution being similar. This is because Mandelbrot and hash are dominated by host buffer operations, which also account for a significant portion of execution with a single GPU. In contrast, Bitonic and k-means have higher proportions of inter-node communication, which map to much faster intra-device communication on a single GPU. Table 3.4 show the amount of time spent managing dependencies. This includes running meta-functions, building access-sets, and updating buffer information. Table 3.4 also shows the time spent per kernel invocation, and the time as a proportion of the total runtime. Benchmarks that have fewer buffers like Mandelbrot and Bitonic Sort spend less time applying dependency information per kernel invocation than benchmarks with more buffers. LU decomposition has the most complex access pattern of any benchmark. Its kernels operate over non-coalescable regions that constantly change shape. Further, the fact that none of LU decomposition s kernels update the whole array means that ownership information from previous kernels is passed forward, forcing the ownership information to become more fragmented, and

42 Chapter 3. Distributing OpenCL kernels 32 Table 3.4: Execution Time Spent Managing Dependencies Total Per Kernel Invocation Percent of Benchmark Time (µs) Time (µs) Runtime Mandelbrot Hash Nearest neighbor Binomial Monte Carlo n-body Bitonic Sort k-means Back propagation HotSpot LUD take longer to process. With the exception of LU decomposition, the time spent managing dependencies is low, demonstrating that the meta-function based approach is intrinsically efficient. An interesting characteristic of HotSpot is that the compute-to-transfer ratio can be altered by changing the pyramid height. The taller the pyramid, the higher the compute-to-transfer ratio. However, this comes at the price of doing more computation than necessary. Figure 3.5 shows the speedup of HotSpot run with a pyramid height of 1, 2, 3, 4, 5, and 6. The distributed results are for 8 peers. Single GPU results were acquired using conventional OpenCL. In both cases, the speedups are relative to that framework s performance using a pyramid height of 1. The number of time-steps used was 60 to ensure that each height was a divisor of the number of time-steps. We can see that for a single GPU, the preferred pyramid height is 2. However, when distributed the preferred size is 5. This is because with 8 peers we have more compute available but the cost of memory transfers is much greater, which shifts the sweet spot toward a configuration that does less transfer per computation. Benchmarks like Hotspot and LU decomposition that write to rectangular areas of two-dimensional arrays need special attention when being distributed. While the rectangular regions appear contiguous in two-dimensional space, in a linear buffer, a square region is, in general, not a single interval. This means that multiple OpenCL map and network operations need to be performed every time one of these areas is transferred. We modified the DistCL scheduler to divide work along the y-axis to fragment the buffer regions transfered between peers. This results in performance that is 204 slower on average across all pyramid heights, for 8 peers. This demonstrates that the overhead of invoking I/O operations on a cluster is a significant performance consideration. In summary, DistCL exposes important characteristics regarding distributed OpenCL execution. Distribution amplifies the performance characteristics of GPUs. Global memory reads become even more expensive compared to computation, and the aggregate compute power is increased. Further, the performance gain seen by coalesced accesses is not only realized in the GPU s bus, but across the network as well. Synchronization - now a whole-cluster operation - becomes even higher latency. There are also aspects of distributed programming not seen with a single GPU. Sometimes, it is better to transfer more data with few transfers than it is to transfer little data with many transfers.

43 Chapter 3. Distributing OpenCL kernels OpenCL DistCL 3.5 Normalized Speedup Pyramid Height Figure 3.5: HotSpot with various pyramid heights. 3.5 Performance Comparison with SnuCL SnuCL and DistCL s performance was compared using the four benchmarks from the AMD APP SDK and one (hash) from GNU libgcrypt, that were ported to SnuCL. Each benchmark was run three times, and the median time was taken. The time for each benchmark is normalized against a standard OpenCL implementation using the same kernel running on a single GPU, including all transfers between the host and device. Figure 3.6 shows the speedups obtained using SnuCL or DistCL relative to that of normal OpenCL run on a single GPU. For the linear benchmarks SnuCL outperforms DistCL by up to 3.5x. When using one or two peers, performance is within 10%, but DistCL does not scale as well. While SnuCL keeps benefiting from additional peers, DistCL sees peak performance when using 16 peers in both cases. The story is different for the compute intensive benchmarks, as both frameworks see improved performance from adding additional devices all the way up to 32 peers. Performance is also more similar with near identical performance from one to eight peers and a maximum difference in performance of 25% with 32 peers. However, even with the compute intensive benchmarks, it is clear that DistCL does not scale as well as SnuCL. One might assume that the additional runtime overhead of meta-functions must be responsible. This would also explain why the difference in performance increases as the number of peers increases, because

44 Chapter 3. Distributing OpenCL kernels 34 Table 3.5: Execution Time Spent Managing Dependencies Total Runtime Per Kernel Invocation Percent of Benchmark Time (s) Time (µs) Time (µs) Runtime Mandelbrot Hash Binomial Monte Carlo n-body Table 3.6: Benchmark Performance Characteristics DistCL Percent of Compute-to- Compute-to- Percent of Benchmark Speedup SnuCL s performance transfer ratio sync. ratio runtime in sync. Mandelbrot Hash Binomial Monte Carlo n-body as the number of peers increases, so do the number of subranges and therefore the number of access-sets that must be calculated. To verify this hypothesis, the amount of time spent running meta-functions was measured for each benchmark distributed across eight peers. The results are presented in Table 3.5. Running meta-functions and managing dependencies accounts for less than 0.1% of runtime in the worst case, so this is clearly not to blame for the reduced performance. To understand the performance difference, the characteristics of the benchmarks must be better understood. If we refer again to Figure 3.4, we can see a runtime performance breakdown for the benchmarks, again for eight peers. It is no surprise to see that the benchmarks with the most time spent actually running the kernel see the most speedup. Table 3.6 shows some performance characteristics of interest for the benchmarks. For the purpose of calculating the ratios, compute is the sum of execution and overlapped transfer/execution and transfer is the sum of buffer, transfer, and overlapped transfer/execution. In this manner, transfer accounts for all the memory transfer necessary to run the kernel, not just transfers between peers. This is a better representation of how much work is truly performed versus how much time is spent shuffling memory around. We can clearly see that benchmarks with high compute-to-transfer ratios see the best speedups. Figure 3.7 shows the speeds achieved with both frameworks for the compute-to-transfer ratio of each benchmark. We see a similar trend in both cases with low ratios leading to poor speedups and high ratios leading to near ideal scaling. However, the compute-to-transfer ratio is not a good predictor of performance relative to SnuCL. This is not surprising considering the same amount of memory transfers and computation take place regardless of which framework is used. A much better predictor of performance relative to SnuCL is the compute-to-synchronization ratio. Synchronization is not just composed of the time it takes to update the dependency information, which was already shown to be insignificant, but also the time it takes for all the peers to notify the master that they have done so, and any time spent waiting for the last peer. Synchronization time is an average of 7 the meta-function execution time and as high as 16 in the case of binomial. The round trip latency on the cluster for an 8 byte transfer was measured to be 86 µs which is always less than meta-function execution, so it alone does not account for the difference. This means that most of this time is spent

45 Chapter 3. Distributing OpenCL kernels 35 Speedup peer SnuCL 1 peer DistCL 2 peers SnuCL 2 peers DistCL 4 peers SnuCL 4 peers DistCL 8 peers SnuCL 8 peers DistCL 16 peers SnuCL 16 peers DistCL 32 peers SnuCL 32 peers DistCL Mandelbrot Hash Binomial Monte Carlo N-body 1 iteration Mean Figure 3.6: DistCL and SnuCL speedups. waiting for all the peers to reach the synchronization points. This is a traditional synchronization overhead [52]. This also explains why relative performance degrades as the number of peers is increased. However, the synchronization time itself does not account for the entire performance difference. For example, consider hash, where synchronization accounts for only 1% of the runtime, yet it only manages 67% of SnuCL s performance. The remaining factor to consider is the fact that SnuCL actually translates the OpenCL kernel into a CUDA kernel. Even when running on the same hardware, it has been shown that large performance differences can remain. Work by Fang et al. [53] has shown that synthetic performance is not affected when using either CUDA or OpenCL. However, when running real applications, CUDA consistently outperforms OpenCL. Several reasons are listed, including faster kernel launches with CUDA and a better compiler leading to fewer instruction in the intermediate representation. This agrees with our results; the fastest kernels, where launching the kernel is a more significant portion of runtime, have the largest performance deficit. Work by Karimi et al. [54] also shows that transferring data between the host and GPU is also faster with CUDA, on average about 40%. From our tests, we can see that benchmarks that spend more time transferring buffers are further from SnuCL s performance. This difference between CUDA and OpenCL performance also explains why binomial was seeing slight super-linear scaling from 1 to 16 peers under SnuCL. Using two different distributed OpenCL frameworks has shown that the compute-to-transfer ratio of a

46 Chapter 3. Distributing OpenCL kernels SnuCL DistCL 7 Normalized Speedup Compute-to-transfer Ratio Figure 3.7: DistCL and SnuCL compared relative to compute-to-transfer ratio. benchmark is the best predictor of performance scaling. SnuCL slightly outperforms DistCL, since it is statically scheduled and uses the CUDA runtime. In contrast, it is easier to port OpenCL applications to DistCL than to SnuCL, since there is no need to modify either the host or kernel code. 3.6 Conclusion This chapter presented DistCL, a framework for distributing the execution of an OpenCL kernel across a cluster, causing that cluster to appear as if it were a single OpenCL device. DistCL shows that it is possible to efficiently run kernels across a cluster while preserving the OpenCL execution model. To do this, DistCL uses meta-functions that abstract away the details of the cluster and allow the programmer to focus on the algorithm being distributed. We believe the meta-function approach imposes less of a burden than any other OpenCL distribution system to date. Speedups of up to 29x on 32 peers are demonstrated. With a cluster, transfers take longer than they do with a single GPU, so more compute-intense approaches perform better. Also, certain access patterns generate fragmented memory accesses. The overhead of doing many fragmented I/O operations is profound and can be impacted by partitioning. We also compared DistCL to another open-source framework, SnuCL, using five benchmarks. From a usability standpoint, DistCL has the advantage of being able to distribute unmodified OpenCL applications. For compute-intensive benchmarks, performance between DistCL and SnuCL is comparable,

47 Chapter 3. Distributing OpenCL kernels 37 but otherwise SnuCL has better performance. This difference cannot be fully attributed to the overhead of meta-functions, which account for a very small portion of the runtime. This is mostly due to DistCL requiring tighter synchronization between nodes and the fact that SnuCL uses CUDA under the hood. The increased synchronization of DistCL also means that is does not scale as well as SnuCL as the number of peers in the cluster increases. Nevertheless, by introducing meta-functions, DistCL opens the door to distributing unmodified OpenCL kernels. DistCL allows a cluster with 2 14 processing elements to be accessed as if it were a single GPU. Using this novel framework, we gain insight into both the challenges and potential of unmodified kernel distribution. In the future, DistCL can be extended with new partitioning and scheduling algorithms to further exploit locality and more-aggressively schedule subranges. DistCL is available at

48 Chapter 4 Selecting Representative Benchmarks for Power Evaluation Benchmarks play an important role in computer architecture. They are the tools used to measure the performance of various designs. However, the relative performance of architectures can vary depending on the type of benchmark being used, as well as the input sets. This is why benchmark suites such as SPEC [55] contain multiple benchmarks and input data sets. Simulating the entire benchmark suite can be impractically time consuming. In an effort to reduce the number of benchmarks necessary to cover a similar breadth of workloads, statistical methods have been used to compare the similarity of benchmarks. The work by Phansalkar et al. [56] showed that it was possible to obtain similar information running just fourteen out of 29 SPEC benchmarks, when benchmarks were clustered by instruction mix and locality. While application benchmarks, such as SPEC, give an idea of the overall performance of an architecture, micro-benchmarks can be used to obtain information about individual components in a design [57]. Existing benchmarking focuses on performance, but there is also a need to consider power. The energy consumption of various operations and data must be understood in order to create a set of benchmarks that cover a wide range of possible power scenarios. This chapter presents the methodology used to create a representative set of micro-benchmarks that will be used to create a power model for the Fusion APU. Section 4.1 first describes the setup used to measure the power consumption of the Fusion APU. Section 4.2 describes the methodology used to determine what benchmark characteristics are important from a power perspective. 4.1 Power Measurements In order to build a power model, we need to measure the power consumption of the components we want to model. There have been various methods proposed for measuring power consumption in computer hardware. The methods range in complexity and accuracy, from measuring full system power at the wall [58] to measuring per-component power in a temperature controlled environment to account for thermal leakage [59]. In this work, power was measured for the APU at the package level at normal 38

49 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 39 Table 4.1: Data Acquisition Unit Specifications DATAQ DATAQ DI-145 DI-149 Channels 4 8 Measurement Range ±10 V ±10 V Maximum Sample Rate 240 Hz Hz Interface USB USB Table 4.2: ACS711 Current Sensor Specifications Input range ±12.5 A V Sensitivity 0.0 3V cc A (0.167 V A at V cc = 5V ) Minimum logic voltage 3 V Maximum logic voltage 5.5 V Supply current 4 ma Internal resistance 1.2 mω Bandwidth 100 khz Error ±5 % operating temperatures. Benchmarks were kept short and the APU was allowed to return to idle between benchmarks to prevent heat buildup in the chip. The measured value includes the power consumption of the CPU, GPU, and memory controller simultaneously. To measure the power consumption of the APU, both the current and voltage delivered to the package were measured. Measurements were made using the DataQ DI-145 [60] and DI-149 [61] data acquisition units (DAQs). The DI-145 was used for all the benchmark clustering experiments and the GPU power measurements. The CPU measurements were made using the DI-149 because there was more variation in the CPU s power consumption and the CPU models were less accurate. Both units can measure differential voltage on up to four and eight separate channels, respectively. More detailed specifications are available in Table 4.1. The DI-149 has an additional power overhead due to the limited on-board storage of the DAQs; data must be periodically read. Since the DI-149 has a higher sampling rate this happens more often. Figures 4.1 and 4.2 show measured idle power consumption, using the DI-145 and DI-149 respectively. The spikes in power consumption occur when the CPU has to wake up and read data from the DAQ. For this benchmark, it only happened once for the DI-145 but multiple times for the DI-149. Due to the higher sampling frequency, the DI-149 also allows us to see the overshoot and undershoot caused by sudden increases and decreases in power consumption respectively [62]. The power overhead of using the higher sampling rate on the DI-149 was found to be 3.2%. Four channels were used to measure the APU s power consumption: two were used to measure current, one to measure voltage on the 12 V line, and another to measure voltage on the 5 V line. Current consumption was measured by inserting current sensors in the 12 V line of the 2 2 connector between the PSU and motherboard. In accordance with the ATX power specification [64], only this connector delivers power to the APU s voltage regulators. Figure 4.3 depicts a schematic of the MSI A75MA-G55 motherboard [63] used in our Fusion system. The 2 2 power connector is circled in red, while the APU socket is circled in blue. Since this connector has two wires for both the 12 V and ground (GND) lines,

50 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 40 Power [W] DI-145 Idle power Time Figure 4.1: Idle power measurements done using the DI-145. Power [W] Idle power DI-149 Time Figure 4.2: Idle power measurements done using the DI-149. it required 2 sensors. We used Allegro s ACS711 current sensors [65] mounted on carrier boards from Pololu Robotics & Electronics. The specifications of the sensor are available in Table 4.2. To measure current on the 12 V line, a voltage divider was required, as the DAQ units can only make readings up to 10 V. A 100 kω potentiometer was used to divide the voltage in two. When using the DI-145 the maximum sampling rate of 240 Hz, 60 Hz per channel, was used. For the DI-149, a sampling rate was set to 938Hz per-channel, as this allows for precise power tracking but keeps the monitoring overhead low. Power consumption is being measured before the voltage regulator modules (VRMs). The efficiency of the VRMs used on the motherboard varies between 80 and 92% [66]. With our system the 80% efficiency is only reached at idle. Under GPU-only loads efficiency ranges between 84 and 88%, while under CPU-only loads it ranges between 90 and 92%. Due to the small variation in efficiency for each device type the modelling assumes a constant VRM efficiency. Figure 4.4 shows a schematic of the measuring setup. The same setup was used for both the DI-145 and DI-149, since they are physically almost identical. The sensors and the potentiometer were soldered to a prototype board along with a Molex 8981 connector. The Molex connector was used to supply 12V, 5V, and GND to the prototype board reliably, while allowing the measuring setup to be removable. The 12V lines to the APU were cut and each had both ends soldered to one of the current sensors. Terminal blocks were added to the prototype board for each of the signals that we wanted to measure and for a ground for each channel. A picture of the measuring setup installed in the system can be seen in Figure 4.5. The 2 2 power connector and APU socket are again circled in red and blue respectively, while the prototype board and DI-145 are circled in yellow.

51 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 41 Power connector Socket Figure 4.3: MSI A75MA-G55 motherboard schematic [63]. The DataQ is attached to the host computer using a USB port. Under Linux it is detected as a terminal device and can be written to or read from as such. A driver program was written that allows easy access to both DI-145 and DI-149. The important public functions include a constructor and start and stop methods. The constructor detects if the device is the DI-145 or DI-149 and initializes it, then it configures the channels as requested. The start method begins data recording and also creates a new thread that will read data from the DAQ periodically to prevent its buffer from overflowing. The stop method stops recording and returns all the data that was captured in an array, as well as the sample count. To reduce the impact of applications running on the system contributing to the power consumption, a system with a clean installation of Ubuntu LTS was used. The system was run without a display, so the GPU would have no additional tasks, and an SSH connection without X forwarding was used to access the system. Cool n Quiet [67] AMD s P-state [68] dynamic voltage and frequency scaling (DVFS) implementation was disabled, meaning the processor was always operating at the maximum frequency. This prevents the frequency from varying throughout the course of a benchmark. However, it was not possible to disable C-states [68], so an idle CPU or GPU could still be clock gated. To further simplify the collection of data, we wrote two benchmarking programs, one to run OpenCL kernels in isolation and another to run entire benchmarks directly. Both these programs reported runtime, performance counter values, as well as the power and energy consumed to run the benchmarks. More details on the performance counters used are available in Section The program used to run

52 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 42 USB TO PC DATAQ DI-14X channel 1 channel 2 channel 3 channel 4 5V ATX 12V ATX Vcc OUT GND Current Sensor IN OUT Vcc OUT GND Current Sensor IN OUT CPU Power Connector 100 KΩ Ground Figure 4.4: Schematic of the measuring setup. benchmarks directly started a timer and power measurements, before immediately forking a process to run the benchmark. To run OpenCL kernels, an XML schema was developed to represent all the information needed to run benchmarks and perform power measurements. A single XML file can contain multiple kernel runs. Information that is common across all kernels only has to be specified once: the device-type being used and the configuration of the power sensors. There is also kernel-specific information: the kernel names, kernel source file, and the kernel arguments. For any buffer arguments, the size is also required, as well as the kernel used to initialize the values in the buffer. This makes it possible to run a kernel with different inputs. The program also starts and stops power measurements, the performance counters, and a timer. The output is an XML file with the same information as the input, but with power and timing information added to each kernel. Using an XML file to specify the benchmark makes it possible to run various kernels without needing to create a corresponding host program for each kernel. This reduced the likelihood of error, which is high with hundreds of similar host programs. Finally, parser was created to parse the output XML file and convert the current readings into energy and power values, and prepare the data for graphing.

Chapter 4. Selecting Representative Benchmarks for Power Evaluation 43 Power connector Socket Prototype board and DI-145 Figure 4.5: A Picture of the measuring setup in action. 4.2 Micro-benchmark Selection A micro-benchmark is designed to exercise one specific component of a processor.

53 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 43 Power connector Socket Prototype board and DI-145 Figure 4.5: A Picture of the measuring setup in action. 4.2 Micro-benchmark Selection A micro-benchmark is designed to exercise one specific component of a processor. By using a representative set of micro-benchmarks, one that exercises every component of the processor, it becomes possible to characterize the entire processor. This can be done not just for performance but also for power. There are many factors that affect the power consumption of each component. Certain factors are static and therefore will be captured by any benchmark that targets the component, while others are dynamic and depend on the benchmark itself. The most obvious static factor is the die area of each component. The larger the component, the more power it can possibly draw. Dynamic factors include the activity ratio of the component and the type of data being operated on. To create a truly representative set of micro-benchmarks, all the factors that affect power must be considered. Micro-benchmarks are evaluated based on the total energy required to run them. Energy is the ultimate concern, whether it be because of limited battery life in mobile devices or high power bills in data-centres. Accounting for energy also allows the analysis of both power consumption and execution time. This means that a benchmark that consumes 10 W and takes 2 s to perform a certain task is correctly identified as more efficient than a benchmark that only consumes 8 W, but requires 3 s to complete the same task. Memory and compute micro-benchmarks were considered separately. Where possible, OpenCL benchmarks

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication