Characterizing Scalar Opportunities in GPGPU Applications

Size: px

Start display at page:

Download "Characterizing Scalar Opportunities in GPGPU Applications"

Rachel Stevenson
5 years ago
Views:

1 Characterizing Scalar Opportunities in GPGPU Applications Zhongliang Chen David Kaeli Department of Electrical andcomputer Engineering Northeastern University Boston, MA {zhonchen, Norman Rubin NVIDIA Corporation Abstract General Purpose computing with Graphics Processing Units (GPGPU) has gained widespread adoption in both the high performance and general purpose communities. In most GPU computation, execution exploits a Single Instruction Multiple Data (SIMD) model. However, GPU execution typically pays little attention to whether the data operated upon by the SIMD units is the same or different. When SIMD computation operates on multiple copies of the same data, redundant computations are generated. It provides an opportunity to improve efficiency by just broadcasting the results of a single computation to multiple outputs. To better serve those operations, modern GPUs are armed with scalar units. Then SIMD instructions that are operating on the same input data operands will be directed to execute upon scalar units, requiring only a single copy of the data, and leaving the data-parallel SIMD units available to execute non-scalar operations. In this paper, we first characterize a number of CUDA programs taken from the NVIDIA SDK to quantify the potential for scalar execution. We observe that 38% of static SIMD instructions are recognized to operate on the same data by the compiler, and their dynamic occurences account for 34% of the total dynamic instruction execution. We then evaluate the impact of scalar units on a heterogeneous scalar-vector GPU architecture. Our results show that scalar units are utilized 51% of the time during execution, though their use places additional pressure on the interconnect and memory, as shown in the results of our study. I. INTRODUCTION General Purpose computing with Graphics Processing Units (GPGPU) is an attractive platform for a growing number of applications. GPUs were traditionally designed to be stream processors for 3-D computer graphics, though they can also be effectively used as many-core data parallel processors capable of high execution throughput and memory bandwidth. Figure 1 compares the single-precision peak performance between GPUs and CPUs [1] [3]. As shown in the Figure, in 2008 GPUs were 13x faster than CPUs. In the past 5 years, this gap has widened; currently GPUs provide 17x greater computational horsepower versus CPUs. Today, GPUs are being deployed in a wide range of acceleration roles for general purpose applications. The execution model for modern GPUs is based on the Single Instruction Multiple Data (SIMD) model, which allows multiple processing elements to perform the same operation on multiple data, simultaneously. In terms of system architecture, a GPU device is array of multiprocessors(nvidia s streaming Fig. 1. Performance comparison between GPUs and CPUs [1] [3] multiprocessors or AMD s compute units), each of which contains SIMD units and on-chip shared memory (or AMD s local data store). A SIMD unit further contains an array of basic processing elements, each containing one ALU. Shared memory provides the GPU with the ability to share data among processing elements. Also, the multiprocessor supports barrier operations to provide for synchronization at the threadblock/work-group level. The two most popular GPU programming models are CUDA (Compute Unified Device Architecture) [4] and OpenCL (Open Computing Language) [5]. They support both data-parallel and task-parallel models. The most commonly exploited model on the GPU is the data-parallel model, which is the focus of this paper. In a data-parallel model, computation is represented by a sequence of instructions that execute on a number of indexed threads (or OpenCL s work-items). All of the threads are explicitly or implicitly divided into thread blocks (or OpenCL s work-groups). In a thread block, threads share data with fast memory (CUDA s shared memory or OpenCL s local memory), and can be synchronized. When a GPU program is executed, thread blocks are first scheduled onto multiprocessors, and the individual threads in the block are further scheduled onto SIMD units in the multiprocessor. Each thread is processed by one processing element on a SIMD lane. While data-parallel processing can achieve high speedups, /13/$ IEEE 225

2 the standard data-parallel model does not consider the case when the input operands to a SIMD instruction are all the same. Threads are mapped to SIMD units no matter what data they operate on. In case the computation is performed with multiple copies of the same data, the parallel operations can be reduced to Single Instruction Single Data (SISD) execution, whichwe refertoasascalaropportunity.ifwecontinuetouse the SIMD hardware for these SISD operations, we are wasting resources and burning unnecessary power. Instead, we turn to a scalar-vector GPU architecture armed with both scalar and SIMD (i.e., vector) units. On the scalar-vector GPU architecture, scalar opportunities are executed on scalar units so that the SIMD engines can be used to execute true SIMD operations. A good example of just such an architecture is AMD s Graphics Core Next (GCN) architecture. This design adds a scalar coprocessor into each compute unit. The scalar coprocessor has a fully functional integer ALU, with independent instruction arbitration and decode logic and also a scalar register file. This new unit helps execute a variety of control flow instructions, including jumps, calls and returns. The scalar coprocessor presents new opportunities in terms of performance and power efficiency[6]. Fig. 2. CUDA programming model [4] Our scalar-vector GPU design aims to be more flexible than GCN. We designed our scalar unit to handle both integer and floating point instructions. Also, the scalar unit does not need to have a scalar register file (the scalar units can use the vector register file in order to incur fewer hardware changes). The proposed architecture should be capable of effectively utilizing onboard scalar units to serve scalar opportunities in applications at low cost. This paper provides a first glimpse of the scalar opportunities present in GPU applications. Furthermore, we evaluate the potential impact that adding a scalar unit can have on a conventional GPU architecture. We examine the challenges and opportunities for various design alternatives on different microarchitectural components, including the multiprocessor pipeline, interconnection network, and memory subsystem. To the best of our knowledge,this paper is the first attempt to evaluate scalar opportunities in the microarchitecture of a GPU. This paper makes the following contributions. From the perspective of software, our scalar design identifies scalar opportunities in GPU applications using static analysis, and uses this information to guide scalar unit design. From the perspective of hardware, we evaluate the impact of scalar opportunities when run on a scalar-vector GPU architecture. We discuss and also address opportunities and challenges introduced by scalar opportunities. This paper is organized as follows. Section II presents background on GPU programming models and architecture. Section III introduces and defines scalar opportunities in GPGPU applications. Section IV describes the proposed scalarvector GPU architecture and various design alternatives, and also discusses implementation details. Section V presents our experimental setup and modeling results. Section VI discusses related work. The paper is finally concluded in Section VII. Fig. 3. NVIDIA Fermi GPU architecture [11] II. BACKGROUND A. GPGPU Programming Models There are two popular models widely used today for GPU programming: 1) CUDA and 2) OpenCL. CUDA is a general purpose parallel programming model, introduced by NVIDIA [4]. As shown in Figure 2, CUDA allows the programmer to partition a problem into multiple subproblems that can be solved independently in parallel by blocks of threads. Each subproblem can be further subdivided into finer elements that can be solved cooperatively in parallel by all threads within the block. A CUDA program usually has two parts: 1) host code running on the CPU and 2) device code running on the GPU. The compilation of device code works as follows: it is extracted by the NVIDIA CUDA C Compiler (NVCC) [7] first, and then compiled to intermediate PTX (Parallel Thread execution) [8] code. The PTX code is further compiled and optimized at run time by the NVIDIA proprietary Optimized Code Generator [9] to native SASS instructions (the NVIDIA ISA) [10]. The OpenCL programming model [5], which is managed 226

3 This heterogeneous architecture provides more flexibility for GPU applications. Fig. 4. AMD Graphics Core Next compute unit architecture [6] by the Khronos Group, is an open standard for general purpose parallel programming across CPUs, GPUs and other devices, giving programmers a portable language to target a range of heterogeneous processing platforms. In OpenCL, data is mapped to work-items in an index space, and all work-items are explicitly or implicitly divided into work-groups. B. GPU Architecture GPUs usually adopt a massively parallel model to achieve high throughput. Most of the device real estate on a GPU is dedicated to computation rather than control logic or cache. The NVIDIA Fermi architecture [11], shown in Figure 3, features up to 16 streaming multiprocessors, each of which has 32 CUDA cores, 16 load/store units, and 4 special function units. Each CUDA core has a fully pipelined ALU, and can execute an ALU instruction per clock for each thread. Each load/store units allows source and destination addresses to be calculated per thread per clock. Each special function unit can execute a transcendental instruction per thread per clock [11]. When a GPU kernel is launched, the global scheduler distributes thread blocks to the local schedulers in each streaming multiprocessor. Threads are further scheduled to SIMD units in warps (groups of 32 threads). Each multiprocessor has two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Two groups of 16 cores each are used to execute two instructions from two different warps per cycle. The homogeneous SIMD-only architecture described above recently underwent a major microarchitectural change. Figure 4 shows a state-of-the-art AMD Graphics Core Next compute unit architecture [6], where scalar units are integrated into compute units, introducing heterogeneity within the GPU. Unlike standard SIMD units, the scalar units provide fast and efficient integer SISD execution. They are mainly used to expedite address generation and control flow execution in GCN. Furthermore, SIMD units can execute other SIMD instructions at the same time as the scalar units execute SISD operations. III. SCALAR OPPORTUNITIES IN GPU APPLICATIONS In GPU programming models, computation is represented by a sequence of SIMD instructions, each of which operates on vector operands in multiple threads. Each component of a vector operand participates in a single computation on one ALU on the GPU. To address thread divergence, an active mask (defined as a bit map) can be used to indicate whether an individualthreadisactiveornot.ifathreadisactive,itsresults are confirmed and kept in the updated microarchitectural state. Otherwise, the results are simply discarded. We define a scalar opportunity as a SIMD instruction operating on the same data in all of its active threads. A typical example of scalar opportunities is loading a constant value when each active thread loads the same value from memory and then stores it in the corresponding component of the destination vector register. Finally those components store the same value. Scalar opportunity analysis can be performed at different abstraction levels. Compiler-level analysis is more flexible and needs zero hardware modifications, but it can only identify scalar opportunities within a thread block or coarser structure since intra-thread-block information is dynamic. Also, it may be conservative since the compiler has to consider all possible control flow paths. Architecture/microarchitecture-level analysis is more informative equipped with run-time information, and can handle scalar opportunities within a finer grained structure such as a warp, but at a hardware cost. In this paper we work at the compiler level mainly due to its lower cost and high flexibility. We carry out the characterization of scalar opportunities on NVIDIA PTX code for the following two reasons. One is that PTX is stable across multiple GPU generations, which makes our approach more general. The other is that there are several existing PTX research tools available to use in the literature. However, we claim that our analysis is independent of any specific SIMD programming model, and thus applies to other SIMD-based instruction sets besides PTX, including NVIDIA SASS, AMD IL, and AMD ISA. To better understand scalar opportunities in GPU applications, consider vector addition for example, as shown in Figure 5. As seen in the CUDA code, the variable i is initialized to the global thread index at first, which is computed using the thread block dimension, thread block index, and local thread index in a thread block. The corresponding PTX code is using three vector registersr1,r0, andr3 to keep track of those three operands, respectively. The thread block dimension andindexarethesameforeverythreadinathreadblock,sothe first two PTX instructions are scalar opportunities(marked 1 at the beginning of the line). The following multiply instruction uses r0 and r1 to compute an intermediate value, and is also a scalar opportunity. Afterwards, the fourth instruction moves the thread index to r3, which obviously processes different data for each thread. It is therefore a true SIMD instruction (marked 0 at the beginning of the line). Then a conditional branch follows. Suppose that the size of the arrays N is not evenly divisible by the size 227

4 Fig. 5. CUDA and PTX code of vector addition of a thread block N B, and the arrays can have different data in the elements. Then the last thread block has the threads [N B N/N B,N B N/N B 1], where the threads [N B N/N B,N 1] execute the if block (BB_1_3), while the threads [N,N B N/N B 1] execute the else block (BB_1_4). When the if block is executed, the threads [N,N B N/N B 1] are inactive, which explains why the load parameter operations are scalar opportunities. However, the addition is not a scalar opportunity since all the active threads operate on data that can be different. IV. SCALAR-VECTOR GPU ARCHITECTURE Scalar opportunities rely on heterogeneous scalar-vector GPU architectures equipped with SIMD and scalar units to improve resource utilization, performance, and power. The AMD Graphics Core Next architecture is currently the only commercially available example of such architecture. The scalar unit design includes an integer ALU to execute arithmetic and logical integer operations, and also a scalar register file to hold the operands for scalar opportunities. We extend the functionality of the AMD scalar unit in our scalar-vector architecture implementation to support any nontranscendental instructions including integer, floating-point, and others. It can perform general computation in addition to address generation and condition manipulation, which implies that more scalar opportunities can be executed on our scalar unit. Such design choices may result in higher design complexity or added latency. However, some overhead is easily tolerable as long as overall execution is more efficient due to the addition of the scalar units. Another major modification is that we do not support separate scalar and vector register files; instead, instructions always use a single vector register file. The primary advantage of this scheme is that we avoid expensive data movement between two register files. Consider that a SIMD instruction has a data dependence on a previous scalar opportunity. If the scalar unit and SIMD unit have separate register files, we have to design a mechanism to broadcast the scalar results to the vector source operands required by the SIMD instruction, which could limit the benefits of scalar processing. Another benefit is that the vector portion of the existing instruction set designed for a traditional SIMD execution model does not have to be changed in order to benefit from scalar opportunities. Designers only need to focus on the new scalar instructions. The downside of employing a combined scalar-vector register file will be the fact that we may need to add extra read ports to the vector registers. Adding read ports can be expensive and incur additional power. In order to limit the impact of this choice, we employ NVIDIA s operand collector architecture[12] in our design. The operand collector is used to simulate a multi-ported memory using lower port count memory banks. It uses multiple collector units to buffer operands of instructions, and a bank request arbitration unit to schedule register accesses from collector units. An instruction can be issued from a collector unit to an executionunit when all of its operandsareready.we addacollectorunitforeachscalarunit. The collector unit stores the warp identifier, instruction opcode, register identifiers, and operands. The operand field merely stores the operands for one thread since a scalar unit only needs the scalar operands for any one of the active threads. It incurs much less storage overhead than the collector units for SIMD units. Moreover, the scalar unit reads no more than one component of a vector register, and the arbitration unit is able to freely choose the optimal component so that current read requests in the queue incur fewer bank conflicts. Since some components may correspond to inactive threads, the arbitration unit uses thread divergence information to read one component from an active thread. We implemented our new scalar unit design and related microarchitectural components on the GPGPU-Sim version simulator [13].The model is based on an NVIDIA Fermi GPU architecture. GPGPU-Sim is a cycle-level GPU performance simulator, composed of Single Instruction Multiple Thread (SIMT) cores connected via an on-chip connection network to memory partitions that interface graphics DRAM (dynamic random-access memory). A SIMT core models a highly multithreaded pipelined SIMD multiprocessor very similar in design to an NVIDIA Streaming Multiprocessor or an AMD Compute Unit. A processing element corresponds to a lane within an ALU pipeline in a SIMT core [13]. As shown in Figure 6, a SIMD instruction is executed on a SIMT core as follows. First, the instruction is fetched from the instruction cache, decoded, and then stored in the instruction buffer. The instruction buffer is statically partitioned so that all warps running on the SIMT core have dedicated storage to place instructions. Then the issue logic checks all the valid instructions, which are decoded but not issued, to establish issue eligibility. A valid instruction can be issued if the 228

5 units have a similar structure to those for SIMD units, but store the operands for only one active thread. Issue logic. We modified the warp scheduler so that scalar instructions can be issued to scalar units at the same time SIMD instructions are issued to SIMD units; otherwise, the instructions will never run in parallel on both units. The issue width of our simulator is configurable. Moreover, scalar opportunities should be able to run on SIMD units as well for flexibility, though we may choose to restrict this option when optimizing for power. Configuration options. The configurable parameters described above are added as new configuration options for GPGPU-Sim. V. MODELING RESULTS In this section, we describe our experimental setup for characterizing scalar opportunities. As mentioned before, we identify scalar opportunities using a compiler pass first, and simulate the program on a modified GPU simulator with the information gathered using the compiler pass to investigate what microarchitectural components scalar opportunities utilize. Fig. 6. Overview of the scalar-vector GPU architecture implemented in GPGPU-Sim following three requirements are all satisfied: (1) its warp is not waiting at a barrier, (2) it passes the Write After Write (WAW) or Read After Write (RAW) hazards check in the scoreboard, and (3) the operand access stage of the instruction pipeline is not stalled. Memory instructions are issued to the memory pipeline. The other instructions always prefer SIMD units to special function units (SFU), unless they have to be executed on special function units. The pipeline also maintains a SIMT stack per warp to handle branch divergence. Moreover, an operand collector offers a set of buffers and arbitration logic used to provide the appearance of a multi-ported register file using multiple banks of single-ported RAMs. The buffers hold the source operands of instructions in collector units. When all the operands are ready, the instruction is issued to an execution unit. Figure 6 shows the major modifications we made to GPGPU-Sim to model our scalar-vector design. These changes include: Execution unit. We added a configurable number of scalar units, each of which is pipelined and can execute all types of ALU instructions except transcendentals. They have the same speed as SIMT units (i.e., they execute one instruction per cycle). Each unit has an independent issue port from the operand collector, and share the same output pipeline register as other execution units that are connected to a common writeback stage. Operand collector. We added a configurable number of collector units to each scalar unit. The collector A. Experimental Setup As described in Section III, a scalar opportunityis a SIMD instruction operating on the same data across all the active threads in a thread block. So our first step is to determine if a vector operand contains the same data for all the components corresponding to active threads. If this condition is satisfied, we call it a uniform vector; otherwise, it is divergent. A scalar opportunity requires that all of its source operands are uniform. We use a static variable divergence analysis approach proposed by Coutinho, et al. [14], to decide whether an operand is uniform or divergent. It first performs PTX-to-PTX code transformation in order to handle both data dependence and sync dependence. Next, data dependence graph reachability analysis starts from apparent divergent variables (e.g.,thread IDs) and the variables defined by atomic instructions, as shown in Figure 7(a). All the variables reached are marked divergent (black circles); the others are uniform (white circles). When performing variable divergence analysis on a data dependence graph, we add a tag to each variable indicating whether it is uniform or divergent. Then we carry out static scalar opportunity analysis on a control flow graph using the tags previously generated, as illustrated in Figure 7(b). A SIMD instruction is recognized as a scalar opportunity (white box) if and only if all of its source operands are uniform. Static statistics are insufficient to arrive at the best use of the scalar units. For example, assume a program has 10 static instructions, where 5 instructions are non-scalar opportunities in a loop executing 100 iterations, and the others are scalar opportunities out of the loop. Then the percentage of static scalar opportunities is 5/10=50%, while that of dynamic scalar opportunities is 5/(5+5*100)=1%. Scalar units will be underutilized if a program has limited dynamic scalar opportunities. Hence, we also count dynamic occurences of static scalar opportunities. 229

6 Note that static analysis does not use run-time information, and thus it may be conservative. Specifically, uniform vectors may be recognized as divergent in variable divergence analysis, and so some scalar opportunities are not detected. For instance, an instruction subtracting a divergent vector from itself produces a uniform vector 0. However, because the result (i.e., 0) has a data dependency on two divergent source vectors in the data flow graph, it is actually labeled a divergent vector. If a following instruction adds the previous result 0 and a uniform vector, it will be identified as a non-scalar opportunity since the 0 was recognized as a divergent vector. Another example is that of a conditional branch may be taken/not taken for all the threads when the program is executed, i.e., thread divergence may not happen. Then the uniform variables defined between the branches and their immediate post-dominators stay uniform. However, they are recognized as divergent since the compiler has to consider all the possibilities. Dynamic analysis can generate run-time statistics under those circumstances. Nevertheless, with dynamic analysis hardware modification is required, which will incur high cost, and run-time information may heavily depend on program inputs, resulting in very specific statistics for certain inputs. Thus we do not consider dynamic analysis in this paper. We added a compiler pass to GPU Ocelot [15] to perform our static analysis. Ocelot is a modular dynamic compilation framework for heterogeneous systems, providing various backend targets for CUDA programs and analysis modules for the PTX virtual instruction set. In the experiments, we compiled CUDA source code to PTX code first, and then used Ocelot to generate flags indicating if a static instruction is a scalar opportunity and the instruction type (e.g., integer, floatingpoint, memory, etc.). The information will be read later during the simulation. Fig. 7. An example of scalar opportunity analysis (vector addition) B. Results We ran 20 CUDA benchmarks chosen from the NVIDIA CUDA SDK version 4.0. We follow the methodology presented in previous subsection to collect our results. As shown in Table I, the benchmarks range from scientific algorithms (e.g., the discrete cosine transform) to financial applications (e.g., binomial option pricing). In this subsection, we characterize the number of scalar opportunities in these benchmarks first, and then discuss their impact on the scalar-vector GPU microarchitecture components. We count the number of static scalar opportunities using Ocelot, and profile their dynamic occurences during simulation using GPGPU-Sim. As Figure 8 shows, 38% of static SIMD instructions on average are detected by the compiler as scalar opportunities. The results imply that scalar opportunities always exist in GPGPU applications, even if we use SIMD programming models to write and optimize our programs. We also break down all the static and dynamic scalar opportunities into individual instruction types, which are shown in Figures 9 and 10, respectively. Parallelism instructions include barrier synchronization, reduction operations on global and shared memory, and vote instructions. Special function instructions are transcendental operations running on special function units. We can see that most scalar opportunities are integer, floating-point, or memory operations. Since memory 230

7 TABLE I. BENCHMARKS Benchmark BlackScholes MersenneTwister MonteCarlo SobolQRNG binomialoptions convolutionseparable dct8x8 dwthaar1d eigenvalues fastwalshtransform histogram256 histogram64 mergesort quasirandomgenerator reduction scalarprod scan sortingnetworks transpose vectoradd Description Evaluation of fair call and put prices for a given set of European options by Black-Scholes formula Mersenne Twister random number generator and Cartesian Box-Muller transformation Evaluation of fair call price for a given set of European options using Monte Carlo approach Sobol Quasirandom Sequence Generator Evaluation of fair call price for a given set of European options under binomial model A separable convolution filter of a 2D signal with a Gaussian kernel Discrete Cosine Transform for blocks of 8 by 8 pixels Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2 A bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size Naturally(Hadamard)-ordered Fast Walsh Transform for batched vectors of arbitrary eligible (power of two) lengths 256-bin histogram 64-bin histogram Merge sort algorithm Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution function for Standard Normal Distribution generation Summarization of a large arrays of values Scalar products of a given set of input vector pairs Parallel prefix sum(given an array of numbers, compute a new array in which each element is the sum of all the elements before it in the input array) Bitonic sort and odd-even merge sort algorithms Matrix transpose Vector addition Fig. 8. Percentage of static scalar opportunities and their dynamic occurences Fig. 9. Instruction type breakdown of static scalar opportunities Fig. 10. Instruction type breakdown of dynamic scalar opportunities instructions are always executed on the load/store units, we need to enable scalar unit support for at least integer and floating point instructions in order to obtain most of the benefits available by adding scalar units. Also note that transcendental calculations such as sine and cosine are all normal SIMD instructions across the benchmarks, which suggests that a simple scalar unit is enough (no need to add transcendental support). In addition, atomic instructions cannot be executed on scalar units since they contain memory operands and need to access the load/store units. In some benchmarks such as binomialoptions, the percentage of static scalar opportunities is significantly higher than the percentage measured during runtime. This trend implies that scalar opportunities in those benchmarks are likely present in the initialization phase of the code, and thus those programs will not benefit from scalar opportunities in the main loops. On the other hand, in some benchmarks such as histogram256, the percentage of static scalar opportunities is much lower than the number executed. In such cases, scalar opportunities are very likely to be present in the main loops of these benchmarks. A similar scenario can be seen for selected types of scalar opportunities. Take BlackScholes for example, where floating-point operations account for 65% of all the static scalar opportunities, but only 7% of all the dynamic scalar opportunities. The main reason is that these instructions are located in the initialization phase of the code. In contrast, 17% of the static scalar opportunities are integer instructions, while their dynamic occurences account for 92% of dynamic scalar opportunities. Therefore, integer scalar opportunities determine the benefits of the scalar units to a large degree. We evaluate the utilization of four types of execution units on a scalar-vector GPU architecture, including scalar units, SIMD units, special function units, and load/store units. As illustrated in Figure 11, scalar units have high utilization, with 51% occupancy on average across all of the benchmarks. Note that the presence of a large number of scalar opportunities does not directly imply higher utilization of the scalar units, since effective exploitation of these units depends on other microarchitecture features. For example, if the load/store units cannot keep up with supplying source operands to the scalar opportunities, execution will be stalled. 231

8 Fig. 11. Utilization of four types of execution units Fig. 13. Stall cycles difference on scalar-vector GPU architecture over homogeneous architecture. Fig. 12. Utilization of SIMD units and load/store units in homogeneous and scalar-vector GPU architecture From the figure, we can also see that the utilization of scalar units is generally lower than that of SIMD units. The reason is that we schedule scalar opportunities on available SIMDunitswhenallscalarunitsarebusy.Wescheduleascalar opportunity into every lane of a SIMD unit for execution. Our scheduling policy decreases stalls in spite of resulting in less efficient execution. By comparing the utilization of SIMD units and load/store units in homogeneous and scalar-vector GPU architectures, we can see in Figure 12 that the utilization of SIMD units decreases when we introduce the additional scalar units. However, the utilization of the load/store units remains the same. But scalar units can place more pressure on memory, as explained below. Scalar opportunities can put pressure on the multiprocessor pipeline, interconnection network and memory subsystem, as shown in Figure 13. Multiprocessor pipeline stalls can be caused by shared memory bank conflicts, non-coalesced memory accesses, or serialized memory accesses. Interconnection network stalls happen when DRAM channels cannot accept requests from the interconnect. Memory stalls result from interconnection network congestion when DRAM channels cannot send packets. In the figure, a positive number implies that stall cycles are increased on a scalar-vector GPU architecture over a homogeneous GPU architecture, while a negative number suggests that they are decreased. The benchmarks show various results. BlackScholes and quasirandomgenerator place more pressure on multiprocessor pipelines, while MersenneTwister places less pressure. Moreover, for binomialoptions, scalar units greatly relieve much of the pressure on the interconnect. In contrast, several other benchmarks including dwthaar1d and scalarprod experience more interconnect stalls. Additionally, some benchmarks such as MonteCarlo place additional stress on memory. The pressure largely results from parallel execution of scalar units and SIMD units. Additional source operands have tobereadtofillthescalarunitpipeline.thiscanresultinmore 232

9 traffic on the memory and interconnects versus the SIMD-only architecture. When designing a scalar-vector GPU architecture, we need to keep in mind that when we add scalar units to the microarchitecture, we may need to increase our interconnect and memory bandwidth to guarantee the data delivered to these units. We need to consider the entire data path so that we do not create another hotspot in the microarchitecture. VI. RELATED WORK Previous research on divergence in GPGPU applications offered us helpful ideas. Coutinho et al. proposed variable divergence analysis and optimization algorithms [14]. They introduced a static analysis to determine which vector variables in a program have the same values for every processing element. Also, they described a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Their analysis is used in our work to determine scalar opportunities. Collange et al. presented a technique for dynamic detection of uniform and affine vectors in GPGPU computations [16]. They concentrated on two forms of value locality specific to vector computations in GPUs. The first form corresponds to the uniform pattern present when computing conditions which avoid divergence in sub-vectors. The second form corresponds to the affine pattern used to access memory efficiently. They proposed an idea of using both forms of value locality combined with hardware modifications to significantly reduce the power required for data transfers between the register file and the functional units. They also looked at how to reduce the power drawn by the SIMD arithmetic units. Their work analyzed variables only, which differs from our approach on computations. Collange later proposed a mechanism to identify scalar behavior in CUDA kernels [17]. This prior work describes a compiler analysis pass that identifies statically several kinds of regular patterns that can occur between adjacent threads, including common computations, memory accesses to consecutive memory locations and uniform control flow. While it is of high quality, this prior work did not consider scalar-vector GPU architecture. Stratton et al. described a microthreading approach to efficiently compile fine-grained single-program multiple-data threaded programs for multicore CPUs [18]. They enabled redundancy removal in both computation and data storage as a primary optimization, where variance analysis discovers what portions of the code produce the same value for all threads. Our work differs from theirs in that we target scalar-vector GPU architectures rather than multicore CPUs. Hong and Kim proposed an integrated power and performance modeling system for GPU [19], which uses an empirical modeling approach to model the GPU power. They used the power and timing models to predict performance per watt and also the optimal number of cores to achieve energy savings. Their work is based on a conventional SIMDonly GPU architecture, while we focus on scalar-vector GPU architecture. VII. CONCLUSION In this paper we characterized scalar opportunities in GPU applications using a quantitative approach. The goal was to motivate the need for GPU architectures to evolve from homogeneity (i.e., SIMD only) to heterogeneity (i.e., scalar plus SIMD). Our static analysis guided approach is flexible and low-cost requiring zero hardware modifications. We have designed and implemented a detailed heterogeneous scalar-vector GPU architecture on a cycle-level simulator, and evaluated hardware resource utilization using commonly used GPU benchmarks. We have also evaluated the impact of scalar opportunities on multiprocessor pipelines, interconnection networks, and memory subsystems. The presence of scalar opportunities in common applications provides us with opportunities to pursue performance improvements and reduce power consumption. In order to achieve those efficiencies, we need to carefully tune hardware components. Our future work will investigate how to build and optimize scalar-vector GPU architectures to maximize the benefits from scalar opportunities in wider range of GPGPU applications. ACKNOWLEDGMENT TheauthorswouldliketothankYoavEtsion,whohelpedto improve the quality of our final paper. This work is supported by an NSF EEC Innovations Program award number EEC , and by both AMD and NVIDIA. The authors would also like to thank the GPGPU-Sim and Ocelot teams for use of their toolsets. REFERENCES [1] (2012) Comparison of AMD graphics processing units. [Online]. Available: of AMD graphics processing units [2] (2012) Comparison of Nvidia graphics processing units. [Online]. Available: of Nvidia graphics processing units [3] Intel Corporation. (2012) Intel microprocessor export compliance metrics. [Online]. Available: sb/cs htm [4] NVIDIA Corporation. (2012) NVIDIA CUDA C programming guide version 4.2. [5] Khronos OpenCL Working Group. (2011) The OpenCL specification version 1.2. [6] Advanced Micro Devices, Inc. (2012) AMD Graphics Cores Next (GCN) architecture whitepaper. [7] NVIDIA Corporation. (2011) The CUDA Compiler Driver NVCC. [8] (2011) PTX: Parallel Thread Execution ISA version 2.3. [9] M. Murphy. (2011) NVIDIA s experience with Open64. [10] NVIDIA Corporation. (2011) cuobjdump application note. [11] (2009) NVIDIA Fermi whitepaper version 1.1. NVIDIA Fermi Compute Architecture Whitepaper.pdf. [Online]. Available: white papers/ [12] S. Liu, E. Lindholm, M. Y. Siu, B. W. Coon, and S. F. Oberman, Operand collector architecture, U.S. Patent B2, Nov. 16, [13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in Proceedings of the IEEE 2009 International Symposium on Performance Analysis of Systems and Software, Boston, MA, USA, Apr. 2009, pp

10 [14] B. Coutinho, D. Sampaio, F. Pereira, and W. Meira, Divergence analysis and optimizations, in Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, Galveston Island, TX, USA, Oct. 2011, pp [15] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems, in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, Sep. 2010, pp [16] S. Collange, D. Defour, and Y. Zhang, Dynamic detection of uniform and affine vectors in GPGPU computations, in Proceedings of the 3rd Workshop on Highly Parallel Processing on a Chip, Delft, The Netherlands, Aug. 2009, pp [17] S. Collange, Identifying scalar behavior in CUDA kernels, INRIA, France, Tech. Rep. hal , Jan [18] J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu, Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs, in Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, New York, NY, USA, Apr. 2010, pp [19] S. Hong and H. Kim, An integrated GPU power and performance model, in Proceedings of the 37th Annual International Symposium on Computer Architecture, New York, NY, USA, Jun. 2010, pp

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu