Characterizing Scalar Opportunities in GPGPU Applications

Size: px
Start display at page:

Download "Characterizing Scalar Opportunities in GPGPU Applications"

Transcription

1 Characterizing Scalar Opportunities in GPGPU Applications Zhongliang Chen David Kaeli Department of Electrical andcomputer Engineering Northeastern University Boston, MA {zhonchen, Norman Rubin NVIDIA Corporation Abstract General Purpose computing with Graphics Processing Units (GPGPU) has gained widespread adoption in both the high performance and general purpose communities. In most GPU computation, execution exploits a Single Instruction Multiple Data (SIMD) model. However, GPU execution typically pays little attention to whether the data operated upon by the SIMD units is the same or different. When SIMD computation operates on multiple copies of the same data, redundant computations are generated. It provides an opportunity to improve efficiency by just broadcasting the results of a single computation to multiple outputs. To better serve those operations, modern GPUs are armed with scalar units. Then SIMD instructions that are operating on the same input data operands will be directed to execute upon scalar units, requiring only a single copy of the data, and leaving the data-parallel SIMD units available to execute non-scalar operations. In this paper, we first characterize a number of CUDA programs taken from the NVIDIA SDK to quantify the potential for scalar execution. We observe that 38% of static SIMD instructions are recognized to operate on the same data by the compiler, and their dynamic occurences account for 34% of the total dynamic instruction execution. We then evaluate the impact of scalar units on a heterogeneous scalar-vector GPU architecture. Our results show that scalar units are utilized 51% of the time during execution, though their use places additional pressure on the interconnect and memory, as shown in the results of our study. I. INTRODUCTION General Purpose computing with Graphics Processing Units (GPGPU) is an attractive platform for a growing number of applications. GPUs were traditionally designed to be stream processors for 3-D computer graphics, though they can also be effectively used as many-core data parallel processors capable of high execution throughput and memory bandwidth. Figure 1 compares the single-precision peak performance between GPUs and CPUs [1] [3]. As shown in the Figure, in 2008 GPUs were 13x faster than CPUs. In the past 5 years, this gap has widened; currently GPUs provide 17x greater computational horsepower versus CPUs. Today, GPUs are being deployed in a wide range of acceleration roles for general purpose applications. The execution model for modern GPUs is based on the Single Instruction Multiple Data (SIMD) model, which allows multiple processing elements to perform the same operation on multiple data, simultaneously. In terms of system architecture, a GPU device is array of multiprocessors(nvidia s streaming Fig. 1. Performance comparison between GPUs and CPUs [1] [3] multiprocessors or AMD s compute units), each of which contains SIMD units and on-chip shared memory (or AMD s local data store). A SIMD unit further contains an array of basic processing elements, each containing one ALU. Shared memory provides the GPU with the ability to share data among processing elements. Also, the multiprocessor supports barrier operations to provide for synchronization at the threadblock/work-group level. The two most popular GPU programming models are CUDA (Compute Unified Device Architecture) [4] and OpenCL (Open Computing Language) [5]. They support both data-parallel and task-parallel models. The most commonly exploited model on the GPU is the data-parallel model, which is the focus of this paper. In a data-parallel model, computation is represented by a sequence of instructions that execute on a number of indexed threads (or OpenCL s work-items). All of the threads are explicitly or implicitly divided into thread blocks (or OpenCL s work-groups). In a thread block, threads share data with fast memory (CUDA s shared memory or OpenCL s local memory), and can be synchronized. When a GPU program is executed, thread blocks are first scheduled onto multiprocessors, and the individual threads in the block are further scheduled onto SIMD units in the multiprocessor. Each thread is processed by one processing element on a SIMD lane. While data-parallel processing can achieve high speedups, /13/$ IEEE 225

2 the standard data-parallel model does not consider the case when the input operands to a SIMD instruction are all the same. Threads are mapped to SIMD units no matter what data they operate on. In case the computation is performed with multiple copies of the same data, the parallel operations can be reduced to Single Instruction Single Data (SISD) execution, whichwe refertoasascalaropportunity.ifwecontinuetouse the SIMD hardware for these SISD operations, we are wasting resources and burning unnecessary power. Instead, we turn to a scalar-vector GPU architecture armed with both scalar and SIMD (i.e., vector) units. On the scalar-vector GPU architecture, scalar opportunities are executed on scalar units so that the SIMD engines can be used to execute true SIMD operations. A good example of just such an architecture is AMD s Graphics Core Next (GCN) architecture. This design adds a scalar coprocessor into each compute unit. The scalar coprocessor has a fully functional integer ALU, with independent instruction arbitration and decode logic and also a scalar register file. This new unit helps execute a variety of control flow instructions, including jumps, calls and returns. The scalar coprocessor presents new opportunities in terms of performance and power efficiency[6]. Fig. 2. CUDA programming model [4] Our scalar-vector GPU design aims to be more flexible than GCN. We designed our scalar unit to handle both integer and floating point instructions. Also, the scalar unit does not need to have a scalar register file (the scalar units can use the vector register file in order to incur fewer hardware changes). The proposed architecture should be capable of effectively utilizing onboard scalar units to serve scalar opportunities in applications at low cost. This paper provides a first glimpse of the scalar opportunities present in GPU applications. Furthermore, we evaluate the potential impact that adding a scalar unit can have on a conventional GPU architecture. We examine the challenges and opportunities for various design alternatives on different microarchitectural components, including the multiprocessor pipeline, interconnection network, and memory subsystem. To the best of our knowledge,this paper is the first attempt to evaluate scalar opportunities in the microarchitecture of a GPU. This paper makes the following contributions. From the perspective of software, our scalar design identifies scalar opportunities in GPU applications using static analysis, and uses this information to guide scalar unit design. From the perspective of hardware, we evaluate the impact of scalar opportunities when run on a scalar-vector GPU architecture. We discuss and also address opportunities and challenges introduced by scalar opportunities. This paper is organized as follows. Section II presents background on GPU programming models and architecture. Section III introduces and defines scalar opportunities in GPGPU applications. Section IV describes the proposed scalarvector GPU architecture and various design alternatives, and also discusses implementation details. Section V presents our experimental setup and modeling results. Section VI discusses related work. The paper is finally concluded in Section VII. Fig. 3. NVIDIA Fermi GPU architecture [11] II. BACKGROUND A. GPGPU Programming Models There are two popular models widely used today for GPU programming: 1) CUDA and 2) OpenCL. CUDA is a general purpose parallel programming model, introduced by NVIDIA [4]. As shown in Figure 2, CUDA allows the programmer to partition a problem into multiple subproblems that can be solved independently in parallel by blocks of threads. Each subproblem can be further subdivided into finer elements that can be solved cooperatively in parallel by all threads within the block. A CUDA program usually has two parts: 1) host code running on the CPU and 2) device code running on the GPU. The compilation of device code works as follows: it is extracted by the NVIDIA CUDA C Compiler (NVCC) [7] first, and then compiled to intermediate PTX (Parallel Thread execution) [8] code. The PTX code is further compiled and optimized at run time by the NVIDIA proprietary Optimized Code Generator [9] to native SASS instructions (the NVIDIA ISA) [10]. The OpenCL programming model [5], which is managed 226

3 This heterogeneous architecture provides more flexibility for GPU applications. Fig. 4. AMD Graphics Core Next compute unit architecture [6] by the Khronos Group, is an open standard for general purpose parallel programming across CPUs, GPUs and other devices, giving programmers a portable language to target a range of heterogeneous processing platforms. In OpenCL, data is mapped to work-items in an index space, and all work-items are explicitly or implicitly divided into work-groups. B. GPU Architecture GPUs usually adopt a massively parallel model to achieve high throughput. Most of the device real estate on a GPU is dedicated to computation rather than control logic or cache. The NVIDIA Fermi architecture [11], shown in Figure 3, features up to 16 streaming multiprocessors, each of which has 32 CUDA cores, 16 load/store units, and 4 special function units. Each CUDA core has a fully pipelined ALU, and can execute an ALU instruction per clock for each thread. Each load/store units allows source and destination addresses to be calculated per thread per clock. Each special function unit can execute a transcendental instruction per thread per clock [11]. When a GPU kernel is launched, the global scheduler distributes thread blocks to the local schedulers in each streaming multiprocessor. Threads are further scheduled to SIMD units in warps (groups of 32 threads). Each multiprocessor has two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Two groups of 16 cores each are used to execute two instructions from two different warps per cycle. The homogeneous SIMD-only architecture described above recently underwent a major microarchitectural change. Figure 4 shows a state-of-the-art AMD Graphics Core Next compute unit architecture [6], where scalar units are integrated into compute units, introducing heterogeneity within the GPU. Unlike standard SIMD units, the scalar units provide fast and efficient integer SISD execution. They are mainly used to expedite address generation and control flow execution in GCN. Furthermore, SIMD units can execute other SIMD instructions at the same time as the scalar units execute SISD operations. III. SCALAR OPPORTUNITIES IN GPU APPLICATIONS In GPU programming models, computation is represented by a sequence of SIMD instructions, each of which operates on vector operands in multiple threads. Each component of a vector operand participates in a single computation on one ALU on the GPU. To address thread divergence, an active mask (defined as a bit map) can be used to indicate whether an individualthreadisactiveornot.ifathreadisactive,itsresults are confirmed and kept in the updated microarchitectural state. Otherwise, the results are simply discarded. We define a scalar opportunity as a SIMD instruction operating on the same data in all of its active threads. A typical example of scalar opportunities is loading a constant value when each active thread loads the same value from memory and then stores it in the corresponding component of the destination vector register. Finally those components store the same value. Scalar opportunity analysis can be performed at different abstraction levels. Compiler-level analysis is more flexible and needs zero hardware modifications, but it can only identify scalar opportunities within a thread block or coarser structure since intra-thread-block information is dynamic. Also, it may be conservative since the compiler has to consider all possible control flow paths. Architecture/microarchitecture-level analysis is more informative equipped with run-time information, and can handle scalar opportunities within a finer grained structure such as a warp, but at a hardware cost. In this paper we work at the compiler level mainly due to its lower cost and high flexibility. We carry out the characterization of scalar opportunities on NVIDIA PTX code for the following two reasons. One is that PTX is stable across multiple GPU generations, which makes our approach more general. The other is that there are several existing PTX research tools available to use in the literature. However, we claim that our analysis is independent of any specific SIMD programming model, and thus applies to other SIMD-based instruction sets besides PTX, including NVIDIA SASS, AMD IL, and AMD ISA. To better understand scalar opportunities in GPU applications, consider vector addition for example, as shown in Figure 5. As seen in the CUDA code, the variable i is initialized to the global thread index at first, which is computed using the thread block dimension, thread block index, and local thread index in a thread block. The corresponding PTX code is using three vector registersr1,r0, andr3 to keep track of those three operands, respectively. The thread block dimension andindexarethesameforeverythreadinathreadblock,sothe first two PTX instructions are scalar opportunities(marked 1 at the beginning of the line). The following multiply instruction uses r0 and r1 to compute an intermediate value, and is also a scalar opportunity. Afterwards, the fourth instruction moves the thread index to r3, which obviously processes different data for each thread. It is therefore a true SIMD instruction (marked 0 at the beginning of the line). Then a conditional branch follows. Suppose that the size of the arrays N is not evenly divisible by the size 227

4 Fig. 5. CUDA and PTX code of vector addition of a thread block N B, and the arrays can have different data in the elements. Then the last thread block has the threads [N B N/N B,N B N/N B 1], where the threads [N B N/N B,N 1] execute the if block (BB_1_3), while the threads [N,N B N/N B 1] execute the else block (BB_1_4). When the if block is executed, the threads [N,N B N/N B 1] are inactive, which explains why the load parameter operations are scalar opportunities. However, the addition is not a scalar opportunity since all the active threads operate on data that can be different. IV. SCALAR-VECTOR GPU ARCHITECTURE Scalar opportunities rely on heterogeneous scalar-vector GPU architectures equipped with SIMD and scalar units to improve resource utilization, performance, and power. The AMD Graphics Core Next architecture is currently the only commercially available example of such architecture. The scalar unit design includes an integer ALU to execute arithmetic and logical integer operations, and also a scalar register file to hold the operands for scalar opportunities. We extend the functionality of the AMD scalar unit in our scalar-vector architecture implementation to support any nontranscendental instructions including integer, floating-point, and others. It can perform general computation in addition to address generation and condition manipulation, which implies that more scalar opportunities can be executed on our scalar unit. Such design choices may result in higher design complexity or added latency. However, some overhead is easily tolerable as long as overall execution is more efficient due to the addition of the scalar units. Another major modification is that we do not support separate scalar and vector register files; instead, instructions always use a single vector register file. The primary advantage of this scheme is that we avoid expensive data movement between two register files. Consider that a SIMD instruction has a data dependence on a previous scalar opportunity. If the scalar unit and SIMD unit have separate register files, we have to design a mechanism to broadcast the scalar results to the vector source operands required by the SIMD instruction, which could limit the benefits of scalar processing. Another benefit is that the vector portion of the existing instruction set designed for a traditional SIMD execution model does not have to be changed in order to benefit from scalar opportunities. Designers only need to focus on the new scalar instructions. The downside of employing a combined scalar-vector register file will be the fact that we may need to add extra read ports to the vector registers. Adding read ports can be expensive and incur additional power. In order to limit the impact of this choice, we employ NVIDIA s operand collector architecture[12] in our design. The operand collector is used to simulate a multi-ported memory using lower port count memory banks. It uses multiple collector units to buffer operands of instructions, and a bank request arbitration unit to schedule register accesses from collector units. An instruction can be issued from a collector unit to an executionunit when all of its operandsareready.we addacollectorunitforeachscalarunit. The collector unit stores the warp identifier, instruction opcode, register identifiers, and operands. The operand field merely stores the operands for one thread since a scalar unit only needs the scalar operands for any one of the active threads. It incurs much less storage overhead than the collector units for SIMD units. Moreover, the scalar unit reads no more than one component of a vector register, and the arbitration unit is able to freely choose the optimal component so that current read requests in the queue incur fewer bank conflicts. Since some components may correspond to inactive threads, the arbitration unit uses thread divergence information to read one component from an active thread. We implemented our new scalar unit design and related microarchitectural components on the GPGPU-Sim version simulator [13].The model is based on an NVIDIA Fermi GPU architecture. GPGPU-Sim is a cycle-level GPU performance simulator, composed of Single Instruction Multiple Thread (SIMT) cores connected via an on-chip connection network to memory partitions that interface graphics DRAM (dynamic random-access memory). A SIMT core models a highly multithreaded pipelined SIMD multiprocessor very similar in design to an NVIDIA Streaming Multiprocessor or an AMD Compute Unit. A processing element corresponds to a lane within an ALU pipeline in a SIMT core [13]. As shown in Figure 6, a SIMD instruction is executed on a SIMT core as follows. First, the instruction is fetched from the instruction cache, decoded, and then stored in the instruction buffer. The instruction buffer is statically partitioned so that all warps running on the SIMT core have dedicated storage to place instructions. Then the issue logic checks all the valid instructions, which are decoded but not issued, to establish issue eligibility. A valid instruction can be issued if the 228

5 units have a similar structure to those for SIMD units, but store the operands for only one active thread. Issue logic. We modified the warp scheduler so that scalar instructions can be issued to scalar units at the same time SIMD instructions are issued to SIMD units; otherwise, the instructions will never run in parallel on both units. The issue width of our simulator is configurable. Moreover, scalar opportunities should be able to run on SIMD units as well for flexibility, though we may choose to restrict this option when optimizing for power. Configuration options. The configurable parameters described above are added as new configuration options for GPGPU-Sim. V. MODELING RESULTS In this section, we describe our experimental setup for characterizing scalar opportunities. As mentioned before, we identify scalar opportunities using a compiler pass first, and simulate the program on a modified GPU simulator with the information gathered using the compiler pass to investigate what microarchitectural components scalar opportunities utilize. Fig. 6. Overview of the scalar-vector GPU architecture implemented in GPGPU-Sim following three requirements are all satisfied: (1) its warp is not waiting at a barrier, (2) it passes the Write After Write (WAW) or Read After Write (RAW) hazards check in the scoreboard, and (3) the operand access stage of the instruction pipeline is not stalled. Memory instructions are issued to the memory pipeline. The other instructions always prefer SIMD units to special function units (SFU), unless they have to be executed on special function units. The pipeline also maintains a SIMT stack per warp to handle branch divergence. Moreover, an operand collector offers a set of buffers and arbitration logic used to provide the appearance of a multi-ported register file using multiple banks of single-ported RAMs. The buffers hold the source operands of instructions in collector units. When all the operands are ready, the instruction is issued to an execution unit. Figure 6 shows the major modifications we made to GPGPU-Sim to model our scalar-vector design. These changes include: Execution unit. We added a configurable number of scalar units, each of which is pipelined and can execute all types of ALU instructions except transcendentals. They have the same speed as SIMT units (i.e., they execute one instruction per cycle). Each unit has an independent issue port from the operand collector, and share the same output pipeline register as other execution units that are connected to a common writeback stage. Operand collector. We added a configurable number of collector units to each scalar unit. The collector A. Experimental Setup As described in Section III, a scalar opportunityis a SIMD instruction operating on the same data across all the active threads in a thread block. So our first step is to determine if a vector operand contains the same data for all the components corresponding to active threads. If this condition is satisfied, we call it a uniform vector; otherwise, it is divergent. A scalar opportunity requires that all of its source operands are uniform. We use a static variable divergence analysis approach proposed by Coutinho, et al. [14], to decide whether an operand is uniform or divergent. It first performs PTX-to-PTX code transformation in order to handle both data dependence and sync dependence. Next, data dependence graph reachability analysis starts from apparent divergent variables (e.g.,thread IDs) and the variables defined by atomic instructions, as shown in Figure 7(a). All the variables reached are marked divergent (black circles); the others are uniform (white circles). When performing variable divergence analysis on a data dependence graph, we add a tag to each variable indicating whether it is uniform or divergent. Then we carry out static scalar opportunity analysis on a control flow graph using the tags previously generated, as illustrated in Figure 7(b). A SIMD instruction is recognized as a scalar opportunity (white box) if and only if all of its source operands are uniform. Static statistics are insufficient to arrive at the best use of the scalar units. For example, assume a program has 10 static instructions, where 5 instructions are non-scalar opportunities in a loop executing 100 iterations, and the others are scalar opportunities out of the loop. Then the percentage of static scalar opportunities is 5/10=50%, while that of dynamic scalar opportunities is 5/(5+5*100)=1%. Scalar units will be underutilized if a program has limited dynamic scalar opportunities. Hence, we also count dynamic occurences of static scalar opportunities. 229

6 Note that static analysis does not use run-time information, and thus it may be conservative. Specifically, uniform vectors may be recognized as divergent in variable divergence analysis, and so some scalar opportunities are not detected. For instance, an instruction subtracting a divergent vector from itself produces a uniform vector 0. However, because the result (i.e., 0) has a data dependency on two divergent source vectors in the data flow graph, it is actually labeled a divergent vector. If a following instruction adds the previous result 0 and a uniform vector, it will be identified as a non-scalar opportunity since the 0 was recognized as a divergent vector. Another example is that of a conditional branch may be taken/not taken for all the threads when the program is executed, i.e., thread divergence may not happen. Then the uniform variables defined between the branches and their immediate post-dominators stay uniform. However, they are recognized as divergent since the compiler has to consider all the possibilities. Dynamic analysis can generate run-time statistics under those circumstances. Nevertheless, with dynamic analysis hardware modification is required, which will incur high cost, and run-time information may heavily depend on program inputs, resulting in very specific statistics for certain inputs. Thus we do not consider dynamic analysis in this paper. We added a compiler pass to GPU Ocelot [15] to perform our static analysis. Ocelot is a modular dynamic compilation framework for heterogeneous systems, providing various backend targets for CUDA programs and analysis modules for the PTX virtual instruction set. In the experiments, we compiled CUDA source code to PTX code first, and then used Ocelot to generate flags indicating if a static instruction is a scalar opportunity and the instruction type (e.g., integer, floatingpoint, memory, etc.). The information will be read later during the simulation. Fig. 7. An example of scalar opportunity analysis (vector addition) B. Results We ran 20 CUDA benchmarks chosen from the NVIDIA CUDA SDK version 4.0. We follow the methodology presented in previous subsection to collect our results. As shown in Table I, the benchmarks range from scientific algorithms (e.g., the discrete cosine transform) to financial applications (e.g., binomial option pricing). In this subsection, we characterize the number of scalar opportunities in these benchmarks first, and then discuss their impact on the scalar-vector GPU microarchitecture components. We count the number of static scalar opportunities using Ocelot, and profile their dynamic occurences during simulation using GPGPU-Sim. As Figure 8 shows, 38% of static SIMD instructions on average are detected by the compiler as scalar opportunities. The results imply that scalar opportunities always exist in GPGPU applications, even if we use SIMD programming models to write and optimize our programs. We also break down all the static and dynamic scalar opportunities into individual instruction types, which are shown in Figures 9 and 10, respectively. Parallelism instructions include barrier synchronization, reduction operations on global and shared memory, and vote instructions. Special function instructions are transcendental operations running on special function units. We can see that most scalar opportunities are integer, floating-point, or memory operations. Since memory 230

7 TABLE I. BENCHMARKS Benchmark BlackScholes MersenneTwister MonteCarlo SobolQRNG binomialoptions convolutionseparable dct8x8 dwthaar1d eigenvalues fastwalshtransform histogram256 histogram64 mergesort quasirandomgenerator reduction scalarprod scan sortingnetworks transpose vectoradd Description Evaluation of fair call and put prices for a given set of European options by Black-Scholes formula Mersenne Twister random number generator and Cartesian Box-Muller transformation Evaluation of fair call price for a given set of European options using Monte Carlo approach Sobol Quasirandom Sequence Generator Evaluation of fair call price for a given set of European options under binomial model A separable convolution filter of a 2D signal with a Gaussian kernel Discrete Cosine Transform for blocks of 8 by 8 pixels Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2 A bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size Naturally(Hadamard)-ordered Fast Walsh Transform for batched vectors of arbitrary eligible (power of two) lengths 256-bin histogram 64-bin histogram Merge sort algorithm Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution function for Standard Normal Distribution generation Summarization of a large arrays of values Scalar products of a given set of input vector pairs Parallel prefix sum(given an array of numbers, compute a new array in which each element is the sum of all the elements before it in the input array) Bitonic sort and odd-even merge sort algorithms Matrix transpose Vector addition Fig. 8. Percentage of static scalar opportunities and their dynamic occurences Fig. 9. Instruction type breakdown of static scalar opportunities Fig. 10. Instruction type breakdown of dynamic scalar opportunities instructions are always executed on the load/store units, we need to enable scalar unit support for at least integer and floating point instructions in order to obtain most of the benefits available by adding scalar units. Also note that transcendental calculations such as sine and cosine are all normal SIMD instructions across the benchmarks, which suggests that a simple scalar unit is enough (no need to add transcendental support). In addition, atomic instructions cannot be executed on scalar units since they contain memory operands and need to access the load/store units. In some benchmarks such as binomialoptions, the percentage of static scalar opportunities is significantly higher than the percentage measured during runtime. This trend implies that scalar opportunities in those benchmarks are likely present in the initialization phase of the code, and thus those programs will not benefit from scalar opportunities in the main loops. On the other hand, in some benchmarks such as histogram256, the percentage of static scalar opportunities is much lower than the number executed. In such cases, scalar opportunities are very likely to be present in the main loops of these benchmarks. A similar scenario can be seen for selected types of scalar opportunities. Take BlackScholes for example, where floating-point operations account for 65% of all the static scalar opportunities, but only 7% of all the dynamic scalar opportunities. The main reason is that these instructions are located in the initialization phase of the code. In contrast, 17% of the static scalar opportunities are integer instructions, while their dynamic occurences account for 92% of dynamic scalar opportunities. Therefore, integer scalar opportunities determine the benefits of the scalar units to a large degree. We evaluate the utilization of four types of execution units on a scalar-vector GPU architecture, including scalar units, SIMD units, special function units, and load/store units. As illustrated in Figure 11, scalar units have high utilization, with 51% occupancy on average across all of the benchmarks. Note that the presence of a large number of scalar opportunities does not directly imply higher utilization of the scalar units, since effective exploitation of these units depends on other microarchitecture features. For example, if the load/store units cannot keep up with supplying source operands to the scalar opportunities, execution will be stalled. 231

8 Fig. 11. Utilization of four types of execution units Fig. 13. Stall cycles difference on scalar-vector GPU architecture over homogeneous architecture. Fig. 12. Utilization of SIMD units and load/store units in homogeneous and scalar-vector GPU architecture From the figure, we can also see that the utilization of scalar units is generally lower than that of SIMD units. The reason is that we schedule scalar opportunities on available SIMDunitswhenallscalarunitsarebusy.Wescheduleascalar opportunity into every lane of a SIMD unit for execution. Our scheduling policy decreases stalls in spite of resulting in less efficient execution. By comparing the utilization of SIMD units and load/store units in homogeneous and scalar-vector GPU architectures, we can see in Figure 12 that the utilization of SIMD units decreases when we introduce the additional scalar units. However, the utilization of the load/store units remains the same. But scalar units can place more pressure on memory, as explained below. Scalar opportunities can put pressure on the multiprocessor pipeline, interconnection network and memory subsystem, as shown in Figure 13. Multiprocessor pipeline stalls can be caused by shared memory bank conflicts, non-coalesced memory accesses, or serialized memory accesses. Interconnection network stalls happen when DRAM channels cannot accept requests from the interconnect. Memory stalls result from interconnection network congestion when DRAM channels cannot send packets. In the figure, a positive number implies that stall cycles are increased on a scalar-vector GPU architecture over a homogeneous GPU architecture, while a negative number suggests that they are decreased. The benchmarks show various results. BlackScholes and quasirandomgenerator place more pressure on multiprocessor pipelines, while MersenneTwister places less pressure. Moreover, for binomialoptions, scalar units greatly relieve much of the pressure on the interconnect. In contrast, several other benchmarks including dwthaar1d and scalarprod experience more interconnect stalls. Additionally, some benchmarks such as MonteCarlo place additional stress on memory. The pressure largely results from parallel execution of scalar units and SIMD units. Additional source operands have tobereadtofillthescalarunitpipeline.thiscanresultinmore 232

9 traffic on the memory and interconnects versus the SIMD-only architecture. When designing a scalar-vector GPU architecture, we need to keep in mind that when we add scalar units to the microarchitecture, we may need to increase our interconnect and memory bandwidth to guarantee the data delivered to these units. We need to consider the entire data path so that we do not create another hotspot in the microarchitecture. VI. RELATED WORK Previous research on divergence in GPGPU applications offered us helpful ideas. Coutinho et al. proposed variable divergence analysis and optimization algorithms [14]. They introduced a static analysis to determine which vector variables in a program have the same values for every processing element. Also, they described a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Their analysis is used in our work to determine scalar opportunities. Collange et al. presented a technique for dynamic detection of uniform and affine vectors in GPGPU computations [16]. They concentrated on two forms of value locality specific to vector computations in GPUs. The first form corresponds to the uniform pattern present when computing conditions which avoid divergence in sub-vectors. The second form corresponds to the affine pattern used to access memory efficiently. They proposed an idea of using both forms of value locality combined with hardware modifications to significantly reduce the power required for data transfers between the register file and the functional units. They also looked at how to reduce the power drawn by the SIMD arithmetic units. Their work analyzed variables only, which differs from our approach on computations. Collange later proposed a mechanism to identify scalar behavior in CUDA kernels [17]. This prior work describes a compiler analysis pass that identifies statically several kinds of regular patterns that can occur between adjacent threads, including common computations, memory accesses to consecutive memory locations and uniform control flow. While it is of high quality, this prior work did not consider scalar-vector GPU architecture. Stratton et al. described a microthreading approach to efficiently compile fine-grained single-program multiple-data threaded programs for multicore CPUs [18]. They enabled redundancy removal in both computation and data storage as a primary optimization, where variance analysis discovers what portions of the code produce the same value for all threads. Our work differs from theirs in that we target scalar-vector GPU architectures rather than multicore CPUs. Hong and Kim proposed an integrated power and performance modeling system for GPU [19], which uses an empirical modeling approach to model the GPU power. They used the power and timing models to predict performance per watt and also the optimal number of cores to achieve energy savings. Their work is based on a conventional SIMDonly GPU architecture, while we focus on scalar-vector GPU architecture. VII. CONCLUSION In this paper we characterized scalar opportunities in GPU applications using a quantitative approach. The goal was to motivate the need for GPU architectures to evolve from homogeneity (i.e., SIMD only) to heterogeneity (i.e., scalar plus SIMD). Our static analysis guided approach is flexible and low-cost requiring zero hardware modifications. We have designed and implemented a detailed heterogeneous scalar-vector GPU architecture on a cycle-level simulator, and evaluated hardware resource utilization using commonly used GPU benchmarks. We have also evaluated the impact of scalar opportunities on multiprocessor pipelines, interconnection networks, and memory subsystems. The presence of scalar opportunities in common applications provides us with opportunities to pursue performance improvements and reduce power consumption. In order to achieve those efficiencies, we need to carefully tune hardware components. Our future work will investigate how to build and optimize scalar-vector GPU architectures to maximize the benefits from scalar opportunities in wider range of GPGPU applications. ACKNOWLEDGMENT TheauthorswouldliketothankYoavEtsion,whohelpedto improve the quality of our final paper. This work is supported by an NSF EEC Innovations Program award number EEC , and by both AMD and NVIDIA. The authors would also like to thank the GPGPU-Sim and Ocelot teams for use of their toolsets. REFERENCES [1] (2012) Comparison of AMD graphics processing units. [Online]. Available: of AMD graphics processing units [2] (2012) Comparison of Nvidia graphics processing units. [Online]. Available: of Nvidia graphics processing units [3] Intel Corporation. (2012) Intel microprocessor export compliance metrics. [Online]. Available: sb/cs htm [4] NVIDIA Corporation. (2012) NVIDIA CUDA C programming guide version 4.2. [5] Khronos OpenCL Working Group. (2011) The OpenCL specification version 1.2. [6] Advanced Micro Devices, Inc. (2012) AMD Graphics Cores Next (GCN) architecture whitepaper. [7] NVIDIA Corporation. (2011) The CUDA Compiler Driver NVCC. [8] (2011) PTX: Parallel Thread Execution ISA version 2.3. [9] M. Murphy. (2011) NVIDIA s experience with Open64. [10] NVIDIA Corporation. (2011) cuobjdump application note. [11] (2009) NVIDIA Fermi whitepaper version 1.1. NVIDIA Fermi Compute Architecture Whitepaper.pdf. [Online]. Available: white papers/ [12] S. Liu, E. Lindholm, M. Y. Siu, B. W. Coon, and S. F. Oberman, Operand collector architecture, U.S. Patent B2, Nov. 16, [13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in Proceedings of the IEEE 2009 International Symposium on Performance Analysis of Systems and Software, Boston, MA, USA, Apr. 2009, pp

10 [14] B. Coutinho, D. Sampaio, F. Pereira, and W. Meira, Divergence analysis and optimizations, in Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, Galveston Island, TX, USA, Oct. 2011, pp [15] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems, in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, Sep. 2010, pp [16] S. Collange, D. Defour, and Y. Zhang, Dynamic detection of uniform and affine vectors in GPGPU computations, in Proceedings of the 3rd Workshop on Highly Parallel Processing on a Chip, Delft, The Netherlands, Aug. 2009, pp [17] S. Collange, Identifying scalar behavior in CUDA kernels, INRIA, France, Tech. Rep. hal , Jan [18] J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W.-m. W. Hwu, Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs, in Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, New York, NY, USA, Apr. 2010, pp [19] S. Hong and H. Kim, An integrated GPU power and performance model, in Proceedings of the 37th Annual International Symposium on Computer Architecture, New York, NY, USA, Jun. 2010, pp

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Dynamic detection of uniform and affine vectors in GPGPU computations

Dynamic detection of uniform and affine vectors in GPGPU computations Dynamic detection of uniform and affine vectors in GPGPU computations Sylvain Collange, David Defour, Yao Zhang To cite this version: Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Implementation of GP-GPU with SIMT Architecture in the Embedded Environment

Implementation of GP-GPU with SIMT Architecture in the Embedded Environment , pp.221-226 http://dx.doi.org/10.14257/ijmue.2014.9.4.23 Implementation of GP-GPU with SIMT Architecture in the Embedded Environment Kwang-yeob Lee and Jae-chang Kwak 1 * Dept. of Computer Engineering,

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

A Compile-Time Managed Multi-Level Register File Hierarchy

A Compile-Time Managed Multi-Level Register File Hierarchy A Compile-Time Managed Multi-Level Register File Hierarchy Mark Gebhart 1 Stephen W. Keckler 1,2 William J. Dally 2,3 mgebhart@cs.utexas.edu skeckler@nvidia.com bdally@nvidia.com 1 The University of Texas

More information

Arquitetura e Organização de Computadores 2

Arquitetura e Organização de Computadores 2 Arquitetura e Organização de Computadores 2 Paralelismo em Nível de Dados Graphical Processing Units - GPUs Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of

More information

Tuning CUDA Applications for Fermi. Version 1.2

Tuning CUDA Applications for Fermi. Version 1.2 Tuning CUDA Applications for Fermi Version 1.2 7/21/2010 Next-Generation CUDA Compute Architecture Fermi is NVIDIA s next-generation CUDA compute architecture. The Fermi whitepaper [1] gives a detailed

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Technical Report: GIT-CERCS-09-06

Technical Report: GIT-CERCS-09-06 Technical Report: GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Caracal: Dynamic Translation of Runtime Environments for GPUs

Caracal: Dynamic Translation of Runtime Environments for GPUs Caracal: Dynamic Translation of Runtime Environments for GPUs Rodrigo Domínguez rdomingu@ece.neu.edu Dana Schaa dschaa@ece.neu.edu Department of Electrical and Computer Engineering Northeastern University

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Measurement of real time information using GPU

Measurement of real time information using GPU Measurement of real time information using GPU Pooja Sharma M. Tech Scholar, Department of Electronics and Communication E-mail: poojachaturvedi1985@gmail.com Rajni Billa M. Tech Scholar, Department of

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

GPU Programming for Mathematical and Scientific Computing

GPU Programming for Mathematical and Scientific Computing GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They

More information

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Abstract A heterogeneous memory system (HMS)

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information