Project Final Report High Performance Pipeline Compiler

Size: px
Start display at page:

Download "Project Final Report High Performance Pipeline Compiler"

Transcription

1 Project Final Report High Performance Pipeline Compiler Yong He, Yan Gu 1 Introduction Writing stream processing programs directly in low level languages, such as C++, is tedious and bug prone. A lot of systems have been developed to simplify the programming of stream processing application and to distribute computation to various devices (e.g. GPU or cluster), such as GRAMPS[3] StreamIt[4], BSGP[1], Storm[2] and DryadLinq[5], etc. These systems propose a set of general purpose programming constructs to hide the complexity of buffer management and computation scheduling. However, all of them compromise performance in certain ways to remain general purpose. We recognized a special type of stream processing, which can be compiled into much faster code : pipelines. Unlike general purpose stream processing which models computation to be a direct acyclic graph of computation nodes (kernels), pipelines are formed by chaining each kernel without introducing complex inter-kernel dataflow. As a result, scheduling and buffer management can be done in a more efficient way, making it possible to generate high performance code that runs as fast as hand tuned C++ code. One example of pipeline application is the rendering pipeline. In a rendering pipeline, a set of vertices are provided as input to the system. The pipeline computes internal triangle representations from these vertices, clips them against screen, rasterizes them into pixels and finally shades pixels to create an image. Traditionally, a rendering pipeline is always written by hand and heavily tuned for highest possible performance. As a result, real-world intricacies such as limited buffer size and low-level optimizations make it extremely hard to reshape the pipeline or to make changes in schedules. We designed a new program language for writing pipeline applications compilable to highly parallelized code with rivaling performance to hand tuned code. Our language extends the C syntax to express stages (computation nodes in our system), buffers used the synchronized between stages and pipelines assembled by chaining stages and buffers. The compiler is able to analyze the stage code and stage connections to figure out the most efficient scheduling for the given pipeline, and generate a highly parallelized pipeline implementation with comparable performance to hand tuned code. The programs written in our pipeline specification language is compact and easy to maintain (our sample rendering pipeline specification contains only 143 lines of code and is compiled to 756 lines of C++ code). Therefore it is very easy to change the pipeline and experiment with various different scheduling using our language. 1

2 2 Programming Model Design We start from a simple version of C like language without pointers, and extend it with the following constructs: Stage. A stage is a kernel function that takes one or multiple items from input stream, and emits zero or multiple items to output stream. The number of input items must be statically defined, while the number of output items can be fully dynamic at runtime. A stage should contain one or more emit statements that push an item to its output stream. Pipe. A standard pipe is a place where stages take inputs from and write outputs to. Pipes guarantees the correct ordering between items. For example, if a pipe containing items [x 1, x 2, x 3 ] is passed to a stage (y 1 (x i ),, y n (x i ) = f(x i )) that outputs to another pipe, the resulting pipe will contain items [y 1 (x 1 ),, y n (x 1 ), y 1 (x 2 ),, y n (x 2 ), y 1 (x 3 ),, y n (x 3 )]. There are two mainly kinds of variant of standard pipe: source and sink. A source is the input to the whole pipeline, and a sink is the output. When linked to host application, the user of a pipeline must provide data to source and may optionally register call backs to handle data streamed into a sink. Grid. A grid is treated as a special kind of pipe by the system, except that it represents a fixed dimension, fixed size array of cells. A grid can have 1, 2 or 3 dimensions. Only items marked as Element1D, Element2D or Element3D can be piped to grids with corresponding dimensions. Stages that emit items to a grid must specify the index of the item, and the item will be outputted to the specified location in the grid. If the specified location already contains an item, a merge function is called to merge the incoming item and existed item into a new item, and the new item is stored. References. Items manipulated by stages can contain references to an item from previous stage. The existence of references affects the scheduling of a pipeline because the referenced item stored in the output buffer for previous stage must be kept alive when its user stage is executed. Most stream processing systems do not allow such references because it makes scheduling difficult, and make it very hard to generate distributed implementations. However, references are crucial in generating high performance code that runs on a single machine, and we choose to support it in order to reduce bandwidth and memory copying overhead. Pipeline assembler. When all stages and internal pipes/grids have been defined, a pipe assembler specifies how to assemble every piece together to form a pipeline. 2.1 Example Pipeline The following example code written in the pipeline specification languages demonstrates several important constructs. The pipeline takes a stream of integers, for each integer x, the Square stage outputs x x and 2x to the output stream, which is handled by external code that prints the result. 2

3 parallel stage Square() : int>>int emit in*in; emit in*in*2; Source<int> input; Sink<int> output; pipe SimplePipe: input>>square()>>output; The pipeline compiler takes this code and produces a C++ class that represents the pipeline. The interface of generated C++ code is shown below. #ifndef OUT_H #define OUT_H #ifndef PIPE_SYSTEM_HEADER #define PIPE_SYSTEM_HEADER class SinkHandler public: virtual void ProcessBuffer(void * data, int count) = 0; ; #endif class SimplePipe public: virtual ~SimplePipe() virtual void Flush() = 0; virtual void SetInput_input(void * data, int count) = 0; virtual void SetSinkHandler_output(SinkHandler * handler) = 0; ; SimplePipe * CreateSimplePipe(); void DestroySimplePipe(SimplePipe * obj); #endif To use the pipeline, the host application calls CreateSimplePipe() to create an instance the pipeline. It then calls SimplePipe::SetInput input() to feed the source with an input stream of integers. Optionally, it can call SimplePipe::SetSinkHandler output() to register a call back for the sink. Finally, it calls SimplePipe::Flush() to initiate the pipeline. 3 Scheduling Pipelines The compiler schedules pipelines onto current CPU architecture. The resulting schedule is expected to be fully parallelized and as efficient as hand tuned code. We only exploit intra-stage parallelism, i.e. we do not try to run two different stages at a time, since proper handling of producer-consumer relationship requires fine grained synchronization between threads, which is costly on current hardware. 3

4 3.1 Scheduling pipe connections When two stages (S1 and S2) are connected by a pipe, a fixed size buffer (refer as B) is allocated to store outputs from S1. Before S1 is able to be invoked, we run a pre-pass to determine the maximum number of inputs S1 can consume, so that the outputs from S1 fit in the size of B. This can be done by generating a pre-pass version of stage kernels that returns only the number of output items given an input item X. At this time, the pre-pass stage kernel of S1 is called in parallel for a segment of input stream at a time to collect output rates for each element in the segment. If buffer B is large enough to hold the entire segment, the next segment is fetched to pre-pass kernels until B is full. After the number of consumable input items is determined, we pre-compute the output location for each input item by doing a prefix sum over the output rates computed in pre-pass. We then run the stage kernel for these items in parallel, generating actual outputs and store them in B, as shown in Figure 2. ConsumePtr X 1 X 2 X 3 X 4 X 5 Pre-pass (Collect output rates) Prefix-sum Run stage kernel S1 S1 Output buffer size = 10 Figure 1: Scheduling of a pipe connection. Because the buffer size is limited, a stage may not be able to consume all items in input stream at a time and the input stream must be split into many batches. In this case, the stage fetches a first batch of input, does the computation and fills its output buffer, calls the rest of the pipeline to drain the output buffer, and then fetches the next batch of input and repeat the process. Since the same strategy applies for all the following stages, the generated code will be in the shape of nested loops with each stage lies in a deeper level. 3.2 Scheduling grid connections Unlike pipes, grids have different semantics such that it has the ability the merge two items at a same location, and hence it is not constrained by the number of incoming items. It is 4

5 wise to accumulate as many items as possible in a grid before continuing to rest of pipeline because by accumulating more items into the grid, we allowed more chance for items target at same locations to get merged and save computation. In the extreme case, we can delay the execution of the following stages until all initial inputs are drained. However, this is the case only when the rest of the pipeline do not reference any items produced in previous stages. If such reference exists, then the referenced buffer cannot be flushed to accommodate the next batch of input, and we must schedule the rest of pipeline in order to drain the referenced buffer. In short, we would like to delay the stages behind a grid as much as possible, but still no later than the latest end of lifetime of referenced items. Figure 2 demonstrates an example of grid scheduling. In this pipeline, stage 3 accesses items in pipe 1 through references, and items in pipe 1 should not be flushed before stage 3 is executed and finished processing all the references. The resulting schedule will be: 1. run stage 1 to process a batch of items from source and fill the buffer of pipe 1; 2. run stage 2 to process a batch of items in pipe 1; 3. repeat step 2 multiple times until everything in pipe 1 is processed; 4. run stage 3 to consume the grid, possibly accessing items in pipe 1; 5. goto step 1 and process next batch of items in source. On the other hand, note that if stage 3 does not use any references into pipe 1, we can keep running stage 1 and stage 2 until every item in source has been processed, and then run stage 3 only once to finish the pipeline. Source Stage 1 Pipe 1 Stage 2 Grid Stage 3 Sink Figure 2: An example pipeline containing grid and references. The compiler relies on inter-stage dependency analysis generate such a scheduling. In the first step, the compiler analyzes each stage to identify dependent fields in input items. In the second step, inter-stage information containing the source of references is propagated through the pipeline to figure out the actual dependence. We have carefully designed the language semantics to make such analysis easy. First, references must be passed through the pipeline. A stage can produce a structure containing reference fields. Reference fields must be initialized when the structure is defined, and only input items to a stage can be assigned to a reference field. For the example in Figure 2, the only way for stage 3 to get access to Pipe 1 is to have stage 2 to pass a reference of its input to stage 3. To do so, stage 2 must 5

6 return a structure containing a reference field. Stage 2 will assign its input to this reference field and emit the structure to the grid. When stage 3 has fetched the structure from the grid, it can use this reference field to access the item in pipe 1. Since program can only be written in this way, the compiler knows that when a structure containing reference field is initialized and emitted, the stage creates a reference to its input stream. By propagating this information through the pipeline, the stage that uses the reference can know where the reference comes from. 4 Optimizations When scheduling pipe connections, the pre-pass can be greatly simplified if we know the stage s output rate is invariant to its input, i.e. the stage produces same amount of output items regardless of the actual input value. In such case, by running the stage s pre-pass code only once to get the exact output rate, we immediately know the maximum consumable inputs by dividing the output buffer size by this output rate, avoiding executing the pre-pass kernel once per input item, reducing the complexity of a pre-pass from O(n) to O(1). To determine if a stage has invariant output rate, we run a dead code elimination style optimization on the stage code. Initially, the emit instruction is marked as alive, but we do not mark the emitted variable. The rest of the analysis is exactly the same as standard dead code elimination, except that we do not actually remove the instructions. If any instruction that reads the input is marked as alive, the analyzer returns output variant, otherwise it returns output invariant. Because the language has arrays and structures (which is compiled into pointer instructions), we run an alias analysis beforehand to get the correct result. In our language, stages cannot interact with the external context other than emitting items to output stream. Therefore, all functions are side-effect free and we can apply more aggressive dead code elimination. We find C++ compiler fails to recognize this fact, so we make our compiler perform dead code elimination before generating C++ code. This is particularly useful in generating stage pre-pass code, because in many cases the return values of external functions do not determine output rate. 5 Implementation Our implementation of the pipeline compiler contains approximately 11,000 lines of C++ code. The compiler does not rely on any third party projects or libraries. This implementation includes comprehensive semantics checking, our self-designed IL (with C++ code emitter), control flow graph utilities, pointer alias analysis and dead code elimination optimizer. We have also developed a control flow graph visualizer for debug purposes, as shown in Figure 3. 6

7 Figure 3: Our control flow graph visualizer tool. 6 Evaluation To evaluate the performance of compiled pipeline, we implemented a simple rendering pipeline using our pipeline specification language. The code of the rendering pipeline in our pipeline specification language is included as appendix. Figure 4 shows the renderer rendering the Sibenik Cathedral scene containing 75,284 triangles. Our compiler offers an option to generate performance instrumentation code for benchmark purposes. This data can be used to compute the scheduling overhead. We measured the overheads of all the four stages in the rendering pipeline, and the performance and scheduling overhead under different optimization settings is shown in Figure 5. We ran the experiments on a PC with Intel Core i GHz quad core CPU and 16GB memory. The renderer generated by our compiler achieves matching performance compared to a hand tuned renderer. Our previous heavily optimized renderer finishes the frame shown in Figure 4 at 35fps, while the generated renderer achieves 29fps. Note that our hand tuned renderer exploits SIMD instructions for further parallelism, while the simple rendering pipeline implemented in pipeline specification language does not leverage SIMD. We expect that a proper SIMD version implementation in our language will achieve equal or better performance 7

8 than our previous renderer. Figure 4: The compiled renderer running in action. 7 Surprises We initially thought that optimizations such as dead-code elimination can be well handled by the C++ compiler so we did not implement this optimization in our compiler. However when we generated C++ code and checked the compiled assembly, we found that the C++ compiler failed to recognize the fact that the called external function is side effect free and did not apply expected optimizations, and the resulting performance is unsatisfactory. We then decided to implement dead code elimination ourselves, which adds another 2,000 lines of code to our system since it is not depending on any existing compiler framework such as LLVM. Fortunately we still managed to finish that in time. We have not planned invariant output rate analysis and generate simplified scheduling for this special type of stages, until we have compiled our first renderer and discovered the fact that stages with invariant output rate are actually common cases in many graphics applications. 8

9 Time (cycles) Scheduling Overhead stage 1 (shade vertex) stage 2 (assemble triangle) stage 3 (rasterize) stage 4 (shade fragment) Figure 5: Unoptimized Dead-code elimination Invariant output rate + dead-code elimination Runtime performance of compiled rendering pipeline at different optimization settings. Left: unoptimized; middle: dead-code elimination on stage pre-pass code; right: simplified pre-pass for stages with invariant output rate, and dead-code elimination on stage pre-pass code. The dark colored region represents time used in core computation, and the entire bar represents total stage. 8 Conclusion In this project, we proposed a new programming language for pipeline applications. Our compiler is able to analyze the pipeline and generate high quality implementation. With our language, it is much easier to experiment with different variations of pipelines and study their performance behavior. For example, the programmer can switch to another scheduling by simply changing the type of connection between stages, or by changing the way to pass data through the pipeline - by reference or by value, the compiler will do all the rest to generate a best implementation. For future work, we would like to extend the system to support more types of connections, and we also want to study how to schedule pipelines to next generation heterogeneous architectures, where CPU and GPU are placed in the same chip and share caches. In this case, being able to schedule different stages to different computation cores at the same time becomes critical. 9 Credit distribution We believe that we have done more than 125% of our initially expected work, and the total credit should be distributed equally among the authors. 9

10 References [1] Qiming Hou, Kun Zhou, and Baining Guo. Bsgp: bulksynchronous gpu programming. ACM Trans. Graph, page 12, [2] [3] Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1 4:11, February [4] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, CC 02, pages , London, UK, Springer-Verlag. [5] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI 08, pages 1 14, Berkeley, CA, USA, USENIX Association. 10

11 Appendix: source code of a simple rendering pipeline #include "RendererUtilities.h" Source<int> Indices; in float[] Vertices; in int VertexAttributeSize; in int VertexShaderOutputSize; in int RenderState; // pointer to render state in int ScreenWidth; in int ScreenHeight; struct Triangle float a0, b0, c0; float a1, b1, c1; float Z0, Z1, Z2; float divisor; int IsClipped; float[3] tricoord0,tricoord1,tricoord2; // clip vertex coordinates int MinX, MaxX; int MinY, MaxY; ref float[] vertex; // reference to vertices struct Fragment : Element2D float alpha, beta, gamma, z; ref Triangle triangle; struct Color : Element2D float R,G,B,A,Z; extern void RunVertexShader(int renderstate, float[] result, float[] vertex); extern int ClipTriangle(int renderstate, float[] vertices, Triangle[] triangles); extern int ClipTriangle_Count(int renderstate, float[] vertices); extern Color RunFragmentShader(int renderstate, Fragment f); parallel stage ShadeVertex() : int >> float float[256] buffer; RunVertexShader(RenderState, buffer, Vertices + in * VertexAttributeSize); for (int i = 0; i<vertexshaderoutputsize; i++) emit buffer[i]; parallel stage AssembleTriangle() : float[vertexshaderoutputsize*3] >> Triangle claim ClipTriangle_Count(RenderState, in); Triangle[7] triangles; int i; for (i = 0; i<7; i++) triangles[i] = Trianglevertex:in; int numtriangles = ClipTriangle(RenderState, in, triangles); for (i = 0; i<numtriangles; i++) emit triangles[i]; parallel stage Rasterize() : Triangle >> Fragment float invw1 = 1.0f/in.vertex[3]; float invw2 = 1.0f/in.vertex[3+VertexShaderOutputSize]; float invw3 = 1.0f/in.vertex[3+VertexShaderOutputSize*2];

12 float divisor = 1.0f/in.divisor; for (int i = in.minx; i<=in.maxx; i++) for (int j = in.miny; j <= in.maxy; j++) Fragment f = Fragment triangle:in; f.index0 = i; f.index1 = j; float x = i + 0.5f; float y = j + 0.5f; f.beta = in.a0*x + in.b0*y + in.c0; if (f.beta < 0.0f) continue; f.gamma = in.a1*x + in.b1*y + in.c1; if (f.gamma < 0.0f) continue; f.alpha = in.divisor - f.beta - f.gamma; if (f.alpha < 0.0f) continue; f.beta *= divisor; f.gamma *= divisor; f.alpha *= divisor; f.alpha = invw1*f.alpha; f.beta = invw2*f.beta; f.gamma = invw3*f.gamma; float interinvw = 1.0f/(f.alpha+f.beta+f.gamma); f.alpha *= interinvw; f.beta *= interinvw; f.gamma *= interinvw; f.z = in.z0*f.alpha + in.z1*f.beta + in.z2*f.gamma; emit f; parallel stage ShadeFragment() : Fragment >> Color Color rs = RunFragmentShader(RenderState, in); rs.index0 = in.index0; rs.index1 = in.index1; rs.z = in.z; emit rs; void zmerge(ref Fragment f, Fragment newfrag) if (newfrag.z < f.z) f = newfrag; Pipe<float> shadedvertices; Pipe<Triangle> clippedtriangles; Grid<Fragment> gbuffer(screenwidth, ScreenHeight, zmerge); Sink<Color> image; pipe RenderPipe: Indices>>ShadeVertex() >> shadedvertices >> AssembleTriangle() >> clippedtriangles >> Rasterize() >> gbuffer >> ShadeFragment() >> image;

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212 DryadLINQ by Yuan Yu et al., OSDI 08 Ilias Giechaskiel Cambridge University, R212 ig305@cam.ac.uk January 28, 2014 Conclusions Takeaway Messages SQL cannot express iteration Unsuitable for machine learning,

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Dryad Distributed Computation Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Distributed Batch Processing 1/34 Outline Motivation 1 Motivation

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 4: Processing Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today Key per-primitive operations (clipping, culling) Various slides credit John Owens, Kurt Akeley,

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications

More information

Parallel Programming

Parallel Programming Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks

More information

Scheduling the Graphics Pipeline on a GPU

Scheduling the Graphics Pipeline on a GPU Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle

More information

A Reconfigurable Architecture for Load-Balanced Rendering

A Reconfigurable Architecture for Load-Balanced Rendering A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load

More information

Comparing Reyes and OpenGL on a Stream Architecture

Comparing Reyes and OpenGL on a Stream Architecture Comparing Reyes and OpenGL on a Stream Architecture John D. Owens Brucek Khailany Brian Towles William J. Dally Computer Systems Laboratory Stanford University Motivation Frame from Quake III Arena id

More information

Real-Time Graphics Architecture

Real-Time Graphics Architecture Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Geometry Outline Vertex and primitive operations System examples emphasis on clipping Primitive

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex

More information

Review: Dryad. Louis Rabiet. September 20, 2013

Review: Dryad. Louis Rabiet. September 20, 2013 Review: Dryad Louis Rabiet September 20, 2013 Who is the intended audi- What problem did the paper address? ence? The paper is proposing to solve the problem of being able to take advantages of distributed

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Abstract. Introduction. Kevin Todisco

Abstract. Introduction. Kevin Todisco - Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image

More information

Could you make the XNA functions yourself?

Could you make the XNA functions yourself? 1 Could you make the XNA functions yourself? For the second and especially the third assignment, you need to globally understand what s going on inside the graphics hardware. You will write shaders, which

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Rendering Subdivision Surfaces Efficiently on the GPU

Rendering Subdivision Surfaces Efficiently on the GPU Rendering Subdivision Surfaces Efficiently on the GPU Gy. Antal, L. Szirmay-Kalos and L. A. Jeni Department of Algorithms and their Applications, Faculty of Informatics, Eötvös Loránd Science University,

More information

Graphics and Imaging Architectures

Graphics and Imaging Architectures Graphics and Imaging Architectures Kayvon Fatahalian http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/ About Kayvon New faculty, just arrived from Stanford Dissertation: Evolving real-time graphics

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

Further Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009

Further Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009 Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009 Introduction Evaluation of what/where GRAMPS is today Planned next steps New graphs: MapReduce and Cloth Sim Speculative potpourri, outside

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277 GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277 History GRAMPS grew from, among other things, our GPGPU and Cell processor work, especially

More information

Teleport Messaging for. Distributed Stream Programs

Teleport Messaging for. Distributed Stream Programs Teleport Messaging for 1 Distributed Stream Programs William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah and Saman Amarasinghe Massachusetts Institute of Technology PPoPP 2005 http://cag.lcs.mit.edu/streamit

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

StreamIt: A Language for Streaming Applications

StreamIt: A Language for Streaming Applications StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger and Saman Amarasinghe MIT Laboratory for Computer

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

EECS 487: Interactive Computer Graphics

EECS 487: Interactive Computer Graphics EECS 487: Interactive Computer Graphics Lecture 21: Overview of Low-level Graphics API Metal, Direct3D 12, Vulkan Console Games Why do games look and perform so much better on consoles than on PCs with

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

PowerVR Series5. Architecture Guide for Developers

PowerVR Series5. Architecture Guide for Developers Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 13: Reyes Architecture and Implementation Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) A gallery of images rendered using Reyes Image credit: Lucasfilm (Adventures

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

CSE 167: Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012

CSE 167: Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012 CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012 Announcements Homework project #2 due this Friday, October

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

ASYNCHRONOUS SHADERS WHITE PAPER 0

ASYNCHRONOUS SHADERS WHITE PAPER 0 ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped

More information

GeForce4. John Montrym Henry Moreton

GeForce4. John Montrym Henry Moreton GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,

More information

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat The PPL Vision: GRAMPS Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages

More information

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06. StreamIt on Fleet Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu UCB-AK06 July 16, 2008 1 Introduction StreamIt [1] is a high-level programming language

More information

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.  Ray Tracing. Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Ray Tracing with Tim Purcell 1 Topics Why ray tracing? Interactive ray tracing on multicomputers

More information

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp Next-Generation Graphics on Larrabee Tim Foley Intel Corp Motivation The killer app for GPGPU is graphics We ve seen Abstract models for parallel programming How those models map efficiently to Larrabee

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick

More information

Software Occlusion Culling

Software Occlusion Culling Software Occlusion Culling Abstract This article details an algorithm and associated sample code for software occlusion culling which is available for download. The technique divides scene objects into

More information

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

ClearSpeed Visual Profiler

ClearSpeed Visual Profiler ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Efficient and Scalable Shading for Many Lights

Efficient and Scalable Shading for Many Lights Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL

More information

Parallel Programming for Graphics

Parallel Programming for Graphics Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models

More information

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog RSX Best Practices Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog RSX Best Practices About libgcm Using the SPUs with the RSX Brief overview of GCM Replay December 7 th, 2004

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Programmable Graphics Hardware

Programmable Graphics Hardware Programmable Graphics Hardware Outline 2/ 49 A brief Introduction into Programmable Graphics Hardware Hardware Graphics Pipeline Shading Languages Tools GPGPU Resources Hardware Graphics Pipeline 3/ 49

More information

Chapter 3: Operating-System Structures

Chapter 3: Operating-System Structures Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System Design and Implementation System Generation 3.1

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Final Project Writeup

Final Project Writeup Jitu Das Bertha Lam 15-418 Final Project Writeup Summary We built a framework that facilitates running computations across multiple GPUs and displaying results in a web browser. We then created three demos

More information

A Stream Compiler for Communication-Exposed Architectures

A Stream Compiler for Communication-Exposed Architectures A Stream Compiler for Communication-Exposed Architectures Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability

More information

Real-Time Graphics Architecture

Real-Time Graphics Architecture Real-Time Graphics Architecture Lecture 4: Parallelism and Communication Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/ Topics 1. Frame buffers 2. Types of parallelism 3. Communication

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Cache Aware Optimization of Stream Programs

Cache Aware Optimization of Stream Programs Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with

More information

A Streaming Virtual Machine for GPUs

A Streaming Virtual Machine for GPUs A Streaming Virtual Machine for GPUs Kenneth Mackenzie (Reservoir L, Inc) Dan Campbell (Georgia Tech Research Institute) Peter Szilagyi (Reservoir L, Inc) Copyright 2005 Government Purpose Rights, All

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Compute-mode GPU Programming Interfaces

Compute-mode GPU Programming Interfaces Lecture 8: Compute-mode GPU Programming Interfaces Visual Computing Systems What is a programming model? Programming models impose structure A programming model provides a set of primitives/abstractions

More information

Hardware-driven visibility culling

Hardware-driven visibility culling Hardware-driven visibility culling I. Introduction 20073114 김정현 The goal of the 3D graphics is to generate a realistic and accurate 3D image. To achieve this, it needs to process not only large amount

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Lecture 23: Domain-Specific Parallel Programming

Lecture 23: Domain-Specific Parallel Programming Lecture 23: Domain-Specific Parallel Programming CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Acknowledgments: Pat Hanrahan, Hassan Chafi Announcements List of class final projects

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Shaders. Slide credit to Prof. Zwicker

Shaders. Slide credit to Prof. Zwicker Shaders Slide credit to Prof. Zwicker 2 Today Shader programming 3 Complete model Blinn model with several light sources i diffuse specular ambient How is this implemented on the graphics processor (GPU)?

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1 X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores

More information

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013 Lecture 9: Deferred Shading Visual Computing Systems The course so far The real-time graphics pipeline abstraction Principle graphics abstractions Algorithms and modern high performance implementations

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Bitonic Sorting Intel OpenCL SDK Sample Documentation Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information