Project Final Report High Performance Pipeline Compiler
|
|
- Cecil Green
- 6 years ago
- Views:
Transcription
1 Project Final Report High Performance Pipeline Compiler Yong He, Yan Gu 1 Introduction Writing stream processing programs directly in low level languages, such as C++, is tedious and bug prone. A lot of systems have been developed to simplify the programming of stream processing application and to distribute computation to various devices (e.g. GPU or cluster), such as GRAMPS[3] StreamIt[4], BSGP[1], Storm[2] and DryadLinq[5], etc. These systems propose a set of general purpose programming constructs to hide the complexity of buffer management and computation scheduling. However, all of them compromise performance in certain ways to remain general purpose. We recognized a special type of stream processing, which can be compiled into much faster code : pipelines. Unlike general purpose stream processing which models computation to be a direct acyclic graph of computation nodes (kernels), pipelines are formed by chaining each kernel without introducing complex inter-kernel dataflow. As a result, scheduling and buffer management can be done in a more efficient way, making it possible to generate high performance code that runs as fast as hand tuned C++ code. One example of pipeline application is the rendering pipeline. In a rendering pipeline, a set of vertices are provided as input to the system. The pipeline computes internal triangle representations from these vertices, clips them against screen, rasterizes them into pixels and finally shades pixels to create an image. Traditionally, a rendering pipeline is always written by hand and heavily tuned for highest possible performance. As a result, real-world intricacies such as limited buffer size and low-level optimizations make it extremely hard to reshape the pipeline or to make changes in schedules. We designed a new program language for writing pipeline applications compilable to highly parallelized code with rivaling performance to hand tuned code. Our language extends the C syntax to express stages (computation nodes in our system), buffers used the synchronized between stages and pipelines assembled by chaining stages and buffers. The compiler is able to analyze the stage code and stage connections to figure out the most efficient scheduling for the given pipeline, and generate a highly parallelized pipeline implementation with comparable performance to hand tuned code. The programs written in our pipeline specification language is compact and easy to maintain (our sample rendering pipeline specification contains only 143 lines of code and is compiled to 756 lines of C++ code). Therefore it is very easy to change the pipeline and experiment with various different scheduling using our language. 1
2 2 Programming Model Design We start from a simple version of C like language without pointers, and extend it with the following constructs: Stage. A stage is a kernel function that takes one or multiple items from input stream, and emits zero or multiple items to output stream. The number of input items must be statically defined, while the number of output items can be fully dynamic at runtime. A stage should contain one or more emit statements that push an item to its output stream. Pipe. A standard pipe is a place where stages take inputs from and write outputs to. Pipes guarantees the correct ordering between items. For example, if a pipe containing items [x 1, x 2, x 3 ] is passed to a stage (y 1 (x i ),, y n (x i ) = f(x i )) that outputs to another pipe, the resulting pipe will contain items [y 1 (x 1 ),, y n (x 1 ), y 1 (x 2 ),, y n (x 2 ), y 1 (x 3 ),, y n (x 3 )]. There are two mainly kinds of variant of standard pipe: source and sink. A source is the input to the whole pipeline, and a sink is the output. When linked to host application, the user of a pipeline must provide data to source and may optionally register call backs to handle data streamed into a sink. Grid. A grid is treated as a special kind of pipe by the system, except that it represents a fixed dimension, fixed size array of cells. A grid can have 1, 2 or 3 dimensions. Only items marked as Element1D, Element2D or Element3D can be piped to grids with corresponding dimensions. Stages that emit items to a grid must specify the index of the item, and the item will be outputted to the specified location in the grid. If the specified location already contains an item, a merge function is called to merge the incoming item and existed item into a new item, and the new item is stored. References. Items manipulated by stages can contain references to an item from previous stage. The existence of references affects the scheduling of a pipeline because the referenced item stored in the output buffer for previous stage must be kept alive when its user stage is executed. Most stream processing systems do not allow such references because it makes scheduling difficult, and make it very hard to generate distributed implementations. However, references are crucial in generating high performance code that runs on a single machine, and we choose to support it in order to reduce bandwidth and memory copying overhead. Pipeline assembler. When all stages and internal pipes/grids have been defined, a pipe assembler specifies how to assemble every piece together to form a pipeline. 2.1 Example Pipeline The following example code written in the pipeline specification languages demonstrates several important constructs. The pipeline takes a stream of integers, for each integer x, the Square stage outputs x x and 2x to the output stream, which is handled by external code that prints the result. 2
3 parallel stage Square() : int>>int emit in*in; emit in*in*2; Source<int> input; Sink<int> output; pipe SimplePipe: input>>square()>>output; The pipeline compiler takes this code and produces a C++ class that represents the pipeline. The interface of generated C++ code is shown below. #ifndef OUT_H #define OUT_H #ifndef PIPE_SYSTEM_HEADER #define PIPE_SYSTEM_HEADER class SinkHandler public: virtual void ProcessBuffer(void * data, int count) = 0; ; #endif class SimplePipe public: virtual ~SimplePipe() virtual void Flush() = 0; virtual void SetInput_input(void * data, int count) = 0; virtual void SetSinkHandler_output(SinkHandler * handler) = 0; ; SimplePipe * CreateSimplePipe(); void DestroySimplePipe(SimplePipe * obj); #endif To use the pipeline, the host application calls CreateSimplePipe() to create an instance the pipeline. It then calls SimplePipe::SetInput input() to feed the source with an input stream of integers. Optionally, it can call SimplePipe::SetSinkHandler output() to register a call back for the sink. Finally, it calls SimplePipe::Flush() to initiate the pipeline. 3 Scheduling Pipelines The compiler schedules pipelines onto current CPU architecture. The resulting schedule is expected to be fully parallelized and as efficient as hand tuned code. We only exploit intra-stage parallelism, i.e. we do not try to run two different stages at a time, since proper handling of producer-consumer relationship requires fine grained synchronization between threads, which is costly on current hardware. 3
4 3.1 Scheduling pipe connections When two stages (S1 and S2) are connected by a pipe, a fixed size buffer (refer as B) is allocated to store outputs from S1. Before S1 is able to be invoked, we run a pre-pass to determine the maximum number of inputs S1 can consume, so that the outputs from S1 fit in the size of B. This can be done by generating a pre-pass version of stage kernels that returns only the number of output items given an input item X. At this time, the pre-pass stage kernel of S1 is called in parallel for a segment of input stream at a time to collect output rates for each element in the segment. If buffer B is large enough to hold the entire segment, the next segment is fetched to pre-pass kernels until B is full. After the number of consumable input items is determined, we pre-compute the output location for each input item by doing a prefix sum over the output rates computed in pre-pass. We then run the stage kernel for these items in parallel, generating actual outputs and store them in B, as shown in Figure 2. ConsumePtr X 1 X 2 X 3 X 4 X 5 Pre-pass (Collect output rates) Prefix-sum Run stage kernel S1 S1 Output buffer size = 10 Figure 1: Scheduling of a pipe connection. Because the buffer size is limited, a stage may not be able to consume all items in input stream at a time and the input stream must be split into many batches. In this case, the stage fetches a first batch of input, does the computation and fills its output buffer, calls the rest of the pipeline to drain the output buffer, and then fetches the next batch of input and repeat the process. Since the same strategy applies for all the following stages, the generated code will be in the shape of nested loops with each stage lies in a deeper level. 3.2 Scheduling grid connections Unlike pipes, grids have different semantics such that it has the ability the merge two items at a same location, and hence it is not constrained by the number of incoming items. It is 4
5 wise to accumulate as many items as possible in a grid before continuing to rest of pipeline because by accumulating more items into the grid, we allowed more chance for items target at same locations to get merged and save computation. In the extreme case, we can delay the execution of the following stages until all initial inputs are drained. However, this is the case only when the rest of the pipeline do not reference any items produced in previous stages. If such reference exists, then the referenced buffer cannot be flushed to accommodate the next batch of input, and we must schedule the rest of pipeline in order to drain the referenced buffer. In short, we would like to delay the stages behind a grid as much as possible, but still no later than the latest end of lifetime of referenced items. Figure 2 demonstrates an example of grid scheduling. In this pipeline, stage 3 accesses items in pipe 1 through references, and items in pipe 1 should not be flushed before stage 3 is executed and finished processing all the references. The resulting schedule will be: 1. run stage 1 to process a batch of items from source and fill the buffer of pipe 1; 2. run stage 2 to process a batch of items in pipe 1; 3. repeat step 2 multiple times until everything in pipe 1 is processed; 4. run stage 3 to consume the grid, possibly accessing items in pipe 1; 5. goto step 1 and process next batch of items in source. On the other hand, note that if stage 3 does not use any references into pipe 1, we can keep running stage 1 and stage 2 until every item in source has been processed, and then run stage 3 only once to finish the pipeline. Source Stage 1 Pipe 1 Stage 2 Grid Stage 3 Sink Figure 2: An example pipeline containing grid and references. The compiler relies on inter-stage dependency analysis generate such a scheduling. In the first step, the compiler analyzes each stage to identify dependent fields in input items. In the second step, inter-stage information containing the source of references is propagated through the pipeline to figure out the actual dependence. We have carefully designed the language semantics to make such analysis easy. First, references must be passed through the pipeline. A stage can produce a structure containing reference fields. Reference fields must be initialized when the structure is defined, and only input items to a stage can be assigned to a reference field. For the example in Figure 2, the only way for stage 3 to get access to Pipe 1 is to have stage 2 to pass a reference of its input to stage 3. To do so, stage 2 must 5
6 return a structure containing a reference field. Stage 2 will assign its input to this reference field and emit the structure to the grid. When stage 3 has fetched the structure from the grid, it can use this reference field to access the item in pipe 1. Since program can only be written in this way, the compiler knows that when a structure containing reference field is initialized and emitted, the stage creates a reference to its input stream. By propagating this information through the pipeline, the stage that uses the reference can know where the reference comes from. 4 Optimizations When scheduling pipe connections, the pre-pass can be greatly simplified if we know the stage s output rate is invariant to its input, i.e. the stage produces same amount of output items regardless of the actual input value. In such case, by running the stage s pre-pass code only once to get the exact output rate, we immediately know the maximum consumable inputs by dividing the output buffer size by this output rate, avoiding executing the pre-pass kernel once per input item, reducing the complexity of a pre-pass from O(n) to O(1). To determine if a stage has invariant output rate, we run a dead code elimination style optimization on the stage code. Initially, the emit instruction is marked as alive, but we do not mark the emitted variable. The rest of the analysis is exactly the same as standard dead code elimination, except that we do not actually remove the instructions. If any instruction that reads the input is marked as alive, the analyzer returns output variant, otherwise it returns output invariant. Because the language has arrays and structures (which is compiled into pointer instructions), we run an alias analysis beforehand to get the correct result. In our language, stages cannot interact with the external context other than emitting items to output stream. Therefore, all functions are side-effect free and we can apply more aggressive dead code elimination. We find C++ compiler fails to recognize this fact, so we make our compiler perform dead code elimination before generating C++ code. This is particularly useful in generating stage pre-pass code, because in many cases the return values of external functions do not determine output rate. 5 Implementation Our implementation of the pipeline compiler contains approximately 11,000 lines of C++ code. The compiler does not rely on any third party projects or libraries. This implementation includes comprehensive semantics checking, our self-designed IL (with C++ code emitter), control flow graph utilities, pointer alias analysis and dead code elimination optimizer. We have also developed a control flow graph visualizer for debug purposes, as shown in Figure 3. 6
7 Figure 3: Our control flow graph visualizer tool. 6 Evaluation To evaluate the performance of compiled pipeline, we implemented a simple rendering pipeline using our pipeline specification language. The code of the rendering pipeline in our pipeline specification language is included as appendix. Figure 4 shows the renderer rendering the Sibenik Cathedral scene containing 75,284 triangles. Our compiler offers an option to generate performance instrumentation code for benchmark purposes. This data can be used to compute the scheduling overhead. We measured the overheads of all the four stages in the rendering pipeline, and the performance and scheduling overhead under different optimization settings is shown in Figure 5. We ran the experiments on a PC with Intel Core i GHz quad core CPU and 16GB memory. The renderer generated by our compiler achieves matching performance compared to a hand tuned renderer. Our previous heavily optimized renderer finishes the frame shown in Figure 4 at 35fps, while the generated renderer achieves 29fps. Note that our hand tuned renderer exploits SIMD instructions for further parallelism, while the simple rendering pipeline implemented in pipeline specification language does not leverage SIMD. We expect that a proper SIMD version implementation in our language will achieve equal or better performance 7
8 than our previous renderer. Figure 4: The compiled renderer running in action. 7 Surprises We initially thought that optimizations such as dead-code elimination can be well handled by the C++ compiler so we did not implement this optimization in our compiler. However when we generated C++ code and checked the compiled assembly, we found that the C++ compiler failed to recognize the fact that the called external function is side effect free and did not apply expected optimizations, and the resulting performance is unsatisfactory. We then decided to implement dead code elimination ourselves, which adds another 2,000 lines of code to our system since it is not depending on any existing compiler framework such as LLVM. Fortunately we still managed to finish that in time. We have not planned invariant output rate analysis and generate simplified scheduling for this special type of stages, until we have compiled our first renderer and discovered the fact that stages with invariant output rate are actually common cases in many graphics applications. 8
9 Time (cycles) Scheduling Overhead stage 1 (shade vertex) stage 2 (assemble triangle) stage 3 (rasterize) stage 4 (shade fragment) Figure 5: Unoptimized Dead-code elimination Invariant output rate + dead-code elimination Runtime performance of compiled rendering pipeline at different optimization settings. Left: unoptimized; middle: dead-code elimination on stage pre-pass code; right: simplified pre-pass for stages with invariant output rate, and dead-code elimination on stage pre-pass code. The dark colored region represents time used in core computation, and the entire bar represents total stage. 8 Conclusion In this project, we proposed a new programming language for pipeline applications. Our compiler is able to analyze the pipeline and generate high quality implementation. With our language, it is much easier to experiment with different variations of pipelines and study their performance behavior. For example, the programmer can switch to another scheduling by simply changing the type of connection between stages, or by changing the way to pass data through the pipeline - by reference or by value, the compiler will do all the rest to generate a best implementation. For future work, we would like to extend the system to support more types of connections, and we also want to study how to schedule pipelines to next generation heterogeneous architectures, where CPU and GPU are placed in the same chip and share caches. In this case, being able to schedule different stages to different computation cores at the same time becomes critical. 9 Credit distribution We believe that we have done more than 125% of our initially expected work, and the total credit should be distributed equally among the authors. 9
10 References [1] Qiming Hou, Kun Zhou, and Baining Guo. Bsgp: bulksynchronous gpu programming. ACM Trans. Graph, page 12, [2] [3] Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1 4:11, February [4] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, CC 02, pages , London, UK, Springer-Verlag. [5] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI 08, pages 1 14, Berkeley, CA, USA, USENIX Association. 10
11 Appendix: source code of a simple rendering pipeline #include "RendererUtilities.h" Source<int> Indices; in float[] Vertices; in int VertexAttributeSize; in int VertexShaderOutputSize; in int RenderState; // pointer to render state in int ScreenWidth; in int ScreenHeight; struct Triangle float a0, b0, c0; float a1, b1, c1; float Z0, Z1, Z2; float divisor; int IsClipped; float[3] tricoord0,tricoord1,tricoord2; // clip vertex coordinates int MinX, MaxX; int MinY, MaxY; ref float[] vertex; // reference to vertices struct Fragment : Element2D float alpha, beta, gamma, z; ref Triangle triangle; struct Color : Element2D float R,G,B,A,Z; extern void RunVertexShader(int renderstate, float[] result, float[] vertex); extern int ClipTriangle(int renderstate, float[] vertices, Triangle[] triangles); extern int ClipTriangle_Count(int renderstate, float[] vertices); extern Color RunFragmentShader(int renderstate, Fragment f); parallel stage ShadeVertex() : int >> float float[256] buffer; RunVertexShader(RenderState, buffer, Vertices + in * VertexAttributeSize); for (int i = 0; i<vertexshaderoutputsize; i++) emit buffer[i]; parallel stage AssembleTriangle() : float[vertexshaderoutputsize*3] >> Triangle claim ClipTriangle_Count(RenderState, in); Triangle[7] triangles; int i; for (i = 0; i<7; i++) triangles[i] = Trianglevertex:in; int numtriangles = ClipTriangle(RenderState, in, triangles); for (i = 0; i<numtriangles; i++) emit triangles[i]; parallel stage Rasterize() : Triangle >> Fragment float invw1 = 1.0f/in.vertex[3]; float invw2 = 1.0f/in.vertex[3+VertexShaderOutputSize]; float invw3 = 1.0f/in.vertex[3+VertexShaderOutputSize*2];
12 float divisor = 1.0f/in.divisor; for (int i = in.minx; i<=in.maxx; i++) for (int j = in.miny; j <= in.maxy; j++) Fragment f = Fragment triangle:in; f.index0 = i; f.index1 = j; float x = i + 0.5f; float y = j + 0.5f; f.beta = in.a0*x + in.b0*y + in.c0; if (f.beta < 0.0f) continue; f.gamma = in.a1*x + in.b1*y + in.c1; if (f.gamma < 0.0f) continue; f.alpha = in.divisor - f.beta - f.gamma; if (f.alpha < 0.0f) continue; f.beta *= divisor; f.gamma *= divisor; f.alpha *= divisor; f.alpha = invw1*f.alpha; f.beta = invw2*f.beta; f.gamma = invw3*f.gamma; float interinvw = 1.0f/(f.alpha+f.beta+f.gamma); f.alpha *= interinvw; f.beta *= interinvw; f.gamma *= interinvw; f.z = in.z0*f.alpha + in.z1*f.beta + in.z2*f.gamma; emit f; parallel stage ShadeFragment() : Fragment >> Color Color rs = RunFragmentShader(RenderState, in); rs.index0 = in.index0; rs.index1 = in.index1; rs.z = in.z; emit rs; void zmerge(ref Fragment f, Fragment newfrag) if (newfrag.z < f.z) f = newfrag; Pipe<float> shadedvertices; Pipe<Triangle> clippedtriangles; Grid<Fragment> gbuffer(screenwidth, ScreenHeight, zmerge); Sink<Color> image; pipe RenderPipe: Indices>>ShadeVertex() >> shadedvertices >> AssembleTriangle() >> clippedtriangles >> Rasterize() >> gbuffer >> ShadeFragment() >> image;
DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212
DryadLINQ by Yuan Yu et al., OSDI 08 Ilias Giechaskiel Cambridge University, R212 ig305@cam.ac.uk January 28, 2014 Conclusions Takeaway Messages SQL cannot express iteration Unsuitable for machine learning,
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationDryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India
Dryad Distributed Computation Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Distributed Batch Processing 1/34 Outline Motivation 1 Motivation
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationLecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 4: Processing Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today Key per-primitive operations (clipping, culling) Various slides credit John Owens, Kurt Akeley,
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationLecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications
More informationParallel Programming
Parallel Programming 9. Pipeline Parallelism Christoph von Praun praun@acm.org 09-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks
More informationScheduling the Graphics Pipeline on a GPU
Lecture 20: Scheduling the Graphics Pipeline on a GPU Visual Computing Systems Today Real-time 3D graphics workload metrics Scheduling the graphics pipeline on a modern GPU Quick aside: tessellation Triangle
More informationA Reconfigurable Architecture for Load-Balanced Rendering
A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA The Load
More informationComparing Reyes and OpenGL on a Stream Architecture
Comparing Reyes and OpenGL on a Stream Architecture John D. Owens Brucek Khailany Brian Towles William J. Dally Computer Systems Laboratory Stanford University Motivation Frame from Quake III Arena id
More informationReal-Time Graphics Architecture
Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Geometry Outline Vertex and primitive operations System examples emphasis on clipping Primitive
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Analyzing a 3D Graphics Workload Where is most of the work done? Memory Vertex
More informationReview: Dryad. Louis Rabiet. September 20, 2013
Review: Dryad Louis Rabiet September 20, 2013 Who is the intended audi- What problem did the paper address? ence? The paper is proposing to solve the problem of being able to take advantages of distributed
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationReal - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský
Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation
More informationPowerVR Hardware. Architecture Overview for Developers
Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationAbstract. Introduction. Kevin Todisco
- Kevin Todisco Figure 1: A large scale example of the simulation. The leftmost image shows the beginning of the test case, and shows how the fluid refracts the environment around it. The middle image
More informationCould you make the XNA functions yourself?
1 Could you make the XNA functions yourself? For the second and especially the third assignment, you need to globally understand what s going on inside the graphics hardware. You will write shaders, which
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationRendering Subdivision Surfaces Efficiently on the GPU
Rendering Subdivision Surfaces Efficiently on the GPU Gy. Antal, L. Szirmay-Kalos and L. A. Jeni Department of Algorithms and their Applications, Faculty of Informatics, Eötvös Loránd Science University,
More informationGraphics and Imaging Architectures
Graphics and Imaging Architectures Kayvon Fatahalian http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/ About Kayvon New faculty, just arrived from Stanford Dissertation: Evolving real-time graphics
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationParallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)
Lecture 2: Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload) Visual Computing Systems Today Finishing up from last time Brief discussion of graphics workload metrics
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationFurther Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009
Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009 Introduction Evaluation of what/where GRAMPS is today Planned next steps New graphs: MapReduce and Cloth Sim Speculative potpourri, outside
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationGRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277
GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277 History GRAMPS grew from, among other things, our GPGPU and Cell processor work, especially
More informationTeleport Messaging for. Distributed Stream Programs
Teleport Messaging for 1 Distributed Stream Programs William Thies, Michal Karczmarek, Janis Sermulins, Rodric Rabbah and Saman Amarasinghe Massachusetts Institute of Technology PPoPP 2005 http://cag.lcs.mit.edu/streamit
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationStreamIt: A Language for Streaming Applications
StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger and Saman Amarasinghe MIT Laboratory for Computer
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationEECS 487: Interactive Computer Graphics
EECS 487: Interactive Computer Graphics Lecture 21: Overview of Low-level Graphics API Metal, Direct3D 12, Vulkan Console Games Why do games look and perform so much better on consoles than on PCs with
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationPowerVR Series5. Architecture Guide for Developers
Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.
More informationLecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 13: Reyes Architecture and Implementation Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) A gallery of images rendered using Reyes Image credit: Lucasfilm (Adventures
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationGPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
GPU Architecture Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationCSE 167: Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012
CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2012 Announcements Homework project #2 due this Friday, October
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationASYNCHRONOUS SHADERS WHITE PAPER 0
ASYNCHRONOUS SHADERS WHITE PAPER 0 INTRODUCTION GPU technology is constantly evolving to deliver more performance with lower cost and lower power consumption. Transistor scaling and Moore s Law have helped
More informationGeForce4. John Montrym Henry Moreton
GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,
More informationGRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat
GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat The PPL Vision: GRAMPS Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages
More informationStreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.
StreamIt on Fleet Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu UCB-AK06 July 16, 2008 1 Introduction StreamIt [1] is a high-level programming language
More informationReal-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.
Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Ray Tracing with Tim Purcell 1 Topics Why ray tracing? Interactive ray tracing on multicomputers
More informationNext-Generation Graphics on Larrabee. Tim Foley Intel Corp
Next-Generation Graphics on Larrabee Tim Foley Intel Corp Motivation The killer app for GPGPU is graphics We ve seen Abstract models for parallel programming How those models map efficiently to Larrabee
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationSoftware Occlusion Culling
Software Occlusion Culling Abstract This article details an algorithm and associated sample code for software occlusion culling which is available for download. The technique divides scene objects into
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationClearSpeed Visual Profiler
ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationEfficient and Scalable Shading for Many Lights
Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL
More informationParallel Programming for Graphics
Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationRSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog
RSX Best Practices Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog RSX Best Practices About libgcm Using the SPUs with the RSX Brief overview of GCM Replay December 7 th, 2004
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationProgrammable Graphics Hardware
Programmable Graphics Hardware Outline 2/ 49 A brief Introduction into Programmable Graphics Hardware Hardware Graphics Pipeline Shading Languages Tools GPGPU Resources Hardware Graphics Pipeline 3/ 49
More informationChapter 3: Operating-System Structures
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System Design and Implementation System Generation 3.1
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationFinal Project Writeup
Jitu Das Bertha Lam 15-418 Final Project Writeup Summary We built a framework that facilitates running computations across multiple GPUs and displaying results in a web browser. We then created three demos
More informationA Stream Compiler for Communication-Exposed Architectures
A Stream Compiler for Communication-Exposed Architectures Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationFahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County
Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou University of Maryland Baltimore County Introduction The demand to increase forecast predictability
More informationReal-Time Graphics Architecture
Real-Time Graphics Architecture Lecture 4: Parallelism and Communication Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/ Topics 1. Frame buffers 2. Types of parallelism 3. Communication
More informationAutomatic Tuning Matrix Multiplication Performance on Graphics Hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationCache Aware Optimization of Stream Programs
Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with
More informationA Streaming Virtual Machine for GPUs
A Streaming Virtual Machine for GPUs Kenneth Mackenzie (Reservoir L, Inc) Dan Campbell (Georgia Tech Research Institute) Peter Szilagyi (Reservoir L, Inc) Copyright 2005 Government Purpose Rights, All
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationCompute-mode GPU Programming Interfaces
Lecture 8: Compute-mode GPU Programming Interfaces Visual Computing Systems What is a programming model? Programming models impose structure A programming model provides a set of primitives/abstractions
More informationHardware-driven visibility culling
Hardware-driven visibility culling I. Introduction 20073114 김정현 The goal of the 3D graphics is to generate a realistic and accurate 3D image. To achieve this, it needs to process not only large amount
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationLecture 23: Domain-Specific Parallel Programming
Lecture 23: Domain-Specific Parallel Programming CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Acknowledgments: Pat Hanrahan, Hassan Chafi Announcements List of class final projects
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationShaders. Slide credit to Prof. Zwicker
Shaders Slide credit to Prof. Zwicker 2 Today Shader programming 3 Complete model Blinn model with several light sources i diffuse specular ambient How is this implemented on the graphics processor (GPU)?
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationX. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1
X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores
More informationLecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013
Lecture 9: Deferred Shading Visual Computing Systems The course so far The real-time graphics pipeline abstraction Principle graphics abstractions Algorithms and modern high performance implementations
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationBitonic Sorting Intel OpenCL SDK Sample Documentation
Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Pablo Brubeck Department of Physics Tecnologico de Monterrey October 14, 2016 Student Chapter Tecnológico de Monterrey Tecnológico de Monterrey Student Chapter Outline
More informationGraphics Processing Unit Architecture (GPU Arch)
Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More information