Project Final Report High Performance Pipeline Compiler

Size: px

Start display at page:

Download "Project Final Report High Performance Pipeline Compiler"

Cecil Green
6 years ago
Views:

1 Project Final Report High Performance Pipeline Compiler Yong He, Yan Gu 1 Introduction Writing stream processing programs directly in low level languages, such as C++, is tedious and bug prone. A lot of systems have been developed to simplify the programming of stream processing application and to distribute computation to various devices (e.g. GPU or cluster), such as GRAMPS[3] StreamIt[4], BSGP[1], Storm[2] and DryadLinq[5], etc. These systems propose a set of general purpose programming constructs to hide the complexity of buffer management and computation scheduling. However, all of them compromise performance in certain ways to remain general purpose. We recognized a special type of stream processing, which can be compiled into much faster code : pipelines. Unlike general purpose stream processing which models computation to be a direct acyclic graph of computation nodes (kernels), pipelines are formed by chaining each kernel without introducing complex inter-kernel dataflow. As a result, scheduling and buffer management can be done in a more efficient way, making it possible to generate high performance code that runs as fast as hand tuned C++ code. One example of pipeline application is the rendering pipeline. In a rendering pipeline, a set of vertices are provided as input to the system. The pipeline computes internal triangle representations from these vertices, clips them against screen, rasterizes them into pixels and finally shades pixels to create an image. Traditionally, a rendering pipeline is always written by hand and heavily tuned for highest possible performance. As a result, real-world intricacies such as limited buffer size and low-level optimizations make it extremely hard to reshape the pipeline or to make changes in schedules. We designed a new program language for writing pipeline applications compilable to highly parallelized code with rivaling performance to hand tuned code. Our language extends the C syntax to express stages (computation nodes in our system), buffers used the synchronized between stages and pipelines assembled by chaining stages and buffers. The compiler is able to analyze the stage code and stage connections to figure out the most efficient scheduling for the given pipeline, and generate a highly parallelized pipeline implementation with comparable performance to hand tuned code. The programs written in our pipeline specification language is compact and easy to maintain (our sample rendering pipeline specification contains only 143 lines of code and is compiled to 756 lines of C++ code). Therefore it is very easy to change the pipeline and experiment with various different scheduling using our language. 1

2 2 Programming Model Design We start from a simple version of C like language without pointers, and extend it with the following constructs: Stage. A stage is a kernel function that takes one or multiple items from input stream, and emits zero or multiple items to output stream. The number of input items must be statically defined, while the number of output items can be fully dynamic at runtime. A stage should contain one or more emit statements that push an item to its output stream. Pipe. A standard pipe is a place where stages take inputs from and write outputs to. Pipes guarantees the correct ordering between items. For example, if a pipe containing items [x 1, x 2, x 3 ] is passed to a stage (y 1 (x i ),, y n (x i ) = f(x i )) that outputs to another pipe, the resulting pipe will contain items [y 1 (x 1 ),, y n (x 1 ), y 1 (x 2 ),, y n (x 2 ), y 1 (x 3 ),, y n (x 3 )]. There are two mainly kinds of variant of standard pipe: source and sink. A source is the input to the whole pipeline, and a sink is the output. When linked to host application, the user of a pipeline must provide data to source and may optionally register call backs to handle data streamed into a sink. Grid. A grid is treated as a special kind of pipe by the system, except that it represents a fixed dimension, fixed size array of cells. A grid can have 1, 2 or 3 dimensions. Only items marked as Element1D, Element2D or Element3D can be piped to grids with corresponding dimensions. Stages that emit items to a grid must specify the index of the item, and the item will be outputted to the specified location in the grid. If the specified location already contains an item, a merge function is called to merge the incoming item and existed item into a new item, and the new item is stored. References. Items manipulated by stages can contain references to an item from previous stage. The existence of references affects the scheduling of a pipeline because the referenced item stored in the output buffer for previous stage must be kept alive when its user stage is executed. Most stream processing systems do not allow such references because it makes scheduling difficult, and make it very hard to generate distributed implementations. However, references are crucial in generating high performance code that runs on a single machine, and we choose to support it in order to reduce bandwidth and memory copying overhead. Pipeline assembler. When all stages and internal pipes/grids have been defined, a pipe assembler specifies how to assemble every piece together to form a pipeline. 2.1 Example Pipeline The following example code written in the pipeline specification languages demonstrates several important constructs. The pipeline takes a stream of integers, for each integer x, the Square stage outputs x x and 2x to the output stream, which is handled by external code that prints the result. 2

3 parallel stage Square() : int>>int emit in*in; emit in*in*2; Source<int> input; Sink<int> output; pipe SimplePipe: input>>square()>>output; The pipeline compiler takes this code and produces a C++ class that represents the pipeline. The interface of generated C++ code is shown below. #ifndef OUT_H #define OUT_H #ifndef PIPE_SYSTEM_HEADER #define PIPE_SYSTEM_HEADER class SinkHandler public: virtual void ProcessBuffer(void * data, int count) = 0; ; #endif class SimplePipe public: virtual ~SimplePipe() virtual void Flush() = 0; virtual void SetInput_input(void * data, int count) = 0; virtual void SetSinkHandler_output(SinkHandler * handler) = 0; ; SimplePipe * CreateSimplePipe(); void DestroySimplePipe(SimplePipe * obj); #endif To use the pipeline, the host application calls CreateSimplePipe() to create an instance the pipeline. It then calls SimplePipe::SetInput input() to feed the source with an input stream of integers. Optionally, it can call SimplePipe::SetSinkHandler output() to register a call back for the sink. Finally, it calls SimplePipe::Flush() to initiate the pipeline. 3 Scheduling Pipelines The compiler schedules pipelines onto current CPU architecture. The resulting schedule is expected to be fully parallelized and as efficient as hand tuned code. We only exploit intra-stage parallelism, i.e. we do not try to run two different stages at a time, since proper handling of producer-consumer relationship requires fine grained synchronization between threads, which is costly on current hardware. 3

3.1 Scheduling pipe connections When two stages (S1 and S2) are connected by a pipe, a fixed size buffer (refer as B) is allocated to store outputs from S1.

This can be done by generating a pre-pass version of stage kernels that returns only the number of output items given an input item X.

4 3.1 Scheduling pipe connections When two stages (S1 and S2) are connected by a pipe, a fixed size buffer (refer as B) is allocated to store outputs from S1. Before S1 is able to be invoked, we run a pre-pass to determine the maximum number of inputs S1 can consume, so that the outputs from S1 fit in the size of B. This can be done by generating a pre-pass version of stage kernels that returns only the number of output items given an input item X. At this time, the pre-pass stage kernel of S1 is called in parallel for a segment of input stream at a time to collect output rates for each element in the segment. If buffer B is large enough to hold the entire segment, the next segment is fetched to pre-pass kernels until B is full. After the number of consumable input items is determined, we pre-compute the output location for each input item by doing a prefix sum over the output rates computed in pre-pass. We then run the stage kernel for these items in parallel, generating actual outputs and store them in B, as shown in Figure 2. ConsumePtr X 1 X 2 X 3 X 4 X 5 Pre-pass (Collect output rates) Prefix-sum Run stage kernel S1 S1 Output buffer size = 10 Figure 1: Scheduling of a pipe connection. Because the buffer size is limited, a stage may not be able to consume all items in input stream at a time and the input stream must be split into many batches. In this case, the stage fetches a first batch of input, does the computation and fills its output buffer, calls the rest of the pipeline to drain the output buffer, and then fetches the next batch of input and repeat the process. Since the same strategy applies for all the following stages, the generated code will be in the shape of nested loops with each stage lies in a deeper level. 3.2 Scheduling grid connections Unlike pipes, grids have different semantics such that it has the ability the merge two items at a same location, and hence it is not constrained by the number of incoming items. It is 4

5 wise to accumulate as many items as possible in a grid before continuing to rest of pipeline because by accumulating more items into the grid, we allowed more chance for items target at same locations to get merged and save computation. In the extreme case, we can delay the execution of the following stages until all initial inputs are drained. However, this is the case only when the rest of the pipeline do not reference any items produced in previous stages. If such reference exists, then the referenced buffer cannot be flushed to accommodate the next batch of input, and we must schedule the rest of pipeline in order to drain the referenced buffer. In short, we would like to delay the stages behind a grid as much as possible, but still no later than the latest end of lifetime of referenced items. Figure 2 demonstrates an example of grid scheduling. In this pipeline, stage 3 accesses items in pipe 1 through references, and items in pipe 1 should not be flushed before stage 3 is executed and finished processing all the references. The resulting schedule will be: 1. run stage 1 to process a batch of items from source and fill the buffer of pipe 1; 2. run stage 2 to process a batch of items in pipe 1; 3. repeat step 2 multiple times until everything in pipe 1 is processed; 4. run stage 3 to consume the grid, possibly accessing items in pipe 1; 5. goto step 1 and process next batch of items in source. On the other hand, note that if stage 3 does not use any references into pipe 1, we can keep running stage 1 and stage 2 until every item in source has been processed, and then run stage 3 only once to finish the pipeline. Source Stage 1 Pipe 1 Stage 2 Grid Stage 3 Sink Figure 2: An example pipeline containing grid and references. The compiler relies on inter-stage dependency analysis generate such a scheduling. In the first step, the compiler analyzes each stage to identify dependent fields in input items. In the second step, inter-stage information containing the source of references is propagated through the pipeline to figure out the actual dependence. We have carefully designed the language semantics to make such analysis easy. First, references must be passed through the pipeline. A stage can produce a structure containing reference fields. Reference fields must be initialized when the structure is defined, and only input items to a stage can be assigned to a reference field. For the example in Figure 2, the only way for stage 3 to get access to Pipe 1 is to have stage 2 to pass a reference of its input to stage 3. To do so, stage 2 must 5

6 return a structure containing a reference field. Stage 2 will assign its input to this reference field and emit the structure to the grid. When stage 3 has fetched the structure from the grid, it can use this reference field to access the item in pipe 1. Since program can only be written in this way, the compiler knows that when a structure containing reference field is initialized and emitted, the stage creates a reference to its input stream. By propagating this information through the pipeline, the stage that uses the reference can know where the reference comes from. 4 Optimizations When scheduling pipe connections, the pre-pass can be greatly simplified if we know the stage s output rate is invariant to its input, i.e. the stage produces same amount of output items regardless of the actual input value. In such case, by running the stage s pre-pass code only once to get the exact output rate, we immediately know the maximum consumable inputs by dividing the output buffer size by this output rate, avoiding executing the pre-pass kernel once per input item, reducing the complexity of a pre-pass from O(n) to O(1). To determine if a stage has invariant output rate, we run a dead code elimination style optimization on the stage code. Initially, the emit instruction is marked as alive, but we do not mark the emitted variable. The rest of the analysis is exactly the same as standard dead code elimination, except that we do not actually remove the instructions. If any instruction that reads the input is marked as alive, the analyzer returns output variant, otherwise it returns output invariant. Because the language has arrays and structures (which is compiled into pointer instructions), we run an alias analysis beforehand to get the correct result. In our language, stages cannot interact with the external context other than emitting items to output stream. Therefore, all functions are side-effect free and we can apply more aggressive dead code elimination. We find C++ compiler fails to recognize this fact, so we make our compiler perform dead code elimination before generating C++ code. This is particularly useful in generating stage pre-pass code, because in many cases the return values of external functions do not determine output rate. 5 Implementation Our implementation of the pipeline compiler contains approximately 11,000 lines of C++ code. The compiler does not rely on any third party projects or libraries. This implementation includes comprehensive semantics checking, our self-designed IL (with C++ code emitter), control flow graph utilities, pointer alias analysis and dead code elimination optimizer. We have also developed a control flow graph visualizer for debug purposes, as shown in Figure 3. 6

7 Figure 3: Our control flow graph visualizer tool. 6 Evaluation To evaluate the performance of compiled pipeline, we implemented a simple rendering pipeline using our pipeline specification language. The code of the rendering pipeline in our pipeline specification language is included as appendix. Figure 4 shows the renderer rendering the Sibenik Cathedral scene containing 75,284 triangles. Our compiler offers an option to generate performance instrumentation code for benchmark purposes. This data can be used to compute the scheduling overhead. We measured the overheads of all the four stages in the rendering pipeline, and the performance and scheduling overhead under different optimization settings is shown in Figure 5. We ran the experiments on a PC with Intel Core i GHz quad core CPU and 16GB memory. The renderer generated by our compiler achieves matching performance compared to a hand tuned renderer. Our previous heavily optimized renderer finishes the frame shown in Figure 4 at 35fps, while the generated renderer achieves 29fps. Note that our hand tuned renderer exploits SIMD instructions for further parallelism, while the simple rendering pipeline implemented in pipeline specification language does not leverage SIMD. We expect that a proper SIMD version implementation in our language will achieve equal or better performance 7

8 than our previous renderer. Figure 4: The compiled renderer running in action. 7 Surprises We initially thought that optimizations such as dead-code elimination can be well handled by the C++ compiler so we did not implement this optimization in our compiler. However when we generated C++ code and checked the compiled assembly, we found that the C++ compiler failed to recognize the fact that the called external function is side effect free and did not apply expected optimizations, and the resulting performance is unsatisfactory. We then decided to implement dead code elimination ourselves, which adds another 2,000 lines of code to our system since it is not depending on any existing compiler framework such as LLVM. Fortunately we still managed to finish that in time. We have not planned invariant output rate analysis and generate simplified scheduling for this special type of stages, until we have compiled our first renderer and discovered the fact that stages with invariant output rate are actually common cases in many graphics applications. 8

9 Time (cycles) Scheduling Overhead stage 1 (shade vertex) stage 2 (assemble triangle) stage 3 (rasterize) stage 4 (shade fragment) Figure 5: Unoptimized Dead-code elimination Invariant output rate + dead-code elimination Runtime performance of compiled rendering pipeline at different optimization settings. Left: unoptimized; middle: dead-code elimination on stage pre-pass code; right: simplified pre-pass for stages with invariant output rate, and dead-code elimination on stage pre-pass code. The dark colored region represents time used in core computation, and the entire bar represents total stage. 8 Conclusion In this project, we proposed a new programming language for pipeline applications. Our compiler is able to analyze the pipeline and generate high quality implementation. With our language, it is much easier to experiment with different variations of pipelines and study their performance behavior. For example, the programmer can switch to another scheduling by simply changing the type of connection between stages, or by changing the way to pass data through the pipeline - by reference or by value, the compiler will do all the rest to generate a best implementation. For future work, we would like to extend the system to support more types of connections, and we also want to study how to schedule pipelines to next generation heterogeneous architectures, where CPU and GPU are placed in the same chip and share caches. In this case, being able to schedule different stages to different computation cores at the same time becomes critical. 9 Credit distribution We believe that we have done more than 125% of our initially expected work, and the total credit should be distributed equally among the authors. 9

10 References [1] Qiming Hou, Kun Zhou, and Baining Guo. Bsgp: bulksynchronous gpu programming. ACM Trans. Graph, page 12, [2] [3] Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1 4:11, February [4] William Thies, Michal Karczmarek, and Saman P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, CC 02, pages , London, UK, Springer-Verlag. [5] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI 08, pages 1 14, Berkeley, CA, USA, USENIX Association. 10

11 Appendix: source code of a simple rendering pipeline #include "RendererUtilities.h" Source<int> Indices; in float[] Vertices; in int VertexAttributeSize; in int VertexShaderOutputSize; in int RenderState; // pointer to render state in int ScreenWidth; in int ScreenHeight; struct Triangle float a0, b0, c0; float a1, b1, c1; float Z0, Z1, Z2; float divisor; int IsClipped; float[3] tricoord0,tricoord1,tricoord2; // clip vertex coordinates int MinX, MaxX; int MinY, MaxY; ref float[] vertex; // reference to vertices struct Fragment : Element2D float alpha, beta, gamma, z; ref Triangle triangle; struct Color : Element2D float R,G,B,A,Z; extern void RunVertexShader(int renderstate, float[] result, float[] vertex); extern int ClipTriangle(int renderstate, float[] vertices, Triangle[] triangles); extern int ClipTriangle_Count(int renderstate, float[] vertices); extern Color RunFragmentShader(int renderstate, Fragment f); parallel stage ShadeVertex() : int >> float float[256] buffer; RunVertexShader(RenderState, buffer, Vertices + in * VertexAttributeSize); for (int i = 0; i<vertexshaderoutputsize; i++) emit buffer[i]; parallel stage AssembleTriangle() : float[vertexshaderoutputsize*3] >> Triangle claim ClipTriangle_Count(RenderState, in); Triangle[7] triangles; int i; for (i = 0; i<7; i++) triangles[i] = Trianglevertex:in; int numtriangles = ClipTriangle(RenderState, in, triangles); for (i = 0; i<numtriangles; i++) emit triangles[i]; parallel stage Rasterize() : Triangle >> Fragment float invw1 = 1.0f/in.vertex[3]; float invw2 = 1.0f/in.vertex[3+VertexShaderOutputSize]; float invw3 = 1.0f/in.vertex[3+VertexShaderOutputSize*2];

12 float divisor = 1.0f/in.divisor; for (int i = in.minx; i<=in.maxx; i++) for (int j = in.miny; j <= in.maxy; j++) Fragment f = Fragment triangle:in; f.index0 = i; f.index1 = j; float x = i + 0.5f; float y = j + 0.5f; f.beta = in.a0*x + in.b0*y + in.c0; if (f.beta < 0.0f) continue; f.gamma = in.a1*x + in.b1*y + in.c1; if (f.gamma < 0.0f) continue; f.alpha = in.divisor - f.beta - f.gamma; if (f.alpha < 0.0f) continue; f.beta *= divisor; f.gamma *= divisor; f.alpha *= divisor; f.alpha = invw1*f.alpha; f.beta = invw2*f.beta; f.gamma = invw3*f.gamma; float interinvw = 1.0f/(f.alpha+f.beta+f.gamma); f.alpha *= interinvw; f.beta *= interinvw; f.gamma *= interinvw; f.z = in.z0*f.alpha + in.z1*f.beta + in.z2*f.gamma; emit f; parallel stage ShadeFragment() : Fragment >> Color Color rs = RunFragmentShader(RenderState, in); rs.index0 = in.index0; rs.index1 = in.index1; rs.z = in.z; emit rs; void zmerge(ref Fragment f, Fragment newfrag) if (newfrag.z < f.z) f = newfrag; Pipe<float> shadedvertices; Pipe<Triangle> clippedtriangles; Grid<Fragment> gbuffer(screenwidth, ScreenHeight, zmerge); Sink<Color> image; pipe RenderPipe: Indices>>ShadeVertex() >> shadedvertices >> AssembleTriangle() >> clippedtriangles >> Rasterize() >> gbuffer >> ShadeFragment() >> image;

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212 DryadLINQ by Yuan Yu et al., OSDI 08 Ilias Giechaskiel Cambridge University, R212 ig305@cam.ac.uk January 28, 2014 Conclusions Takeaway Messages SQL cannot express iteration Unsuitable for machine learning,