GRAPHIC RENDERING APPLICATION PROFILING ON A SHARED MEMORY MPSOC ARCHITECTURE. Matthieu Texier, Raphaël David, Karim Ben Chehida

Size: px

Start display at page:

Download "GRAPHIC RENDERING APPLICATION PROFILING ON A SHARED MEMORY MPSOC ARCHITECTURE. Matthieu Texier, Raphaël David, Karim Ben Chehida"

Clinton Fields
6 years ago
Views:

GRAPHIC RENDERING APPLICATION PROFILING ON A SHARED MEMORY MPSOC ARCHITECTURE Matthieu Texier, Raphaël David, Karim Ben Chehida CEA, LIST, Embedded Computing Lab PC 94, F-91191 Gif-sur-Yvette Cedex

1 GRAPHIC RENDERING APPLICATION PROFILING ON A SHARED MEMORY MPSOC ARCHITECTURE Matthieu Texier, Raphaël David, Karim Ben Chehida CEA, LIST, Embedded Computing Lab PC 94, F Gif-sur-Yvette Cedex firstname.name@cea.fr Olivier Sentieys University of Rennes 1, IRISA/INRIA 6, rue de Krampont BP Lannion sentieys@irisa.fr ABSTRACT This paper describes the implementation of a graphic rendering pipeline on an MPSoC architecture devoted to the dynamic management of static task graphs. It exhibits the highly non stationary workloads of this application domain and provides first useful feedbacks motivating the design of innovative embedded architectures that have to face heterogeneous computation domains such as graphics and telecommunications. Especially these experiments stress the needs for data dependent resource allocation strategies. Index Terms Multi-core, simulator, graphic rendering, load balancing. 1. INTRODUCTION Embedded devices have to handle an increasing amount of applications. Each has different computing requirements and is dedicated to a specific domain. To efficiently handle this variety of applications, embedded systems usually use different hardware accelerators. Current embedded devices are based on system on chip made of several cores [1] like general purpose processors, multimedia processors (video and audio encoding and decoding) and others IPs that can be dedicated to imaging or telecoms for example. More and more mobile systems also embed a Graphic Processing Unit (GPU). Well known in desktop computers, the graphic processing units are more and more used in embedded devices like mobiles, tablets, etc. for applications such as games and user interfaces. The last generation of embedded GPUs is able to render millions of vertices per second and is becoming programmable. As in desktop computers, GPUs become General Purpose Graphic Processing Units (GPGPUs) and embedded GPUs expect to support new application classes like multimedia for example. In this paper we look at a different approach consisting in extending MPSoCs to support graphic applications. We thus study the ability of standard execution models devoted to multi-domain multi-core architectures to sustain performance for graphic applications. Next section is devoted to the description of graphic pipeline and Section 3 presents a brief state of the art about graphic architectures. Section 4 describes several implementations of graphic pipeline on a multi-core architecture modeled in an approximated timed TLM (Transaction-Level Model) simulator and the results are described in Section 5. In the last section, the main conclusions of the performance analysis and future work to design a multi-purpose embedded parallel architecture are explained. 2. THE GRAPHIC RENDERING PIPELINE The main graphic processing unit job is to render a threedimensional scene to a two-dimensional screen [2]. The input data is a set of points (vertices) defined in a threedimensional space. These points define triangles and the gathering of triangles can make a form like a sphere, a cube, etc. A complex triangle mix can draw all the forms showed in games. This rendering process is done in three main stages: the geometry, the setup and the fragmentation. All these stages are handling different kinds of data. Figure 1: The graphic rendering pipeline stages. As showed in Figure 1, the geometry stage consists in modifying the incoming data (set of vertices) according to the transformations defined by the user (translation, rotation, etc.). It also computes the coordinates related to the different points of view (origins) required for the next computations and calculates the impact of the lights on each incoming vertex (shading). Vertices are finally grouped by three to build triangles. The setup stage verifies whether each vertex is visible from the camera (clipping). The triangles that show their

It is in charge of computing the final color of each pixel constituting the tile according to the texture, the light and the material colors.

2 back-faces to the camera are also removed (culling). It calculates the texture mapping on each triangle and cuts them into tiles. This implies some interpolations of vertices parameters (color, position, etc.) for each tile corner and to check if the tile fully covers the triangle. The fragment stage works on tiles. It is in charge of computing the final color of each pixel constituting the tile according to the texture, the light and the material colors. It can blend the incoming pixel color with the actual value in the framebuffer according to the user defined parameters like adding the colors, color components, etc. applying effects like transparency. The depth of the incoming pixel and the pixel in the framebuffer can also be compared (depth test) so that the chosen pixel replaces the old one. Each stage is configurable and each transformation can be modified, enabled or disabled by the user through a set of APIs defined for example in OpenGL ES [3]. According to the scene and expected transformations, the user can modify through small programs, called shaders, the scene by the creation or destruction of vertices or the creation of effects like mechanical simulation, elasticity, etc. This leads to very important variations in the computing requirements balance between the stages and the complete rendering. In the next section, we describe how these stages are handled by current state of the art graphic architectures. 3. EMBEDDED GRAPHIC RENDERING ARCHITECTURES Due to the high computation demand of graphic applications [4], a wide diversity of embedded GPUs from ASIPs (Application Specific Instruction Set Processor) to fully programmable multi-core architectures has been designed. For example the Mobile Unified Shader [5] is an ASIP dedicated to graphic rendering. The programmable core is a SIMD (Single Instruction Multiple Data) processor that computes vertices, shaders, lighting and texturing. A set of configurable accelerators compute the other pipeline stages. The ARM Mali [6] is a GPU scalable from one to four programmable cores. In fact there are two different kinds of cores: the vertex core and the fragment core. Each of them is specific to a kind of shaders. The vertex shader typically supports geometry transformations and lighting through dedicated hardware SIMD instructions and dedicated lighting acceleration. The fragment shader focuses on the pixel generation by modifying its color components, it also applies the texture colors. This is done by using dedicated hardware for texture loading for example and specific instructions. The PowerVR [7] architecture is a multi-core architecture with multithreaded computing cores. The number of cores is variable but they are able to compute vertices and fragments. This is called an unified architecture because the processing elements can be used to compute vertices and fragment shaders. The latest embedded GPU generation has become an highly optimized multi-core and multi-threaded architecture. As desktop GPUs, embedded GPUs begin to support general purpose computation. The last generation of embedded GPUs inaugurates the support of GPGPU programming languages like OpenCL [8], [9]. There is a wide diversity of embedded GPUs which are becoming more and more programmable. This programmability can be used to accelerate other applications [10]. Their execution models are however very constrained in terms of data s position in the memory hierarchy for example because of their initial purpose which is graphic rendering. They also use dedicated computing elements that are optimized for graphic rendering. This makes these architectures under-optimized for accelerating all kind of applications that have data sets organized differently (e.g. augmented reality, content understanding, etc.), especially when the data size can dynamically vary [11]. In the next section, the implementation results of the graphic pipeline on an MPSoC architecture, initially designed for complex image and vision applications, will be analyzed in order to extend its execution model to support efficiently the graphic rendering. 4. RENDERING PIPELINE PARALLELIZATION This section presents the software implementation of a rendering pipe on a multi-core architecture. First the targeted architecture and its associated simulator are presented. Then a first implementation of the pipeline in a dataflow mode is showed, followed by a second implementation that uses more parallelism. 4.1 The SCMP architecture Figure 2: SCMP architecture. The SCMP architecture [12] depicted in Figure 2 is a compute-intensive resource that is seen by an host processor as a coprocessor. It uses a central scheduler that can be based on a dedicated hardware IP or a programmable processor with hardware acceleration through coprocessors.

Figure 3: The Control Data Flow Graph of a labeling application. It dynamically determines the list of eligible tasks to be executed, based on control and data dependencies.

3 Figure 3: The Control Data Flow Graph of a labeling application. It dynamically determines the list of eligible tasks to be executed, based on control and data dependencies. The Memory Configuration and Management Unit (MCMU) allocates the memory for the tasks and loads the instruction code. It also manages memory allocations and the exclusive sharing of a physically distributed and logically shared memory space. SCMP uses heterogeneous computing resources like SPARC or MIPS processors. The scheduler uses a Control Data Flow Graph (CDFG) (see Figure 3 for example) stored in a local memory to describe the control and data dependencies between the tasks of an application. In this example task 0 initiates eight DMA tasks (tasks 1 to 8) that are followed by computing tasks (tasks 9 to 24). The work is distributed in order to have the maximum possible acceleration. Tasks 25, 26 and 28 are merge tasks. Finally, task 33 is a DMA task that sends back the results. The SCMP architecture targets applications having dynamic behavior like vision systems or content understanding. 4.2 Dataflow Implementation Starting from a monolithic application code executed on a single processor, we have performed a code profiling to partition the application. The profiling results and the pipeline organization lead us to a set of five tasks communicating through FIFOs (First-In, First-Out) in a dataflow mode complying with the SCMP programming model (Figure 4). The first stage (A) can be seen as the user interface. It catches the OpenGL API to draw the scene. All the OpenGL commands are encoded and sent with the vertices (v0, v1) to the second stage of the pipeline trough a FIFO. Figure 4: Dataflow implementation of the rendering pipeline. The second stage (G) is the geometry stage. It receives the commands from the application stage and executes the commands related to geometry, otherwise the commands are sent to the next stage. First the geometry consists in computing the coordinates related to the different origins. Then it performs shading, culling and clipping. Finally the triangles (t0, t1) are sent to the rasterizer stage. The rasterizer (or setup) stage (R) is in charge of preparing the rendering of the triangles. Its main job is to divide the triangles into pixels blocks (tiles). This stage also computes the texture mapping on the tiles. Then the tiles (b0, b1) are sent to the fragment stage. The fragment stage (F) computes the final color of each pixel by interpolating the corners parameters and applying the texture. For the tile crossing the triangle borders, it checks whether each pixel is inside the triangle before starting to render it or not. Finally it sends the computed tiles to the display stage. The display stage (D) is in charge of the blend and the depth test steps. It is also the only stage that has access to the framebuffer to write the pixel color and depth. The rendering pipeline has been implemented as a five stage pipeline. Each has its own parameters which can be dynamically modified. The implementation has been done in the SESAM simulator [13] using FIFOs with a size of thirtytwo data between stages, one MIPS (Microprocessor without Interlocked Pipeline Stages) processor for each stage and one DMA (Direct Memory Access) to load textures. 4.3 The Parallel and Dataflow Implementation In order to leverage all the processing in SCMP, the application has been parallelized by duplicating the entire pipeline or only parts of the pipeline, for example the geometry or the fragment stages. Finally the application uses two types of parallelism: thread and data parallelisms. To parallelize a stage, the data have to be fairly dispatched between the parallelized stages. Furthermore this also requires merging the data after the computation. A key point is that the data need to be kept in order. Moreover a stage can generate a variable amount of data for each processed data. The input data are written in the FIFOs in a round robin manner. They are followed by a tag that specifies that the next data is in the second FIFO, like in chained lists. This allows the merge stage to reorder the generated data by following the tags and fetch the data in the appropriate FIFO.

Figure 5: Implementation of the rendering pipeline with two parallel fragment stages. Different pipelines with distinct parallelism levels have been implemented.

The display task merges the data by following the information provided by the tags inserted by the rasterizer task.

Figure 6: Implementation of the rendering pipeline with two parallel rasterizer and fragment stages. Figure 6 shows a pipeline with two parallel rasterizer tasks and fragment tasks.

They also broadcast the tags in order to keep the information about the order of the data for the display task.

4 Figure 5: Implementation of the rendering pipeline with two parallel fragment stages. Different pipelines with distinct parallelism levels have been implemented. Figure 5 shows a pipeline with two parallel fragment tasks, the rasterizer task dispatches the data to the fragment tasks as described before. The display task merges the data by following the information provided by the tags inserted by the rasterizer task. Figure 7: Time to compute each data for the Geometry and the Display stages (for the first hundred data of the cube scene). Figure 6: Implementation of the rendering pipeline with two parallel rasterizer and fragment stages. Figure 6 shows a pipeline with two parallel rasterizer tasks and fragment tasks. The geometry task dispatches the triangles to the two rasterizers. The rasterizer tasks generate a variable amount of data that are sent to the fragment tasks. They also broadcast the tags in order to keep the information about the order of the data for the display task. The next section presents the results of the implementation of the rendering pipeline on the SESAM simulator. 5. RESULTS The SCMP architecture is modeled within the SESAM [10] framework to allow a fast exploration of the execution model with accurate results. The SESAM framework allows the exploration of asymmetric multi-core architecture at TLM (Transactional Level Modeling). The components can be parameterized for defining the memory map, the amount of processors, the cache parameters, the number of memories and their sizes, etc. SESAM produces a set of simulation statistics about: cache miss rates, memory allocation history, processor occupation rate, amount of preemptions, network bandwidth, etc. The implemented applications have been executed on the simulator in order to measure the impact of the input scenes on the pipeline complexity. In the first subsection these scenes are described. The second subsection details the execution profiling results that are analyzed in the next two subsections. Figure 8: Time to compute each data for the rasterizer and fragment stages (for the first hundred data of the cube scene). 5.1 Scenes Descriptions Used as Scenarios Three different scenes have been chosen as execution scenarios. The first one was a rotating cube made of twelve textured triangles. The cube rotates by 45 degrees in the x and y axis between each frames. Secondly a vertically translating sphere from the bottom to the top made of a set of one hundred triangles has been implemented. In the first and last frames, the sphere is completely out of the camera s point of view. Between these extremes, the sphere is partially or completely visible. Finally the third scenario was a set of appearing entities made of five triangles with increasing amounts, random positions and sizes. This example also uses the blend stage in order to allow transparency between entities. 5.2 Profiling To identify the slower part of the application, the time to compute each scene was measured for each stage and for each data. To do so, the processing of each data has been annotated so that it gives precise information about the

5 Figure 9: Amount of data per stage for the different scenes. varying computing requirements according to the input data set (the scene). Figure 7 and Figure 8 show the time needed to compute the hundred first data for the cube scene. The geometry and display stages are less compute intensive than the other stages. They need from 5000 to cycles, while the rasterizer and fragment stages need from a few to more than cycles. Moreover, the amount of data between the stages varies depending on the scene. Figure 9 shows the differences in the repartition of the data between the stages for the different scenes. For the cube scene, the rasterizer stage computes six triangles and the fragment ninety five blocks. For the sphere, the rasterizer stage computes one hundred and fourteen triangles and the fragment stage three hundred and thirteen. These profiling results show that the time required to compute each data and the amount of data vary a lot between the stages and they also depend on the rendered scene. The communication time is not displayed here. Figure 10: Cycles required to render frames for each stage on the cube scene. Figure 11: Cycles required to render frames for each stage on the sphere scene. 5.3 Dataflow Mode The scenes have also been benchmarked at the frame level in order to show the computation disparities between the different frames of the same scene. Figure 10 shows the amount of cycles per stage required to render each frame without taking into account the communication time. It shows that the fragment is the most compute intensive stage. This stage takes from four million cycles to more than nine million cycles per frame and the other stages require up to one and an half million cycles. This is due to the fact that the cube is composed of twelve triangles and each triangle takes a big amount of pixels. So the rasterizer stage cuts each triangle into up to fifteen tiles and this brings a big amount of tiles that have to be computed by the fragment stage. Furthermore each triangle is textured, that also adds a lot of work to the fragment stage. The disparity in the amount of cycles required to render each frame (difference between rendering time of frames 7, 12, 13) is due to the variation in the amount of triangles visible from the camera s point of view because of the cube rotation. Figure 11 shows the same results for the sphere scene. The amount of cycles required for the geometry stage slightly varies from seven hundred thousand to one million cycles. This is due to the variation of the triangles visible to the user and to the computation needed to render the triangles that cross the camera s view borders. The rasterizer and fragment stages also vary according to the amount of triangles visible to the user. The amounts of cycles required for these two stages are close because the triangles are very small and there is approximately one or two block(s) per triangle. This means that these two stages have almost the same amount of data to compute. The amount of cycles for the display stage varies according to the amount of data sent from the fragment stages. Figure 12 shows the results for the entities scene. As the amount of entities rendered for each frame increases, we can

6 Figure 12: Cycles required to render frames for each stage on the entities scene. Figure 14: Cycles required to render frames for each stage on the sphere scene using two rasterizer and four fragment tasks. Figure 13: Cycles required to render frames for each stage on the cube scene using six fragment tasks. see that the amount of cycles required increases linearly with the amount of entities in the screen. The fragment stage complexity is dependent on the amount of entities and is the most compute intensive stage. The display stage needs up to three million cycles for one frame because the entities scene needs to activate the blending stage to render transparency. The simulation results show that the load of each stage is dependent on the rendered scene and even within the same scene, the load requirements can also evolve during the rendering on different frames. For example, up to 90 % of the time needed to render the cube scene is dedicated to the fragment stage. In the case of the sphere scene, the fragment stage uses almost 40 % of the global time. However the variations are more predictable between the different frames of a given scene than between the different data. To efficiently balance the computation along the different stages of the pipeline, the slowest stages have to be accelerated. This can be done by adding more computing tasks for one stage. For the cube scene, the fragment stage is clearly the slowest. So it needs to be accelerated by using two or more tasks for the fragment stage in order to dispatch the amount of work along different computing resources. The optimal amount of tasks for each pipeline stage is Figure 15: Cycles required to render frames for each stage on the entities scene with four fragment stages. dependent on the amount of work, and thus depends on the rendered scene. This is the object of the next subsection. 5.4 Parallel and Dataflow Modes The simulation results have shown a significant disparity in the choice of the optimal parallelization of the pipeline stages for the different scenes. To efficiently balance the computation among the pipeline, different versions of the rendering pipeline have been implemented (e.g. the example showed in Figure 6) based on the profiling results. For the cube scene, the fragment stage is clearly the most compute intensive stage, so a new version with six fragment tasks has been implemented in the same manner as the one showed in Figure 5. The sphere case is more compute intensive for the rasterizer and the fragment stages. Therefore, these stages require to be duplicated like in Figure 6. The sphere scene needs two rasterizer tasks and four fragment tasks. The entities scene is even more computational intensive for the fragment stage, since it needs four fragment stages. The results for the cube scene with six fragment stages are shown in Figure 13. The amount of work is equitably spread along the fragment tasks and along the other pipeline

7 stages. The fragment tasks take up to two million cycles, in the dataflow implementation it took up to ten million cycles. The fragment and rasterizer stages of the pipeline are now balanced. The display task takes from two hundred thousand cycles to one million, and the geometry task takes one hundred thousand cycles and does not vary. Figure 14 shows the results for the sphere scene using two rasterizer and four fragment stages. As before, the computation is spread between the two fragment and the rasterizer tasks. The second rasterizer task is more computational intensive than the first one, since it depends on the incoming data from the geometry task. The geometry and display stages remain unchanged. The results for the entities scene are shown in Figure 15. The pipeline stages tasks are now balanced with the geometry and display stages. The display stage is in charge of merging data and cannot be duplicated. To efficiently balance the pipeline to render the cube scene, six fragment stages are needed. The sphere scene needs two rasterizer stages and four fragment stages. Finally the entities scene needs two rasterizer stages and four fragment stages. The mapping of the rendering application with a static parallelism able to support efficiently each scene is thus clearly unfeasible and a static data parallelism is clearly not the solution. Moreover this mapping is also unpredictable, because it is data dependent. The balanced versions of the pipeline globally improve the overall performance. With taking into account the communication times, the parallelized cube version renders at 148fps (frames per second) compared to 91fps for the dataflow only version. The parallelized sphere version renders at 142fps compared with the dataflow version that renders at 71fps. Finally the parallelized and dataflow version of the entities scene renders only at 75fps while it is rendered at 69fps for the dataflow version. This is due to FIFO contention problems due to the data dispatch methodology. 6. CONCLUSIONS AND PERSPECTIVES This paper shows some figures on the profiling of a graphic rendering application implementation on a multi-core architecture. It provides interesting insights on the highly non-stationary workloads of the different pipeline stages through experiments on three distinct rendering scenes. It points out the need of a dynamic adaptation of the pipeline stage parallelism to improve the performance. The idea is to link the resource allocation process to the amount of data required to compute in each stage by monitoring buffer usage for example. Looking at a scene level the load requirements can be predicted from the preceding scenes in order to balance the pipeline for the next frame. The more global objective of this work is to define an architecture able to support a variety of mobile applications, each application having different performance requirements that can evolve dynamically. These mobile applications may require specific accelerators, so the architecture needs to support the mapping of tasks and the dynamic parallelism adjustment with different computing resources and low power constraints. The SCMP architecture uses a static graph which describes the application task dependencies. To support a dynamic load balancing, this graph has to be modified to describe the possible adaptation stages. Future work involves the definition of an efficient load balancing algorithm by taking into account the cost of creating/removing tasks and the data transfers in the context of heterogeneous computing resources. 7. REFERENCES [1] C.H. Van Berkel, Multi-Core for Mobile Phones, Proceedings of the IEEE/ACM Conference on Design, Automation and Test in Europe DATE 09, pp , [2] Tomas Akenine-Moller, Eric Haines, and Naty Hoffman. Real- Time Rendering, Third Edition. A K Peters/CRC Press, July [3] OpenGL ES low level API. [4] Bren C. Mochocki, Kanishka Lahiri, Srihari Cadambi, X. Sharon Hu Signature-Based Workload Estimation for Mobile 3D Graphics, Proceedings of the 43rd IEEE/ACM Design Automation Conference DAC, pp , [5] Jeong-Ho Woo, Sohn Ju-Ho ; Kim Hyejung ; Yoo Hoi-Jun, A 195 mw, 9.1 Mvertices/s Fully Programmable 3-D Graphics Processor for Low-Power Mobile Devices, IEEE Journal of Solid State Circuits, vol. 43, no. 11, pp , [6] ARM Mali [7] PowerVR SGX Series 5 [8] Imagination submits POWERVR SGX cores for OpenCL conformance [9] ARM Mali-T604 [10] W. Plishker, G. Zaki, S. S. Bhattacharyya, C. Clancy, and J. Kuykendall. Applying graphics processor acceleration in a software defined radio prototyping environment. In Proceedings of the International Symposium on Rapid System Prototyping, pages 67-72, Karlsruhe, Germany, May [11] Vuduc, Richard and Chandramowlishwaran, Aparna and Choi, Jee and Guney, Murat and Shringarpure, Aashay, On the limits of GPU acceleration, Proceedings of the 2nd USENIX conference on Hot topics in parallelism, pp 13, Berkeley, CA, USA, [12] N. Ventroux, R. David, SCMP Architecture: An Asymmetric Multiprocessor System-on-Chip for Dynamic Applications, Proceedings of the ACM International Forum on Next Generation Multicore/Manycore Technologies (IFMT), pp. 6, Saint-Malo, France, [13] N. Ventroux, A. Guerre, T. Sassolas, L. Moutaoukil, G. Blanc, C. Bechara and R. David, SESAM: an MPSoC Simulation Environment for Dynamic Application Processing, Proceedings of the IEEE 10th International Conference on Embedded Software and Systems (ICESS), pp , Bradford, UK, 2010.

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

A practitioner s view of challenges faced with power and performance on mobile GPU Prashant Sharma Samsung R&D Institute UK LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014) SERI