Enhancing Visual Rendering on Multicore Accelerators with Explicitly Managed Memories *

Size: px

Start display at page:

Download "Enhancing Visual Rendering on Multicore Accelerators with Explicitly Managed Memories *"

Phebe Hodge
5 years ago
Views:

1 JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, (2012) Enhancing Visual Rendering on Multicore Accelerators with Explicitly Managed Memories * KYUNGHEE CHO 1, SEONGGUN KIM 2 AND HWANSOO HAN 2,+ 1 S-Core Corporation Seongnam, Korea 2 School of Information and Communication Engineering Sungkyunkwan University Suwon, Korea Recent electronic devices are equipped with processors extended with multicore accelerators to take advantage of the powerful performance from acceleration co-processors. Applications on such high-end electronic products require capability to run graphic-rich applications. Scalable acceleration co-processors are frequently designed as multicores with explicitly managed memories. Such multicore architectures require sophisticated data management among the main memory and the local memories to fully exploit their potential performance. Ray tracing is a high quality rendering algorithm in computer graphics and has potentially many parallelism to exploit. On the explicitly managed memory hierarchies, however, ray tracing with complex data structures tends to suffer from irregular memory accesses and inefficient data management. Compared to other acceleration structures for ray tracing, grid structure is simple to manage but commonly regarded to produce too slow algorithms. However, recent improvements on grid structure with SIMD optimizations show comparable performance with kd-tree structure, which is one of the fastest acceleration structures. We introduce a grid structure based parallel ray tracer on a processor with a multicore accelerator. We adopt SIMD optimizations and double buffering to enhance the performance of grid-based ray tracer and propose a macrocell structure over the grid to fully exploit the memory bandwidth. In our experiment, our ray tracing scheme shows comparable performance with BVH-based ray tracer. Keywords: ray tracing, multicore accelerator, grid structure, DMA latency hiding, explicitly managed memory 1. INTRODUCTION Recent advances in microprocessors allow off-the-shelf processors to equip powerful accelerators on the same microprocessor chips. For example, a multicore processor is designed to have nine processing cores: one regular processing core and eight specialized cores. While applications run on the regular processing core, some parts of the applications that demand high performance are assigned to the specialized cores to accelerate the execution. Processors of this kind can provide the total processing power up to hundreds of Gflops, which is competitive with powerful GPUs. Actually, the processor with such a multicore accelerator was originally developed for accelerating multimedia and vector processing applications. Thus, its main target applications include digital media, image Received May 31, 2011; accepted March 31, Communicated by Jiman Hong, Junyoung Heo and Tei-Wei Kuo. * This work was supported by the Ministry of Education, Science, and Technology, Korea under NRF Grant No. NRF and by the Ministry of Knowledge Economy, Korea under NIPA ITRC program No. NIPA-2012-H Corresponding author. 895

896 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN processing, compression, encryption, DSP, ray tracing, high performance computing, pattern matching, network security, etc. [1-3].

processors. Game consoles, which come with such multicore processors, provide realistic game play and high quality graphics.

Ray tracing is one of the most representative rendering algorithms for three-dimensional scenes. The quality of graphics is far better than traditional rasterization schemes.

2 896 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN processing, compression, encryption, DSP, ray tracing, high performance computing, pattern matching, network security, etc. [1-3]. As electronics companies have an increasing interest in high quality applications on their electronic devices, some of high-end electronic devices are often equipped with this kind of multicore processors. Game consoles, which come with such multicore processors, provide realistic game play and high quality graphics. High-end HDTVs, which are also equipped with such multicore processors, are capable of decoding multiple video streams in software. Ray tracing is one of the most representative rendering algorithms for three-dimensional scenes. The quality of graphics is far better than traditional rasterization schemes. Since we can calculate the color of each pixel without any dependence to other pixels, parallelisms in ray tracing are abundant. Moreover, many fast traversal algorithms and intersection algorithms for ray tracing are developed by using SIMD optimizations. Ray tracing is a promising solution to distinguish the quality of future electronic products [4, 5]. The ray tracing algorithm shoots a ray from the eye through the screen to the 3D space, finds the nearest hit point of an object, and calculates the color of pixel from the information of objects and lights. After the first hit, we can generate more rays and traverse them recursively for reflection, refraction, and shadow. The object traversal step of ray tracing could be implemented to traverse all the triangles to test the intersection and find the nearest one, but it is inefficient to traverse all the triangles within a scene. If we can skip to traverse some of the triangles, which will never hit the ray, the performance would improve much better. Acceleration structures for ray tracing are proposed to implement this idea. Bounding volume hierarchy (BVH), grid, octree, binary space partition (BSP), and kd-tree are such examples [6]. Fig. 1 shows representative acceleration structures for ray tracing. (a) Grid. (b) Octree. (c) Bounding volume (d) kd-tree. hierarchy (BVH). Fig. 1. Acceleration structures. The acceleration structures can be classified depending on how those structures are built. Spatial subdivision is to divide triangles according to the location in the space. Grid, octree, and kd-tree belong to this category. The advantage of spatial subdivision is that we can exit early without checking rest of triangles, when we find the triangle hit by the ray. The downside of these structures is that intersection tests could be duplicated many times, when triangles stretch over more than two sub-spaces. Meanwhile, hierarchical object grouping collect the triangles grouped by objects and hierarchically enclose them within bounding shapes. BVH, skd-tree, and bkd-tree belong to this category. Traversal algorithms for these hierarchical objects start from the root node to the leaf nodes, checking

3 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 897 the intersection between a ray and a bounding shape. If a bounding shape does not intersect with a ray, we can skip the entire triangles that belong to the bounding shape. Complex acceleration structures, which adaptively build acceleration structures depending on distribution of objects, require a large amount of time to initialize the structures for ray tracing. For static scenes, initial structure building time can be amortized by rendering the same scene many times during the navigation of the scene. As the speed of ray tracers are getting faster, rendering dynamic scenes becomes an important feature [6-11]. One implication for dynamic rendering is that the structure build time should be included in rendering time. Most of acceleration structures are adaptive to geometry and require heavy build costs. Meanwhile, the build times of the grid structures are very fast in general. Since the grid structure just projects triangles to the uniformly divided cell, the building time is far lower than any other acceleration structures. In addition, we investigate an appropriate acceleration structures for explicitly managed memories. The uniformity of the grid structure is a plus side for managing data around the main memory and the local stores of specialized cores. The grid structure is generally regarded as a slower acceleration structure than other adaptive acceleration structures, but recent techniques to exploit SIMD instructions prefer the uniform and regular shape of the grid structure. In our paper, we additionally investigate techniques for grid-based structures on multicore accelerators with explicitly managed memories. The main contributions of our paper are as follows. We propose a grid-based ray tracer on multicore accelerators with explicitly managed memories. We propose a parallelization technique for ray tracing which can hide the DMA latency. We experimentally evaluate that our grid-based ray tracer is comparable to other hierarchical traversals on multicore accelerators with explicitly managed memories. 2. OVERVIEW OF GRID-BASED RAY TRACING To implement an efficient ray tracer, several components of the ray tracer should be taken into account. Structure building, structure traversal, and intersection test are all considered. Acceleration structures often decide the efficiencies of those components of the ray tracer. Building time is often ignored when most of the rendering algorithms focus on static scenes. Due to fast processors and advanced ray tracing algorithms, some ray tracers achieve a real-time rendering, which is capable of handling dynamic scenes. As we can deal with dynamic scenes, building acceleration structures becomes an important issue in ray tracing. Well-known acceleration structures such as BVH and kd-tree are classified as hierarchical, adaptive structure, but grid structures are uniform, spatial subdivision. A brief comparison is shown in Table 1. In terms of traversal time, BVH and kd-tree are faster than grid, since BVH and kd-tree build trees for fast traversing. Traversal algorithms for grid structures are considered to be slow, since they compute each ray to traverse cell by cell by using 3D-DDA algorithm. To improve the performance, we use the coherent grid traversal [8]. Instead of calculating each ray, the coherent grid traversal processes a packet

4 898 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN of multiple rays together by using SIMD instructions. The coherent grid traversal improves the performance of the 3D-DDA by 10 times and it is comparable with kd-tree traversals. Grid-based traversals often suffer from multiple intersection tests for the triangles that overlapped by two or more grids. We use the mail-boxing [12] and the frustum culling [13] to reduce the overhead of multiple intersection tests. As for the build time of acceleration structures, BVH and kd-tree take more times than simple spatial subdivision structures such as grid structures. Special cases such as deformable motion, refitting and incremental updates can be handled fast by BVH or kd-tree, but they are slow to build in general. On the other hand, grid-based acceleration structures have a plenty of potentials for real-time ray tracers on modern processor architectures. Table 1. Characteristics of acceleration structures. Acceleration Structure Partition Method Hierarchy Traversal Time * Build Time * Grid Uniform No O( 3 n) O(n) BVH Object Yes O(log n) O(n log n) kd-tree Adaptive Yes O(log n) O(n log n) * n is the number of triangles. Fig. 2. Parallel programming model for the multicore accelerator. To parallelize ray tracing on multicore accelerators, we use the single program multiple data (SPMD) programming model. Each core within the accelerator processes different ray to find the intersections with triangles on grid cells. Since each ray passes different grid cells, the processing time of each core is different, which may causes load imbalance across multiple cores. To avoid the load imbalance, we use dynamic scheduling. We divide the screen by n n pixel size tiles and distribute them to each core. If a core finishes its work, another tile is given by the master processor. The workload among cores is balanced in this way. As described in Fig. 2, the master processor constructs the grid-based data structures for ray tracing, initializes the cores within the accelerator, and schedule pixel tiles to render. Each core renders the pixels within the given tiles and returns the resulting colors of the pixels to the main memory. Since rendering the pixels of the disjoint tiles is an independent work, each core runs in parallel the steps of ray tracing: ray generation, ray traversal, intersection test, shading, and writing to the frame buffer. Once all the tiles are processed by the multiple cores, the master processor displays the result on the screen.

ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 899 The memory system on multicore accelerators is composed of the multiple local memories which need an explicit management of data

The performance on such a memory system highly depends on how much we can hide the memory latency in the DMA transfers between the main memory and the local memory of the core.

5 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 899 The memory system on multicore accelerators is composed of the multiple local memories which need an explicit management of data movement among the main memory and the local memories. The performance on such a memory system highly depends on how much we can hide the memory latency in the DMA transfers between the main memory and the local memory of the core. Software cache is a typical solution for BVH or kd-tree, since we cannot determine which triangles will be tested until we traverse down to the leaf node. The performance, however, may suffer from the high miss rate on the software cache. Meanwhile, grid-based structures divide space and map each grid space to a different data node without overlapping. If we know the traversing direction of the ray, we can find the grids to be processed by using a simple calculation and determine which triangles will be tested. Fig. 3 briefly describes the double buffering on a grid-based structure. Assuming that we have a scene with four objects and each object consists of many triangles as in Fig. 3 (a). We can find which grid cells are passed by a ray by using a simple calculation before the actual traversal of each grid cell. Those grid cells are indicated by a grey color on Fig. 3 (b). Among the grey cells, we can figure out which grid cell contains triangles. We request the first DMA transfer of the triangles for the dark trapezoid object as in Fig. 3 (c). Once we get the first set of triangles, we perform the intersection test for the set of triangles. At the same time, we request the DMA transfer for the white circle object in Fig. 3 (c). We keep processing the intersection test and the DMA transfer until we find the hit point as in Fig. 3 (d). In this manner, we overlap the computation of the intersection tests with the DMA transfer for the next object to test and hide the DMA latency. (a) (b) (c) (d) Fig. 3. Double buffering for grid-based ray tracer; (a) Ray tracing on the scene with 4 objects; (b) Find the cells passed by a ray (those cells are indicated by grey color); (c) Perform double buffering by overlapping the computation for dark trapezoid with the DMA transfer for white circle; (d) Continue to process grid cells one by one until we find the hit point. 3. GRID STRUCTURE Grid structures are regarded to have a slower traversal time than hierarchical structures, but a faster build time. In terms of data structure, hierarchical structures traverse multiple nodes across non-contiguous memory locations, even though those nodes are spatially close. In grid structures, on the other hand, adjacent grid cells are located contiguously in memory, which allows to process spatially close grid cells with a better locality. In addition, grid structures are not hierarchically organized. Thus, we can easily predict the location of the next grid cell while we traverse. These characteristics make grid-based acceleration structures easily employ double buffering to hide the latency of the DMA

6 900 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN transfer. Grid structures are once regarded as slow acceleration structures, but they show comparable traversal performance to other complex acceleration structures. Moreover, the fast build times add values to grid structures, particularly in real time rendering for dynamic scenes. 3.1 Data Structure of Grid We use the polygon file format (PLY file format), which includes vertices and faces for the test set. The vertices consist of x, y, and z axes, and faces include some of vertices, but we use only triangles (a face includes 3 vertices). For the triangle intersection test with barycentric coordination [9], we make a pre-computed acceleration structure. Triangles in the acceleration structure are grouped and sorted by the grid cell and the triangles crossing the boundaries of grid cells are duplicated for the multiple grid cells. Each grid cell has two pieces of the information which are the index of triangle data in the acceleration structure and the number of triangles in the grid cell. By using these, we can easily fetch the data via DMA. The size of a grid cell is important for the grid-based acceleration structure. If a grid cell size is too small, it is good for performing fewer intersection tests for a grid cell, but we should traverse more grid cells. On the other hand, if a grid cell size is too large, we can traverse fewer grid cells, but perform more intersection tests for a gird cell. In this paper, we use the following Eq. (1) for the grid cell size [14]. λn 3 tri λn, 3 tri λn, 3 tri Nx = Lx Ny = Ly Nz = Lz (1) V V V V is the volume of the bounding box and L x, L y, and L z are the lengths of the three sides of the bounding box, respectively (i.e. V = L x L y L z ). The number of total triangles is N tri and the parameter that determines the size of a grid cell is λ. Since the total number of grid cells is N grid = N x N y N z, we can calculate the parameter, λ = N grid /N tri. If we assume that all the triangles are uniformly distributed when λ = 1, each grid will have one triangle. In general, we can assign appropriate number of triangles to a grid cell by adjusting the parameter (λ). If we increase λ, grid cells tend to include fewer triangles. If we decrease λ, grid cells are likely to include more triangles. 3.2 Macrocell When we build a uniform grid structure, some grid cells may contain too many triangles, but others may have no triangles. The imbalance of the triangles often hurts the performance of the grid traversal, since we need to traverse many grid cells which have no triangles to test intersections. To reduce such overhead, we use the idea of hierarchical grids by constructing macrocells over grid cells [8, 15]. A macrocell includes m m m size of grid cells. As a result, it introduces one level of hierarchy for grid cells. By using the macrocell structure, we can traverse cells faster, when the triangles are sparsely distributed over some grid cells. Moreover, the macrocell structure provides an advantage on multicore accelerators over other general purpose multicore processors. We can hold the information of many

7 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 901 more grid cells within the local memory. The data structure of a grid cell includes 3-dimensional information, the size of which can be very large. The local memory of an accelerator core is only a couple of hundreds of kilobytes. Since the accelerator core cannot access the data structures of the grid cells directly, it needs to bring them to its local memory before it can process them. A few hundreds of kilobytes are too small for a fairly large number of grid cells. In general, the acceleration core prefers to fetch the information for a large number of grid cells, as the overheads of DMA transfers can be reduced by fewer number of DMA requests. By using the macrocell structure, we can fetch the information of a macrocell on the local memory and traverse each grid cell without multiple DMA transfers. For empty grid cells that have no triangles inside, we can skip fetching the detailed information for the gird cells. Without the macrocell, we do not know whether a grid cell includes triangles before bringing the data for that grid cell to the local memory. If the fetched grid cell has no triangles, it would be a useless DMA request. In addition, it makes difficult to apply double buffering. Since there are no triangles to compute intersection tests, the next DMA transfer cannot be overlapped with the computation. As a result, it causes a performance drop. By using the macrocell, we can avoid fetching empty grid cells. We also reorder the grid cells so that the grid cells within the same macrocell are adjacently placed. When the information of the grid cells within the same macrocell is requested, a single DMA request can handle this request. 4. GRID-BASED TRAVERSAL AND INTERSECTION Ray tracing consists of five steps: ray generation, ray traversal, intersection test, shading, and frame buffer. In this section, we will present our traversal and intersection algorithms. To speed up our algorithms, we adopt the coherent grid traversal which takes advantage of SIMD instructions for intersection tests [8]. We also apply the mail-boxing [8, 12] and the vertex culling [8, 13] to overcome the shortcomings in grid structures. 4.1 Coherent Grid Traversal and SIMD Intersection Test For the fast grid traversal, we use the coherent grid traversal algorithm [8]. This algorithm is about 10 times faster than the conventional 3D-DDA algorithm. First, it finds the axis of the ray packet that is aligned to the traversal direction, and computes the bounding frustum of the packet. Then, it starts to traverse the grids along the traversal axis one slice at a time. As it proceeds to the next slice, the overlapping frustum with the next slice is incrementally computed from the overlapping frustum of the current slice. To maximize the ability of accelerator cores, we employ SIMD instructions in our intersection test [9]. First, we construct a ray packet with coherent n n rays. By using SIMD instructions, four rays are tested together at a time. The intersection test with a triangle consists of four individual tests. First, it tests the distance to the embedding plane of the triangle. Then, it tests the three barycentric coordinates of the point where the ray pierces the plane. 4.2 Mail-Boxing and Frustum Culling Mail-boxing and frustum culling are both very effective to reduce the number of redundant intersection tests, which are major disadvantages of uniform grid traversals. In

8 902 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN grid structures, a large number of triangles may overlap multiple grid cells. Since the multiple overlapped grid cells are neighboring among another, it is highly probable that the intersection test for the same triangle is performed multiple times. Repeatedly testing the intersection for the same triangle can be avoided by mail-boxing [8, 12]. A unique identification number is assigned to each triangle and accelerator cores record the triangle numbers which are already tested. Before performing the intersection test, we can check whether the identification number of the triangle to test is in the recorded list of numbers. If the identification number indicates that the triangle is already tested, we can skip its intersection test. Since triangles are not so tightly fit within the boundaries of grid cells as kd-tree, the intersection test on a grid structure results in some extra triangles for test which a kd-tree would avoid. If a triangle lies completely outside the frustum of the ray packet, we skip the intersection tests for the rays which are on the outside of the triangle by frustum culling with barycentric coordination. Before performing the intersection test, we perform the culling test for the four corner rays of a ray packet [8, 13]. If all of the four rays are on the outside of the triangle, we do not have to perform the intersection tests for the rest of the rays. 5. DMA LATENCY HIDING Multicore accelerators with local memories typically form a distributed memory. On such a memory system, we cannot access the main memory directly. With the usage of direct memory access (DMA), we need to move the data from the main memory to the local memories for accelerator cores. To reduce the overhead of the DAM latency, software cache [16, 17] is one technique, which keeps the triangle data for future usages. Another technique is double buffering, which is a widely used to overlap the communication with the computation [18]. Software cache is useful when the access pattern of memory is irregular, but this technique could impose quite a large overhead for misses in the software cache. When the access pattern is regular and we can predict the next data to access, doubling buffering is much effective with less overhead. Predicting the next index of the grid cell during the traversal of grid structures is relatively easy, since grid-based structures show a regular access pattern during the traversal. Thus, we adopt double buffering to hide the DMA latency instead of the software cache. We apply the double buffering scheme to three levels of the data transfers via asynchronous DMA requests. Fig. 4 represents these three levels of double buffering: tile level, macrocell level, and triangle level. Each acceleration core runs the same ray tracing code, but with different areas of tiles. The color of each pixel within a tile is calculated from the rendering algorithm and the resulting colored tiles are sent to the main memory through DMA transfers. At this tile level, we prepare two buffers, one for the rendering computation and the other for the DMA transfer. The double buffering scheme simultaneously renders current tiles and transfers the previously rendered tiles. Within the rendering algorithm for a pixel, we generate a packet of coherent rays and traverse through macrocells that intersect with the ray packet. At the macrocell level, we request the DMA transfer for the grid cells that are contained within the intersected non-empty macrocells. While we transfer the next grid cells to compute, we simultaneously traverse the current grid cells which have been transferred during the previous request. At the triangle level, we traverse the grid cells to find the triangles that intersect with the ray packet. In a similar fashion,

9 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 903 Fig. 4. Double buffering for DMA latency hiding: three levels of double buffering are employed for rendering. From the top to the bottom, each represents tile level, macrocell level, and triangle level double buffering, respectively. we request the DMA transfer for the next triangles, while we perform the intersection test with the current triangles, which have been transferred previously. Once we find the first hit points for the rays in the packet, we perform the shading algorithm to calculate the colors. 6. EXPERIMENTAL RESULTS To experimentally evaluate our grid-based ray tracing on multicore accelerators with explicitly managed memories, we used a game console which contains a multicore processor. The multicore processor has one general processor and six special cores as an accelerator, which runs at 3.2 GHz. The size of the local memory each accelerator core has is 256KB. The main memory of the game console is 256MB. Table 2 shows the rendered scenes and the characteristics of polygon models we used for our experiments. Four different models contain 36, ,000 vertexes and 70, ,000 triangles. The same conference model is rendered from two different viewpoints to generate the similar scenes used in other ray tracers. The polygon models were downloaded from the Stanford 3D scanning repository [19]. Table 2. Scenes used in experiments. Name Bunny Horse Armadillo Vertexes 35,947 48, ,974 #Triangles 69,451 96, ,944 Name Conference 1 Conference 2 #Vertexes 166, ,867 #Triangles 282, ,755

When we perform the ray casting without shading, the measured FPS values range from 17-56 on a 6 core accelerator.

10 904 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN 6.1 Performance of Grid-Based Ray Tracer Table 3 shows the performance of the rendered scenes with our grid-based ray tracer on multicore accelerators with two shading variances. When we perform the ray casting without shading, the measured FPS values range from on a 6 core accelerator. Complex models such as armadillo and conference show lower FPSs than bunny and horse, but the results are still competitive. When we add simple shading, the measured FPSs drop by 10-20%, but the FPSs are still high. Table 3. Performance of grid-based ray tracer (fps). Bunny Horse Armadillo Conf.1 Conf.2 Ray casting + no shading on multicore 3.2 GHz 1 core cores Ray casting + simple shading on multicore 3.2 GHz 1 core cores Fig. 5. Scalability on multicore accelerators: FPS increases almost linearly as the number of cores increases. The dotted lines represent projected linear speedups based on 1 core performance. The graphs in Fig. 5 show how much scalable our grid-based ray tracer is. The thin dotted lines in the graph show the linearly projected FPS values from the results of the 1 core accelerator. The thick lines represent the measured FPS values by varying the number

11 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 905 of cores used in the accelerator. For all the scenes, the measured FPSs are only slightly lower than the projected FPSs. Thus, our grid-based ray tracer shows an almost linear scalability up to 6 core accelerators. 6.2 DMA Latency Hiding Fig. 6 compares the effect of double buffering. Two bars for each scene represent the result without double buffering and the result with double buffering, respectively. The execution times are broken down in seconds. The execution time is divided into initialization, computation, and DMA latency. The initialization part includes the time spent in ray generation and parameter reset. The computation part is the time for traversal and intersection test. The DMA latency contains the wait time to complete the DMA transfers on three levels of DMA requests. The difference for two bars for each scene is whether the double buffering is applied or not. Thus, only the DMA latency part (the top component on each bar) among the breakdown in seconds has reduced, as shown in Fig. 6. The rows labeled as DMA hide in Table 4 show how many percentage of the DMA latency is reduced by the double buffering scheme. For all experiments, 60-76% of the DMA latency is hidden by overlapping the DMA transfers and the computation. As a result, the performance of our ray tracer is increased by 10-22%. The rows labeled as speedup in Table 4 show theses performance improvements. The two conference scenes have relatively high portions for DMA latency, as these models have many objects inside. With double buffering, however, the performances of these two conference scenes improve more. Fig. 6. Execution time breakdown in seconds: initialization (INIT), computation (COMP), and DMA latency (DMA). Two bars for a scene represent the results without double buffering and the results with double buffering on 1 core and 6 core accelerators. Table 4. DMA latency hiding and dpeedup. Bunny Horse Armadillo Conf. 1 Conf. 2 DMA hide 1 core 71.7% 75.2% 75.7% 71.6% 73.5% 6 cores 71.2% 73.8% 75.4% 62.1% 60.1% speedup 1 core 15.4% 14.1% 10.3% 18.8% 21.8% (fps) 6 cores 15.6% 14.7% 10.6% 16.5% 18.5%

12 906 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN 6.3 Comparison with Other Ray Tracers Fig. 7 shows the performance comparison of our grid-based ray tracer on a 6 core accelerator with other ray tracers on a similar architecture and several general purpose processors. We used the two conference scenes to measure the performance of our gridbased ray tracer on a 6 core accelerator. The performances of other ray tracers are taken from the previously published literature. The graph in Fig. 7 (a) compares the performance of the ray tracing on multicore accelerators with two different acceleration structures: BVH and grid. The performance for the BVH structure is measured on an 8 core accelerator running at 2.4GHz [16]. Meanwhile, the performance of our grid-based ray tracing is measured on a 6 core accelerator running at 3.2GHz. Since the theoretical peak performances of two platforms are actually the same (8 cores 2.4GHz = 6 cores 3.2GHz), direct comparisons are meaningful. Our grid-based ray tracer is slow by half, but still comparable. If we include the build time, our grid-based ray tracer can be more competitive. The graph in Fig. 7 (b) compares the performance of our ray tracer with other ray tracers on general purpose processors from [8]. The performance of our grid-based ray tracer on a 6 core accelerator is quite impressive. Particularly, our grid-based traversal performs almost four times faster than the coherent grid traversal on a general purpose CPU, from which we mainly take the ideas for our grid-based structure on multicore accelerators. (a) (b) Fig. 7. Performance comparison with conference 1 for (a) and conference 2 for (b); (a) BVH vs. grid on multicore accelerators; (b) Grid on multicore accelerator vs. various ray tracers on general CPUs with/without HW multithreading (MT). 7. RELATED WORK Multicore accelerators are appropriate architectures for ray tracing. There have been quite volume of ray tracing studies on multicore accelerator architectures. The terrain rendering engine (TRE) has been developed as a client-server ray casting system [20]. A client sends user parameters to render and a server performs the rendering. The rendered

13 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 907 images are delivered in compressed forms between the server and the client. The rendering engine is pipelined and optimized to use SIMD instructions. Ray tracing with the BVH has been investigated on the dual multicore accelerators, which explores the software cache technique for explicitly managed memories [16, 17]. To reduce the cache miss delay, software hyper-threading has been also studied to hide the latency of the DMA transfers for the missed data [17]. To exploit the SIMD architecture of the multicore accelerators, efficient SIMD intersection algorithms are investigated on the BVH traversal with a packet of rays [9, 21]. The interactive ray tracer (irt) [22] for the multicore accelerators has been implemented by using techniques introduced in previous works [16, 17]. In addition, reflection, transparency, shadow, BRDF lighting, and cubic environment mapped texture are added to the features of the irt. The ray packet technique is applied to ambient occlusion rays, too. The irt was able to render complex scenes with over one million polygons on a cluster of eight accelerators, each of which contains eight special accelerator cores. 8. CONCLUSION In this paper, we present a grid-based ray tracer on multicore accelerators. We propose a parallelization scheme for multicore accelerators with explicitly managed memories. We also introduce the double buffering with macrocells over grid to hide the DMA latency. We experimentally show that our grid-based ray tracer has a close to linear scalability on a multicore accelerator. We also show our doubling buffering scheme can hide 60-76% of the DMA latency, which, in turns, results in 10-22% speedup in FPSs. Compared to ray tracers on various architectures, our ray tracer on a multicore accelerator shows competitive or better performance. Compared to the BVH based ray tracer on a similar multicore accelerator, our grid-based ray tracer is about two times slower, but the result is still promising as the build time for grid structures is much faster than BVH structures. For real time ray tracing with dynamic scenes, grid-based acceleration structures can be favorite choices. In addition, since more DMA overheads are expected for secondary rays, double buffering and SIMD intersection test could be extended to handle secondary rays in grid-based structures. In summary, grid-based structures have much potential on modern processor architectures, which are embedded with multicore accelerators and explicitly managed memories. REFERENCES 1. K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, The landscape of parallel computing research: a view from Berkeley, Technical Report No. UCB/EECS , Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, H. P. Hofstee, Power efficient processor architecture and the cell processor, in Proceedings of International Symposium on High-Performance Computer Architecture, 2005, pp S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, The potential

14 908 KYUNGHEE CHO, SEONGGUN KIM AND HWANSOO HAN of the cell processor for scientific computing, in Proceedings of International Conference on Computing Frontiers, 2006, pp T. J. Purcell, I. Buck, W. R. Mark, and P. Hanrahan, Ray tracing on programmable graphics hardware, ACM Transactions on Graphics, Vol. 21, 2002, pp J. Madruga, Passive head tracking using cell processor, in International Conference and Exhibition on Computer Graphics and Interactive Techniques, youtube.com/watch?v=ryntiyyijbq. 6. T. Ize, I. Wald, and S. Parker, Asynchronous BVH construction for ray tracing dynamic scenes on parallel multi-core architectures, in Proceedings of Eurographics Symposium on Parallel Graphics and Visualization, 2007, pp S. Parker, W. Martin, P. P. Sloan, P. Shirley, B. Smits, and C. Hansen, Interactive ray tracing, Interactive 3D Graphics, 1999, pp I. Wald, T. Ize, A. Kensler, A. Knoll, and S. G. Parker, Ray tracing animated scenes using coherent grid traversal, ACM Transactions on Graphics, Vol. 25, 2006, pp I. Wald, Realtime Ray Tracing and Interactive Global Illumination, Ph.D. Thesis, Department of Computer Science, Saarland University, I. Wald, W. R. Mark, J. Günther, S. Boulos, T. Ize, W. A. Hunt, S. G. Parker, and P. Shirley, State of art in ray tracing animated scenes, Computer Graphics Forum, Vol. 28, 2009, pp T. Akenine-Möller, E. Haines, and N. Hoffman, Real-Time Rendering, 3rd ed., A. K. Peters Ltd., D. Kirk and J. Arvo, Improved ray tagging for voxel-based ray tracing, Graphics Gems II, 1991, pp K. Dmitriev, V. Havran, and H. P. Seidel, Faster ray tracing with SIMD shaft culling, Research Report No. MPI-I , Max-Planck-Institut für Informatik, J. Cleary, B. Wyvill, G. Birtwistle, and R. Vatti, Design and analysis of a parallel ray tracing computer, in Proceedings of Simula Users Conference, 1984, pp S. Parker, M. Parker, Y. Livnat, P. P. Sloan, C. Hansen, and P. Shirley, Interactive ray tracing for volume visualization, IEEE Transactions on Computer Graphics and Visualization, Vol. 5, 1999, pp C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich, Ray tracing on the cell processor, in Proceedings of IEEE Symposium on Interactive Ray Tracing, 2006, pp J. Sugerman, T. Foley, S. Yoshioka, and P. Hanrahan, Ray tracing on a cell processor with software caching, in IEEE Symposium on Interactive Ray Tracing, 2006, pp T. Chen, Z. Sura, and K. O Brien, Optimizing the use of static buffers for DMA on a Cell chip, in Proceedings of International Workshop on Languages and Compilers for Parallel Computing, 2006, pp Stanford Computer Graphics Laboratory, The Stanford Models, The Stanford 3D Scanning Repository, B. Minor, G. Fossum, and V. To, Terrain rendering engine (TRE): Cell broadband engine optimized real-time ray-caster, IBM White Paper, I. Wald, S. Boulos, and P. Shirley, Ray tracing deformable scenes using dynamic

ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 909 bounding volume hierarchies, ACM Transactions on Graphics, Vol. 26, 2007, Art. 6. 22. B. Minor, M. Nutter, and J.

degree in Electrical Engineering from Hanyang University in 2007 and the M.S. degree in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2009.

Currently, he investigates optimization opportunities in the OpenGL runtime libraries on embedded systems. Seonggun Kim received the B.S. degree in Electrical Engineering and the Ph.D.

15 ENHANCING VISUAL RENDERING ON MULTICORE ACCELERATORS WITH EMMS 909 bounding volume hierarchies, ACM Transactions on Graphics, Vol. 26, 2007, Art B. Minor, M. Nutter, and J. Madruga, irt: An interactive ray tracer for the cell be processor, IBM White Paper, Kyunghee Cho received the B.S. degree in Electrical Engineering from Hanyang University in 2007 and the M.S. degree in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in After graduation, he joined S-Core, Korea as an engineering staff. His research interests are in the field of compiler optimizations for graphics applications. Currently, he investigates optimization opportunities in the OpenGL runtime libraries on embedded systems. Seonggun Kim received the B.S. degree in Electrical Engineering and the Ph.D. degree in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2004 and 2010, respectively. He is currently a post-doctoral research associate at Sungkyunkwan University. His research interests are in the field of compiler techniques to automatically generate SIMD code and improve the memory locality for a broad range of applications. Hwansoo Han received the B.S. and the M.S. degrees in Computer Engineering from Seoul National University, Korea in 1993 and 1995, and the Ph.D. degree in Computer Science from the University of Maryland at College Park in He is currently an Associate Professor at Sungkyunkwan University. Previously, he was with Korea Advanced Institute of Science and Technology (KAIST) and Intel. His research interests include compiler technology for high-performance computing and embedded computing.

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives