Bandwidth and Memory Efficiency in Real-Time Ray Tracing

Size: px

Start display at page:

Download "Bandwidth and Memory Efficiency in Real-Time Ray Tracing"

Doreen Daniel
5 years ago
Views:

João António Madeiras Pereira Dr. Vasco Alexandre da Silva Costa Examination Committee Chairperson: Prof.

1 Bandwidth and Memory Efficiency in Real-Time Ray Tracing Pedro Filipe Vitorino Castro Lousada Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. João António Madeiras Pereira Dr. Vasco Alexandre da Silva Costa Examination Committee Chairperson: Prof. Nuno João Neves Mamede Supervisor: Prof. João António Madeiras Pereira Member of the Committee: Prof. Fernando Birra October 2016

2 ii

3 Acknowledgments I would like to start by thanking my family, especially for all the support, encouragement and motivation they have provided throughout my academic years. I also want to thank prof. João Madeiras and Dr. Vasco Costa for providing expertise, guidance and insight during the development of this thesis. Lastly but not least I also want to thank all my friends and colleagues for the words of encouragement provided during the past months. iii

4 Abstract Real time ray tracing has been given a lot of attention in recent years in the academic and research community. Several novel algorithms have appeared that parallelize different aspects of the ray tracing algorithm through the use of a GPU. Among theses, the creation of BVHs. We believe that recent approaches have failed to consider the performance impact of memory accesses in GPU and how their cost affects the overall performance of the application. In this work we show that by reducing memory bandwidth and footprint we are able to achieve significant improvements in BVH traversal times. We do this by compressing the BVH and the triangle mesh in parallel manner after its creation in each frame and then decompressing them as needed while traversing the BVH. Keywords: Ray Tracing, Bounding Volume Hierarchy, Parallelization, GPU, Quantization, Memory iv

5 Resumo Nos últimos anos ray tracing em tempo real tem sido alvo de bastante atenção na comunidade académica e de investigação. Surgiram novos algoritmos que paralelizam diferentes aspectos do algoritmo de ray tracing através do uso de um GPU. Entre estes aspectos destaca-se a criação de BVHs. A nosso ver estas abordagens não têm considerado a importância e o impacto na performance que os acessos à memória global do GPU têm. Com este trabalho pretendemos demonstrar que através da redução da utilização da largura de banda de memória conseguimos reduções significativas no tempo de travessia da BVH. Exploramos técnicas de compressão da BVH e da malha poligonal mediante a sua construção a cada frame e descomprimimos as mesmas à medida que são necessárias para efectuar testes de intersecção. Palavras-chave: Ray Tracing, Bounding Volume Hierarchy, Paralelização, GPU, Quantização, Memória v

6 Contents Acknowledgments iv Abstract v Resumo vi Contents viii List of Figures x List of Tables xi 1 Introduction Motivation Problem Description Goals Contributions Work Outline Background and State of the Art Background Ray Tracing GPU Computing Acceleration Structures Kd-Trees Uniform Grids Bounding Volume Hierarchies State of the art Parallel Construction of Bounding Volume Hierarchies Random-Accessible Compressed Bounding Volume Hierarchies Approach and Architecture Construction Of The BVH Binary Radix Tree Properties Morton Codes Parallel Construction of BVH Calculating Bounding Volumes vi

7 3.2 Compression BVH Compression Mesh Compression Rendering Evaluation Evaluation Methodology Experimental Scenes Hypothesis Experimental Results and Analysis Number of Ray-Box Intersection Tests Number of Ray-Triangle Intersection Tests Global Memory Reads Bounding Volume Hierarchy Total Size Kernel Execution Time Frames Per Second Conclusions Future Work Bibliography 48 A Data Structures 49 B BVH Traversal Kernel Code 50 vii

8 List of Figures 2.1 Ray Tracing Diagram Difference between rasterization and ray tracing (source: Intel). This figure shows how Ray-Tracing can better simulate soft shadows and reflections Example of a typical CUDA processing flow (source: NVIDIA) Memory hierarchy in CUDA (source: NVIDIA) A visual representation of a kd-tree. P1 through P4 represent planes of space division Five rays traversing a grid A visual representation of a Bounding Volume Hierarchy Example 2-D Morton code ordering of triangles with the first two levels of the hierarchy. Blue and red bits indicate x and y axes, respectively. Source: [1] A visual representation of an ordered radix tree. The numbers 0-7 act as keys (and such as leaf nodes) of the tree. Each internal node represents a common range prefix of the binary value of all the leaf nodes bellow it. Notice we have N 1 internal nodes for N keys Centroid of a triangle Interweaving of bits in order to create a Morton coder Bar representation of the keys covered by each internal node Pseudocode for the parallel construction of a BVH (source:[2] ) Resulting BVH created in parallel around Stanford s Bunny Ray-box intersection errors caused by not using a quantized parent bounding box Mesh deformation caused by quantization of triangles vertices. Notice the small black dots between triangles. Image is a closeup of the tip of the ear of Stanford s Bunny CORNELL test scene STANFORD BUNNY test scene ASIAN DRAGON test scene Number of ray-box intersection tests in CORNELL Number of ray-box intersection tests in STANFORD BUNNY Number of ray-box intersection tests in ASIAN DRAGON Number of ray-triangle intersection tests in CORNELL viii

9 4.8 Number of ray-triangle intersection tests in STANFORD BUNNY Number of ray-triangle intersection tests in ASIAN DRAGON Estimated memory read during intersection tests in CORNELL in each frame Estimated memory read during intersection tests in STANFORD BUNNY in each frame Estimated memory read during intersection tests in ASIAN DRAGON in each frame Frames per seconds in each test scene ix

10 List of Tables 4.1 Size of internal node, leaf node and triangle in each algorithm Kernel execution time for CORNELL frame Kernel execution time for STANDFORD BUNNY frame Kernel execution time for ASIAN DRAGON frame x

11 Chapter 1 Introduction 1.1 Motivation Real-time rendering concerns itself with the generation of synthetic images at a rate fast enough so that the viewer can interact with a virtual environment. An image appears on screen, the viewer acts or reacts, and this feedback affects what is generated next. Two of the more popular approaches to synthetic image generation are Rasterization and Ray-Tracing. Both have been used in computer graphics for the past decades. Each method allow us to generate 2D images from 3D scenes composed of virtual objects. Rasterization quickly became the dominant approach for real time applications. Its initially low computational requirements, its ability to incrementally move from software to hardware, and the ever increasing performance of dedicated graphics hardware made it the most attractive option in generating images at high frame rates [3]. Ray-Tracing then received little attention outside the academic world. Its high computation costs made it a much more expensive and slow approach compared to rasterization. Ray tracing offers a fairly long list of advantages over rasterization. Ray-tracing can easily simulate nonlocal effects such as shadows, reflections and refraction. In Rasterization reflections and shadows are hard to compute; refractions being very hard. Due to its mathematical correctness, ray tracing can make generated images look more realistic. Real time applications such as games, simulations and even medical software can greatly benefit from this. It s only a matter of being able to generate them at a rate fast enough required by the application. 1.2 Problem Description At its core, the ray tracing algorithm follows the following logic: for each pixel of the display, cast a ray that propagates in a straight line until it intersects an element of the scene being rendered. The intersection is then used to determine the color of the pixel as a function of the intersected element s surface. For a simple scene with no secondary rays (i.e. reflections, shadows, refractions, etc) this 1

12 means having at least N xm intersection tests (N being the number of rays and M the number of polygons in the scene). For its nature, ray tracing inherently leads to a high number of expensive floating point operations. Put this together with an irregular and high bandwidth memory access pattern and performance will be an issue when trying to achieve real time rendering. One way to optimize the standard ray tracing algorithm is through the use of acceleration structures. Acceleration structures allow us to lower the number of intersection tests needed to render an image. Currently Bounding Volume Hierarchies show the most promise. The state of the art in this kind of structure has been targeted towards generating the structure at a rate fast enough to be usable in a real time application, typically creating a new BVH or adapting an existing one at each frame. Most algorithms use a GPU to achieve the performance they need. Despite their capability of performing a high number of floating point operations per second, GPUs tend to suffer from slow memory accesses. Most algorithms that focus on constructing a BVH in a parallel manner tend to overlook this and be careless with memory accesses and bandwidth, resulting in overall less than optimal performance. 1.3 Goals Our goal with this work is to explore the lack of concern given to memory accesses in the recent state of the art. We believe that by concerning ourselves with GPU memory footprint and bandwidth efficiency we are able to improve performances. In this work we aim to develop a ray tracing application supporting one of the latest state of the art algorithms in parallel BVH construction and improve it further through memory compression techniques. 1.4 Contributions Our contribution is a novel algorithm based in existing techniques for real time BVH construction with focus and improvements in BVH and triangle mesh compression. We are able to reduce the total size of memory used to store both the BVH and the triangle mesh as well as reduce the memory bandwidth of the application. We see improvements of up to 40% in occupied memory and reductions of up to 20% in memory bandwidth. These improvements then lead to a decrease in the time it takes to generate each frame, namely through the acceleration of the traversal of the BVH. 1.5 Work Outline This document is divided into 5 different chapters. In Chapter 2 we briefly talk about the basis of the ray-tracing algorithm. We mention the advantages of GPU computing and provide some insight on how it works. We also give a brief overview on the usage of acceleration structures in ray-tracing and mention their advantages and disadvantages. Still in this chapter we take a look into the state of the art in parallel construction of BVH s and BVH compression. In Chapter 3 we describe our algorithm and 2

13 techniques in detail. Chapter 4 provides the evaluation methods we use and showcases our results. The final Chapter 5 is where we make our concluding remarks and discuss potential ideas for future work. 3

14 Chapter 2 Background and State of the Art In order to better contextualize our work, this section delves into the topics that this work focuses on. We first look into the standard ray tracing algorithm as presented by T. Whitted in We describe the general idea behind the algorithm and provide pseudocode of its implementation. We then proceed to look into GPU Computing. We explore the architectural model and the advantages we are able to gather through the use of a GPU. After this we go over the most commonly used acceleration structures, analyze their advantages and disadvantages and justify our reasoning in choosing the one we think is more suitable. Finally, we look into the state of the art in the construction of Bounding Volume Hierarchies in a parallel manner in GPUs and also their compression. 4

2.1 Background 2.1.1 Ray Tracing First introduced by T. Whitted [4], ray tracing is a technique based in global illumination shading.

15 2.1 Background Ray Tracing First introduced by T. Whitted [4], ray tracing is a technique based in global illumination shading. It employs the use of ray casting [5] to intersect eye rays with objects in a computer simulated scene. Figure 2.1: Ray Tracing Diagram As shown in figure 2.1 ray tracing works by creating a series of Primary Rays with their origin in the Spectator and targeted at a virtual Window X pixels wide and Y pixels high, depending on the resolution of the screen or the ray tracing application window. Primary Rays traverse the scene until they encounter a Scene Object. When an object is hit by a Primary Ray 2 things happen: Shadow Rays are generated from the intersection point to check if the area is shaded, and, if the material of the object is reflective or refractive Secondary Rays are created to traverse the scene once more in order to create secondary visual effects such as reflection and refractions. Pseudocode for this algorithm can be found in Algorithm 1. 5

16 Algorithm 1 Whitted Ray Tracing Algorithm 1: while true do 2: for each pixel p do 3: ray Ray(camera.origin, camera.direction, p) 4: pixelcolor CastRay(ray) 5: end for 6: end while 7: 8: function CASTRAY(ray) 9: if ray.depth > MAX DEP T H then 10: return BLACK 11: end if 12: intersection IntersectRayW ithp rimitives(ray) 13: if intersection = null then 14: return backgroundcolor 15: end if 16: outputcolor Color(intersection) 17: if ObjectIsRef lective(intersection) then 18: ref lectionray Ref lectray(ray, intersection) 19: ref lectionray.depth ray.depth : outputcolor outputcolor + CastRay(ref lectionray) intersection.ref lective coef 21: end if 22: if ObjectIsRef ractive(intersection) then 23: ref ractionray Ref ractray(ray, intersection) 24: ref ractionray.depth ray.depth : outputcolor outputcolor + CastRay(ref ractionray) intersection.ref ractive coef 26: end if 27: return outputcolor 28: end function 29: 30: function COLOR(intersection) 31: color black 32: for each light l do 33: shadowray ShadowRay(intersection.intersectionP oint, l) 34: if IntersectRayW ithp rimitives(shadowray) = null then 35: color color + dif f usecomponent 36: color color + specularcomponent 37: end if 38: end for 39: return color 40: end function 6

$The main difference between rasterization and ray tracing is the ability to calculate effects such as shadows, reflections and refractions.$

17 The main difference between rasterization and ray tracing is the ability to calculate effects such as shadows, reflections and refractions. Whereas rasterization engines only try to simulate these effects through the use of texturing, ray tracing takes a more mathematical approach. Since these effects are highly geometry dependent, simulated methods look rather unrealistic when changing object position or viewer position. The difference in ability to simulate secondary visual effects between rasterization and ray tracing can be seen in Figure 2.2. Figure 2.2: Difference between rasterization and ray tracing (source: Intel). Ray-Tracing can better simulate soft shadows and reflections. This figure shows how 7

18 2.1.2 GPU Computing There are 2 options when it comes to GPU computing: CUDA[6] and OpenCL, we chose to work with CUDA since it provides a vast array of libraries to work with, has the best integration with modern IDE s and also provides the best profiling and debugging tools. Modern GPUs are what we call SIMD (Single Instruction Multiple Data) devices, meaning that they can perform the same operation on multiple data points simultaneously. Even though a modern CPU core can dispatch more operations per second than a GPU core, the GPU as a whole can vastly outperform a CPU by having thousands of cores running the same operations versus the 4-8 cores present in the CPU. The idea behind GPU accelerated computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally intensive part is run by the GPU [7]. The massive parallelism of programmable GPUs naturally lends itself to inherently parallel problems such as ray tracing. A GPU accelerated application consists of 2 main components: a host and a device. The host refers to the CPU and the system s memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernel functions which are executed in the device. A typical sequence of operations for a CUDA program consist of (Figure 2.3): 1. Declare and allocate host and initialize the host s data 2. Transfer data from host memory to device memory 3. Execute kernel functions 4. Transfer results from device memory to host memory A given kernel executes a single program across many parallel threads. Typically, each kernel completes execution before the next kernel begins, with an implicit barrier synchronization between kernels. GPU s have support for multiple, independent kernels to execute simultaneously, but many kernels are large enough to fill the entire device. Threads are decomposed into thread blocks; threads within a given block may efficiently synchronize with each other and have shared access to per-block on-chip memory[1]. Each thread also has access to private memory visible only to that thread. And, obviously, all threads have access to the same global memory (figure 2.4). 8

19 Figure 2.3: Example of a typical CUDA processing flow (source: NVIDIA) Figure 2.4: Memory hierarchy in CUDA (source: NVIDIA) 9

20 2.1.3 Acceleration Structures One of the major problems of ray tracing is the sheer number of intersection tests one must perform in order to check if a certain ray intersections an object of the scene. Acceleration structures help us reduce the number of intersections to test by organizing the geometry of the scene into a data structure that can easily be explored. The organization of acceleration structure is typically hierarchical, loosely meaning that the topmost level encloses the levels bellow it, and so on. Typically an acceleration structure is a preprocess of the rendering step. This is often because their construction is rather expensive and, for static scenes, it can be reused without any modifications. Despite their weight, applications tend to benefit through the use of acceleration structures as queries typically go from O(n) to O(log n). Historically, ray tracing research has concentrated mostly on the effectiveness of acceleration structures (i.e., reducing the number of primitive operations), and on the efficiency of the traversal and intersection operations). The time for building these data structures was then typically ignored, since it is usually insignificant in a non real-time application. Consequently, as ray tracing started to reach interactive rendering performance, build times could no longer be ignored. This situation opened up an entirely new research challenge for consideration in ray tracing: how to create build algorithms that were fast enough and flexible enough to be used at interactive frame rates [8]. The choice of data structure is greatly correlated with the kind of scene and animations we ll have in the application. According to Ingo Wald et al[8]. we can refer to each scene s animation as follows: Static - A scene whose geometry does not change at all from frame to frame. Partially static - A scene in which a certain amount of the primitives are moving, while other parts remain static, such as a few characters moving through an otherwise static game environment. Hierarchical - The scene s primitives can be partitioned into groups of primitives such that all of the primitives of a given group are subject to the same linear or rigid-body deformation. Semi-hierarchical - A scene that can be partitioned into objects whose motion is primarily hierarchical, plus some small amount of incoherent motion within each object. Deformable - Changes can occur to the vertices of a primitive but its connectivity is unchanged. Arbitrary - Arbitrary changes to the scene topology may occur. Despite this categorization, applications often use different kind of animations. In games, for example, most objects will have hierarchical animation. The motion of the objects themselfs may itself be an example of incoherent motion (e.g. characters appearing and disappearing). In addition, characters often have skeleton animations, providing a good example of semi-hierarchical motion. It is likely that no one technique will handle all kinds of motions equally well. As one would expect the choice of an acceleration structure strongly affects the traversal performance, and also can facilitate (or inhibit) the choice of certain algorithms for updating or rebuilding the acceleration structure. At the current state-of-the-art, grids[9], kd-trees[10][11][12], and BVHs[13][1][2] all feature fast construction and traversal algorithms that support animated scenes. All of them make suitable candidates and for this reason we will briefly look into each one and discuss both their advantages and disadvantages in order to justify our choice in using a BVH. 10

21 2.1.4 Kd-Trees In its basis, a kd-tree is a binary tree that partitions space along a series of planes. Each non-leaf node of the tree corresponds to a plane, the points on the left of the plane are sent to the left subtree and the points to the right of the plane are sent to the right subtree. A visual representation of this can be seen in Figure 2.5. Kd-trees are considered by many to be the best known method to accelerate ray tracing of static scenes as they have the fastest reported times for primary and shadow rays. If we wish to find the first intersection point along a ray, the problem is the simplest of all space partitioning data structures. Each volume of space is represented just once, so the traversal algorithm can traverse these voxels in strict front-to-back order, and can perform an early exit as soon as any intersection is found. Their primary disadvantage for dynamic scenes is that rebuilding or updating the structure each frame is a very costly process. Work has been done towards developing and researching methods that allow a quick rebuild of kd-trees in a parallelized manner[10][11][12]. However, since this is a relatively recent problem these methods are still somewhat rudimentary and lacking performance. If memory usage by the acceleration structure is not a problem, then a kd-tree is often the most effective solution for ray tracing of large scenes. The problem for us is that GPU memory still isn t an abundant resource and so kd-trees based ray tracers suffer by consuming large amounts of memory. During the construction of a kd-tree each plane is placed on top of a median point in the scene, most often than not splitting objects and primitives. This splitting causes an increase in memory usage since now each primitive must be referenced in 2 or more different cells. And the more finely a scene is subdivided more memory the kd-tree will use. Figure 2.5: A visual representation of a kd-tree. P1 through P4 represent planes of space division. 11

2.1.5 Uniform Grids Grids are the simpler form of acceleration structures, the concept behind a grid is that space is divided into evenly sized blocks.

22 2.1.5 Uniform Grids Grids are the simpler form of acceleration structures, the concept behind a grid is that space is divided into evenly sized blocks. While both kd-trees and BVHs are hierarchical structures, grids fall into the category of uniform spatial subdivision. Grids are very easy to build as it s simply a matter of directly calculating in which cell a point will fall into. This makes them very suitable to animated scenes as they can be quickly rebuilt every frame despite the kinds of animations present in the scene. The construction process can also be easily parallelized since there are no hierarchical constrains when it comes to filling in the cells. Memory bandwidth proves to be a major bottleneck in parallel grid rebuilds as shown by Ize et al.. The matter of fact is that when using grids we have a low amount of computation performed and a high amount of data that needs to be processed[9]. By having a high bandwidth usage and low computation we are forfeiting all the advantages we gain by making use of a GPU. And, much like kd-trees, primitives end up falling into multiple cells as their vertices are divided in the construction process. A visual representation of an uniform grid can be seen in figure 2.6. Figure 2.6: Five rays traversing a grid Bounding Volume Hierarchies Bounding Volume Hierarchies are a tree-like structure that subdivide a scene into smaller segments. While in both kd-trees and grids spatially divides the scene, a Bounding Volume Hierarchy divides a scene around its objects. All objects are wrapped into individual Bounding Volumes that form the leaf nodes of the tree. Bounding Volumes are then recursively enclosed together by more embracing Bounding Volumes until we are left with a single Bounding Volume that wraps around the entire scene. Bounding Volume Hierarchies have long been ignored in real-time ray tracing as they were once thought to be unable to compete with kd-trees and grids in respect to render time, mainly due to slower traversal speeds. However, given continuous research[13][1][2] and their advantages with respect to dynamic scenes they now make a viable alternative to other acceleration structures. In a typical object hierarchy data structure, it s easy to update the data structure as an object moves because the object lives in just one node and the bounds for that node can be updated with relatively simple and localized update operations. For deformable scenes, just re-fitting a Bounding Volume Hierarchy i.e., recomputing the hierarchy node s bounding volumes, but not changing the hierarchy itself is sufficient to produce a valid Bounding Volume Hierarchy for the new frame. Bounding Volume Hierarchies also allow 12

23 for incremental changes to the hierarchy[8]. In space subdivision each primitive can be referenced from multiple cells, resulting in higher memory usage than one would ideally like. But in Bounding Volume Hierarchies, primitives are only referenced exactly once, allowing us to save GPU memory and bandwidth. Spatial subdivision often leads to visiting the same object several times along the ray, which cannot happen with an object hierarchy. The same is true for empty cells that frequently occur in spatial subdivision, but simply do not exist in object hierarchies. The effectiveness of a Bounding Volume Hierarchy depends on what actual hierarchy the build algorithm produces. Like for kd-trees, best results seem to be achieved using top-down, surface area heuristics-based builds. 1 Figure 2.7: A visual representation of a Bounding Volume Hierarchy. 1 The SAH provides a method for estimating the cost of a structure for ray tracing. It assumes an infinite uniformly distributed rays, in which the probability of a ray hitting a (convex) volume V is proportional to the surface area of that volume 13

24 2.2 State of the art Parallel Construction of Bounding Volume Hierarchies Generating an acceleration structure is a necessary step when trying to achieve real-time performances. One can take two approaches: To construct a new structure every frame or to reuse the same structure and adjust it at each frame. The latter option has been recently explored [14] but still seems to be lacking performance when processing large trees or when large modifications are needed to the data structure. Most approaches [15] that explore a construction of new hierarchy at each frame tended to rely on serial algorithms running on the CPU to construct the necessary hierarchical acceleration structures. While once a necessary restriction due to architectural limitations, modern GPUs provide all the facilities necessary to implement hierarchy construction directly. Doing so should provide a strong benefit, as building structures directly on the GPU avoids the need for the relatively expensive latency introduced by frequently copying data structures between CPU and GPU memory spaces. The first to explore such method was Lauterbach et al.[1]. Lauterbach introduced a novel algorithm using spatial Morton codes[16] to reduce the construction of BVH to a simple sorting problem. Morton codes are used to determine a primitive s order along a space filling curve. They can be computed directly from a primitive s geometric coordinates. Lauterbach expects each input primitive to be represented by an AABB 2 and that the enclosing AABB of the entire input geometry is known. By taking the barycenter of each primitive s AABB as its representative point and by quantizing each of the 3 coordinates of the each representative points into k-bit integers, a 3k-bit Morton code is constructed by interleaving the successive bits of these quantized coordinates. Figure 2.8 shows a 2D representation of this. Sorting the Morton codes will automatically lay the associated points in order along a Morton curve. It will also order the corresponding primitives in a spatially coherent way. Because of this, sorting geometric primitives according to their Morton codes has been used to improve cache coherence as a ray that hits a certain primitive will likely hit the primitive adjacent to it. The main problem of Lauterbach s work and others[15][17] is that in order to build the resulting BVH a series of sequential steps must be taken. Lauterbach et al create their BVH by sequentially observing each bit of each Morton code and grouping the primitives according to the value of the bit. At each level if said bit has value 0 then the primitive it is placed in a group, if the bit has value 1 then it is placed in the opposite group. Garanzha et al. takes a different approach by generate one level of nodes at a time, starting from the root. They process the nodes of the BVH on a given level in parallel. For this they use binary search to partition the primitives contained within each node. The resulting child nodes are then enumerated using an atomic counter, and subsequently process them on the next round. Karras[2] further developed this line of though by proving a way to build a BVH in a total parallel manner. Karras introduces an in-place algorithm for constructing a binary radix tree (also called a Patricia tree) which can directly be converted into a BVH. 2 AABB - Axis-aligned minimum bounding box 14

25 He bases his approach on the fact that for a scene with N primitives we know we can make a Patricia tree with N 1 internal nodes to represent it. He analyzes the similarities between each Morton code and its neighbours determine the position of each internal node in the Patricia Tree. The child-parent association is then calculated based on the range of same-value-bits with the rest of the Morton codes. This will explained in further detail in Chapter 3. Figure 2.8: Example 2-D Morton code ordering of triangles with the first two levels of the hierarchy. Blue and red bits indicate x and y axes, respectively. Source: [1] Random-Accessible Compressed Bounding Volume Hierarchies An aggravating trend that s been appearing is that the growth rate of the data access speed is significantly slower than that of processing speed. This often makes an application to have to wait a considerable amount of time for data to arrive. Several approaches have been developed to address this problem. Unfortunately most of them make it so that once a BVH has been compressed it can only be accessed/traversed by first decompressing it in its entirety. Kim et al.[18] developed a novel compact representation of a BVH that makes it possible to randomly access a BVH. The authors create a cluster based layout-preserving BVH compression and decompression method. During the compression, consecutive BVs are decomposed into a set of clusters, each of which will serve as an access point for random access at runtime. Despite being able to achieve good compression results (compression factors up to 12:1) we felt this approach wasn t suitable for a real time ray tracing application as the compression mechanism is a very 15

26 heavy and time consuming process. Adding this compression time on top of the construction of the BVH and the intersection tests on each frame would result in a very low frame rate. 16

27 17

28 Chapter 3 Approach and Architecture In order to achieve a high frame rate one must reduce the time it takes to generate an image at each frame. For this we make use of a GPU to help us accelerate all the floating point operations required by the intersection process of a ray-tracer. We aim to reduce the memory bandwidth and memory footprint of our application. We make use of research made in the area of parallel BVH construction in GPU to naturally reduce the number of intersection tests performed (thus, reducing memory bandwidth as well). We then explore BVH and triangle meshes compression techniques. With a compressed BVH and triangle mesh we can expect our rendering kernels to be able to achieve lesser rendering times as the data transfered between the GPU s global memory and the kernel s local memory will be smaller. All of this, which define our proposed solution model, will be addressed in further detail in the following sections. Before we delve into the contents of the sections below we must note that we describe in detail only part of the application s algorithm, more specifically what happens inside the GPU s kernels. As referenced in section our application consists of both Host and Device code. The Host code was written in C++ for increased performance. It deals with OpenGL setup and is responsible at reach frame for displaying the image generated by the ray tracing kernels. In order to achieve a higher performance level we had our kernels write the resulting color of the ray tracing process of each pixel into a texture located in the GPU s Texture Memory. This texture is then mapped into a simple quad for rendering. In order to create an interactive application the camera controls are treated on the Host code. A new position is calculated and sent to the rendering kernels at each frame. The Host code is also responsible for loading the scene s primitives into the GPU s global memory at startup. The Device code is therefore only responsible for building a new BVH, perform the necessary intersection tests, and, generate a new image at each frame. 18

3.1 Construction Of The BVH 3.1.1 Binary Radix Tree Properties A radix tree is a space-optimized trie often used for indexing string data, although it can also be used to index any data divisible in

29 3.1 Construction Of The BVH Binary Radix Tree Properties A radix tree is a space-optimized trie often used for indexing string data, although it can also be used to index any data divisible in smaller comparable chunks such as characters, binary numbers, etc. For simplicity, assume that from now on our radix tree only contains binary values and that they are in a lexicographical order as in our application each key will correspond to a sorted Morton code. Given a set of keys k 0,..., k n 1 represented as bits, a radix tree can be seen as a hierarchical representation of the common bits of each key. The keys are represented in the leaf nodes of the tree, and each internal node corresponds to the longest common prefix shared between the keys in that subtree. Assume the example referenced in Figure 3.1. As one would expect the root of the tree covers the full range of keys. At each level the keys are partitioned according to their first differing bit. The first difference occurs between keys k 3 and k 4, thus, the left child of the root node contains keys k 0 through k 3 and the right child contains k 4 through k 7. We continue this process to essentially get a hierarchical representation of the common prefixes between each key. At the bottom level of the tree we will find that each child references a key. Figure 3.1: A visual representation of an ordered radix tree. The numbers 0-7 act as keys (and such as leaf nodes) of the tree. Each internal node represents a common range prefix of the binary value of all the leaf nodes bellow it. Notice we have N 1 internal nodes for N keys. 19

30 A radix tree is considered a compact data structure as it omits nodes which only have one child, thus removing redundant information and decreasing the overall size of the tree in memory. One property of a binary radix tree is that any given tree with N keys will have n 1 internal nodes. This allows us to know even before we construct the tree how many nodes, and thus how much memory, we will require. Assuming a lexicographically ordered tree allows us to express [i, j] as the range of keys covered by any given internal node. We use δ(i, j) to denote the length of the longest common prefix between the keys k i and k j. The ordering of the keys automatically implies that δ(i, j ) δ(i, j ) for any i, j [i, j] [2]. We can then determine the common prefix shared between key under a given node by comparing the first and last key it covers - all the other keys in between are guaranteed to share the same or a larger prefix. In practice each internal node partitions the keys under it on their first differing bit, the one following δ(i, j). We can safely assume that in the range [k i, k j ] part of the keys will have said bit set to 0 and others will have it set to 1. Since we are working with an ordered tree all of the keys with the bit set to 0 will be presented before the keys with the bit set to 1. We call the position of the last key where this bit has value 0 the split position, denoted by γ [i, j 1]. Since the split position is where the first bit differs between the keys in the range we can say that δ(γ, γ + 1) = δ(i, j). The ranges [k i, k γ ] and [k γ+1, k j ] will be subdivided at the next differing bit. We can thus say for sure is that δ(i, γ) > δ(i, j) and δ(γ + 1, j) > δ(i, j). Taking Figure 3.1 once more as reference, we can see that the first differing bit happens between keys k 3 and k 4, the range is then split at γ = 3 resulting in the subranges [k 0, k 3 ] and [k 4, k 7 ]. The left child then splits its range [k 0, k 3 ] at the third bit, γ = 1. The right child splits the range [k 4, k 7 ] at γ = 4, at the second bit, and so on Morton Codes We start our process with a series of primitives represented by 3D points. In order to generate our BVH we first start by deciding in which order each leaf node will be represented in the tree. A good approach is to sort the leafs according to their position, generally we will want primitives close to each other in 3D space to appear close to each other on the tree. In order to do this we sort them via a spacefilling curve, more specifically a Z-order curve. For this we take the centroid of each triangle (Figure 3.2) and express it relative to the bounding volume of the entire scene, known after loading the scene s data in the Host code. Let bvh min be defined as the minimum point of the scene s bounding volume and bvh max as its maximum. If we define c as the centroid point of a given triangle then we can express q, the same point but now in coordinates relative to the bounding volume of the scene through the following formula: c.x bvh min.x q = ( bvh min.x bvh max.x, c.y bvh min.y bvh min.y bvh max.y, c.z bvh min.z bvh min.z bvh max.z ) (3.1) q s coordinate values now vary between 0 and 1. We can think of it as a point inside the 3D space delimited by the scene s bounding volume. The closer each coordinate is to 0 the closer the point will be to the minimum point of the bounding volume, and, the closer it is to 1 the closer it will be to its maximum 20

31 Figure 3.2: Centroid of a triangle. point. We now want to express this 3D point as a Morton code. The first step in this process is to transform our point from a continuous space into a discrete one. We achieve this by quantizing each floating point coordinate into a range created by the difference between the scene s bounding volume maximum and minimum extremes. Let 2 p represent the precision of our quantization, where p stands for the number of bits we have available to represent each integer coordinate: q = ( c.x bvh min.x bvh min.x bvh max.x, 2 p c.y bvh min.y bvh min.y bvh max.y, 2 p c.z bvh min.z bvh min.z bvh max.z ) (3.2) 2 p Morton codes are best expressed as a single integer so to represent a 3D point in a single integer value we will have to make some precision sacrifices. We assume an architecture of 32 bit sized integers, meaning that in order to represent 3 distinct values in 32 bits we will have 10 bits for each of the 3 Cartesian coordinates. equation: Fixing the value of p at 10, we are then left with the following quantization q = ( (c.x bvh min.x) 1024 bvh min.x bvh max.x, (c.y bvh min.y) 1024 bvh min.y bvh max.y, (c.z bvh min.z) 1024 bvh min.z bvh max.z ) (3.3) Where q is now a 3D point who s coordinates are composed of 10 bit integers. To correctly represent this point along a Z-order curve we interleave the bits of all three coordinates together to form a single binary number. We take the value of each coordinate, expand the bits by inserting 2 bit gaps after each bit and then criss-cross them as shown in Figure 3.3. Beyond this point it s simply a matter of calculation the Morton codes for each primitive and sort them via a parallel sorting algorithm, we used radix sort for this Parallel Construction of BVH Now that we have a set of ordered keys we can begin the process of building the hierarchy. This section follow the same line of thought described by Karras in [2]. A naive way of building the BVH is to start at the root, find the first differing bit, create the child nodes, and repeat the process for each child recursively. This approach lack by being very focused and inherently sequential, we would be forced to stand idle as a single thread first processed the root node, then 2 threads processed each of its children 21

32 Figure 3.3: Interweaving of bits in order to create a Morton coder. and so on. This would be a waste of time and GPU computing power. Ideally we should make use of as much information as we possibly can. We know for a fact (see section 3.1.1) that for any tree with n keys we will have n 1 internal nodes, its just a matter of finding out how these nodes connect amongst themselves and to to the leafs. The idea is to assign indexes for the internal nodes in a way that enables finding their children without depending on earlier results, this way we can fully parallelize the construction of the BVH. We create 2 separate arrays to store our nodes, L for the leaf nodes and I for the internal nodes. We define the layout of I as having the root node at the index 0, denoted by I 0. The indexes of its children are assigned according to its respective split position. In fact, the position of each internal node will be defined by its split position. For any given node the left child will be defined in I γ if it covers more than one key and at L γ if it doesn t. Similarly the right child will be located either at I γ+1 or at L γ+1. Assuming this layout has an important property, the index of every internal node will either coincide to the first or the last key it covers. Take for example the root node, it covers the entire range of keys [0, n 1] and is located at position I 0. Let γ 0 be the first split position, since the left child covers the range [0, γ 0 ] it will automatically have to be located at I γ0 since I 0 is already taken by the root. More generally speaking if a node covers the range [i, j], then its left child will be located at the end of the range [i, γ] and its right child located at the beginning of the range [γ + 1, j]. In Figure 3.4 we present a visual example of this property. Node 0 covers the entire range [k 0, k 7 ] and is therefore located at I 0. Its children, node 3 and 4 cover the ranges [k 0, k 3 ] and [k 4, k 7 ] and are placed at I 3 and I 4, respectively. Node s 3 children in turn cover the keys [k 0, k 1 ] and [k 2, k 3 ] so we place them at I 1 and I 2. Node 4 is a little bit different, while its right child covers the range [k 5, k 7 ] and is placed at I 5, its left child would only cover a single key, k 4, we avoid creating an internal node for this and instead reference directly a leaf node. The same happens for node 5 right child. Nodes 1, 2 and 6 all reference 2 leaf nodes. Interestingly, this process will never result in gaps or duplicates when populating the internal node array. An advantage of using this scheme is that each internal node is conveniently placed next to a sibling node in memory. And since the internal nodes run from 0 to n 1 we can use them to directly address the nodes in memory. 22

33 Figure 3.4: Bar representation of the keys covered by each internal node. In order to construct the BVH, we need to know more than simply what nodes cover which keys, we need to know how the nodes connect amongst themselves, i.e. the parent-child relations. We start by analyzing each internal node I i in parallel. Based in our numbering we know for sure that its range will cover at least the key k i and either the keys to the left, [k 0, k i 1 ], or the keys to the right, [k i+1, k n 1 ]. We also know that since each node covers at least 2 keys either key k i 1 or key k i+1 will belong to the same node.the other key will belong to a neighboring node. Lets say we want to determine the direction in which the range of keys covered by node I i extends in. We call this direction d. In order to determine d we compare the length of the common prefix between the keys k i 1, k i, and k i+1. If key k i 1 shares a larger common bit prefix with key k i then we say that the range extends from k i to the left and that d = 1. If on the other hand k i and k i+1 both share a larger common prefix then we say that the range extends from k i to the right and that d = +1. In other words, if δ(i 1, i) > δ(i, i + 1) then d = 1. And if δ(i 1, i) < δ(i, i + 1), then d = +1. Knowing this we can say that k i and k i+d both belong to node I i, and that k i d belongs to node I i d. At this point we know the starting point of our range, k i, and we know in which direction it extends in, d. We do not know yet when it stops, i.e., how far it extents. Since k i d does not belong to the range of keys covered by node I i, we can safely assume that the keys which are share amongst themselves a larger common prefix than k i and k i d do. We call this common prefix δ min so that δ min = δ(i, d 1). δ min is in fact by definition the lower bound value under which all keys contained in the range covered by I i have a higher common prefix length amongst themselves. In other words δ(i, j) > δ min for any k i belonging to I i. This being said all we need to do to find the other end of the range is to search for the 23

34 largest l that satisfies the equation δ(i, i + ld) > δ min. The fastest way to do this is to start at k i and in powers of 2 increase the value of l until it no longer satisfies δ(i, i + ld) > δ min. Once this happens we know that we have gone too far and we have left the range of keys covered by I i. Lets call this upper bound l max. We know for sure the correct value of l is somewhere in the range [l max /2, l max 1]. Now it s only a matter of using a binary search to find the value of l under which δ(i, d(l + 1) + i) δ min. After finding the value of l we can use j = i + ld to specify the other end of the range. Lets call δ node the length of the common prefix shared between k i and k j, given by δ(i, j). We use δ node to search the split position γ that partitions the keys covered by I i. What we do now is perform a search for the largest s [0, l 1] that satisfies δ(i, i + sd) > δ node. I.e., we need to find the furthest key which shares a larger prefix with k i than k j does. Discovering γ allows us to determine the ranges covered by each children. The left child will have a range covering [min(i, j]), γ] and the right child will cover [γ + 1, max(i, j)]. For the final step we analyze the values of i, j and γ. If i = γ we know I i s left child is the leaf node L γ, otherwise it s the internal node I γ. Correspondingly if j = γ + 1 we say I i s right child is the leaf node L γ+1, otherwise it s internal node I γ+1. Pseudocode for this algorithm can be found in Figure 3.5. Figure 3.5: Pseudocode for the parallel construction of a BVH (source:[2] ). 24

3.1.4 Calculating Bounding Volumes Traditional methods for BVHs calculate the bounding boxes sequentially in a bottom-up fashion They rely on the fact that the bounding boxes of the level below are

35 3.1.4 Calculating Bounding Volumes Traditional methods for BVHs calculate the bounding boxes sequentially in a bottom-up fashion They rely on the fact that the bounding boxes of the level below are known a priori. This method can be partially parallelized by processing the nodes of each level at the same time but this doesn t erase the fact that a synchronization layer must exist between each level. An algorithm of this sort is expect to have a time complexity of O(n log n). Karras dictates there is a better way. We can start a thread from each leaf node and walk up the tree using parent pointers recorded during the radix tree construction. Each internal node maintains an atomic counter to track how many threads it has been visited by. The first thread to reach the node terminates itself while the second thread continues up the tree. This way we are guaranteed we always have the bounding box of both children before calculating the bounding box of the current node. Using this method each node is processed exactly once which leads to O(n) time complexity. The result is now a fully functioning BVH we can use to perform intersection tests. See Figure 3.6 for a visualization of a BVH constructed around Stanford s Bunny. Figure 3.6: Resulting BVH created in parallel around Stanford s Bunny. 25

36 3.2 Compression Reducing memory footprint and bandwidth has been our goal with this application. A certain amount of bandwidth is automatically reduced by the simple use of a BVH. Using a brute-force approach in our application would mean to test every single ray against every single triangle, resulting in fetching the entire scene from global memory at each frame. Through the use of a BVH we are on one hand reducing the amount of information we fetch from global memory by only fetching small segments of the BVH at each time until we reach a small number of primitives. On the other hand, a BVH is an acceleration structure, and they take up memory space too. By reducing memory bandwidth we have increased memory footprint. We intend to lessen this effect by doing post-processing on the BVH to reduce its overall size. It should also be noted that through the use of a BVH we are not simply reducing memory bandwidth. We are also drastically reducing the number ray-triangle intersection tests, replacing them with ray-box intersection tests which are much simpler. Pseudocode for our algorithm can be found in Algorithms 2 and 3. We will explain BVH compression and mesh compression in detail in the following sections. Algorithm 2 Our Ray Tracing Algorithm: Part 1 1: while true do 2: mortoncodes GenerateAndSortM ortoncodes() 3: bvh CreateBV H(mortonCodes) 4: bvh CompressBV H(bvh) 5: for each pixel p do 6: ray Ray() 7: ray.origin camera.origin 8: ray.direction camera.direction 9: ray.depth 0 10: pixelcolor CastRay(bvh, ray) 11: end for 12: end while 13: 14: function COLOR(bvh, intersection) 15: color black 16: for each light l do 17: shadowray ShadowRay(intersection.intersectionP oint, l) 18: if IntersectRayW ithbv H(bvh, shadowray) = null then 19: color color + dif f usecomponent 20: color color + specularcomponent 21: end if 22: end for 23: return color 24: end function 26

37 Algorithm 3 Our Ray Tracing Algorithm: Part 2 1: function CASTRAY(bvh, ray) 2: if ray.depth > MAX DEP T H then 3: return backgroundcolor 4: end if 5: intersection IntersectRayW ithbv H(bvh, ray) 6: if intersection = null then 7: return backgroundcolor 8: end if 9: outputcolor Color(bvh, intersection) 10: if ObjectIsRef lective(intersection) then 11: ref lectionray Ref lectray(ray, intersection) 12: ref lectionray.depth ray.depth : outputcolor outputcolor + CastRay(bvh, ref lectionray) 14: end if 15: if ObjectIsRef ractive(intersection) then 16: ref ractionray Ref ractray(ray, intersection) 17: ref ractionray.depth ray.depth : outputcolor outputcolor + CastRay(bvh, ref ractionray) 19: end if 20: return outputcolor 21: end function 22: 23: function INTERSECTRAYWITHBVH(bvh, ray) 24: intersections null 25: if IntersectRayW ithbox(ray, node.boundingbox) then 26: if IsLeaf N ode(node.lef tchild) then 27: intersections intersections+intersectrayw ithp rimitives(ray, node.lef tchild.primitives) 28: else 29: intersections intersections + IntersectRayW ithbv H(node.lef tchild, ray) 30: end if 31: if IsLeaf N ode(node.rightchild) then 32: intersections intersections+intersectrayw ithp rimitives(ray, node.rightchild.primitives) 33: else 34: intersections intersections + IntersectRayW ithbv H(node.rightChild, ray) 35: end if 36: end if 37: return intersections 38: end function 27

38 3.2.1 BVH Compression Since we rebuild our BVH at every frame each instance only lives for a few milliseconds. It doesn t make sense for us to make use of heavy compression algorithm, mainly target at compressing persistent BVH s. We thus chose to make a simple use of quantization in order to compress our BVH. We take a fully parallelized approach in this process. We start a thread at each internal node and make our way up to the top of the tree. Once we reach the top of the tree we start making the same path in the opposite direction, that is, from the root node to the internal node in question. As we descend each level we take the bounding box of the previous parent and subdivide each dimension in 1024 segments. We treat this 1024x1024x1024 grid as a voxel space and map the minimum and maximum points of the current level s node bounding box into the nearest corresponding voxels. It becomes clear we lose some precision as we descend each level since we are going from floating point coordinates into integer indexes. Each index will be contained in the range [0, 1024[, this is so we can store 3 of these indexes in a single 32 bit integer. Each bounding box can then be stored as a single 2 integer data structure instead of a 6 float data structure, reducing its size down from 24 bytes to 8 bytes. It is obvious from this method that out BVH will become more loosely coupled as we need to perform roundings when going from a continuous space (world coordinates) into a discrete one (voxel indexes). We round up when mapping the maximum point of a bounding volume to a voxel and round down when mapping the minimum so out bounding volumes remain coherent. We predict this loss of precision won t have a big impact in our intersection tests, we will most likely intersect a few more bounding volumes as they will be slightly larger in size but we don t expect a significant increase in the number of tests. One thing to take into account is that in each thread we must quantize the entire path from the root to each internal node. If we simply quantize a node relative to the world coordinates of the parent s bounding box we ll find mismatches on the true size of the bounding volume. This effect will have an impact on the rendered image as it can make us dismiss ray-triangle intersection tests. It can be seen in Figure 3.7. This happens because at each level of the tree we slightly change the position of the maximum and minimum points that define the bounding volume of that node. These changes in position accumulate over the levels so we have to take into account when quantizing the bounding volume of a node that the parent s bounding volume doesn t correspond to the original world position. When traversing the BVH we will need to make inverse quantization of each node s bounding volume to perform the ray-box intersection tests Mesh Compression Karras[2] uses each leaf node as a redirect to a primitive. We can remove an indirection level by having each leaf node envelop a single triangle and store said triangle in the node itself. We can also try and diminish the number of memory necessary to store a triangle by following the same line of thought as in section By quantizing the position of each vertex of each triangle into the bounding box that encapsulates it, we can go from having to store 9 floats to just 3 integers. This will of course have an impact on the model itself as we will be losing the original coordinates of each vertex. When 28

39 Figure 3.7: Ray-box intersection errors caused by not using a quantized parent bounding box. recalculating each vertex back to world coordinates we will be changing the final world positions of each triangle, resulting in slight mesh deformation. Interestingly enough, this effect isn t noticeable unless examining models very up close (see Figure 3.8). In a practical application like a game or simulation where the camera and the objects are constantly moving this effect could pass up unnoticed. This compression step can easily be done in the same kernel as the quantization the internal nodes of the tree, thus having little to no effect on the post-processing time of the BVH. 3.3 Rendering Rendering is simply a process of creating primary rays for the camera and check for intersections with the BVH. We used the algorithm proposed by Williams et al. in [19] to perform our ray-box intersections. Once we found we intersected a leaf node we used the method described by Moller et al. in [20] to perform ray-triangle intersections with the triangle contained in the leaf node. Secondary rays that resulted from intersections with objects are also checked for intersections with the BVH instead of the primitives directly. Phong shading and Lambertian reflectance were the methods used in shading. 29

Figure 3.8: Mesh deformation caused by quantization of triangles vertices. Notice the small black dots between triangles. Image is a closeup of the tip of the ear of Stanford s Bunny.

40 Figure 3.8: Mesh deformation caused by quantization of triangles vertices. Notice the small black dots between triangles. Image is a closeup of the tip of the ear of Stanford s Bunny. Concluding Remarks In this chapter we presented the algorithm we used in our ray tracing application. We use a novel algorithm to build a new BVH in a parallel manner at each frame every frame. We then performed a series of optimizations and post-processing steps in hopes of increasing the performance of our ray tracing algorithm. We hope to achieve this gain in performance by reducing the number of global memory accesses to the GPU. In the next chapter we perform a series of benchmarks in order to measure the success of our approach against a vanilla version of Karras work[2]. 30

41 31

42 Chapter 4 Evaluation 4.1 Evaluation Methodology We implemented our algorithm in OpenGL/C++ and CUDA/C++. We also implemented the algorithm described by Karras [2] and benchmarked it against our own. Our tests then compare: No compression (Karras), internal node compression, and internal node compression together with mesh compression. Testes were made using an NVIDIA GeForce GTX 970 with 4 GB of RAM. Since our algorithm runs mainly in the GPU the CPU and system memory have no impact on the test results. We chose to render our images in a 1280x720p as this is one of the more commonly used resolutions for multimedia content. We measure the amount of intersection tests performed, namely ray-box and ray-triangle, in order to determine how the loss of precision of the bounding volumes impacts the number of intersection tests needed. We also measure the amount of time spent in each section of the algorithm. We then proceed to measure what the impact of our algorithm is in the usage of GPU memory and bandwidth consumption. 4.2 Experimental Scenes We tested our algorithm with three different scenes, ASIAN DRAGON, CORNELL and STANFORD BUNNY. CORNELL (Figure 4.1, 152 triangles) is a scene representative of highly reflective and refractive scenes. It also represents a low polygon scene. It consists of an icosphere surrounded by five mirrors and a glass prism. This scene focuses on testing performance gains in scenes with a high number of secondary rays. The STANFORD BUNNY scene (Figure 4.2, 70K triangles) is what we would call a medium sized scenes. It serves as a midterm between the very low polygon scenes (CORNELL) and the very high polygon scenes (ASIAN DRAGON). This scene also only features eye and shadow rays. ASIAN DRAGON (Figure 4.3, 7.2 M triangles ) is representative of complex objects with a high number of triangles. It s a dense model with a great number of triangles in a small space. Our intent with this 32

43 scene is to measure the performance gains of BVH compression for very dense BVHs. This scene only features eye and shadow rays. 4.3 Hypothesis We hypothesized our compression of the BVH will have an impact on many different levels. On one side we expect to have a greater number of ray-box intersections as our bounding volumes will be slightly larger. We also expect ray-triangle intersection to increase, although to a lesser extend. Memory footprint should be significantly reduced, as well as the amount of data fetched from memory. A direct result of the compression. Altogether we expect to have more, but faster, intersection tests as fewer bytes will be fetched from global memory in each test. Overall we also expect to see an increase in the number of frames per seconds Figure 4.1: CORNELL test scene. 33

44 Figure 4.2: STANFORD BUNNY test scene. Figure 4.3: ASIAN DRAGON test scene. 34

4.4 Experimental Results and Analysis 4.4.1 Number of Ray-Box Intersection Tests In this section we measure the number of ray-box intersection test in each frame.

45 4.4 Experimental Results and Analysis Number of Ray-Box Intersection Tests In this section we measure the number of ray-box intersection test in each frame. As expected both the node compression and the node+mesh compression variations of our algorithm present the same number of intersections. Cornell Number of Ray- Box Intersec5ons Millions Karras Node Compression Node + Mesh Compression Figure 4.4: Number of ray-box intersection tests in CORNELL. Despite being a small, enclosed scene we notice that lack of precision of the bounding volume areas causes an increase of 9% in intersection tests. We assume this is likely caused by reflected and refracted rays that would normally pass tangent to the icosphere but are instead tested and categorized as a hit. Stanford Bunny Number of Ray- Box Intersec5ons Millions Karras Node Compression Node + Mesh Compression Figure 4.5: Number of ray-box intersection tests in STANFORD BUNNY. In STANDFORD BUNNY we notice an increase of 7.1% in the number of ray-box intersection tests, making it the less affected of all the 3 scenes. 35

46 Asian Dragon Number of Ray- Box Intersec5ons Millions Karras Node Compression Node + Mesh Compression Figure 4.6: Number of ray-box intersection tests in ASIAN DRAGON. In ASIAN DRAGON we notice an increase of 12.38% in the number of ray-box intersection tests. Since this scene has a high tree we expect rounding errors to accumulate over the levels and lead to this increase in the number of intersection tests Number of Ray-Triangle Intersection Tests In this section we measure the number of ray-triangle intersections. This number is a direct impact of the loss of precision of the bounding volumes. Since the more leaf bounding volumes are intersected, the more ray-triangle intersections we have to perform. As expected both variations of our algorithm will present the same number of tests. Cornell Number of Ray- Triangle Intersec7ons Millions Karras Node Compression Node + Mesh Compression Figure 4.7: Number of ray-triangle intersection tests in CORNELL. CORNELL has an increase of 11.1% tests performed. We believe this number is greater in comparison to the other test scenes because in this scene almost every ray will hit an object and thus will have to traverse the entire BVH, making it more prone to the effects of loss of precision of the bounding volumes. 36

Stanford Bunny Number of Ray- Triangle Intersec7ons Millions 7 6 5 4 3 2 1 0 6.246781 6.711506 6.711506 Karras Node Compression Node + Mesh Compression Figure 4.

47 Stanford Bunny Number of Ray- Triangle Intersec7ons Millions Karras Node Compression Node + Mesh Compression Figure 4.8: Number of ray-triangle intersection tests in STANFORD BUNNY. In STANFORD BUNNY we notice an increase of 7.4% in the number of tests performed. This increase in the number of ray-triangle test seems to be similar to the increase of ray-box intersection tests. Asian Dragon Number of Ray- Triangle Intersec7ons Millions Karras Node Compression Node + Mesh Compression Figure 4.9: Number of ray-triangle intersection tests in ASIAN DRAGON. ASIAN DRAGON has an increase of 10.75% tests performed. As in STANFORD BUNNY this number seems to go in hand with the increase of ray-box intersectiont tests, although slighty lower. 37

48 4.4.3 Global Memory Reads In this section we measure the total number of bytes read from global memory at each frame from intersection tests alone. We base our calculation in the number of intersection tests measured and the following formula: T otalm emoryread = numrb size(internaln ode)+numrt (size(leaf N ode)+size(triangle)) (4.1) The following results are based in the number of intersection tests performed in each case and the size of each internal node, leaf node and primitive. Reference to Appendix A for a list of relevant data structures. numrb stands for the number of ray-box intersection tests. numrt stands for the number of ray-triangle intersection tests. Cache hits are not taken into account. Our algorithm benefits best from them as each entity takes up less memory space compared to Karra s. Cornell MB Karras Node Compression Node + Mesh Compression Figure 4.10: Estimated memory read during intersection tests in CORNELL in each frame. In CORNELL we notice we are able to achieve a significant impact in the amount of memory we read when we implement node compression (14.5%), but we see little to no gains when implementing mesh compression. This is because in this scene the number of ray-box intersection dwarfs the number of ray-triangle intersection, hence optimizations made to the execution time of ray-triangle tests will have little impact. 38

49 Stanford Bunny MB Karras Node Compression Node + Mesh Compression Figure 4.11: Estimated memory read during intersection tests in STANFORD BUNNY in each frame. STANFORD BUNNY reduces its memory accesses by 14% when compressing internal nodes alone and 22% when compressing the mesh alike. These results go in hand with the results obtained in ASIAN DRAGON providing some insight that every model, be it high or low poly will be positively affected by our improvements. Asian Dragon MB Karras Node Compression Node + Mesh Compression Figure 4.12: Estimated memory read during intersection tests in ASIAN DRAGON in each frame. In ASIAN DRAGON we see a steady decrease of accessed memory as we implement our changes. We reduce memory accesses by 11.9% when using node compression and, by 20.3% when adding mesh compression 39

50 4.4.4 Bounding Volume Hierarchy Total Size In this section we measure the total memory occupied by both the BVH and the scene s primitives. The following table displays the total memory each structure occupies: STRUCTURE SIZE Karras Node Compression Node + Mesh Compression Data type bytes bytes bytes Internal Node Leaf Node Triangle (encoded in leaf node) 0 Table 4.1: Size of internal node, leaf node and triangle in each algorithm. Remember that for N triangles we will have a BVH with N 1 internal nodes and N leaf nodes. Cornell In CORNELL we calculate the BVH and primitives occupy Kilobytes in Karra s approach. In our node compression variation we calculate 9.71 Kilobytes, and in our node and mesh compression variation 6.06 Kilobytes. Stanford Bunny In Karra s algorithm we calculate STANFORD BUNNY to have a total of 5.57 Megabytes. Our approach with node compression comes in at 4.46 Megabytes. And finally our approach with node and mesh compression has a total size of 2.79 Megabytes. Asian Dragon Karra s approach to ASIAN DRAGON uses an estimated Megabytes. Our approaches, node compression and node plus mesh compression both use Megabytes and Megabytes, respectively. We are thus capable of reducing memory footprint to 50% of its original value. 40

51 4.4.5 Kernel Execution Time In this section we measure the execution time for each of the most relevant kernels. We evaluate: BVH Creation - Step that encloses the creation of Morton codes and their sorting, the creation of the hierarchy and the calculation of the bounding volumes. Node Compression - Compression of the internal nodes of the BVH. Mesh Compression - Encoding of the triangles into the enveloping bounding box. Rendering - Rendering kernel that includes the traversal of the BVH, intersection tests with primitive and shading. BVH traversal code can be found in Appenddix B. Cornell CORNELL Karras Node Compression Node + Mesh Compression Kernel Time (ms) Time (ms) Time (ms) BVH Creation Node compression Mesh Compression Rendering Table 4.2: Kernel execution time for CORNELL frame. CORNELL doesn t benefit much from our modifications. We notice a reduction in rendering time of 6.7% when compressing internal nodes and 10.7% when compressing the mesh alike. Both BVH construction time and compression are minimal. We believe our modifications do not have a great impact on this scene as this a very small scene that easily fits in L2 Cache, thus mitigating any improvements gained from memory optimization. 41

52 Stanford Bunny STANDFORD BUNNY Karras Node Compression Node + Mesh Compression Kernel Time (ms) Time (ms) Time (ms) BVH Creation Node compression Mesh Compression Rendering Table 4.3: Kernel execution time for STANDFORD BUNNY frame. In this scene we notice most of the time is spent executing the rendering kernel as in CORNELL. The creation of the BVH has a slight impact on the time required to render each frame and remains constant across all 3 variations. Compression times have little to no weight but greatly impact the time it takes to render the scene. Here we can see a reduction in rendering time of 20.55% when compressing just the BVH s internal nodes and a reduction of 41.01% when compressing both internal nodes and triangles. Overall we are only able to achieve a 37.33% increase in performance when considering the rest of the kernels. Asian Dragon ASIAN DRAGON Karras Node Compression Node + Mesh Compression Kernel Time (ms) Time (ms) Time (ms) BVH Creation Node compression Mesh Compression Rendering Table 4.4: Kernel execution time for ASIAN DRAGON frame. In ASIAN DRAGON we see a time reduction 21.08% of the duration rendering kernel when compressing just the internal nodes and 34.67% when compressing both internal nodes and triangles. Here we can see our algorithm has a significat impact in high poly models as we are fetching a vast number of BVH nodes in this scene. Unfortunately the BVH Creation kernel takes up most of the time in each frame when dealing with such a high number of polygons. Overall we are only able to achieve a 6.6% increase in performance when we consider the rest of the kernels that make up the frame. 42

4.4.6 Frames Per Second 25 20 19 20 18 21 15 12 14 FPS 10 5 2 2 2 0 Karras

Cornell Figure 4.13: Frames per seconds in each test scene.

ASIAN DRAGON has no significant changes in performance as it manages a

53 4.4.6 Frames Per Second FPS Karras Node Compression Node + Mesh Compression Asian Dragon Stanford Bunny Cornell Figure 4.13: Frames per seconds in each test scene. In this section we make an overall comparison of the frame per seconds we are able to generate in each scene. STANDFORD BUNNY is the scene that shows the most benefit from our approach. ASIAN DRAGON has no significant changes in performance as it manages a meager 2 frames per second, mainly caused by the high cost of constructing the BVH. CORNELL shows slight improvements in performance. 43

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives