}w!"#$%&'()+,-./012345<ya

Size: px

Start display at page:

Download "}w!"#$%&'()+,-./012345<ya"

Theodora Neal
6 years ago
Views:

1 MASARYK UNIVERSITY FACULTY OF INFORMATICS }w!"#$%&'()+,-./012345<ya Raytracing on GPU DIPLOMA THESIS Marek Vinkler Brno, spring 2010

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: doc. Ing. Jiří Sochor, CSc. ii

3 Acknowledgement I would like to thank the lead of my thesis doc. Ing. Jiří Sochor, CSc. for his guidance, insight, suggestions and advice. iii

4 Abstract The goal of this work was to create a functional implementation of ray tracing engine on current main stream GPUs. Emphasis was put on performance of the application so that realtime frame rates could be achieved. Because of this requirement and the programming complexity of the ray tracing problem the C for CUDA language was chosen. The final product is a ray tracer that achieves 50+fps in primary rays on moderately complex scenes. iv

5 Keywords Ray tracing, GPU, CUDA, visualization v

6 Contents 1 Introduction Ray tracing algorithm History Effects Basic illumination Shadows Recursive ray tracing Reflection Refraction Monte Carlo ray tracing Soft shadows Ambient occlusion Glossy reflections and refractions Diffuse reflections Antialiasing Depth of field Motion blur More effects Comparison with rasterization Input data Acceleration structures Capabilities Hardware Summary CUDA Eyerays engine Ray computation Traversal methods Packet traversal Memory loads Ray traversal Memory loads

7 6.2.3 Sequential hybrid traversal Interleaved hybrid traversal Shading and saving of data Results Test scenes Primary rays Secondary rays Conclusion

8 Chapter 1 Introduction A great part of computer graphics is dedicated to generating photo realistic images. In this process the global illumination plays the crucial role as it provides humans with information about the layout of the scene. The light and shadow give us a way to evaluate relative distances and orientations between the objects. It also reveals the finer properties of objects (material, shape of surface,...). All in all it enables us to get the feeling of the rendered scene. Since the beginning of computer graphics people were trying to develop an illumination model that would approximate the real world as close as possible. Ray tracing is one of the algorithm schemes capable of rendering physically correct images. The main idea of ray tracing is as the name suggests to trace (follow the path) of rays (carriers of light intensity information). In the real world, photons emitted from the light sources move in the direction of emission until they hit some matter. On this event the photon may change its direction and looses some of its energy. This process repeats until the photon arrives at our eye. Then it provides us with the agglomerated information of its path. Thus ray tracing can directly simulate the physics of light. The rays are shot from the light source and then we trace their paths until they arrive at our eye. Although this kind of computation is valid it is also inefficient. Only a small fraction of rays shot from the light sources arrive at our eye. Reverting the problem is usually a better solution. We shoot the rays from the eye into the scene and trace their paths until a certain criterion is met. There are often used a number of bounces or magnitude of contribution thresholds. Ray tracing has been frequently called Embarrassingly parallel. This is because each ray is completely independent of all other rays (data parallelism). This makes it possible to assign one processor per ray. And that is exactly the kind of parallelism which suits modern GPUs. 3

9 1. INTRODUCTION 1.1 Ray tracing algorithm In one sentence the ray tracing algorithm could be expressed as: Shoot ray through each pixel into the scene and find the intersection of the ray with scene geometry. However, for complex scenes it would be hopelessly inefficient to test every ray for intersection with every object. We therefore organize the objects in a hierarchy so that a large fraction of the objects can be rejected quickly. This extended approach can be described as: At the first stage rays from the camera into the scene are computed. At the second stage a group of objects likely to be hit by the ray is found (usually by traversing some spatial index). At the next stage the exact intersection of the ray with a geometric primitive is computed. If no primitive is hit we return to the traversal stage. Once a hit point is found, we can proceed to the shading stage. At this stage we compute the colour of the returned light. Not only the light from the first hit contributes to the final colour. So we can follow the light that arrived at our eye further. For example we now have no idea whether our intersection point is in light or in shadow. Neither we know anything about other paths of light that arrive at the hit location. We can acquire this information by shooting more rays from the hit location into the scene. These subsequent rays are called secondary rays. They can be used to compute shadows, reflection and refraction, etc. For shadow computation we shoot a ray to every light source in the scene. If there is no intersected object between the location and the light source we know that the location is illuminated by this light. This elegantly, accurately and efficiently solves the shadow computation problem. We can accelerate this computation by counting any intersection with the scene geometry instead of the first one and by skipping the shading stage for these rays. The reflection and refraction rays are sent to acquire the light coming from that direction. These rays are the same as the primary ray, as from their hit point (unlike the shadow rays) a new bunch of rays may be casted. When all of these secondary rays return the incoming colour or the shadow information the colour of the location can be computed. We can shoot even more rays to obtain better sampling of the global illumination. A common example of such rays is short ambient occlusion rays and long diffuse inter-reflection rays. Another possibility is to shoot more rays in the same direction for antialiasing. We can obtain any information by shooting more rays, i.e. dispersion of white light into the wavelength spectre. These effects and some more will be described in the chapter 3. 4

10 Chapter 2 History Ray tracing had been used long before the electronic computer was invented. Figure 2.1 shows ray tracing in the year Back then, the computer was the artist s assistant and rays were strands of thread. Albrecht Dürer ( ), a German renaissance painter and engraver, used this device to render images with correct perspective projection. Points on the object (a lute) are projected onto the image. Nowadays we would call this process projection of 3D points onto a 2D image. Ray tracing was also used for lens design for microscopes, telescopes, binoculars, and cameras. Sir Isaac Newton ( ) showed reflection and refraction of rays in his famous 1704 book Opticks. Ray tracing solution for the global illumination problem was first formulated by Turner Whitter [12] in At that time FLOPS (FLoating Operations Per Second) were really expensive in CPUs. Moreover scenes usually consisted only of few large polygons and hardware resources were very scarce. Due to its purely local computational model, rasterization is much better suited for such environment. Since then a complete generation of researchers, developers, and users grew up exclusively in a rasterization based world. Since then nearly everyone has been sure that ray tracing is slow and cannot be implemented in hardware. In the mid 1990s VLSI graphics chips (NVIDIA, ATI) mainly accelerated the rasterization part of OpenGL which almost entirely consisted of fixed point arithmetics. FLOPS were still expensive to realize in hardware. Thus the rasterization took off on the mass market while ray tracing still could not be realized on a competitive basis. The software research on ray tracing hardly continued due to the lack of progress, and hardware research was still limited by hardware resources. It mainly focused on intersection computation. At the end of the late 1990s the hardware resources became available to perform real time ray tracing on large supercomputers and a few years later also on clusters of PCs. FLOPS became cheap and were available in every PC through SIMD, high clock rates, long CPU pipelines, etc. Yet the new 5

2. HISTORY Figure 2.1: Mechanical creation of a perspective image (Dürer, 1525) architecture was not well suited for the inner loops of traditional ray tracing algorithms.

11 2. HISTORY Figure 2.1: Mechanical creation of a perspective image (Dürer, 1525) architecture was not well suited for the inner loops of traditional ray tracing algorithms. Only when the algorithms were reimplemented these new hardware features could be exploited effectively. A few years later large numbers of FLOPS also became available in programmable GPUs. However, until now their programming model has been too inflexible to effectively exploit the raw performance for ray tracing. The features and flexibility of GPUs have increased significantly since then. But that was still insufficient to implement fast ray tracing on GPUs except in toy examples. The dominant stream programming model was not well suited for ray tracing type of algorithms (limited control flow, no recursion, no stack). All these aspects made ray tracing on GPUs slower than on CPUs despite the much higher raw performance and memory bandwidth. In the mean time custom hardware has been developed and that performs the entire computation highly efficiently in hardware. In 2004 a first fixed-function ray tracing chip was presented. A year later the first fully programmable RPU (ray processing unit) was presented at Siggraph. Interestingly enough the latter architecture is based to large degrees on GPUs but extends them in key locations by dedicated hardware units as well as significantly increased programming flexibility. The development of GPUs has brought even more FLOPS, memory bandwidth but more importantly programmability. In 2006 NVIDIA released their computing architecture called CUDA. These two events made GPUs 6

12 2. HISTORY much closer to general purpose parallel super computers than to mere graphics chips. Ray tracing maps pretty well to this architecture allowing for realtime frame rates on static scenes and interactive frame rates on dynamic scenes. Since the beginning of ray tracing in 1980 people have been trying various acceleration techniques to achieve realtime performance. During the years articles have been written. All of them enhancing or accelerating the basic algorithm. The hottest topics have been the construction of acceleration structures, intersection tests and traversal algorithms. This thesis revolves mainly around the latest. 7

13 Chapter 3 Effects The insatiable demand for photographic image quality caused the invention of visualization methods for many lighting effects. Some of these methods use trick solution and cheap hacks to accelerate their rendering. But these illusions often brake down in extreme cases. This section introduces some of the lighting effects and explains how they can be achieved with ray tracing. 3.1 Basic illumination A simple illumination model consists of three parts: ambient illumination, diffuse reflection and specular reflection as can be read in every textbook on computer graphics. The illumination model used in this thesis applies Lambert s cosine law to computation of diffuse reflection and the Phong cosine-power formula for the computation of specular highlights. The ambient term is a constant addition to the illumination approximating light after several bounces. Although other reflection models exist this one is the most widespread. An example of a simple scene rendered with this illumination model is given in Figure 3.1a. With ray tracing we can achieve this effect by shooting rays from the eye through each pixel into the scene. When the ray hits the closest object we sample the local illumination at this point. The pseudocode for the basic ray tracing engine is in Algorithm 3.1. These rays are called the primary rays as they perform the first (primary) hit. 3.2 Shadows The images generated with the algorithm from the previous section tell us which objects are visible but not much more. For example we have no information about the relative position of the objects. Shadows are vital for human eye to evaluate such relations. We can determine whether a point 8

14 Algorithm 3.1 Ray tracing for each pixel do compute ray for that pixel for each object in scene do if (ray intersects the object and the object is closest so far) record intersection distance and object id set pixel colour to SHADING(object id) 3. EFFECTS is in shadow or not by shooting a ray from this point to the light source. If the ray hits any opaque object before hitting the light source the point is in shadow. In the algorithm from the previous section we replace set pixel colour to SHADING(object id) with code from Algorithm 3.2. Figure 3.1b shows a scene with shadows. If there are multiple light sources in the scene we repeat this computation for each of them. Algorithm 3.2 Shadows if (ray hit the scene) compute ray from the hit point to the light source if (ray does not intersects an opaque object) add this light s contribution (a) Lambert diffuse reflection with Phong shading (b) Shadows Figure 3.1: Basic rendering effects from [6] 9

15 3. EFFECTS 3.3 Recursive ray tracing In the real world the objects have various properties with regard to the light. An example of this can be the surfaces with specular reflection (mirrors, shiny metals) and refraction (dielectric materials). When these objects are hit by light some of it is reflected or refracted. This light can hit another specular surface and the process continues. To simulate this in computer we employ recursion Reflection With ray tracing we model this phenomenon by shooting another ray to sample the reflection. The origin of such a ray is the intersection of the primary ray with the closest object in the scene. The direction is computed according to the law of reflection (derived from Fresnel s law). r n d Figure 3.2: Reflection scheme. The angle between incident ray direction d and the normal n is the same as between reflected ray direction r and the normal n. This leeds to the following reflected ray direction computation. r = d + 2( d n) n (3.1) The reflection ray can hit another specular surface. At this point we can either quit or recurse the reflection tracing. Potentially this tracing can never terminate so safety switches like maximal number of bounces or minimal amount of contribution are enforced. A scene with added reflection is shown in Figure 3.4a. 10

16 3. EFFECTS Refraction This effect is quite similar to the reflection. The only difference is how the ray direction is computed. Snell s law guides this computation. It states that the fraction of sins of exterior angle θ and the interior angle φ are proportional to the fraction of light propagation speeds v θ, v φ in the corresponding media, which equals the inverted fraction of this media s refractive indices η θ, η φ. sin θ sin φ = v θ v φ = η φ η θ (3.2) n d r t Figure 3.3: Refraction scheme. The incident ray with direction d coming from a medium with refractive index η θ into a medium with refractive index η φ produces a refraction ray with direction t. The normal of the interface is denoted n. A detailed derivation of the formula can be found in [10]. The equation for computing t follows. t = η θ( d n( d n)) η φ n η 2 1 θ (1 ( d n) 2 ) η 2 φ (3.3) If the quantity under the square root is negative there is no refraction. This case is called Total internal reflection. An example of an scene render with refraction is given in Figure 3.4b. This image is a combination of all effects described above. The type of ray tracing supporting effects described up to now is called the Whitted-style ray tracing as it was first introduced by Turner Whitted in 1980 [12]. 11

$3. EFFECTS (a) Reflection (b) Refraction Figure 3.4: Whitted-style rendering effects from [6] 3.4 Monte Carlo ray tracing Up to this point all of the described effects were computed deterministically.$

17 3. EFFECTS (a) Reflection (b) Refraction Figure 3.4: Whitted-style rendering effects from [6] 3.4 Monte Carlo ray tracing Up to this point all of the described effects were computed deterministically. This led to somewhat clean look. From now on we will deal with so-called Monte Carlo (or stochastic ) ray tracing. This means that the ray origins or directions are computed using random numbers. There are two main approaches to stochastic ray tracing: distributed ray tracing and path tracing. The distributed ray tracing shoots multiple rays from each surface point to sample some effect. This means that each ray is responsible for the whole tree of secondary rays. This leads to a ray number explosion. To mitigate this we reduce the number of rays after a few bounces. The advantage of this approach is that we can ensure good distribution of the spawned rays e.g. by stratifying the directions. On the other hand the path tracing approach shoots a single secondary ray. This avoids the explosion in the number of rays but it induces noise. The solution is to shoot many primary rays in the same direction. This is beneficial for the effects that depend on accumulation of colour over multiple primary rays. However, we cannot ensure proper distribution of rays as with the distributed ray tracing Soft shadows Shadows described in the section 3.2 were hard, which means that the point was either in light or shadow. This is because used light sources are infinitesimal points or directions. In the real world light sources have nonzero 12

18 3. EFFECTS area (area light sources) and thus a point can be partially visible from the light source. The set of points that are fully invisible from the light source is called the umbra while the partially visible set is called penumbra 3.5. light object penumbra umbra penumbra Figure 3.5: Soft shadows scheme. The effect is often computed by shooting N shadow rays from a single point to N random points on the area light source. Each of the rays adds one Nth to the final colour. The resulting shadows are somewhat fuzzy as can be seen in Figure 3.7a Ambient occlusion Some of the area light sources can be extremely large. Take the sun and the sky for example. The sun is nearly a directional light source and thus creates hard shadows. But as it hits the atmosphere the light is dispersed and the whole sky acts as a light source. This can be computed similarly to the soft shadows, the only difference is that we shoot random rays to the entire hemisphere above the hit point. Image rendered with ambient occclusion can be seen in Figure 3.7b Glossy reflections and refractions Let us revisit the reflections a little. Previously we shot a single ray in the reflection direction to sample this effect. This works well for ideal mirrors, but not all object interact with light this way. The light often reflects into a subset of the scene following some distribution. The distribution of the reflected light at opaque surfaces can be modelled with a BRDF (Bidirectional reflectance distribution function). This function returns ratio of reflected radiance for each pair of incoming and outgoing light direction. Thus if we want to sample the effect we shoot multiple reflection rays in the directions 13

19 3. EFFECTS with high reflected radiance ratio. To prevent artifacts we often use random numbers to pertube the directions a little. The BRDF function can be both isotropic and anisotropic. The isotropic function treats all of the incident ray directions the same way i.e. the distribution is independent of the incident ray direction. On the other hand for the anisotropic function the distribution changes based on the incident ray direction. This effect can be seen for example on brushed metal surfaces where the reflection will change with the rotation of the surface. Figure 3.7c shows a scene rendered with glossy reflection. A similar approach can be used for refraction rays. By distributing rays around the refraction direction we can get the apperance of slightly frosted glass Diffuse reflections A point in space is illuminated not only by light directly from the light source but also from light bounced of nearby objects (indirect diffuse light). To mimic this effect we need to sample the illumination at points near to the point of interest. To do this we shoot a lot of rays from the point to the random points on the hemisphere above it. The distribution of the ray directions is cosine-weighted such that more rays are shot towards the pole than towards the equator. It makes sence only to consider object intersections in a certain distance from the point. Objects that are too far should not affect the point s colour. distance Figure 3.6: Diffuse reflection scheme. This effect is also called colour bleeding as the colour bleeds to nearby areas. The familiar teapot scene rendered with diffuse reflection can be seen in Figure 3.7d. 14

20 3. EFFECTS Antialiasing The images rendered using ray tracing often suffer from aliasing. This is anticipated as only one sample of the scene (from centre point of the pixel) is taken per pixel. The straightforward solution it to take more samples per pixel to average the colour over the pixel thus antialiasing. This is done by shooting several rays through the same pixel. There are several possibilities how to choose the samples within the pixel. We can choose regular sampling, which means that we take n n samples on a regular grid within the pixel. This, however, often leads to artifacts like Moire pattern. Another possibility is to take these samples randomly (random sampling) or semirandomly (jittering) Depth of field The pursuit of the photographic quality of the images means that we sometimes have to simulate phenomenons of the acquisition devices (cameras and eye). Such is the case with the depth of field effect. The real cameras are built with lenses that have a finite aperture and nonzero area. Only objects with a certain distance from the objective are in perfect focus while objects further away are blurry. With ray tracing this can be achieved by slightly perturbing the origins and directions of the primary rays within the eye (lens) area. Resulting image is in Figure 3.7e Motion blur The real cameras have a parameter called exposition time (time period during which the light sensitive medium is exposed to the light). If the scene changes during this period, e.g. some of the objects moves, this object is blurred in the final image. To render this effect with ray tracing we shoot the rays at different times during the exposition time. When we test the intersections we shift the moving objects to the appropriate time for the ray. Image rendered with motion blur is in Figure 3.7f More effects There are a lot of other light effects that can be simulated with ray tracing. We can mention caustics [9], volumetric effects [5, 11], anisotropic scattering [7] and wavelength effects [6], just to mention a few of them. 15

21 3. E FFECTS (a) Soft shadows (b) Ambient occlusion (c) Glossy reflection (d) Diffuse reflection (e) Depth of field (f) Motion blur Figure 3.7: Some other effects achievable by ray tracing from [6] 16

22 Chapter 4 Comparison with rasterization Ray tracing is an alternative to rasterization for visibility calculations. For a proper comparison of these two basic algorithms we have to address several key aspects. For example supported input data, availability of dedicated hardware and algorithm s special capabilities. In the following subsections we will discuss both algorithms in these terms. 4.1 Input data In this subsection we will discuss which types of input data and scene description the two algorithms can handle, for which model sizes they are best suited and how they deal with instantiation. Rasterization supports a limited set of primitive types. It can handle points, lines, triangles and quads well. However, if our primitive of choice is not among these we are forced to convert it into one of them. For example polygons of higher degree can be triangulated and curves can be approximated with line segments. Ray tracing on the other hand supports non-polygonal primitives in their native form. So additionally to previously mentioned primitive types we can easily process volume data, splines and subdivision surfaces. It also supports volume data with partial transparency. Also all of these can be used in a single scene without the need of application driven context switching. Ray tracing can also deal with so-called declarative scene description that specifies what a scene should look like without specifying how to achieve that. This includes fully orthogonal descriptions of the geometry, its appearance (surface shaders), the camera, as well as the lighting environment and any light sources. This scene description can then be rendered fully automatically without the help of the creating application. Ray tracing is demand driven (output sensitive). This means that it only accesses something hit by a ray, potentially contributing to the final image. There can be gigabytes of data hidden behind an occluder (e.g. a wall) and 17

23 4. C OMPARISON WITH RASTERIZATION Figure 4.1: Plant ecosystem from [3] the ray tracer might not even load them to the memory. Experiments show that above approximately one million triangles the rendering time hardly changes. This makes ray tracing a perfect match for massive model visualizations. For example on dual-processor PC entire Boeing 777 model (350 million polygons, approximately 30GB of data) can be rendered with 2-3 fps at video resolution [3]. Another example from [3] shows that ray tracing is also very well suited for object instantiation. The whole objects (geometry, material, texture, etc.) can be defined only once. For multiple instances, only transformation matrix can be stored and after a hit of a ray with object s bounding box the ray is transformed into the object s local coordinate system. This makes the engine capable of rendering a complex plant ecosystem model. This model contains more than 365,000 individual plants (68 plant species), resulting in a total of over 1.5 billion potentially visible polygons. The model can be ray traced interactively on a small number of desktop PCs even including optical effects like shadows, transparency, etc., and without any form of simplification. Example of a resulting image is given in Figure Acceleration structures No reasonably big data file can be rendered in realtime when drawing all of its input primitives. This statement holds true for both ray tracing and rasterization. An acceleration structure has to be used to create images in time sublinear in the number of primitives. The performance of the ray tracing is highly influenced by the intersection algorithm. Typically a ray only intersects a few of the scene s ob18

24 4. COMPARISON WITH RASTERIZATION jects and it would be extremely inefficient to test every ray for intersection with every object. An ideal intersection algorithm would therefore only test the nearest object hit by the ray. To avoid the abundant ray-primitive intersection tests, ray tracing engine employs a spacial index called acceleration structure. This structure quickly identifies regions not entered by the ray. This leads to culling of a large amount of scene s primitives. The proposed methods are octrees, binary space partitioning trees (BSP trees), grids, bounding volume hierarchies (BVHs), k-dimensional trees (kd-trees) and bounding interval hierarchies (BIHs) just to name the more common ones. Ray tracing also uses the acceleration structure directly during the rendering of each frame, making it better suited for huge data sets. Rasterization already uses spatial indices similar to those used by ray tracing to lower the amount of drawn geometry. Despite having a lower time constant even rasterization would be prohibitively slow on moderately complex models. With rasterization the steps of finding the geometry to render and rendering can be separated. For example the former part can be done periodically after several frames because of the time coherence leading to even lower computational cost. In contrast to ray tracing it usually prefers less complicated structures like grids or quad trees. However, if the application wants to do e.g. collision detection it needs another acceleration structure for that. With ray tracing only one robust structure is sufficient for both tasks. But the acceleration structures used with ray tracing usually take longer to build. This leads to problems with dynamic scenes due to the need for spatial index updates. Currently the ways of constructing acceleration structures in realtime are a hot research topic. We can compare the currently used acceleration structures for ray tracing in terms of construction time, memory use, and ray traversal time. BVHs and kd-trees are among the most popular acceleration structures for ray tracing. The advantage of kd-trees is that they are superior in traversal performance for static scenes. On the other hand their construction is also the most time-consuming and they take a lot of memory. These aspects make them unsuitable for dynamic scenes. For BVHs we usually use the AABBs as the bounding volume, although other primitives like spheres OBB (Oriented Bounding Boxes) or k-dops are also possible. However, the spheres are not generally used because they have the worst fit. The problem with OBBs and k-dops is that they are more complicated to build and also the intersection test is slower. In general BVHs offer a good traversal time at much lower building cost. They also consume significantly less memory. Recently BIHs started to be used as well. They are a sort of compromise between kd-trees and AABB BVHs. They use two planes to partition the space 19

25 4. COMPARISON WITH RASTERIZATION in a node instead of one plane as in the kd-trees and six in the AABB BVH. Some implementations also tried using grids or quad trees as the acceleration structure. These structures have a low memory footprint and are easy to use but they are not suitable for scenes with nonuniform object scene distribution. 4.2 Capabilities There is a large palette of effects which can be produced by rasterization and new ones are added on a regular basis. Nevertheless the effects often rely on tricks and are physically incorrect. Such an example would be most shadow computations, reflection and refraction or ambient occlusion. On the other hand ray tracing can naturally support these advanced visual effects and more. It also has a built-in visibility culling: visibility is on per pixel basis and typical algorithms such as back-face culling, view-frustum culling and occlusion culling are handled implicitly. Another advantage of ray tracing is that it can be frame less, pixel values are updated as soon as they are computed. That enables the iterative rendering of more advanced effects. We compute the primary rays quickly (the same result as basic rasterization) and then continue computing the more sophisticated effects if the camera does not move. 4.3 Hardware The disadvantages of ray tracing to rasterization are a higher time constant and lack of special purpose hardware. These two aspects are growing less important because of the exponential growth in number of rendered primitives and developments in CPU and GPU hardware. Unlike rasterization which has seen twenty years of hardware development, ray tracing hardware is only just appearing. Ray tracing also demands greater flexibility from the hardware architecture. The efficiency of rasterization derives from processing a single primitive (e.g. triangle) at a time. The graphics chip does not have to access the entire scene. This prevents from rendering some basic optical effects faithfully and fakes such as shadow maps and reflection maps must be used. However, these fakes are inaccurate and generally inefficient. 20

26 4. COMPARISON WITH RASTERIZATION 4.4 Summary The main advantage of ray tracing is that it directly simulates the physics of light. Moreover all operations are performed per pixel, including occlusion culling, interpolations, illumination etc. This improves the quality of the image and enables the photographic image quality. Entire rendering can be done by the ray tracing engine without the support of the application. That simplifies application programming and allows complete hardware acceleration. Another advantage is that there is no need of object sorting (e.g. for transparency rendering). 21

27 Chapter 5 CUDA In recent years GPU (graphic processing unit) has seen tremendous development in both performance and programmability. It evolved into highly parallel, multithreaded, manycore processors with huge raw power and high memory bandwidth. These characteristics are natural as GPUs are specialized for computeintensive, highly parallel computations needed in graphics rendering. On the hardware level it means that more transistors are spent on data processing rather than on data caching and flow control. The class of problems that are well-suited for this architecture is described as data-parallel computations, meaning that the same program is executed on many data elements in parallel. It is also beneficial if the ratio between arithmetic and memory operations is high. With these predicates fulfilled there is a lower need for sophisticated flow control as the same program code is executed for each data element. Also memory access latency can be hidden with calculations instead of employing big caches because of the high arithmetic intensity. The data-parallel programming model as the name suggests is especially well suited for processing large data sets. That is no surprise, as in 3D rendering, large sets of vertices and fragments are processed. Large sets of data can also be found in image postprocessing, video encoding and decoding, stereo vision and pattern recognition. But we need not stick to the computer graphics field, fitting problems can be found in general signal processing, physics simulation, computational finance or computational biology and many other fields. The CUDA general purpose parallel computing architecture was released in November 2006 by NVIDIA. It supports variety of languages such as NVIDIA s own C for CUDA, FORTRAN, OpenCL and DirectX Compute. 22

28 5. CUDA Figure 5.1: Developement in floating point operations per second Figure 5.2: Developement in bandwidth 23

29 Chapter 6 Eyerays engine When writing a ray tracing engine one must first decide which features to implement. I have decided to write a ray tracer for static scenes as they are the most comprehended. Moreover only primary, shadow, reflection and refraction rays are supported. This allows the ray tracing engine to run at realtime frame rates while supporting effects rasterization is unable to render. Advantageously extension for other ray types would only consist of writing a different ray generation code. The next step is to choose an acceleration structure and to decide the memory layout. AABB BVH has been chosen as an acceleration structure for my engine for several reasons. First, it is faster to build and update than most of the other structures. This favours potential extension of the engine to support dynamic scenes as well. On the other hand it is not quite as fast as kd-tree for static scenes. This is because kd-tree partitions space better and also the ray-node intersection test is faster. However, rendering static scenes has a very limited use so it is not wise to focus entirely on rendering them as fast as possible. Second, the AABB BVH has a lower memory footprint compared to kd-tree. Thus the AABB BVH is sort of a compromise between memory footprint and intersection test speed. The low memory consumption is particularly useful on GPUs as they usually have less RAM than CPUs. Third, it is pretty straightforward to implement traversal of this structure. It would be very time consuming to write my own BVH construction code. Instead I have decided to reuse one already written. I use the code from Jakko Bikers Arauna engine [2] for data loading and AABB BVH construction. The construction code ensures that the number of triangles per leaf is less than six. The acceleration structure is then transformed into my internal format suitable for the GPU. It is unclear how good the created hierarchy is because there is no proper metric for such measurement. The resulting FPS counts are, however, reasonably high so the quality of the tree structure should not be a major drawback. Now that the acceleration structure has been chosen it is time to come 24

30 6. EYERAYS ENGINE up with the data layout. The decision to support only static scenes simplifies things a little bit. We can copy all of the data to the GPU before the rendering starts and free it once it is done. We need to copy the acceleration structure, the triangle data and normal for every triangle vertex. Pointers to the arrays in GPU memory are passed to the kernel via structure in constant memory. Also texture reference is bound to the array holding the BVH to take advantage of the texture cache. The data for children of the current node are saved in a continuous block of 64bytes, which will show useful later. The triangle data has no special layout. The only condition is that the data pointed to by a leaf node are saved in a continuous block. The memory layout is shown is Figure 6.1. BVH data min max Triangle data vertex1 vertex2 vertex3 Normal data normal1 normal2 normal Figure 6.1: Memory layout Once the data layout is in place we can focus on writing the ray tracing code optimized for the GPU. In the remainder of this section we will take a look on the key aspects of the implementation. As mentioned earlier the engine should be capable of computing the Whitted style ray tracing - primary, shadow and secondary rays (reflection and refraction). The GPUs are currently unable to perform dynamic allocations thus we cannot generate the shadow and secondary rays as soon as we finish traversing the parent primary ray. To overcome this we trace each ray type individually and share intermediate data through the global memory. We first trace all of the primary rays then if we want to, we can trace all of the shadow rays. If we decide so, secondary rays can be traced then. Currently there is no way to trace the primary rays and then the secondary rays omitting the shadow rays. To put that in one sentence, we can trace primary rays only, primary + shadow rays or full primary + shadow + secondary rays. The easiest and most efficient way to achieve such rendering is to trace each ray type in a separate kernel. Thus depending on the kernels we launch we render different effects. In order for this to work the kernels have to interact by sharing some data through the global memory. This is described 25

31 6. EYERAYS ENGINE lower and shown in Figure 6.2. To define which effects should be rendered (which kernels should be launched) we can use macros. These macros switch on/off the launching of the kernels for shadow and secondary rays. The application is written in such a way that if a kernel is not turned on then the code for it and the code for saving values used by it in other kernels is not compiled at all. This way we avoid the overhead of saving data to global memory if they are not used later. There are also other macros for code adjustment. They generally fit into one of two categories. The macros from the first category define the method used for traversing the acceleration structure. In Figure 6.2 these macros correspond to the particular kernel box. As you can see there are several traversal methods for the primary and secondary rays but only one for the shadow rays. This is because the shadow rays are very similar to the primary rays. Thus the fps counts are also similar and there is no need to benchmark various traversal methods. The macros from the second category switch on/off some optional processing. For example we can choose whether to use back face culling or if texture reference should be used for fetching the triangle data. primary kernels packet ray hybrid secondary kernels packet ray hybrid data in global memory shadow kernel Figure 6.2: Kernel interaction diagram. Macros are used to define the path. Now that we have decided the interaction between the different kernels we need to write the kernels themselves. All of them share the same structure. They first compute or load the ray data (ray origin and direction). Then they trace the ray through the acceleration structure. After this is finished and the closest triangle hit is found (if any) they do the shading and/or save the data for processing in the subsequent kernel. The middle part: tracing the ray through the acceleration structure can be done in multiple ways. 26

32 6. EYERAYS ENGINE The algorithms for this phase that I have implemented are described in section 6.2. The following three sections describe these three stages of each kernel. 6.1 Ray computation As suggested in [1] I have used the persistent warps concept in my engine. There was about 30 percent speed-up of this scheduling method over the default scheduler in my ray tracer. To explain this speed-up the default scheduler and the persistent scheduler descriptions follow. A default scheduler for the GPU launches as many threads (grouped into thread blocks) as needed for the computation. The blocks (all of the warps in the block) are then sequentially assigned to free multiprocessors. On kernel start each multiprocessor is assigned a maximum number of blocks that it can hold. When a block finishes (all threads in all warps are done with computation) a new block can be assigned to the multiprocessor. This continues until all of the launched blocks are processed. The problem is that before a new block can be assigned to the multiprocessor all warps in some currently assigned block must finish. If the computational time varies greatly among the warps inside currently assigned block, the finished warps are idling until the other warps in this block are finished as well. This reduces the number of active warps on the multiprocessor and thus compromises the multiprocessors ability to hide latency with computation. With the persistent warp concept there are launched only so many thread blocks as the graphics chip can hold. This equals to the number of blocks a multiprocessor can hold multiplied by the number of multiprocessors on the chip. When a warp finishes its computation (all threads in the warp are done with computation) new data are loaded into the warp and the computation restarts. This continues until there are no more data to load. This approach is faster for the kernels with heterogeneous workload such as ray tracing kernels. The advantage is that the warps in the block are independent and no idling (until other warps in the same block finish) occurs. They can start new work as soon as they finish the previous one. Another improvement mentioned from [1] is using the Morton code (filling Z-curve) for ray index assignments. This should allow for blocks processing pixels in close 2D locality to use the same texture cache. Subsequently it should lead to higher hit/miss ratio on coherent scenes. However, the computation of the Morton code consists of many instructions 27

33 6. EYERAYS ENGINE and slows down the kernel. A simpler approach to ray index computation showed superior performance. The ray indices are assigned to threads sequentially. This takes only a few multiplications and additions to compute. The final ray origin and direction computation is fairly straightforward. The ray origin is at the viewpoint thus we only need to copy values from the modelview matrix. The direction is the vector from the viewpoint to the corresponding pixel position. It is computed by multiplying vector (x, y, 2) with the modelview matrix. Variables x and y are coordinates in [ 1, 1] defining the relative pixel position in the output image. These coordinates are computed from the ray index. For the shadow and secondary ray kernels the ray computation is replaced with data loading from the global memory. 6.2 Traversal methods Packet traversal Packet traversal is the first method that I have implemented. It is based on the algorithm described in [4]. The main advantage of this method is that it maps very well to the CUDA. All loads are fully coalesced promising high performance. It also wastes no bandwidth when loading the bounding box data as all threads participate in the load and then use the result. We cannot avoid wasting some bandwidth when loading the triangle data as the triangle count varies across the leaf nodes. However, we can still load it coalescently. Another advantage is that there is no divergence in the BVH traversal as all of the rays follow the same path by definition. This leads to high performance on coherent rays such as primary or shadow rays. The drawback of this method is that a ray is sometimes forced to follow a path that it does not want to. This happens mostly on incoherent secondary rays and leads to much higher number of intersected bounding boxes and primitives reducing performance. This method is referred to as packet in the result section. The pseudocode for this algorithm is in Algorithm 6.1. Memory loads As mentioned before the children of the current bounding box are saved in a continuous block of 64bytes. This makes it possible to load them both in one instruction. All of the threads within the warp copy one float value from the global memory to the shared memory. This is the preferred pat- 28

34 6. EYERAYS ENGINE tern for accessing global memory as described in [8]. With triangle data saved in leaf nodes things are more complicated. Their size is not multiple of 32 and their number varies across the leaves. I have implemented three ways of handling their loading. The first one loads 128byte chunks of memory until all of the needed data are loaded. There is a lot of overhead in this method mainly because of loop handling. Another method loads the first chunk and then decides whether to load the second chunk based on triangle count. This method was quite faster than the first one mainly because of eliminating the loop. The third method is the simplest one. It issues two loading instructions no matter how many triangles we actually want to load. This may lead to higher bandwidth consumption on leaves with low number of triangles but pays off because of the lower instruction count Ray traversal This method is based on the so-called if-if traversal from [1]. It is a simple traversal method where every ray follows its own path. This results to the lowest number of performed intersection tests among all of the implemented methods. That also imposes a performance penalty as there is divergence inside the warps. Loads from the global memory are not coalesced and therefore waste memory bandwidth, which negatively affects the performance as well. These aspects make the ray traversal the slowest of all the methods for coherent primary rays. On the other hand for incoherent secondary rays the significantly lower number of intersection tests (compared to packet traversal) leads to better performance. If not for the slowdown caused by worse texture cache hit/miss ratio this method would be as fast for secondary rays as it is for primary rays. The pseudocode for this algorithm can be seen in Algorithm 6.2. With the ray traversal one bounding box or one triangle is intersected per iteration. Notice that both these intersections can be done in one iteration if the next loaded node is a leaf node. Memory loads As mentioned above all loads of this method are uncoalesced. However, as the children bounding box data size is a multiple of four we can at least load the data as float4. This reduces the number of loading instructions by four leading to slightly improved performance. For loading triangle data, this method is useless as their size is not a multiple of four. Another source of slowdown for this method is its handling of traversal stack. Whereas the packet traversal method uses the fast shared memory as the location for 29

35 6. EYERAYS ENGINE Algorithm 6.1 Packet traversal while(shared stack not empty) { if(leaf node) { Load triangles Intersect the ray with every triangle Load next node from shared stack } else { Load children visitleft = RayBoxIntersection(left child) visitright = RayBoxIntersection(right child) } } // Visit both children if(any(visitleft) && any(visitright)) { Decide which child to traverse first Copy the other child onto shared stack Set the chosen child as next node } else if(any(visitleft)) // Visit left child only { Set the left child as next node } else if(any(visitright)) // Visit right child only { Set the right child as the next node } else // Do not visit any child { Load next node from shared stack } 30

36 6. EYERAYS ENGINE the stack, this method is forced to use the slow local memory. The shared memory on current generation of GPUs is too small to hold the stack for all of the rays. As stated in [8] the local memory has the same latency as the global memory. Also in case that different rays access different stack indices the loads and stores are uncoalesced Sequential hybrid traversal This is one of the methods developed by myself. It is a combination of the two methods described above. As stated packet traversal works well for coherent nodes whereas ray traversal is better suited for incoherent nodes. The idea behind this method is to traverse the acceleration structure using packet traversal as long as some coherency condition is satisfied. When the condition no longer holds it switches to ray traversal. Thus it should benefit from the advantages of both. There are two most promising conditions for this one-time-only switching variant. One simply ends the packet traversal as soon as a small stack in shared memory is full. I call this method the sequential stack max. The other condition is more complex. It counts the number of rays wanting to enter a particular node. If this number is too low it puts the node onto the local stack of those rays which want to enter it and proceeds with the node which is entered by the majority of rays. I call this method the sequential cut. Counting how many rays want to visit a node is very expensive in the current hardware. This significantly slows down the method. However, if a new warp vote function is added in the next generation of hardware this method can be the fastest one. The generic pseudocode for interleave traversal methods is in Algorithm Interleaved hybrid traversal The idea behind this method is the same as for the Sequential hybrid traversal. Once again we try to divide the acceleration structure tree into coherent and incoherent parts. The difference is that we do this for each branch separately. Thus we start tracing with packet traversal then switch to the if-if traversal and when all of the rays finish this branch we load next node from the shared stack and the process continues until the shared stack is empty. This approach did not seem promising at first. I thought it was because the rays that finished fast would spend a lot of time waiting for the slower ones. However, this turned out not to be the reason for the main slowdown. There were executed a lot of unnecessary instructions in the older version. 31

37 6. EYERAYS ENGINE Algorithm 6.2 Ray traversal while(local stack not empty) { if(!leaf node) { Load children visitleft = RayBoxIntersection(left child) visitright = RayBoxIntersection(right child) } if(visitleft && visitright) // Visit both children { Decide which child to traverse first Copy the other child onto local stack Set the chosen child as next node } else if(visitleft) // Visit left child only { Set the left child as next node } else if(visitright) // Visit right child only { Set the right child as the next node } else // Do not visit any child { Load next node from local stack } if(leaf node) { Load next triangle Intersect the ray with loaded triangle } } if(all triangles processed) Load next node from local stack 32

38 Algorithm 6.3 Sequential hybrid traversal while(shared stack not empty) {... if(any(visitleft) && any(visitright)) { Decide which child to traverse first Copy the other child onto shared stack Set the chosen child as next node 6. EYERAYS ENGINE } if(switching condition met) End the packet traversal }... Perform the ray traversal as usual This slowed down the computation limited kernel beyond expectation. After reimplementation the FPS counts increased significantly. As with the Sequential hybrid traversal there is a variety of switching conditions. In the result section these methods are referred to as interleaved stack max and interleaved cut. The generic pseudocode for interleave traversal methods is given in Algorithm Shading and saving of data As the final step the computed information is used for shading or saved for processing by following kernel. The first case applies always to the shadow rays and if the kernel is the final step for primary and secondary rays. The Phong illumination model is used. The second case applies whenever the kernel is not the final step. For example for the primary rays we save the ray description, the hit triangle index, the barycentric coordinates of the intersection point within the triangle, the pixel index and contribution coefficient for shading. This information is saved in such a format that both its saving and loading are fully coalesced thus as fast as possible. The only interesting part of the shading function is the actual saving of values to the output image. The pixel information is saved in 32bits in 33

39 Algorithm 6.4 Interleaved hybrid traversal while(shared stack not empty) {... if(any(visitleft) && any(visitright)) { Decide which child to traverse first Copy the other child onto shared stack Set the chosen child as next node 6. EYERAYS ENGINE } if(switching condition met) { Perform the ray traversal as usual Load next node from shared stack } }... a ARGB format. So when saving the pixel colour we must enforce that its components are in range and then we compose the final value using binary operations. The only trouble is with secondary rays, when two of the threads want to update the same pixel. This is handled by summing the final colour in shared memory. 34

40 Chapter 7 Results The performance of the described methods was benchmarked on NVIDIA GTX 280. The numbers are given in frames per second. With output resolution of this number also stands for mega rays per second (Mrays/s) for the primary rays. For the secondary rays it is one half of the Mrays/s as we shoot twice as many rays to compute the effect. The timings include the whole kernel execution (ray generation, traversal, shading, etc.) but exclude time spent on buffer switching and CPU code. This auxiliary code has a constant workload and decreases the performance by 1-2 fps. 7.1 Test scenes To measure the traversal performance we need some scenes to run our algorithm on. These should be as different as possible to evaluate the ray tracing performance in different environments. Among the popular scenes for computer graphics algorithm testing I have chosen the Conference room 7.1a, Fairy forest 7.1b and Sibenik cathedral 7.1c models. The Conference room was created by Anat Grynberg and Greg Ward ( gov/mgf/scenes.html), the Fairy forest model was made by Ingo Wald ( and finaly the Sibeik model is the work of Marko Dabrovic ( In each of the scenes I have randomly selected five different viewpoints. From these viewpoints most of the scene s triangles are usually visible and except for one viewpoint in the Fairy forest scene no ray hits the background. This way a meaningful benchmarking can be done. The Conference room model is basicly a box with a regular layout. Thus neighbouring rays often follow coherent paths. No background is visible from inside of the model so every ray, primary or secondary one hits some triangle. The Fairy forest model is a round open space with visible sky. The layout is heavily irregular with majority of the triangles concentrated to the fairy figure. The presence of the open sky causes some of the secondary 35

7. RESULTS (a) Conference (b) Fairy (c) Sibenik Figure 7.1: Images of the test scenes.

1: Triangle counts, triangle counts after split, number of nodes (both inner and leaf), number of leafs and maximal depth of the tree for the three test scenes. rays to miss the scene entirely.

This model has also a regular layout (as most architectural models) but a lot of triangles are occluded by walls. This is the main difference to the Conference room model. 7.

41 7. RESULTS (a) Conference (b) Fairy (c) Sibenik Figure 7.1: Images of the test scenes. Scene Triangles Split triangles Nodes Leafs Max depth Conference 283k 424k 190k 95k 25 Fairy 174k 261k 117k 59k 28 Sibenik 80k 121k 56k 28k 24 Table 7.1: Triangle counts, triangle counts after split, number of nodes (both inner and leaf), number of leafs and maximal depth of the tree for the three test scenes. rays to miss the scene entirely. The last Sibenik cathedral scene is a model of several separate buildings. The most important building (the cathedral) is the most detailed one. Inside all of the selected viewpoints are located. This model has also a regular layout (as most architectural models) but a lot of triangles are occluded by walls. This is the main difference to the Conference room model. 7.2 Primary rays As it can be seen in table 7.2 the if-if traversal is the slowest of all methods for the primary rays. This is interesting since it is reported to be faster than the packet traversal [1]. The reason for this discrepancy is unknown to me. Possible candidates are a worse BVH tree construction algorithm or simply a poor implementation of the mentioned method. It is important to realize that poor performance of the if-if traversal reflects also in performance numbers for the hybrid methods. The sequential stack max traversal is sometimes slightly faster than the packet traversal. When rendering the conference scene the constant for the 36

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives