Design and Implementation of Parallel Light Transport Algorithms based on quasi-monte Carlo Ray Tracing

Size: px

Start display at page:

Download "Design and Implementation of Parallel Light Transport Algorithms based on quasi-monte Carlo Ray Tracing"

Jack Crawford
6 years ago
Views:

1 AGH University of Science and Technology Faculty of Electrical Engineering, Automatics, Computer Science and Electronics Institute of Computer Science Dissertation for the degree of Doctor of Philosophy in Computer Science Design and Implementation of Parallel Light Transport Algorithms based on quasi-monte Carlo Ray Tracing mgr inż. Micha l Radziszewski supervisor: dr hab. inż. Krzysztof Boryczko, prof. n. AGH Kraków, June 2010

3 Abstract Photorealistic rendering is a part of computer graphics, which concentrates on creating images and animations based on 3D models. Its goal is creation of pictures that are indistinguishable from real world scenes. This work is dedicated to a particular class of photorealistic rendering algorithms quasi-monte Carlo ray tracing based global illumination. Global illumination is a very useful concept for creating realistically lit images of artificial 3D scenes. Using automatic and correct computation of vast diversity of optical phenomena, it enables creating a rendering software, which allows specification of what is to be rendered instead of detailed description of how to render a given scene. Its current applications range from many CAD systems to special effects in movies. In future, when computers become sufficiently powerful, real time global illumination may be the best choice for computer games and virtual reality. Currently, only Monte Carlo and quasi-monte Carlo ray tracing based algorithms are general enough to support full global illumination. Unfortunately, they are very slow compared to other techniques, e.g. hardware accelerated rasterization. The main purpose of this thesis is an improvement of efficiency of physically correct rendering. The thesis concentrates on enhancement of robustness of rendering algorithms, as well as parallel realization of them. These two elements together can substantially increase global illumination applicability, and are a step towards the ultimate goal of being able to run true global illumination in real time. i

4 Acknowledgements I am deeply indebted to my supervisor, Prof. Krzysztof Boryczko. Without his inspiration, advice and encouragement this work would not have been completed. I would like to sincerely thank Dr Witold Alda from the AGH University of Science and Technology. During work on this dissertation we have spent many hours in conversations. The expertise I gathered working with him contributed substantially to the dissertation. I would like to acknowledge the financial support from the AGH University of Science and Technology a two one year scholarships for PhD students. Finally, I also would like to acknowledge ZPORR (Zintegrowany Program Operacyjny Rozwoju Regionalnego) for scholarship Ma lopolskie Stypendium Doktoranckie, co-founded by European Union. ii

5 Contents Abstract Acknowledgements i ii 1 Introduction Global Illumination and its Applications Thesis Purpose and Original Contributions Thesis Organization Light Transport Theory Geometric Optics Assumptions Radiometric Quantities Light Transport Equation Surface Only Scattering Volumetric Scattering Extension Properties of Scattering Functions Analytic Solutions Simplifications Image Formation Importance Integral Formulation Image Function Monte Carlo Methods Monte Carlo Integration Statistical Concepts Estimators of Integrals Biased and Unbiased Methods Variance Reduction Techniques Importance and Multiple Importance Sampling Russian Roulette and Splitting Uniform Sample Placement Quasi-Monte Carlo Integration Desired Properties and Quality of Sample Sequences Low Discrepancy Sequences Randomized Quasi-Monte Carlo Sampling Comparison of Monte Carlo and Quasi-Monte Carlo Integration Quasi-Monte Carlo Limitations Light Transport Algorithms Ray Tracing vs. Other Algorithms View Dependent vs. View Independent Algorithms Ray Tracing Algorithms iii

6 iv CONTENTS Hardware Accelerated Rasterization Radiosity Algorithms Light Transport Paths Classification of Paths Construction of Paths Local Path Sampling Limitation Full Spectral Rendering Necessity of Full Spectrum Representing Full Spectra Efficient Sampling of Spectra Results and Discussion Analysis of Selected Light Transport Algorithms Path Tracing Bidirectional Path Tracing Metropolis Light Transport Irradiance and Radiance Caching Photon Mapping Combined Light Transport Algorithm Motivation Merging of an Unbiased Algorithm with Photon Mapping Results and Conclusion Parallel Rendering Stream Processing Stream Processing Basics Extended Stream Machines with Cache Stream Monte Carlo Integration Parallel Ray Tracing Algorithm Initialization and Scene Description Frame Buffer as an Output Stream Multipass Rendering Ray Tracing on an Extended Stream Machine Results and Conclusions Choice of Optimal Hardware Shared Memory vs. Clusters of Individual Machines Multiprocessor Machines vs. Graphics Processors Future-proof Choice Interactive Visualization of Ray Tracing Results Required Server Output Client and Server Algorithms MIP-mapping Issues Results and Discussion Rendering Software Design and Implementation Core Functionality Interface Quasi-Monte Carlo Sampling Ray Intersection Computation Spectra and Colors Extension Support Procedural Texturing Language Functional Languages Syntax and Semantic Execution Model and Virtual Machine API Results and Conclusion New Glossy Reflection Models Related Work

7 CONTENTS v Properties of Reflection Functions Derivation of Reflection Function Results and Conclusions Results Image Comparison Full Spectral Rendering Comparison of Rendering Algorithms Conclusion Contributions Summary Final Thoughts and Future Work Bibliography 112 Index 121

8 List of Symbols Φ radiant flux E irradiance L radiance W importance λ wavelength f r bidirectional reflection distribution function (BRDF) f t bidirectional transmission distribution function (BTDF) f s bidirectional scattering distribution function (BSDF) f p phase function N normal direction ω unit directional vector ω i incident ray direction ω o outgoing ray direction θ angle between ω and N x, y points either on a surface A or in a volume V x 2 x 1 normalized vector pointing from x 1 to x 2, x 1 x 2 x, x [k] a light transport path, path with k segments, k + 1 vertexes Λ space of all visible wavelengths, Λ = [λ min, λ max ] Ω space of all unit directional vectors ω Ω + space of all unit directional vectors such that ω N 0 A space of all points on scene surfaces V space of all points in scene volume, without surfaces X space of all light transport paths µ arbitrary measure σ(ω) solid angle measure σ (ω) projected solid angle measure A(x) area measure V (x) volumetric measure σ a absorption coefficient σ s scattering coefficient σ e extinction coefficient pdf probability density function cdf cumulative distribution function ξ canonical uniform random variable δ Dirac delta distribution vi

9 List of Figures 2.1 Radiance definition Results of simplifications of light transport equation Uniform sample placement Quasi-Monte Carlo sampling patterns Comparison of Monte Carlo and quasi-monte Carlo integration error Undesired correlations between QMC sequences Rendering with erroneous QMC sampling An example light path Extended light path notation Difficult light path Local path sampling limitation Full spectral and RGB reflection Full spectral and RGB refrefraction on prism Different methods for wavelength dependent scattering Selection of optimal number of spectral samples Various methods of sampling spectra Imperfect refraction with dispersion Analysis of behaviour of spectral sampling A path generated by Path Tracing algorithm Results of Path Tracing Simplified Path Tracing Batch of paths generated with BDPT Batch of paths generated with optimized BDPT Comparison of BDPT and Photon Mapping One pass versus two pass Photon Mapping Quickly generated images with one pass Photon Mapping Light transport algorithms comparison Light transport algorithms comparison Stream machine basic architecture Extended stream machine architecture Test scenes for parallel rendering Parallel rendering run times Main loop of visualization client process Glare effect applied as a postprocess on the visualization client Comparison of MIP-mapping and custom filtering based blur quality Results of Interactive Path Tracing Results of Interactive Photon Mapping Noise reduction based on variance analysis of Path Tracing Ray-primitives interaction Semi-transparent surfaces intersection optimization vii

10 viii LIST OF FIGURES 6.3 Comparison between different gamut mapping techniques Textured and untextured models Sample procedural texture Images generated using noise primitive Procedurally defined materials Mandelbrot and Julia Fractals Comparison of different glossy BRDFs with little gloss Latitudal scattering only Longitudal scattering only Product of latitudal and longitudal scattering Scattering with perpendicular and grazing illumination Complex dragon model rendered with our glossy material Comparison of spectral rendering algorithms Full spectral rendering of a scene with imperfect refraction Rendering of indirectly visible caustics

11 List of Tables 4.1 Numerical error of spectral sampling Spectral functions for RGB colors Texturing language grammar Notation used in BRDF derivation Convergence of spectral sampling techniques Convergence of selected rendering algorithms ix

12 List of Algorithms 4.1 Optimized Metropolis sampling Construction of photon maps Kernel for Monte Carlo integration Single- and multipass rendering on extended stream machine Rasterization of point samples by Visualization client Visualization client repaint processing Gamut mapping by desaturation Sample usage of procedural texturing x

13 Chapter 1 Introduction Photorealistic rendering is a part of computer graphics, which concentrates on creating still images and animations based on 3D models. Its goal is creation of pictures that are indistinguishable from real world scenes. This work is dedicated to the particular class of photorealistic rendering algorithms ray tracing based global illumination. In the rest of this chapter, a brief description of global illumination is presented, followed by the outline of the most interesting original contributions, and finally, the thesis organization. 1.1 Global Illumination and its Applications The global illumination term is used to name the two distinct types of capabilities of rendering algorithms there are two commonly used and substantially different definitions of it. According to the first definition, global illumination effects are just opposed to local illumination effects, where only direct lighting is accounted for while rendering any part of the scene. Thus, any phenomenon which depends on knowledge of other scene parts, while rendering a given primitive, is a global illumination effect. Such effects are, for example, shadow casting and environment mapping. On the other hand, according to the second definition, any rendering algorithm capable of global illumination must be able to simulate all possible interreflections of light between scene primitives. Since the second definition is much more useful and precise, it is used through the rest of the thesis. Global illumination is a very useful concept for creating realistically lit images of artificial 3D scenes. Computing automatically and correctly vast diversity of optical phenomena, it creates a solid basis for rendering software, which allows specification of what is to be rendered instead of detailed description of how to render a given scene. Its current applications range from many CAD systems to special effects in movies. Global illumination algorithms are responsible for the rendering process. Within the framework presented later they are easy to implement, however design of an effective, robust and physically correct global illumination algorithm is a difficult and still not fully solved task. By physical correctness we understand complete support of geometric optics based phenomena. Currently, only ray tracing based algorithms are general enough to render all geometric optic effects. Unfortunately, the price to pay for such an automatization is that the evaluation of global illumination is slow compared to other techniques, e.g. hardware accelerated rasterization. However, it is widely believed that, when computers become fast enough, global illumination is likely to replace other, less physically accurate techniques. This statement is often supported by the fact that a similar breakthrough is already happening in modeling domain, where physical models compete with more traditional approaches with good effect. For example, nowadays it is possible to model a cloth animation either as a mesh with a given mass, stiffness, etc. and let the model to calculate its appearance over time, or as a sequence of keyframes, each frame laboriously 1

14 2 CHAPTER 1. INTRODUCTION created by animator. Currently both models can be run in real time, while the simulation was not plausible a few years ago. 1.2 Thesis Purpose and Original Contributions The main purpose of this thesis is to improve an efficiency of physically correct rendering. The idea is to provide working software with rigorous theoretical basis. Our assumption is to avoid two extremes overly theoretical work, without care about algorithms implementation and algorithms created by method of trials and errors, designed to produce good enough visual effects, which sometimes work and sometimes do not. The latter approach is, unfortunately, surprisingly common in real time graphics, despite it does not provide any information about algorithm correctness. Moreover, accounting for hardware development trends is very important for us, since this can significantly affect performance of algorithms. The thesis concentrates on enhancement of robustness of rendering algorithms, as well as their parallel realization. These two elements together can substantially increase global illumination applicability, and are a step towards the ultimate goal of being able to run true global illumination in real time. In effort to realize it, this dissertation provides several original contributions in the field of computer graphics. This section enumerates most important and interesting of them, listed in order in which they appear in the thesis. Light transport theory. Typically light transport theory assumes creation of image from pixels by convolution of luminance with filter function associated with each pixel. There is nothing incorrect with this concept, but it limits generality of rendering algorithms. Instead, we represent image as a 3D function defined over [0, 1] 2 λ, where unit square represents image film surface and λ is wavelength of light. This generalization allows using much more sophisticated post-processing techniques, but it makes invalid any rendering algorithm dependent on pixel abstraction, however. This approach is explained in details in Section 2.3. Full spectral rendering. Despite being proven incorrect, a lot of global illumination algorithms are designed to use an RGB color model. Only few most advanced approaches attempt to accurately simulate visually pleasing full spectral phenomena. We have designed and implemented an improved full spectral support, based on Multiple Importance Sampling. This new technique is much more efficient at simulating non-idealized wavelength dependent phenomena, and fits elegantly into Monte Carlo and quasi-monte Carlo ray tracing algorithms. The novel spectral sampling technique is defined in Section 4.3. Light transport algorithms. First, we have adapted Photon Mapping to be one pass technique. The modified algorithm starts storing just few photons, and later, the photon map size is increased. This aspect allows rendering images with progressive quality improvement, a feature impossible to obtain in the original, two pass variants of this technique. Surprisingly, dynamic photon map comes at a little performance penalty for typical rendered scenes. Second, we have provided a detailed potential error prediction technique for some most significant ray tracing algorithms. This feature is then used in the presented new rendering algorithm, which tries to select the most appropriate method to render any part of an image. Both algorithms are presented in Sections 4.4 and 4.5. Parallel rendering. We provide an extension to the stream processor model to support readwrite memory. The extension guarantees coherency of all pieces of written data, but the order of different reading and writing operations is not preserved. Therefore, the correctness of any algorithm must not depend on content of this kind of memory, but the algorithm may use it to accelerate its operations. This mechanism is used in parallel implementation of one pass version of photon mapping, as well as in our new rendering algorithm. The extended stream machine is described in Chapter 5. Moreover, we have designed and implemented an interactive viewer of

15 1.3. THESIS ORGANIZATION 3 ray tracing results, based on processing power of GPUs. The viewer works in parallel with CPU based renderer, which generates new data while previous is displayed. This concept is explained in Section 5.4. Sampling oriented interface. We have designed an interface between 3D objects, cameras and rendering algorithms based entirely on sampling. This interface provides clear abstraction layer, general enough to express majority of ray tracing algorithms. It is designed to simplify the implementation of bidirectional light transport methods. Furthermore, the interface provides support for spectral rendering and carefully designed quasi-monte Carlo sequence generation infrastructure. Additionally, we have developed a technique for storing 2D surfaces and 3D participating media in the same ray intersection acceleration structure, using the sampling interface. If a rendered scene contains roughly similar number of different 2D and 3D entities, the improved algorithm is nearly twice as fast as algorithm using two separate structures. The design of this interface is explained in Section 6.1. Materials. We have provided a shading language optimized for ray-tracing. The new concept is based on usage of a functional language for this purpose. The language is computationally complete and enables easy creation of complex material scripts. The script style resembles much more mathematical notation than classic imperative programming languages. The language is presented in Section 6.2. Moreover, we have developed a new glossy reflection model, which is together symmetric and energy preserving. Derivation of its formulae is given in Section Thesis Organization The second and third chapters provide a theoretical basis used in the rest of the thesis. The second chapter presents a brief introduction to the light transport theory under the assumption of geometric optics applicability. It explains how illumination in a 3D scene can be described mathematically as an integral equation (so called Light Transport Equation), and how to use its solution to create images. The third chapter describes Monte Carlo integration as a general purpose numerical technique, giving details on selected approaches to improve its efficiency. The fourth chapter shows how to apply the Monte Carlo integration to solve Light Transport Equation, which leads to variety of so called non-deterministic ray tracing algorithms. The main point of this chapter is, however, the analysis of strong and weak points of major existing light transport algorithms, as well as proposal of new algorithm, designed to efficiently cope with many of their issues. Finally, this chapter expresses how efficiently incorporate full spectral rendering into presented methods. The fifth chapter illustrate the potential of parallel rendering. Furthermore, it gives the idea of how to express ray tracing as an algorithm dedicated to slightly extended streaming machine, and describes variety of hardware, as potential candidates for implementation of streaming machine. In this chapter there is also presented an interactive previewer of ray tracing results. The sixth chapter presents the design of rendering software for developing and evaluating light transport algorithms, and some most interesting implementation details. The seventh chapter provides the most important results. Then, it contains detailed comparison of efficiency of all presented light transport algorithms. The last chapter summarizes the original contributions, and finally, it presents some ideas of future work dedicated to rendering.

16 Chapter 2 Light Transport Theory The light transport theory provides a theoretical basis for an image creation process. The theory used in this thesis is based on the assumption of geometric optics applicability. It describes equilibrium radiance distribution over an entire scene, and additionally, a method for the calculated radiance to form a final image. In computer graphics, light transport was first described formally by Kajiya [Kajiya, 1986]. More recent works, which cover this area, are [Veach, 1997] and [Pharr & Humphreys, 2004]. Arvo s course [Arvo, 1993] and Pauly s thesis [Pauly, 1999] provide a detailed description of light transport in volumes. This chapter starts with explanation of assumptions of geometric optics. Next, it presents the equation describing radiance distribution in both forms for surfaces placed in vacuum only, and for volumetric effects as well. Because all these equations are commonly known, they are discussed briefly. Finally, it is shown how computed radiance is used to form a final image. The presented image formation theory is a modified approach, designed to remove assumption that image is built from pixels. 2.1 Geometric Optics There are a few physical theories describing the behaviour of light. Most important of them are: geometric optics, wave optics and quantum optics. In simplification, geometric and wave optics explain transport of light and quantum optics describes its interaction with matter. Each of these theories predicts selected real-world phenomena with certain degree of accuracy, and likely none of them is completely correct. Choice of a physical theory for simulation is a tradeoff between the desired accuracy of a solution and a computational cost. An interesting fact is that, for computer graphics needs, the simplest theory, geometric optics, typically is sufficient to provide high degree of realism. Geometric optics based rendering is perfectly fine at capturing phenomena such as soft shadows, indirect lighting, dispersion, etc. However, despite its physical simplicity, very few rendering systems provide full support of geometric optics based rendering, and any application, which attempts to do so, is far too slow to be used in real time Assumptions The theoretical model of geometric optics is based on the following simplifying assumptions, which significantly improve the computation efficiency and still allow simulation of majority of commonly seen phenomena: the number of photons is huge while the photon energies are extremely small any distribution of photons may be treated as a continuous value; photons do not interact with each other, thus effects such as interference cannot be simulated; 4

17 2.1. GEOMETRIC OPTICS 5 photon collisions with surfaces and particles of non transparent volumes (e.g. fog or dust) are elastic, which means that photons cannot change wavelength during scattering; diffraction, continuously varying refractive indexes and all other phenomena which could affect movement of photons are neglected, so between collisions photons travel along straight lines; speed of photons is infinitely large, the scene is assumed to be always in equilibrium state; optical properties of materials do not depend on illumination power, therefore illumination is linear, i.e. it can be computed independently for each light source and summed to form final result. Using all these assumptions light transport can be described by means of radiometric quantities, such as flux or radiance. Some phenomena, which require solving wave equation of light transport, like diffraction or interference, cannot be simulated using these quantities at all. However, some other selected non-geometric effects can be easily simulated, by simple extension of these quantities. For example, spectral radiance is used to simulate wavelength dependent effects. In a similar way radiance can be extended to support polarization. Moreover, scattering can be extended to support fluorescence. If spectral radiance is represented as a vector, then reflectivity can be represented as a matrix and scattering event as a matrix-vector multiplication. The simplified case of elastic photon scattering is therefore represented with diagonal matrixes. Implemented software extends geometric optics to support spectral radiance only Radiometric Quantities Under the assumption of applicability of geometric optics it is enough to use a few radiometric quantities to fully describe light transport in any 3D scene. Each quantity is defined by measuring the distribution of radiation energy with respect to some parameters. Any of these quantities can be defined in standard and spectral (denoted by dependence on λ) version. Radiant Flux Radiant flux is defined as radiated energy per time: Φ(t) = dq(t), Φ λ (t, λ) = d2 Q(t, λ) dt dtdλ. (2.1) Radiant flux is measured in Watts. This quantity is useful for description of total emission from a 3D object or a single light source. Irradiance Irradiance in a point x is defined as radiant flux per area: E(x) = dφ(x) da(x), E λ(x, λ) = d2 Φ λ (x, λ) da(x)dλ. (2.2) Irradiance is measured in Watts per square meter. It is used to describe how strong the illumination on a given surface is.

18 6 CHAPTER 2. LIGHT TRANSPORT THEORY Radiance Radiance is considered to be the most basic quantity in radiometry, and is defined as a power per area per solid angle (see Figure 2.1): L(x, ω) = d2 Φ(x, ω) da (x)dσ(ω), L λ(x, ω, λ) = d3 Φ λ (x, ω, λ) da (x)dσ(ω)dλ, (2.3) where da (x) is a projected area measure. Radiance is measured in watts per square meter per steradian. It may be rewritten to the more convenient expression: L(x, ω) = d 2 Φ(x, ω) ω N(x) dadσ(ω) = d2 Φ(x, ω) da(x)dσ (ω), (2.4) which uses standard area measure on surfaces. Light transport equations are based on radiance, which has a useful property it is constant when light travels along straight lines in vacuum. To render an image it is enough to know the radiance on camera lens, however some techniques try to compute radiance everywhere. During scattering of photons it is important to distinguish between incident and outgoing radiances. These quantities (defined on the same x) are often marked as L i and L o. N w q dw da dacosq Figure 2.1: Radiance of conical bundle of rays is defined as derivative of bundle power with respect to angular divergence dω of the cone and perpendicular area da of its base. The radiance can be measured on surface non-perpendicular to bundle direction, by projecting the measured area. Radiant Intensity Radiant intensity is defined as power per unit angle: I(ω) = dφ(ω) dσ(ω), I λ(ω, λ) = d2 Φ λ (ω, λ) dσ(ω)dλ. (2.5) Radiant intensity is measured in watts per steradian. It is used to describe how strong the illumination in a particular direction is. In strictly physically based systems, the radiant intensity from any particular point x is always zero, however the I(x, ω) quantity is useful in description of emission from, commonly used in modeling, point-based light sources. Volume Emittance Volume emittance is similar to radiance, but is defined with respect to volume, instead of surface: L v (x, ω) = d2 Φ(x, ω) dv (x)dσ(ω), L v,λ(x, ω, λ) = d3 Φ λ (x, ω, λ) dv (x)dσ(ω)dλ, (2.6) where V (x) means volume measure. Volume emittance is measured in Watts per cubic meter per steradian. This quantity is used to describe volumetric phenomena, for example emission from correctly modeled fire.

19 2.2. LIGHT TRANSPORT EQUATION Light Transport Equation All the rendering algorithms have to solve the light transport problem. In 1986 J. Kajiya [Kajiya, 1986] first noticed that his theoretically correct (under the assumption of geometric optics applicability) approach solves equation described in his paper, and all currently available algorithms make simplifications of some kind, trading the accuracy of solution for speed. This section presents in short the derivation of this equation and its extension to support light transport in volumes. Next, it explains scattering functions properties. Then, it presents analytical solution to the simplest cases of this equation, and finally, it describes simplifications of light transport equation made by various accelerated rendering algorithms Surface Only Scattering The following derivation is due to [Veach, 1997]. Global light transport equation is based on a formal definition of local surface scattering. Whenever a beam of light from a direction ω i hits a surface at a point x, it generates irradiance equal to: de(x, ω i ) = L i (x, ω i ) ω i N(x) dσ(ω i ) = L i (x, ω i )dσ (ω i ). (2.7) It can be observed that radiance reflected from a particular point on a surface L o is proportional to irradiance at that point: dl o (x, ω o ) de(x, ω i ). Bidirectional Scattering Distribution Function (BSDF), called scattering function for short, is, by definition, constant of this proportionality: f s (x, ω i, ω o ) = dl o(ω o ) de(x, ω i ) = dl o (x, ω o ) L i (x, ω i )dσ (ω i ). (2.8) Local surface scattering equation is defined as: L s (x, ω o ) = f s (x, ω o, ω i )L i (x, ω i )dσ (ω i ). (2.9) Ω This equation describes how much light is reflected from a surface point x in a particular direction ω o, knowing the incident illumination L i. Total light outgoing from a particular surface point x in a particular direction ω o is a sum of scattered light L s and emitted light L e : L o (x, ω o ) = L e (x, ω o ) + f s (x, ω o, ω i )L i (x, ω i )dσ (ω i ). (2.10) Ω Incident radiance at a particular surface can be computed using outgoing radiance from another surface: L i (x, ω i ) = L o (T (x, ω i ), ω i ). (2.11) The ray casting operator T (x, ω) finds nearest ray-surface intersection point for ray starting from x in direction ω. In order to avoid special case when ray escapes to infinity, the whole scene may be enclosed in a huge, ideally black sphere. The equation 2.11 holds because radiance does not change as light travels along straight lines in vacuum. Substituting 2.11 into 2.10 leads to the final form of rendering equation: L(x, ω o ) = L e (x, ω o ) + f s (x, ω o, ω i )L(T (x, ω i ), ω i )dσ (ω i ). (2.12) Ω The incident radiance L i appears no more in this equation, so the subscript from L o is dropped. This equation is valid for spectral radiance L λ as well Volumetric Scattering Extension Classic rendering equation cannot handle light transport in so-called participating media. These media affect radiance when light travels between surfaces. The assumption of surfaces placed in

20 8 CHAPTER 2. LIGHT TRANSPORT THEORY vacuum is a good approximation when rays travel on short distances in clean air. It fails, however, in simulation of phenomena such as dusty or foggy atmosphere, emission form fire, or large open environments, where light paths may be many kilometres long. The vacuum version of rendering equation (2.12) can be extended to support volumetric effects by modification of ray casting operator T to account for radiance changes. The extensions are due to [Arvo, 1993], [Pauly, 1999] and [Pharr & Humphreys, 2004]. Participating media may affect ray by increasing or decreasing its radiance while it travels. Increase is due to in-scattering and emission, and decrease due to out-scattering and absorption. The whole participating medium may be described by definition of three coefficients absorption, emission and scattering in every point of 3D space. Moreover, at every point in which scattering coefficient is larger than zero, there must be also provided a phase function, which performs a role similar to BSDF in classic equation. The emission from a volumetric medium is described by the following expression: dl(x + tω, ω) dt = L ve (x + tω, ω), (2.13) where L ve (x, ω) is volume emittance. Absorption coefficient is defined as following: dl(x + tω, ω) dt = σ a L(x + tω, ω). (2.14) Intuitively, the absorption coefficient describes how many times radiance is decreased when light travels a unit distance. The absorption coefficient is measured in [m 1 ], and can take any nonnegative real value. Out-scattering decreases ray radiance in similar way as absorption, but uses scattering coefficient σ s instead of σ a. The total decrease of ray radiance is expressed by extinction coefficient σ e = σ a + σ s. Fraction of light, which is transmitted between points x and x + sω (beam transmittance), is given by the following formula: tr(x, x + sω) = exp Beam transmittance has two useful properties: s 0 σ e (x + tω, ω)dt. (2.15) tr(x 1, x 2 ) = tr(x 2, x 1 ) (2.16) tr(x 1, x 3 ) = tr(x 1, x 2 )tr(x 2, x 3 ). (2.17) The scattering in the participating medium is described by: L vs (x, ω) = σ s (x) f p (x, ω o, ω i )L i (x, ω i )dσ(ω i ). (2.18) Ω Radiance added to ray per unit distance due to in-scattering and emission can be expressed as: L vo (x, ω) = L ve (x, ω) + L vs (x, ω). (2.19) Assuming that ray travels through participating medium infinitely long, i.e. it never hits a surface, the total ray radiance change due to participating medium is: L i (x, ω) = 0 tr(x, x + tω)l vo (x + tω, ω)dt (2.20) When participating media are mixed with surfaces, similarly as in surface-only rendering, ray casting operator T can be used to found nearest ray-surface intersection. Let y = T (x, ω i ) and

21 2.2. LIGHT TRANSPORT EQUATION 9 s = x y. The radiance L i incoming at a surface can then be expressed in terms of radiance outgoing from other surface, modified by encountered participating media: s L i (x, ω i ) = tr(x, y)l o (y, ω i ) + 0 tr(x, x + tω i )L vo (x + tω i, ω i )dt (2.21) Expression 2.21 can be substituted into 2.10, which leads to light transport equation generalized to support participating media mixed with surfaces: L(x, ω o ) = L e (x, ω o ) + f s (x, ω o, ω i ) Ω s (2.22) tr(x, y)l(y, ω i ) + tr(x, x + tω i )L vs (x + tω i, ω i )dt dσ (ω i ). 0 Similarly as in surface-only version, the subscript from L o is dropped, and the equation holds for spectral radiance as well. The generalized equation is substantially more complex than the surface-only version, and therefore it may be expected that rendering participating media dramatically hurts performance. Because of that, many existing volumetric rendering algorithms make simplifications of some kind to this general form of volumetric rendering equation Properties of Scattering Functions The domain of scattering function f s, which describes scattering from all surfaces, is the whole Ω. This function is often defined as a union of simpler functions reflection f r (Bidirectional Reflection Distribution Function, BRDF) and transmission f t (Bidirectional Transmission Distribution Function, BRDF). Reflection happens when both ω i and ω o are on the same side of a surface. Transmission happens when ω i and ω o are on opposite sides of the surface. Transmission is typically modeled as one two-directional function. Therefore, for reflection only direction of surface normal N is important, while for transmission direction as well as sign of N has to be accounted for. Scattering functions have several important properties. First, to conform the laws of physics, BRDFs must be symmetric, i.e. swapping incident and outgoing directions must not change BRDF value: ω i, ω o Ω + f r (ω i, ω o ) = f r (ω o, ω i ). (2.23) However, when surface transmits light, the BTDF typically is not symmetric, but the asymmetry is strictly defined as a function of refraction coefficients of the media on the opposite sides of the surface. The same rule applies to phase functions, but obviously in this case there is no difference between reflection and transmission: ω i, ω o Ω f p (ω i, ω o ) = f p (ω o, ω i ). (2.24) Moreover, all BRDFs and BTDFs (and therefore BSDFs) must be energy conserving, i.e. surfaces cannot reflect more light than they receive: ω o Ω R(ω o ) = f s (ω i, ω o )dσ (ω i ) 1. (2.25) Ω Analogous relationship holds for phase functions: ω o Ω R(ω o ) = f p (ω i, ω o )dσ(ω i ) = 1. (2.26) Ω The equation 2.26 differs from 2.25 in two ways. First, the integration is done with respect to ordinary solid angle, and second, there is strict requirement that phase function is a probability distribution (i.e. it must integrate to one).

22 10 CHAPTER 2. LIGHT TRANSPORT THEORY Analytic Solutions The light transport equation can be solved analytically only in trivial cases, useless in practice. However, these analytical solutions, despite being unable to produce any interesting image, can provide a valuable tool in testing light transport algorithms. Obviously, these tests cannot definitely prove that algorithm which passes them is correct, but nevertheless they provide an aid in removing algorithms errors and evaluating their speed of convergence. A very simple test scene is a unit sphere, with constant emission L e (x, ω) 1 2π and constant BRDF f r (x, ω i, ω o ) 1 2π for each point on the sphere and each direction inside sphere. Since the scene is symmetric (neither geometry nor emission or scattering can break the symmetry), it is easy to verify that radiance L measured along each ray inside the sphere must be identical and equal to 1. This scene can be made a bit more complex with a BRDF which is not necessarily constant, but still with constant reflectivity of R = 0.5. The sphere can be filled with homogenous participating medium with an absorption coefficient σ a 0 and arbitrary scattering coefficient σ s. These modifications must not change the result returned by tested light transport algorithms Simplifications Variety of rendering algorithms, especially those which run in real time, do not attempt to solve full light transport equation. These algorithms may be described theoretically by a (substantially) simplified equations, which are actually solved. In Figure 2.2 there are shown results of simplifications of light transport equation compared with full global illumination solution. The simplified results are given without any additional terms, like ambient light, they just show what simplified equations actually describe. First, one-pass hardware accelerated rasterization implemented in commonly available libraries, such as OpenGL or DirectX, computes only one scattering and supports only point light sources. Therefore rasterization solves the following equation: L(x, ω o ) = n i=1 ) ( ) f r (x, ω o, ŷ i x I i x y i, (2.27) where n is the number of light sources, y i is the position of ith light source, and I i is its radiant intensity. Only recently, due to advancements in programmable graphics hardware [Rost & Licea- Kane, 2009], the f r function can be arbitrarily complex, and non-point lights can be approximated with reasonable precision. Such dramatic simplifications result in possibility of real time rendering, but at the cost of very poor illumination quality. Classic Whitted ray tracer [Whitted, 1980] (see Section 4.1.2) handles multiple reflections, but only from ideal specular surfaces and for point light sources. This may be seen as replacing integral with a sum of radiances of ideally reflected and transmitted rays: L(x, ω o ) = n i=1 ) ( ) f r (x, ω o, ŷ i x I i x y i + αl(t (x, ω r ), ω r ) + βl(t (x, ω t ), ω t ), (2.28) where ω r is reflected ray direction, ω t is transmitted ray direction, and α and β are color coefficients such that 0 α < 1, 0 β < 1 and α + β < 1. An improved approach Cook s Distributed Ray Tracing [Cook et al., 1984] computes right hand integral, but only once for area light sources or twice for glossy reflections, so it also does not support global illumination. On the other hand, the radiosity algorithms [Cohen & Wallace, 1993] (see Section 4.1.4) handle multiple reflections, but are limited to matte surfaces only. They solve full light transport equation, but with the assumption that f r k π, 0 < k < 1. There exist less restrictive radiosity algorithms, however, but they are impractical due to excessive memory consumption. Similarly, the volumetric version of light transport equation is often simplified. In rasterization approach, this equation is typically ignored completely color of all scene elements exponentially

2.3. IMAGE FORMATION 11 Figure 2.2: Difference between results of hardware rasterization algorithm (left), classic ray tracing (middle) and full global illumination (right).

23 2.3. IMAGE FORMATION 11 Figure 2.2: Difference between results of hardware rasterization algorithm (left), classic ray tracing (middle) and full global illumination (right). All guessed lighting terms (e.g. ambient light) were disabled deliberately, to show what given technique actually computes. fade to arbitrarily chosen fog color with the distance from the viewer. This is a very poor approximation, and it cannot simulate majority of volumetric effects, like visible beams of light. However, that is the high price of the ability to run rasterization in real time. On the other hand, physically based rendering systems use less drastic simplifications of volumetric effects. For example, Pharr and Humphreys [Pharr & Humphreys, 2004] implemented single scattering approximation, which seems to be a reasonable trade-off between rendering speed and image quality. 2.3 Image Formation This section shows how to create an image from computed radiance. It starts by describing a camera as a device emitting importance. Next, it explains the conversion of light transport equation domain to points only from points and unit direction vectors. Finally, it shows how the equation can be rewritten as integral over space of all light transport paths, which sometimes is easier to solve. These formulae are based on [Veach, 1997], [Pauly, 1999] and [Pharr & Humphreys, 2004], with modification that image is a function defined on real values instead of array of numbers Importance Cameras may be formally described as devices emitting importance. The importance might be though as hypothetical particles like photons, which propagate from camera, against direction of light flow. Intuitively, tracing importance approximates how important for a rendered image various scene parts are. That is, the importance distribution only is enough to define a camera, and different cameras have different importance distributions. The creation of an image is defined as integral of product of emitted importance (from camera) and radiance (from light sources): I i = A lens Ω W i e(x lens, ω)l i (x lens, ω)dσ (ω)da(x lens ), (2.29) where the I i is ith pixel, W i e is its emitted importance and lens is treated as a special surface in the scene an idealized measuring device which is able to record radiance while not interfering with light transport. The pixel dependence can be removed in the following way. Let the image domain be a unit square, i.e. [0, 1] 2, and u, v image coordinates: (u, v) [0, 1] 2. The importance is defined as function of point x lens and direction ω, as well as location on image plane (u, v). The image function I is then evaluated using the expression: I(u, v) = W e (u, v, x lens, ω)l i (x lens, ω)dσ (ω)da(x lens ). (2.30) A lens Ω

24 12 CHAPTER 2. LIGHT TRANSPORT THEORY The importance W e can be obtained from W i e using a filter function: W e (u, v) = i W i ef i (u, v). (2.31) The image equation 2.29 uses importance We i based on particular filter during image creation process. This seriously limits available image post-processing algorithms. On the other hand, the basic modification in equation 2.30 removes this flaw. The modification seems very simple in theoretical formulation of light transport, but it has significant implications to the described later design of rendering algorithms and their parallelization. Obviously, both these equations are correct for spectral radiance as well. The functional representation of image, however, can cause potential problems when emitted radiance L e described with δ distribution is directly visible. For example, point light sources rendered with pinhole cameras cannot be represented with finite I(u, v) values. Therefore, all algorithms implemented for purpose of this thesis explicitly omit directly visible lights described with δ distributions Integral Formulation The light transport equation can be reformulated to an integral over all possible light transport paths. The following transformation is based on [Veach, 1997] and [Pharr & Humphreys, 2004]. First, radiance and scattering functions can be expressed in different domains: L(x 1, ω) = L(x 1 x 2 ), where ω = x 2 x 1 (2.32) f s (x 2, ω o, ω i ) = f s (x 3 x 2 x 1 ), where ω o = x 3 x 2 and ω i = x 2 x 1. (2.33) The projected solid angle measure σ (ω) is transformed to an area measure A(x) as follows: dσ (ω) = V (x 1 x 2 ) cos θ 1 cos θ 2 x 1 x 2 2 da(x 2 ) = G(x 1 x 2 )da(x 2 ), (2.34) where ω = x 2 x 1, θ 1 and θ 2 are angles between ω and N(x 1 ) or N(x 2 ), respectively, and V (x 1 x 2 ) is the visibility factor which is equal to 1 if x 1 and x 2 are mutually visible, and 0 otherwise. Substituting 2.33, 2.32 and 2.34 into 2.12 leads to a light transport equation defined over all scene surfaces instead over all direction vectors: L(x 2 x 1 ) = L e (x 2 x 1 ) + f s (x 3 x 2 x 1 )L(x 3 x 2 )G(x 3 x 2 )da(x 3 ), (2.35) A and, assuming x 0 x lens and W e W e (u, v), image creation equation: I(u, v) = W e (x 0 x 1 )L(x 1 x 0 )G(x 1 x 0 )da(x 1 )da(x 0 ). (2.36) A 2 Next, using the recursive substitution of the right hand expression of 2.35 into L in 2.36 it is obtained: I(u, v) = W e (x 0 x 1 )G(x 1 x 0 )L e (x 1 x 0 )da(x 1 )da(x 0 ) + A 2 + W e (x 0 x 1 )G(x 1 x 0 )f s (x 2 x 1 x 0 ) A 3 + G(x 2 x 1 )L e (x 2 x 1 )da(x 2 )da(x 1 )da(x 0 ) + A 4 W e (x 0 x 1 )G(x 1 x 0 )f s (x 2 x 1 x 0 )G(x 2 x 1 ) f s (x 3 x 2 x 1 )G(x 3 x 2 )L e (x 3 x 2 )da(x 3 ) da(x 0 ) + + = = W e (x 0 x 1 )α i L e (x i+1 x i )dµ(x i ), (2.37) A i+2 i=0

25 2.3. IMAGE FORMATION 13 where α i = G(x 1 x 0 ) i f s (x j+1 x j x j 1 )G(x j+1 x j ) j=1 dµ(x i ) = da(x 0 )da(x 1 ) da(x i+1 ). The expressions 2.12, 2.35, and 2.37 for evaluating radiance L are obviously equivalent, but different rendering algorithms often prefer particular form over another. The volumetric version of light transport equation written as integral over light paths was formulated by Pauly [Pauly, 1999]. The main idea of this concept is integration over paths built from any combination of surface and volume scattering. Let b i be the ith bit of binary representation of b, b N. Let x = x 0 x 1... x k be the light transport path with k + 1 vertexes. Integration domain is defined as: { Ψ k A, if bi = 0 b = (ψ 0 ψ 1 ψ i ψ k ), ψ i = V, if b i = 1. (2.38) The integration measure is: µ k b ( x) = (ψ 0 ψ 1 ψ i ψ k ), ψ i = The geometric factor is redefined as: G x (x 1 x 2 ) = V (x 1 x 2 )tr(x 1 x 2 ) c(x 1)c(x 2 ) x 1 x 2 2, c(x) = The emitted radiance is and scattering function is L ex (x 2 x 1 ) = f x (x 2 x 1 x 0 ) = { da(xi ), if b i = 0 dv (x i ), if b i = 1. (2.39) { cos θ, if x A 1, if x V. (2.40) { Le (x 2 x 1 ), if x 2 A L ev (x 2 x 1 ), if x 2 V, (2.41) { fs (x 2 x 1 x 0 ), if x 1 A σ s f p (x 2 x 1 x 0 ), if x 1 V. (2.42) Finally, the integral formulation of volumetric light transport equation is: I(u, v) = i=0 2 i+1 1 b=1 Ψ i+1 2b W e (x 0 x 1 )α ix L ex (x i+1 x i )dµ i+1 2b ( x), (2.43) where α ix is similar to α i, but is defined on G x and f x instead of G and f s. This equation implicitly makes the common sense assumption that x 0, which is a point on camera lens, is always surface point, i.e. no volumetric sensors are allowed and importance is always emitted from lens surface. However, the radiance emission from volume is accounted properly Image Function In basic version, image contains only intensity values. However, we found that depth of the first scattering (relative to camera) is potentially very useful in many of postprocessing techniques. Thus, the image function obtained during rendering process is: I : [0, 1] 2 λ R +, (2.44) I d : [0, 1] 2 R + Ṽ, (2.45) where the means that ray escapes to infinity and Ṽ means that ray was scattered in participating medium. In the latter case there is no way to reliably define depth of the first intersection.

26 Chapter 3 Monte Carlo Methods The light transport equation has an analytical solution only for very simple scenes, which are useless in practice, thus appropriate numerical algorithms are necessary to solve it. Classic quadrature rules, e.g. Newton-Cotes or Gaussian quadratures are not well suited to light transport. Light transport integrals are very high-dimensional and integrated functions are discontinuous, which result in poor convergence of quadrature rules based on regular grids. However, usage of algorithm of non-deterministic sampling of integrated functions gives much better results. Non-deterministic algorithms use random numbers to compute the result. According to [Pharr & Humphreys, 2004], they can be grouped in two broad classes the Las Vegas algorithms, where random numbers are used only to accelerate computations in average case, with final deterministic result (e.g. Quick Sort with randomly selected pivot) and Monte Carlo algorithms, which gives correct results on average. For example, the result of Monte Carlo integration, is not certainly correct, but nevertheless has strict probabilistic bounds on its error. All non deterministic rendering algorithms, from mathematical point of view, can be seen as variants of Monte Carlo integration. The purpose of this chapter is the explanation of statistical methods used in rendering algorithms. Since these methods are well-known, the discussion is brief. Mathematical statistics is explained in detail in [Plucińska & Pluciński, 2000]. The good reference books on general Monte Carlo methods are [Fishman, 1999] and [Gentle, 2003], and [Niederreiter, 1992] on quasi- Monte Carlo techniques. Applications of Monte Carlo methods in computer graphics are presented in [Veach, 1997] and [Pharr & Humphreys, 2004]. The rest of this chapter starts with a short review of statistic terms and basic Monte Carlo integration techniques. Next, a distinction between biased and unbiased integration algorithms is explained. This is followed by a description of selected most useful variance (i.e. error) reduction techniques. Finally, some quasi-monte Carlo methods are presented. 3.1 Monte Carlo Integration This section starts with description of concepts employed in statistics. These ideas are then used to construct estimators which approximate integrals of arbitrary functions. Finally, there is a brief analysis of error and convergence rate of Monte Carlo integration, based on variance Statistical Concepts Let Ψ be a set (called sample space) 1. Let A be a σ-algebra on Ψ and B be the σ-algebra of Borel sets on R. The measure P defined on A is probability if P (Ψ) = 1. The function X : Ψ R 1 Typically sample space is denoted by Ω. However, in this work Ω is reserved for space of unit directional vectors, thus sample space is denoted by Ψ to avoid confusion. 14

27 3.1. MONTE CARLO INTEGRATION 15 is a single-dimensional random variable if: x R X 1 ((, x)) = {ψ : X(ψ) < x} A. (3.1) The probability distribution of random variable is defined as: The cumulative distribution function (cdf) is: P X (S) = P ({ψ : X(ψ) S}), S B. (3.2) cdf(x) = P X ((, x)), x R. (3.3) Cumulative distribution function cdf(x) may be interpreted as a probability of an event that randomized X value happens to be less or equal given x: The corresponding probability density function (pdf or p) is: cdf(x) = P r {X x}. (3.4) pdf(x) = d cdf(x). (3.5) dx Let X 1, X 2,..., X n be random variables and B n be the σ-algebra of Borel sets on R n. The vector X = (X 1, X 2,..., X n ) is a multidimensional random variable if The cdf of multidimensional random variable is: P X (S) = P ({ψ : X(ψ) S}), S B n. (3.6) cdf(x) = P X ((, x)), x R n, (3.7) and is corresponding pdf: pdf(x) = n cdf(x). (3.8) x 1 x 2 x n The relationship between pdf and cdf can be expressed in more general way using measure theory: pdf(x) = d cdf(x) dµ(x) and cdf(x) = D pdf(x)dµ(x). (3.9) The expected value of random variable (single- or multidimensional) Y = f(x) is defined as: E[Y ] = f(x)pdf(x)dµ(x), (3.10) and its variance is: Ψ [ V [Y ] = E (Y E[Y ]) 2]. (3.11) Standard deviation σ, which is useful in error estimation, is defined as square root of variance: σ[x] = V [X]. (3.12) Expected value and variance have the following properties for each α R: E[αX] = αe[x], (3.13) V [αx] = α 2 V [X]. (3.14) The expected value of sum of random variables is a sum of expected values: [ N ] N E X i = E [X i ]. (3.15) i=1 Similar equation holds for variance if and only if the random variables are independent. Using these expressions and some algebraic manipulation variance can be reformulated: [ V [X] = E (X E[X]) 2] = E [ X 2 2XE[X] + E[X] 2] = E [ X 2] E[X] 2. (3.16) i=1

28 16 CHAPTER 3. MONTE CARLO METHODS Estimators of Integrals Let I be the integral to evaluate: I = f(x)dµ(x). (3.17) The basic Monte Carlo estimator of this integral is: Ψ Ĩ F N = 1 N N i=1 f(x i ) pdf(x i ), (3.18) where x: f(x) 0 pdf(x) > 0. Using definition of expected value it may be found that expected value of estimator 3.18 is equal to integral 3.17: [ ] 1 N f(x i ) E [F N ] = E = 1 N f(x) N pdf(x i ) N Ψ pdf(x) pdf(x)dµ(x) = f(x)dµ(x), (3.19) Ψ i=1 thus the estimator 3.18 produces correct result on average. expressed as: [ ] [ 1 N V [F N ] = V X i = 1 N ] N N 2 V X i = 1 N 2 i=1 i=1 This expression for variance is valid if and only if X i are independent. sample V [F ] is equal to: i=1 V [F ] = E[F 2 ] E[F ] 2 = Ψ Variance of such estimator can be N i=1 V [X i ] = 1 V [F ]. (3.20) N The variance of single f 2 ( 2 (x) f(x)dµ(x)) pdf(x) dµ(x) (3.21) Ψ Convergence rate of estimator F N can be obtained from Chebyshev s inequality: { } V [X] P r X E[X] δ, (3.22) δ which holds for any fixed threshold δ > 0 and any random variable X which variance V [X] <. Substitution of estimator F N into Chebyshev s inequality yields: { P r F N I 1 } V [F ] δ, (3.23) N δ thus for any fixed threshold δ the error decreases with rate O ( 1/ N ) Biased and Unbiased Methods All non-deterministic integration algorithms can be divided into two fundamental groups unbiased and biased. Bias β in statistics is defined as difference between the true value of estimated quantity Q and expected value of the estimator F N : β [F N ] = E [F N ] Q. (3.24) Thus, the unbiased algorithms produce correct results on average, without any systematic errors. However, the shortcoming of these algorithms is variance, which is likely to be larger than in biased ones. The variance appears as noise in the rendered image. Non trivial scenes require a lot of computation to reduce this distracting artifact to an acceptable level. On the other hand, biased methods exhibit systematic error. For example, given point in a rendered image can be

29 3.2. VARIANCE REDUCTION TECHNIQUES 17 regularly too bright regardless of number of samples N evaluated. Biased methods still can be consistent (unbiased asymptotically), as long as the error decreases to zero with increasing amount of computation: lim E [F N] = Q. (3.25) N The bias is usually difficult to estimate, and even if rendered image does not appear noisy, it still can have substantial inaccuracies, typically seen as excessive blurring or illumination artifacts on fine geometrical details. However, despite non-zero bias, biased methods tends to converge substantially faster (they have lower variance) than unbiased ones. 3.2 Variance Reduction Techniques The basic Monte Carlo estimator potentially can have high variance, which directly leads to poor efficiency. In this section there is a brief revision of general methods which substantially reduce variance of this estimator, without introducing bias. On the other hand, biased methods used in computer graphics are often dedicated to particular light transport algorithms, and therefore are described in Chapter Importance and Multiple Importance Sampling A variance of Monte Carlo estimator 3.18 can be decreased, when the pdf(x) is made more proportional to f(x). Intuitively, this approach tends to place relatively more samples wherever integrand is large, therefore reducing integration error. Particularly, when pdf(x) f(x) is satisfied exactly, the variance is zero: and therefore V [F ] = Ψ ( 1 pdf(x) = cf(x), c = f(x)dµ(x)), (3.26) Ψ f 2 ( ) 2 ( 2 (x) pdf(x) dµ(x) f(x)dµ(x) = c 1 f(x)dµ(x) f(x)dµ(x)) = 0. Ψ Ψ Ψ However, in order to achieve zero variance, the function must be integrated analytically, to obtain normalization constant c. This is impossible, since otherwise there would not be the necessity of using Monte Carlo integration at all. Fortunately, using a pdf which is proportional, or almost proportional, to at least one of factors of f, typically decreases variance. This technique is called Importance Sampling. Suppose that f(x) can be decomposed into product f(x) = f 1 (x)f 2 (x) f n (x) and there exist probability densities pdf i proportional (or roughly proportional) to each factor f i. If the standard Importance Sampling is used, the pdf i used for sampling f(x) has to be chosen at algorithm design time. This can have disastrous consequences for algorithm efficiency, if the chosen pdf i poorly matches the overall f(x) shape. In this case, Importance Sampling can actually increase variance over sampling with uniform probability. In this case Multiple Importance Sampling [Veach & Guibas, 1995] can be used. This technique was designed to improve the reliability of Importance Sampling when the appropriate pdf cannot be chosen at the design time. The main idea of this method is to define more than one pdf (each of them is potentially good candidate for importance sampling) and let the algorithm chose the best one at runtime, when the actual shape of integrand is known. The algorithm does this by computing appropriate weights and returning the estimator as weighted sum of samples from these pdfs: F nm = n i=1 1 m m w i (X ij ) f(x ij) n pdf i (X ij ), x w i (x) = 1. (3.27) j=1 i=1

30 18 CHAPTER 3. MONTE CARLO METHODS The appropriate choice of weights w i is crucial for obtaining low variance estimator. According to [Veach & Guibas, 1995] following set of weights is the optimal choice: w i (x) = pdf i (x) n j=1 pdf j(x). (3.28) However, Multiple Importance Sampling causes an issue that has large impact on design and implementation of sampling routines. The standard Importance Sampling technique requires just two methods sampling points x i with given probability pdf(x i ) and evaluating f(x i ). The Multiple Importance Sampling, however, requires an additional operation computing pdf(x) for an arbitrary argument x. Intuitively, the operation may be interpreted as compute the hypothetical probability of returning given x. This operation is usually more difficult than computing probability during sampling x i, since algorithm has no knowledge of random choices necessary to select arbitrary value, as it has in sampling procedure. Fortunately, when approximated probabilities are computed, the Multiple Importance Sampling estimator still is correct, but crude approximation hurts performance Russian Roulette and Splitting Russian roulette and Splitting are designed to adaptively change sampling density without introducing bias. These two techniques were introduced to computer graphics by Arvo and Kirk [Arvo & Kirk, 1990]. Suppose that estimator F is sum of estimators F = F 1 + F F n. Russian roulette allows random skipping evaluation of these terms: F i = { Fi (1 q i)c q i, with probability q i, c, with probability 1 q i, (3.29) where c is arbitrary constant, typically c = 0. When the estimator is a sum of infinite number of terms S = F 1 + F F i +..., the Russian roulette still can be applied, provided that the sum S is finite. Let S i = F i + F i be the partial sum. Then S can be reexpressed as S = S 1, S 1 = F 1 + S 2, S 2 = F 2 + S 3,.... Each sum S i is then evaluated with probability q i, and set to 0 otherwise. Provided that at most finite number of q i is equal to 1 and ε > 0: q i < 1 ε, for almost all q i, evaluation of sum S is randomly terminated with probability 1 after n terms. This leads to expression: S = 1 ( F ( F ) ) F n... = q 1 q 2 q n n i 1 F i (3.30) i=1 q j=1 j Russian roulette, however, increases variance of the estimator. Nevertheless, since it reduces computational cost of it, Russian roulette can improve the estimator efficiency (product of variance and cost), if probabilities q i are chosen carefully. Moreover, Russian roulette can be used to terminate computation of infinite series without introducing statistical bias. Splitting works in opposite direction to Russian roulette. Splitting increases the number of samples in order to reduce variance. Splitting increases computation time, but nevertheless, if performed carefully, can improve sampling efficiency. According to [Veach, 1997], the splitting technique works as follows: F i = 1 n F ij, (3.31) n where F ij are independent samples from F i. i= Uniform Sample Placement Random samples tend to clump together leaving large portions of domain relatively empty. Uneven distribution of samples leads to increased variance of estimators and therefore low sampling

31 3.3. QUASI-MONTE CARLO INTEGRATION 19 efficiency. More advanced sampling techniques try to spread samples as evenly as possible over entire integration domain, which is typically assumed to be a unit s-dimensional hypercube. Stratified Sampling The stratified sampling method is used to split the integration domain Ψ into k non-overlapping smaller subdomains Ψ 1,..., Ψ k, called strata, and draw n i samples from each Ψ i. The total number of samples is not modified, but the samples are better distributed. Provided that no n i is equal to zero, the result is still correct, and provided that each n i is proportional to relative volume of respective Ψ i, stratified sampling never increases variance. According to [Veach, 1997], stratified sampling works most efficiently if integrand mean values in different strata are as different as possible; if the mean values are identical, stratified sampling does not help at all. An example stratified pattern of 16 samples is presented in Figure 3.1. Each of 16 strata contains exactly one sample. Unfortunately, stratified sampling has two major drawbacks. First, whenever dimensionality s of integration domain Ψ is high, the number of subdomains k quickly becomes prohibitive. For example, suppose that s = 10 and stratification splits domain into four parts along each dimension. In this case, k = , which typically is far too much. Second, in order to optimally stratify integration domain, the number of drawn samples should be known in advance. This is a major limitation, if algorithm is designed to draw as many samples as is necessary to achieve desired level of accuracy of solution. Latin Hypercube Sampling Latin hypercube sampling stratifies projections of integration domain onto any of axis. The cost of this technique does not increase with dimensionality of integration domain s, however, Latin hypercube sampling does not provide multidimensional stratification. An example Latin hypercube pattern of 16 samples is presented in 3.1. Each of 16 horizontal and each 16 vertical stripes contains exactly one sample. Latin hypercube sampling can be very effectively implemented using the following formula: X j i = π j(i) ξ j i, (3.32) N where i is sample number, j is dimension number, N is number of samples, π j is jth random permutation of sequence of numbers {1, 2,..., N}, and all ξ j i are independent canonical random numbers. Latin hypercube sampling works best when single-dimensional components of integrand are much more important than others, i.e. f(x 1, x 2,..., x s ) = f(x 1 ) + f(x 2 ) f(x s ) + f res (x 1, x 2,..., x s ) and f res (x 1, x 2,..., x s ) f(x 1, x 2,..., x s ). Nevertheless, the variance of Latin Hypercube Sampling is never much worse than variance of common unstratified sampling: N 2 V [F ] N V [F ], (3.33) N 1 where V [F ] is variance of estimator F using unstratified sampling and V [F ] is variance of the same estimator with Latin hypercube sampling. Thus using Latin hypercube sampling in the worst case can result in variance of standard sampling with one observation less. Since N has to be known in advance, similarly as stratified sampling, Latin hypercube sampling does not allow adaptive choice of number of samples. 3.3 Quasi-Monte Carlo Integration Almost all implementations of Monte Carlo algorithms use random numbers in theory and some kind of pseudorandom number generator in practice. This is not strictly correct, but if a sequence of pseudorandom numbers satisfies some constraints, the convergence of Monte Carlo algorithm

32 20 CHAPTER 3. MONTE CARLO METHODS Figure 3.1: Comparison of uniform sample placement techniques. Left image: reference pure random sampling. Middle image: stratified sampling. Right image: Latin hypercube sampling. with pseudorandom number generator does not differ much from purely random Monte Carlo. The approach of using pseudorandom numbers may be pushed further. So-called quasi-monte Carlo methods use carefully designed deterministic sequences of numbers for sampling integrated functions, which, in practice, provides a slightly better convergence than random Monte Carlo. This section starts with brief explanation of selected methods of evaluating quality of sequences of quasi-monte Carlo samples. Next, it lists properties, which such sequences should have in order to be applicable to integrating Light Transport Equation. Then, there is a short description of a few particularly useful methods of generating them, and finally, a comparison of Monte Carlo and quasi-monte Carlo integration and a summary of quasi-monte Carlo limitations and potential issues related to using this approach Desired Properties and Quality of Sample Sequences A sequence of sample points, which is to be used to solve Light Transport Equation by means of quasi-monte Carlo integration, must have at least three properties. First, the number of samples necessary to obtain desired accuracy of solution is not known in advance. Therefore, the sequence must be infinite. That is, when a given method is used to obtain two sequences with n and n + 1 sample points, the n sample points from a first sequence must be a prefix of the second sequence. Second, each individual light scattering event increases dimensionality of an integration domain by two. Since, at least in theory, light can bounce indefinitely, the dimensionality of a sequence must be unlimited, too. Additionally, since individual samples are evaluated in parallel, it is desirable to be able to compute ith sample point without prior evaluation of 0,..., i 1 points. True random numbers clearly satisfy these properties, but with some drawbacks they may be obtain only by means of external devices attached to a computer, the sequence obtained in one algorithm run is unrepeatable, and integration error resulting from using true random numbers is higher than when carefully designed deterministic sequences are used. Thus, infinite pseudorandom sequences which minimize integration error have to be designed. The sequence domain is a s-dimensional unit hypercube, where s may be infinite. Sample points defined in the hypercube are then transformed to necessary domain independently of sample generation algorithms. The quality of a low discrepancy sequence implies how large is the mean error of integrating functions using sample points from the given sequence. Intuitively, the more evenly sample points are spread in the s-dimensional hypercube and in its lower dimensional projections, the better the sequence quality is. Unfortunately, there is no perfect, precise measurement of sequence quality, which is directly related to integration error. The commonly used measure is a discrepancy. Discrepancy is measured as supremum of set of values, which are calculated as a ratio of volume of axis aligned box to the fraction of sample points in that box. Discrepancy is equal to supremum over all such boxes inside the unit hypercube. A star discrepancy quality measurement limits the accounted boxes to ones

33 3.3. QUASI-MONTE CARLO INTEGRATION 21 which include origin of the hypercube. A star discrepancy of N samples is then defined by the following formula: DN = sup # {x i b} µ(b) N, (3.34) b B where B is a hypercube, b is a origin-anchored box, µ(b) is its volume and x i is a ith sample point. The star discrepancy of true random numbers is, with probability one, asymptotically equal to: ( ) ( ) log log N 1 DN = O O. (3.35) N N Best known low discrepancy sequences in s-dimensional space have a discrepancy of: ( s ) (log N) DN = O, (3.36) N while regular grids have the following discrepancy: ( ) 1 DN = O s, (3.37) N which explains why they behave such poorly when dimensionality is large. Low discrepancy sequences obviously have the lowest discrepancy, but, in practice, when s is large, the number of samples N at which these sequences start to be better than random numbers, is far too large to be useful Low Discrepancy Sequences There is a number of techniques for generating low discrepancy sequences. Simple and popular are methods based on radical inverses. Let the sample number be i. A radical inverse in a base b is found by evaluating digits of representation of i in a base b, and then reflecting string of these digits around a decimal point. If i is written as: then the radical inverse r is: n 1 i = d n 1 d n 2... d 0 = d i b i, (3.38) i=0 n 1 r = 0.d 0 d 1... d n 1 = d i b i 1. (3.39) The radical inverse in base 2 is called a van der Corput sequence. This sequence is one-dimensional, and cannot be converted to multiple dimensions by simply taking subsequent samples, e.g. if every even sample point is used as a x coordinate and respectively every odd point as a y one, all sample points will fall onto diagonal lines along a 2D domain. Solution to this problem is a Halton sequence [Halton, 1960]. Halton sequence is built from radical inverses for each dimension, with the restriction that all base numbers must be relatively prime. Typically, first s primes are used for sampling in a s-dimensional space. Unfortunately, the sequence quality, and therefore integration accuracy, degrades quickly with increasing base. Partial solution to this issue is a Faure sequence [Faure, 1982]. Suppose that sampling domain is s dimensional. The Faure sequence is constructed over a finite field over a prime p not less than s. Since sequences built over smallest possible primes have best properties, a smallest such prime is typically chosen. This feature is also an obvious drawback of a Faure sequence s has to be known in advance or a crude overestimation of it is necessary. First coordinate of a point from a Faure sequence is a radical inverse in a base p. Subsequent coordinates are evaluated by multiplication of a digit vector used to construct a radical inverse by a matrix defined as Pascal triangle modulo p. The modified digit vector is then reflected around a decimal point. All operations are performed in p element finite field. Such i=0

34 22 CHAPTER 3. MONTE CARLO METHODS multiplications can be performed p 1 times, before results start to repeat. For example, a digit vector of ith coordinate of a point P from a Faure sequence in base 3 is constructed as follows: i P i = d 0 d 1 d 2 d 3 d 4 d 5., i = 0, 1, 2. (3.40) There is an alternative concept of evaluating low discrepancy sequences quality (t, m, s)-nets and (t, s)-sequences. For example, suppose that the sampling domain is two dimensional, and there are n 2 points in a sequence. A stratified sampling would then results with an exactly one sample in each [i/n, (i+1)/n] [j/n, (j +1)/n] block. On the other hand, latin hypercube sampling would place exactly one sample in each [0, 1] [i/n 2, (i + 1)/n 2 ] and [j/n 2, (j + 1)/n 2 ] [0, 1] blocks. It is desirable to satisfy stratified sampling, latin hypercube sampling, and any other similar domain partitionings together. Techniques for generating (t, m, s)-nets do exactly this. Let B be the elementary interval in the base b, which is an axis aligned box inside a unit s-dimensional hypercube C: s [ tj B = b, t ) j + 1, (3.41) kj b kj j=1 where 0 t j < b kj. A (t, m, s)-net in a base b is then a set of N = b m points placed in a hypercube C such that each elementary interval with volume b t m contains exactly b t points, for each m t 0. Intuitively, t is a quality parameter, and best nets are these with t = 0. In this case, each elementary interval contains exactly one sample, which results in as uniform sample distribution as possible. A (t, s)-sequence in a base b is an infinite sequence in which k > 0 each subsequence: x kb m,..., x kb m+1 1 (3.42) forms a (t, m, s)-net in a base b. It is worth to note that the Faure sequence in a base b is actually a (t, s)-sequence in such a base. Additionally, any (t, s)-sequence is a low discrepancy sequence. Unfortunately, good properties of (t, m, s)-nets and therefore (t, s)-sequences in base b are obtained for sample numbers N = b m, m = 0, 1, 2,..., thus when N samples are not enough to obtain desired accuracy, N should be increased b times. If b is large (from a practical point of view: larger than 2), (t, s)-sequences are of little usefulness in rendering. Moreover, the base 2 is also convenient due to very efficient implementations of (t, s)-sequence generators, which perform logic operations on individual bits instead of relatively costly integer multiplication, division and modulus. Algorithms for constructing (t, s)-sequences in base 2 are one of the best choices for generation of quasi-monte Carlo sample points. Unfortunately, for a given s there may exist no sequence with desirable quality t. In particular, it has been proven that existence of sequences with minimal possible t is linearly dependent on s. That is, t min = O(s). For example, best (with t = 0) infinite sequence in base 2 exist up to s = 2. If quality requirements are released a bit, t = 1, the highest available s is 4. The most common solution, and a solution chosen in our implementation, is usage of a (t, s)- sequence for a few first s dimensions of s dimensional space, and then fill the remaining dimensions with pseudorandom numbers, for example based on hashing functions. If sampling space is defined so that the first few dimensions affect most of integration error, which is typically satisfied by integrals in Light Transport Equation, a good quality infinitely dimensional sequence is achievable in such a way.

3.3. QUASI-MONTE CARLO INTEGRATION 23 3.3.3 Randomized Quasi-Monte Carlo Sampling Quasi-Monte Carlo methods use deterministic, carefully designed sample sequences in order to minimize integration error.

35 3.3. QUASI-MONTE CARLO INTEGRATION Randomized Quasi-Monte Carlo Sampling Quasi-Monte Carlo methods use deterministic, carefully designed sample sequences in order to minimize integration error. However, when these methods are used for rendering, sometimes regular sampling patterns are visible in resulting images see Figure 3.2, for example. What is more, all methods for estimating random error due to variance are invalid with quasi-monte Carlo sampling. Nevertheless, despite their drawbacks, quasi-monte Carlo sampling tends to produce less error, even if integrands are discontinuous and highly dimensional, which is common in graphics. These issues can be removed by randomized quasi-monte Carlo methods [Hong & Hickernell, 2003]. High quality results are obtained by using randomly scrambled (t, m, s)-nets and (t, s)-sequences. Integration error resulting from these algorithms can be analyzed by means of variance, and they still have good sample distribution properties of deterministic quasi-monte Carlo samples Comparison of Monte Carlo and Quasi-Monte Carlo Integration It is interesting to compare the behaviour of Monte Carlo sequences and various techniques of quasi-monte Carlo sampling. This comparison is based on integration of two functions. First, a case which is well suited for QMC methods a smooth 2D function given by the equation f(x, y) = arctan(x + y), integrated over a square [ 1, 1] [ 1, 1]. Second, a more difficult example a 4D discontinuous function g(x, y, z, w) = sign(x)sign(y)sign(z)sign(w)exp(x 2 + y 2 + z 2 + w 2 ), integrated over an analogous square. Both functions can be integrated analytically in an obvious way, the results in both cases is 0. True random numbers for Monte Carlo integration are simulated by a Mersenne Twister pseudorandom number generator [Matsumoto & Nishimura, Figure 3.2: Patterns resulting from a quasi-monte Carlo sampling. Top image: visible regular patterns due to Niederreiter-Xing sequence. Bottom image: pseudorandom numbers do not produce noticeable patterns. Both images have been rendered with low quality to show error more clearly.

36 24 CHAPTER 3. MONTE CARLO METHODS 10 1 Monte Carlo quasi-monte Carlo Monte Carlo quasi-monte Carlo Absolute integration error e-005 Absolute integration error e-006 1e-005 1e e+006 1e+007 Number of samples (N) 1e e+006 1e+007 Number of samples (N) Figure 3.3: Comparison of Monte Carlo and quasi-monte Carlo integration error with respect to the number of samples taken. Left image: integration error of a smooth 2D function f(x, y) = arctan(x+y) over [ 1, 1] 2. Right image: integration error of a discontinuous 4D function g(x, y, z, w) = sign(x)sign(y)sign(z)sign(w)exp(x 2 +y 2 +z 2 +w 2 ) over [ 1, 1] 4. In both cases QMC converges substantially faster, yet the ordinary MC is better with small number of samples. 1998], while QMC is based on the Halton and the Faure in a base 2 sequences in the 2D case, and the Niederreiter-Xing [Niederreiter & Xing, 1996] sequence in the 4D case. Error of integrating these functions with respect to the number of samples taken is shown in Figure Quasi-Monte Carlo Limitations Quasi-Monte Carlo sampling behaves substantially differently than classic random Monte Carlo sampling. If quasi-monte Carlo is used, there are pitfalls, which must be avoided. We found two of them worth to mention. First, it is a serious error to select every nth sample from a QMC sequence (see Figure 3.4), while it is perfectly fine with random numbers. For example, if only samples with even indexes are chosen from a van der Corput sequence, exactly half of the domain is sampled, and the other half contains no sample points. Second, samples from different sequences may correlate. For example, if Faure sequences in bases 2 and 3 are mixed, some 2D projections exhibit visible correlations, see Figure 3.4. Therefore, a single, well designed and proven to be correct, multidimensional sequence must be used for integration. It is a serious error, if, for example, camera rays are generated using one sequence, and light sources are sampled with another. This important aspect influenced design of our rendering software, see Section Figure 3.5 presents what may happen if this is not assured.

37 3.3. QUASI-MONTE CARLO INTEGRATION 25 Figure 3.4: Undesired correlations between quasi-monte Carlo samples. Left image: 2D projection of an erroneous 3D sequence generated from a van der Corput sequence, using 3n, 3n + 1, 3n + 2 sequence points as nth 3D point coordinates. Right Image: 2D projection of an erroneous 5D sequence constructed from two Faure sequences in bases 2 and 3. The projection, where both dimensions came from sequences in different bases, forms a 4 3 grid with difference in sampling density between individual rectangles of it. Figure 3.5: Results of rendering with erroneous quasi-monte Carlo sampling. The scene contains uniform gray matte walls and a sphere, illuminated by a rectangular light source with uniform intensity. All these conspicuous patterns are results of correlation between camera rays and light source sampling.

38 Chapter 4 Light Transport Algorithms Light transport algorithms provide numerical solutions to the light transport equation these algorithms are responsible for rendering process. Currently, only Monte Carlo or quasi-monte Carlo based ray tracing techniques are capable of solving the light transport equation in its original form, without any simplifications. This chapter concentrates on analysis of inefficiencies and difficult input scenes for different ray tracing algorithms, as well as design of improved algorithm, which is not prone to majority of these flaws. Ray tracing and global illumination algorithms are not new. The earliest well known ray tracer was created by Whitted [Whitted, 1980]. Despite being a significant improvement over earlier methods, this algorithm still creates artificially looking images. Its main drawbacks are lack of global illumination, sharp shadows, sharp reflections and support of point light sources only. An improved version of this algorithm is [Cook et al., 1984]. It can create smooth shadows and reflections, but still cannot simulate global illumination. The first physically correct solution (under the assumption of application of geometric optics applicability) is Path Tracing [Kajiya, 1986]. There is, however, an earlier interesting algorithm for global illumination [Goral et al., 1984], but it is based on radiosity technique and supports only ideal matte (Lambertian) reflection, thus is not general enough to be considered as a full global illumination solution. Two important rendering techniques Russian roulette and Particle Tracing are introduced by Arvo and Kirk [Arvo & Kirk, 1990]. Later, Particle Tracing has lead to bidirectional rendering techniques, and the Russian roulette becomes one of the main probabilistic algorithms used in virtually any ray tracer. The original Path Tracing method, despite being correct, usually is not particularly efficient. The improved algorithm [Pharr & Humphreys, 2004] is somewhat better, but still cannot cope with many common lighting conditions. Most notably, all variants of Path Tracing fail during simulation of caustics or strong indirect illumination. Since the appearance of Path Tracing, there were a lot of research dedicated to make this class of algorithms more effective. The important unbiased methods are Bidirectional Path Tracing [Veach, 1997] and Metropolis Light Transport [Veach, 1997]. The Bidirectional algorithm handles indirect illumination and caustics much better than ordinary Path Tracing due to its ability to trace light paths from a camera and from light sources as well. The Metropolis algorithm is designed to handle cases which much of light is transported by relatively few paths, and majority of paths have negligible or zero contribution. The Metropolis algorithm ability of using mutations of previously found paths to create new ones and carefully evaluated mutation acceptance probabilities ensure high efficiency of this algorithm. However, major drawback of Metropolis method is the fact, that it cannot compute absolute image brightness. Moreover, the Metropolis algorithm does not stratify samples well. Kelemen et al. [Kelemen et al., 2002] proposed mutations defined over a unit cube, which increases mutation acceptance rate. The more recent algorithm Energy Redistribution Path Tracing [Cline et al., 2005] also uses mutation scheme, but is free of some of Metropolis sampling defects. The well known biased methods are Irradiance Caching [Ward et al., 1988, Ward & Heckbert, 26

39 4.1. RAY TRACING VS. OTHER ALGORITHMS ] and Photon Mapping [Jensen, 2001]. The Irradiance Caching is well suited only for diffuse scattering, while Photon Mapping works well with arbitrary reflection functions. Irradiance caching was recently extended to radiance caching [Křivánek, 2005,Křivánek et al., 2006,Jarosz et al., 2008]. The latter algorithm is capable of handling similar effects as Photon Mapping. The major limitation of Photon Mapping is excessively large memory consumption, making this technique difficult to use in complex environments. Since the appearance of original Photon Mapping, there were a lot of work dedicated to improve or modify it. The paper of Fradin et al. [Fradin et al., 2005] describes how Photon Mapping can be modified to effectively use external memory in order to render huge scenes, far beyond the capabilities of original implementation. Fan et al. [Fan et al., 2005] illustrates how to incorporate advantages of Metropolis sampling into Photon Mapping algorithm. The original Photon Mapping fails when scene contains difficult visibility conditions between light sources and camera, because a lot of stored photons can potentially be invisible and therefore useless. On the other hand, the improved technique tries to take into account viewer position while building photon map. This enhancement seems to be more reliable than three pass version of Photon Mapping [Jensen, 2001]. The original final gathering was designed to substantially improve rendered image quality by reducing blurriness resulting from using photon map directly. However, using this technique causes a lot of additional rays to be traced and therefore hurts Photon Mapping rendering speed drastically. Havran et al. [Havran et al., 2005] and Arikan et al. [Arikan et al., 2005] improve the efficiency of final gathering. Herzog et al. [Herzog et al., 2007] shows different algorithm of estimating irradiance when rendering photon map. The technique is used to improve density estimation in diffuse or moderately glossy environment. The rest of this chapter starts with brief comparison of ray tracing with alternative rendering algorithms. Next, there is described a concept of light transport paths and a local path sampling technique and its limitations, which happens to be a major handicap for variety of unbiased ray tracing algorithms. Later, the chapter explains a novel approach to full spectral rendering, which provides more reliability and integrates well with Monte Carlo and quasi-monte Carlo ray tracing. This is followed by a detailed analysis of strengths and weaknesses of important existing light transport algorithms. Finally, an improved technique, designed to reduce impact of some of the flaws of current light transport algorithms is proposed. 4.1 Ray Tracing vs. Other Algorithms Ray tracing algorithms are one of a few well-known popular rendering methods. Based on point sampling of scene radiance distribution, they are substantially different from other approaches. These algorithms are not always the best choice. However, when light transport has to be simulated exactly and the scene contains complex materials and complex geometry, currently there are no alternative to ray tracing algorithms for rendering. This section starts with fundamental distinction between view dependent and view independent algorithms. Next, principles of ray tracing are examined. This is followed by brief description of selected alternative approaches hardware accelerated rasterization and radiosity. Finally, there are some advices when ray tracing is the best choice for rendering View Dependent vs. View Independent Algorithms View dependent techniques compute solution that is valid only for particular view (i.e. camera location). Output from these algorithms is usually an image that can immediately be displayed. The general advantage is that majority of these techniques require very small amount of additional memory and are capable of rendering huge scenes without using external storage. However, some algorithms are based on storing and interpolating already computed results, which accelerates rendering speed. Nevertheless, this storage is not necessary for algorithm output, and if memory costs become too high, it is always possible to switch to different algorithm, which does not impose extra memory costs.

40 28 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS On the other hand, view independent methods compute the solution for all views simultaneously. The output is some kind of temporary data, which requires additional processing to be displayed. The most common representation are light maps. Light maps are simple additional grayscale textures for all scene primitives, which, when mixed with normal textures, give the appearance of illuminated polygons, without using any lights. The advantage of this approach is that light maps can be quickly rendered by graphic hardware, allowing real-time walkthroughs in globally illuminated scenes. Unfortunately, a scene textured with light maps must be static any, even smallest, change to the scene makes the entire solution invalid. What is more, the size of intermediate solution of full, unsimplified light transport equation is unacceptably huge. If a scene contains only matte, flat surfaces, the dimensionality of entire solution is two (solution domain is union of all scene surfaces). On the other hand, the full solution requires 6D domain (3D participating media instead of surfaces, 2D directions and 1D spectral data, S = V Ω λ) Ray Tracing Algorithms Ray tracing is a class of view dependent techniques for generating images by tracing paths of light and accounting for intersections of them with scene objects. In nature, light is emitted from light sources, and after scattering it may hit a surface of the sensor. The earliest ray tracing algorithm [Whitted, 1980] reverses this process, and follows light paths from sensor, through scene objects to light sources. Nowadays, much more complex ray tracing algorithms have been introduced. The broad class of these techniques consists of methods using basic ray tracing operations ray casting, which looks for a nearest intersection of a ray with scene geometry, and ray scattering, which generates new, bounced, ray after an intersection was encountered. From mathematical point of view, ray tracing is a way of point sampling the scene radiance distribution. The image is formed by integrating radiance over many rays. In simplified versions of ray tracing, only specific phenomena can be rendered e.g. perfect specular reflection and refraction, point light sources and sharp shadows. On the other hand, in ray tracing based full global illumination, any radiance distribution could be sampled by means of casting and scattering rays, however at the high computational cost Hardware Accelerated Rasterization The rasterization is an algorithm for rendering which draws individual primitives pixel-by-pixel into a raster image. Similarly as ray tracing, rasterization is also view dependent. Typically, rasterization assumes that models are built from triangles and vertexes. In general, rasterization consists of three steps. First, model vertexes are transformed from a model space to a screen space by a model, view and projection matrixes. Then, primitives are assembled as simple triangles or more complex polygons if triangle happens to be projected at the edge of the screen. Eventually, assembled polygons are rasterized (converted to fragments, which are then written into raster frame buffer). For decades these steps were frozen in graphics hardware designs, and only recently a bit of programmability has appeared. As of 2009 year, vertex, geometry and fragment processing is partially programmable. Obviously, rasterization is far inferior if compared to ray tracing when high quality images are generated. It is not based on physics of light transport and thus is not capable of correctly calculating many important illumination features. In rasterization, individual pixels can be colored by local illumination algorithms. Attempts to global illumination use various quirks based on multipass rendering and use initial rendering pass results as textures for subsequent passes. These tricks make the rasterization design particularly unclean, and moreover, these effects are merely approximation, sometimes very poor. The tradeoffs between ray tracing and rasterization are pertinently described by Luebke and Parker [Luebke & Parker, 2008]: rasterization is fast but needs cleverness to support complex visual effects, ray tracing supports complex visual effects but needs cleverness to be fast.

4.2. LIGHT TRANSPORT PATHS 29 Figure 4.1: An example light path. 4.1.4 Radiosity Algorithms Radiosity algorithms [Cohen & Wallace, 1993] are radically different from ray tracing and rasterization.

41 4.2. LIGHT TRANSPORT PATHS 29 Figure 4.1: An example light path Radiosity Algorithms Radiosity algorithms [Cohen & Wallace, 1993] are radically different from ray tracing and rasterization. The most fundamental difference is a view independent approach radiosity algorithms calculate radiance distribution L(x, ω) over the entire scene. The radiance distribution, however, is not represented exactly a linear combination of carefully selected basis functions L is used. The dependence 2.12 can be concisely written as light transport operator A: L = L e + AL. The light transport operator A also is not represented exactly. Typically a sparse matrix Ã, which approximates the original equation: L = Le + Ã L is used. Modern radiosity methods solve approximated rendering equation in iterative approach: L (1) = L e L (n) = L e + Ã L (n 1), however first approaches used Gaussian elimination instead [Goral et al., 1984]. Since radiosity is well suited only for matte materials, there are some more advanced algorithms designed to cope with this flaw. For example, multi-pass techniques [Wallace et al., 1987], which mix ray tracing and radiosity are designed to incorporate reflections into scenes rendered with radiosity. However, these hybrid approaches are barely satisfying if applied to solve general light transport equation. Radiosity component is still a serious handicap, and ray-traced reflections are added in a view-dependent way, effectively killing main radiosity advantage. 4.2 Light Transport Paths Light transport paths are basic concept of ray tracing based global illumination algorithms. A light transport path is a sequence of rays which connect a point on a light source with a point on camera lens. Any point of interaction (i.e. scattering) of light is represented as a vertex of light transport path. Ray segment connects two adjacent vertexes and represents radiance transport between them along a straight line. An example light transport path is shown in Figure 4.1. Virtually all unbiased ray tracing algorithms generate images by sampling space of all possible light paths. A computed image is then defined as contribution of sampled path divided by probability of its generation, averaged over large number of such samples. This is a direct application of Monte Carlo formula (Equation 3.18) to the image formation formula (Equation either 2.37 or 2.43). Different algorithms use different approaches to sample path space and therefore

42 30 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS paths are sampled with different probability densities. Quality of sampling, and therefore light transport algorithm, can be estimated by proportionality of its path sampling probability density to contributions of paths. This is a consequence of importance sampling (Section 3.2.1). The rest of this section starts with description of classification of light paths. Then there are described methods of construction of light paths, and finally, an inherent limitation of local path sampling technique is explained Classification of Paths Light transport paths can be classified according to types of scattering at path vertexes. Scattering can be specular, if all possible scattering directions form less than two dimensional set, or diffuse otherwise. Examples of specular scattering are an ideal mirror (with unique possible scattering direction) and an idealized anisotropic reflection (when scattered directions form 1D set). In case of specular scattering, scattering function f s is infinite and its values cannot be represented as real numbers. In this case a δ distribution with an appropriate coefficient is used. The distinction between types of scattering based on ideal specular and non-specular scattering events is somewhat suboptimal. If f s represents highly polished, yet non-ideal, material, the f s is always finite, but the scattering behaviour is almost as in specular case. It is far better to use an appropriate threshold to determine scattering type. If a f s for a given scattering event is greater than the threshold, the scattering is considered to be glossy. Otherwise, the scattering is matte. Light paths can be described as sequences of scattering events. Heckbert [Heckbert, 1990] developed a regular expression notation for light paths. In this notation, paths are described by expressions in the form of L(S D) E, where L is a vertex on light source, E is a vertex on camera lens and S or D represent scattering, respectively specular or diffuse. To avoid confusion, in the rest of this thesis, the L(S D) E notation is used for ideal vs. non-ideal classification, and L(G M) E notation, where G means glossy and M matte scattering, for threshold based classification. Regular expression based description of light paths was extended by Veach [Veach, 1997]. In this approach light and importance emission is split into spatial and directional components. Light path is then extended by two vertexes at the side of a light source and additional two vertexes at the side of a camera lens. The first and last vertex of path represents spatial component of emission. This component is specular in case of pinhole cameras and point light sources. Area or volumetric emitters and cameras with finite aperture are accounted diffuse. The second and one-before-last vertexes represent directional component of emission. This component is specular in case of orthographic projection camera or idealized laser light source. In the extended path notation, special cases, i.e. light L and camera E vertexes, no longer exist these symbols indicate only direction of light flow through path. Paths are described by expressions like L(S D)(S D)(S D) (S D)(S D)E or L(G M)(G M)(G M) (G M)(G M)E. Extended path notation is presented more intuitively in Figure Construction of Paths All known unbiased methods are based on a local sampling scheme. The local sampling is a technique of constructing light transport paths adding one ray segment at a time. Thus the path can be constructed using the following operations only: random selection of path vertex, typically at light source or camera lens, but also possibly at ordinary surface or volume; addition of next vertex to already constructed subpath, the addition typically is performed by scattering a ray in random direction from last path vertex and search for nearest intersection; deterministic connection of two subpaths.

43 4.2. LIGHT TRANSPORT PATHS 31 D pinhole camera S S area light source D S extra camera vertexes D extra light vertexes D Figure 4.2: Extended light path notation. Pinhole camera can be seen as additional specular scattering and area light source as additional diffuse scattering. On the other hand, operation like calculating the point of reflection R on a mirror in such a way that light travels from A to B through R is not local sampling, because it adds two segments at once. Non-local sampling is prohibitively complex to integrate into any ray tracing algorithm, and therefore is not used Local Path Sampling Limitation A light transport path can be constructed with non-zero probability using local path sampling if and only if it has two subsequent diffuse scattering events, i.e. its expression contains DD substring [Veach, 1997]. Particularly, local sampling cannot generate light paths having specular reflections separated by single diffuse reflections with point light sources and pinhole cameras. An example of a such path is a caustic on a seabed, caused by a specular refraction on water surface, viewed indirectly through the same water surface (another specular refraction, both specular refractions are separated by only one diffuse scattering at a seabed), presented in Figure 4.3. This fact becomes more intuitive, when all possible ways of constructing light transport path with local path sampling path are considered. First, a light source or a camera may be intersected at random. This can happen if and only if they occupy a finite area and light or importance emission is not specular. In this case, path begins or ends with DD substring. Second, a subpath starting from camera can be connected to subpath starting from light. This connection can be made only between two adjacent diffuse vertexes. D pinhole camera point light source D S S S S D Figure 4.3: Difficult light path.

44 32 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Even if a light source or a camera are not point based, yet relatively small, and reflections are highly glossy, the image created with any unbiased method can have huge error. A method for estimation of such error, based on threshold path classification, is used in algorithm presented in Section 4.5. On the other hand, the most advanced biased methods, which do not rely on concept of light paths, do not have such flaw. The illumination in difficult lighting conditions can be excessively blurry, but important features are never missing completely. Moreover, increasing rendering time progressively reduces blurring. Results of local path sampling and a biased method (our implementation of Photon Mapping) in highly glossy, yet not specular environment, are presented in Figure 4.4. Figure 4.4: Top left image: Reference rendering with local path sampling. Top right image: Rendering using light paths which do not cause excessive noise only. Bottom left image: Rendering with problematic paths only. Bottom right image: Biased method applied to problematic paths. 4.3 Full Spectral Rendering The color phenomenon is caused by a spectral mixture of light, perceived by the human visual system. However, the human visual system cannot distinguish between arbitrary spectral light distributions. Different spectra, which are indistinguishable by human observers, are called metamers. The space of colors recognizable by human observers contains only three independent values, hence the popularity of three component color models. There are many color models in computer graphics, however most are designed for a specific purpose only. The most common are: RGB designed for displaying images, CMYK for printing and HSV for easy color selection by user. All of these models are to some degree hardware dependent. There is, however, a standard model based on XYZ color space, which is independent of any

45 4.3. FULL SPECTRAL RENDERING 33 hardware and can represent all the colors recognizable by a human observer. It was defined by the CIE (Comission Internationale de l Eclairage) as three weighting functions to obtain x, y, and z components from arbitrary spectra. Nevertheless, neither of these models is well suited for rendering, where direct calculations on spectra are the only way to produce correct results [Evans & McCool, 1999, Johnson & Fairchild, 1999]. A general description of many popular color models can be found in Stone [Stone, 2003]. Devlin et al. [Devlin et al., 2002] provide references related to data structures for full spectral rendering and algorithms for displaying spectral data. There are several works dedicated to simulation of particular spectral based phenomena. Wilkie et al. [Wilkie et al., 2000] simulated dispersion by means of classic (deterministic) ray tracing. Rendering of optical effects based on interference attracted a fair amount of attention. Reflection from optical disks is presented in Stam [Stam, 1999] and Sun et al. [Sun et al., 2000]. Algorithms for accurate light reflection from thin layers can be found in Gondek et al. [Gondek et al., 1994] and Durikovic and Kimura [Durikovic & Kimura, 2006]. The latter paper also shows how this algorithm can be run on contemporary GPUs. Many papers present methods for representing and operating on spectral data. Peercy [Peercy, 1993] designed a spectral color representation as a linear combination of basis functions, chosen in a scene dependent manner. Different algorithm using basis functions is described by Rougeron and Peroche [Rougeron & Peroche, 1997]. It uses adaptive projection of spectra to hierarchical basis functions. Sun et al. [Sun et al., 2001] proposed a decomposition of spectra on smooth functions and set of spikes. Evans and McCool [Evans & McCool, 1999] used clusters of many randomly selected spectral point samples. Johnson and Fairchild [Johnson & Fairchild, 1999] extended OpenGL hardware rasterization to support full spectra. Dong [Dong, 2006] points that typically only a part of the scene needs a full spectral simulation and using RGB together with full spectrum can accelerate rendering at cost of only slight quality loss. Ward and Eydelberg-Vileshin [Ward & Eydelberg-Vileshin, 2002], however, designed a threecomponent model optimized for rendering, which typically produces images with an acceptable yet imperfect quality, but the model is not general enough and cannot simulate wavelength dependent phenomena like dispersion. The rest of this section starts with explanation why rendering with full spectrum is necessary. Next, a random point sampling as a method of representing spectra is presented. This is followed by a detailed description of our novel sampling technique, designed to substantially reduce variance of rendering of many wavelength dependent phenomena. Finally, a method for combination of sampling of light transport paths with spectrum sampling is explained Necessity of Full Spectrum The RGB model is often used for rendering color images. However, this is an abuse of it, since RGB based rendering does not have any physical justification. The model was designed for storage and effective display of images on a monitor screen, but not for physically accurate rendering. The light reflection computation under the assumption of elastic photon scattering is performed by a multiplication of a spectrum that represents an illumination and a spectrum describing surface reflectance. This multiplication actually must be performed on spectral distribution functions, not on RGB triplets, in order to get proper results. The RGB based reflection of white light, or light with smoothly varying spectrum, from a surface with smoothly varying reflectance, typically does not produce substantial inaccuracies. However, when at least one of spectra has large variation, the simulation using RGB model becomes visibly incorrect (see Figure 4.5, for example). Moreover in global illumination, due to multiple light scattering, even white light becomes colorful, causing scattering inaccuracies to accumulate. This makes RGB based global illumination rendering results unable to accurately capture the physical phenomena. In addition, the most visually distracting error from using an RGB model appears in simulation of phenomena like dispersion. Whenever RGB based light, from a light source with almost parallel output rays, hits a prism, it is scattered into three bands instead of continuous full spectrum, and

Right bottom half: our full spectral model. For clarity, only diffuse reflection is calculated. the rest of the image remains dark (see Figure 4.6), which looks unrealistic.

46 34 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.5: Left image: copper sphere illuminated by a D65 white light. Right image: copper sphere illuminated by a triangular spectral distribution stretched from 535nm to 595nm. Top left half: an RGB model with 645nm, 526nm and 444nm wavelengths. Right bottom half: our full spectral model. For clarity, only diffuse reflection is calculated. the rest of the image remains dark (see Figure 4.6), which looks unrealistic. Using a full spectrum representation gives a continuous rainbow of colors. However, good-looking results may be obtained by an RGB representation if the light source angular distribution is conical and divergent enough. A similar trick is a basis of a simple Nvidia shader demo [Fernando & Kilgard, 2003]. An address of the texture on a surface, which is seen through glass, is offseted independently for each channel. If texture data is blurred enough, the resulting spectrum is smooth. Nevertheless, both of these methods do not have any physical significance, and obviously are incorrect, but, in some conditions, can look convincing. Figure 4.6: Dispersion on a prism. Top row: RGB model with 645nm, 526nm and 444nm wavelengths. Bottom row: physically correct full spectrum. The light collimation is controlled by a Phong-like function I cos n (φ), with exponent n decreased four times in each subsequent column, and intensity I doubled to compensate for light scattering Representing Full Spectra Full spectral rendering requires an efficient method for representing spectral data. The most common techniques are based on linear combinations of carefully selected basis functions [Peercy,

47 4.3. FULL SPECTRAL RENDERING ,Rougeron & Peroche, 1997,Sun et al., 2001] and point sampled continuous functions [Evans & McCool, 1999]. Efficiency of the linear combination approach is strongly dependent on the actual functions and their match to scene spectral distribution. However, the natural solution in Monte Carlo based rendering system is a random point sampling. Random Point Sampling Random point sampling produces noise at low sampling rate, but well-designed variants of this technique converge quickly. Point sampling can effectively handle smooth (like tungsten bulbs) light distributions and very narrow spikes (like neon bulbs) in the same scene. The two greatest strengths of this technique are: randomly selected wavelengths and well defined wavelength value for each spectral sample. The first one ensures correctness, since when more samples are computed, the more different wavelengths are explored, and due to the law of large numbers, the rendering result converges to the true value. The second allows simulating wavelength dependent effects like dispersion at the cost of additional color noise. It is worth to note that wavelength dependent phenomena cannot be simulated correctly with algorithms based on linear combinations of basis functions with non-zero extent in wavelength space. Even if spectra are represented by unique non-zero coefficients, the corresponding basis functions still have some finite extent, which prevents from doing exact computations with explicit wavelength required. The simplest approach to point sampled spectra is generation of a single spectral sample per light transport path. However, according to Evans and McCool [Evans & McCool, 1999], this technique is inefficient, since it causes a lot of color noise. They proposed using a fixed number of several spectral samples (called a cluster of samples) traced simultaneously along a single light path, which substantially reduces variance with minimal computational overhead. Basic Operations The implementation of multiplication, addition, minimum, etc. operators are obvious, since it is enough to perform appropriate calculation per component, as in RGB model. However, when using full spectrum, computing luminance is a bit more difficult. Particularly, luminance of a spectrum which describes reflectivity of a surface, by definition must be in [0, 1] range. However, computing luminance as a Monte Carlo quadrature of product of reflectance spectrum r(λ) and scaled CIE y weighting function, may randomly lead to numerical errors causing luminance to exceed 1.0 threshold. The equation: l n i=1 r(λ i )y(λ i ) / n y(λ i ) p(λ i ) p(λ i=1 i ), (4.1) where r(λ) is the reflectance, y(λ) is CIE y weight and p(λ i ) is a probability of selecting given λ i, solves the issue. It guarantees that the luminance is in [0, 1] range, provided that r(λ) is also in the specified range. Wavelength dependent effects can be handled as proposed by Evans and McCool [Evans & McCool, 1999] for specular dispersion by dropping all but one spectral sample from a cluster. This is done by randomly selecting a sample to preserve, with uniform probability. All the samples, except the selected one, are then set to zero, and the power of the chosen one is multiplied by the cluster size. Then the wavelength parameter becomes well defined, and further computations are performed with usage of its actual value. However, when simulated phenomena are not optically perfect, like in Phong-based glossy refraction, it may be more efficient to trace the whole cluster, scaling power of each sample independently. We examine this approach in detail in the next section.

48 36 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Efficient Sampling of Spectra Evans and McCool [Evans & McCool, 1999] simulate wavelength dependent phenomena by tracing only one spectral sample per path. This particular approach is always correct, and is necessary when a phenomenon is optically perfect, such as refraction on idealized glass. However, when the scattering is not ideal, dropping all but one spectral sample from a cluster, while still being correct, might be extremely wasteful and inefficient. In this section we propose a substantially improved technique. Single Scattering Model For testing purpose, a refraction model with an adjustable, wavelength dependent refraction and imperfection introduced by Phong-based scattering [Phong, 1975], with controllable glossiness is used. Perhaps an extension to Walter et al. microfacet based refraction [Walter et al., 2007] supporting dispersion would give better results, but their model is much more complicated and therefore would make evaluation of spectral sampling difficult. Nonetheless, since we have never made assumptions about scattering model, our results are general and applicable to any wavelength dependent phenomena. For clarity, all tests are based on a single scattering simplification (i.e. light is refracted once, when it enters into glass only). The x component in CIE XYZ space in outgoing direction ω o is then described by the following formula: I CIEx (ω o ) = Λ Ω f s (ω i, ω o, λ)l λ (ω i, λ) w CIEx (λ)dσ (ω i )dλ, (4.2) where Λ is the space of all visible wavelengths, Ω is the space of all direction vectors, L λ (ω i, λ) is the radiance incoming from direction ω i, w CIEx is the CIE weight for x component, and σ (ω i ) is the projected solid angle measure. The y and z components can be evaluated in a similar way. In the rest of this section, the Formula (4.2) is written in a simplified, still not confusing, form: I = Λ Ω f(ω, λ)l(ω, λ)w(λ)dσ (ω)dλ. (4.3) Basic and Cluster Based Monte Carlo Estimators The Monte Carlo method (Equation 3.18) can be applied to evaluate the two integrals from Formula (4.3), which leads to the following estimator: I 1 N N f(ω i, λ i ) w(λ i ) pdf σ (ω i, λ i ) pdf λ (λ i ) L(ω i, λ i ), (4.4) i=1 where pdf σ is the probability of selection of a given ω i evaluated with the σ (ω) measure on Ω and pdf λ is the probability of selection of a given λ i. Quality of this estimator, and all the further estimators in this section, relies on the assumption that scattering model offers proper importance sampling (Section 3.2.1), i.e. f(ω, λ) pdf σ (ω, λ) is roughly satisfied. However, this basic estimator is inefficient, because it forces the numbers of spectral and directional samples to be equal. Each directional sample requires additional rays to be traced, which is computationally expensive, while spectral samples are almost for free. This explains the advantage of clusters of spectral samples over a single spectral sample approach. The main improvement over Evans and McCool method is tracing a full cluster of spectral samples, even when wavelength dependent phenomenon is encountered. Wavelength dependence can be defined precisely as the dependence of pdf σ on λ. If scattering is not wavelength dependent, directional sampling is not wavelength dependent as well, i.e. pdf σ (ω, λ) pdf σ (ω). In our method, a particular spectral sample λ s i is selected at random from a cluster, and its value is used

49 4.3. FULL SPECTRAL RENDERING 37 for sampling ω s i. This leads to the color estimator in the form: I = 1 NC 1 NC N C i=1 j=1 N i=1 f(ω s i, λj i ) w(λ j i ) pdf σ (ω s i, λs i ) pdf λ (λ j i )L(ωs i, λ j i ) = 1 pdf σ (ω s i, λs i ) C j=1 f(ω s i, λ j i ) w(λj i ) pdf λ (λ j i )L(ωs i, λ j i ), (4.5) where N is the number of traced clusters, C is the number of samples in each cluster, and pdf σ is the probability of selecting scattering direction, calculated for the selected wavelength λ s i. The estimator (4.5) can be more efficient than estimator (4.4), since it traces C spectral samples at the minimal additional cost. On the other hand, it may deteriorate the importance sampling quality significantly. This happens because all samples with potentially wildly different f(ω s i, λj i ) values are traced, and just one probability pdf σ (ω s i, λs i ) which matches the shape of f(ωs i, λs i ) only, is used. Whenever a direction ω s i with low probability pdf σ (ωs i, λs i ) is chosen at random, and at least one of the f(ω s i, λj i ) has a relatively large value in that direction, the value is no longer cancelled by the probability, leading to the excessively high variance in the rendered image. Moreover, the estimator (4.5) is incorrect whenever λ s i, ωs i : pdf σ (ωs i, λs i ) = 0 and λj i : f(ωs i, λj i ) > 0, particularly when a wavelength dependent phenomenon is optically perfect, i.e. its f is described by a δ distribution. Thus, the initial version of our new approach is not always better than the traditional technique of tracing only one spectral sample. The question is when the new technique exhibits lower variance and when it does not. Multiple Importance Sampling Estimator Fortunately, the variance issue can be solved automatically. Simple modification of the estimator (4.5), which incorporates Multiple Importance Sampling [Veach & Guibas, 1995] (see Section 3.2.1), gives a better estimator with variance as low as possible in a variety of conditions. The new improved estimator is constructed from the estimator (4.5) multiplying each cluster by C and a weight W s i equal to: Wi s pdf = σ (ω s i, λs i ) C j=1 pdf σ (ωs i, (4.6) λj i ), where pdf σ (ω s i, λs i ) is the probability with which the scattering direction is selected, and the values pdf σ (ω s i, λj i ) are hypotethical probabilities of selecting the sampled direction if using λj i value instead. This leads to the final estimator: N C I = 1 N 1 NC i=1 N i=1 CW s i pdf σ (ω s i, λs i ) j=1 1 C j=1 pdf σ (ω i, λ j i ) f(ω s i, λ j i ) w(λj i ) pdf λ (λ j i )L(ωs i, λ j i ) = C j=1 f(ω s i, λ j i ) w(λj i ) pdf λ (λ j i )L(ωs i, λ j i ). (4.7) Assuming that a scattering model provides proper importance sampling, the estimator (4.7) leads to a low variance result. Moreover, the estimator (4.7) is correct whenever scattering model is correct, i.e. whenever ω, λ: f(ω, λ) > 0 pdf σ (ω, λ) > 0, so it is applicable even to optically perfect wavelength dependent phenomena. However, in this case it does not provide any benefit over estimator (4.4). The comparison between the new estimators (4.5) and (4.7) and the previous single sample estimator (4.4) is presented in Figure 4.7. The glass sphere has linearly varying refraction from 1.35 for 360nm to 1.2 for 830nm and uses Phong based scattering with n = Images are created using only two 16-sample clusters, to show error more clearly. Generation of Clusters In order to generate clusters efficiently, two issues have to be solved, namely: how many samples should a single cluster contain, and how to generate them. The number of spectral samples in a

38 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.7: Comparison between new initial estimator (left), new improved estimator (middle) and previous method (right).

8: Selection of optimal number of spectral samples for a single cluster: 4 samples (left), 8 samples (middle), 12 samples (right). All images were rendered in 640x480, with 200k image samples (i.e. spectral clusters).

50 38 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.7: Comparison between new initial estimator (left), new improved estimator (middle) and previous method (right). The new initial estimator exhibits more variance due to lack of proper importance sampling. The color noise from single sample approach makes the rightmost image barely legible. Figure 4.8: Selection of optimal number of spectral samples for a single cluster: 4 samples (left), 8 samples (middle), 12 samples (right). All images were rendered in 640x480, with 200k image samples (i.e. spectral clusters). cluster is an important decision for achieving best possible performance. Unfortunately, optimal number of such samples is highly scene dependent. The more variation emission and reflectance spectra have, the more spectral samples a single cluster should contain. Assuming that a scene contains rather smoothly varying spectra (this assumption typically is satisfied), it is possible to balance excessive color noise and computational overhead. After a few tests we have found that eight spectral samples are optimal 1. Four samples cause significant noise and twelve give barely visible improvement (see Figure 4.8). Rendering time differences between these images have been less than 1%, which confirms the efficiency of a cluster approach. The efficient generation of spectral samples proves to be more difficult. Spectra should be importance sampled, but there are at least three factors, which should affect choice of pdf λ, namely: sensor (camera, human eye, etc.) sensitivity, light source spectral distribution and reflectance properties of materials. However, often only sensor is taken into account, and it is assumed that its sensitivity is well described by CIE y weighting function. Unfortunately, despite producing good quality grayscale images, importance sampling wavelength space with respect to the y function causes excessive color noise, and, contrary to common knowledge, is suboptimal. Ideally, a sampling probability should take into account all three x, y and z components. After some experiments, we found that following probability gives good results: pdf λ (λ) = N 1 f(λ), f(λ) = 1 cosh 2 (A(λ B)), (4.8) 1 Due to Intel SSE instruction set optimization, our implementation requires the number of samples to be divisible by four.

51 4.3. FULL SPECTRAL RENDERING 39 Figure 4.9: Various methods of sampling spectra. Top row: 2000K blackbody radiation. Bottom row: D65 spectrum. Left column: spectra sampled using random numbers and our importance sampling, with various numbers of samples. Middle column: comparison of luminance based importance sampling (top halves) with our p Λ (bottom halves) using 128 spectral samples. Right column: spectra sampled using Sobol low discrepancy sequence and our p Λ, using 4 and 8 spectral samples. where A = nm 1 and B = 538.0nm are empirically evaluated constants and N = f(λ)dλ = 1 ( atanh ( A(λ max B) ) atanh ( A(λ min B) )) (4.9) Λ A is a normalization factor. Results of this improved technique are presented in Figure 4.9. Moreover, since spectra are typically smooth, sampling them with quasi-monte Carlo (QMC) low discrepancy sequences instead of random numbers improves results. However, care must be taken when QMC sampling is applied to cluster based spectra. When a wavelength dependent effect is to be simulated, a single sample from the cluster has to be chosen. This choice is tricky due to peculiarities of QMC sampling. In case of true random numbers, selection of first sample from a cluster always works correctly. On the other hand, it is a serious error to select an every nth sample from a low discrepancy sequence. In the latter case, we assign a separate (pseudo)random sequence for a such selection of a spectral sample, in addition to sequence used for randomizing cluster samples. Results of QMC sampling are presented in Figure Results and Discussion Some more comparison between single spectral sample approach and improved technique is presented in Figure Images in top row use previous settings (refraction coefficient from 1.35 for 360nm to 1.2 for 830nm and glossiness coefficient n = 1000). Next, images in bottom row use much sharper settings (refraction coefficient from 1.5 for 360nm to 1.2 for 830nm and glossiness coefficient n = 4000). Images from first and second column are rendered to have approximately the same quality, and images from second and third column are rendered with the same number of samples (i.e. traced rays). The average numerical error for various numbers of rays for scene from Figure 4.10 is summarized in Table 4.1. Limit Cases Analysis of two limit cases could give more insight into how this new technique works, and when it is most effective. The analysis is based on the assumption that f(ω, λ) pdf σ (ω, λ) is roughly

$40 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.10: Imperfect refraction with dispersion. Top left image uses previous approach with a massive number of 900 samples per pixel.$

52 40 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.10: Imperfect refraction with dispersion. Top left image uses previous approach with a massive number of 900 samples per pixel. Top middle image uses new technique with just 50 samples per pixel, yet it has similar quality. Top right image again uses previous approach, but with 50 samples per pixel. However, gains from using the new technique are less spectacular when glossiness or dispersion is increased. Bottom row images use 900, 100, and 100 samples, respectively. Settings C MIS SSS n = 1000 η = [1.35, 1.20] n = 4000 η = [1.50, 1.20] Table 4.1: Comparison of error of our method (MIS) and a single spectral sample approach (SSS), for C 8-sample spectral clusters per pixel. The error is evaluated as a difference between the tested image and the reference image, averaged over all pixels and color components. The pixel values are normalized to [0, 1] range. satisfied. Otherwise, the multiple importance sampling approach cannot help much in reducing variance. First, when wavelength dependence is little and the reflection is fairly matte, all the scattering probabilities become more and more independent on λ: pdf σ (ω s i, λj i ) pdf σ(ω s i ). The weight W s i then becomes: W s i = pdf σ (ω s i, λs i ) C j=1 pdf σ(ω s i, λj i ) pdf σ (ω s i ) C j=1 pdf σ(ω s i ) 1 C, (4.10)

4.3. FULL SPECTRAL RENDERING 41 Figure 4.11: Analysis of behaviour of estimator (4.7) with increasing glossiness and wavelength dependence of scattering.

53 4.3. FULL SPECTRAL RENDERING 41 Figure 4.11: Analysis of behaviour of estimator (4.7) with increasing glossiness and wavelength dependence of scattering. Wavelength independent scattering (leftmost image). Optically perfect wavelength dependent scattering (rightmost image). Intermediate cases (middle). All the images are rendered with just four clusters. and the estimator: I 1 N 1 NC N i=1 W i pdf σ (ω s i, λs i ) N C i=1 j=1 C j=1 f(ω s i, λ j i ) w(λj i ) pdf λ (λ j i )L(ωs i, λ j i ) f(ω s i, λj i ) w(λ j i ) pdf σ (ω s i ) pdf λ (λ j i )L(ωs i, λ j i ), (4.11) which is an estimator of a simple, wavelength independent, scattering. Second, when scattering becomes more and more glossy and wavelength dependence is significant, with probability close to one the f becomes close to zero for all directions except ω s i. The rare cases, when f(ω j i, λj i ) is large and j s, have low weight W i s, and therefore they cannot affect the estimator significantly. Moreover, all the probabilities but the selected one go to zero, and therefore the weight Wi s for directions preferred by f s for λ s i goes to one, which leads to the estimator equal to: I 1 N 1 N N i=1 N i=1 W s i pdf σ (ω s i, λs i ) C j=1 f(ω s i, λ j i ) w(λj i ) pdf λ (λ j i )L(ωs i, λ j i ) f(ω s i, λs i ) w(λ j i ) pdf σ (ω s i, λs i ) pdf λ (λ j i )L(ωs i, λ j i ), (4.12) which is equivalent to the one sample estimator. This behaviour of estimator (4.7) is presented in Figure The former approach to spectral rendering separates scattering into two artificial cases: standard wavelength independent scattering, and costly simulation of wavelength dependent phenomena using single spectral sample estimator. On the other hand, our method does not depend on such classification. Due to automatically computed weights, it adjusts itself to these two limit cases, and to the broad spectrum of intermediate cases, when scattering is wavelength dependent, but imperfect. The computational cost of our method depends on strength of wavelength dependence and optical perfection of material. These factors cause the computational cost to increase, but it never exceeds the cost of single spectral sample estimator. Sampling of Light Transport Paths Our spectral sampling was derived for a single scattering model. However, it is easy to generalize to light transport path sampling a case when there could be more than one wavelength dependent scattering encountered on the same light path. The wavelength λ s i is selected once for a whole path, and reused at each scattering. The weight Wi s is therefore computed for the whole path, using products of probabilities instead of probabilities of single scatterings. For example, assuming

54 42 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS that the sampled path is build by recursively sampling f s and tracing rays in sampled directions, the Wi s is given by the following expression: m Wi s k=1 = pdf σ (ωs ki, λs i ) C m j=1 k=1 pdf σ (ωs ki, (4.13) λj i ), where k is the number of a scattering event and m is the length of the sampled path. Intuitively, a weight Wi s is a ratio of probability of generating the whole path using selected wavelength λ s i to the sum of probabilities of generating such a path using each wavelength from a cluster. If a light transport algorithm generates a path in a different way, or does not use a concept of light transport paths, the weight Wi s has to be computed in a different manner. The integration of our spectral sampling with individual light transport algorithms is presented in details in Section Analysis of Selected Light Transport Algorithms There is no perfect rendering algorithm. Any of them is well suited to particular input scenes, but is inferior in rendering others. In this section there is a detailed analysis of strengths and weaknesses of selected rendering algorithms. The analysis is based, among others, on classification of light paths described in section Moreover, methods for integration of full spectral sampling (Section 4.3) with presented algorithms are given. The described algorithms are: Path Tracing, Bidirectional Path Tracing, Metropolis Light Transport, Energy Redistribution Path Tracing, Irradiance and Radiance Caching, and Photon Mapping. Additionally, we have proposed optimizations for some of these algorithms, and tested Path Tracing and Photon Mapping in restricted versions which are potential candidates for real time global illumination Path Tracing Path Tracing algorithm, the historically first mathematically correct solution to the light transport problem (Equation 2.12) was given by Kajiya [Kajiya, 1986]. Kajiya also was the first to formulate the Equation 2.12, in the same paper. However, today the original formulation of Path Tracing is considered to be ineffective, and our analysis is based on Pharr and Humphreys version [Pharr & Humphreys, 2004]. This version of Path Tracing improves its convergence and provides a support for volumetric rendering (Equation 2.22). The algorithm is based on integral over paths formulation (Equation 2.43) of the light transport. The Path Tracing method is based on generating light paths from a camera towards light sources. First, a vertex on a camera lens and a sampled direction is chosen at random. Then, the path is constructed incrementally: in loop, nearest intersection is found, and ray is scattered from intersection point in a random direction. Light sources can be either intersected at random or their illumination can be accounted for directly (Figure 4.12). In the second case, a point y i+1 on a light source is chosen at random, and a visibility test between it and a point x i on a light transport path is performed. The loop stops when either a ray exits the scene and escapes to infinity or absorption happens. The absorption is a variant of Russian roulette (Equation 3.30), which with some probability terminates light path evaluation, and therefore with probability one after some finite number of steps finishes the loop. Radiance Estimator According to Equation 2.43, image is defined as an integral over all possible light transport paths x. In order to evaluate standard Monte Carlo estimator (Equation 3.18), or better Multiple Importance Sampling estimator (Section 3.2.1), of the integral 2.43, the path contribution f(x) and probability density pdf(x) have to be evaluated. In path tracing a light transport path x j

55 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 43 y 2 y 5 y 6 x 3 camera x 0 x 1 x 2 x 5 x 4 y 3 y 4 Figure 4.12: A path generated by Path Tracing algorithm. The solid lines represents rays traced towards nearest intersection points. The dashed lines represents visibility tests. with length j can be generated in two ways, with probabilities equal to: pdf 1 (λ, x [j] ) = pdf W (x 0 )pdf W (x 0 x 1 )pdf tr (x 0 x 1 )G x (x 0 x 1 ) j 1 [ pdf σx (x i 1 x i x i+1 )pdf tr (x i x i+1 )G x (x i x i+1 ) i=1 ] pdf c (x i 1 x i x i+1 ) pdf 2 (λ, x [j] ) = pdf W (x 0 )pdf W (x 0 x 1 )pdf tr (x 0 x 1 )G x (x 0 x 1 ) j 2 [ pdf σx (x i 1 x i x i+1 )pdf tr (x i x i+1 )G x (x i x i+1 ) i=1 ] pdf c (x i 1 x i x i+1 ) pdf Lx (y j ), (4.14) (4.15) where pdf 1 is the probability of generating paths with randomly intersected light sources, pdf 2 is the probability of generating paths with special treatment of light sources, { pdf pdf σx (x i 1 x i x i+1 ) = σ (x i 1 x i x i+1 ), if x i A (4.16) pdf σ (x i 1 x i x i+1 ), if x i V is scattering probability measured with respect to projected solid angle in case of surface scattering or ordinary solid angle in case of volumetric scattering, pdf L (x) and pdf W (x) are probabilities of selecting a point on a light source or a camera lens, pdf L (x y) and pdf W (x y) are the probabilities of emission in a given direction, and ( ) tr(x i x i+1 )f x (x i 1 x i x i+1 ) pdf c (x i 1 x i x i+1 ) = min 1, λ (4.17) pdf tr (x i x i+1 )pdf σx (x i 1 x i x i+1 ) is a Russian roulette continuation probability after scattering at vertex x i (i.e. 1 pdf c is an absorption probability). The tr factor is defined by Equation 2.15, G x by 2.40, and f x by All these probabilities might be dependent on wavelength λ. The norm λ converts a function of wavelength into a real value. Typically, f(λ) λ = luminance(f(λ)) is used, but we found that using f(λ) λ = f(λ s ), (4.18) where λ s is a wavelength chosen from a cluster for generating a light transport path, together with spectral Multiple Importance Sampling (pdf c used with norm 4.18 is wavelength dependent), reduces slightly color noise in rendered images. The error reduction of applying 4.18 over using luminance for this purpose, calculated as L 1 norm of difference from reference image is about 7% ± 2%. This is not substantial, but the resulting images have fewer spots of wrong color with less intensity, and the technique does not increase rendering time, so it is worth to apply.

56 44 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS The path contribution is given by the following equation: f(λ, x j ) = W e (x 0 x 1 )G x (x 0 x 1 ) j 1 [ ] f x (x i 1 x i x i+1 )G x (x i x i+1 ) L ex (x j 1 x j ), (4.19) i=1 where W e is a camera sensitivity and L ex is emitted radiance (Equation 2.40). Obviously, the path contribution f(λ, x j ) has the same form as integrand from integral form of Light Transport Equation (Equation 2.43). These equations omit cases of short light paths in open environments. However, the fix is obvious. If a camera ray escapes to infinity (one vertex path), the path contribution to the image is assumed to be zero. Assuming that due to finite absorption probability 1 q i in each ith scattering event, with probability one the path is to be terminated after k scattering events. Both techniques used for generating paths are combined using Multiple Importance Sampling. The spectral radiance estimator along a ray x 0 x 1 is then a sum: ( ) k L λi (x 0 x 1 ) = CWj 1 f(λ i, x 1 j) pdf 1 (λ s, x 1 j) + CW j 2 f(λ i, x 2 j) pdf 2 (λ s, x 2, j) i = 1, 2,... C (4.20) j=2 W α j = pdf α (λ s, x j ) C i=1 pdf 1(λ j, x j ) + C i=1 pdf 2(λ j, x j ), where Wj α is Multiple Importance Sampling weight. The cluster of C spectral samples is evaluated at once, using a path constructed with a randomly chosen λ s value. It is not necessary that the number of samples taken from techniques 1 and 2 are equal, but in practice this approach gives good results. When Equations 4.14, 4.15 and 4.19 are substituted into 4.20, there happens to be a lot of cancelling factors in numerators and denominators, so the actual implementation is much simpler than appears to be. Algorithm Flaws Unfortunately, Path Tracing is not a particularly efficient approach. Besides local path sampling limitation, which affects virtually all unbiased light transport algorithms, it exhibits a few major flaws. The most common scene configurations, which cause Path Tracing to fail, are: Dominant indirect illumination directly illuminated surface area is little comparatively to total scene surface area. Caustics, especially from relatively small light sources. Inappropriate local sampling directions favorized by local scattering do not match light transport over whole paths. The first two problems can be assigned to asymmetry of Path Tracing light path generation. Light paths are started from sensor, while this is not always optimal. Mirrors placed near camera are therefore handled efficiently, but ones placed near light sources are not. The difficult case of dominant indirect illumination is presented in Figure In 1990 Arvo and Kirk [Arvo & Kirk, 1990] presented an algorithm which represents the light as a stream of particles, photons, which are traced from light sources. After photons hit surfaces, they are scattered in random directions, while every hit (if visible) is recorded by a camera. This technique solves inefficient handling mirrors placed near light sources (thus allowing effective rendering of caustics), but it fails when mirrors appear next to camera, which is an easy case for Path Tracing. In fact it does not solve the asymmetry problem, but inverts it. Moreover, particle transport suffers from very poor performance when scene contains many light sources and majority of them are invisible to camera. That is, scenes are rendered with one camera, and might contain a lot of light sources, which is a potential issue for algorithms which trace rays from lights towards a camera. The issue with inappropriate local sampling is discussed more thoroughly in Sections and

4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 45 Figure 4.13: Comparison of results of Path Tracing rendering of scenes with different illumination.

Right image: Moving the sphere under the light source increases the role of indirect illumination, which causes Path Tracing utter failure.

57 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 45 Figure 4.13: Comparison of results of Path Tracing rendering of scenes with different illumination. Left image: Mostly directly illuminated scene with matte surfaces. Right image: Moving the sphere under the light source increases the role of indirect illumination, which causes Path Tracing utter failure. Both images were rendered with 4M samples in 640x480 resolution. Figure 4.14: Results of simplified path tracing. Top left: direct illumination only, 2M samples, 55sec. Top right: one indirect bounce, 2M samples, 1min 20sec. Bottom left: two indirect bounces, 2M samples, 2min. Bottom right: reference full global illumination, 8M samples, 13min. Towards Real Time Global Illumination Despite its poor quality at simulating sophisticated lighting phenomena, Path Tracing is fairly good at rendering simple global illumination effects, especially when rendering time is more important than final image quality. Path Tracing does not employ any complex analysis of gathered samples, and ray casting is the only one time consuming procedure in this algorithm. If illumination is restricted to a direct component, and, say, at most two indirect ray bounces, a scene can be rendered in reasonable quality and in a reasonable amount of time. What means reasonable is the matter of personal taste. One can assess quality of images presented in Figure 4.14, since this method would never provide any meaningful and objective error bounds. Moreover, if the

58 camera 46 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS x 0 x 1 x 2 x 3 x 4 y 3 y 2 y 1 y 0 light Figure 4.15: Batch of light transport paths generated by a Bidirectional Light Transport Algorithm. Each pair of vertexes from camera and light subpaths is connected forming a full path. scene contains caustic illumination, it is to be missing from the image, and the direct component of the illumination should be more important than indirect one. In order to avoid black patches in images, an ambient light term must be manually introduced, even if two indirect bounces are computed. Nevertheless, the ambient light has much smaller influence on the illumination than if direct illumination only is evaluated Bidirectional Path Tracing Nowadays it is a well known fact that both Path Tracing and Particle Tracing are not robust, and sometimes it is most efficient tracing paths from a camera, and sometimes from a light source. The Bidirectional Path Tracing algorithm does exactly this. This algorithm is far more reliable, since it is able to handle well all scenes which are easy for Path Tracing, Particle Tracing and some other configurations. Moreover, if Bidirectional Path Tracing happens to fail, Path Tracing and Particle Tracing would also fail on such a scene. Radiance Estimator The algorithm works by generating one subpath starting from a camera and another subpath starting from a randomly selected light source. Then these paths are connected with a deterministic step. In the original algorithm [Veach, 1997] full paths are generated by connecting every pair of vertexes from both subpaths (see Figure 4.15). Therefore, a path with length k can be generated in k different ways, varying the number of vertexes of camera and light subpaths. Path contributions are then weighted by Multiple Importance Sampling. Probability of generating a camera subpath with length m is given by the following formula: pdf E (λ, x [0] ) = 1 pdf E (λ, x [1] ) = pdf A (x 0 ) pdf E (λ, x [m] ) = pdf A (x 0 )pdf W (x 0 x 1 )pdf tr (x 0 x 1 )G x (x 0 x 1 ) m 2 i=1 [ pdf σx (x i 1 x i x i+1 )pdf tr (x i x i+1 )G x (x i x i+1 ) ] pdf c (x i 1 x i x i+1 ), (4.21)

59 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 47 where pdf σx is defined by Equation 4.16 and G x by Equation A formula for probability of generating light subpath with length n is defined in a similar way: pdf L (λ, y [0] ) = 1 pdf L (λ, y [1] ) = pdf A V (y 0 ) pdf L (λ, y [n] ) = pdf A V (y 0 )pdf L (y 0 y 1 )pdf tr (y 0 y 1 )G x (y 0 y 1 ) n 2 i=1 [ pdf σx (y i 1 y i y i+1 )pdf tr (y i y i+1 )G x (y i y i+1 ) ] pdf c (y i 1 y i y i+1 ). (4.22) The Bidirectional Path Tracing radiance estimator is then a sum: L λi = m=0 n=0 f(λ i, xy [m+n+1] ) CW m,n, i = 1, 2,..., C, (4.23) pdf E (λ s, x [m] )pdf L (λ s, y [n] ) where W m,n = pdf E(λ s, x [m] )pdf L (λ s, y [n] ) T 1 + T 2 m C T 1 = pdf E (λ j, x 0... x i )pdf L (λ j, x i+1... x m y n... y 0 ) (4.24) i=0 j=1 n C T 2 = pdf E (λ j, x 0... x m y n... y i+1 )pdf L (λ j, y i... y 0 ) i=0 j=1 is the Multiple Importance Sampling weight, and xy [m+n+1] = x 0 x 1... x m y n y n 1... y 0 (4.25) is a concatenated light transport path, which contribution f(λ, xy) is defined by the equation The infinite sum is terminated by Russian roulette based absorption. Similarly as in Path Tracing, with probability one there exist finite subpath lengths s and t. Each term in the sum in estimator 4.23 is assumed to be zero for m > s and n > t. Again, the C spectral samples are evaluated at once, using subpaths generated with randomly chosen λ s value. Bidirectional Path Tracing radiance estimator 4.23 contributes to radiance at several points on image plane at once, not only to the point defined by camera ray x 0 x 1. In fact, each evaluated path with just one camera vertex provides additional different camera ray. These additional radiance samples are to be stored at seemingly random positions on image plane. The image consists of two different, independently stored, subimages, one for initial camera rays x 0 x 1, and the second for additional rays in the form x 0 y i, i = 0, 1,..., t. These two subimages are reconstructed with different algorithms and the final image is a sum of these two [Veach, 1997]. Optimizations Due to connections between each pair of vertexes from light and camera subpaths, Bidirectional Path Tracing estimator requires a lot of rays to be traced: s + t rays are necessary for generating subpaths and st rays are required for visibility tests. This can hurt application performance, especially in highly glossy environments, when early absorption is unlikely. Veach proposed technique called efficiency optimized Russian roulette to solve this issue [Veach, 1997]. Unfortunately, Veach s approach assumes that partially rendered image is stored in a pixel array, so it cannot be used in our implementation. However, it is enough to generate a path with a given length just once to obtain estimator with low variance. The actual number of camera and light subpath vertexes used to obtain a full path with length k is chosen at random, which leads to unbiased

60 camera 48 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS y 1 y 2 y 3 y 4 y 5 x 0 x 1 x 2 x 3 x 4 [4] [3] [2] [8] [6] [7] [5] y 3 y 2 y 1 y 0 light Figure 4.16: Batch of light transport paths generated by an optimized Bidirectional Light Transport Algorithm. The algorithm uses specialized direct lighting and reduction of number of visibility tests. Direct illumination points y i are used instead of y 0, and paths with a given length [k] are concatenated just once. estimator with something higher variance, but nevertheless better efficiency due to significantly reduced computational cost. Our technique requires exactly s + t + max(s, t) rays to be traced if subpaths have lengths s and t. Basic Bidirectional Path Tracing algorithm uses just one vertex at a light source for each camera subpath. This is inefficient, since specialized direct lighting techniques provide much less variance, increasing rendering time only slightly. We have implemented such optimization exactly as suggested by Veach [Veach, 1997]. Both of these optimizations are presented in Figure Open Issues While being far superior to Path Tracing and Particle Tracing, even fine tuned for performance, Bidirectional Path Tracing still is not an algorithm of choice for rendering complex scenes. Besides local path sampling limitation, it exposes some issues. The most notable are: Mainly invisible illumination majority of light subpath to camera subpath connections pass through opaque surfaces. Inappropriate local sampling directions favorized by local scattering do not match light transport over whole paths. Due to the first flaw, Bidirectional Path Tracing is unsuitable for large open environments. It is very unlikely that a light subpath started somewhere at sky, and generated independently of camera subpath, is visible by the camera. The inappropriate local sampling occurs when, for example, a directly visible highly polished surface reflects most of camera rays into unlit part of the scene. Therefore majority of illumination is transported by paths which are less likely to be generated, which results in highly noise images. Similar case is a sparse participating medium illuminated by a very bright light source. Since the medium is not dense, almost all rays pass through it and escape into darker area of the scene, while few of them interact with the medium. Because interaction occurs with low probability, and interaction points receive relatively strong direct illumination, resulting image is dark with white spots. The main image feature is then the illuminated participating medium, while algorithm prefers tracing rays into far darker surroundings, effectively wasting computational resources.

61 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS Metropolis Light Transport Path Tracing and Bidirectional Path Tracing generate independent light transport paths. If a path with a relatively high radiance contribution is found, it is dropped after being sampled, and the new path is generated from a scratch. If at least some of important paths happen to be difficult to be sampled at random (for exaple, due to inappropriate local sampling), the resulting image has large variance. Metropolis Light Transport [Veach, 1997], on the other hand, is capable of reusing previously generated paths during sampling the new ones. Metropolis algorithm generates samples from a function f : X R, where X is a space of all light transport paths. After the initial sample x 0 is constructed, the algorithm creates sequence of samples x 1, x 2,... x n. A sample x i is obtained as a random mutation of sample x i 1. The mutation is then randomly accepted or rejected, and x i is set to be either the mutation result, or the unmodified x i 1 otherwise. The mutation acceptance probability is chosen in such a way that a pdf for sampling function f in the limit becomes proportional to the f itself. Path Mutations In order to obtain the desired sampling density over X, the mutation acceptance probability must be appropriately evaluated. Suppose that a mutation transforms a path x to y. The condition, which holds when Metropolis sampling reaches equilibrium regardless of initial sample x 0 is called a detailed balance: f(x)pdf T (x y)pdf a (x y) = f(y)pdf T (y x)pdf a (y x), (4.26) where pdf T (x y) is a conditional probability of generating a mutated path y provided that current path is x, and pdf a is a mutation acceptance probability. This equation leaves some freedom in choosing an appropriate pdf a, and since high mutation acceptance probability improves convergence of the algorithm, the following expression provides maximal possible acceptance probability: { pdf a (x y) = min 1, f(y)pdf } T (y x). (4.27) f(x)pdf T (x y) Additionally, in order to properly explore light transport paths space X, mutation scheme must be ergodic. The algorithm has to converge to an equilibrium state no matter how x 0 is chosen. To ensure ergodicity, it is enough to have a mutation with pdf T (x y) > 0 for each x and y such that f(x) > 0 and f(y) > 0. Mutation Strategies Mutations of light transport paths are designed to efficiently capture variety of illumination phenomena. Good mutation strategy should have the following properties: large changes to light transport path, high acceptance probability, stratification over image plane, low cost. All these properties are difficult to obtain with a single mutation algorithm, so a proper mutation strategy offers a set of individual mutations, which are addressed to satisfy different goals. Veach proposed using bidirectional mutations and lens, caustic, and multichain perturbations [Veach, 1997]. Pauly extended this mutation set by a propagation mutation [Pauly, 1999], which enhances algorithm robustness in the presence of participating media.

62 50 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS A particular mutation, which is to be used to modify current path, can be selected at random from a mutation set. The optimal probabilities for mutation selection are somewhat scene dependent. Assigning them roughly equal values produces a strategy which is fairly robust and less prone to excessive error resulting from difficult scene configuration, however the strategy can be suboptimal for simple scenes without sophisticated illumination phenomena. Algorithm Initialization and Radiance Estimator Metroplis Light Transport algorithm exhibits two issues which have to be addressed it is prone to start-up bias phenomenon and it can compute relative brightness of parts of image only. Fortunately, both these problems can be solved using a clever way of algorithm initialization. The algorithm initialization is due to Veach [Veach, 1997]. The start-up bias is caused by arbitrary choice of initial light transport path x 0. Although the algorithm samples paths according to probability which is proportional to f(x), it does so only in the limit, when n. The initial path, however, affects the probability of sampling for any finite n. To avoid start-up bias, it is enough to multiple each sample math x i by a weight W 0 = f(x 0 )/pdf(x 0 ), where x 0 is a path generated by arbitrary capable light transport algorithm with a probability pdf(x 0 ). The image generated by a weighted Metropolis sampling is unbiased and provides absolute brightness information as well, yet the algorithm is still unusable in practice. For example, if a path x 0 happens to have zero contribution, f(x 0 ) = 0, the resulting image is a black square. The Metropolis algorithm can be initialized using n 0 samples generated by different light transport algorithm to solve this issue. The weight is then evaluated as: W 0 = 1 m m j=1 f(x j ) pdf(x j ). (4.28) The initial path of Metropolis algorithm is then chosen at random from the generated set of m paths, x 0 x j, where jth path is chosen with the probability proportional to f(x j )/pdf(x j ). The sampling algorithm can be run several times to generate total of n required mutations, each time starting with different randomly chosen initial (x 0 ) path. This feature enables efficient parallelization of the algorithm (see Section 5.2.4), and a variance based error estimation as well [Veach, 1997]. The Metropolis algorithm generates samples according to a function g : X R. Since path contribution f(x) is a spectrum, the spectrum has to be converted to a real value. The good solution is to set g(x) = f(x) λ, where for the norm λ can be used the Equation The image generated by a Metropolis algorithm is formed by storing spectral samples f(x) at locations (u, v) on image plane defined by directions of camera rays from paths x. The samples are weighted by W 0 (Equation 4.28). Optimizations Metropolis Light Transport tends to concentrate sampling effort in the relatively brighter parts of the image. While this is a welcome feature if bright and dark parts are mixed in area less than or comparable to image resolution, dark areas which span across several pixels receive little samples as well. It is even possible that some parts of an image would not be sampled at all. Metropolis algorithm can be optimized to address this issue, though [Veach, 1997]. First, it is possible to record rejected mutation as well, if samples are weighted appropriately, using the Algorithm 4.1. Moreover, the function f can be modified to directly account for differences in image brightness. Using samples from Metropolis algorithm initialization, a tentative image I with tiny resolution (say, 32 32) can be generated. Then, the Metropolis algorithm samples according to a function g, which is defined as ratio of f to I value at a given sample location. The magnitude of a g function typically does not exhibit substantial variation over different image regions.

63 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 51 Algorithm 4.1: Metropolis sampling optimized for recording both accepted and rejected mutations. for i = 1 to n do y mutate(x i 1 ) α pdf a (x i 1 y) record(αf(y)) record((1 α)f(x i 1 )) ξ random() if ξ < α then x i y else x i x i 1 end if end for When a scene does not contain sophisticated lighting phenomena, and majority of illumination is a simple direct illumination, Metropolis Light Transport yields to a basic Path Tracing. Veach [Veach, 1997] proposed excluding direct illumination component from Metropolis algorithm and evaluate it with more efficient approach. However, as he noted, if scene contains a lot of invisible light sources, this optimization may in fact turn to be a serious handicap. Comparison of MLT and BDPT Algorithms The Metropolis Light Transport is substantially more robust than Bidirectional Path Tracing. It solves issues related to inappropriate local sampling and a difficult visibility between light sources and camera. Moreover, MLT algorithm reduces the impact of local path sampling limitation, however this flaw is still not completely solved when light sources are sufficiently small and scene materials glossy, MLT is bound to fail also. Unfortunately, mutation based sampling scheme is less efficient on simpler scenes, which can be rendered more efficiently with BDPT. Energy Redistribution Path Tracing Energy Redistribution Path Tracing [Cline et al., 2005] is a relatively recent light transport algorithm based on light path mutations, and is similar to Metropolis Light Transport. The initial step of this algorithm generates light transport paths using simple path tracing. The number of such paths necessary to reduce noise to an acceptable level is much smaller than in pure path tracing, however. Then, the generated paths are mutated, possibly passing through image area associated with neighboring pixels, therefore redistributing the energy of path tracing sample over larger image area Irradiance and Radiance Caching Irradiance and Radiance Caching algorithms are biased solutions to the light transport equation, based on the assumption that indirect illumination is likely to change slowly across rendered scene, and often can be interpolated using already computed nearby irradiance/radiance values. All algorithms which use this kind of interpolation must solve three tasks: decide when compute new values and when interpolate among nearby ones, provide a structure for storage of cached data and efficiently evaluate new samples if interpolation happens to be unwanted. Images produced by irradiance/radiance caching typically are not noisy (only direct illumination component can introduce some noise), but indirect illumination tends to be blurred. The irradiance caching approach assumes that the cached value is, as the name suggest, irradiance, and therefore caching and interpolating can take place only on perfectly diffuse surfaces,

64 52 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS making the algorithm severely limited. The caching scheme is often used on other matte surfaces [Pharr & Humphreys, 2004], however, this inevitably causes error which cannot be reduced by simply taking more samples. Radiance caching technique is an improvement, which does not exhibit such a major flaw it allows caching and interpolating on diffuse and moderately glossy surfaces. The radiance caching approach caches full directional light distribution using spherical harmonics basis function for its representation. Since both irradiance and radiance caching trace rays only in direction from the camera to the light sources, they are prone to substantial error during rendering phenomena like caustics. For some time now it is well known that reliable rendering of such effects require tracing rays in opposite direction [Veach, 1997, Jensen, 2001]. Nowadays irradiance and radiance caching techniques are rarely used alone instead they are used as an fairly important optimization of a final gathering step of the Photon Mapping approach (see Section for a detailed description of this approach) Photon Mapping Photon Mapping is a biased, two pass algorithm for solving light transport equation. In the first step, light subpaths are traced from light sources, and carried energy is stored as so called photons at points where rays intersect scene geometry. Then, a structure specialized for fast search (so called photon map) is built. Finally, in the second step, an image is rendered using flux estimation from photon map as a source of indirect illumination. The key innovation in Photon Mapping is the usage of specialized, independent of scene geometry, data structure for photon storage. Photon mapping is particularly effective in rendering of caustics and it is not prone to local path sampling limitation. Since Photon Mapping is biased, the method error is not only a random noise, and in general produces images with parts which are regularly too bright or too dark. Fortunately, the method is at least consistent by increasing the number of stored photons the algorithm error can be decreased to any arbitrarily small value. Photon Maps Building In the original algorithm [Jensen, 2001], photons are emitted from light sources with probability proportional (or roughly proportional) to the emitted radiance. Then the photons are scattered through scene using BSDF sampling and Russian roulette based absorption. Particularly, photon tracing is equivalent to construction of light subpaths in Bidirectional Path Tracing. At each intersection of a ray with scene geometry, if f s at the intersection point contains a diffuse part, the photon is stored. Stored photon contains three pieces of information: point of intersection, incoming photon direction, photon weight, which are necessary to rendering of a photon map. Additionally, photons are flagged as being direct, caustic or indirect. This flag is also stored. Due to efficiency reasons, caustic photons are stored in normal and separate caustic map as well. Moreover, since photons stored on scene surfaces are treated differently than ones stored in scene volume, these two kind of photons are stored in different maps. Thus, four maps are constructed: surface global and caustic, and volumetric global and caustic. Some of these maps might be empty, if for example scene contains no specular surfaces, capable of generating caustics. The method for tracing photons is presented in Algorithm 4.2. Photons are initially stored in arrays. After the necessary amount of photons is traced, appropriate photon maps are built. Jensen [Jensen, 2001] proposed a kd-tree structure for storing photons. This structure is very good for this purpose, since it poses no substantial storage overhead over simple array, can be constructed in O(n log n) time, and expected time for searching for photons is O(log n), where n is the number of photons stored in the tree.

65 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS Algorithm 4.2: Tracing photons for construction of photon maps. N 0; // number of shooted photons S 0; // number of stored photons while S < required photons do N N + 1; emit(photon, position, direction); f lag(photon, as direct); while true do position nearest intersection(position, direction); if position = then break // ray escaped from scene if material(position) has matte part then if position A then store(photon, surf ace global map) ; else store(photon, volume global map) ; if f lagged(photon, as caustic) then if position A then store(photon, surf ace caustic map) ; else store(photon, volume caustic map) ; end S S + 1; end direction scatter(position, direction, f s, pdf); absorption prob 1 min(1, luminance(f s )/pdf); if random() < absorption prob then break; else scale(photon, f s /(pdf absorption prob)) ; if not flagged(photon, as indirect) and glossy(f s ) then flag(photon, as caustic) ; else f lag(photon, as indirect) ; end if N > maximum shoots then break; // give up end Rendering of Photon Maps When all the necessary photon maps are constructed, the scene is rendered with a specialized ray tracing. Light transport paths, which are to be accounted for, are partitioned in a few sets: direct illumination (paths like L(M G) 2 M?G (M G) 2 E, i.e. directly illuminated matte surface or a light source, seen through zero or more glossy reflections), caustics (paths like L(M G) 2 G + MG (M G) 2 E, i.e. matte surface illuminated through one or more glossy reflections, seen through zero or more glossy reflections), indirect illumination (paths like L(M G) 2 (G M) M(G M) MG (M G) 2 E, i.e. paths which contain at least two matte scattering events). Each of them is evaluated in a different manner. Direct illumination (and illumination visible through arbitrary number of glossy reflections) is always evaluated by sampling light sources, without relying on photon maps. In this case, subpaths are traced from a camera through arbitrary number of glossy (G) reflections until either a matte (M) reflection occurs or an absorption happens. Directly visible caustics (and caustics visible through arbitrary number of glossy reflections) are rendered using the separate caustic maps. The indirect illumination (the rest of possible light transport paths) is rendered using global photon maps. When a photon map is decided to be used, the incoming radiance is based on incoming flux estimated from the map content, instead of being calculated exactly. The reflected radiance is evaluated from incoming radiance using the formulae either (2.9) if x A or (2.18) if x V. The

66 54 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS incoming radiance can be expressed in terms of flux: L i (x, ω) = d 2 Φ i (x, ω) da(x)dσ, if x A, (ω) d 2 Φ i (x, ω) σ s (x)dv (x)dσ(ω), if x V, (4.29) which after substitution into Equations (2.9) or (2.18), respectively, gives the following expression for the reflected radiance: f s (x, ω i, ω o ) dφ i(x, ω), if x A, Ω da L r (x, ω o ) = f p (x, ω i, ω o ) dφ (4.30) i(x, ω), if x V. dv Ω The incomin flux can be approximated by stored photons: 1 M f s (x, ω pi, ω o ) Φ A A pi(x, ω pi ), if x A, i=1 L r (x, ω o ) 1 M f p (x, ω pi, ω o ) Φ V V pi(x, ω pi ), if x V, i=1 (4.31) where M is the number of photons used in the flux estimate and Φ pi is the flux associated with ith photon. The photons stored on surfaces cannot be used to estimate volumetric flux and vice versa, hence the superscripts Φ A and Φ V. The estimation is performed in area A or in volume V centered around the point of interest x. The simplest way of flux estimation is expanding the sphere around x until it contains required number of M photons or a prespecified maximum search radius is reached (in the latter case M is reduced, possibly even to zero in unlit regions of the scene). If a sphere with a radius r is used for photon selection, then V = 4 3 πr3, and A = πr 2 (intersection of a sphere with a surface, which is assumed to be locally flat in small radius around x). Jensen [Jensen, 2001] proposed variety of optimizations for more effective flux approximations, majority of which we have implemented. The flux estimation using photon maps is the source of bias in Photon Mapping algorithm. The method is consisted, however, because under some circumstances, increasing the number of emitted photons N, the estimation error can be made arbitrarily small [Jensen, 2001]: α (0,1) N α lim f s (x, ω pi, ω o ) Φ pi (x, ω pi ) = L r (x, ω o ). (4.32) N i=1 The number of photons M = N α used in radiance estimate increases to infinity together with N. Because of α both infinities are of different order, so the photon search radius r becomes arbitrarily small. The equation (4.32) is valid if f does not contain δ distributions and if point x lie on a surface, the surface is locally flat around x. This formula is the key formula to ensure convergence of a described later in this section one-pass variant of Photon Mapping. Full Spectral Rendering Support Multiple Importance Sampling based spectral rendering (see Section 4.3) integrates flawlessly with Path Tracing and Bidirectional Path Tracing. Unfortunately, the notable case, where spectral sampling causes difficulties, is Photon Mapping, designed by Jensen to work optimally with RGB triplets only [Jensen, 2001]. There are two issues to solve first, there are no complete light transport paths, which connect light source and camera, and second, millions of individual photons have to be stored, causing excessive memory consumption if full spectrum is used to describe them. A recent work [Lai & Christensen, 2007] addresses memory issues. Unfortunately, this algorithm

67 4.4. ANALYSIS OF SELECTED LIGHT TRANSPORT ALGORITHMS 55 Figure 4.17: Comparison of rendering results between Photon Mapping (left image) and Bidirectional Path Tracing (right image), in 800x450 resolution. Both images were rendered in approximately the same time, using 256k image samples with 32k photons, and 512k image samples, respectively. BDPT tends to produce somehow noisy image, while PM samples are more expensive to evaluate, and less of them can be generated in the same time. converts photons spectra to RGB prior to storing them in a map, and converts RGB to spectra again when searching through photons. Our approach, on the other hand, is designed to converge always to the true result with increased number of photons, and therefore significant compression of spectral data is unsuitable. We trace and store clusters of photons with different wavelengths, instead of describing them by RGB triplets. First, in order to explore wavelength space properly, each emitted photon cluster must have individually chosen wavelengths. The obvious place for optimization is that one emitted photon cluster typically corresponds to several stored photon clusters, and therefore cluster wavelengths are stored once for each emission. Moreover, for storing energy, one can experiment with a nonstandard floating point format instead of IEEE single precision. Using 8-sample clusters requires 32B of data for individual stored photon, not to mention an additional 32B for each emission, which is far more than 12B required by an RGB based implementation. If a compact float format with shared exponent is used, the latter can be compressed even to 4B, however, with potential loss of image quality. We have left this for further research. When a photon is about to be stored, its energy is multiplied by weight given by Equation (4.13), which accounts for all encountered wavelength dependent scattering events. In the second pass, rendering of photon map is performed. Camera rays should be weighted similarly prior to photon map lookups. In the classic Photon Mapping, photons are searched in a sphere centered around the intersection point. The sphere radius should be chosen carefully too small causes noise and too large blurriness. We extend this approach to wavelength search as well. If a photon cluster is decided to be used in a flux estimate by a sphere test, additional tests are performed on individual photons (with associated wavelengths) using a spectral search distance in a wavelength space. Similarly as with the spatial radius, the spectral distance must be chosen carefully, in order to avoid noise and blurriness. Optimizations The basic Photon Mapping structure is optimized in a numerous ways. First, final gathering [Jensen, 2001] is added to reduce blurriness resulting from using global photon map directly for estimate indirect flux. Since final gathering is computationally expensive, its cost is often reduced with irradiance or radiance caching (see Section 4.4.4). Apart from these two most important improvements, there are several other useful Photon Mapping optimizations e.g. flagging all specular surfaces and explicitly shot photons onto them, use shadow photons for accelerated direct illumination evaluation (see [Jensen, 2001]), and many others. The results of Photon Mapping compared with Bidirectional Path Tracing are presented in Figure 4.17.

56 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.18: Comparison of rendering results between one pass Photon Mapping (left image) and two pass Photon Mapping (right image).

68 56 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.18: Comparison of rendering results between one pass Photon Mapping (left image) and two pass Photon Mapping (right image). The direct illumination is rendered using photon map to show rendering error more clearly. Both images were rendered in 800x600 resolution, in approximately the same time, using 256k image samples. The left image rendering progressively increased number of photons from 1k to 128k, while the right one uses 128k photons from the start. The most notable artifact in the left image is a noise in shadowed area on the left side of the ring which is a remaining of blurriness from rendering with too few photons. One Pass Photon Mapping Actually, Photon Mapping can be done in one pass, with only minor loses in efficiency compared to the original approach see Figure This approach is very useful if interactive preview of rendering results is required (which is described in more detail in Section 5.4). The new algorithm uses a linear function of number of image samples (n) to estimate minimal necessary photon count in photon map to obtain image with quality determined by n. Therefore, the photon map is no more a static structure new photons are added while new image samples are rendered. Immediately two issues have to be solved synchronization of read and write accesses to the photon map structure in parallel photon mapping and balancing kd-tree. Synchronization can be performed with simple read-write locks (classic readers-writers problem). On the other hand, kd-tree balancing requires significant algorithm modification. We have chosen to balance the scene space instead of photons. The original algorithm starts with bounding box of all photons (unknown in our approach) and in each iteration places splitting plane at a position such that half of the photons remains on the one side of the plane. Otherwise, our algorithm starts with bounding box of the entire scene, and in each iteration it splits it in half across dimension in which the box is the longest. Splitting stops when all nodes contain less photons than a certain threshold (5-6 seems to be optimal) or a maximum recursion depth is reached. Adding new photons require just splitting of some of the nodes, where there happens to be too many photons. The idea is somehow similar to Irradiance Caching algorithm [Ward et al., 1988]. Similarly as in this method, our approach starts with empty structure and fills it through rendering. However, Irradiance Caching calculates irradiance samples when they are needed by camera rays, while our modified Photon Mapping traces photons in a view independent manner. Algorithm Flaws Apart from being biased, the Photon Mapping is not always a best known solution for any scene. First, when the scene contains a lot of illuminated, but invisible, surfaces, many photons would be stored on them, which lead to waste of computational resources and excessively blurry indirect illumination. The Fradin algorithm [Fradin et al., 2005] is designed to mitigate this issue. Bidirectional Path Tracing is no better in such situations, producing noisy images instead, whereas

69 4.5. COMBINED LIGHT TRANSPORT ALGORITHM 57 Figure 4.19: Quickly generated images with one pass Photon Mapping. Left image was generated in 8sec, while progressively refined right image in the 30sec time. Both images were rendered in 640x480 resolution, in parallel, using 2.4GHz Intel Core 2 Duo. Metropolis Light Transport is significantly more reliable. Moreover, we have found that Photon Mapping has problems with direct illumination from strongly directional light sources, like car headlights or torches. In this way, it is better to evaluate illumination sampling paths starting from light sources, precisely what Bidirectional Path Tracing does. Towards Real Time Global Illumination Second Approach The Photon Mapping algorithm is a good candidate for real time global illumination. The obvious approach using relatively few photons in photon map causes blurry indirect illumination, yet does not affect adversely direct illumination component and renders quickly. If the one pass variant is used, the image can be progressively refined, as more photons are added to the photon map. The results are presented in Figure Combined Light Transport Algorithm The Section 4.4 gave a brief description of most important currently used ray tracing algorithms for global illumination evaluation. The important remark is that some of these algorithms can be better than others when rendering a particular scene, while the situation can be opposite with a different scene. Therefore, the best algorithm cannot be chosen without the detailed knowledge of what the scene contains and the user have to decide which one to use a situation which should be avoided if fully automatic global illumination evaluation is the purpose. The rest of this section starts with motivation behind the proposed light transport algorithm. Next, there is a description of choice of unbiased part and combining it with Photon Mapping. The main difficulty here is to decide when a light transport path should be sampled by an unbiased technique and when estimated using photon map. Then, results of the combined approach are presented, and finally, possible future improvements are described Motivation Our idea is to combine more than one algorithm in order to achieve the best possible result. Usage of one of unbiased techniques, e.g. Bidirectional Path Tracing is a good idea for a lot of scenes, due to their modest memory requirements, ease of parallelization and error, which is easy to estimate. The complex algorithm, however, should detect difficult cases resulting primarily from local path sampling limitation, and when the estimated error exceeds some threshold, switch to

70 58 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS the one of the Photon Mapping variants. Therefore, Photon Mapping can be used only when is strictly necessary. Thus, the proposed complex algorithm can reduce the risk of excessive memory requirements and the risk of unpredictable output due to bias. The combined light transport algorithm has the potential of being much more reliable than any unbiased technique. On the other hand, the basic Photon Mapping technique is often the algorithm of choice, and used no matter what the rendered scene represents. We argue that this approach can lead to poor results, even on simple scenes. For example, the scene presented in Figure 4.20 contains a strong direct illumination from highly directional light source, and caustics due to metallic ring. The direct illumination component is evaluated inefficiently using photon map, and even more poorly using direct lighting strategy recommended for Photon Mapping. In fact, the best possible way to account for this kind of illumination is to trace rays from light source and record their intersections with scene geometry directly by a camera exactly what Bidirectional Path Tracing does. Moreover, the caustics are not always created by focusing illumination (caustics inside versus outside of the ring). If the caustic is created without focusing effect, the photon density is far lower, and BDPT is able to render such caustic far more efficiently than photon mapping, which is considered an algorithm of choice for rendering caustics. On the other hand, BDPT fails in rendering indirectly visible caustics (i.e. their reflections through mirrors). Considering the mentioned above examples, it is clear that neither Bidirectional Path Tracing nor Photon Mapping working alone can be reliable on a wide variety of real world scenes. It fact, they cannot render efficiently all parts of the very simple test scene from Figure This is the main reason of development of the algorithm containing both BDPT and PM components, and selects one of them at runtime, at per light transport path basis Merging of an Unbiased Algorithm with Photon Mapping The combined algorithm contains two parts Bidirectional Path Tracing and Photon Mapping. In order to render scenes properly, the algorithm have to decide on which light transport paths which part is used. The paths must not be skipped, since this cause too dark resulting image, and similarly, any path must not be accounted for twice. The main component of the algorithm, which drives the computation, is a Bidirectional Path Tracing part. Having constructed a light transport path, it decides whether to add its contribution to the result, or pass it to evaluation by Photon Mapping. The BDPT algorithm can generate a light transport path with k vertexes in k + 1 techniques, by varying number of vertexes generated tracing rays from camera and light sources. These k + 1 techniques are assigned weights, in such a way that weighted sum of their contributions produce image with as low variance as possible. That is, a technique which can be a source of high variance is assigned a low weight. However, if the algorithm is correct, the weights are normalized to add up to one. Therefore, if all techniques of generating a given light transport path have low weights, the weighted sum does not help at all, and BDPT works very inefficiently producing image with overly high variance. Moreover, if scene contains point light sources, perfectly specular materials and is rendered with a pinhole camera, the BDPT is likely to miss certain illumination features completely. Then, the task of combined light transport algorithm is to detect such cases, omit evaluation of these paths by BDPT (if these are not missed anyway), and estimate their contribution using a photon map instead. The modified BDPT part of combined algorithm is designed to look for special cases of light transport paths paths with the form of LG + XG + E, X D DG + XG + D. Such paths have at least five vertexes, have all matte scattering events separated with glossy ones, and have glossy scattering events next to light sources and camera. Thorough testing of BDPT algorithm shows that such paths have low weights for all BDPT sampling techniques, and are main source of error in images produced by this algorithm. If such a path is detected, is immediately omitted from further processing by the BDPT component. The photon mapping component starts with empty photon map. If the BDPT component detects glossy surfaces next to a camera and light sources, it orders PM component to start filling

4.5. COMBINED LIGHT TRANSPORT ALGORITHM 59 Figure 4.20: Comparison of results of various light transport algorithms. Top left image: Bidirectional Path Tracing, 4M samples.

71 4.5. COMBINED LIGHT TRANSPORT ALGORITHM 59 Figure 4.20: Comparison of results of various light transport algorithms. Top left image: Bidirectional Path Tracing, 4M samples. Top right image: Photon Mapping, 4M samples, 0.5M photons. Bottom image: Combined algorithm, 4M samples, 0.5M photons, which uses either BDPT or PM, estimating which one is likely to perform better. BDPT is roughly 2.5 times faster than the PM and the proposed combined algorithm. a photon map it is highly likely that BDPT is about to generate a high variance light transport path, or even miss some of illumination features completely. Then to the BDPT generated image are added samples from photon map. These samples are restricted to contain illumination only from paths omitted by the BDPT component. Additionally, the combined algorithm employs one important optimization the same subpaths traced from light sources are used for both BDPT samples and photon map filling. This optimization significantly reduces number of traced rays, at the cost of having photon map highly correlated with BDPT sampling. During our tests, however, this correlation appeared to be harmless Results and Conclusion The comparison of images generated by the new proposed algorithm with Path Tracing, Bidirectional Path Tracing, and with Photon Mapping, is given in Figure The detailed numerical comparison of convergence of these algorithms is presented in Chapter 7, together with reference image in Figure 7.3. As expected, local path sampling limitation does not cause the combined algorithm to fail, and the bias from Photon Mapping is not a serious issue as well. On the other hand, the algorithm cannot cope well with scenes where light sources are separated from camera with difficult visibility blockers. This is not surprising, since such scenes cannot be efficiently rendered by both Photon Mapping and Bidirectional Path Tracing.

60 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.21: Comparison of results of various light transport algorithms. Top left: Path Tracing, 2M samples.

72 60 CHAPTER 4. LIGHT TRANSPORT ALGORITHMS Figure 4.21: Comparison of results of various light transport algorithms. Top left: Path Tracing, 2M samples. Top right: Bidirectional Path Tracing, 1M samples. Bottom left: Photon Mapping, 256k samples, 128k photons. Bottom right: Combined algorithm, 1M samples, 32k photons, which uses either BDPT or PM, estimating which one is likely to perform better. All images were rendered in approximately the same time.

73 Chapter 5 Parallel Rendering Currently, to achieve best possible performance, rendering must be run in parallel. Due to the recent advancements in microprocessor technology, significant performance improvements are available as a multiplication of computational cores, rather than substantial increase of efficiency of a sequential processing. Fortunately, majority of image synthesis algorithms exhibit very high degree of inherent parallelism. In the simplest case, each of millions of image fragments can be evaluated independently of others. However, more advanced techniques often introduce some dependencies, in order to, sometimes significantly, accelerate computations. Parallel rendering is not a new idea, though. There is a lot of investigation dedicated to this area. There are conducted researches into parallelization of well-known sequential algorithms. Jensen [Jensen, 2000] show how to run his Photon Mapping in parallel environment. Debattista et al. [Debattista et al., 2006] parallelized Irradiance Caching. Parallelization allows effective rendering of huge data sets. Dietrich et al. [Dietrich et al., 2005] and Stephens et al. [Stephens et al., 2006] described techniques for interactive rendering of massive models using shared memory approach. Some important advancements allow run parallelized ray tracing algorithms in real time. Parker et al. [Parker et al., 1999] showed a method of running a real-time classic ray tracer. Dmitriev et al. [Dmitriev et al., 2002], Wald et al. [Wald et al., 2002], Benthin et al. [Benthin et al., 2003] and Dachsbacher et al. [Dachsbacher et al., 2007], on the other hand, implemented more advanced algorithms, which still yield interactive frame rates. Benthin [Benthin, 2006] also proposed using coherent rays in addition to parallelization. These algorithms are variants of a severely restricted global illumination and do not try to solve light transport equation correctly, though. The recently popular model for parallelization are stream processors [Purcell, 2004, Gummaraju & Rosenblum, 2005, Gummaraju et al., 2007]. Purcell [Purcell, 2004] describes how this model works, express ray tracing based algorithms with it and finally implements stream processor on DirectX 9 class GPU. Unfortunately, this design and implementation is inflexible and suboptimal, due to quirks and peculiarities of that time GPU programming model. Currently (2008/2009 year) it seems that technologies such as OpenCL (Open Computational Language) [Munshi, 2009, Trevett, 2008] or Intel Ct (C for throughput computing) [Ghuloum et al., 2007] have a lot of potential to replace GPU shading languages, which still have insufficient programmability. Unfortunately, they are still in a development stage, however. The rest of this chapter starts with a detailed description of a concept of stream processing. Next, there is an explanation of how to express ray tracing in terms of stream computation. Then, there is a discussion on a choice of optimal hardware for implementation of an extended stream machine, and finally, a technique for interactive visualization of ray traced results using CPU cooperating with GPU is presented. 61

74 62 CHAPTER 5. PARALLEL RENDERING 5.1 Stream Processing Stream processing is a constrained form of a general parallel computation. The main idea behind this concept is a simplification of an arbitrary parallel program and hardware necessary to execute it, which results in an increased performance, however, at the cost of limited flexibility. Thus, not every parallel algorithm can be run on stream machines. Fortunately, some of ray tracing techniques easily can be expressed as stream algorithms. Stream processing is best suited for compute-intensive tasks, which exhibit no dependencies between operations on different data and locality of memory accesses. It is also possible to offload machine central processor by running such parts of an application on specialized streaming hardware. This section starts with detailed description of principles of stream processing, followed by our novel extension to the stream machine concept a software cache which is very handful for some more advanced ray tracing algorithms. Finally, there is a description of quasi-monte Carlo integration as a stream computation Stream Processing Basics The stream processing concept is based on so called streams and kernels. Streams are sets of records, while operations of them are performed by kernels. A kernel is an arbitrary function, which takes exactly one stream as an input and one or more streams as an output. Kernel operates on individual records of an input stream independently, thus a lot of such records can be processed in parallel. When record is processed, kernel writes zero or more result records into any combination of output streams. Kernel is executed once for each input record, but due to parallel nature of processing, order of output records does not necessarily match the order of input ones. However, in some definitions, this order is forced to be kept [Purcell, 2004]. Kernel execution must be stateless, that is, kernel cannot store static variables of any kind. Additionally, kernels have access to arbitrary, constant data structures, which can be read during processing of an input stream. The concept of stream processing is presented in Figure 5.1. Read-only Memory Output streams Input stream Kernel Figure 5.1: Stream machine basic architecture. Kernels and streams can be combined into more complex designs. For example, output of one kernel can be input of another one. Additionally, more than one kernel can write to the same stream. Kernels can be organized into feedback loops, where one of the outputs is written into input stream. However, there is no possibility of having more than one input stream, since kernel would not know how to match input records for processing Extended Stream Machines with Cache The major issue, seriously limiting variety of algorithms suitable for stream machines, is a total lack of possibility of data transfer between executions of a kernel on different records of

75 5.1. STREAM PROCESSING 63 input stream. We argue that adding an additional read-write memory can significantly improve performance and potential of stream machines. We refer to this extra memory as a cache. Basically, the presented cache design guarantees atomicity of read and write operations, although order of individual operations is not guaranteed. That is, if processing of earlier element in a stream by a kernel induces a cache write operation, and processing of a later element causes a cache read, during read operation cache would either have previous portion of data written completely or not written at all. In the actual stream machine design, an algorithm can use an arbitrary number of independently synchronized caches (limited only by the capacity of machine memory). The extended stream machine is presented in Figure 5.2. Read-only Memory Cache Output streams Input stream Kernel Figure 5.2: Extended stream machine architecture. Motivation The motivation behind this extension is, among others, to enable expressing algorithms, which adjust themselves to the input data at a runtime, in a stream style. Such algorithms could gather some data during processing of an initial part of an input stream, and then use the gathered information to accelerate processing of the rest of the input stream. Such algorithm can never fail, because even if cache write operation is slow, and cache read comes before write is completed, founding no data at all, the algorithm gains nothing, but is still correct. Example of such an algorithm is presented in the Section Cache Synchronization The cache synchronization in an extended stream machine is a classic readers-writers problem in the domain of concurrent programming [Ben-Ari, 2006]. This problem is known from having a universal, provably correct solution, which unfortunately cannot be tuned to be optimal in all cases, from performance point of view. Three general known solutions are worth to mention: Favour of readers. Whenever a reader is already having access to a shared resource, each newcoming reader would also gain an access. Typically this is the best performing solution, because readers never wait unnecessarily (they wait if and only if access is granted to a writer). However, the solution is incorrect because of the risk of starvation for writers. Favour of writers. Whenever a writer waits for an access to a shared resource, no reader is given such an access. This solution typically is hardly optimal, because of a lot of unnecessary waiting of readers. Moreover, it is possible to have at least one writer waiting all the times, which cause starvation of readers. Fair lock. Readers and writers wait in one common queue. The newcoming reader would gain access if and only if the resource is not locked for write and no writer is waiting in queue. The writer will gain access if and only if the resource is unlocked. Since there is one common

76 64 CHAPTER 5. PARALLEL RENDERING queue, neither readers nor writers are prone to starvation. However, blocking a reader just because a writer is waiting, can potentially deteriorate algorithm s performance. The fair lock algorithm can be upgraded using priority queues for readers and writers. By adjusting priorities, priority queues can be tuned for near optimal performance, but this tuning is highly sensitive to the parameters of a particular application how much read and write requests the application generates, and how long they last. There is also some possibilities with temporal priority boost. For example, whenever a reader is already granted an access, all newcoming readers get slight, temporal, boost in priority. Again, when resource becomes free and there is few waiting readers, a writer may have priority boosted, in order to gather more readers in queue and grant them access simultaneously. Moreover, asynchronous writes can also be useful. In this case writing process is always released and does not wait, but the actual data are written into the cache if and only if the cache is unblocked. Otherwise data is copied into a private, temporary buffer, and writing operation is executed at the next opportunity. All these concepts are evaluated with selected parallel ray tracing algorithms, and the results are presented in Section Stream Monte Carlo Integration In this section it is presented a high level description of simple algorithm, which potentially can perform better on a stream machine with a cache extension. Assume that a standard Monte Carlo estimator (Equation 3.18) has to be computed. Let the data necessary to calculate f(x) be placed in the constant memory of a stream machine. The input stream may contain successive values of a canonical random variable ξ i. The kernel then should map the values ξ i to values of desired random variable X i with requested probability pdf, evaluate pdf(x i ) and f(x i ) and then eventually write the ratio f(x i )/pdf(x i ) into the output stream. The main processor would have then the trivial task of summing all the numbers from the output stream and dividing the sum by the number of records in the stream. The algorithm is simple, however it is inflexible and cannot adapt itself to the shape of f(x). In some cases such possibility can be extremely important. Monte Carlo Integration with Cache If stream machine offers some form of read-write memory, there is much more flexibility in design of algorithms. Let the pseudocode of kernel function be: Algorithm 5.1: Kernel for Monte Carlo integration on an extended stream machine Check the cache for information about f(x) shape; if something was found then use adjusted pdf(x); else use basic (hard-coded) pdf(x); end Evaluate X i from ξ i using updated adaptive pdf(x); Evaluate f(x i ); Update the cache according to the f(x i ) value; Write f(x i )/pdf(x i ) to the output; This algorithm has a chance to adapt itself to the behaviour of function f(x). This concept is the base of expressing some advanced ray tracing algorithms as extended stream computations (see Section for details). Errors in Numerical Algorithms There is one caveat with this method, however. Numerical errors can be roughly grouped into three categories input data errors (not relevant to this discussion), roundoff errors and

77 5.2. PARALLEL RAY TRACING 65 method errors. When the integration is performed with quasi-random (repeatable) values and in a sequential way, in each subsequent algorithm run, errors from all three categories will be identical, and therefore the results will be the same up to every bit. On the other hand, parallel algorithm causes some issues. First, when cache is not used, ratios in the output stream can be in a different order in each different algorithm run. Since typically (a b) c a (b c) in computer arithmetic, results would then be equivalent up to roundoff errors. If the results are to be compared, bit-to-bit comparison of outputs would not work anymore. Moreover, cache makes things even worse. In parallel algorithm, content of cache could be different in each algorithm run, when, say, ith element of an input stream is processed. Therefore the results of any two parallel algorithm with cache runs typically differ by roundoff and method errors, even if quasirandom numbers are used. This causes comparison of results even more difficult. However, if algorithm is mathematically correct, the results will inevitably converge to the true result (up to roundoff errors), independently on cache content, if the number of stream elements is to be increased without bounds. 5.2 Parallel Ray Tracing Running ray tracing in parallel seems easy at first, but in fact there are some issues to solve in order to make this approach efficient. This section starts with description of ray tracing initialization and trade-offs offered by different ray intersection acceleration structures. Next, there are presented techniques for managing a frame buffer during parallel rendering, followed by a description of streaming multipass rendering. Finally, selected ray tracing algorithms are targeted to extended stream machine, and efficiency of the design is evaluated in a multi-threaded environment Algorithm Initialization and Scene Description A major task to be performed during ray tracing initialization is building a ray intersection acceleration structure. Other works, e.g. loading textures, typically consume much less time. It is not uncommon to see this time is roughly a few minutes for large scenes with millions of primitives. If parallel rendering is used to obtain interactive frame rates, this cost of initialization is unacceptable. There was some research carried to investigate this topic. Wald et al. [Wald et al., 2006] proposed using grids instead of kd-trees as an acceleration structure for dynamic scenes. However, if initialization cost is unimportant, kd-trees are structures with best possible performance [Havran, 2001, Popov et al., 2007]. Popov [Popov et al., 2006] also investigated streaming kd-tree building approach. There was proposed an algorithm for parallel building of a kd-tree [Shevtsov et al., 2007], which seems to fit best into already parallelized ray tracing. The cache is particularly well suited for lazy kd-tree building. Typically, in a streaming ray tracer, scene description is static and is prepared before actual stream computations. In the presented design, however, scene description can be just read from file, with additional processing postponed to be performed lazily, placing results into cache. Unfortunately, lazy kd-tree building does not seem to be a big win over parallel kd-tree construction, though. We suspect, but we have not checked this, that lazy approach can offer more reasonable performance gains if a scene contains a lot of invisible and unlit geometry, which is not the case of our tests Frame Buffer as an Output Stream The output streams of the presented rendering machine contains image samples randomly scattered at the image plane. Samples from such stream can be either immediately converted to a raster image, or stored for further processing. The first option is used by the GPU-based interactive previewer, presented in Section 5.4. This approach is effective from memory usage point of view, but somehow limits image post processing capabilities. Alternatively, if all samples could be stored individually, for example in a file in external memory, then the image could be formed by sophisticated external tools. This approach allows, for example, adaptive or non-linear filtering in order to remove noise [Rushmeier & Ward, 1994,Suykens & Willems, 2000]. Animations

78 66 CHAPTER 5. PARALLEL RENDERING could be generated in a frameless mode [Bishop et al., 1994, Dayal et al., 2005], which could be more efficient. We leave these aspects for further research. The previewer can generate onscreen images, as well as offscreen ones, in arbitrary resolution. The processing power of GPU is enough to justify a transfer of stream of samples to it, and the final image backwards to the main memory and then to the file in external storage. The only limitation is GPU memory size, which in inexpensive consumer level hardware is currently (2010 year) typically 512MB or 1GB. The memory size limits maximum size of offscreen buffer to contain roughly 10M or 20M pixels. The output stream needs synchronization, since it can be written to by rendering stream machine, and it can be read by an interactive previewer. The double buffering is necessary and enough solution avoid rendering stalls when output stream data is sent to the GPU previewer. The previewer can be a bottleneck only while rendering very simple scenes with top CPUs (e.g. Intel Core i7). The problem is, however, not with stream synchronization, but with insufficient speed of input processing by the GPU. These issues are examined in detail in Section Multipass Rendering Multipass rendering cannot be executed by just one run of a stream machine. Examples of algorithms when multipass rendering is necessary are, among others, initialization of Metropolis Light Transport, and a photon map building step in the original Photon Mapping algorithm. The presented extended stream machine design allows this feature by splitting rendering into two functions preprocess and render, which have to be implemented as a definition of rendering algorithm. Preprocess function is done before a stream rendering run. It can be performed sequentially, or if necessary, manually parallelized. After preprocess, stream of image samples is evaluated in parallel. Preprocess signals if multipass is necessary. In this case, a number of necessary samples is defined by preprocess routine, and then these samples are evaluated in a streaming way. Output streams are ignored, and the only way of communication is extended stream machine cache. When stream evaluation is finished, preprocess is executed again. Only the final rendering step can be run indefinitely, without prior specification of required number of samples. However, if the interactive preview of partially rendered result is required, the preprocess must be performed quickly, since it increases latency between availability of initial result and rendering start. The sequence of rendering procedures is presented below: Algorithm 5.2: Single- and multipass rendering on extended stream machine. {Initial multipass rendering passes.} repeat M ultipasssamples preprocess() if MultipassSamples > 0 then render(m ultipasssamples) {Rendering without output streams.} {At this point all cache writes are guaranteed to be finalized.} end if until M ultipasssamples > 0 {Final rendering pass.} render(until aborted) {Rendering with output streams.} cleanup() Ray Tracing on an Extended Stream Machine Expressing ray tracing as a stream computation gives a lot of freedom in design. Purcell [Purcell, 2004] proposed splitting a full application into several small kernels, like ray-triangle intersection,

79 5.2. PARALLEL RAY TRACING 67 grid traversal, ray scattering and generation of new rays, and so on. This design choice is well suited for graphics chips with limited programmability. We argue that it is possible and performance wise justified to express the whole ray tracing algorithm with just one kernel. When extended stream machine is used, the Photon Mapping is suitable for this design, too. Unfortunately, this particular approach limits choice of plausible hardware as a implementation platform of such a stream machine. In fact, only general purpose CPUs are suitable, and even OpenGL 4.0/DirectX 11 class GPUs are not programmable enough. These issue is discussed in detail in Section 5.3. The single kernel design is very simple yet extremely efficient. The input stream contains just sample numbers, implemented as a continuous range of integers. That is, in other words a task description is like this: evaluate samples from s 1 to s 2. All the work is performed in a kernel. The work contains converting sample numbers into quasi-monte Carlo pseudorandom sequences, generating and tracing rays, and finally returning radiance values along traced rays. The output contains two streams and is a bit more complex, but not excessively. If an algorithm traces rays from the camera towards the scene only, each input number corresponds to one camera ray and one radiance value alongside it, which is written into the first output stream. If algorithm traces rays from lights towards the camera, each input number corresponds to zero or more camera rays, with radiance values written into the second output stream. Additionally, it is necessary to store the image plane location for each calculated radiance value. Due to mathematics underneath light transport algorithms (discussed in Section 4.4), content of the two streams cannot be mixed. Sample density in the first stream does not affect local brightness of an image (it affects the precision of the output only), while in the second stream it does. In fact, data from each stream is used to form an individual image, with a slightly different algorithm, and the final image is a sum of these two. Note that the final image formation is performed after a stream ray tracing, but its impact on performance is minimal, since its computational cost is substantially smaller. Path Tracing and Bidirectional Path Tracing These algorithms are ideal candidates for parallelization, and therefore near linear speedups can be expected. All the samples generated by them are independent, so the only potential bottleneck is an algorithm initialization which includes costly construction of acceleration structure. The Path Tracing algorithm uses only first output stream, so it is even better for parallelization. If real time rendering is to be achieved due to massive parallelization, a simplified version of Path Tracing (see Section 4.4.1) seems to be an algorithm of choice. Parallelization efficiency and speedup of this algorithm is discussed in Section Photon Mapping and Combined Approach The one pass variant of Photon Mapping starts with an empty cache. When initial image samples are to be generated, the algorithm found no photons in photon map, and switches to emit and trace a pack of photons and store them in the map. The minimal required number of photons is expressed as a linear function f(n) = an + b of number of evaluated sample n, where a and b are adjustable constants. For performance reasons, somehow more photons than required by function f is generated. Eventually, the map contains enough photons for the rendering of range of n 1 to n 2 image samples. When a sample n i > n 2 is to be evaluated, photon map is filled with additional photons. The algorithm run time, and therefore image quality, is limited only by machine memory size. The combined approach is implemented in a similar way, yet with the difference that not all image samples are evaluated using a photon map. In some cases, when the predicted error of all bidirectional samples never exceed a prespecified threshold, the photon map is never filled and remains empty through the entire algorithm run.

68 CHAPTER 5. PARALLEL RENDERING Figure 5.3: Test scenes for parallel rendering. Left image: scene a) with roughly a million of primitives. Right image: scene b) with a few primitives.

80 68 CHAPTER 5. PARALLEL RENDERING Figure 5.3: Test scenes for parallel rendering. Left image: scene a) with roughly a million of primitives. Right image: scene b) with a few primitives Ideal PT BDPT PM Ideal PT BDPT PM Speedup 5 4 Speedup Number of threads Number of threads Figure 5.4: Parallel Path Tracing, Bidirectional Path Tracing, and Photon Mapping run times. Left image: results for scene a). Right image: results for scene b) Results and Conclusions For test purposes a machine with a Intel Core i7 CPU has been used. The Core i7 is a quad core processor capable of running eight threads simultaneously due to the SMT technology (each core can run up to two independent threads). The two test scenes are presented in Figure 5.3. The scene a) is complex, being made from an about million primitives, The scene b) is much simpler, built from a few primitives, which occupy just one node of kd-tree. Both scenes are rendered with Path Tracing, Bidirectional Path Tracing and one pass Photon Mapping (Figure 5.4), forcing the stream machine to run on various numbers of threads. The streaming approach to ray tracing is a very convenient way of expressing the algorithms and exhibit their potential for parallelization. Our approach differs from previous ones, mainly because we have assumed that a general purpose processor is available, and just one kernel is enough for the whole ray tracing. The two substantial benefits from streaming are high scalability

81 5.3. CHOICE OF OPTIMAL HARDWARE 69 and memory latency hiding. The scalability is likely to be crucial in near future, when CPUs with more cores are to be introduced. In our case memory latency hiding is not the result of data locality. In fact, our input stream is virtually empty (the integers representing ranges of samples to evaluate consume few kilobytes at most), while the read only memory containing scene description is typically hundreds of megabytes or even gigabytes large. Access patterns to the scene description depends on ray generation algorithms. Selected algorithms prefer to scatter rays into the scene as much as possible, since this improves statistical properties of generated samples. That is, mixed and scattered rays provide much more information than coherent ones. Unfortunately, this approach is not especially cache friendly the presented implementation reads data in several byte chunks from seemingly random locations. However, when CPUs provide technologies like SMT (Symmetric Multi-Threading), memory latencies are not an issue. When one thread is stuck waiting for data, the processor core could continue to execute another thread. Creating much more additional threads is very easy due to employed streaming model. Therefore, only memory bandwidth can become a potential bottleneck. The memory latency hiding is very well visible in parallel rendering efficiency of scene a). The total speedup obtained for Path Tracing and Bidirectional Path Tracing is roughly 5.5 on a quad core CPU. The speedup using four threads is far from ideal though, being roughly 3.2 instead of expected 4.0 (the Intel Turbo Boost technology is insignificant, increasing frequency of CPU if one core is used by approx. 5%). This seems to be a hardware issue independent cores compete for access to a shared L3 cache and a memory controller. Both above mentioned algorithms use just read-only data structures with no synchronization, so obtained suboptimal speedup is not a software problem. To additionally support this claim, we present a different scene b), containing just exactly seven, non-textured primitives. Bidirectional Path Tracing produces two streams of output data, while Path Tracing produces just one. In fact, BDPT generates roughly 4-5 times more data, measured in bytes, than PT over the same time period. In scene b) rendering, the necessity of transfer of ray traced samples to the GPU (see Section 5.4) is the bottleneck for BDPT. In case of scene a), both PT and BDPT exhibit almost identical speedups, so the transfer of samples to the GPU does not affect the speedups. Completely different results are obtained with Photon Mapping algorithm. In the tests, our one pass variant of the method has been used. The photon map is a read-write structure, so it must be properly locked, limiting scalability. The most efficient synchronization solution appears to be simplest fair readers-writer lock without any extra features the efficiency gain is questionable since the parameters tuning is highly scene dependent. In fact, using more cores allow rendering more complex scenes in the same time, rather than rendering the same scene in shorter times. This is clearly visible for the scene b), where maximum obtained speedup is roughly 1.4 for two threads. Adding more threads only makes matters worse. The one pass Photon Mapping can be run efficiently in parallel for real world scenes, though. The combined algorithm speedup could be anywhere between Bidirectional Path Tracing and Photon Mapping, depending on the rendered scene. This is due to fact that it choses one of these methods depending on the currently sampled light transport path. During the tests, the Intel Core i7 platform has shown very good rendering times, compared to previous Intel architecture with memory controller separated from a CPU by inefficient FSB (Front Side Bus), yet it has not provided expected speedups in parallel rendering. We suspect that L2 L3 cache or memory throughput is too low. Moreover, it would be interesting to see how perform a CPU where a core can process any thread a solution known from GPUs. In current Core i7 architecture, any core can process one of just two threads, and if both threads wait for data, the core is stalled. 5.3 Choice of Optimal Hardware The choice of the appropriate hardware for high performance rendering is crucial, especially when parallel implementations are necessary for achieving reasonable computation time, since it affects design and performance of the application. Currently, single processor sequential machines

82 70 CHAPTER 5. PARALLEL RENDERING do not have enough computational power for full global illumination. This fact probably will be true in future, because modern advancements in microprocessor technology give substantially more attention to produce more computational cores on a single machine, than to produce ultra fast single processor cores. The rest of this section starts with a comparison of two models of parallel programming shared memory with multiprocessor machines and explicit message passing with clusters of individual machines. Next, there is a comparison of multiprocessor machines and GPUs, both of which are programmed in a shared memory style. Finally, there is justification of our choice of optimal hardware for implementation of an extended stream machine Shared Memory vs. Clusters of Individual Machines From programmer s perspective, there are two popular basic parallel techniques shared memory and message passing. Shared memory model is used on multiprocessor machines, whereas message passing in distributed environment. Shared memory tends to be faster, but lack scalability. On the other hand, explicit message passing allows massive parallelization, at the cost of delays introduced even when few machines are used. In ray tracing, both techniques do not provide additional available memory. In shared memory model this fact is obvious. Ray tracing implemented by means of message passing in cluster environment requires data replication on each machine, to avoid substantial performance degradation. With the recent development of multi core processors, shared memory gains an advantage. Today, twelve processor cores in a single workstation is not uncommon. Recent introduction of six core CPUs Intel Xeon Dunnington with four socket boards results in 24 core supercomputers. Unfortunately, machines with more than two CPU sockets on a motherboard are substantially more expensive, while two socket machines roughly double the cost over single socket ones. Considering price-to-performance ratio, currently two processor machines with twelve cores total, seem to be optimal for ray tracing Multiprocessor Machines vs. Graphics Processors Contemporary rendering algorithms are written for either CPUs or GPUs. The important differences between these are degree of parallelism and flexibility of programming model. For example, as of 2010 Nvidia architecture contains up to 480 processing cores [Nvidia Corporation, 2009b], while popular Intel CPU based servers contain two CPUs with six cores each. However, despite these numbers, GPU based global illumination ray tracers are not significantly faster than CPU ones. The computational power of a modern GPU can be fully utilized only in a narrow class of algorithms, due to severe limitations imposed on GPU programs. The most significant are, among others, severe performance penalties for data dependent branching, lack of virtual function calls, lack of memory management techniques, and inefficient double precision processing, suffering from performance drop about 8x, not to mention very limited available memory amount. Ghuloum [Ghuloum, 2007] described majority of these cases. This obviously is a major handicap for any GPU based ray tracer, limiting its features drastically. The direct comparison of computational power of GPU and CPU, presented in [Nvidia Corporation, 2009a], is not trustworthy, because of these restrictions and the fact that a powerful GPU needs also a powerful CPU to run the driver. There are technologies for installing multiple graphics cards in a single machine ATI CrossFire allows joining up to four units, and Nvidia SLI up to three. Both technologies offer questionable price-to-performance ratio, however. First, despite the fact that all GPUs are placed in a single machine, shared memory model does not work. That is, a GPU cannot directly access memory of another GPU. In fact, memory management is not exposed and performed indirectly by graphics driver, which forces data replication between individual units. Moreover, to make things worse, two physical GPUs are not seen as a one logical unit with doubled number of cores and the same amount of memory. That is, a software which is well suited for a standard GPU may not work at all in SLI or CrossFire mode. This is not an uncommon issue, and there exist a few computer

83 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 71 games, which cannot take advantage of these technologies. Since multisocket motherboards do not exhibit such design flaws, and because low level details of graphics cards are not published, we have never seriously considered SLI and CrossFire as technologies well suited for implementing more powerful stream machines. In author s opinion, a term GPU (Graphics Processing Unit) is seriously misleading. In fact, there are many graphics algorithms, which cannot be run efficiently on any contemporary GPU. Actually, GPGPU unfortunately is not general purpose enough for a flexible and efficient implementation of ray tracing. The major handicap is the architecture of GPUs. The 480 core Nvidia Fermi consists only 15 truly independent cores, each of them capable of processing 32 threads. The caveat is that each of the 32 threads must execute the same instruction at the same time. If some of threads execute if branch, and others choose else branch, these branches are executed sequentially, forcing parts of threads to stall. It is important to realize which class of algorithms can be effectively run on GPUs. For example, matrix multiplication or FFT are good candidates, but sorting, tree building and tree search are not. The two latter algorithms are major time consuming processes in ray tracing, and therefore there were work dedicated to avoid these difficulties. Purcell [Purcell, 2004] does not use tree data structure at all, using less flexible grid as ray intersection accelerator, and Popov et al. [Popov et al., 2007] modifies tree searching to a stackless algorithm, at the cost of large memory overhead. It is not sure if GPU performance benefits can overcome penalties for using suboptimal algorithms, just to make ray tracing compatible with contemporary GPUs. The presented implementation does not try to achieve this compatibility, focusing solely on optimization for Intel CPU architectures. However, since modern GPUs process regular data structures very efficiently, they are very useful in visualization of ray tracing results. This topic is investigated in detail in Section Future-proof Choice Today the future of rendering is unclear. Recently, GPGPU programs become significantly more popular. Nvidia designed for this purpose new API called CUDA [Nvidia Corporation, 2009a]. This API is much better than using OpenGL or DirectX shaders for non-graphic computations, but is still not as general purpose as traditional CPU programming tools. The GPGPU obviously has a lot of potential, see for example [Harris, 2005], but CPU vendors also improve their products significantly. There are also ideas of producing a processor from massive number of x86 cores [Seiler et al., 2008]. Moreover, there has even been proposed a chip, based on a FPGA technology, dedicated for ray tracing operations [Woop et al., 2005]. Therefore targeting a rendering application onto a hardware platform with peculiarities requiring serious algorithmic restrictions, which is likely to become obsolete in few years time, is not a good idea. Instead, we have expressed ray tracing computations for a virtual stream machine model, and have implemented this machine on the best, fully programmable platform, which is currently available. Today, the best platform seems to be one of the Intel workstations, but if situation changes in the near future, implementing a stream machine on a more flexible, future version of today immature GPGPU, should not be very difficult. Perhaps a new revision of our rendering software will be rewritten in the Intel Ct language, which is currently being developed. 5.4 Interactive Visualization of Ray Tracing Results Just after the appearance of first programmable DirectX 9 class graphics processors there were first attempts to use it for ray tracing [Purcell et al., 2002]. Nowadays, vast majority of contemporary real time global illumination algorithms are based on computational power of modern GPUs, e.g. [McGuire & Luebke, 2009, Wang et al., 2009]. Unfortunately, they still put restrictions, often quite severe, on scene content (limited range of material and geometry representation), scene size, and illumination phenomena which are possible to capture.

84 72 CHAPTER 5. PARALLEL RENDERING The true, unrestricted, global illumination algorithms, which solve the Rendering Equation [Kajiya, 1986], are not well suited for OpenGL 4.0 class GPUs. Such implementation is possible, as has been shown numerous times, but severely restricted when compared with classic multicore CPU solutions, because GPUs cannot process irregular data structures effectively [Ghuloum, 2007]. However, this is not the only way to obtain interactivity nowadays multi-cpu workstations can perform interactive ray tracing [Pohl, 2009], yet true global illumination is still unachievable. Interactivity can also be obtained using clusters of machines with CPU rendering [Benthin, 2006]. On the other hand, approach presented here is substantially different from those above placing absolutely no restrictions on scene and illumination effects, it uses GPU based visualization client just to display and postprocess image made from CPU ray traced point samples, in resolution dynamically adjusted for real time performance. Our renderer, designed for flexibility of CPUs, based on significantly modified Bidirectional Path Tracing and Photon Mapping with quasi-monte- Carlo (QMC) approach (see Chapters 3 and 4). Such traditionally CPU based algorithms are very difficult to port to GPUs. When, despite all problems, they are ported eventually, performance benefits of GPUs over multicore CPUs are questionable [Ghuloum, 2007]. Pure ray tracing algorithms are based on point sampling scene primitives, not using scan line rasterization at all. This gives much freedom in the way how samples are chosen, however QMC ray tracing algorithms produce a huge number of samples, which do not fit into a raster RGB grid. Converting these data to 3x8bit integer based RGB image at interactive frame rates may be impossible even for multi-core CPUs, especially when dynamic image resolution has to be adjusted to the server rendering speed and scene complexity, with some non-trivial post-processing added. As we will show, conversion of ray tracing output to a displayable image and many post-processing effects can be expressed purely by rasterization operations, in which GPUs excel. The main idea behind the presented approach is therefore the usage of the best suitable processor for a given algorithm, instead of porting everything to GPUs. The rest of this section starts with a characterization of output of rendering server, required to be compatible with the presented visualization client. Then, there are detailed descriptions of a wrapper for rendering server and algorithms used in visualization client. Finally, obtained results are discussed Required Server Output In general the server may run any point sampling algorithm, but in this project we rely on QMC ray tracing. The visualization client assumes the specific format of the server s output. In the following subsections there is a detailed description of the conditions which should be fulfilled to make the client work properly. Color Model Having in mind further processing, it may be useful to output full spectral images (see Section 4.3 for a detailed description of full spectral rendering). However, full spectral representation requires huge amount of memory. For example, full HD spectral image in 16bit floating precision and with 3nm wavelength sampling from 400nm to 700nm needs as much as B 400MB, while RGB one requires B 12MB. The standard CIE XYZ space [Fraser et al., 2005] seems to be the best option instead, since an RGB space, which depends on a particular display hardware, is not a plausible choice. For this reason the visualization client accepts CIE XYZ color samples. The rendering server natively generates full spectral data and a server wrapper converts it internally from full spectrum to the three component color space.

85 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 73 Format of Output Streams Some most advanced ray tracing algorithms trace rays in both directions from the camera towards lights (camera rays), and in the opposite one (light rays). Such approaches produce two kind of samples, which must be processed differently in order to produce displayable images [Veach, 1997]. The client accepts two input streams. The format of samples is identical in both streams: ([u, v], [x, y, z, w]), where [u, v] are screen space coordinates, in [0, 1] 2 range, or, perhaps, with slight overscan to avoid postprocess filtering edge artifacts, x, y, z is sample color value in CIE standard, and w is the sample weight. The two streams differ only in interpretation of sample density. The pixels of image from camera rays are evaluated by averaging local samples using any suitable filter sum of weighted samples is divided by sum of weights. On the other hand, pixels of light image are formed using a suitable density estimation technique samples are filtered and summed, but not divided by sum of weights. Therefore, a sample density affects only quality of camera image, while it affects both quality and brightness of light image. The final, displayable, image is a sum of both camera and light images, the latter divided by a number of traced paths. Unfortunately, samples from light image potentially can be scattered very nonuniformly over screen space. This, however, is not an issue, when the sample density is roughly proportional to image brightness. Obviously, not all ray tracing algorithms need both camera and light output streams. For example, Path Tracing [Kajiya, 1986] and Photon Mapping [Jensen, 2001] produce camera samples only, while Particle Tracing [Arvo & Kirk, 1990] needs only light image. Therefore, the visualization client employs an obvious optimization it skips processing of a stream given that no samples were generated into it. Stream Uniformity The server should provide stream of color values scattered uniformly at random locations in the screen space. The uniformity of sampling ensures acceptable image quality even at low sampling rates, which is typical due to high computational cost of ray tracing. Additionally, the output stream should be generated roughly uniformly in time. Otherwise the client might fail to maintain interactive refresh rates. The original, two pass variant of Photon Mapping is therefore unsuitable for interactive visualization. This is the main motivation for development of one pass version of this technique, described in detail in Section Strictly speaking, the new approach does not generate batches of samples in exactly uniform time. Due to kd-tree lookup computational complexity as well as linear dependence between number of photons in kd-tree and number of samples computed, the average time to calculate nth sample is the order of O(log n), where n is the sample number. Logarithm, however, changes slowly, which is acceptable for the client. Coherent vs. Non-Coherent Rays For some time now it is often claimed that it is beneficial to trace camera rays in a coherent way, because it can significantly accelerate rendering [Wald et al., 2001, Benthin, 2006]. This is true, but only for primary rays (sent directly from camera or light source). Unfortunately, rays, which are scattered through the scene, do not follow any coherent pattern and caching does not help much. Since true global illumination algorithms typically trace paths of several rays, these algorithms do not benefit much from coherent ray tracing. What is more, coherent ray tracing tends to provide new image data in tiles, which make progressive improvement of image quality difficult. On the other hand, we have chosen to spread even primary rays as evenly as possible, using carefully designed Niederreiter-Xing QMC sequence [Niederreiter & Xing, 1996] as the source of pseudorandom numbers. Therefore, it can

86 74 CHAPTER 5. PARALLEL RENDERING be expected that very few traced rays provide reasonable estimate of colour of the entire image, and subsequently traced rays improve image quality evenly Client and Server Algorithms Finally, a GPU task is a conversion of point samples into a raster image. The conversion is done with resolution dynamically adjusted to the number and variance of point samples. In the image, a color conversion from XYZ to RGB space of current monitor, together with gamut mapping, tone mapping, gamma correction and other post-processing effects are performed. As a target platform we have chosen a GPU compatible with OpenGL 3.2 [Segal & Akeley, 2010, Shreiner, 2009] and GLSL 1.5 [Kessenich et al., 2010, Rost & Licea-Kane, 2009]. Major part of algorithm is coded as a GLSL shader, which suits our needs very well. Recent technologies, such as Nvidia CUDA, ATI Stream, or OpenCL [Munshi, 2009] are not necessary for this kind of algorithm. The rendering task is split into two processes (or threads in one process, if a single application is used as a client and server) running in parallel: a server wrapper process and visualization process. The rendering process may be further split into independent threads, if multicore CPUs or multiple CPU machines are used. Server Wrapper Process Ray tracing can produce virtually unlimited number of samples, being limited only theoretically by the machine numerical precision (our implementation can generate as many as 2 64 samples before sample locations eventually start overlap). Therefore, ray tracing process is reset only immediately after user input, which modifies the scene. Otherwise, it runs indefinitely, progressively improving image quality. The server wrapper runs on a separate thread, processing commands. The wrapper recognizes four commands: term, abort, render and sync. The term command causes wrapper to exit its command loop, and is used to terminate the application. The abort command aborts current rendering, and is used to reset server to the new user input (for example, camera position change). The render command orders server to perform rendering. The rendering is aborted when either abort or term command is issued. Maximum time to abort rendering is a time necessary to generate just one sample. Any algorithm capable of generating the specified output (see Section 5.4.1) can be used. In our server implementation, rendering is performed in parallel on multicore CPUs. The wrapper allows registering asynchronous f inish and sync events. The f inish event is generated when rendering is finished (either a prespecified number of samples was generated or abort was issued). The sync command, when is executed, triggers a sync event, passing to it any data specified in sync message. When sync event is triggered, all commands sent before the sync command are guaranteed to be finished. These events can be used to synchronize the visualization client with rendering server. Apart from sending asynchronous messages, the wrapper can be queried synchronically for already rendered samples. Since this query just copies the data to the provided buffer, server blocking due to necessary synchronization takes little time. Client Process Client is responsible for visualizing samples generated by server, and additionally it processes GUI window system messages. Client stores its internal data in the two screen-aligned two layer texture arrays, in the IEEE 32bit floating point format. A 4-channel [X, Y, Z, W ] texture and a two component variance [V ar, W ] texture are stored, each layer for camera and light input streams. Therefore, client stores 48 bytes of data per screen pixel, apart from standard integer front and back buffers. The details of client main loop are presented in Figure 5.5.

87 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 75 start process input rasterize samples repaint back buffer swap buffers with vsync get new samples quit Figure 5.5: Main loop of visualization client process. When all GUI messages are processed, client rasterizes new samples, generated by the server, into its internal textures. This task is performed by the render-to-texture feature of Framebuffer Object (FBO). The client uses an almost empty vertex program, which only passes through data. The geometry program is equivalent to rendering textured point sprites fixed functionality, and additionally, it routes input samples to the appropriate texture layer. The input is a stream of the following elements two component screen position (u, v), four component color (x, y, z, w) and flag, which encode whether the sample belongs to camera or light stream. Input is placed in Vertex Buffer Object (VBO), and is then rendered with GL render points command. Points are rendered in blending mode set to perform addition, ensuring that all samples add up instead of overwriting previous texture content. Additional input is a monochromatic HDR filter texture, used to draw point sprites. The texture is normalized (all the texel values add up to one) and the texture border value is set to zero. The filter texture is applied without rescaling and with bilinear filtering, thus preserving filter normalization, which is crucial for algorithm correctness. We have found that 5x5 texel windowed Gaussian blur gives good results. The rendering is performed in two passes. First, color texture array is updated. In the second pass, using already up-to-date color texture, variance texture array is updated. In both passes, the same samples are rendered. The variance is updated using the following formula: V j = V j 1 + i (Y i Y j ) 2, (5.1) for jth batch of i samples. The formula does not give the best theoretically possible results, since the mean Y is approximated using only already evaluated samples. The alternative formula: V j = Y j Y 2 j, Y j = Y j 1 + i Y 2 i, (5.2) which requires storing sum of squares (Y ) instead of variance, should be avoided due to poor numerical stability (even negative variance results are possible). In both formulas the division by n 1 factor, where n is the total number of samples in a given stream, is omitted. This division is performed when variance data is read from its texture. The details of rasterizing new samples are presented in Algorithm 5.3.

88 76 CHAPTER 5. PARALLEL RENDERING Algorithm 5.3: Rasterization of point samples by Visualization client The content of client sample buffer (triples [u, v], [x, y, z, w], flag) is loaded into VBO, there is one buffer for both streams camera and light samples are encoded by a flag; Monochromatic float texture with filter image is selected and point draw command is issued, the texture is used as a texture sprite for emulated point sprites; Geometry program routes the samples to the appropriate texture layer; Fragment program performs multiplication of color attribute by the texture value [X, Y, Z, W ]; After rasterization, color texture array is detached from FBO, GPU MIP-map build command is issued; Texture LoDs (used by repaint back buffer processing) for both streams are evaluated as LoD i = log 4 (P/S i )), where P is number of pixels on the screen and S i is the number of samples from ith stream computed so forth; Second draw is issued, with variance textures as output this time. The variance is evaluated only for luminance (Y ) component, since three component variance typically do not help much and substantially complicates algorithm. Variance output for each stream is (Y avg Y ) 2, where Y avg is read from previously generated color texture, and Y is luminance of currently processed sample, multiplied by filter texture; Similarly to color texture, variance texture array is detached from FBO, GPU MIP-map build command is issued; In order to repaint back buffer, client draws a screen-sized quad, using the four textures as an input. The screen is filled with custom fragment program. The program accepts following control parameters: level of detail (LoD) for both streams, light image weight (Lw), image brightness (B), contrast (C), gamma (G), color profile matrix (P ), and variance masking strength (V m). Level of detail (LoD) is already evaluated during rasterization. Now, the LoD values are used by fragment program to blur texture data if not enough samples are computed. Light image weight is got from the server along with samples, and its value is equal to the number of paths traced from light sources. This parameter is used to scale light image texture appropriately, such that the texture can be summed with camera image texture. Image brightness, contrast, gamma and color profile are set by the user, and their values adjust the image appearance. Additionally, the visualization client is able to add a glare effect (see Figure 5.6) as an additional post-process, implemented as a convolution with a HDR glare texture, generated according to [Spencer et al., 1995]. However, sufficiently large glare filters are far beyond computational power of contemporary GPUs for real-time screen refresh rate. Since these parameters are defined only for client, and do not affect server rendering at all, their values can be modified freely without resetting the server rendering process. Variance of samples is estimated only for luminance (CIE Y channel), using the standard variance estimator (V 1 N 1 (E(Y ) Yi ) 2, where N is the number of samples, Y i are luminance values, and E(Y ) is the luminance value estimated from samples computed so far. The client is able to increase blurriness according to the local changes in estimated variance, hence slightly masking noise produced by stochastic ray tracing. The noise to blurriness ratio can be controlled by V m parameter. The blurriness is created by low pass filter or bilateral filtering [Paris et al., 2008] guided by variance estimation, which potentially can be much better in preserving image features than a simple low pass filter. However, bilateral filtering works correctly only if noise is less intense than image features. When image is heavily undersampled, this assumption may not be satisfied, and a low pass filter remains the only viable option. For example, in Figure 5.8, the two leftmost images cannot be enhanced by bilateral filtering. On the other hand, this technique does a good job improving the quality of middle image from Figure Unfortunately, the noise masking feature can hide only the random error which is the result of variance. It cannot hide (in fact, it cannot even detect) other kind of error resulting from bias.

89 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 77 Figure 5.6: Glare effect applied as a postprocess on the visualization client. The effect is not generated in real time, it took reoughly 10sec to render an image and 1sec to visualize it on Nvidia 9800 GT in 512x512 resolution. The variance is the only source of error in Bidirectional Path Tracing, while Photon Mapping error is dominated by bias. The details of processing back buffer repaint are presented in Algorithm 5.4. Algorithm 5.4: Visualization client repaint processing The program reads data from both variance maps, using requested LoDs through hardware MIP-mapping; LoDs for both streams are evaluated according to initial LoDs, the variance and V m, for ith stream: LoD i LoD i + V m log 4 ([V ar]); [X, Y, Z, W ] textures of both streams are sampled, this time using just evaluated LoD and custom filtering technique (hardware MIP-mapping produces very poor results, see section for more detailed discussion); Texture samples for both streams are normalized, i.e. [X, Y, Z, W ] [X/W, Y/W, Z/W, 1] (if W = 0, then sample is considered to be [0, 0, 0, 1]). Then, light texture sample, divided by Lw, is added to camera texture sample, producing single result for further processing; Optionally, glare effect is applied here. Our glare texture is generated to be applied in XYZ color space instead of RGB one; Tone mapping of luminance (Y ) is performed, using very simple yet effective procedure: Y 1 exp( (B Y ) C ), while X and Z components are scaled by Y/Y ratio. If Y = 0 it means that image is black at that point and X Y Z (0, 0, 0) is used; Resulting X Y Z is multiplied by matrix P, and a basic gamut mapping is performed (see Section 6.1.3). Now output is in RGB format, normalized to [0, 1] range; Gamma correction using G is performed;

90 78 CHAPTER 5. PARALLEL RENDERING Next, client swaps front and back buffers, in synchronization with screen refresh period. This guarantees constant frame rate (typically 60Hz for common LCDs). 1 Finally, client reads new samples from the server. The reading is performed with synchronization, blocking the server for a moment. However, client does not display samples immediately, blocking server just for copying this portion of data to its internal buffer for later processing MIP-mapping Issues Images produced by rasterizing ray traced samples are created as screen-sized textures. Should enough samples be generated, these images could be used immediately without any resampling. Unfortunately, contemporary CPUs are far too slow to generate at least #screen pixels of such samples in, say, 1/30sec, which is required for real time performance. Therefore, some kind of blurring texture data, according to fraction of necessary samples generated and the local sample variance, have to be performed. While MIP-mapping is reasonably good in filtering out texture details which would otherwise cause aliasing, it cannot be used reliably to blur the texture image. Blurring by using LoD bias parameter of texture sampling function produces extremely conspicuous and distracting square pattern, with severe bilinear filtering artifacts (see Figure 5.7 for details). This is not surprising, since a GPU uses box filter to generate MIP-maps and linear interpolation between texels to evaluate texture value at the sampled point. Moreover, MIP-mapping with polynomial reconstruction instead of linear one fails as well. We have used custom texture sampling with Catmull-Rom spline interpolation for this purpose. Visually good results can be obtained by using Gaussian blur: i j I(u, v) = T ijg ij (u, v) j g ij(u, v). i The I is texture sample, u, v is the sample position, T are texel values, and g ij = exp( σd 2 ij ) is the filter kernel, with σ controlling blurriness, and d ij is the distance between the u, v position and texel T ij. Unfortunately, direct implementation of Gaussian blur requires sampling an entire texture for evaluation of any texture sample, which is far beyond computational capabilities of contemporary GPUs. The weight of Gaussian filter, however, quickly drops to zero with increasing distance from evaluated sample. Truncating the filter to a fixed size window containing limited number of samples is a commonly used practice. The simple truncation is not always optimal, since quality of truncated Gaussian filter depends strongly on the σ parameter to obtain similar quality with different sigmas, an O(σ 1 ) number of texels have to be summed. That is, if a Gaussian filter is truncated too much, it starts to resemble a box filter. In our case, σ varies substantially, and therefore more advanced technique should be used. We may notice that decreasing a resolution of the original image twice, and increasing σ four times, approximates the original filter on the original image. Eventually, the following algorithm is employed: initial MIP-map level is set to zero, and while σ is smaller than a threshold t, the σ is multiplied by four, and MIP-map level is increased by one. The threshold t and number of summed texels have been adjusted empirically to balance the blur quality and computational cost. First we have found that truncation range R of roughly 2.5 is a maximum value which ensures reasonable performance. For such truncation, setting t 1 is reasonable. Additionally, it is better to use a product of g and smooth windowing function w instead of original g if truncation is used. The w = 1 smoothstep(0, R, d) E, where E controls how quickly w drops to zero with distance, works quite well. The value E = 8 yields good results. What is more, the transition between MIP-map levels is noticeable and decreases image quality. This is especially distracting if σ varies across the image, which is the case because blur is adjusted 1 GPU class must be properly selected for a monitor resolution. If GPU is too poor, interactivity is not obtained. We found that best contemporary single processor GPU (Nvidia GTX 285, at the time of testing) is enough for refresh rate of 30Hz in full HD. Such issue, however, does not slow down the server the same number of samples is still rendered in the same amount of time, they are just displayed more rarely, in larger batches.

level changes. Therefore, truncation to range 2.5 cause blurring to use 2[(2 2.5) 2 ] = 50 texture fetches on average, which is costly, yet acceptable on contemporary GPUs.

91 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 79 to the locally estimated variance. Therefore, similarly as in trilinear filtering, the Gaussian blur is performed on two most appropriate MIP-map levels, and the results are linearly interpolated, avoiding sudden pops when MIP-map level changes. Therefore, truncation to range 2.5 cause blurring to use 2[(2 2.5) 2 ] = 50 texture fetches on average, which is costly, yet acceptable on contemporary GPUs. The sophisticated filtering scheme is used only for [X, Y, Z, W ] textures. Variance [V ar] textures, not being displayed directly, do not have to be sampled with anything more complicated that basic MIP-mapping. This saves some computational power of a GPU, yet does not produce noticeable visual artifacts. Figure 5.7: Comparison of MIP-mapping and custom filtering based blur quality. From left: reference image, hardware mipmapping, custom reconstruction based on Catmull-Rom polynomials, windowed Gaussian blur Results and Discussion The quality of rendered images obviously mostly depends on the rendering algorithm used. We have tested the visualization client in cooperation with Path Tracing (Figure 5.8) and Photon Mapping (Figure 5.9). Both figures present initial image rendered after 1/30sec and show the speed of image quality improvement. All the tests were performed on Intel Core i7 CPU and Nvidia 9800 GT GPU, in 512x512 resolution. Figure 5.8: Results of Path Tracing (from left: after 1/30sec, 1/3sec, 3sec, 30sec). The Path Tracing error appears as noise, blur in the first two images is caused by undersampling (far less than 1 sample per pixel were evaluated). The client is responsible merely for visualization and postprocessing, assuming that it is provided with stream of point samples, scattered roughly evenly through an entire image. The only algorithm for image quality improvement is noise reduction based on variance analysis. The error due to variance (seen as high frequency noise) is much more prominent in results of Path Tracing than in Photon Mapping, so the noise reduction has been tested on the first algorithm. The results are presented in Figure When multiple processors are used in the same application, good load balancing is important. While it is well known how to load balance ray tracing work between multiple CPUs, in our

92 80 CHAPTER 5. PARALLEL RENDERING Figure 5.9: Results of Photon Mapping (from left: after 1/30sec, 1/3sec, 3sec, 30sec). Photon Mapping does not produce much noise, but due to overhead caused by photon tracing and final gathering, less image samples than with Path Tracing were computed, which cause some blurriness. Figure 5.10: Noise reduction based on variance analysis of Path Tracing image (from left: no noise reduction, with noise reduction, variance image). The difference is not large, but noticeable, especially in shadowed area beneath the sphere and on the indirectly illuminated ceiling. application it is impossible to balance loads between visualization client and ray tracing server. The subtasks performed by CPUs and GPU are substantially different and suited for different architectures of these two processors, so work cannot be moved to the less busy unit as needed. In fact, on contemporary machines rendering server is always at full load, and GPU can be not fully utilized, especially when low resolution images are displayed. However, it is good to have some reserve in GPU power to ensure real time client response. We have presented an interactive GUI visualization client for displaying ray traced images online, written mainly in GLSL. Apart from visualization, the client can hide noise of input data by means of variance analysis. Moreover, the client can apply glare effect as a postprocessing technique, which is performed quite efficiently on GPU. The client is able to obtain interactivity regardless of the ray tracing speed. However, the price to pay is blurriness of images rendered at interactive rate. Nevertheless, the image quality improves quickly with time whenever rendered scene is not changed. Additionally, we have modified the Photon Mapping algorithm to be a one-pass technique, with the photon map being updated interactively during the whole rendering process. This enables using Photon Mapping with the presented visualization client, which then could ensure progressive image quality improvement, without any latencies resulting from construction of photon map structure. Our approach scales well with increasing number of CPU cores for ray-tracing, as well as with increasing number of shader processors on a GPU. Moreover, the program never reads results from the GPU, so it does not cause synchronization bottlenecks, and should be friendly with multi-gpu technologies like SLI or Crossfire. Our visualization client has a lot of potential for future upgrades. The adaptive filtering technique [Suykens & Willems, 2000] seems to be good approach to significantly reduce image noise on

93 5.4. INTERACTIVE VISUALIZATION OF RAY TRACING RESULTS 81 the side of the visualization client. Moreover the client can be extended to support frameless rendering [Bishop et al., 1994, Dayal et al., 2005]. This very interesting and promising technique can improve image quality substantially using samples from previous frames, provided that subsequent images do not differ too much. In future we plan to introduce to our client stereo capability, using OpenGL quad-buffered stereo technology. Ray tracing algorithms can easily be converted to render images from two cameras at once, and a lot of them can do this even more efficiently than rendering two images sequentially (for example, Photon Mapping can employ one photon map for both cameras, and similarly, Bidirectional Path Tracing can generate one light subpath for two camera subpaths). Unfortunately, stereo rendering doubles the load on the GPU shaders, as well as on the GPU memory. However, it seems that interactive stereo can be obtained by slight decrease of custom texture filtering quality.

94 Chapter 6 Rendering Software Design and Implementation Global illumination algorithms alone are not enough to create realistic images. Equally important is input data, which can properly describe real world geometry, materials, and light sources. Without sufficiently complex artificial scenes, even globally illuminated images look plain, dull and unbelievable. There is a good design idea to separate rendering algorithms from input data management functions by a well designed layer of abstraction. All the classes and functions available for rendering algorithms in this thesis are called environment functions. In order to achieve satisfactory degree of realism, environment functions must operate on huge amount of data. Efficient implementation of this task is very difficult, because of storage limitations of contemporary machines and a required performance of the functions. In fact, requirements of low memory consumption and high execution speed are contradictory, and any implementation must seek for a reasonable compromise. The efficiency of a given rendering algorithm decides how many environment function calls is required for rendering image with requested quality. The final rendering speed, therefore, cannot be satisfactory, if good rendering algorithm is paired with slow environment functions. Apart from environment functions, a clear and effective interface between them and rendering algorithms is also a necessity. Well designed interface can make implementation of the software easy, while poor one can even disable possibility of implementation of certain algorithms. An extended stream machine and parallelism is also hidden behind a specialized interface, however, this interface is not fully transparent. The rest of this chapter starts with a description of framework for core functionality and its interfaces. These interfaces define the communication between 3D objects and rendering algorithms. Next, an integrated texturing language, specialized for ray tracing and global illumination is presented, and finally, new reflection model, also optimized for global illumination algorithms, is derived. 6.1 Core Functionality Interface The framework mainly defines interfaces between rendering algorithms, cameras and 3D objects, and some auxiliary functions and classes. Careful design of framework is an extremely important task in creating high quality rendering system. The framework decides what and what not can ever be implemented. Interface modifications typically cause tedious and time consuming changes of large parts of the system. Our idea is to define a framework as an interface in object oriented sense, with addition of a few auxiliary classes. The framework contains only variety of sampling methods. That is, algorithms 82

95 6.1. CORE FUNCTIONALITY INTERFACE 83 can be implemented only by usage of sampling routines, and it is enough to define sampling operations to define new 3D objects or cameras. What is more, the design exhibits symmetry of light transport, which enables giving 3D object and camera very similar and consistent interfaces. This feature substantially simplifies implementation of bidirectional ray tracing algorithms. The idea of modular rendering system, supporting variety of ray tracing concepts, is not new. Some early works on this topic are [Kirk & Arvo, 1988] and [Shirley et al., 1991]. Kirk and Arvo [Kirk & Arvo, 1988] described a framework for classic ray tracing, whereas Shirley et al. [Shirley et al., 1991] presented a design of a global illumination system. However, the Shirley s framework is designed for zonal method, incorporates a lot of details of particular algorithms and therefore cannot be easily modified to support different rendering techniques. A more recent work in this area [Ureña et al., 1997] presents a very general object oriented design for not only ray tracing, but also z-buffer rasterization and radiosity. Unfortunately, the approach is overcomplicated and inconvenient for support of non quasi-monte Carlo ray tracing only. Greenberg et al. [Greenberg et al., 1997] describe current (for 1997 year) rendering techniques and research areas, rather that defines an object oriented framework. This paper divides the image synthesis task into light reflection models, light transport simulation and perceptual issues. This allows independent research in any of these domains. Moreover, it gives a lot of attention to correctness of rendered images, obtained by using carefully measured reflection data and validation of images of artificial scenes through comparison with photographs of real ones. Geimer and Müller s [Geimer & Müller, 2003] main point is interactivity and support of Intel s SSE* instruction sets [Intel Corporation, 2009]. Lesson et al. [Leeson et al., 2000] designed interface around mathematical concepts like functions and integration besides the rendering ones. This system provides also debugging elements independent testing of implementation of particular interfaces and graphical viewer of traced rays. Wald et al. [Wald et al., 2005] presents a system suitable for both geometric design accuracy and realistic visualization based on ray tracing, executed in real time. All the previously mentioned projects assumes rays travel along straight lines in 3D space. However, there are some exotic approaches, which break this assumption. Hanson and Weiskopf [Hanson & Weiskopf, 2001] have not assumed that light speed is infinite and visualize relativity effects by using 4D (3D + time) space. A novel and interesting approach to ray tracing [Ihrke et al., 2007] allows rendering of volumetric objects with varying refractive indexes. Due to that rays no more travel along straight lines. Instead their trajectories are evaluated by solving partial differential equations. These methods require substantially different approach to tracing rays. The most similar to our approach is one of Pharr and Humphreys [Pharr & Humphreys, 2004]. They designed a framework for ray tracing based global illumination and implements a few well known algorithms within it. However, the framework mixes some aspects of 3D objects representation with implementation of rendering algorithms. While this sometimes can be useful in fine tuning rendering for best achievable performance, it could be possibly very time consuming and error prone when integrating new concepts into rendering platform. The interface is based on a few structures: photon event, spectral color and a sample data. These structures, with some other elements, are arguments of sampling routines. Photon event describes intersection of ray with object geometry and photon scattering. It is filled by intersection function, and can be read by subsequent function calls. Spectral color describes, depending on the routine, a emission spectrum, a camera sensitivity or a scattering factor. The sample data is used for generating quasi-monte Carlo sample points Quasi-Monte Carlo Sampling Sampling interface allows very elegant and concise way of specifying what could be done with cameras and 3D objects. The sampling methods are general enough to implement majority of ray tracing algorithms. Basic sampling of a function is selecting arguments of function y = f(x) at random, with probability depending on its shape. Argument of each function is a variable in some space Ω 1. Similarly, y = f(x) is not necessarily a real number, and then f transfers x Ω 1 y Ω 2. In our sampling there can be distinguished three basic operations listed below.

96 84 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Sampling the argument x (e.g. direction of scattering). Evaluation of sampling probability density (e.g. what is the probability of scattering ray in a given direction ω). The result often is difficult to compute exactly and can be approximated, however crude approximation hurts performance. Evaluation of y = f(x) value for a given argument x (e.g. what is the scattering factor in a given direction ω). All the sampling routines can be grouped into four different categories listed below. Queries about a point on a surface or in a volume in scene objects these queries randomize emission points, while in cameras they select a point on the lens. The y value denotes spectrum, which describes spatial emission or sensitivity respectively. In case of point lights or pinhole cameras y is a δ distribution coefficient. The probability is measured with respect to the lens or light source area, or a light source volume in case of a volumetric emission. These query functions are used by ray tracing algorithms to start a light transport path. Queries about transmission such as find the nearest ray intersection with object or lens or check if and where ray interacts with medium if object contains volumetric effects. The argument x represents a point in a volume or a point on a surface, while the y value is spectrum describing attenuation on a ray path. The probability is measured with respect to the distance. Queries about emission/sensitivity such as what is the emission or a camera sensitivity at a point in a given direction. Points are generated by point generating group of routines. The argument x is a direction vector, and y value is a spectrum representing emission or sensitivity. Probability is measured with solid angle for volumetric emission or projected solid angle for surface emission and cameras. Queries about scattering direction in which ray scatters after a collision with a scene object. The argument x is a direction vector while the y value is a BSDF or a phase function represented as a spectrum. Probability is measured with solid angle for volumetric scattering, projected solid angle for surface scattering or is a δ distribution for perfect specular surfaces. These functions are used only with respect to the points acquired by one of two previous methods either surface or volume sampling or transmission. Eventually the basic interface contains twelve functions altogether. However, to make the implementation more efficient, the query about transmission attenuation is divided into two functions. One tests ray for nearest intersection with an opaque surface, thus returning a binary value if ray hits a surface or not and a distance if hit occurred, while another calculates attenuation of ray radiance due to participating media and semi-transparent surfaces. Finally, the interface contains one additional function routine for freeing temporary memory, since most implementations require storing data between calls which generate points and queries about scattering. Quasi-Monte Carlo numbers are used instead of true random ones. In fact, despite not being strictly correct without modification the theory behind Monte Carlo Integration, this approach allows reproduction of any sequence of operations, and often better convergence. There are two kind of random number generators. First are generators which preserve state, with initialization routine and next routine. These generators can produce numbers only sequentially, thus random access to the middle of sequence is extremely inefficient. Second kind of generators are so called explicit, hash functions or low discrepancy sequences. They allow immediate access to any pseudorandom number. The low discrepancy sequences allow better convergence of MC integration, but only for low-dimensional functions. Algorithm generating good quality high dimensional low discrepancy sequence is still an unsolved task. The presented system needs random access sequence with theoretically unbounded dimensionality. It uses for this purpose a complex generator made from two basic ones. First four dimensions is generated by a Niederreiter (t, s)-sequence in a base 2 with s = 4 and quality t = 1 [Niederreiter

97 6.1. CORE FUNCTIONALITY INTERFACE 85 & Xing, 1996] (see Section for a brief description of (t, s)-sequences). Further dimensions are generated by a pseudorandom number generator. The 64-bit seed is formed as a sample number xor-ed with dimension number multiplied by a large prime number. The number generator consists of three 64-bit congruent RNG steps separated by translation of high order bits to mix with low ones. The known from good statistical properties Mersenne Twister generator [Matsumoto & Nishimura, 1998] cannot be used here, because it does not offer random access to any sequence number, providing a parameterless next routine instead. Nevertheless, our simple function works quite good, and does not exhibit significant flaws. Each sample point is described by a 64-bit sample number, and each coordinate of a point is described by a separate 32-bit coordinate number. Light transport algorithm provides these numbers for sampling routines implemented by 3D objects and cameras. Implementations of these entities actually must use the provided QMC sample point generator, instead of defining independent sample point generating routines, due to peculiarities of QMC sampling (see Section for a more detailed explanation). Each time an object or a camera use a pseudorandom number, a coordinate number is increased by one. This operation ensures that all objects intersection generated by subsequent rays on the same path use appropriate coordinates of a sample point. On the other hand, each individual path is assigned a different sample point, so no accidental correlation between sample numbers can occur Ray Intersection Computation Efficient computation of intersections of rays with scene geometry is crucial from performance point of view. There were proposed a lot of so called acceleration structures, which provide average logarithmic complexity of such computation, with respect to the number of primitives in the scene. However, we are not aware of efficient implementation of an acceleration structure that can keep surface and volumetric primitives together. Typically, two separate structures are used, or even worse, volumetric primitives are just kept in an array. In this section we propose a substantially better method, which is significantly faster if a scene contains a lot of volumetric primitives. Moreover, our approach is capable of handling semi-transparent surfaces, such as thin stained glass, with no light path vertex for ray scattering on such a surface. Due to their best known performance, we choose to use kd-trees as acceleration structures. Volumetric primitives are inserted into trees similarly as common surfaces they can be bounded by boxes, and tests if such a volume has a common part with a tree node are easily performed. Almost any well known tree construction algorithm can therefore be immediately adapted to insert volumetric primitives apart from surfaces into the tree. More tricky is efficient traversal of such a tree. Effect of volumetric primitive on radiance of a ray depends on, among others, how long the ray travels through a primitive. This quantity, however, is unknown, because a surface which be eventually hit by a ray, may be found after a ray-volume interaction is to be computed. In Figure 6.1 there is shown an example ray interacting with various primitives. Whenever a ray intersection search is performed, a reference and a distance to the nearest, encountered so far opaque surface is kept. Meanwhile, references and distances to even nearest semi-transparent surfaces and volumetric primitives are stored in a heap. The key used in heap is a distance, which in case of volumetric primitives is a distance to the initial intersection of ray with a primitive boundary. If a nearer opaque surface is found, references to primitives that are intersected further are removed from the heap. When a tree traversal is finished, ray radiance is modified according to each semi-transparent primitive stored in the heap. Therefore, independent traversals for surface and volumetric primitives are unnecessary. Moreover, semi-transparent surfaces, which affect ray radiance, but do not scatter rays in different directions, like thin stained glass, where offset due to refraction is negligible, can be added for free, without the necessity of generating extra intersections with these primitives. Effectively, this optimization shortens light transport paths, by one vertex for each such intersection. These paths are faster to generate and less prone to high variance. Typically, a semi-transparent surface combines a transparent and an opaque elements. For example, a thin glass surface may either

98 86 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Volumetric primitive bounding box Interactions of a ray with semi-transparent surfaces Interaction of ray with a volumetric primitive Ray hits an opaque surface Figure 6.1: An interaction of ray with surface and volumetric primitives. The ray is terminated when it hits an opaque surface. The interaction with volume takes place only to the point of termination. Semi-transparent surfaces, which do not scatter a ray, may be accounted without generating additional ray-surface intersections. Figure 6.2: Optimization of intersections of rays with semi-transparent surfaces. Left image: Standard ray intersection. Right image: Omission of explicit generation of ray intersections with such surfaces. Both techniques obviously converge to the same result, however much faster with this optimization. reflect a ray or let it through, attenuating it but not changing its direction. Implemented software assigns a degree of transparency coefficient for each surface, which describes how much light is able to pass through the surface. This coefficient is analogous to a scattering coefficient σ s which is used to describe participating media. The coefficient may depend on several variables, e.g. cosine of angle between a ray direction and a surface normal. Whenever such a surface is intersected, a random test is performed, to determine if ray passes through the surface, or is scattered on it. Effect of this improvement is presented in Figure 6.2. Here, a semi-transparent stainglass, is modeled with the described technique. The difference is striking, especially in coloured shadow viewed indirectly through glass Spectra and Colors An efficient technique for sampling spectra (described in Section 4.3) is not enough for the full spectral rendering. Typically, obtaining an input scene description with spectral data is much of an issue. The spectral data is often unavailable and physical simulations necessary to obtain them

99 6.1. CORE FUNCTIONALITY INTERFACE 87 are far too complicated. In this case a conversion from an RGB color representation has to be performed. One of the popular algorithms is described by Smits [Smits, 1999]. We have provided an alternative approach for this purpose, however. Additionally, a result of any spectral simulation is a spectrum which cannot be displayed directly. It has to be converted to an RGB model, taking into account human visual system properties as well as color gamut of the target hardware. Since implemented rendering core outputs image samples in CIE XYZ format, the conversion to an RGB model can be easily replaced by a more sophisticated approach. Acquiring spectral data Spectral data typically can be obtained by employing physically based calculations, using measured data or by conversion from RGB images. Common example of physically based calculations are Planck s blackbody radiation formula and reflection from metals described by Fresnel equations. These formulae are not computationally expensive when compared to full global illumination cost and give physically plausible results, so they should be applied whenever possible. If there are no simple physical equations describing given phenomenon, measured data can be applied. The example of such approach is CIE illuminant D65, which is a tabularized spectrum of daylight illumination. However, typically there exist neither physical formulae nor measured spectral data. In this case the only solution is to convert RGB colors to the full spectral representation. This conversion is, obviously, not well-defined there are infinitely many spectra for any one RGB color, and the implementation must choose arbitrarily one of them. Compared to previously created conversion algorithm [Smits, 1999], our approach is simpler, produces smoother output spectrum and is applicable to point sampled spectra representation instead of piecewise constant basis functions. Due to the fact that such conversion is not well defined, it cannot be said which approach is more accurate in general. Actually, plausible RGB to spectrum conversions for material reflectivity and for light sources are different. An idealized white material is the material which reflects all the received light, and therefore absorbs nothing. So the conversion for materials maps from an RGB triplet (1, 1, 1) to a spectrum, which have the constant value of one. If this conversion is used for a light source, the effect is an reddish illumination. This is because white light is not the light with a constant spectrum, but the daily Sun light, which is a complex result of scattering Sun rays in Earth atmosphere. Human visual system adapts to these conditions, and perceives this as the neutral colorless light. The basic conversion implemented in our model is defined for reflectances. The plausible conversion means that resulting spectra satisfy some requirements. First, the triplets of the form (c, c, c) should be mapped to appropriate constant valued spectra. Second, perceived hue of an RGB color should be preserved when a textured surface is illuminated with a white light, output image should match the texture color. This can be precisely expressed as: spectrum ((r, g, b)) D65 XY Z RGB = c 1 (r, g, b), (6.1) where c 1 is arbitrary constant, using srgb profile for XY Z RGB transform. Moreover, in most cases resulting spectrum should be smooth. The basis of conversion are three almost arbitrarily chosen functions: r(λ), g(λ) and b(λ). For simplicity, in our model these functions are spline based and the actual spectral curve for blue component is dependent: b(λ) = 1.0 (r(λ) + g(λ)), which guarantees that functions sum to one. The converted spectrum is calculated as s(λ) = R r(λ) + G g(λ) + B b(λ), where (R, G, B) is a given color triplet. The conversions for light sources are defined as a product of daylight illumination spectrum (given as a standard CIE illuminant D65) and a converted (R, G, B) triplet. Thus the RGB data acts as modulator of a D65 white light. One of the possible r(λ), g(λ) and b(λ) are presented in Table 6.1. The precision of the algorithm cannot be perfect, because it contains a lot of measured physiological and hardware data. Particularly, there is no unique RGB standard,

100 88 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION R(λ) function, for a λ < b G(λ) function, for a λ < b a [nm] b [nm] R(λ) a [nm] b [nm] G(λ) R = G = R = 0.165(1 P (a, b, λ)) G = P (a, b, λ) R = G = R = P (a, b, λ) G = 1 P (a, b, λ) R = G = 0 Table 6.1: Piecewise polynomial spectral functions for RGB colors. The R(λ) and G(λ) are presented. The function B(λ) is defined as B(λ) = 1 (R(λ) + G(λ)). The polynomial P (a, b, x) is a function P (a, b, x) = t 2 (3 2t), where t = clamp((x a)/(b a), 0, 1). and the conversion of D65 light to one of these standards does not necessarily result in an ideal gray color. Conversion to RGB from XYZ For XYZ to RGB transformation our idea is to examine a simple mapping approach, which still can produce good quality results. Presented technique does not care for luminance and chromatic adaption, but it leaves a few parameters to be adjusted arbitrarily. Algorithm presented here contains two independent parts. First is luminance (tone) mapping, and second is the actual color conversion. Luminance mapping allows setting sensitivity and contrast parameters, while color conversion takes a color profile matrix. The luminance mapping is necessary because common display hardware has very limited contrast in comparison with contrasts that appear in nature. Usually mapping algorithms map the computed luminance from range [0, ) to range [0, 1], which is then quantized linearly (typically to 256 levels) and displayed. The simplest solutions are to clamp all too large values to unity, or scale linearly to the brightest value. These methods, however, produce too many white patches or display dark regions far too dark. Effective conversion needs nonlinear mapping. In our model we use function y = 1 2 (σy)c for this purpose, where y is computed luminance, y is mapped luminance, σ is brightness scaling parameter and C is contrast. In advanced approaches, local contrast preserving techniques are used. They use different scaling factors in different parts of image, based on local average brightness. The result is better contrast on the whole image, but at the price of non-monotonic mapping. After mapping the luminance, the RGB output has to be calculated. Due to limited color gamut of RGB model the exact result cannot be produced. We assume that mapped luminance (y ) must be preserved exactly in order to keep the whole mapping monotonic, the hue also must be roughly preserved (it cannot be changed e.g. from red to blue while mapping) and only one parameter that might be modified by a large amount is saturation. In this technique all problems that emerge while mapping results in fade-to-white effect. Particularly, when sufficiently strong lighting is used, each color becomes pure white, i.e. (1, 1, 1) in an RGB model. This should not be seen as a harmful effect, since it is similar to overexposure in photography. Careful adjustment of σ and C parameters gives reasonably good results in almost all lighting conditions. The color conversion algorithm requires color profile matrix P. The matrix multiplication can produce out of gamut colors. However, simple desaturation of such colors to render them displayable, works reasonably well. The desaturation algorithm uses color profile matrix P and second row of its inverse, P 1, which affects y component in expression XY Z = P 1 RGB. These elements must be nonnegative and must sum to one, which requirement is satisfied by srgb color matrix. The algorithm assumes that y component of input is in [0, 1] range. The idea of this approach, presented in Algorithm 6.1 is to compute clamped RGB color first, then check how clamping affect luminance, and finally adjust RGB color to compensate for luminance change. In Figure 6.3 there is a comparison between desaturation and clamping, using srgb color profile. The colors are defined by blackbody radiation (1500K for red and 6600K for gray) with

101 6.1. CORE FUNCTIONALITY INTERFACE 89 Algorithm 6.1: Gamut mapping by desaturation RGB P XY Z; clrgb clamp(rgb, 0, 1); y y (P 1 clrgb).y; if y 0 then RGB clrgb else RGB 1 clrgb end result clrgb + RGB y (P 1 RGB).y carefully chosen mapping sensitivity. Each subsequent color patch has the sensitivity set to value 1.25 times larger than the previous. First row is computed with luminance correction it exposes the fade-to-white effect. On the other hand, the saturated images in second row cannot approach full luminance, no matter how high the sensitivity is. Figure 6.3: Comparison between different gamut mapping techniques. The upper line is generated with desaturation, while the lower is due to clamping. The red color is defined as 1500K blackbody radiation, and gray has a temperature of 6600K Extension Support Extension support is a very handy feature if ray tracer implementation is used for experiments with variety of light transport algorithms. Unfortunately, a C++ language does not support machine code loading at runtime. The rendering software uses Windows DLL mechanism [Richter, 1999] for this purpose. Due to lack of C++ language standard for importing pure virtual functions from dynamic libraries, arrays of virtual functions are created manually, from standard C like functions, with explicitly passed this pointer. This tedious work is necessary to provide compatibility if different compilers are used for platform core and individual plugins. Majority of rendering functionality is defined by abstract interfaces, and therefore can be implemented as plugins. Most important objects which can be implemented as plugins are listed below: Light transport algorithms, 3D objects, Cameras, Each plugin provides an independent part of rendering functionality, which can be modified and experiment with independently of the rest of the system, possibly by independent group of people. Typically, implementation of necessary functions of above mentioned objects requires so much computational resources, that additional layer of virtual function calls has little effects on performance. On the other hand, functionality listed below is not available as plugins: Spectrum representation,

102 90 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Quasi-Monte Carlo sample generation. These functions are much simpler and much more frequently executed, so making them virtual would noticeably hurt performance. As a consequence of being integrated into core functionality, any modification to them would require recompilation of rendering core and all plugins. 6.2 Procedural Texturing Language Texturing is a technique used in rendering. Its purpose is improvement of image quality without increasing geometrical complexity of 3D objects. A simple application of texture, which is presented in Figure 6.4, shows how much in contemporary computer graphics is modeled using textures. Figure 6.4: Left image: human model with head built from geometric primitives only. Right image: the result of texture application onto geometrical model. Classic (non-procedural) texturing, presented in Figure 6.4 is a painting of 2D maps onto 3D models. In spite of being able to define arbitrary appearance of any 3D model, this technique exhibits some flaws. First, texture maps have to be created, which is often time consuming. Second, storing maps during rendering consumes a lot of machine memory. Moreover, sometimes 3D textures are also useful. These textures require even far more storage. As a consequence, complex scenes are frequently constructed using the same texture several times or using textures with reduced resolution. A partial solution to this problem is rendering with compressed textures [Beers et al., 1996, Wang et al., 2007, Radziszewski & Alda, 2008]. However, there are a lot of objects, which can be textured procedurally. Procedural texturing is based on executing typically short programs, which evaluate material properties of textured points. The example result of procedural texturing is presented in Figure 6.5. Procedural texturing, despite the possibility of generating wide variety of object appearances with minimal storage, is not always an ideal solution. First, it is not always possible to apply it, like human face in Figure 6.4. Second, more complicated programs can significantly increase rendering time. Simple array lookup, even with sophisticated filtering, typically is faster than executing a general program. Due to this a most common solution is a hybrid approach a procedural texturing language enhanced with commands for support of classic texture images. Rendering, especially in real time, used to be based on hard-coded shading formula, defined as a sum of a matte reflection with texture color, and a white highlight, with adjustable glossiness [Shreiner, 2009]. More modern approaches allowed using a few textures combined by simple mathematical operations, like sum or product. Obviously, these methods were totally inadequate for simulation of vast diversity of real life phenomena. First well known approach to a flexible shading were shade trees [Cook, 1984]. Since that time there have been created a lot of far more sophisticated languages designed for procedurally defining appearance of surfaces. Many useful techniques for procedural texturing are presented in [Ebert et al., 2002]. The popular shading language for off-line rendering is a part of RenderMan software [Cortes & Raghavachary, 2008]. Typical shading languages used in modern, hardware accelerated rendering are GLSL [Rost &

103 6.2. PROCEDURAL TEXTURING LANGUAGE 91 Figure 6.5: Sample procedural texture. Licea-Kane, 2009], HLSL [Luna, 2006], and Cg [Fernando & Kilgard, 2003]. These languages have syntax which is similar to C, but they are compiled for specialized graphics hardware. Currently, these languages impose serious restrictions on programs. Nevertheless, these restrictions are forced by deficiencies of contemporary GPUs, rather than being inherent limit of these languages. Therefore, it is widely expected that these languages might be far more powerful in near future. As a part of rendering software, we have designed and implemented texturing language optimized for ray tracing, with functional language syntax. The rest of this section starts with a description of functional programming aspects. Next, our language syntax and semantic is explained. This is followed by language execution model and its virtual machine API, targeted for integration with rendering software. Finally, example results are presented Functional Languages Functional programming is a programming technique which treats computation as evaluation of mathematical functions. This evaluation is emphasized in favour of state management, which is major part of traditional, imperative programming style. In practice, the difference between a mathematical function and the concept of a function used in imperative programming is that imperative functions can have side effects, reading and writing to state variables, besides its formal input parameters and output result. Because of this the same language expression can result in different values at different times depending on the state of the executing program. On the other hand, in functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function f multiple times with the same value for an argument x will produce the same result f(x) all times. Eliminating side-effects can make it easier to understand the behavior of a program. Moreover, it makes easier to implement compiler optimizations, improving performance of functional programs. This is one of the key motivations for the development of functional languages. One of such languages is Haskell [Hudak et al., 2007]. The main idea of imperative programming is description of subsequent tasks to perform. These languages are good as general purpose languages. They can describe typical mathematical functions, as well as arbitrary other tasks, e.g. a web server or a computer game. However, in some programming tasks programs always evaluate a function of already defined input variables and constants, and return result of this function. In such cases, functional programming seems to be much more convenient and less error prone. Actually, this is the case of programmable texturing and shading, and therefore we argue that functional languages could be better for such task than general purpose, C-like languages so often used for this purpose. The functional languages have been used in computer graphics already [Karczmarczuk, 2002, Elliott, 2004]. However, Kaczmarczuk created the language named Clean, which was used to implement a library with image synthesis and processing tools. Clean language is not integrated with any rendering application, the toolset can just create 2D images, display and save them. Elliot, on the other hand, created a Vertigo language, used for programming graphics processors.

104 92 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION The language compiler outputs assembler in the DirectX format, so the language can be treated as HLSL replacement. To our knowledge, there is no functional language designed to cooperate with ray tracing based global illumination, which is the motivation of our research in this area Syntax and Semantic The language design is largely affected by the language purpose. In this case, the language is designed for texturing and cooperation with ray tracing based global illumination software. Therefore, the language is not intended to be used as a tool for creating standalone applications. Its usage model should be similar with languages like GLSL, HLSL or Cg the programs are compiled at runtime and executed by rendering application in order to define objects appearance. The rest of this section starts with a distinction between texturing and shading. This is followed by the description of presenter language grammar and semantic of grammatical constructions. Texturing vs. Shading Texturing and shading are in fact substantially different operations, despite the fact that these terms are frequently used interchangeably. Shading operation is an evaluation of final color of image fragment, which is about to be placed in frame buffer. Shading takes into account material properties and illumination as well. On the other hand, texturing operation defines in textured point material properties only. This differentiation is important, because different types of rendering algorithm can use either texturing or shading. Shading is used in rasterization algorithms supporting local illumination, where shaders are expected to produce final fragment color. Local illumination uses information about shaded point and a tiny set of constants, like light sources description. This information obviously is not enough for full global illumination calculations. On the other hand, global illumination algorithms automatically calculate illumination simulating laws of physics. Any intervention into these procedures by shaders is harmful, so global illumination algorithms are designed to cooperate with texturing only. Since the thesis concentrates on the latter approach only, the presented language is designed for texturing purposes only. Language Grammar and Semantic The simplified language grammar is presented in Figure 6.2. At the highest level, the program contains surface and volume descriptors, with function, constants and type definitions. The descriptors specify particular outputs, which have to be defined. The outputs are script-defined expressions returning values of particular type. There are listed all possible outputs of surface descriptors, with required types given in the brackets: BSDF (material, default value is matte), scattering coefficient (spectrum), absorption coefficient (spectrum), emission function (material), emission coefficient (spectrum), surface height (real), surface height gradient (real[2]). The volume descriptors have similar fields, except height and gradient: phase function (material, default value is isotropic),

105 6.2. PROCEDURAL TEXTURING LANGUAGE 93 scattering coefficient (spectrum), absorption coefficient (spectrum), emission function (material), emission coefficient (spectrum). The scattering coefficients in both descriptors describe how likely is the primitive to scatter rays (see Section for explanation of volume scattering coefficient). The scattering coefficient in surfaces can be used to create transparent or semi-transparent surfaces (see Section 6.1.2). Whenever scattering coefficient is non-zero, scattering may occur. The scattering is described by BSDFs for surfaces and phase functions for volumes (see Section 2.2.3). Whenever no scattering occurs, ray radiance is attenuated according to absorption coefficient. If a surface or a volume have a non-zero emission coefficient, emission may occur. Emission is defined by materials, similarly as scattering. The surface descriptors have two additional outputs height, used for displacement mapping, and gradient, which can be used for bump mapping if surface primitive does not support displacement. Apart from descriptors, all non-nested functions and constants are visible from outside of script. The alias construction can be used to explicitly specify elements visible outside of script under different names. The materials are opaque handlers describing interaction of light with scene primitives, describing surface as being matte, glossy or so on. Currently, materials cannot be defined from scratch in scripts, and appropriate standard library functions which return them have to be used. The spectra are opaque handlers for internal full spectral representation of colors. There are a lot of standard library functions as well as operators for operating on spectra. At the lowest level, the script is made from function definitions and function calls. The functions have only formal input parameters and have to return exactly one value. The function definition is given by any number of assignments to temporary variables (which are assigned values once, the value cannot be redefined), followed by an expression defining returned value. Therefore, iterative constructions cannot be used, and as such are replaced by recursion. Expressions are created from operators, function calls, constants and environment variables. These variables, accessible with $ident construction, provide access to the ray-primitive intersection data, like input and output directions, surface normal, intersection point and so on. The language offers abundance of features for making programs more concise and programming easier. To name a few, generic type support is similar to templates known from C++, function types and anonymous functions provide functionality similar to lambda expressions from the new C++200x standard, and library functions for vector and matrix operations and complex numbers enable concise representation of many of mathematical concepts Execution Model and Virtual Machine API The texturing programs, which are part of the rendered scene description, should be compiled at runtime, during preprocess just before rendering. The compiler outputs code targeted for specialized stack based virtual machine. Despite this approach provides inferior performance compared to CPU native code, we have chosen it due to hardware independence. That is, if, in future, the software is to be moved to another platform, it is enough just to recompile its code, which is not the case if native CPU instructions were explicitly generated at runtime. During rendering, texturing program functions might be executed whenever ray intersects scene geometry, in order to evaluate material properties at intersection point. In the rest of this section there are described two sets of functions language standard library, which is intended to use by texturing programs, and a virtual machine API, targeted for its integration with ray tracing software.

106 94 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION program (import function typeid const surface volume alias)* import import string ;* function type ident? ( formals? ) { nested definition } ;* typeid ident typeid type ;* type basic enum structure array functype generic ident basic logic integer real complex spectrum material enum enum ident? { ident(, ident)* } structure struct ident? { (type ident(, ident)* ;) + } array type [ ] type [ n ] type [ n, m ] f unctype function f ormals -> type generic generic({ ident })? formals formal(, formal)* f ormal type ident(, ident)* nested (f unction typeid const)* const const ident = expression ;* def inition (ident = expression;)* return expression; expression expression op expression prefix expression expression suffix term term (expression) call selection constant $ ident # ident call ident (( arguments ))? arguments (expression (, expression)*)? selection select (expression, if expression ;)* expression otherwise ; end constant true false integer real real i string surf ace surface (ident string) { (ident = expression;)* } ;* volume volume (ident string) { (ident = expression;)* } ;* alias (ident string) alias ident ;* Table 6.2: Simplified texturing language grammar. Font symbol denotes non-terminal symbol, font symbol denotes terminal symbol as literally written letters, and font symbol denotes complex terminal symbols, like identifiers or real numbers. Symbols: means selection,? means optional, * means zero or more occurrence, and + means one or more occurrence. Standard Library The standard library provides a rich set of functions for assist program development. These functions provide, among others, materials, images from external files, noise generation, physical equations, and mathematical functions like sin or exp. Basic materials, which are available in library, are among others: matte reflective and matte translucent materials, glossy reflective and refractive materials, based on microfacet and Phong models, and ideal specular reflective and refractive materials. The materials are parametrized by inputs like color or glossiness (if applicable). The materials can be combined to form more complex ones, using complex materials, available in standard library as well. A complex material takes two materials and a combination factor as input. The standard library provides exhaustive support for access to images from external files. These images can be stored in memory either directly or in compressed form (saving storage at the price of slower access). The images can be read with just reconstruction filter, or with low pass, blurring filter. Additionally, gradients of image functions can be evaluated. The images can be either monochromatic or in color. In the latter case, the result is implicitly converted to full spectral representation. Additionally, the library provides optional conversion from srgb color space, before transform to full spectrum. Because of their usefulness, the library offers variety of noise primitives [Perlin, 1985, Ebert et al., 2002]. These functions produce real valued n-dimensional noise and gradients of noise. The noise is guaranteed to be restricted to [ 1, 1] range, to have smooth first derivative and have limited frequency content. Sample results generated using noise primitive are presented in Figure 6.6.

107 6.2. PROCEDURAL TEXTURING LANGUAGE 95 Figure 6.6: Images generated using noise primitive. From left: simple noise, sum of six noise octaves, noise as an argument of sine function, noise as an argument of exponential function. Each image is generated by varying some of its input as a function of image location. Moreover, in the library there are defined a few functions based on physical equations namely: blackbody radiation in function of temperature and Fresnel equations for dielectric and metals in function of indexes of refraction and angles between incoming and outgoing rays and surface normal. These functions can be used to improve realism of generated images. Virtual Machine API In order to use texturing programs, the rendering software must compile appropriate scripts. The compilation is divided into two parts. First, the program source is compiled into internal graph representation. At this stage, just syntax checking is performed. Arbitrary number of scripts can be compiled independently. Second, linking of selected compiled scripts is performed to generate programs. During linking, semantic checks are performed, name dependences are resolved, and various code optimizations take place. After being successfully linked, program is guaranteed to be error free. Finally, individual surface and volume descriptions can be generated from the program. The example of generating surface description from scripts is presented in Algorithm 6.2. Algorithm 6.2: Sample usage of procedural texturing. MMsGraph* g1 = mmsmkgraph(); MMsGraph* g2 = mmsmkgraph(); if (!mmsparsestring(*g1, script)) throw Error(mmsGetGraphLog(*g1)); if (!mmsparsefile(*g2, "marble.txt")) throw Error(mmsGetGraphLog(*g2)); MMsProgram* p = mmsmkprogram(); mmsaddgraph(*p, *g1); mmsaddgraph(*p, *g2); if (!mmslinkprogram(*p, mms::o2)) throw Error(mmsGetProgramLog(*p)); Surface* s = mmsgetsurface(*p, "mysurf"); if (!s) throw Error("No surface description \"mysurf\" exist."); //... constructed surface can be used here mmsdeleteprogram(p); mmsdeletegraph(g2); mmsdeletegraph(g1); Results and Conclusion The presented language is a very convenient tool for procedural texturing. It employs a lot of functionality dedicated to make texturing task easier. The language is designed and optimized for cooperation with ray tracing based global illumination algorithms. The one thing to consider is

96 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Figure 6.7: Procedurally defined materials. Figure 6.8: Mandelbrot and Julia Fractals.

The compilation to native code would ensure significantly better performance, however, large effort to write a good compiler can be wasted, if in future the rendering software is to be moved to a

108 96 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Figure 6.7: Procedurally defined materials. Figure 6.8: Mandelbrot and Julia Fractals. whether it is better to have the programs be executed by a virtual machine, or provide a compiler for, say, x86/x64 SSEx unit native code. The compilation to native code would ensure significantly better performance, however, large effort to write a good compiler can be wasted, if in future the rendering software is to be moved to a different platform. Obviously, all material scripts would remain unchanged, regardless the language is compiled to native code or not. In the Figure 6.7 there are presented some rendering results of spheres with surfaces described by presented language. All these scripts are simple, with just few calls of standard library functions and mathematical operations. On the other hand, Mandelbrot and Julia fractals, presented in Figure 6.8, use some more advaced features of the language support for complex numbers and recursion. 6.3 New Glossy Reflection Models Modeling reflection properties of surfaces is very important for rendering. Traditionally, in global illumination fraction of light which is reflected from a surface is described by a BRDF (Bidirectional Reflection Distribution Function) abstraction. This function is defined over all scene surface points, as well as two light directions incident and outgoing. As the name suggests, to conform the laws of physics, all BRDFs must be symmetric, i.e. swapping incident and outgoing directions must not change BRDF value. Moreover, the function must be energy preserving it cannot reflect more light than it receives.

109 6.3. NEW GLOSSY REFLECTION MODELS 97 To achieve best results of rendering with global illumination, energy preservation of BRDF should satisfy more strict requirements. It is desirable that basic BRDF model reflects exactly all the light that arrives on a surface. The actual value of reflection is then modeled by a texture. If BRDF is unable to reflect all incident light, even white texture appears to absorb some part of it. In local illumination algorithms this can be corrected a bit by making reflection value more than unit, but in global illumination such trick can have fatal consequences due to multiple light scattering. Our model is strictly energy preserving, while it still maintains other desirable properties. In the subsequent section there is a brief description of former research related to the BRDF concept. Next, requirements, which should be satisfied by a plausible reflection model are presented. Then there is explained the derivation of our reflection function, followed by comparison of our results with previous ones. Finally, we present a summary, which describes what was achieved during our research and what is left for future development Related Work The first well known attempt to create glossy reflection is Phong model [Phong, 1975]. This model is, however, neither symmetric nor energy conserving. An improved version of it was created by Neumann et al. [Neumann et al., 1999a, Neumann et al., 1999c]. Lafortune et al. [Lafortune et al., 1997] used combination of generalized Phong reflection functions to adjust scattering model to measured data. There are popular reflection models based on microfacets. Blinn [Blinn, 1977] and Cook et al. [Cook & Torrance, 1982] assumed that scattering from each individual microfacet is specular, while Oren and Nayar [Oren & Nayar, 1994] used diffuse reflection instead. A lot of work was dedicated to anisotropic scattering models. The first well known approach is Kajiya s one [Kajiya, 1985], which uses physical model of surface reflection. Ward [Ward, 1992] presented a new technique of modeling anisotropic reflection, together with method to measure real-world material reflectances. The Walter s technical report [Walter, 2005] describes how to efficiently implement Ward s model in Monte Carlo renderer. Ashikhmin and Shirley [Ashikhmin & Shirley, 2000] showed how to modify Phong reflection model to support anisotropy. Some approaches are based on physical laws. He et al. [He et al., 1991] developed a model that supports well many different types of surface reflections. Stam [Stam, 1999] used wave optics to accurately model diffraction of light. Westin et al. [Westin et al., 1992] used a different approach to obtain this goal. They employed Monte Carlo simulation of scattering of light from surface microgeometry to obtain coefficients to be fitted into their BRDF representation. On the other hand, Schlick s model [Schlick, 1993], is purely phenomenological. It accounts for diffuse and glossy reflection, in isotropic and anisotropic versions through a small set of intuitive parameters. Pellacini et al. [Pellacini et al., 2000] used a physically based model of reflection and modified its parameters in a way which makes them perceptually meaningful. A novel approach of Edwards et al. [Edwards et al., 2006] is designed to preserve all energy while scattering, however at the cost of non-symmetric scattering function. Different approach was taken by Neumann et al. [Neumann et al., 1999b]. They modified Phong model to increase its reflectivity at grazing angles as much as possible while still satisfying energy conservation and symmetry as well. Some general knowledge on light reflection models can be found in Lawrence s thesis [Lawrence, 2006]. More information on this topic is in Siggraph Course [Ashikhmin et al., 2001] and in Westin s et al. technical report [Westin et al., 2004b]. Westin et al. [Westin et al., 2004a] also provided a detailed comparison of different BRDF models. Stark et al. [Stark et al., 2005] shown that many BRDFs can be expressed in more convenient, less than 4D space (two directional vectors). Shirley et al. [Shirley et al., 1997] described some general issues which are encountered when reflection models are created.

110 98 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Properties of Reflection Functions In order to create visually plausible images, all reflection functions should satisfy some well defined basic requirements. Energy conservation. In global illumination it is not enough to ensure that no surface scatters more light that it receives. It is desirable to have a function which scatters exactly all light. We are aware of only Edwards s et al. work [Edwards et al., 2006], which satisfies this requirement, but at the high price of lack of symmetry. Neumann et al. [Neumann et al., 1999b] improved reflectivity of Phong model, but the energy preservation still is not ideal. Symmetry. The symmetry of BRDF (see Equation 2.23) is very important when bidirectional methods (which trace rays from viewer and from light as well) are used. When a BRDF is not symmetrical, appropriate corrections similar to described in [Veach, 1996] must be made, in order to get proper rendering results. Since these corrections are not part of the BRDF model itself, BRDF sampling may turn out to be extremely inefficient. Obviously, the best option is minimize usage of non-symmetric BRDF with Monte Carlo renderings. This is reasonable, since majority of currently used basic BRDF models are symmetric. Everywhere positive. If a reflection function happens to be equal to zero on part of its domain, the respective surface may potentially render to black, no matter how strong the illumination is. Having a blackbody on the scene is a severe artifact, which is typically mitigated by a complex BRDF with an additional additive diffuse component. However, this option produces dull matte color and is not visually plausible. All reflection functions based on Phong model have inside a factor equal to max(cosθ, 0), where θ is an angle between viewing direction and ideal reflection direction. If illumination is not perpendicular, all these BRDFs are prone to exhibit black patches. Everywhere smooth. Human eye happens to be particularly sensitive on detecting discontinuities of first derivative of illumination, especially on smooth, curved surfaces. This artifact occurs in any BRDF which uses functions such as min or max. Particularly, many microfacet based models use so-called geometric attenuation factor with min function, and look unpleasant at low glossiness values. Limit #1 diffuse. It is very helpful in modeling if glossy BRDF can be made just a bit more glossy than a matte surface. That is, good quality reflection model should be arbitrarily close to matte reflection when glossiness is near to zero. Surprisingly, few of BRDF models satisfy this useful and easy to achieve property. Limit #2 specular. Similarly, it is convenient if glossy BRDF becomes near to ideal specular reflection when glossiness approaches infinity. Unfortunately, this property is partially much more difficult to achieve than Limit #1. First, all glossy BRDFs are able to scatter light in near ideal reflection direction, which is correct. Second, energy preservation typically is not satisfied. While at perpendicular illumination majority of popular BRDFs are fine, whenever grazing angles are encountered, these reflection functions tend to absorb more and more light. Ease of sampling. Having a probability distribution proportional (or almost proportional) to BRDF value, which can be integrated and then inverted analytically, allows efficient BRDF sampling in Monte Carlo rendering. This feature is roughly satisfied in majority of popular BRDF models. Numerical stability. Numerical stability is extremely important property of any computational algorithm, yet it is rarely mentioned in BRDF related works. Particularly, any reflection function, which denormalizes over any part of its domain, is a potential source of significant inaccuracy. Example of such case are microfacet models, based on so-called halfangle vectors. The halfangle vector is a normalized componentwise sum of viewing and illumination direction. In some cases, the halfangle vector is calculated as ω h = [0, 0, 0]/ [0, 0, 0], causing a serious error. Our work is an attempt to create a BRDF which satisfies all these conditions together.

111 6.3. NEW GLOSSY REFLECTION MODELS Derivation of Reflection Function In this section a detailed derivation of the new reflection model is presented. Since this model is purely phenomenological, all mathematical functions chosen to use in it are selected just because of desirable properties they have. This particular choice has no physical basis, and of course, is not unique. Through the rest of this section the notation presented in Table 6.3 is used. By Symbol Meaning f r Reflection function (BRDF) R Reflectivity of BRDF ω i Direction of incident light ω o Direction of outgoing light ω r Ideal reflection direction of outgoing light N Surface normal u, v Arbitrary orthogonal tangent directions θ i Angle between ω i and N θ r Angle between ω r and N φ i Angle between ω i and u φ r Angle between ω r and u Ω Hemisphere above surface, BRDF domain Table 6.3: Notation used in BRDF derivation. convention, all direction vectors are in Ω, i.e. cosine of angle between any of them and N is non-negative. Moreover, these vectors are of unit length. Energy Conservation Energy conservation requires that reflectivity of the BRDF must not be greater than one (see Equation 2.25), and is desirable to be equal to one. The reflectivity can be expressed in a different domain. The following expression is used through the rest of this section: R(θ o, φ o ) = π 2π f r (θ i, φ i, θ o, φ o ) cos(θ i ) sin(θ i )dθ i dφ i. (6.2) It is very useful if reflection function f r can be separated into a product: f r (θ i, φ i, θ o, φ o ) = f θ (θ i, θ o )f φ (θ i, φ i, θ o, φ o ), (6.3) where f θ is latitudal reflection and f φ is longitudal reflection. If f φ integrates to unit regardless of θ i and θ o this separation significantly simplifies reflectivity evaluation, which now can be re-expressed as: R(θ o, φ o ) = and energy conservation as: π 2 0 2π f φ (θ i, φ i, θ o, φ o )dφ i f θ (θ i, θ o ) cos(θ i ) sin(θ i )dθ i, (6.4) 0 2π 0 f φ (θ i, φ i, θ o, φ o )dφ i 1 and π 2 0 f θ (θ i, θ o ) cos(θ i ) sin(θ i )dθ i 1. (6.5) Due to this feature, latitudal and longitudal reflection functions can be treated separately.

112 100 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Latitudal Reflection Function The domain of latitudal function is very inconvenient due to the sine and cosine factors in the integrand: R θ (θ o ) = π 2 0 f θ (θ i, θ o ) cos(θ i ) sin(θ i )dθ i. (6.6) However, substituting x = cos 2 (θ i ), y = cos 2 (θ o ) and dx = 2 sin(θ i ) cos(θ i )dθ i leads to much simpler expression for reflectivity: R y (y) = f θ (x, y)dx. (6.7) Despite being much simpler, this space is still not well suited for developing reflection function, mainly because of necessity of symbolic integration. Using the final transformation it may be obtained: F θ (x, y) = y x 0 0 f θ (s, t)dsdt and f θ (x, y) = 2 F θ (x, y). (6.8) x y Designing a function F θ is much easier than f θ. The requirements that F θ must satisfy are the following: x,y F θ (x, y) = F θ (y, x) (6.9) x F θ (x, 1) = x (6.10) x1 x 2 F θ (x 1, y) F θ (x 2, y) (6.11) The requirement (6.10) can be released a bit. If it is not satisfied, it is enough if F θ (1, 1) = 1 and F θ (0, 1) = 0 are satisfied instead. In the latter case, applying: x = F 1 (x, 1) and y = F 1 (1, y) (6.12) guarantees that F θ (x, y ) satisfies original requirements ( ). A matte BRDF in this space is expressed as F θ = xy. We have found that the following (unnormalized) function is a plausible initial choice for latitudal glossy reflection: Transforming this equation into F θ space leads to: f θ (x, y) = sech 2( n(x y) ). (6.13) F θ (x, y) = ln cosh(nx) + ln cosh(ny) ln cosh ( n(x y) ) 2 ln cosh n This function satisfies only the released requirements, so it is necessary to substitute: x = 1 ( ) 1 e 2 ln(cosh n)x n artanh tanh n. (6.14) (6.15) for x, and analogical expression for y. After substitution and transformation to the f θ space it may be obtained: m tanh 2 n e m(x+y) f θ (x, y) = ( tanh 2 n (1 e mx ) (1 e my ) ) 2, (6.16) where m = 2 ln cosh n. Finally, it should be substituted x = cos 2 (θ i ) and y = cos 2 (θ r ). Considering how complex the final expression is, it is clear why it is difficult to guess the form of plausible reflection function, and how useful these auxiliary spaces are.

113 6.3. NEW GLOSSY REFLECTION MODELS 101 Longitudal Reflection Function The longitudal reflection function should be a function of cos(φ i φ r ). It has to integrate to unit over [ π, π], so it is reasonable to choose a function that can be integrated analytically: 1 f φ (φ i, φ r ) = C n [n(1 cos(φ i φ r )) + 1] 6, and C (2n + 1)5.5 n = 2πP 5 (n), (6.17) where P 5 (n) = 7.875n n n n 2 + 5n + 1. When n = 0, the function becomes constant. When n increases, the function is largest when φ i = φ r. In the limit, when n approaches infinity, the function converges to δ(φ i φ r ). There is still one issue whenever either ω i or ω r is almost parallel to N, φ i or φ r is poorly defined. In fact, in these cases, the function should progressively become constant. The simple substitution n = n sin θ i sin θ r works fine. Reflection Model Combining latitudal and longitudal scattering functions leads to the final BRDF: f r (θ i, φ i, θ o, φ o ) = (2n φ sin θ i sin θ r + 1) 5.5 2πP 5 (n φ ) (n φ sin θ i sin θ r (1 cos(φ i φ r )) + 1) 6 m θ tanh 2 n θ e m θ(cos 2 θ i+cos 2 θ r) ( tanh 2 n θ ( 1 e m θ cos 2 θ i ) ( 1 e m θ cos 2 θ r )) 2. (6.18) The parameters n θ and n φ do not have to satisfy n θ = n φ = n. Using various functions of the form n θ = f 1 (n) and n φ = f 2 (n) leads to a variety of different single parameter glossy scattering models. The reflection angles (θ r and φ r ) may be computed from outgoing angles θ o and φ o in a few ways. For example, it can be used ideal reflection, ideal refraction or backward scattering for this purpose, leading to variety of useful BRDFs. The reflection model is strictly energy preserving, so cosine weighted BRDF forms probability density functions (pdf) to sample θ i and φ i from. Obviously, both pdfs are integrable analytically, which is very helpful. Care must be taken, since probability of selecting given direction vector ω i is defined over projected solid angle around N, instead of ordinary solid angle. However, probability density defined over projected solid angle is often more convenient for use in ray tracing algorithms than over ordinary solid angle Results and Conclusions The following results are generated using a white sphere and a complex dragon model illuminated by a point light source. The proportions of latitudal and longitudal gloss are set to n θ = n and n φ = 0.75n n sin θ i sin θ r. In Fig. 6.9 there is examined how selected well-known scattering models cope with little glossiness. Both Phong-based models expose a zero reflectivity with certain directions, while max- Phong and microfacet models have shading discontinuities due to usage of max or min functions. Neither of these models is fully energy conserving. In Fig there is examined latitudal component of our reflection model. The scattering is increased at grazing angles to achieve full energy conservation. Similarly, Fig presents longitudal scattering only. In Fig and Fig there is presented our BRDF model, defined as a product of latitudal and longitudal scattering. The Fig shows how the BRDF behaves when glossiness is increased, while the Fig changes illumination angle using the same glossiness. The BRDF exhibits some

102 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Figure 6.9: Comparison of different glossy BRDFs with gloss just a bit more than matte.

From left: glossiness just a bit more than matte, medium glossiness, large glossiness, similarity between θ i and θ r. Figure 6.

anisotropy at non-perpendicular illumination, but this is not a problem with complex models. In Fig. 6.

114 102 CHAPTER 6. RENDERING SOFTWARE DESIGN AND IMPLEMENTATION Figure 6.9: Comparison of different glossy BRDFs with gloss just a bit more than matte. From left: diffuse reference, reciprocal Phong, max-phong, microfacet. Figure 6.10: Latitudal scattering only. From left: glossiness just a bit more than matte, medium glossiness, large glossiness, similarity between θ i and θ r. Figure 6.11: Longitudal scattering only with varying glossiness. Figure 6.12: Product of latitudal and longitudal scattering with increasing glossiness. anisotropy at non-perpendicular illumination, but this is not a problem with complex models. In Fig there is a dragon model rendered with our BRDF and two different glossinesses. We have presented a novel approach to create BRDFs, for which we have designed an energy preserving symmetrical reflection function. Energy conservation allows improved rendering results. For example, when a model rendered with our function is placed into an environment with uniform

6.3. NEW GLOSSY REFLECTION MODELS 103 Figure 6.13: Scattering with perpendicular (left) and grazing (right) illumination. Figure 6.14: Complex dragon model rendered with glossiness n = 2 (left) and n = 4 (right).

Moreover, our BRDF behaves intuitively: when glossiness is decreased, it progressively becomes matte. This function, however, still has some flaws.

115 6.3. NEW GLOSSY REFLECTION MODELS 103 Figure 6.13: Scattering with perpendicular (left) and grazing (right) illumination. Figure 6.14: Complex dragon model rendered with glossiness n = 2 (left) and n = 4 (right). illumination it vanishes. On the other hand, majority of other models lose some energy, especially at grazing angles. In this case, models have dark borders, which are impossible to control. Moreover, our BRDF behaves intuitively: when glossiness is decreased, it progressively becomes matte. This function, however, still has some flaws. Most notably, it has difficulty in controlling anisotropy at non-perpendicular illumination, visible at very smooth surfaces. Secondly, it becomes numerically unstable when glossiness is increased. Minimizing the impact of those disadvantages requires further research in this area.

116 Chapter 7 Results In this chapter there is a summary of most important results of research on which the thesis is based. First, there are discussed selected aspects of image quality assessment and various numerical metrics. These metrics are used later in this chapter to compare images generated by the algorithms presented in the thesis with alternative commonly available methods. The comparison is provided for the novel full spectral rendering (see Section 4.3) and the combined light transport algorithm (described in Section 4.5). Finally, since the results of parallelization of light transport algorithms do not affect image quality, there is no need in providing image comparison. The parallelization efficiency is given in Section Image Comparison Before performing any reasonable assessment of rendering algorithm quality, and therefore quality of images produced by the algorithm, the method used to compare images and the reference image must be precisely defined. In the case of geometric optics simulation, the best possible reference image would be the photography of a real world scene. Unfortunately, to make the comparison of rendering output with photography meaningful, the input data for rendering, e.g. scene geometry, surface reflectance, light sources radiance and so on have to be precisely measured. Without a laboratory very well equipped in specialized measuring devices, this task is impossible. In majority of research related to rendering much simpler, yet still reasonable approach is used. The reference image is produced with a well-known and proven to be correct reference algorithm (for example, Bidirectional Path Tracing), with extremely long rendering time, to minimize potential errors in reference image. Since BDPT is unbiased, when the image exhibits no visible variance, one can be extremely confident that the image is correct. Then the tested algorithms are run on the same input, and the resulting images are compared with a reference one. This method is able to caught all rendering errors, except those that arise from assumption of geometric optics applicability. The method used to compare the difference between the test image and the reference image is very important. There have been some research into measuring image difference, e.g. [Wilson et al., 1997]. These methods take into account various aspects of human visual system. However, since there is no widely used and accepted sophisticated metric used to image comparison, we have employed a standard RM S (Root Mean Square) norm: d(i 1, I 2 ) = 1 N N (p 1 i p2 i )2, (7.1) where d(i 1, I 2 ) is a distance between images I 1 and I 2, N is the number of pixels in the images, and p j i is the value of ith pixel of jth image, normalized to [0, 1] range. The norm (7.1) is, obviously, i=1 104

117 7.2. FULL SPECTRAL RENDERING 105 imperfect, but seems to be most often used. The norm is defined for grayscale images. For RGB ones, the norm can be evaluated for each channel separately, and summed using srgb-to-luminance weights, i.e. d RGB = 0.21d R d G d B. 7.2 Full Spectral Rendering In this section there is a numerical comparison of convergence of Bidirectional Path Tracing equipped with standard and Multiple Importance Sampling based full spectral rendering. The reference image of a test scene is presented in Figure 7.2. The numerical comparison using a norm (7.1) is performed on a rectangle containing the glass figure (see Figure 7.1), and its results are presented in Table 7.1. Figure 7.1: Comparison of spectral rendering algorithms. Left: Multiple Importance Sampling based full spectrum rendering, 1 sample/pixel. Middle: single spectral sample per cluster rendering, 1 sample/pixel. Right: reference image, 256 samples/pixel. Number of samples algorithm 1M 2M 4M 8M 16M MIS basic Table 7.1: Comparison of convergence of spectral sampling techniques used with Bidirectional Path Tracing. The table contains the differences between the reference image and images rendered by a given algorithm with a given number of samples. 7.3 Comparison of Rendering Algorithms In this section there is a numerical comparison of convergence of four light transport algorithms: Path Tracing, Bidirectional Path Tracing, Photon Mapping and our new combined algorithm. The

118 106 CHAPTER 7. RESULTS reference image of a test scene is presented in Figure 7.3. The results of comparison using a norm (7.1) are presented in Table 7.2. Rendering time Algorithm 15sec 1min 4min 16min PT BDPT PM Combined Table 7.2: Comparison of convergence of selected rendering algorithms. The table contains the differences between the reference image and images rendered by a given algorithm in a given time The difference is evaluated using RMS norm, the results are scaled by a factor f = 10 2.

$7.3. COMPARISON OF RENDERING ALGORITHMS 107 Figure 7.2: Full spectral rendering of a scene with imperfect refraction on glass. The image is rendered with Bidirectional Path Tracing, using 0.$

119 7.3. COMPARISON OF RENDERING ALGORITHMS 107 Figure 7.2: Full spectral rendering of a scene with imperfect refraction on glass. The image is rendered with Bidirectional Path Tracing, using 0.5G samples, at 1920x1080 resolution, in 3h 27min on 2.93GHz Intel Core i7 CPU. In the bottom right corner of the image there is a miniature showing estimated relative variance of various parts of the main image.

120 108 CHAPTER 7. RESULTS Figure 7.3: Scene containing indirectly visible caustics, rendered with the presented here combined light transport algorithm, using 64M samples, up to 8M photons, at 1920x1080 resolution, in 1h 50min on 2.93GHz Intel Core i7 CPU. In the bottom right corner of the image there is a miniature showing estimated relative variance of various parts of the main image.

THE goal of rendering algorithms is to synthesize images of virtual scenes. Global illumination

THE goal of rendering algorithms is to synthesize images of virtual scenes. Global illumination 2 Fundamentals of Light Transport He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast. Leonardo Da Vinci, 1452 1519 THE