Large fused GPU volume rendering

Size: px

Start display at page:

Download "Large fused GPU volume rendering"

Nathaniel Shelton
6 years ago
Views:

Linköping University SE-601 74 Norrköping, Sweden Institutionen

1 LiU-ITN-TEK-A--08/108--SE Large fused GPU volume rendering Stefan Lindholm Department of Science and Technology Linköping University SE Norrköping, Sweden Institutionen för teknik och naturvetenskap Linköpings Universitet Norrköping

2 LiU-ITN-TEK-A--08/108--SE Large fused GPU volume rendering Examensarbete utfört i Vetenskaplig visualisering vid Tekniska Högskolan vid Linköpings universitet Stefan Lindholm Handledare Gianluca Paladini Examinator Anders Ynnerman Norrköping

3 Upphovsrätt Detta dokument hålls tillgängligt på Internet eller dess framtida ersättare under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: Stefan Lindholm

4 Abstract This master thesis describes the underlying theory and implementation of a fused GPU volume rendering pipeline. The open source framework of XIP, largely developed at Siemens Corporate Research, is extended with fusion capabilities through a Binary Space Partitioning approach. Regions in the intersection pattern of multiple volumes are identified and subsequently rendered using either Texture Slicing or Raycasting in a cell based fashion. Results demonstrate interactive frame rates for reasonable scenes and are encouraging as the implementation can be extended by several key acceleration methods. 1

6 Acknowledgments I would like to thank my supervisor Gianluca Paladini and the team at Imaging and Visualization at Siemens Corporate Research for their support and expertise. Special thanks also goes to Anders Ynnerman as my examiner as well as Andres Sievert, Veronica Giden and Patric Ljung for your help in the creation of this thesis. 3

8 Contents 1 Introduction Introduction Aim Outline Limitations Background Introduction The GPU Direct Volume Rendering Fusion in DVR Large Data Sets in DVR and Fusion Theory and Design Problem Description Approach Binary Space Partitioning Geometrical Homogeneity and Complete Cells Storage Structures Design Volume Representation Fusion Overview Implementation Fusion Module Generating Region Representations On Plane Threshold Partition Plane Selection Render Module Render Queue Cell Based Rendering Rendering Methods Instantiated Shaders Extra Functionality Clip Planes and Early Ray Termination

9 6 Contents Fusion with Simple Convex Mesh Geometry Results Storage Structures Bsp-tree Generation and Complexity Partition Plane Selection Methods Real Application Impact Proxy Geometry Fusion Module Results Render module results Conclusion and Possibilities Conclusion Known Limitations Future Research Bibliography 43 A Vocabulary 45 B Partitioning 48

10 Chapter 1 Introduction The context of this thesis is about rendering multiple overlapping volumetric data sets in a computer graphics (CG) environment. The first sections below provides an introduction to the thesis and the field of medical visualization. This is followed by a background providing overviews of existing methods and techniques for accomplishing this visualization. 1.1 Introduction Medical imaging and visualization of medical data is an important part in many medical fields ranging from medical trials to surgical planning. The acquisition of data can be performed by a wide selection of methods such as Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) while ever more intricate techniques are constantly developed for the visualization of this data. Also, without regard for specific visualization techniques, more refined approaches are contrived involving multiple data sources at the same time. From a medical point of view this simultaneous visualization of multiple volumes can contribute to a greater sense of context or to highlight relational aspects of different data types. Examples of such are x-ray scans of specific organs placed within a transparent body, blood flow visualization through the model of a heart or regional activity visualized in the context of a human brain. 1

11 2 Introduction Aim The aim of this thesis is to account for the functionality required to perform fused visualization of multiple volumes and a description of the implementation of this functionality done at Siemens Corporate Research (SCR) in Princeton, USA. It is also aimed at highlighting difficulties and possible solutions to performing fused rendering, both in a general sense and in relation to specific rendering methods Outline The first sections of chapter 1 are dedicated to ensure a certain level of necessary knowledge regarding visualization in general and the specifics of fused rendering. A full problem description, presentations of certain approaches and a design overview are given in chapter 2. Chapter 3 deals with the implementation done at SCR while results and a following discussion can be found in chapter 4 and 5. A vocabulary of abriviations and terms can be found in appendix A Limitations This thesis is not intended as a complete review of or comparison between sampling methods or variations of such, since not enough time have been available to perform the necessary testing. The fact that such reviews are bound to be application specific also put them out of scope for this thesis. Decisions regarding rendering specifics, such as choice of sampling scheme, have been made on a general basis or for debugging purposes during implementation. No other rendering methods than Texture Slicing or Raycasting will be discussed as they are dominant within DVR and no other methods were desired by Siemens. 1.2 Background This section aim to present some of the corner stones in the area of GPUs, DVR and fusion. The intention is not to provide a complete knowledge base but to highlight those parts that are required to understand the problem and the solution in this thesis. The knowledge entry level is at the level of GPUs and volume rendering. Readers unfamiliar with the underlying fields of computer science and CG might benefit from additional background reading while experienced readers can focus on the later sections of this chapter. Real Time Volume Graphics [Engel et al., 2006] is highly recommended as a source of knowledge on a wide selection of topics concerning DVR. For the work discussed

1.2 Background 3 in this thesis and the following text, relevant parts include the underlying theoretical background, introduction to the GPU, key algorithms and higher level optimizations.

12 1.2 Background 3 in this thesis and the following text, relevant parts include the underlying theoretical background, introduction to the GPU, key algorithms and higher level optimizations. 1 For the underlying knowledge in CG the reader is referred to Computer Graphics: Principles and Practice [Foley et al., 1996] Introduction For further discussion about possibilities and shortcomings of fusion in computer graphics it is important to have an understanding of a few key things. The general problem that is being solved, the platform used for the solution and how the problem is represented on that platform. The following subsections contain an overview of the hardware followed by a theoretical presentation of the rendering problem and its role in volume visualization. Several key components that ar vital or necessary in performing this hardware rendering are presented as general (e.g. rendering methods and transfer functions) or fusion specific (e.g. segmented rays and sampling schemes) The GPU Figure 1.1. Multiprocessors work independently and in parallel. Execution units within a multiprocessor work in unison and in parallel. While initially created to offload the CPU by taking over most graphic related operations the GPU is nowadays utilized as a secondary processing unit in many applications. It constitutes the platform of operation and its architecture is a concern for how the problem of volume rendering is solved. In short, the GPU can be thought of as a cluster of processing cores. These are called multiprocessors and operate independently and in parallel, solving identical problems over a series of inputs. Within each multiprocessor, several execution units work in unison, instruction by instruction, on a single input per unit. See figure 1.1. The GPU pipeline can be programmed through small user defined programs commonly called shaders. Main inputs and outputs from these shaders are passed via textures and render buffers whereas some meta data can be made available in the shaders by so-called uniform variables, or simply uniforms. 1 chapters 1,2,3,7,8,17 in particular

13 4 Introduction While enjoying the gains of parallelism and the advantages of speedups thanks to highly specialized hardware and the parallel build described above, GPUs also have their disadvantages. One highly limiting factor is inefficient execution of if-clauses. In a worst case scenario within a single processing core, all if-clause branches are executed for all pixel fragments [Fernando, 2004]. This behavior is inherent in the architecture of the GPU and is a result of the high level of parallel optimization that requires units within the same core to execute the same instruction simultaneously. The resulting time penalty can be avoided if the triggering of fragments and the usage of conditionals is such that the fragments of all units within a single core are guaranteed to take the same branch. Another shader related drawback is that some advanced algorithms have to be split over two separate shaders possibly resulting in costly state changes. Raycasting on arbitrary convex polyhedra is an example of such an algorithm where different steps in the algorithm depend on each other Direct Volume Rendering Direct Volume Rendering (DVR) in volume rendering is simply the absence of intermediate surface representations in a volume rendering process. Other methods of visualization, as standard surface rendering and 3D graphics, contain stages where geometrical representations are computed as an initial step and visual properties calculated at surface level [Engel et al., 2006]. The difference is that the proxy geometry used for DVR exists only to trigger the rendering rather than as an representation in itself and that visual properties are calculated in the interior of the volumetric data. Original Problem of DVR Figure 1.2. Stages of the information flow in DVR, from patient to pixel values. Regardless of what rendering method is applied, the literature states that while the origins of DVR lie in physical models of light transportation, in practice it corresponds to an accumulation of color contributions from a series of volumetric data

14 1.2 Background 5 samples. 2 In theory, the underlying function is a continuous integral describing the transportation of light through a media along a certain path, see figure 1.2(a). In CG this integral is approximated with a discrete sum evaluated as an iterative sequence of small contribution calculations along the view direction through the scene, figure 1.2(e). To account for absorbtion this requires composition of the color contributions to be performed sequentially with appropriate blending, equations 1.1 or 1.2 depending on traversal direction. With the GPU as the platform, the accumulation is typically performed into an off screen buffer or directly into the framebuffer with one integral being approximated per pixel. These integrals are the general problem we are trying to solve. F ront to Back Back to F ront { Cout C in + (1 α in )C i (1.1) α out α in + (1 α in )α i { Cout (1 α i )C in + C i (1.2) α out (1 α i )α in + α i Volumes as 3D Signals To view the volume data as discrete three dimensional signals opens an entire field of techniques and tools related to Digital Signal Processing. From this viewpoint, as discussed in [Engel et al., 2006] chapter 9, DVR can be seen as the reconstruction process of an original signal as illustrated in figure 1.2(a-c). This reconstruction will seldom be perfect in a real world case. 3 The reason is that the choice of interpolation technique during data sampling in CG directly translates to a reconstruction filter in signal processing. Furthermore, there is no interpolation scheme available to match the only filter that would give a perfect reconstruction, the sinc filter. Standard tri-linear interpolation implies a tent shaped filter and even though it is known to have errors it is used in all work covered by this thesis as it is supported by GPU hardware. Rendering Methods Two of the most widely spread methods of DVR on modern GPUs are Raycasting and Texture Slicing, figure 1.3(a) and 1.3(b). These methods present algorithmic and practical solutions to the objectives of DVR as discussed above. In Raycasting, a hull of the desired sample area is used as proxy geometry and rendered once or twice to trigger entry and exit points of all sample rays simultaneously. The actual sampling along these rays then takes place in the per pixel fragment executed shader program. Texture slicing on the other hand relies on proxy geometry for 2 [Preim and Bartz, 2007], [Engel et al., 2006], [Hansen and Johnson, 2005] 3 In fact, for perfect reconstruction the signal must be continuous piecewise linear function with constant derivatives between sample points.

6 Introduction (a) Raycasting (b) Texture Slicing Figure 1.3. Two frequently discussed methods of using GL proxy geometry for sample acquisition [Engel et al., 2004].

For Raycasting the grid becomes spherical with consistent step length while Texture Slicing, since the slicing polygons are still planar, it is the step length that is altered depending on screen

15 6 Introduction (a) Raycasting (b) Texture Slicing Figure 1.3. Two frequently discussed methods of using GL proxy geometry for sample acquisition [Engel et al., 2004]. each slice of samples where one sample per ray is triggered by a polygonal slice of the desired sample area. If perspective projection is used without compensations the sampling grid will be permuted. For Raycasting the grid becomes spherical with consistent step length while Texture Slicing, since the slicing polygons are still planar, it is the step length that is altered depending on screen position. Transfer Functions and Lookup Tables While a light transportation integral requires color contributions, most scientific data holds little information about actual optical properties or color contributions at sample points. Rather, a mapping is performed to convert the sampled scalar values (most often density values) into color contributions. This mapping is done with a transfer function. Prior to a mapping, values are said to be preclassification and are here called data samples. Post-classification is accordingly the label for values after a mapping has been performed and the values are now called color samples. These post-classification color samples are accumulated in the composition equations 1.1 and 1.2. Figure 1.4. Typical one-dimensional TF used to convert scalar data samples to color samples. A TF is in general a continuous function that often cannot be easily represented analytically. Instead a TF is sampled at discrete points, creating a discrete representation called a Lookup Table (LUT) that can used for the mapping. Simply put, the data sample is used as a coordinate to access the LUT and extract a color sample. In figure 1.4 a one-dimensional LUT is illustrated where values in the range of [0.0, 1.0] are mapped to different colors. Typically, in CG a LUT is

16 1.2 Background 7 represented as a texture and the actual lookup is performed as a texture fetch in the GL. Alpha Correction α i = 1 (1 α i ) ( x x ) (1.3) If a constant sampling frequency is maintained then no inter sample weighing other than the TF lookup is typically performed. If the frequency is variable or if the overall intensity is desired to be maintained through a series of images rendered with different frequencies, a per sample weight is introduced. 4 In the literature this weight is called alpha correction and is computed as in equation 1.3. While maintaining a fairly constant overall intensity this correction does by no means remove the errors associated with discrete approximations, fig 1.2(e) Fusion in DVR Aside from the general aspects of DVR, fusion introduces additional criteria. When fusing multiple overlapping volumes in DVR, the requirements in equations 1.1 and 1.2 regarding sequential color sample composition along the viewing direction still apply. To render the volumes one by one before a final composition would represent a sum that invalidates the iterative discrete approximations by not having sequential samples. Fusion is the art of allowing multiple volumes to be rendered without breaking any of the mentioned laws. Segmented Rays As discussed for the software Raycaster in [Grimm et al., 2004], different segments of a sampling ray can be processed separately, this approach is here called segmented rays. The segments are blended together while care is taken not to violate the blending order of the segments themselves or their samples. The problem thus becomes one of how to find and sample all segments along the view direction exhibiting different combinations of overlapping volumes. Although these segments can be chosen arbitrarily, in fused DVR they typically conform to the regions in the intersection pattern of the volumes. 5 As illustrated in figure 1.5, this pattern and its regions corresponds well to set theory. As an example, in a rendering of scene (a) 6, regions (b) (c) (d) can be rendered separately 4 This should follow from fig 1.2(e), e.g. if half of the integral was sampled twice as often. 5 Higher granularity can be applied for optimizations. 6 With the camera in the top right corner and front to back composition

$8 Introduction (a) Cu Sp (b) Cu \ Sp (c) Cu Sp (d) Sp \ Cu Figure 1.5.$ If composition is integrated in the rendering, as it is in most hardware CG to avoid temporary storage, then also the rendering order has to be (d) (c) (b).

If composition is integrated in the rendering, as it is in most hardware CG to avoid temporary storage, then also the rendering order has to be (d) (c) (b).

This overhead stems from the necessity to keep track of the segments and their blending and introduces additional computational and storage requirements.

Sampling Schemes Under the assumption that sampling is costly and should be kept to a minimum volumes should ideally by sampled according to their respective data resolution and not outside of their

17 8 Introduction (a) Cu Sp (b) Cu \ Sp (c) Cu Sp (d) Sp \ Cu Figure 1.5. To avoid wasted samples a scene can be split down to its intersections [Commons, 2008] in any order while blending has to take place as (d) (c) (b). If composition is integrated in the rendering, as it is in most hardware CG to avoid temporary storage, then also the rendering order has to be (d) (c) (b). Increased complexity As opposed to a single region in regular DVR, rendering and blending several small regions introduces an overhead in the overall rendering. This overhead stems from the necessity to keep track of the segments and their blending and introduces additional computational and storage requirements. Since these issues are highly method specific further discussion is withheld until chapter 3. Sampling Schemes Under the assumption that sampling is costly and should be kept to a minimum volumes should ideally by sampled according to their respective data resolution and not outside of their individual boundaries. Brute force sampling, fig 1.6(a), is hardly satisfying in this context and without any assumptions of aligned volumes or related volume resolutions. A globally selected frequency, fig 1.6(b), also carries an overhead as it forces over sampling for low resolution volumes. A per region uniform sampling frequency, fig 1.6(c), limits this overhead while introducing a variable sample frequency within individual volumes. Since such a variation alters the approximation error illustrated in 1.2(f), artifacts such as incorrect and abrupt color changes between neighboring pixels can appear. Opacity correction alone is not sufficient to avoid this problem. Interleaved sampling, fig 1.6(d) was discarded by [Rösler et al., 2006] for introducing unspecified artifacts and likewise deemed limited in [Plate et al., 2007] due to opacity errors.

18 1.2 Background 9 (a) Brute force. Sample all volumes everywhere according to the highest resolution of any single volume. (b) Global frequency. Sample volumes within their own boundaries according to the highest resolution of any single volume. (c) Per region frequency. Sample volume intersection regions according to the highest resolution among volumes present in each region. (d) Interleaved sampling. Sample all volumes independently according to their specific resolution and interleave the sample between volumes. Figure 1.6. Sampling schemes for multiple overlapping volumes. The colored curves indicate sampled density for the different volumes along the view axis. Composition Schemes (a) Independent TF lookups per data sample. (b) Combined TF lookup for data samples. Figure 1.7. Sampling schemes for multiple overlapping volumes. A interesting area of research when it comes to using TFs for mapping data samples to color samples is how to use more than one data value for each lookup. These are called multidimensional TFs and can be used when fusion is present to investigate new possibilities in how to composite volumes. One such composition is to use the overlap of two volumes and display the difference between samples rather than visualizing the volume data itself, this is sometimes used with heat or flow field visualizations. Independent TF lookups are however more commonly used and are the only ones used in this work.

10 Introduction 1.2.5 Large Data Sets in DVR and Fusion Figure 1.8. Rubik s cube is an example of a 3x3x3 bricking.

In some cases, such as large volumes or large amount of volumes, this presents a problem as all required data takes up more space than is available on the GPU.

19 10 Introduction Large Data Sets in DVR and Fusion Figure 1.8. Rubik s cube is an example of a 3x3x3 bricking. Any data to be visualized through the GPU has to be present on the VRAM at the time of rendering. In some cases, such as large volumes or large amount of volumes, this presents a problem as all required data takes up more space than is available on the GPU. One solution to this problem is to divide the volume into smaller parts, commonly called bricks, that are rendered separately so that only one such brick is required to be stored on the GPU at any time. 7 Ordinary rendering methods still apply although additional considerations have to be made regarding boundary constraints and interpolation. In fusion, it is often desired to identify intersection regions between volumes. When dealing with fusion and multiple bricked volumes this region identification has to be completed all the way down to brick level to assure that no more than one brick from each overlapping volume must be present on VRAM at any time. The significance and implication of this is discussed in section A volume divided into bricks is commonly called a bricked volume.

20 Chapter 2 Theory and Design The chapter is mainly concerned with fusion as those parts of the solution were less known beforehand. The rendering on the other hand, known to be implemented as Texture Slicing and Raycasting and thus documented throughout the literature, required substantially less designing and is therefor treated directly in the implementation chapter. 2.1 Problem Description As a part of the cabig TM initiative, the open source extensible Imaging Platform (XIP) have been developed and includes a pipeline for volume rendering. In short, the task is to extend and/or rebuild this existing rendering pipeline to support fusion of multiple volumes. Modules in the pipeline are desired to support large volumes through bricking and include the two rendering methods of Texture Slicing and Raycasting. Since XIP is open source and developed using plug-in methodology, key demands such as modularity, simplicity and extendability is a priority (described in more detail below). Furthermore, memory consumption and rendering speed are also prioritized and interactive framerates are desired. Modularity, Simplicity and Extendability Volumes should be represented in such a way that more than one fusion module can operate on the same data with volumes being included or excluded at will on a per module basis, figure 2.1(Fusions). The same modularity should also apply for the rendering step with different renderers operating independently against a single 11

21 12 Theory and Design Figure 2.1. Modularity in pipeline fusion module, figure 2.1(Renderings). Such a shared pipeline is a priority for producing split views or piecewise outputs from the same scene. The modularity also implies simplicity through clearly defined parts, extendability is also simplified by the fact that single pieces of the pipeline can be replaced. Confinements Under the requirement of interaction this thesis does not cover any fusion schemes at data level, such as pre process re-sampling or fusion of data on disk. The idea is instead to fuse individual contributions calculated from different data sets into a single image directly at render time rather than fusing the data itself. 2.2 Approach Following the problems of large data and the desired support for volume bricking, regions in the intersection pattern between volumes have to be identified. This implies some scheme to partition space. This section presents Binary Space Partitioning (BSP) and a few other key elements as a way to reach the goals and solve the stated problem.

22 2.2 Approach Binary Space Partitioning In short, BSP recursively subdivides space into pairs of subspaces by partitioning planes. All such conceptual spaces are called cells. Although the initial space can be bounded, such as a polyhedron, it does not have to be. 1 The process is recursive, with all resulting subspaces in turn subdivided further, and proceeds until a specified criteria is met for the resulting cells. One of the most prominent features of BSP compared to other partitioning schemes is the fact that the orientation of the partitioning planes are chosen arbitrarily and not axis aligned. The following text explains why simpler schemes are not sufficient and also provides a description of how BSP can generate a tree structure. Partitioning Alignment Axis aligned methods are often simple and efficient but can in themselves not accomplish accurate representations of all regions unless the volumes themselves are axis aligned. As illustrated in figure 2.2, regions are often approximated rather than represented using such methods. These non accurate representations can trigger situations with region intersections ending up completely inside approximations. With region subdivision down to brick level this leads to the fact that several bricks per volume have to be present in VRAM at the time of rendering. 2 The Binary Space Partitioning (BSP) used in this work is a non axis aligned scheme capable of producing accurate representations for intersections of multiple convex polyhedra. Tree Representation Each subspace in BSP can be represented by a node in a binary tree structure called Bsp-tree where each node corresponds to a cell. In this tree, all internal nodes have one positive and one negative child, named so after the partitioning plane half spaces they represent. The only nodes not associated with planes and that have no children are the leaf nodes of the tree. This leads to the fact that the space represented by any node in the tree is defined by the initial space and all planes stored in its direct ancestors when applied in top down order. It also means that for any given node, every cell of its descendants will constitute proper subsets of the cell in that node. See figure 2.4(Bsp-tree) for an example and [Foley et al., 1996] for further reading. 1 In fact, in this thesis the initial space is unbounded and the root node represents the full world space in CG. 2 If bricks are created using Octrees then the required number can be as high as eight bricks per volume.

14 Theory and Design (a) Uniform axis aligned partitioning (b) Non uniform axis aligned partitioning (c) Non axis aligned partitioning Figure 2.

an additional, or extended, definition of homogeneity is required for its use in the work of this thesis. Figure 2.3. Geometrical homogeneity.

Homogenous qualities in cells is an important aspect in BSP and often constitute the condition for breaking the recursive subdivision. Homogeneity as described in [Foley et al.

However, in this implementation, the subdivision is driven with the objective to find intersecting regions amongst multiple convex polyhedra.

23 14 Theory and Design (a) Uniform axis aligned partitioning (b) Non uniform axis aligned partitioning (c) Non axis aligned partitioning Figure 2.2. (a) and (b) are region approximations while (c) can be said to be a representation Geometrical Homogeneity and Complete Cells In addition to common terms of space partitioning an additional, or extended, definition of homogeneity is required for its use in the work of this thesis. Figure 2.3. Geometrical homogeneity. Cells (a), (b), (c) are geometrically homogenous and thus complete while (d) is incomplete. None of the cells are strictly homogenous. Homogenous qualities in cells is an important aspect in BSP and often constitute the condition for breaking the recursive subdivision. Homogeneity as described in [Foley et al., 1996] occur when there are no boundaries inside a cell, including boundaries towards empty space. However, in this implementation, the subdivision is driven with the objective to find intersecting regions amongst multiple convex polyhedra. 3 No requirement exist to find any boundaries between these polyhedra and empty space, only between the polyhedra themselves. Strict homogeneity is therefor not required. Hence, the concept of geometric homogeneity is introduced. A cell is said to be geometrically homogenous if all polyhedra within the cell are equal and occupy the same space, i.e. their geometrical representations, including 3 The function of these polyhedra is described further in section 3.1

24 2.2 Approach 15 position, are equal. This means that while a cell is non homogenous in a strict sense by spanning both occupied and non occupied space it can still be geometrically homogenous. It also follows that cells occupied by less than two polyhedra are geometrically homogenous by definition. In figure 2.3, cells (a), (b) and (c) are all geometrically homogenous (two of them by definition) while cell (d) requires further subdivision. A geometrically homogenous cell is called complete if it fulfills the criteria and is not subdivided further, thus all leaf nodes in a tree represent complete cells. Cells represented by internal nodes are always incomplete. When discussing a Bsp-tree where each cell is represented by a node, the completeness of the cell directly determines the completeness of the node Storage Structures As a direct implication of the desire for minimal memory usage the two data structures of Object Pools and Segmented Arrays are introduced. These structures are available for main RAM only and are not applicable on GPU memory. Object Pools An object pool (or resource pool) is basically a stack where objects are stored that are no longer used. Instead of having objects allocated and destroyed on demand this scheme keeps objects passive for future use and thus avoids the cost associated with frequent allocations. 4 A slight overhead is generated from managing the stack but is often negligible in relation to the performance gains from avoiding frequent allocations. For more information see [Kircher and Jain, 2002]. Segmented Arrays Segmented arrays are beneficial in the same way as object pools in that they limit the amount of allocations while keeping the memory footprint low in a classic memory versus performance tradeoff. The idea is simply to segment the allocation of memory so that every time more memory is needed an entire chunk is allocated. The memory layout can be thought of as a two-dimensional array where one row at a time is allocated. The size of these allocations can be selected depending of what kind of data the structure will hold, with an additional speed-up if the size is kept as a power of two. 5 In applications with a fairly time invariant memory footprint the allocated chunks can remain allocated for fast direct access or otherwise be released when not used anymore. 4 The actual cost is OS and runtime dependant. 5 Using bit-shift operations for index calculation on the 2D array.

25 16 Theory and Design 2.3 Design Algorithm 1 describes the main steps to solving the problem of fusion. It starts with an objective of finding spatial representations of all regions in the intersection pattern of the volumes through the use BSP. Rendering can then be performed using these region representations involving only those volumes that occupy each region. Algorithm 1 Fusion Pipeline Require: Representations for all involved volumes 1... Insert all volumes as original fragments in Bsp-tree root node 2... Apply BSP until all leaf nodes are complete and the Bsp-tree is built 3... Create proxy geometry according to camera position 4... Render proxy geometry with settings according to represented volumes Volume Representation Before any fusion can be initiated the volumes must be represented within the framework. The solution presented uses two main descriptions, one in the pipeline and one called Volume Fragments 6 for the fusion process. In the Pipeline This description contains in itself only two things per volume, a pipeline specific index and a transformation matrix. Other volume specific information such as resolution, transformation, storage, GL access point 7 and LUT information are maintained per volume but not part of any encapsulating structure. The index and matrix pair populates a list used by fusion modules in the environment while all other info is made available directly in the rendering shaders. No need was identified for more comprehensive representations in the framework. This is in contrast to more complex representations found in [Rösler et al., 2006] called V- objects and [Plate et al., 2007] under the name of lenses. Volume Fragment The second representation is used within fusion modules and consists of a purely geometrical description of each volume, called volume fragment or simply fragment. These fragments are what drives the space partitioning as described later. 6 Not to be confused with pixel fragments in CG. 7 Texture unit in OpenGL

26 2.3 Design 17 The fusion process begins with one fragment per volume, see figure 2.4(Original Fragments), before being subdivided such that, in the end, several smaller fragments combine to form a complete volume, see figure 2.4(Resulting Fragments). While volumes in the pipeline can be shared by multiple fusion modules each such module has a unique set of fragments. Volume Fragment Boundaries Each fragment holds a list of boolean flags with one entry per original boundary of the represented volume signifying if the boundary is open or closed. Initially, all boundaries for a fragment are considered open. Once a boundary polygon of a fragment ends up on or completely outside a plane defining the cell to which the fragment belongs that boundary is closed. This gives a way to control a cell s geometrical homogeneity and completeness as defined in section 2.2.2; if all boundaries of all fragments in a cell are closed then the cell is complete. Considering the gray fragment in the middle of figure 2.3 belonging to the light blue square, this fragment would have a list of four boundaries (top, bottom, right, left) with all but the top boundary marked closed as those original edges are no longer a part of the fragment or are positioned on partition planes Fusion Overview Figure 2.4. Fusion solution with the main goal of finding spatial representations for the individual regions separated to the left. All volumes are initially represented in the root node of a Bsp-tree with one volume fragment each, see figure 2.4(Original Fragments). BSP is then applied on the scene one plane at a time. This recursive procedure continues until all space is divided into complete cells with a multitude of subdivided volume fragments representing each volume, see figure 2.4(Resulting Fragments). Since space is

27 18 Theory and Design divided in a binary way and the plane normals are known, a view dependent order of the cells can be extracted from the Bsp-tree without any need for sorting in the render module.

28 Chapter 3 Implementation The two main modules, fusion and rendering, implemented in XIP are described here. All implementation is done in C++ using the Open Inventor and OpenGL APIs while the rendering execution is shared between OpenGL and GLSL. Object Pools and Segmented Arrays are used for the management of primitives such as Bsp-tree nodes and volume fragments. In particular, the pools are used as caches for the population of primitives within specific modules while lists of primitives and storage of geometrical data are implemented using the segmented arrays. VTune performance analyzer was used extensively for tracing bottlenecks during development of the structures. 3.1 Fusion Module The fusion module is a conceptual grouping of pipeline functionality that concern the camera independent aspects of fusion and their underlying structures. Its main assignment is to divide space and to generate a representation of the resulting intersection pattern of the volumes Generating Region Representations Once all volumes are represented in the root node as fragments, fusion is initiated and the recursive BSP method described in is carried out as in algorithm 2 1. In short, an initial check is performed for each node regarding its completeness, if 1 For a detailed version see appendix B algorithm 11 19

29 20 Implementation the node is complete it is marked as a leaf and the recursive branch is closed. If the node exhibits fragments with open boundaries, i.e. it is incomplete, a plane is retrieved and the node is split before algorithm 2 is executed on its children. Algorithm 2 Build node Require: node to start recursive subdivision 1... if (node complete) then 2... mark node as leaf 3... else 4... find best partitioning plane for node 5... create children and split node 6... recursively run Build Node on children 7... end if Figure 3.1. Splitting a node by a specified plane comes down to a sorting and, if necessary, partitioning of all fragments within the node. When splitting a node by a given plane, all fragments that not directly intersect the plane are sorted amongst the child nodes while those fragments that do intersect the plane are in turn split by the same plane. See figure 3.1 for an illustration where the center fragment is split while the two fragments to the left and right are sorted. This scheme is repeated on every level from fragments through polygons down to individual lines and vertices, see appendix B algorithms 9 and 10. For efficient polygon clipping the implementation is based on the Sutherland-Hodgman algorithm as described in [Foley et al., 1996]. This algorithm traverses the vertices of a polygon in CW or CCW order while keeping track of all vertices on the positive side of the plane and creates new vertices at any edge plane intersections. The algorithm is also extended to support splitting of polygons storing both resulting halves. Furthermore, requirements on all levels for primitives to be convex are fulfilled by definition as all resulting entities from a split of any convex entity by a plane are known to be convex. However, care must be taken to close the geometrical representations of the pieces created in the partitioning. In the case of polyhedra this means adding a polygon at the place of the intersection to close the hull.

3.1 Fusion Module 21 Figure 3.2. An on-plane threshold (area within dashed lines) is introduced in all intersection tests to counter the effects of numerical inaccuracies.

30 3.1 Fusion Module 21 Figure 3.2. An on-plane threshold (area within dashed lines) is introduced in all intersection tests to counter the effects of numerical inaccuracies. Green geometries can be sorted to the top half space without being split since their primitives are all considered on or above the partitioning plane On Plane Threshold During plane geometry intersection tests, numerical errors occur due to lack of floating point precision. Situations where at least one vertex is positioned close to a plane can, wrongly, trigger a splitting of the polyhedra it belongs to. To avoid such situations an on-plane-threshold is introduced giving the plane a thickness where points within this thickness are said to be on the plane and thus belong to both half spaces. Without a threshold, testing a single polygon is done by counting positive and negative vertices. With the threshold, this is expanded to also count the number of vertices positioned on the plane within the threshold. If vertices exists strictly on one side of the plane or within the threshold then the polygon is deemed not to intersect the plane. This expands to all types of geometry as seen in figure 3.2 where all green geometry are handled as completely inside the upper half space while blue geometry intersects the plane and needs to be split. The thickness of the threshold is kept small within the range of the precision for the float data type Partition Plane Selection If a Bsp-tree node is not complete and should be subdivided further, a partitioning plane has to be found. As described in section the completeness of a cell is determined through the boundary states of all fragments in that cell where each closed boundary brings the subdivision closer to completion. Any plane has the potential to close one or more boundaries but only a limited set of planes are guaranteed to do so. This limited set is defined as all planes that coincide with an open boundary for any of the fragments in the cell. Choosing partitioning planes from this set provides a way to reach a completed subdivision as in algorithm 3.

22 Implementation Algorithm 3 Find partitioning plane Notation: F x is fragment x in node such that x [0, N F ] Notation: B xk is boundary k for fragment x such that k [0, N Bx ] Require: node T for

.. store boundary B ik as partition plane in T 6... close boundary B ik in fragment F i 7... break /*jump to if(plane found)*/ 8... end if 9... end for 10... end for 11...... 12.

31 22 Implementation Algorithm 3 Find partitioning plane Notation: F x is fragment x in node such that x [0, N F ] Notation: B xk is boundary k for fragment x such that k [0, N Bx ] Require: node T for list of fragments for all (F i and F j in T such that 0 i < j N F ) do 3... for all (open B ik in F i ) do 4... if (B ik separates F i and F j ) then 5... store boundary B ik as partition plane in T 6... close boundary B ik in fragment F i 7... break /*jump to if(plane found)*/ 8... end if 9... end for end for if (plane found) then split T with plane else close any remaining open boundaries in all fragments in T mark T as leaf end if (a) 6 cells 5 planes. (b) 10 cells 9 planes. Figure 3.3. The complexity of the subdivision depend on how the planes are chosen. Selection Scheme While the set of boundary closing planes provide a good selection of relevant planes the internal order in which thay are applied must be defined. One of the main goals in determining the order of the planes is to minimize the complexity of the Bsp-tree. Although only planes from the same limited set were chosen in both figure 3.3(a) and 3.3(b) the complexity is almost doubled for the later. As a direct result of this it can be argued that choosing a plane that completely separates as many fragments as possible while keeping intersections to a minimum is preferable. This is demonstrated in the choice of initial planes in figures 3.3(b) (long double

32 3.2 Render Module 23 arrowed diagonal line) and 3.3(a) (long double arrowed vertical line) where the later sets up a much better position for closing multiple boundaries and keeping the number of cells to a minimum. However, spending resources finding the best plane comes with a penalty. All open boundary planes of all fragments have to be considered against all other fragments in a vertex by vertex manner before any consideration can take place. Thus, there is a tradeoff between Bsp-tree complexity and performance and which solution is optimal will depend on the situation. If a scene consists of few volumes but behaves such that the tree must be generated often, a low complexity tree might not be worth the extra time it costs to compute. On the other hand, in a more static environment with high complexity, the extra generation time can be negligible and well worth the reduction of tree complexity. In this work, two schemes of different complexity are implemented in accordance with the two situations described above. 3.2 Render Module The second module of the pipeline performs the actual rendering and conceptually begin after the creation of the Bsp-tree in the fusion module. Main parts include the creation of the render queue and its traversal where a selection of instantiated shaders is a central step Render Queue To keep the pipeline consistent regardless of rendering scheme and to further emphasize the separation of fusion and render modules the idea of a render queue is introduced. The linear queue is created by a traversal of a Bsp-tree where each leaf cell generates a queue entry, see figure 2.4(Render queue) for illustration. Proxy geometry to be used in the rendering, as discussed per rendering method in section 1.2.3, is stored in each entry along with additional information such as present volumes. Each render module maintains a private queue. As all cells to be rendered are leafs and known to be geometrically complete the polygonal hull of any fragment within a cell can thus be used as a geometrical description for all occupied space within that cell. In Texture Slicing, the intersection points between this description and the desired slicing planes are used to create slicing polygons. For Raycasting, the polygons of the hull itself are used directly as proxy geometry. Additional polygons are inserted in case the fragment is clipped by front or back clip planes by the GL since this otherwise creates holes in the rendered geometry.

33 24 Implementation Cell Based Rendering With inherent depth sort among the cells, rendering becomes the task of iteratively render all cell entries in the queue. The combination of volumes in each entry is used to select a shader according to any of the instantiation schemes described in section To avoid artifact due to misplaced samples between adjacent rays on opposite sides of a cell border, all sample positions in all cells must be enforced to follow a global pattern. In Texture Slicing using view dependant slicing polygons 2, this comes down to an enforcement in z-offset in the creation of each plane such that planes in adjacent cells always end up edge to edge. For Raycasting the same type of offset is enforced in the shader as a manipulation of the ray entry point dependant on the camera position Rendering Methods Two hardware renderers are implemented using GLSL programs, one based on Texture Slicing and the other on Raycasting. Both renderers share common features in shader selection and render queue execution as described in previous sections. As stated in section 3.2.1, each renderer initially fills its queue entries with appropriate proxy geometry before rendering is carried out according to algorithm 4 or 5. This step is slightly more complex in cell based rendering as compared to regular DVR since the areas are formed by arbitrary convex polyhedra as opposed to being cuboid. Furthermore, since rendering is performed on hardware using a GL API, the state of this API must be set according to the desired functionality. In particular, the blending is set according to the method of choice and its implementation such that the requirements of sequentiality stated in section are not violated. The choice of which sampling schemes from chapter to implemented for each rendering method was made based on availability and time as no requests were made from SCR. Texture Slicing Creating slice polygons for regular DVR can be done analytically by calculating intersections of planes with the bounding cube. However, the complexity of the calculations for the plane intersections are increased in the fusion case due to the non cuboid polyhedra defining the rendering areas. Also, the total number of slices increases with an increasing number of cells which in turn adds more complexity. The implications of this are noticeable in the results of chapter 4 and discussed in 5. The rendering process of the Texture Slicer implemented in XIP follows algorithm 2 Polygons parallel with the screen facing the camera.

34 3.2 Render Module 25 4 where the sample point manipulation discussed above is already present in the location of the proxy geometry and thus does not have to be considered in the shader programs. The only sampling scheme implemented in the Texture Slicer is Global Frequency. Algorithm 4 Texture Slicing Notation: E x is entry x in queue such that x [0, N E ] Require: queue Q 1... bind render buffer as render target 2... enable accumulative blending 3... for all (entries E i in Q such that 0 i < N E ) do 4... pick shader according to present volumes in E i and set uniforms 5... render front facing geometry in E i 6... end for Raycasting Raycasters for standard DVR can be implemented as single-pass shaders, avoiding costly state changes in the GL, where exit points for triggered rays are calculated directly in the shader from the planes of the bounding cuboid. 3 The polyhedric nature of the rendering areas in cell based rendering requires an additional pass to be added where the exit points are rendered into a texture for access in the main rendering step. This two-pass approach is further extended so that the resulting alpha after the rendering of one cell is accessible in the rendering of the next. 4 A ping-pong like swapping of render targets is thus performed twice for each cell, all according to the segmented rays discussed in section Both Global Frequency and Interleaved Sampling variants are implemented for Raycasting and manipulation of the entry points for inter cell consistency is implemented directly in the shader programs. An additional sampling point manipulation is also introduced as a conversion from the native spherical sampling grid of Raycasting to a uniform grid matching the one found in Texture Slicing Instantiated Shaders In a naive shader implementation the execution of specific volumes would be governed by conditionals as seen in example 3.1(top part). However, poor hardware support for conditionals in shader programs on the GPU prevents a single shader from effectively skipping volumes in this manner. 3 This is sometimes reversed so that exit points are triggered and entry points are calculated. 4 It is stored in the alpha channel of the exit point texture.

35 26 Implementation Algorithm 5 Raycasting Notation: E x is entry x in queue such that x [0, N E ] Require: queue Q 1... for all (entries E i in Q such that 0 i < N E ) do 2... bind render buffer as texture 3... bind exit point buffer as render target 4... bind exit point shader and set uniforms 5... enable replacing blending 6... render back facing geometry in E i 7... bind exit point buffer as texture 8... bind render buffer as render target 9... pick shader according to present volumes in E i and set uniforms enable accumulative blending render front facing geometry in E i end for Instead, a more advanced scheme is introduced where one shader is compiled for every number of volumes to be fused. This way five shaders are used for a five volume setup with the first shader only sampling a single volume, the second sampling two volumes etc. While removing the overhead of poor conditional execution, the programmer no longer have the freedom of arbitrary blending for specific volumes. Although they can still be manipulated using specific TFs. If two sequential cell renderings share the same amount of volumes no state change has to be made as the same shader is kept active and only uniform variables are updated. Yet another scheme is also implemented that takes specific volumes into account at the cost of a growing number of compiled shaders. The baseline this time is that one shader is compiled for each unique combination of volumes and thus a five volume setup would result in 32 unique shaders. Freedom in blending for specific volumes is restored while an overhead is created as a state change is performed for virtually every rendered cell. Writing 32, or even five, shader source files for a five volume setup is hardly an option for obvious reasons such as code duplication. This problem is solved by using pre compilation macros to include or exclude volume contributions, see example 3.1(bottom part) as opposed to a naive if-clause 3.1(top part). Source code is only written once before being instantiated multiple times using different macro setups. Both non-naive schemes described above utilizes this in the implementation. Example 3.1: If-clause dependent shader code vs. compiler macros {... if (usevolumex)

36 3.3 Extra Functionality 27 color += (1.0 - color.a) * samplevol( volumex, pos );... #ifdef VOLUME_X color += (1.0 - color.a) * samplevol( volumex, pos ); #endif... } 3.3 Extra Functionality In addition to the core functionality of DVR, to render volumes, there exist several performance optimizations and functional tools to increase the value of the visualization. Some of these addition have been included in the pipeline and are presented here Clip Planes and Early Ray Termination The importance and usage of volume clipping is thoroughly discussed in [Engel et al., 2006] chapter 15. In this work volumes can be clipped by the insertion of clip planes on a per fusion module basis, thus clipping all volumes within the Bsp-tree of that module. A potential acceleration of the generation of the Bsp-tree is also noticeable for every clip plane that is added thanks to a lowered initial complexity. I.e., some volumes or at least parts of volumes can be cut away, reducing both the overall amount of primitives and the amount of potential partitioning planes. Standard Early Ray Termination (ERT) is available for Raycasting in such a way that rays are terminated once a certain threshold is reached for the saturation of the pixel associated with that ray. And although an already saturated ray can be re-triggered as additional cells are processed the rendering is aborted before the stepping along the ray is initiated. This topic is discussed in both [Krüger and Westermann, 2003] and [Engel et al., 2006] Fusion with Simple Convex Mesh Geometry Originally implemented for crude representations of medical instruments integrated in DVR, there exists support in the implementation for correct fusion between simple convex polygon geometry, called mesh geometry, and the volumetric data. This way, standard polygon models can be present in DVR, see

37 28 Implementation algorithm 6, with correct blending and with the use of textures or lighting made to the likeness of instruments or tools. The requirement of simplicity lies in the fact that a mesh is inserted as any other volume and thus effects the complexity of the tree accordingly. A polygon mesh would for example, if present inside a volume, cause subdivisions. Below, a scheme is presented for Raycasting to address this issue. Algorithm 6 Fusion with simple mesh geometry 1... Add mesh as a volume to the tree 2... Generate Bsp-tree and Render queue for all (cells containing the geometry) do 5... render back facing mesh polygons 6... render volume data 7... render front facing mesh polygons 8... end for Proposed Scheme for Fusion with Complex Mesh Geometry For more complex geometry, one approach is to add a simple bounding box as a volume in the Bsp-tree and have the fusion take place as in algorithm 7. This scheme will only work with Raycasting and is not yet implemented in the pipeline. Algorithm 7 Fusion with advanced mesh geometry 1... for all (cells containing the mesh bounding box) do 2... render surrounding volume data render back volume data render back facing mesh polygons 5... render intermediate volume data render front facing mesh polygons 7... render front volume data end for 5 Entry and exit points comes from front and back facing hull polygons respectively 6 Entry points comes from front facing hull polygons or back facing mesh polygons, whichever is closest, exit points comes from back facing hull polygons 7 Entry and exit points comes from front and back facing mesh polygons respectively 8 Entry and exit points comes from front facing hull and mesh polygons respectively

Chapter 4 Results The results presented in this chapter include comparisons of partition plane selection schemes and the impacts of volume

Certain aspects of the fusion module and the rendering modules are also presented with illustrations.

XIP. All tests were run on a 2.8GHz Pentium 4 machine equipped with 1GB RAM and a GeForce 8800 GT graphics card with 512MB VRAM.

Sparse geome- (d) scene. Sparse Figure 4.1. Two different scenes are used in all performance testing.

38 Chapter 4 Results The results presented in this chapter include comparisons of partition plane selection schemes and the impacts of volume placement on tree generation time and complexity. Certain aspects of the fusion module and the rendering modules are also presented with illustrations. All benchmark timing was performed using VTune Performance Analyzer (VTune in short) while application frame rates were measured directly in XIP. All tests were run on a 2.8GHz Pentium 4 machine equipped with 1GB RAM and a GeForce 8800 GT graphics card with 512MB VRAM. The head data set used for rendering has a 2563 resolution at 12 bits. (a) Dense proxy geometry. (b) scene. Dense (c) proxy try. Sparse geome- (d) scene. Sparse Figure 4.1. Two different scenes are used in all performance testing. One dense worst case scenario, the other a sparse situation. The sparse geometry in (c) also includes a big enclosing volume that is left out in this figure for illustrative purposes. All final results are directly dependent on the complexity of the Bsp-trees as highlighted in section This complexity is in turn highly dependent on the relative 29

39 30 Results placement of the involved volumes and testing is performed using two different scene setups. As seen in figure 4.1 a dense scene represents a worst case scenario and a sparse scene depicts a distributed scenario where several small volumes are rendered inside an encapsulating big volume. 4.1 Storage Structures Figure 4.2. Relative VTune benchmark results for three tested allocation methods. The developed Segmented Arrays and Object Pools presented in section are compared against standard fix C++ arrays and a naive allocate-on-demand scheme. A dense and computationally expensive scene was used during the experiments. A direct result of the Object Pools and Segmented Arrays is, as seen in figure 4.2, that relatively little speed have to be sacrificed while the memory footprint is considerably smaller compared to standard C++ arrays. While being large enough for the specific test cases, the C++ arrays in these charts cannot handle arbitrary memory demands. The segmented arrays on the other hand handles all cases. The third pair of columns belonging to the New/Delete memory scheme is mostly present as a reference. Timing was performed with VTune in a test environment that was executed between ten thousand and one million times. 4.2 Bsp-tree Generation and Complexity Tree complexity, here measured in number of primitives (nodes, fragments, polygons), relates roughly as nodes volumes 3 as seen in figure 4.3(a) and the storage requirements for polygons also follow the same pattern. This growth rate is only apparent in scenes with dense volume placement, i.e. worst case scenarios where all volumes overlap each other without relative alignment. In a comparison of VTune Minimum Time performance between dense and sparse scenes in charts 4.3(c) and 4.3(d) it is apparent that a sparse placement directly translates to shorter tree generation times with a time gain between 40% and 300%+ for increasing number of involved volumes.

40 4.3 Partition Plane Selection Methods 31 (a) Complexity of primitives (b) Storage requirements (c) Generation measurements in VTune (bench marking) and XIP (application) for a dense scene (worst case). The complexity for the Min Time and Min Complexity schemes in this case are almost identical due to the scene density. (d) Generation times (columns) and complexities (lines) for a sparse scene (good natured). Figure 4.3. Time, complexity and storage charts for generation on Bsp-trees with varying number of volumes. Noticeable differences appear depending on partition plane selection scheme and scene density. 4.3 Partition Plane Selection Methods Generation times for Bsp-trees in figure 4.3 are of course directly proportional to their complexity. Even so, since the complexity in turn is dependent on the placement of volumes as well as how the partitioning planes are selected some interesting remarks can be made. As discussed in section the complexity of the tree can possibly be lowered if more resources are spent on partition plane selection. Hence a tradeoff between complexity and speed arises. This is illustrated by the difference in tree complexity in figure 4.4 and confirmed by the time charts in figure 4.3(d) which even indicates a speed-up for the complexity minimizing strategy derived directly from its lowered complexity. On the downside, lowered complexity is only obtained as described above if the

32 Results (a) Minimum Time (b) Minimum Complexity Figure 4.4. Comparison of the two implemented ways to chose partitioning planes and their effect on the complexity of a simple scene.

41 32 Results (a) Minimum Time (b) Minimum Complexity Figure 4.4. Comparison of the two implemented ways to chose partitioning planes and their effect on the complexity of a simple scene. The performance oriented method in (a) spends less time per plane selection but results in higher complexity. volume placement is sparse, e.g. figure 4.1(c). If the volume placement is dense on the other hand, the extra resources spent on partition plane selection becomes futile and result in an overall penalty. The Bsp-tree in figure 4.7(a) is an example of this as its resulting structure is independent of which selection scheme is used. Also, figure 4.3(c) shows a time penalty for the minimum complexity scheme in XIP while the complexities (not illustrated) for both schemes are similar. 4.4 Real Application Impact So far, all performance measurements discussed have been done in VTune. However, there exists a close relation between real application measurements done in XIP and VTune benchmarking. This is apparent in figure 4.3(c) where real application tests closely follow the synthetic estimations. This relation can be observed as long as no other major bottlenecks (typically rendering) appear in the pipeline for the measured scene. 4.5 Proxy Geometry The proxy geometry for Texture Slicing and Raycasting is shown in figure 4.6(a) and 4.6(b) with the final rendering in 4.6(c). As can be seen in the charts of figure

The relative amount needed for Texture Slicing and Raycasting, the portion of which that is queue geometry (needs to be redrawn every frame during interaction) and finally the growth of both with

42 4.6 Fusion Module Results 33 (a) The queue makes for the greater part of the proxy geometry consumption for Texture Slicing. (b) Raycasting carries little interaction overhead with low amounts of queue geometry. Figure 4.5. Three points on proxy geometry. The relative amount needed for Texture Slicing and Raycasting, the portion of which that is queue geometry (needs to be redrawn every frame during interaction) and finally the growth of both with increased scene density. 4.5, the amount of geometry needed for Texture Slicing grows drastically with increased scene complexity. For a dense scene during interaction, roughly polygons needs to be recomputed each frame for Texture Slicing. The same number for Raycasting is a few thousands. An increase from 256 to 512 in sampling depth for Texture Slicing results in a 88% increase of proxy geometry in a sparse scene while a dense scene causes this increase to be 60%. Raycasting proxy geometry is not dependant of the sampling depth. (a) Proxy geometry for Texture Slicing (b) Proxy geometry for Raycasting (c) Final rendering Figure 4.6. Screenshots of a fused rendering of three skulls. Rendering at 25 and 15 Fps for Texture Slicing and Raycasting respectively on a 512 square viewport with 512 slices on a unit volume.

7 illustrates a small Bsp-tree and its resulting queue for a dual volume intersection.

43 34 Results (a) Bsp-tree (b) Parts (c) Queue Figure 4.7. Screenshots of a fused rendering of two volumes. Grey polyhedrons represent incomplete cells (internal nodes) while colored implies a complete cell (leaves) 4.6 Fusion Module Results Figure 4.7 illustrates a small Bsp-tree and its resulting queue for a dual volume intersection. 1 For a setup with the volumes aligned in one dimension the tree exhibits a total of 13 fragments in 9 nodes resulting in a queue with 5 entries. The non axis alignment of the partitioning required to support bricking as discussed in chapter 2 is apparent in figure 4.7(b). 4.7 Render module results Overall application performance is highly dependent on several factors. Influences of interaction, scene complexity and sampling density on the rendering speeds have been isolated and are illustrated in 4.8(a), 4.8(b) and 4.8(c) respectively. Previously in this section the relation between scene volume placement and Bsp-tree generation times were demonstrated. In line with this relation, between the dense and sparse placement measures in figure 4.8(b) and those for doubled sampling density in 4.8(c), volume placement has a higher impact on the overall frame rate. Raycasting exhibit an invariance in rendering speed in terms of interactivity while Texture Slicing demonstrates a considerable drop in frame rates once camera interaction is present. The bottom blue line in figure 4.8(a) is in fact two almost identical lines, with and without camera interaction for Raycasting. This effect originates in the amount of proxy geometry slices that need to be recalculated 1 This is a direct real world equivalent of the situation illustrated in figure 2.4.

4.7 Render module results 35 (a) Texture Slicing (TS) display a considerable drop in frame rates during interaction while Raycasting (RC) numbers remain identical.

44 4.7 Render module results 35 (a) Texture Slicing (TS) display a considerable drop in frame rates during interaction while Raycasting (RC) numbers remain identical. (b) Comparison of rendering speed regarding scene complexity when rendering with density of 512 samples for a unit volume. (c) Effects on rendering speed caused by increased sampling density in a sparse scene. Figure 4.8. Frame rate charts of isolated factors regarding rendering speed. every frame as the camera changes position. In figure 4.5(a) this corresponds to the render queue (red) parts of the columns. The effect also grows if complexity for a tree for any reason is increased. (a) 8bit buffer (b) 16bit floating point buffer (c) 32bit floating point buffer (d) Rendering Figure 4.9. Difference for identical rendering between Raycasting and Texture Slicing. Simple RGB difference multiplied by 100. Tests were performed on the relative difference between images rendered with Texture Slicing and Raycasting at various texture and buffer precisions on the GPU. This difference is calculated as raycasting(x i, y j ) slicing(x i, y j ) in a per channel RGB fashion. As seen in figure 4.9, the difference decreases with increased buffer precision. The noticeable colored artifacts in 4.9(a) are due to Raycasting entry and exit points being faulty if low precision buffer are used. No distinguishing differences in rendering speed were detected during the comparisons of different buffer precisions. This experiment also emphasizes the modularity of the pipeline as the renderings of Texture Slicing and Raycasting as well as the

45 36 Results comparison calculations are done in realtime using a single fusion module. On top of this, the proxy geometry is also rendered before all parts are visualized in a split viewport as seen in figure Figure Four rendering modules operate on a single fusion module to produce a comparison test between Texture Slicing and Rraycasting.

46 Chapter 5 Conclusion and Possibilities The results presented in the previous chapter are discussed in relation to the problems, solutions and goals stated in the beginning of this thesis. Known limitations and future possibilities are also highlighted. 5.1 Conclusion In general, the results presented in this chapter agrees well with their effects as predicted earlier in this thesis. The framework can utilize Texture Slicing and Raycasting to fuse multiple arbitrary aligned volumes with user defined composition schemes. However, all results are directly dependent on the applied user scenario and everything from the amount of proxy geometry, Bsp-tree benchmarks or actual frame rates are connected to and reflect different aspects of the overall complexity. As long as the number of volumes is kept low, scenes remain sparse and the sampling rate is reasonable then interactive frame rates are achieved. Both Texture Slicing and Raycasting deliver acceptable results for limited scenes with Texture Slicing being slightly faster. In their present form however, their respective implementations are far from optimized with Raycasting for example being limited by an unnecessarily large overhead from state changes. If this overhead can be solved then Raycasting can very well be the prominent method for fused DVR. since acceleration methods such as Empty Space Skipping/Leaping or Multi resolution Data can be implemented in ways compatible with a fusion environment. Due to the high penalty of increased proxy geometry these accelerations are less suited for Texture Slicing when the scenario involves fusion. As seen among the results, the pipeline supports a large variety in scenes as no 37

47 38 Conclusion and Possibilities assumptions are made of volume placement or numbers. In the case the pipeline is used for a specific purpose such assumptions should effect the choice of partitioning scheme and case specific optimizations can be performed such as introducing user specified partitioning planes or a generation of several smaller Bsp-trees for performance load distribution. In line with the confinements for this thesis no comparisons are presented of the different sampling or composition schemes. However, these areas of research and their impact on image quality should be investigated further along with the signal processing aspects of different interpolation kernels in volume and TF sampling. Goal Achievement and Problems Solved In short, the main goal of fusion support in XIP is met. Furthermore, the design with separated fusion and render modules and their respective modularity, as illustrated in figure 2.1 and used for the rendering in figure 4.10, account for much of the simplicity and extendability requirements originating in XIP. For example, if an additional rendering method was to be included in the framework, only the last part of the pipeline would have to be replaced while the volume representations and fusion module could remain intact with the Bsp-tree as a simple interface. The instantiation of shaders also emphasizes simplicity as only one shader source needs to be written while the instantiations remain specific for each combination of volumes. 5.2 Known Limitations High complexity translates to low frame rates and ultimately loss of interactivity. Even though the framework theoretically supports arbitrary number of volumes there are limitations as to how many volumes can be fused at the same time with reasonable speed. One major bottleneck is the complexity of the generated Bsptree that grows drastically as shown in chapter 4. Also, the number of texture units on the GPU sets a limit to the number of volumes since each volume will occupy one such unit. 1 While the pipeline and its core algorithm of BSP theoretically supports bricking it is not implemented at this point which limits the framework to scenes where all volume data fit on VRAM simultaneously. 1 With the use of volume specific LUTs two units are occupied per volume.

48 5.3 Future Research Future Research Some of the following discussions are aimed at conclusions made from the results presented in the previous chapter. Others are more directed to future possibilities. Both types are centered about the overall performance either to limit bottlenecks or to introduce accelerations. Geometry To calculate, store and work with geometry is an essential part of generating complete cells for a Bsp-tree. Some experiments were carried out trying to complete the subdivision using more theoretical plane and line descriptions rather than polygon based geometry. This did not succeed due to numerous cases where intersections decisions were ambiguous. However, by maintaining geometry per volume all ambiguities are avoided. In case some additional approach is used for space partitioning other than BSP, such as an axis aligned method 2, the simplified geometry solution could be worth investigating. State Changes At the time of writing the two-pass ping pong implementation of the Raycaster needs to be optimized. The actual flipping of buffers is done using two OpenGL FBOs due to the fact that previously existing code made it the easiest choice available. Other options exists, such as flipping between two buffers in the same FBO or using a conditional shader, which can have a positive effect in terms of higher frame rates with less overhead from needless GL state changes. Even if conditional shaders have limitations, the coherency between pixels in this case will probably limit the otherwise costly penalty of conditional execution. Bsp-tree Generation Two methods for the selection of partitioning planes in Bsp-tree generation were presented in section Results in chapter 4 shows this to be advantageous under certain conditions. However, only one additional scheme was investigated apart from the standard minimum time option. More tests could be performed to further explore other schemes and their respective impact on different scene setups, i.e. number of volumes and placement density. Ultimately, a few key schemes can be exposed as options for the user during runtime to ensure maximum performance for each scene. 2 Although such a method probably would not support bricking as discussed in 2.2.1

49 40 Conclusion and Possibilities Partial Bsp-tree Update Changing position of any volume relative to any other volume is the sole operation, save for adding or removing volumes, that causes an existing tree to be invalid and regenerated. Although recursive tree methods are typically well suited for partial updates no scheme has been implemented other than full regeneration. If the applied change consists of moving one volume for example, a proposed partial updating scheme can be found in algorithm 8. The gain here lies in the fact that if a change only applies to a limited part far down in the tree structure then the computational costs could be drastically reduced by only recomputing specific parts of the tree. Algorithm 8 Update tree due to moved volume(s) Require: node to start recursive update (root node) 1... if (node is leaf) then 2... run Build Node on leaf to perform full regeneration 3... else 4... if (change effects both half spaces of partition plane) then 5... run Build Node on active node to perform full regeneration 6... else if (change effects positive half space only) then 7... recursively run Update Tree on positive child 8... else if (change effects negative half space only) then 9... recursively run Update Tree on negative child end if end if

50 5.3 Future Research 41 Figure 5.1. Screenshots of a two fused heads.

HTTP Based Adap ve Bitrate Streaming Protocols in Live Surveillance Systems

HTTP Based Adap ve Bitrate Streaming Protocols in Live Surveillance Systems HTTP Based Adapve Bitrate Streaming Protocols in Live Surveillance Systems Daniel Dzabic Jacob Mårtensson Supervisor : Adrian Horga Examiner : Ahmed Rezine External supervisor : Emil Wilock Linköpings