Scalable Parallel Volume Raycasting for Nonrectilinear Computational Grids

Size: px

Start display at page:

Download "Scalable Parallel Volume Raycasting for Nonrectilinear Computational Grids"

Liliana Smith
5 years ago
Views:

1 Scalable Parallel Volume Raycasting for Nonrectilinear Computational Grids Judy Challinger Computer Engineering & Information Sciences University of California, Santa Cruz Santa Cruz, CA USA judyqcse.ucsc.edu Abstract A scalable approach to parallel volume raycasting of structured and unstructured computational grids is presented. The algorithm is general enough to handle non-convex grids and cells, grids with voids, grids constructed from multiple grids, and embedded geometrical primitives. The algorithm is designed for a highly parallel MIMD architecture which features both local memory and shared memory with nonuniform access times. It has been implemented on a BBN TC2000 and benchmarked on several datasets. A variation of the algorithm which provides fast image updates for a changing transfer function is also presented. A distributed approach to controlling the execution of the volume render is used and the graphical user interface designed for this purpose is briefly described. Keywords: volume rendering, parallel processing, scientific visualization. INTRODUCTION There is a major shift in paradigm underway in the area of supercomputing for the computational sciences. The powerful vector processors epitomized by the Cray series of supercomputers will gradually be replaced with massively parallel systems of hundreds or thousands of processors and extremely large, scalable memories. This trend towards a new architectural approach to achieving high performance is being driven from above by the performance requirements of the so-called gmnd challenge problems. With the advent of extremely powerful supercomputers and massively parallel systems, numerical simulations of physical systems are being done in three spatial dimensions, and at increasingly higher levels of resolution and complexity. Tools which provide for the visual analysis of the results of such simulations are extremely important. A uolumetric dataset is a collection of scalar data in which each datum has an associated location in three-dimensional space. Many numerical models of physical systems are based upon such scalar fields. Direct volume renderingis a powerful, but computationally intensive, computer graphics technique for rendering volumetric datasets that has been shown to be useful for the visual analysis of the results of scientific computations. The extremely large size of the results of such simulations can make it difficult and time consuming to extract useful visualizations of the data. Typically the scientist producing the result is located remotely from the massively parallel machine. A common approach has been to move the data set to a local graphics workstation and render it there. This can be problematic for very large data sets, and especially so if the simulation of interest is unsteady (time-varying). One motivation for this research is to provide a powerful intemctiue tool for the visual analysis of the results of simulations. In this context, it is desirable for the analysis tool (volume renderer) to be made available on the same machine that the simulation is running on. This will facilitate closer coupling of scientific simulations and visualization tools in the future, ultimately leading to the ability to interactively steer scientific computations based on visual feedback. As the trend towards putting large scientific simulation applications on massively parallel machines continues, it is essential that similar research is conducted on the parallelization of visualization techniques. DISTRIBUTED GRAPHICAL USER INTERFACE The user interface was motivated by the desire to have a practical, portable system for interactively controlling volume rendering software NI-IIIhIg on a remote host. X Windows was chosen for its portability and widespread use; in particular, Motif widgets are used wherever possible. A distributed approach was chosen in order to make use of image compression in the transfer of rendered images from the host to the local workstation, and in order to get good performance in highly interactive operations such as transfer function and viewing parameter modification. X Windows is not particularly suitable for the transmission of images over today s networks, thus a distributed approach allows the images to be intelligently compressed before being transmitted over the socket connection to be displayed. In addition, the operating systems of today s massively parallel hosts are not really suitable for highly interactive tasks such as rubberbanding a line in an X window. It is more efficient, and a better use of the host s resources, to perform such highly interactive tasks locally and send the necessary information to the host as it is needed. All highly interactive tasks are performed locally on the workstation, giving better performance and freeing up the massively parallel host for more intensive computations. The near-interactive, intensive task of volume rendering a scalar field is done on the host; the resulting image is compressed and sent back to the local workstation for display. The concept of using a distributed approach as described here was inspired by a system using a similar approach for O ~8/93 $3.00 CJ 1993 IEEE 81

2 grid generation (also a computationally intensive process), presented at the 1992 Computational Aerosciences Conference at NASA Ames Research Center [39]. In this work a graphical user interface and plotting software runs on a local workstation, and an elliptic multi-block grid generation program runs on a remote supercomputer. Color Fig. 8 shows the main components of the graphical user interface. In the upper left comer is a main window which provides menus for saving and restoring rendering scripts, for establishing a host connection, for specifying a computational grid type and file name, and for specifying the rendering method. The computational grids are stored in Plot3D flies [31] on the host. Additional buttons on the main window allow the user to pop up windows providing other functionality, and to request an image to be rendered. Buttons along the right side are used to specify which scalar field from the solution file is desired and the histogram of the scalar field is displayed in a window on the left. In addition to the main window, separate windows are provided for transfer function specification, view specification, and image viewing. These are popup windows that can be iconified or closed independently. When the host reads in a grid, it sends the external nodes back to the workstation to be displayed in the view window. These are used to provide feedback to the user as the view is manipulated. The bounding box of these nodes is also used to compute an initial zoom and translation which will display the entire volume in the desired image size. The simple image compression scheme used is lossless and generally has been found to compress images by 30% to 60%. Images can be enlarged locally (on the workstation, rather than being rerendered on the host) using bilinear interpolation, and can be saved or restored from a local file. RELATED WORK Volume rendering algorithms can be classified as being based on raycasting, cell projection, splatting, or shear transformations. Raycasting is an image-space approach in which a ray from the eyepoint is cast through each pixel of the image, intersected with the volume, and sampled along its length [al, 23, 41, 33, 151. Cell projection and splatting are both object-space algorithms in which cells or nodes are projected to the screen [41, 44, 45, 20, 48, 38, 271. These methods require that the cells or nodes be sorted into a visibility ordering and projected either front-toback or back-to-front. Methods using shear transformations first rotate the volume in memory so that it is view aligned. Cornpositing can then be done by simply striding through the volume [8]. Sequential Approaches to Efficiency The primary drawback of direct volume rendering is the amount of computation needed to produce a result. To achieve its ultimate usefulness as an exploratory tool, volume rendering must run at interactive speeds. There have been numerous approaches to speeding up the volume rendering process using sequential algorithms [44, 45, 20, 24, 48, 7, 421. Many of these algorithms trade image quality for speed, typically under user control. Several also gain speed through the use of hardware-renderable primitives. Complex Grids Researchers have recently begun to address volume rendering algorithms for more complex computational grids. Many of the object-space approaches for rectilinear grids can be extended to handle more general grids [27, 38,50, 51, 49, 25, 421. Raycasting is also fairly easily extended to handle more complex grids. An important aspect is how to deal with the computational complexity of the ray-cell intersection testing requirements. Approaches taken include interpolating the grid to a rectilinear one [47], finding the frrst intersection and then stepping through the cells [13, 191, and techniques related to scan-line algorithms [5, 141. Parallel Approaches A few researchers have presented algorithms for parallel volume rendering on single-instruction, multiple-data (SIMD) architectures [36, 35, 43, 181. All of the SIMD approaches to volume rendering have addressed rectilinear grids only. More work has been done on parallel direct volume rendering for MIMD architectures. A wide variety of architectures have been addressed. These can be categorized as to whether they are highly parallel and scalable, or limited to a small number of processors. Implementations for highly parallel systems have been described for the Pixel-Planes 5, Stanford DASH multiprocessor, ncube 2, Fujitsu APlOOO, and the BBN TC2000. Work on smaller systems has been reported for multiprocessor Silicon Graphics systems. Most of the research conducted so far has been directed towards direct volume rendering of rectilinear datasets. A variety of algorithms have been investigated, with the raycasting approach receiving the most attention [22, 30, 28,6]. Two researchers have reported on parallel splatting algorithms [29, 91 and a few have addressed implementations of parallel projection algorithms on smaller multiprocessor computer graphics workstations [34, 50, 49, 251. We have conducted a comparison of parallel image-space versus objectspace rendering algorithms and the problems inherent in the two approaches [4]. The results indicate that image-space algorithms may be easier to parallelize with high efficiency on highly parallel architectures. Very recently researchers have been investigating parallel direct volume rendering algorithms for datasets that are nonrectilinear [5, 50, 49, 251. Parallel Computer Graphics Related efforts in the parallelization of computer graphics algorithms are surveyed by Burke and Leler [3]. Two studies that particularly influenced this work are mentioned here. Parallelization of the ray tracing algorithm on a distributedmemory MIMD architecture using an image-space decomposition is presented by Badouel [l]. This algorithm depends on an implementation of shared virtual memory with local caching. Whitman explores several image-space decompositions and scheduling strategies on a shared-memory MIMD machine [46]. A detailed analysis of the overhead incurred in the parallelization is presented. ARCHITECTURE OF THE BBN TC2000 The machine used for this research is a BBN TC2000 located at the National Energy Research Supercomputing Center at Lawrence Livermore National Laboratories. This particular machine is configured with 128 processors and 2GB of main memory. Although this work has been done on a specific architecture, it is anticipated that it will be possible to extrapolate the results to several other MIMD architectures. There are two architectural features of this particular machine that promote this capability. First, it provides both local memory and globally-shared memory. Second, accesses to memory are non-uniform in that the latency for access to 82

3 a remote shared-memory location is greater than for a local (on-board) memory reference, and the time required. for a remote memory access will vary based on switch contention. The BBN TC2000 is a multiprocessor architecture with a distributed shared memory [2]. The TC2000 processors access the shared memory through an interconnection network called the Butterfly switch. The architecture is modular and scalable and can be configured to contain between 1 and 512 function boards. The main components of each function board include a Motorola RISC processor, a 16 kilobyte instruction cache and cache/memory management unit (CMMU), a 16 kilobyte data cache and CMMU, 4 to 16 megabytes of main memory, a switch interface, and a VMEbus interface. The has one readmodify-write instruction, xmem, which exchanges the contents of a register with the contents of a memory location. The TC2000 is designed to honor the xmem instruction, thus providing the capability of atomic operations even across the switch. When shared memory is allocated by an application, it can be specified as uncachable, cachable with copy back, or cachable with write through. The TC2000 also supports interleaving of shared memory to reduce switch contention. References made to a contiguous shared address space by a processor will be spread over several function boards by some mapping hardware. The basic clump size is 16 bytes which is also the maximum switch message data size. SCALABLE PARALLEL VOLUME RAYCASTING Most paralie1 direct volume rendering algorithms previously presented for MIMD architectures have been for rectilinear datasets, and none has produced the level of scalability that is desired. The research presented in this paper focuses on the efficient (scalable) parallel implementation of a volume raycasting algorithm on highly parallel, multipleinstruction, multiple-data (MIMD) architectures with virtual shared memory. As the architectures and operating systems of massively parallel systems mature, virtual shared memory with non-uniform memory access times will almost certainly become a supported feature. It is likely that many applications will take advantage of this feature due to the increased programmer productivity it provides. The raycasting approach to volume rendering has been selected for this research due to the high quality of the images it produces, and because initial studies indicated it would be easier to parallelize in an efficient and scalable manner [4]. Although sequential projection and splatting approaches are inherently faster than raycasting, especially for small volumes in a large image, the cells or nodes of the volume must be ordered according to visibility and rendered either front-to-back or back-to-front. These visibility ordering requirements complicate the parallelization and introduce the need for synchronization between processors, lowering the efficiency of the parallel algorithm. In addition, many of the fastest object-space methods gain much of their speed through the use of hardware-renderable primitives. It is not clear how these algorithms will perform on a highly parallel architecture that does not contain built-in rendering hardware. Grid Data Structures and Distribution The algorithm presented here is designed to handle computational grids that are nonrectilinear. We will call a single data point that has been sampled or computed a node. Two neighboring nodes may be said to define an edge, and three or more define a face. In the case where a face is defined by more than three nodes, it is possible that the face will be non-planar. A cell is the space in R3 defined by four or more nodes. A face that is shared by two cells is an internal face, otherwise it is an ezternalface. Grids may be curved to match the simulation geometry, as in the curvilinear grids commonly used in CFD. A curvilinear grid is defined by a rectilinear computational grid that has been shaped, resulting in cells with non-planar faces in physical space [lo]. These array organized grids are also called structuredgrids [40]. Computational grids in R3 made up of tetrahedral or hexahedral cells that have been shaped are also common in computational fluid dynamics and finite element analysis applications [53]. Typically the definition of these grids is given as a list of cells, defined by pointers into a list of nodes. Thus information on shared faces and neighboring cells is not inherent in the data structure. These grids are sometimes called unstructured grids [40]. The algorithm presented here has been designed to be general enough to handle structured or unstructured grids. The grids and cells may be non-convex, and grids may contain voids or holes. Grids may have been constructed from multiple grid definitions (multi-block grids). Information on which cells share faces is not required, which is useful in applying the algorithm to unstructured datasets. Embedded geometrical primitives can also be rendered by the algorithm. Linear arrays of cachable interleaved shared memory are used to store the grid nodes and scalar values. In addition, each face is represented in in a separate data structure called a cell face. The cell face data structure contains, for each of its four vertices, indices into the grid node and scalar value arrays. It also contains a byte indicating which grid it belongs to (required for multi-block grid solutions) and a byte indicating whether the face is internal or external (required for handling non-convexity and voids in the grid). The cell faces are divided up into sections called face groups by the dimensions of the grid. This division is motivated by the need to identify parallel tasks based on groups of faces. In this implementation for curvilinear grids, an xd x yd x zd grid will have xd + yd + zd face groups. These face groups may be different sizes, for example, there will be xd face groups containing (yd - 1) x (zd - 1) cell faces each, yd face groups containing (xd - 1) x (zd - 1) cell faces, and rd face groups containing (xd - 1) x (yd - 1) cell faces. An array of xd + yd + ad pointers to cell faces is allocated. Each of the face groups is stored in cachable interleaved shared memory, the array containing the pointers to the face groups is propagated to each processor to be stored locally. Curvilinear grids have a natural decomposition into this form since the grid is rectilinear in computational space. Unstructured grids such as those used in finite element analysis would need to be divided into groups of faces. This could be accomplished using a spatial decomposition of the original grid. The size of each group could be constructed so as to enhance load balancing of parallel functions which op erate on one group at a time. Shared faces need only be represented once, and no information is required concerning which cells share a given face. Overview of Basic Algorithm An earlier approach operated on the cells of the volume, rather then the faces, and utilized a task decomposition based on scanlines [5]. The approach described here is faster 83

4 Volume Initialization new mid I I I image When a new grid is specified by the user, the grid and solution files are read into shared memory. Initialization of the cell faces is done in parallel. The task decomposition is by face group and these tasks are dynamically generated. This phase generates td + yd+ zd tasks during which indices into the grid and scalar arrays are computed and stored for each vertex of each cell face in the group. The most time consuming part of initializing a new grid is reading the data in from a file. 1 Figure 1: Overview of main functions. and exhibits better scalability. The basic algorithm represents the image as a set of square image tiles. The image tiles form the basis of the task decomposition. Processors dynamically acquire an image tile for rendering and perform raycasting for each pixel in the tile. Two techniques are used to reduce the ray-face intersection testing requirements. First, a parallel viewing sort creates for each tile a list of pointers to cell faces which project into that tile. This is done prior to rendering whenever the view has changed. Second, within each tile-rendering task a local active list is incrementally maintained. This means that for each pixel in the tile a list of cell faces which intersect the ray through that pixel is available. The cell faces on this active list are processed, generating an intersection with the ray. The design is object-oriented and the code is written in C++. Classes may be defined for all types of objects that may be rendered. These include curvilinear grids, unstructured grids, rectilinear grids, geometrical primitives, etc. Each of these classes provides virtual functions which maintain active lists if required (as in the curvilinear grids), perform intersection testing, and do the shading computations for resulting intersections. This paper describes only the class implementation for curvilinear grids. The algorithm proceeds in three distinct phases. These include processing changes to the grid, processing changes to the view, and rendering the image. Not all phases need to be executed for each new image. For example, if the view has changed but the grid has not, then only the viewing sort and rendering phases are executed. If the transfer function is the only thing that has changed, then only the rendering phase is required. Figure 1 shows the main rendering loop. A master processor communicates with the user interface via a socket connection to receive updates to grid, scalar, view, and transfer function specifications. It is also responsible for compressing and sending images back to the user interface. In addition to these special tasks, the master participates in parallel tasks. The system has been designed in such a way (although this has not been implemented) that the master could also be communicating with an executing simulation (for example, to render images of specified time steps in an unsteady computation). -I Parallel View Sort The objective of the view sort is to create a list of pointers to pertinent cell faces for each image tile. A second important function of the view sort is to eliminate from further consideration any cell face whose bounding box falls entirely between two pixels, or entirely out of the image. An array of data structures called buckets are allocated, one for each tile in the image. Each bucket will contain an integer specifying the number of pointers in the bucket, and a pointer to an array of cell face pointers. The buckets are allocated in interleaved shared memory, but are declared to be uncachable. This is because atomic increments to the counts will be made as different processors add pointers to the shared lists. The address of the array of buckets is propagated to all the pre cessors. The parallel view sort consists of three phases. In the frrst phase, the grid nodes are multiplied by a matrix representing the viewing transformation. This is done in parallel with each row of grid nodes constituting a task. There will be yd * zd tasks of length xd dynamically generated. Due to the inefficiency of memory management functions for shared memory, sorting of the cell faces into buckets has been split into counting and initialization phases with sequential allocation of the shared memory for the buckets in between. In particular, allocate and free commands are very slow for shared memory, and there is no reallocate function. In order to minimize use of these time-consuming functions, new memory for a bucket is allocated only if the new size required is greater than the existing size of the bucket. The counting phase involves computing the view-space bounding box for each cell face and counting the number of cell faces in each bucket. The task decomposition is by face group. Each task allocates a local array of integers called counts, one per image tile. Each cell face in the group computes its bounding box and increments the count for each image tile that it projects to. Once every cell face in the group has been processed, any non-zero local counts are atomically added to the shared bucket counts. When all the parallel tasks of the second phase are finished, the shared buckets each contain a total count of the number of pointers that will need to be stored in any given bucket. A sequential portion of the code allocates the necessary amount of cachable interleaved shared memory for each bucket that is not currently large enough, and stores the pointer to it in the bucket. The bucket counts are then set to zero. The third phase of the view sort initializes the list of pointers in each shared bucket. This is done in parallel by face group using the same approach as that used to generate counts for the buckets in phase two, except that now each task accumulates local lists of pointers for the buckets. When the entire face group has been processed, the shared bucket counts are atomically incremented and the lists of pointers accumulated locally for each bucket are copied to the shared lists. 84

5 Parallel Rendering In the rendering phase, task decomposition is by image tiles. Since each tile contains a varying number of cell faces to be rendered, dynamic task generation is essential for good load balancing. In addition, the tasks are sorted by size and the largest tasks are allocated first. The size of a task is defined to be the number of cell faces in the bucket for that task. Within each tile, the approach taken to reduce the number of intersection calculations, required for each ray utilizes the idea of a bucket sort and scanline algorithm from computer graphics [ll, 161. The algorithm presented here uses a y-bucket sort followed by an x-bucket sort to create an active list of cell faces for each ray. At the beginning of each new scanline, the intersections of the scanline with edges of each cell face active on that scanline are computed using linear interpolation and stored. Each cell face is represented as two triangles for the purposes of interpolation because the faces of the curved hexahedra are not necessarily planar. It would also be possible to estimate the closest planar polygon to represent the cell face, however, representation of each cell face as two triangles has the advantage that interpolation is then rotationally invariant. At each pixel, intersections along the scanline are computed using linear interpolation of the edge intersections of the cell face. Each computed intersection is put on a depth-sorted intersection list. After all cell faces have been processed, the intersection list is traversed to compute the color and opacity for the pixel. For each intersection on the list, the shading calculation uses the scalar value for the current intersection, the scalar value for the next intersection on the list, and the distance between them. The volume density optical model proposed by Williams and Max [52] is used to compute the contribution from each cell. The op tical density is assumed to be independent of wavelength, thus one exponential is evaluated for each cell intersected by the ray. Given an image tile to be rendered and the bucket from the view sort, the algorithm proceeds as follows: l Set up a y-bucket sort based on the cell faces in the bucket for this tile. l Process each scanline in the tile. For every scanline in the tile: o Update the y-active list using the y-bucket sort to incrementally maintain a list of cell faces active on this scanline. Allocate local edge intersection storage for newly active cells. l Compute and store edge intersections of cell faces active on this scanline. l Set up the x-bucket sort based on cell faces in the y- active list (those that are active somewhere on this scanline). l Process each pixel in the scanline. For every pixel in a scanline: o Update the x-active list using the x-bucket sort to incrementally maintain a list of cell faces active for this pixel. The x-active list contains pointers to all of the cell faces that are intersected by the ray through this pixel. l Create an empty pixel and an intersection list. l For each cell face on the x-active list: - Compute the intersection by linear interpolation of edge intersections, add to depth-sorted intersection list. l For each intersection on the depth-sorted intersection list (or until the pixel is opaque): l - Calculate shading contribution and composite the computed color and opacity into the current pixel. Store the pixel in the image. Intersection List Caching A second algorithm has been implemented in which the intersection lists are explicitly cached in local memory. A similar approached has previously been utilized to speed computation of successive ray-traced images with changing lighting conditions and surface properties [37]. The basic rendering algorithm proceeds in the same way, but sampling and compositing have been split into two phases. The motivation for doing this is to attain fast image updates for a changing transfer function. Finding a transfer function which brings out features of interest in the data can be a time-consuming process, and this approach helps to alleviate that. The algorithm begins by dynamically allocating tasks for the sampling phase, keeping track of which tiles are stored on a given processor. The sampling phase is only necessary if the grid or viewing transformation has changed. Task decomposition is by image tile, and these tasks are dynamically allocated. Each processor begins by allocating intersection lists for every pixel in the tile that has been assigned to it. It then executes just the sampling portion of the algorithm for that tile. The cornpositing phase will be executed following a sampling phase, or alone if just the transfer function has changed. Static task generation is used (one task per processor), with each processor computing pixels from the stored intersection lists for the tiles assigned to it during the sampling phase. For each pixel of each tile resident on a given processor the intersection list is processed as described in the previous section. BENCHMARKING RESULTS The basic parallel volume raycasting algorithm described above has been benchmarked on several volumes and for multiple views of these volumes. Four curvilinear data sets obtained from NASA Ames Research Center were used in &is study. The first is the blunt fin dataset [17]. This dataset represents a CFD simulation of air flow past a blunt f?n on a grid resolution of 40 x 32 x 32, or 37,479 cells. An image of this dataset is presented in Color Fig. 4. The second is the post dataset [32], which was obtained from a numerical study of three-dimensional incompressible flow around multiple posts and has a grid resolution of 38 x 76 x 38 giving 102,675 cells (see Color Fig. 5). The third dataset is the de&o wing dataset, taken from a study of vertical flow over a delta wing [12]. It contains 91 x 51 x 51 grid nodes, or 225,000 cells (Color Fig 6). The fourth is the shuttledataset [26], a multi-block grid consisting of 9 grids with a total of 885,898 cells. This dataset represents flow computations of the space shuttle ascent aerodynamics. Images of the shuttle dataset are given in Color Figures 7 & 8. The tile size was set to 8 x 8 pixels for 2562 images and 16 x 16 for 5122 images. This fixes the number of rendering 85

6 ww.-.-- Dataset I Grid Node 1 Bucket l Bucket 11 view sort and render 376 second.3 with n = 1 5 seconds with n = render only 353 seconds with n = 1 4 seconds with n = 100 Table 1: Execution Times in Seconds for View Sort n Processors: i 40 I 60 I fl Figure 2: Execution Times for the Blunt Fin Dataset. Shuttle Shuttle Shuttle 2-512L Table 2: Rendering Phase Execution Times in Seconds Figure 3: Speedup for the Blunt Fin Dataset. tasks (tiles) at Speedup studies were done only on the blunt fin dataset because the other datasets cause excessive paging when rendered on a single processing node. Figure 2 shows the execution times for the blunt fin dataset. In the view rendered, the volume is rotated about both the X and Y axes (Color Fig. 4). Figure 3 shows the speedup graph. The results show that the rendering phase alone exhibits good scalability with efficiency at 90% for n = 100. Scalability of the combined view sort and rendering phases is lower, indicating diminishing returns for parallelization of the view sort. Table 1 gives representative execution times for the view sort using 100 processors. It was found to be much more difficult to obtain consistent measurements for the view sort than for the rendering phase. This is probably due to three things: the small task size, the overhead of going parallel three times, and the atomic updates required for initializing the buckets. Table 2 shows the execution times of the rendering phase for the post, delta wing, and shuttle datasets. The results indicate near-linear speedup for the post and delta wing datasets, with less scalability exhibited on the shuttle dataset. Both task generation overhead and load imbalance were explicitly measured. The percentages given for task generation and load imbalance are percentages of the total execution time. Each processor keeps track of how much time it spends on any given task. Let n is the number of processors and ti be the time spent on all the tasks allocated to processor i. Let t,,, be the maximum of all the ti. Then the total rendering time is n x t,,, and the load imbalance can be shown as a percentage of this. The percentage increase in ELI ti as n increases shows the overhead due to switch contention for remote memory accesses. An analysis of the overheads involved shows load imbalance to be the primary inhibiting factor for the shuttle dataset. A taskadaptive approach to task generation [46, 301 may improve the efficiency on such a complex and widely-varying grid. Table 3 gives execution times in seconds for the twophase approach in which the intersection lists are cached. Since this algorithm is memory intensive and is only intended to be used on many processors, testing was performed using 100 processors. Since the goal is the fastest possible image updates, only 256 images were generated. Blunt Fin Post Delta Wing Shuttle 2 Sample Composite Table 3: Execution Times for Intersection Caching 86

7 CONCLUSIONS A scalable approach to parallel volume raycasting of structured and unstructured computational grids has been presented. The algorithm is general enough to handle nonconvex grids and cells, grids with voids, grids constructed from multiple grids (multi-block grids), and embedded geometrical primitives. The algorithm is designed for a highly parallel MIMD architecture which features both local memory and shared memory with non-uniform access times. A variation of the algorithm which provides fast image updates for a changing transfer function has also been presented. The approach was found to be generally efficient with a minimal amount of overhead from task generation and remote memory accesses. The parallel view sort required when either the grid or the viewing specification has changed was determined to be the least scalable portion of the algorithm. The rendering phase of the algorithm exhibits high scalability for a majority of the grids benchmarked. On the most complex grid, the shuttle, load imbalance was found to limit the scalability of the rendering phase. Currently under investigation is the further use of coherence and local caching to speed up the rendering process within a given task, more scalable approaches to the parallel view sort, and techniques for load balancing extremely complex grids. A distributed graphical user interface which is used to control the remotely executing volume renderer has also been presented. The design of the entire system is such that the volume renderer could communicate with a simultaneously executing simulation on the massively parallel host. This approach could lead to the ability to steer a simulation based on visual feedback of its progress. ACKNOWLEDGEMENTS I would like to acknowledge the support and encouragement of Jane Wilhelms, Nelson Max, and Charlie McDowell. Thanks also to the staff of the National Energy Research Supercomputing Center at LLNL for use of the BBN TC2000. Funds for the support of this study have been allocated by a cooperative agreement with NASA-Ames Research Center, Moffett Field, California, under Interchange No. NCA2-430, and by the National Science Foundation, Grant Number ASC Author/Title Index [31 BURKE, A., AND LELER, W. Parallelism and Graphics: an Introduction and Annotated Bibliography. In SIG- GRAPH Course Notes: Parallel Algorithms and Architectures for 3D Image Generation (1990). [41 BADOIJEL, D., AND PRIOL, T. An Efficient Parallel Ray Tracing Scheme for Highly Parallel Architectures. In Proceedings of the Fifth Eurographics Workshop on Graphics Hardware (September 1990). BBN ADVANCED COMPUTERS, INC. Inside the TC2000 Computer, preliminary ed., August CHALLINGER, J. Parallel Volume Rendering on a Shared-Memory Multiprocessor. Tech. Rep. UCSC- CRL-91-23, University of California, Santa Cruz, [51 CHALLINGER, J. Parallel Volume Rendering for Curvilinear Volumes. In Proceedings of the Scalable High Performance Computing Conference (April 1992), IEEE Computer Society Press, pp [6] CORFLIE, B., AND MACKERRAS, P. Parallel Volume Rendering and Data Coherence on the Fujitsu APlOOO. Tech. Reo. TR-CS Denartment of Commuter Sci- E :nce, Thk Australian Nition a University, PO1 Pll WI [I GOODSELL, D. S., AND OLSON, A. J. Molecular Ap plications of Volume Rendering and 3-D Texture Maps. In Proceedings of the Chapel Hill Workshop on Volume Visualization (1989), Department of Computer Science, University of North Carolina at Chapel Hill. P71 P81 WI WI DANSKIN, J., AND HANRAHAN, P. Fast Algorithms for v olume Ray Tracing. In 1992 Workshop on Volume Visualization (1992), ACM, pp DREBIN, R. A., CARPENTER, L., AND HANRAHAN, P. volume Rendering. Computer Graphics 2.2, 4 (1988), $5-74. Proceedings of SIGGRAPH 88. ELVINS, T. T. Volume Rendering on a Distributed Memory Parallel Computer. In Visualization 92 (Oc- ;ober 1992), IEEE, pp FLETCHER, C. A. J. Computational Techniques for Fluid Dynamics. Springer-Verlag, FOLEY, J., AND DAM, A. V. findamentals of Interactive Computer Graphics. Addison-Wesley Publishing Company, FUJII, K., GAVALI, S., AND HOLST, T. VorticaI Flow over a Delta Wing Computation. In 5th International Conf. on Numerical Methods in Laminar and Turbulent Flow (July 1987). Montreal, Quebec. GARRITY, M. P. Raytracing Irregular Volume Data. Computer Graphics 24,5 (November 1990), Proceedings of the San Diego Workshop on Volume Visualization. [I41 GIERTSEN, C. Volume Visualization of Sparse Irregular Meshes. IEEE Computer Graphics and Applications 12, 2 (March 1992), HEARN, D., AND BAKER, M. P. Computer Prentice-Hall, Inc., Graphics. HUNG, C. H., AND BUNING, P. G. Simulation of Blunt-Fin Induced Shock Wave and Turbulent Boundary Layer Interaction. Journal of Fluid Mechanics 154 (1985), KABA, J., MATEY, J., STOLL, G., TAYLOR, H., AND HANRAHAN, P. Interactive Terrain Rendering and Volume Visualization on the Princeton Engine. In Visualization 92 (October 1992), IEEE, pp KOYAMADA, K. Fast Traversal of Irregular Volumes. In Visual Computing - Integrating Computer Graphics and Computer Vision, T. L. Kunii, Ed. Springer Verlag, 1992, pp LAUR, D., AND HANRAHAN, P. Hierarchical Splatting: A Progressive Refinement Algorithm for Volume Rendering. Computer Graphics 25,4 (1991), Proceedings of SIGGRAPH 91. LEVOY, M. Display of Surfaces from Volume Data. IEEE Computer Graphics and Applications 8,3 (1988),

8 P21 P4I t251 P61 P71 MAX, N., HANRAHAN, P., AND C~WFIS, R. Area and Volume Coherence for Efficient Visualization of 3D Scalar Frmctions. Computer Graphics 24,5 (1990). Proceedings of the San Diego Workshop on Volume Visualization. WI [301 NIEH, J., AND LEVOY, M. Volume Rendering on Scalable Shared-Memory MIMD Architectures. In 1992 Workshop on Volume Visualization (1992), ACM, pp [311 PLOTBD User s ManuaI. National Aeronautics and Space Administration, Fluid Dynamics Division, NASA Ames Research Center, [321 [331 LEVOY, M. Design for a Real-Time High-Quality Vollme Rendering Workstation. In Proceedings of the Chapel Hill Workshop on Volume Visualization (1989), Department of Computer Science, University of North Carolina at Chapel Hill, pp LEVOY, M. Display of Surfaces From Volume Data. PhD thesis, The University of North Carolina at Chapel Hill, LEVOY, M. Efficient Ray Tracing of Volume Data. ACM Transactions on Graphics 9, 3 (1990), LUCAS, B. A Scientific Visualization Renderer. In Vipualization 92 (October 1992), IEEE, pp MARTIN, JR., F. W., AND SLOTNICK, J. P. FlowComputations for the Space Shuttle in Ascent Mode Using Thin-Layer Navier-Stokes Equations. In Progress in Astronautics and Aeronautics, Vol. 125, P. Henne, Ed. American Institute of Aeronautics and Astronautics, Washington, D.C., 1990, pp MONTANI, C., PEREGO, R., AND SCOGNO, R. Parallel Volume Visualization on a Hypercube Architecture. In 1992 Workshop on Volume Visualization (1992), ACM, pp NEUMANN, U. Interactive Volume Rendering on a Multicomputer. In 1992 Symposium on Interactive 5D Gmphics (March 1992), ACM, pp ROGERS, S. E., KWAK, D., AND KAUL, U. K. A Numerical Study of Three-Dimensional Incompressible Flow Around Multiple Posts, AIAA Paper , Reno, Nevada. SABELLA, P. A Rendering Algorithm for Visualizing 3D Scalar Fields. Computer Graphics 22, 4 (1988), Proceedings of SIGGRAPH 88. [341 SAKAS, G., AND HARTIG, J. Interactive Visualization of Large Scalar Voxel Fields. In Visualization 92 (October 1992), IEEE, pp [351 SCHGDER, P., AND STOLL, G. Data Parallel Volume Rendering as Line Drawing. In 1992 Workshop on Volume Visualization (1992), ACM, pp SCHROEDER, P., AND SALEM, J. B. Fast Rotation of Volume Data on Data Parallel Architectures. In Course Notes 8: State of the Art in Volume Visualization (1991), ACM Siggraph 91 Conference. [371 SEQUIN, C. H., AND SMYIU, E. K. Parameterized Ray Tracing. Computer Gmphics 23, 3 (July 1989), Proceedings of SIGGRAPH 89. C381 SHIRLEY, P., AND TUCHMAN, A. A Polygonal Approximation to Direct Scalar Volume Rendering. Computer Graphics 24, 5 (November 1990), Proceedings of the San Diego Workshop on Volume Visualization. i391 SORENSON, R. L., AND MCCANN, K. Grapevine: Grids About Anything by Poisson s Equation in a Visually Interactive Networking Environment. In Computational Aerosciences Conference Compendium of Abstracts (1992). NASA Ames Research Center. [401 SPERAY, D., AND KENNON, S. Volume Probes: Interactive Data Exploration on Arbitrary Grids. Computer Graphics 84, 5 (November 1990), Proceedings of the San Diego Workshop on Volume Visualization. [411 UPSON, C., AND KEELER, M. VBUFFER: Visible Volume Rendering. Computer Graphics 22, 4 (1988), Proceedings of SIGGRAPH 88. [421 VAN GELDER, A., AND W&HELMS, J. Rapid Exploration of Curvilinear Grids Using Direct Volume Rendering. In Proceedings of Visualization 93 (October 1993), IEEE. to appear. [431 VSZINA, G., FLETCHER, P. A., AND ROBERTSON, P. K. Volume Rendering on the MasPar MP-1. In 1992 Workshop on Volume Visualization (1992), ACM, pp [441 WESTOVER, L. Interactive Volume Rendering. In Conference Proceedings of the Chapel Hill Workshop on Volume Visualization (1989), Department of Computer Science, University of North Carolina at Chapel HiII. [45] WESTOVER, L. Footprint Evaluation for Volume Rendering. Computer Graphics 24, 4 (1990). Proceedings of SIGGRAPH 90. [46] WHITMAN, S. Multiprocessor Methods for Computer Graphics Rendering. Jones and Bartlett, [47] WILHELMS, J., CHALLINGER, J., ALPER, N., RA- MAMOORTHY, S., AND VAZIRI, A. Direct Volume Rendering of Curvilinear Volumes. Computer Graphics 24, 5 (November 1990), Proceedings of the San Diego Workshop on Volume Visualization. [48] WILHELMS, J., AND VAN GELDER, A. A Coherent Projection Approach for Direct Volume Rendering. Computer Graphics 25, 4 (1991). Proceedings of SIG- GRAPH 91. [491 WILLIAMS, P. L. Interactive Direct Volume Rendering of Curvilinear and Unstructured Data. PhD thesis, University of IIIinois at Urbana-Champaign, [501 WILLIAMS, P. L. Interactive Splatting of Nonrectilinear Volumes. In Visualization 92 (October 1992), IEEE, pp [511 WILLIAMS, P. L. Visibhty Ordering Meshed Polyhedra. A CM Transactions on Graphics 11,2 (April 1992), [521 [531 WILLIAMS, P. L., AND MAX, N. A Volume Density Optical Model. In 1992 Workshop on Volume Visualization (1992), ACM, pp ZIENKIEWICZ, 0. C., AND TAYLOR, R. L. The Finite Element Method. McGraw-Hill Book Company,

9 Figure 4: Blunt Fin Figure 5: Post View 2 Figure 6: Delta Wing View 1 Figure 7: Shuttle View 2 Figure 8: GUI with Shuttle View 3 Judy Challinger, Scalable Parallel Volume Raycasting for Nonrectilinear Computational Grids 111

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and