A Parallel Framework for Simplification of Massive Meshes

Size: px

Start display at page:

Download "A Parallel Framework for Simplification of Massive Meshes"

Rudolf Lucas
5 years ago
Views:

1 A Parallel Framework for Simplification of Massive Meshes Dmitry Brodsky Department of Computer Science University of British Columbia Jan Bækgaard Pedersen School of Computer Science Howard R. Hughes College of Engineering University of Nevada Las Vegas Abstract As polygonal models rapidly grow to sizes orders of magnitudes bigger than the memory of commodity workstations, a viable approach to simplifying such models is parallel mesh simplification algorithms. A naïve approach that divides the model into a number of equally sized chunks and distributes them to a number of potentially heterogeneous workstations is bound to fail. In severe cases the computation becomes virtually impossible due to significant slow downs because of memory thrashing. We present a general parallel framework for simplification of very large meshes. This framework ensures a near optimal utilization of the computational resources in a cluster of workstations by providing an intelligent partitioning of the model. This partitioning ensures a high quality output, low runtime due to intelligent load balancing, and high parallel efficiency by providing total memory utilization of each machine, thus guaranteeing not to trash the virtual memory system. To test the usability of our framework we have implemented a parallel version of R-Simp [Brodsky and Watson 2000]. 1 Introduction When considering parallel model simplification algorithms, two key factors come into play. These are: The speed of the computation should be as fast as possible; often this is one of the main reasons for parallelizing an application. Since the model is substantially simplified, the need to guarantee high quality output is of utmost importance. In an attempt to optimize both of these factors we present a parallel framework for simplifying large polygonal meshes. To demonstrate its usability, we used the R-Simp [Brodsky and Watson 2000] algorithm within the framework to create a parallel model simplification algorithm; note, any model simplification algorithm could have been used instead of R-Simp. In order to increase speed, simply dividing the model into as many parts as there are workstations available does not suffice. If dima@cs.ubc.ca matt@cs.unlv.edu the model is so large that even one of its parts cannot fit into the physical memory of a workstation, the performance is seriously penalized due to thrashing of the virtual memory system. Thus, it is not a viable solution to naïvely partition the model into N equally large chunks, one for each machine. One of the main objectives of our framework is to automatically partition the model into a number of equally sized chunks, and then group these into various sized bundles, each bundle is tailored in size to fit the performance characteristics of the machine for which the bundle is targeted. The size of a chunk is determined by the size of the smallest memory of any machine in the cluster. When performing the initial partitioning and assignment of chunks to a machine, we attempt to cluster these chunks as much as possible in order to maintain quality of the output; this is done as each machine might process more than one of these initial chunks at a time. If the total physical memory of the cluster exceeds the size of the model, not all machines are needed to simplify the model. Enough machines are chosen such that each makes optimal use of its memory. The choice of chunk size guarantees that thrashing does not occur. The speed of each machine s CPU is also taken into account; a larger number of chunks is assigned to machines with faster CPUs. By combining these two performance measures we guarantee the following: The algorithm is not slowed down due to thrashing of the virtual memory system. The total execution speed is kept as low as possible by utilizing each machines memory to the fullest and by assigning more work to faster machines. The quality of the output is as high as possible by processing as many chunks at the same time as memory size allows. The algorithm is not slowed down because too much work was assigned to a slow machine, thus slowing down the entire computation. The framework is responsible for partitioning the original model and reassembling the simplified model. Thus, any simplification algorithm can be used with our framework, since each instance of the simplification algorithm is solely responsible for simplifying the portion of the model assigned to it by the framework. Any number of machines with various CPU speeds and memory sizes can be utilized. The intelligent partitioning guarantees (1) no memory trashing since each chunk is guaranteed to fit into core memory, (2) optimal memory usage is ensured by coalescing multiple chunks to maximize memory usage, and (3) good load balancing is guaranteed by assigning work proportional to CPU speed and memory size. 2 Related work Over the last decade the focus of mesh simplification has been on output quality and execution time. Recently, the focus has shifted to

2 issues pertaining to simplification of massive meshes. The polygon count of models is steadily increasing, for example, The Michelangelo Project [Levoy et al. 2000], which conventional mesh simplification algorithms [Garland and Heckbert 1997; Hoppe et al. 1993; Hinker and Hansen 1993; Kalvin and Taylor 1996; Schroeder et al. 1992; Turk 1992] are simply unable to handle. Conventional algorithms access the original mesh throughout simplification process and thus, implicitly, the model must fit into core memory to achieve reasonable execution time. This is not possible on today s billion polygon meshes. Algorithms that perform simple uniform clustering of vertices [Low and Tan 1997; Rossignac and Borrel 1993] can handle these large meshes but their output quality tends to be low. The majority of the research has focused on performing simplification sequentially. In doing so, two main issues have to be addressed when dealing with models that are too big to fit into core memory. First, the size of the in-core data structures should be independent of the input model size. Second, the representation of out-of-core data should allow for efficient access. Ensuring that in-core data structures remain smaller than core memory can be done in several ways. Prince [Prince 2000] developed a version of Hoppe s [Hoppe 1996] progressive mesh algorithm that handles large models. The algorithm spatially partitions the data set into smaller chunks, each chunk is simplified by a slightly modified version of the original algorithm, and then the simplified parts are stitched back together. Shaffer and Garland [Shaffer and Garland 2001] take an approach similar to Brodsky and Watson [Brodsky and Watson 2000] except they use BSP trees and only collect summary quadric information as they perform a linear pass through the model. Lindstrom [Lindstrom and Turk 1998] employs quadric [Garland and Heckbert 1997] information to reduce the size of in-core data structures. Garland and Shaffer [Garland and Shaffer 2002] and Lindstrom [Lindstrom 2000; Lindstrom and Silva 2001] employ uniform vertex clustering [Rossignac and Borrel 1993] to increase the size handling capabilities of their algorithms and decrease simplification time. Finally, Choudhury and Watson [Choudhury and Watson 2002] took a novel approach of looking at data access patterns and the behaviour of a virtual memory system; data is rearranged to reduce the number of pagefaults and enables the system to prefetch data more effectively. A large amount of complementary research [Cignoni et al. 2003; El-Sana and Chiang 2000; Guéziec et al. 1999; Isenburg and Gumhold 2003; Rossignac 1999] has been done on model data representation and compression to allow efficient access from disk. The increase in processor speed is far outstripping the increase in disk speed. Hence, as models grow larger, and processors become faster, eventually the disk becomes the bottleneck. Compression techniques [Guéziec et al. 1999; Isenburg and Gumhold 2003; Rossignac 1999] allow for the data to be quickly read, in a sequential manner, provided that the decompresser is fast. If the goal is to randomly access the data, then re-writing the data for more efficient access [Cignoni et al. 2003; El-Sana and Chiang 2000] is the better choice. 3 Design The framework is designed to work in a cluster environment composed of a set of nodes (machines). The framework provides a mechanism to (1) partition the model data into chunks, (2) distribute the chunks to all nodes in the cluster, (3) execute the simplification algorithm on each node, and (4) gather the results and compose them into a single simplified model. We believe, that any existing mesh simplification algorithm can operate within this framework. We use the term guest to refer to an instance of some simplification algorithm. To operate within this framework the guest algorithm requires a few minor changes (see Section 3.2), but on the whole, remains unchanged. All communication and distribution mechanisms are the responsibility of the framework. 3.1 Partitioning The model is partitioned by spatially subdividing the surface. The number and size of chunks is dictated by the characteristics of the cluster: the number of nodes, the amount of core memory on each node, and the processor speed of each node. We have two main goals when partitioning the model. First we want to optimally use the resources in the cluster, and second we want to maintain good output quality. The size of a chunk is dictated by the smallest core memory. A node receives a bundle, which consists of one or more chunks. The size of a node s core memory and processor speed dictate the size of the bundle. If a chunk is too big and does not fit into core memory, then thrashing ensues and hurts performance; this does not occur within our framework. Additionally, we want to reduce the number of nodes used to minimize the communication overhead, which translates to shorter simplification times. Thus, we create chunks that will fit exactly into the core memory of the nodes. To maintain good quality output we also need to minimize the number of chunks. The simplification steps in most algorithms are globally ordered based on some criteria like distortion. When the simplification algorithm is run on many chunks, rather than on one big model, global ordering is lost. The loss of global ordering of the simplification steps can lead to a reduced output quality. Thus, we limit the number of chunks to reduce the loss in quality. 3.2 Algorithm modification The guest algorithm requires some minor modifications to operate within our framework. The framework passes the guest algorithm a bundle of chunks of the model and a set B of boundary vertices. These boundary vertices define the faces that span chunks, and their treatment depends on the guest algorithm. Once the chunks are simplified they are stitched together by the framework into the simplified model. To accomplish this, the framework must know how chunks relate to one another; this relationship is tracked through the vertices in B. The framework requires the guest algorithm to provide a mapping from the vertices in B to a corresponding set B of vertices in the simplified chunk. The size of B depends on the size of the model, the size of the polygons, and the number of chunks the model is partitioned into. On average we have found that the size of B is between 0.1% and 1% of the number of vertices in the original model. 4 Implementation In this section, we describe the implementation of our framework. We describe the framework and the implementation of one guest simplification algorithm; we chose to implement R-Simp [Brodsky and Watson 2000]. We start by briefly describing the R-Simp algorithm. For the remainder of the section, unless explicitly stated otherwise, we are describing the framework implementation. The framework organizes the processing in a master/slave configuration. We use the LAM [LAM 2003] implementation of MPI to do the communication and management of nodes within the cluster. The master partitions the mesh and sends the bundles of chunks to the slaves. The slaves simplify the chunks and send the simplified chunks back to the master. The master stitches the resulting chunks into the final simplified model.

3 l l l l X Y w X Y w X Y w X Y w Z Z (a) (b) (c) (d) Z Z Figure 1: Stages of partitioning the model. (a) Project the vertices onto the sides of the bounding box. Based on the density of the vertices determine the number of chunks we should have for the side. (b) Based on the width of the side compute the partitioning in one direction. (c) Based on the length of the bounding box compute the the partitioning in the other direction. This gives us a partitioning for two sides, plus initial conditions for the other four sides. (d) Compute the partitioning for the other four sides based on the density of the vertices. 4.1 R-Simp The R-Simp algorithm begins by reading the original model into main memory. The model is contained within a single cluster and represents the root node of an n-ary tree. A cluster is a collection of vertices and faces from the original model. The initial cluster is subdivided into eight sub-clusters. These sub-clusters are then iteratively subdivided until the required number of clusters is reached. The decision to subdivide a cluster is based on the amount of variation in the orientation of the faces in the cluster (i.e., curvature). The final set of clusters represent vertices that lie on the simplified surface. Next, these vertices are triangulated to form the faces of the new surface. Finally, the new vertices and faces are written to an output file. 4.2 Partitioning in the framework The master begins by reading in an axis-aligned bounding box for the model. The bounding box is pre-computed but can easily be computed as part of the initial step. Next the master requests the characteristics of the slaves. Each slave sends a pair of values consisting of the processor speed and the size of core memory. Using this information the master computes the initial partitioning of the model. The initial partitioning is based on the number of vertices in the model, the smallest core memory, and the area of the model s bounding box. The guest simplification algorithm must specify its memory requirements, m, in terms of bytes per vertex; for R-Simp its approximately 360 bytes per vertex. For example, a model of n vertices requires m n bytes of memory. We then proceed to compute the chunk size. The chunk size is computed for the lowest common denominator, the node with the smallest core memory, to ensure all chunks fit into the core memory of all the nodes. Let vn c = smallest core, m be the number of vertices in a chunk. Once vn c has been computed, we proceed to compute the partitioning. The partitioning is a uniform subdivision of the model s bounding box with axis-aligned planes. We compute the number of planes in the x, y, and z directions. We assume that the vertices are evenly distributed on the surface of the bounding box. This is not true in reality, but it provides a good approximation and allows us to reason about the partitioning in terms of surface area. We compute vps, the number of vertices per side, based on the density of vertices that lie on a side, s i, of the bounding box (Figure 1a); the bounding box has six sides s 1...s 6. Thus, vps is computed as follows: vps = n area of s i total area, for a side s i. The number of chunks for a side s i is as follows: cps i = vn c vps. Once the number of chunks on a side is known we must determine the actual partitioning for the side. Let l i and w i be the length and width respectively of side s i ; w i l i. Since the number of chunks is proportional to the area, the number of partitions along w i (Figure 1b) is p wi = cps i wi. l i Similarly, the number of partitions along l i (Figure 1c) is p li = p wi li w i. To obtain the complete partitioning of the bounding box, we determine the partitioning for the side defined by the x and y axis (Figure 1c). We then take the resulting partitioning along x and determining the partitioning for the side defined by the x and z axis (Figure 1d). This gives us the complete partitioning of the bounding box. 4.3 Vertex processing Once the initial partitioning is determined, the master starts processing the vertices. The master performs a linear scan of the vertices, and for each vertex determines the partition it belongs to. Then, the vertex and its location is sent to slave S 1, and S 1 sends out the vertices to the other slaves. By having S 1 take on the responsibility for sending out the vertices we overlap the processing of faces and the distribution of vertices to the other slaves. To reduce network I/O and message-creation overheads, the vertices are sent to S 1 in batches. As the master processes the vertices it builds a list of vertices, a vertex map, that is used in the processing of the faces. The vertex map maps each vertex to its corresponding partition. During

4 partitioning the vertices are represented as a list V of points, and faces are triples of indices into this vertex list. The partitioning of vertices creates sublists Vp 0...Vp n. The vertex map maps a vertex in Vp i to the corresponding vertex in V. Once all the vertices have been processed, the master assigns the partitions to the slaves; this is stored in the processor map. 4.4 Processor assignment The master proceeds to assign the initial partitions to the slaves in the cluster. First, we compute the unit of processing, P u, which is the product of the slowest CPU speed and the smallest core memory. Next we compute the number of processing units, P n = 1 P u mem i cpu i, where mem i is the core memory size and cpu i is the processor speed of slave S i. From this we compute the number of partitions, np = P n P u, for a unit of processing. Finally, the number of partitions, Sn i, for a slave S i, is Sn i = np mem i cpu i. We assign contiguous partitions to slaves in the event that a slave has more core memory, because they can coalesce several partitions before performing the simplification, thus decreasing fragmentation. Once the partitions have been assigned, the master sends the assignment to S 1. Upon receiving the assignments, S 1 starts propagating the vertices in the partitions to the slaves. In the meantime, the master starts processing the faces. In this way we distribute and pipeline the work to send out the vertices and process the faces. 4.5 Face processing For each face in the model, the master determines its location with respect to the partition and slave. This is accomplished through the vertex and processor maps, which were created in the previous step. Then, the face is sent to the corresponding set of slaves; like the vertices, we send the faces in batches to reduce network I/O. We call a face whose vertices are in multiple partitions a spanning face; spanning faces may belong to two or three slaves. Our main data structure, the vertex map, grows linearly with the size of the model. Given that it is used extensively, our access of the vertex map has to be efficient. An entry in the vertex map consists of two integers, and there is an entry for each vertex. This means that for large models this map cannot be stored in core memory. We have created two approaches to efficiently handle the vertex map when we cannot fit it into core memory. For large vertex maps we store the map as a file and then memory map it. For extremely large models, the vertex map is sufficiently large that it is not possible to memory map the entire file on a 32-bit architecture. Thus, we use a sliding memory mapped window to access the file. For clusters that are connected by a high speed network (gigabit and faster), we have developed a simple distributed shared memory (DSM [Li and Hudak 1989]) system for the vertex map. The slaves store a piece of the vertex map, and the master keeps a cache of vertex map pages. To look up a vertex, the master first looks in the cache, if it is not in the cache, the master requests the page from the slave holding that portion of the map. Since the vertices that a face refers to tend to be clustered closely together, there is a good chance that the other vertices of the face can be looked up from the cache. We have tested this DSM on a regular 100Mb network with mixed results. We believe that on a gigabit or faster network this DSM approach will be faster then memory mapping the vertex map. 4.6 The slave Once the slave S 1 has finished sending the vertices and the master has finished sending the faces, the slaves start their own processing. The slaves reassemble the chunks to create the final partitioning for the model. Recall that a single chunk must fit into the smallest core memory. If a slave has more memory, it coalesces several small chunks into a single large chunk. This has several benefits as described in Section 3.1. When two chunks are merged, the vertices that define the faces which span the two chunks are removed from the set B (Section 3.2). Finally, the guest simplification algorithm is invoked to simplify the chunk or chunks. If the model is large, then each node may be required to simplify several chunks serially. The output from the guest simplification algorithm is a simplified chunk and a set B of vertices. The vertices in B lie on the new simplified surface and potentially define new faces that span the simplified chunks; a vertex in B may represent several vertices in B. 4.7 Stitching Once all the chunks have been processed by the guest algorithm they have to be stitched back together. We use a divide and conquer approach to create the simplified surface. Initially, each node stitches together all chunks that were assigned to it. The stitching process is similar to the retriangulation process in the original R-Simp algorithm. A vertex on the simplified surface represents one or more vertices on the original surface. We iterate through the original faces, if the vertices that define the face correspond to three distinct vertices on the simplified surface then that face is retained and is defined by the new vertices. Otherwise the face has degenerated into a line or a point and is discarded. We take two chunks and the union of their B sets, and examine the faces that were defined by the vertices in B. If the vertices that define the face correspond to three distinct vertices in B then that face is retained and is defined by the vertices in B. The advantage of this approach to stitching is that there is no lose to output quality. Once each nodes has stitched together its chunks, the adjacent nodes are paired off, node p i receives the simplified chunk from node p i+1 and the corresponding set B of vertices. The surfaces are stitched together in the same manner, and then the p 2i nodes are paired off. Given n nodes, there are logn iterations to stitch the surface back together; after logn iterations, the entire surface is located back on the master node. Finally, the simplified surface is written to disk by the master. 5 Discussion The motivation for our framework is the realization that the naïve approach that uniformly divides the model into a number of chunks and distributes them to a number of potentially heterogeneous workstations is bound to fail. Uniform partitioning only works when all workstations can efficiently perform computation on a chunk, which means that all workstations must have sufficient processing power and memory; the computation power of the cluster is dependent upon the smallest workstation. A workstation that is processor bound slows down the entire simplification process, a workstation that has too little memory slows down the simplification process even further due to thrashing within the virtual memory

5 The framework with R-Simp as the guest R-Simp Model Total (s) Simp. (s) CPUs Speedup Efficiency Total (s) Simp. (s) Bunny Dragon Buddha Blade St. Matthew s Face David Lucy St. Matthew Table 1: Execution times for our framework with R-Simp as the guest algorithm and the standalone version of R-Simp. For the framework the simplification time is per processor. For the last four models, the timings for R-Simp were done on a large Sparc Server with 32 GB of core memory. The total time includes all file I/O, and the simplification time excludes all file I/O. system. In addition, to simplify a model of a given size s one must always satisfy the following equation: s = m n, where n is the number of available workstations and m is the smallest core memory of a workstation in the cluster; determining the size of the partitioning such that a chunk is smaller than m may also be a non-trivial task. Our framework handles the non-homogeneity among workstations by querying them for their characteristics, and then creating appropriate size chunks to enable each workstation to optimally perform computation on assigned chunks. This approach has two benefits. First, simplification proceeds at optimal speed since workstations are appropriately loaded. Thus, conditions such as the thrashing of the virtual memory system do not occur. Second, if there are more chunks than workstations, then our framework assigns multiple chunks per workstation and are processed sequentially on each workstation. Thus, our framework can simplify a model of any size on any number of workstations. 6 Results To evaluate the effectiveness of our framework we used R-Simp as the guest simplification algorithm. We executed the framework with this guest algorithm on a number of models that ranged in size from 70,000 to 370 million polygons. Table 2 summarizes the Model Size (polys) Size (MB) Bunny 69, Dragon 871, Buddha 1,087, Blade 1,765, St. Matthew s Face 6,755,412 1,159 David 8,253,996 1,416 Lucy 28,045,920 4,814 St. Matthew 372,315,310 63,912 Table 2: The models used for evaluation. models that we used; the last four models are from the The Digital Michelangelo Project [Levoy et al. 2000]. The numbers specified in the MB column is the size of the model in memory, for R-Simp, without any instantiated auxiliary data structures. Each face in the original model takes up approximately 180 bytes (based on the number of faces being approximately twice the number of vertices and the average degree, of a vertex, is 6). This number could be reduced to approximately 130 bytes per face (by not storing normals and mid-points of faces), but this would result in a dramatic increase to non-localized memory accesses as these values are used often. To compute the total amount of memory used, the size of the output model (the number of vertices in the final output times 73 bytes), the intermediate data structures, such as priority queues (estimated to be 10-20% of the input model in size), must be summed. As an example, this means that for R-Simp to simplify Lucy to 20,000 vertices, between 5.8GB and 6.3GB of memory is required; this amount can not even be addressed by a 32-bit processor. We report two different kinds of results in this section: (1) The execution times for our framework and the original R-Simp algorithm, and the speedup and efficiency achieved. (2) The quality of the simplified models produced by our framework compared to the original, and the amount of quality lost due to the partitioning. The R-Simp algorithm was modified to keep track of the chunk boundary vertices and to output a mapping between the new and the old vertices. About 50 lines had to be added and or modified. The machines used for the experiments were run on a non dedicated cluster of 20 Pentium III 1 GHz PCs, each with 256 MB ram and connected by a 100 Mb/s switched Ethernet. 6.1 Speed We ran our framework with R-Simp as the guest algorithm and the original R-Simp on all the models in Table 2, and recorded the total wall clock simplification time and the simplification time excluding all file I/O. Table 1 presents the timing results. We see two trends from the results. First, the file and network I/O dominate the simplification process as the models grow in size, which is expected. The model data was in a conventional representation and we used standard system calls to access the data. We believe that by employing more intelligent data representations and or compression techniques, such as [Isenburg and Gumhold 2003], we can significantly reduce the amount of time we spend doing file I/O. Even with our naïve approach to accessing out-of-core data, we found that our framework has provided up to threefold improvement in performance. PR-Simp [Brodsky and Pedersen 2002], a parallel version of R-Simp that used a naïve partitioning scheme, simplified Lucy in 2,105 seconds, while the framework simplified Lucy in 742 seconds. PR-Simp simplified David in 209 seconds, while the framework required 144 seconds. The amount of improvement depends on the size of the input model. As models grow simplification resources become scarce, and the better these resources are managed the smoother the simplification process proceeds. Our framework provides this management and thus attains better performance than PR-Simp. Garland s and Shaffer s Multiphase Approach [Garland and Shaffer 2002] took 2820 seconds, not including file I/O, to simplify St. Matthew. If we exclude file I/O

6 Original 20,000 Vertices 10,000 Vertices 5,000 Vertices Figure 2: Michelangelo s St. Matthew. we were able to simplify St. Matthew in 254 seconds. To compute the speedup and efficiency achieved in our framework we use the simplification time only. One must be careful when interpreting the speedup and the efficiency results because the sequential and the parallel algorithms were run on different architectures. The Buddha model was run on similar architectures and we see that we obtain an efficiency of over 100%. The buddha does not fit into the core memory of a single workstation, hence causing the virtual memory system to thrash, and thus the overall simplification time increases. The efficiencies for the other model s are also above 100%. This may be do to the different architectures and the fact that the Sparc Server is not a dedicated machine but supports hundreds of users. We were unable to obtain a speedup or efficiency measure for Michelangelo s St. Matthew model because we had no single system with sufficient amount of core memory. 6.2 Quality We measured output quality both quantitative and qualitatively. To do quantitative analysis we used Metro [Cignoni et al. 1997] to compute the error between the original and the simplified surface; we report the error as a percentage of the model s bounding box diagonal. Metro is limited to the size of models it can compare. Thus, for the large models we only provide qualitative results, see Figures 2 and 3. Since the smaller models required one processor, the output quality remained unchanged from the simplifications created by R-Simp [Brodsky and Watson 2000]. Visual inspection of the models in Figures 2 and 3 shows no significant artifacts. If one required high output quality, one could use a multipass approach; using our framework with R-Simp as the guest algorithm to simplify the model to a manageable number of vertices, and then use an algorithm that generates high quality output to generate the final simplification. We used Metro to determine the relationship between output quality verses the number of processors used (Table 3); we ran our framework on the Bunny and Dragon models on 1, 2, 4, 8, 12, 16, and 20 processors, and simplified them to 5,000 vertices. Increasing the number of processors perturbs the global ordering of the simplification operations. Surprisingly, the results showed no significant difference. For the Bunny, we found that for all node configurations the median maximum error reported by Metro was 2.37%, the average maximum error was 2.31% and had a standard deviation of For the dragon, the median maximum error was 1.72% and the average maximum error was 1.41%, and had a standard deviation of Although, we found no correlation between Number of Processors Model Bunny Dragon Table 3: Total maximum error, as reported by Metro, for the Stanford Bunny and the Dragon models when simplified to 5,000 vertices on 1, 2, 4, 8, 12, and 20 processors. The error is a percentage of the model s bounding box diagonal. the number of processors used and output quality, we believe, that if one uses a large number of processors to simplify a small model, then the quality will be affected. 7 Conclusions We presented a general framework for supporting efficient implementation of parallel model simplification algorithms on cluster environments. To demonstrate the usability of this framework we have implemented a parallel version of the R-Simp algorithm. The framework takes into account the heterogeneity, with respect to CPU speed and memory size, of the machines in a cluster to provide support for efficiently executing any mesh simplification algorithm. Using intelligent load balancing and partitioning, resources are utilized to their fullest, thus achieving significant speedups while preserving the quality of the model. To show how this framework improves performance of a naïve implementation, PR-Simp, we ported an existing parallel implementation of R-Simp to work with the framework. For a model with 28 million faces the efficiency increased threefold, showing that with the intelligent partitioning and load balancing we can achieve significant gains in efficiency without compromising the quality. Acknowledgements We would like to thank Alex Brodsky, Chamath Kappitiyagama and Joon Suan Ong for proof reading the paper and for helping clarify a number of points. We would also like to thank Norm Hutchinson and the DSG Lab for their support.

7 References BRODSKY, D., AND PEDERSEN, J. B Parallel model simplification of very large polygonal meshes. In Prodeedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, BRODSKY, D., AND WATSON, B Model simplification through refinement. In Graphics Interface 00, CHOUDHURY, P., AND WATSON, B Completely adaptive simplification of massive meshes. Technical Report NWU-CS-02-09, North Western University. CIGNONI, P., ROCCHINI, C., AND SCOPIGNO, R Metro: measuring error on simplified surfaces. Tech. rep., Istituto per l Elaborazione dell Infomazione - Consiglio Nazionale delle Ricerche. CIGNONI, P., ROCCHINI, C., MONTANI, C., AND SCOPIGNO, R External memory management and simplification of huge meshes. To appear in Transactions on Visualization and Computer Graphics. EL-SANA, J., AND CHIANG, Y.-J External memory viewdependent simplification. Computer Graphics Forum 19, 3, GARLAND, M., AND HECKBERT, P. S Surface simplification using quadric error metrics. In SigGraph 1997 Conference Proceedings, GARLAND, M., AND SHAFFER, E A multiphase approach to efficient surface simplification. In In Proceedings of IEEE Visualization 2002, GUÉZIEC, A. P., BOSSEN, F., TAUBIN, G., AND SILVA, C. T Efficient compression of non-manifold polygonal meshes. In In Proceedings of IEEE Visualization 1999, HINKER, P., AND HANSEN, C Geometric optimization. In In Proceedings of IEEE Visualization 1993, HOPPE, H., DEROSE, T., DUCHAMP, T., MCDONALD, J., AND STUET- ZLE, W Mesh optimization. In SigGraph 1993 Conference Proceedings, vol. 27, HOPPE, H Progressive meshes. In SigGraph 1996 Conference Proceedings, ISENBURG, M., AND GUMHOLD, S Out-of-core compression for gigantic polygon meshes. In To appear in SigGraph 2003 Conference Proceedings. KALVIN, A. D., AND TAYLOR, R. H Superfaces: Polygonal mesh simplification with bounded error. IEEE Computer Graphics and Applications 16, 3 (May), LAM - MPI. LEVOY, M., PULLI, K., CURLESS, B., RUSINKIEWICZ, S., KOLLER, D., PEREIRA, L., GINZTON, M., ANDERSON, S., DAVIS, J., GINSBERG, J., SHADE, J., AND FULK, D The digital michaelangelo project: 3D scanning of large statues. In SigGraph 2000 Conference Proceedings, LI, K., AND HUDAK, P Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems 7, 4, LINDSTROM, P., AND SILVA, C A memory insensitive technique for large model simplification. In In Proceedings of IEEE Visualization LINDSTROM, P., AND TURK, G Fast and memory efficient polygonal simplification. In In Proceedings of IEEE Visualization 1998, LINDSTROM, P Out-of-core simplification of large polygonal models. In SigGraph 2000 Conference Proceedings, LOW, K.-L., AND TAN, T.-S Model simplification using vertexclustering. In 1997 Symposium on Interactive 3D Graphics, PRINCE, C Progressive meshes for large models of arbitrary topology. Masters thesis, University of Washington. ROSSIGNAC, J., AND BORREL, P Multi-resolution 3D approximations for rendering complex scenes. In Modeling in Computer Graphics: Methods and Applications, ROSSIGNAC, J Edgebreaker: Connectivity compression for triangle meshes. IEEE Transactions on Visualization and Computer Graphics 5, 1, SCHROEDER, W. J., ZARGE, J. A., AND LORENSEN, W. E Decimation of triangle meshes. Computer Graphics 26, 2 (July), SHAFFER, E., AND GARLAND, M Efficient adaptive simplification of massive meshes. In In Proceedings of IEEE Visualization 2001, TURK, G Re-tiling polygonal surfaces. Computer Graphics 26, 2 (July),

8 Original 20,000 Vertices 10,000 Vertices 5,000 Vertices (a) Lucy (b) Michelangelo s David (c) St. Matthew s Face Figure 3: Simplifications of several large models to 20,000, 10,000, and 5,000 vertices.

Out of Core continuous LoD-Hierarchies for Large Triangle Meshes

Out of Core continuous LoD-Hierarchies for Large Triangle Meshes Hermann Birkholz Research Assistant Albert-Einstein-Str. 21 Germany, 18059, Rostock hb01@informatik.uni-rostock.de ABSTRACT In this paper,