Hardware-Accelerated Free-Form Deformation

Hardware-Accelerated Free-Form Deformation Clint Cha and Ulrich Nemann Compter Science Department Integrated Media Systems Center University of Sothern California Abstract Hardware-acceleration for geometric deformation is developed in the framework of an extension to the OpenGL specification. The method reqires an addition to the front-end of the OpenGL rendering pipeline and an appropriate OpenGL primitive. Or approach is to implement general geometric deformations so the system spports additional layers of abstraction, inclding physically based simlations. This approach wold spport a wide range of sers with an accelerated implementation of a wellnderstood deformation method, redcing the need for software deformation engines and the exection time penalty associated with them. CR Categories and Sbject Descriptors: I.. [Compter Graphics]: Hardware Architectre --- Graphics Processors Additional Keywords: Free-Form Deformation, OpenGL INTRODUCTION Many animations are limited to rigid body transformations. While these approximate many ordinary object motions, rigid body animations preclde the class of elastic deformations fond in soft or organic objects sch cloth or hman limbs. In order to achieve these elastic deformations, programmers have had to constrct software deformation engines, often copled to physical simlations sch as mass-spring models []. Despite the large body of research literatre on deformations, interactive systems tend to avoid deformation techniqes de to their software implementation cost. Instead, application developers often se approximations based on precompted keyframes or rigid body animations that hopeflly mask any artifacts that arise. Often, in spite of these efforts, seams in the model and inter-penetrations are visible, especially in complex hman or animal models. With the contining increase of rendering qality provided by new graphics cards and gaming consoles [0], it becomes important to consider the addition of high-resoltion model deformations that allow single-skin models to deform * {cha nemann}@graphics.sc.ed seamlessly and natrally. Althogh single-skin deformations are available on a next-generation gaming console, deformation does not have widespread spport among most graphics cards or APIs. Deformable single-skin models allow arbitrary interactively controlled deformations not easily approximated by key frames and rigid transformations. The hardware and software API proposed in this paper provides these deformations in the framework of an incremental modification to OpenGL. Or approach tilizes the well-known Extended Free-Form Deformation (EFFD) method []. We show how to integrate the EFFD system into the OpenGL API by extending the crrent eval fnctions already within OpenGL. This approach yields several benefits. The EFFD method is a widely sed and intitive approach that programmers can easily se for creating and controlling deformable objects. The EFFD deformation operations are simple and reglar, therefore allowing efficient implementation as hardware (or optimized firmware) in the OpenGL rendering pipeline. The remainder of the paper is arranged as follows. The next section provides backgrond on EFFD. Section details the proposed hardware additions. Section 4 discsses several possible hardware organizations. Section 5 describes the OpenGL API fnctions. The last section smmarizes the tility of the system. ACKGROUND Free Form Deformation (FFD) is a poplar techniqe for creating geometric deformations. There are several types of FFDs, ranging from the standard FFD [8], the EFFD [], and FFDs with arbitrary topology [4]. We chose to work with the EFFD in or approach since it preserves the mathematical simplicity of the standard FFD yet it covers a wider variety of control lattices. The intition behind all types of FFDs is embedding an object into a piece of Jell-O, and as the Jell-O deforms, the embedded object deforms with it. The different FFDs mentioned above only vary the initial shape of the Jell-O in which the object is embedded. A standard FFD works solely with rectanglar parallelepipeds. Arbitrary topology FFDs allow any control lattice. The EFFD, while not as versatile as the arbitrary topology FFD, can work with cylinders, spheres and other reglar shapes more complex than a rectanglar parallelepiped, while still sing the efficient deformation eqations of the standard FFD. All FFD techniqes share some common properties. Here is a brief smmary of these properties. FFDs can deform nearly any type of geometric model ranging from simple triangle patches to parametric srfaces. It can be applied hierarchically to perform both local and global space deformations. In addition, FFDs can control srface continity as well as volme preservation. The reader is directed to [,4,8] for more details on these properties.

EFFD operation is divided into steps. The first step embeds the object in the initial control lattice (Plates,,5,,9) and comptes the parameterized coordinates of the object. The next step moves the control lattice points to new locations, ths deforming the enclosed region of space. The last step calclates the deformed positions of every object point based on the new locations of the control points (Plates,4,7,8,0). The deformed object is then ready for rendering. The embedding process starts with a lattice of reglarly spaced control points. For now, we can assme that the control points are arranged in a rectanglar parallelepiped. An object is embedded within the lattice by compting a transformation of coordinate systems from the object coordinate system to the local lattice coordinate system. For a rectanglar parallelepiped control lattice, the embedding or freezing of the object is accomplished by a simple affine coordinate transformation [8]. Since more complex control lattice shapes are possible with the EFFD, the parameterization also becomes more complex, reqiring a soltion to a system of non-linear eqations []. This system of non-linear eqations is defined by the eqations of deformation so we will first introdce these eqations and then show how the embedding is done. Assming that each object point X(x,y,z) has a parameterized local coordinate (s,t,), then with the set of control points P, we calclate the deformed position q with qi, j, k ( s, t, ) P i+ l, j+ k+ n l, n 0 ( s) ( t) ( ) () where P i,j,k is the i th, j th, k th control point and the s are the niform cbic -spline blending fnctions shown below. ( ) 0 ( ) ( ) ( ) ( (4 ( + + Ths, given a set of parameterized local coordinates, we evalate the above eqation to get the set of deformed points based on the new positions of the control points. For non-rectanglar lattices, we need to derive the system of non-linear eqations to determine the parameterized local coordinates. From eqation (), we can derive the system of nonlinear eqations for the (s,t,) parameterization of an object point X. The intition behind the embedding procedre for the EFFD is that it is the inverse of deformation of the object in the initial control lattice to a rectanglar parallelepiped control lattice. Hence, the initial EFFD control lattice shape is constrained by the existence of a mapping or morph between it and a standard rectanglar parallelepiped. The morph mst not incr any space folding, that is, it mst be invertible. From this we can see why the EFFD is not appropriate for lattices of arbitrary topology since the morph from the arbitrary lattice to the rectanglar parallelepiped may not exist. To derive the system of non-linear eqations, we set the object point X eqal to q, the deformed point in Eqation (). The goal then is then to find a parameterization (s,t,) that satisfies this eqation. In order to find (s,t,), we rearrange the Eqation () to obtain the following eqation. + l m + ) ) ) n P, j+ k+ nl ( s) m ( t) n( ) X i, j, 0 i+ l k l, n 0 where X i,j,k is the object point within the deformable region of P i,j,k to P i+,j+,k+. Since this is a -D vector eqation, we have one eqation from each dimension, and nknowns, namely (s,t,). We se a Newton-Raphson root-finding method [7] with initial gess 0.5 to find (s,t,). While the calclation of parameterized coordinates cold be done mltiple times, this process is typically done only once. In other words, once the initial control lattice has been selected and the parameterization done for it, the control lattice is not changed. In this case, the embedding procedre is done as a pre-processing step for each model and its initial control lattice. At rn time, the control lattice points P are moved (between key frames or based on ser inpt) and the deformed model coordinates are evalated by sing Eqation (). It is this evalation that reqires efficiency and hardware acceleration. HARDWARE ARCHITECTURE. OpenGL System The hardware architectre illstrates how we accelerate the evalation of Eqation (). We take advantage of the similarity between the form of Eqation () and the evalation of splinebased crves and srfaces already handled by OpenGL. Let s first take a look at OpenGL s crve and srface rendering system. Fndamentally, OpenGL s crve and srface capability is based on the evalator system [5]. This system advocates (bt does not reqire) the se of hardware accelerated polynomial evalators [9]. The evalations of the crrently spported crves and srfaces all take the form of Eqation (). y changing the blending fnctions, we obtain different families of crves ranging from ezier and -spline to Hermite. To simplify the syste OpenGL chose to fix the blending fnctions to the ernstein fnctions. y doing so, hardware implementers are able to optimize their designs for a single polynomial evalator sing the ernstein blending fnctions shown below. a, b a b ( c) ( c) The choice of the ernstein blending fnctions does not limit OpenGL s capability for rendering other families of crves. In fact, the GLU library flly implements a NURS renderer on top of OpenGL s evalator system. This is achieved by finding a transformation between families of crves. In order to represent one type of crve with another, there mst exist a mapping between the crve s original type of control points and the reslting family s control points sch that the crve represented by both control points are the same. This well-known transformation is described in []. Ths, OpenGL can render a wide variety of crves and srfaces by sing this transformation.. Proposed System a b () OpenGL does not constrain implementations, so the hardware designer can choose how many hardware polynomial evalators to inclde. Clearly, a single evalator is sfficient to increase the c a

s t P x P y P z /(-s+s -s ) /(-t+t -t ) /(-+ - ) Register File for Polynomial Coefficients PE PE PE From Control Point Cache s t P (a) (b) Inpt latches Pipeline Stage (d) -bit conter (c) Pipeline Stage Otpt latch reg reg reg Mltiplier Accmlator Figre. Data Flow of the Pipeline with optional components. This shows how the data flows throgh the pipeline with the general polynomial evalators set to the niform cbic -spline blending fnction,. speed for evalating Eqation () bt a second evalator allows parallel evalations in bivariate hyperpatches (srfaces). This design freedom leads to or proposed architectre, shown in Figre Or conceptal design (Fig. ) shows a three stage pipelined trivariate EFFD evalator. The first stage ses three polynomial evalators to evalate each blending fnction in parallel. The next stage is a for-inpt floating-point vector mltiplier that mltiplies the reslts of the blending fnctions with the control point vector. The last stage is the vector-accmlator that comptes the smmations for each dimension in Eqation (). A -bit conter controls the asynchronos reset of the accmlators after every 4 additions. The system also shows two optional components: the register file for polynomial coefficients and the control point cache. Their fnctions are discssed in later sections. The base system withot optional components works as follows. First the parameterized coordinates (s,t,) for a model vertex and a control point P are fed into the polynomial evalator. Once the blending fnctions are evalated, all the reslts are mltiplied together and this prodct is then mltiplied with the control point vector. The final vertex coordinate is accmlated for each dimension after every 4 additions. With a pipleline architectre, vertex coordinates are fed into the pipeline continosly and removed at the same rate after a fixed processing latency. These deformed vertices are then passed to the transformation engine for rendering. A smmary of the data flow is shown in Figre.. Other Featres To OpenGL vertex rendering Figre. lock-level design of the OpenGL Evalator Sb-block. (a) PE refers to a Polynomial Evalator. (b) Stage is a floating point vector mltiplier (c) The otpt stage is a vector accmlator (d) -bit conter s otpt s go throgh a NOR gate and is attached to the asynchronos reset of the accmlator s register. To optimize the throghpt of this syste we mst psh the parameterized coordinates (s,t,) and the control points into the pipeline as fast as possible. The addition of three featres will help address this problem... Control Point Cache This optional component insres that the control points are available to the pipeline as reqired. The control points are first loaded into the control point cache prior to sending the parameterized coordinates (s,t,) for any vertices in a model. Once the control points are loaded vertices in the model can be deformed efficiently. Since the model is embedded in the control lattice as a preprocess, a synchronized seqence of model vertices and control points can be compted and stored for all or any part of a model (like a vertex array) to optimize cache performance and minimize its reqired size. Althogh the control point positions change to obtain deformations, their nmber and neighborhood relationships within the control lattice and the model vertices do not change. It is clearly possible to maintain either separate model and control point arrays or a single nified model and control point array. In either case, cache performance can be statically determined by the degree of array synchronization... Data Interleaving and Direct Memory Access Another optimization to evalating the FFD is DMA. The parameterized object points (and their synchronized control lattice points) can reside in a fixed block of memory. A DMA controller can copy the model and control data directly to the OpenGL pipeline. This can insre that the evalator pipeline is flly tilized while the CPU is free to execte the application.

One method that takes advantage of DMA is interleaving parameterized local coordinates and their corresponding 4 control points. As data is streamed into the hardware, control points are sent directly to the polynomial evalator while the control points are sent to the cache (Figre ). This method is inefficient de to the redndancy of control points in the data stream. Crrently, we are investigating caching strategies and optimized arrangements of the data in order to redce redndancy. Interleaved Data Stream Via DMA Control Point Cache parameterized local coord. Control Points parameterized local coord. Control Points.. Polynomial Coefficient Register File Althogh predetermining the polynomial evalator has the advantage of simplifying the hardware design, it forces programmers to convert from one family of crves to another. This additional conversion process cold be seen as another point of optimization. Instead of sing CPU cycles in compting this transformation, we cold add specialized hardware jst for this transformation. This is however not optimal since the transformation involves matrix mltiplication. Dplicating the existing mltiplication hardware is wastefl and reroting the existing OpenGL matrix mltiplication hardware disrpts the pipeline. y modifying the behavior of the polynomial evalators in or syste we can bypass the transformation step. The only modification is to implement a more general polynomial evalator. This means that the polynomial evalator can evalate a three-degree polynomial with arbitrary coefficients. Ths, or general polynomial evalators are of the following form. GPE ax + bx + cx + Proposed asic System Figre : The Interleaved Data Stream. We see how the Control Points associated with each parameterized local coordinate is loaded into the cache while the local coordinate is sent to the proposed system for evalation. With this general polynomial evalator, we can change the blending fnctions by changing the coefficients and degree of the polynomial evalator. This is where the polynomial coefficient register file is sefl. In order to maintain state for the family of crve we are crrently rendering, we set the crrent coefficients and polynomial degree in the register file. d.4 Frther Enhancements Another possible enhancement is the addition of a combination generator. From Eqation (), we see that the smmation goes over all the 4 possible combination of the blending fnction for each of the three inpts s, t, and. This means that if we did a straight evalation of all possible combinations of the blending fnctions, we wold be repeating 89 evalations. One idea is to generate each blending fnction with each inpt s, t, and and then forward the reslts to a combination generator that will then feed into the stage mltiplier of Figre. This introdces additional memory and logic into the pipeline bt it is worth considering, as shown in the next section. Revisiting Eqation () reveals that the mltiply-accmlate operation is sed often. It is natral to consider DSP floating-point mltiply-accmlate (FMAC) strctres. Althogh not shown here, we can replace the Mltiplier and the Accmlator in Figre with FMAC hardware..5 Other Isses For nmerical representation, single-precision floating point sffices, bt the efficiency of fixed-point is desirable. The concerns with fixed-point calclations are precision and range. Since the local coordinates (s,t,) all range between 0 and, fixed point representations appear feasible. We plan to investigate a fixed-point optimization in the ftre. 4 DISCUSSION The system described so far is the core hardware nit. There are several ways to attain better performance. In order to gage performance, we need to calclate the pper bond time complexity of each hardware organization. Let m be the time it takes to calclate a floating-point mltiplication. We can also se m as an pper bond on addition and other operations. For the polynomial evalator, we designed a simple iterative nested-form evalator (Figre 4). This design comes from the transformation shown below. MUX coeff coeff ax + bx + cx + d d + ( c + ( b + Figre 4: Polynomial Evalator. Given the parameter and the coefficients of a cbic polynomial, we are able to evalate the reslt in loops. a) ) ) Coefficient a is first mltiplexed into the loop and mltiplied with the parameter. Then the reslt is added to the next coefficient. The reslt of the addition is then looped back into the mltiplier and the process is contined two more times. The whole process takes loops and performs operations in each loop for a total of m time.

4. asic System The first approach is to se the core nit by itself. This is the simplest implementation and the slowest. From Figre, we see that Stage a takes m time, Stage b takes m time, and Stage c takes m time. This loops 4 times which gives s a total of 4(++) 40m time. 4. Flly Parallel System On the other extreme, we can parallelize the whole operation by dplicating the basic system 4 times. Then we replace all the Accmlators of each basic system with a single tree adder for each dimension (Figre 5). This calclates all 4 basis fnction combinations in 9m time (m for the polynomial evalator and m for the Mltiplier) while the tree adder takes log (4) m time to add all the 4 reslts. The flly parallel system takes a total of 5m time. 4. Iterative Tree System Instead of bilding all 4 nits, we can bild n nits and loop several times to accmlate the calclation. The organization is the same as the Flly Parallel System (Figre 5) except we replace the tree adder with a tree accmlator. Each nit will take 9m time to calclate the basis fnctions and then it will take log (n) for the tree accmlator to add the reslts. This organization will then loop ceil(4/n) times. The total time for this organization is shown below. time 4.4 asic System with a Combination Generator ( 9 + In Section.4, we pointed ot that a lot of the calclations were repeated. In order to enhance the performance of the syste a new hardware nit called the Combination Generator (CG) cold be sed. This nit (Figre ) is banks of register files that contain 4 floating-point registers each. The CG stores niqe basis fnction calclations. Once stored, log ( n)) V V 4 4 the CG generates all possible combinations and feeds them into the Mltiplier. The organization that ses the CG (Fig. 7) Figre 5: Flly Parallel System. The vectors are fed into a row of 4 parallel nits and then their reslts are all added together by a tree adder. The block is the basic system with ot the Accmlator s R0 0 n From Polynomial Evalator t R 0 To Mltiplier R 0 Figre : Combination Generator. It is composed of register banks each with 4 locations. The Polynomial evalator will calclate all vales and store the reslts in the appropriate location. The data is then selected from each bank and sent to the vector mltiplier is not different from the basic system. We simply insert a CG in between the Polynomial Evalator and the Mltiplier in Figre. To load the CG with vales, assming we only have polynomial evalator, takes *7m time. Then assming that the CG takes no more than m time to generate a combination, it takes 5m time (CG + Mltiplier + Accmlator) to perform one loop. The total time for this organization is 7+4*59m time. Compared to the basic syste this amonts to a 9% savings. To gain an additional 7% savings, we can se three Polynomial Evalators in parallel. This takes 4*4m time to fill the CG. The total time sing three Polynomial Evalators only contribtes to the time it takes to fill the CG. We get a total of 4+4*5 44m time which is a 4% savings. 5 OPENGL COMMAND EXTENSIONS The OpenGL extensions are described in two variations. The first set are the extensions needed if the general polynomial evalator is not sed while the other set is for se with the general polynomial evalator. 5. asic System PE Combination Generator Mltiplier Accmlator Figre 7: System with Combination Generator. We add in a Combination Generator between the polynomial evalator and the vector mltiplier of the basic system. In the basic syste we need to provide the OpenGL API that allows an application to interface with the hardware withot the general polynomial evalator. Since the GLU libraries already have a well-known interface for rendering NURS, we tilize the same approach to render FFDs. elow we list the necessary OpenGL API calls. glnewffdrenderer generate a new FFD object to refer to when rendering. glffdproperty change common properties like line width, etc. glffdcallback Callback sed to monitor the progress of the rendering. Also sed for error monitoring.

glffdvolme Used to generate the srface based on the control points and the model. This is the main call that accepts the array of parameterized local coordinates and control points. The programmer does not need to explicitly interleave the data bt they do need to provide a forth dimension that indexes into the control point array. The OpenGL implementation can then repackage the data by interleaving parameterized local coordinates with the appropriate control points. This extension of the GLU library is similar to the GLU NURS interface. In fact, it is sed in exactly the same way [5]. 5. System with Optional Components y adding the general polynomial evalator, we no longer have to work in the GLU library interface. With the general polynomial evalator hardware, we can access the evalators directly. Here are the corresponding OpenGL API calls for this system [5]. glmap similar to glmap, it is sed to specify the array of control points. glevalcoord sed to evalate a parameterized object point. This and glmap is sed to evalate a single parameterized local coordinate. gllendcoefficient sed to specify the blending polynomial degree and coefficients. glffd This call mimics the behavior of glffdvolme. It accepts an interleaved stream and sends it directly to the hardware nit. Notice that similar calls to glmapgrid or glevalmesh do not exist. This is becase it does not make sense to evalate a reglarly spaced grid since most of the time the coordinates to be evalated will come from a complex model. An implementation may inclde them bt they wold be of limited se. Plate Plate Plate Plate 4 Plate 5 Plate Plate 7 Plate 8 EXAMPLES All of the models were parameterized within a single cbe of deformation consisting of a 4-point control lattice. The cat and the second chair (Plates 9 and 0) were embedded inside the deformable region while the office chair (Plates 5,,7,8) had its stem and wheels embedded in the deformable region. Most deformations shown here were obtained by rotating the top 4 corner control points by 90 degrees abot the y-axis. All other control points are linearly interpolated based on the position of the 8 corner control points. The shearing deformation on the office chair was obtained by lowering the bottom left corner control points while lifting the bottom right corner control points (See Plates and 8). Plates and show views of a cat being embedded in a control lattice. Plates and 4 show the cat being twisted after the control points are rotated 90 degrees arond the Y- axis. Plates 5 and show an office chair being embedded in a control lattice. Plates 7 and 8 show the legs of the office chair being sheared. Plate 9 shows another chair being embedded in a control lattice. Plate 0 shows the chair being twisted. 7 CONCLUSIONS Plate 9 Plate 0 We show how to integrate free form deformation, a poplar geometric deformation techniqe, into OpenGL by adding a few OpenGL API calls. In addition, we propose a block-level design of the OpenGL evalator sb-system in order to spport FFD evalation in hardware. y implementing these changes, programmers wold have access to a standard geometric deformation programming interface as well as a hardware accelerated deformation system.

References [] D. echmann. Space Deformation Models Srvey. Compters & Graphics 994, 8(4), pages 57-58. [] S. Coqillart. Extended Free-Form Deformation: A Sclptring Tool for D Geometric Modeling. SIGGRAPH 990, volme 4, pages 87-9. [] J.D. Foley, A. van Da S.K. Feiner, J.F. Hghes. Compter Graphics Principles and Practice, nd edition in C. Addison- Wesley, 99, pp. 50-5. [4] R. MacCracken and K.I. Joy. Free-Form Deformations With Lattices of Arbitrary Topology. SIGGRAPH 99, pages 8-88. [5] J. Neider, T. Davis, M. Woo. OpenGL Programming Gide: The Official Gide to Learning OpenGL, Release. Addison- Wesley, 99. [] OpenGL Architectre Review oard. OpenGL Reference Manal: The Official Reference Docment for OpenGL, Release. Addison-Wesley, 99. [7] W. H. Press, S. A. Tekolsky, W. T. Vetterling, and. P. Flannery. Nmerical Recipes in C: The Art of Scientific Compting. nd Edition, Cambridge University Press, 99, pages 79-8. [8] T. W. Sederberg and S.R. Parry. Free-Form Deformation of Solid Geometric Models. SIGGRAPH 98, volme 0, pages 5-0. [9] M. Segal, K. Akeley. The OpenGL Graphics Pipeline. 99. http://trant.sgi.com/opengl/docs/white_papers/oglgraphsys/openg l.html [0] Sony Compter Entertainment of America. PlayStation Technical Specifications and Technical Demos. http://www.playstation.com/news/ps.asp. [] A. Witkin and D. araff. Physically-ased Modeling. SIGGRAPH 997 Corse Notes. Corse 9, Differential Eqation asics.