GPU-based Parallel Construction of Compact Visual Hull Meshes

Noname manuscrpt No. (wll be nserted by the edtor) GPU-based Parallel Constructon of Compact Vsual Hull Meshes Byungjoon Chang Sangkyu Woo nsung hm Receved: date / Accepted: date Abstract Buldng a vsual hull model from multple two-dmensonal mages provdes an effectve way of understandng the three-dmensonal geometres nherent n the mages. n ths paper, we present a GPU accelerated algorthm for volumetrc vsual hull reconstructon that ams to harness the full compute power of the many-core processor. From a set of bnary slhouette mages wth respectve camera parameters, our parallel algorthm drectly outputs the trangular mesh of the resultng vsual hull n the ndexed face set format for a compact mesh representaton. Unlke prevous approaches, the presented method extracts a smooth slhouette contour on the fly from each bnary mage, whch markedly reduces the bumpy artfacts on the vsual hull surface due to a smple bnary n/out classfcaton. n addton, t apples several optmzaton technques that allow an effcent CUDA mplementaton. We also demonstrate that the compact mesh constructon scheme can easly be modfed for also producng a tme- and space-effcent GPU mplementaton of the marchng cubes algorthm. Keywords Vsual hull volumetrc approach compact mesh GPU algorthm CUDA mplementaton marchng cubes algorthm. B. Chang, S. Woo,. hm (Correspondng author) Department of Computer Scence and Engneerng Sogang Unversty, Seoul, Korea Tel.: +82-2-705-8493 Fax: +82-2-704-8273 E-mal: jerrun@sogagn.ac.kr, coldnght.w@gmal.com, hm@sogang.ac.kr 1 ntroducton Snce t was ntroduced to the computer vson communty, the dea of reconstructng three-dmensonal (3D) shapes from object slhouettes n two-dmensonal (2D) mages [1] has been appled to model statc or dynamc 3D objects effectvely n scenes. Gven slhouette mages from multple camera vews along wth ther vewng parameters, a vsual hull can be constructed by ntersectng the slhouette cones they respectvely defne [8], thus representng the maxmal volume mpled by the slhouettes. Volumetrc methods employ a fxed or adaptve volume grd representaton to produce a vsual hull. A set of small 3D cells that approxmate the vsual hull regon are generated or ts boundary surface s polygonzed usng a surface extracton technque such as the marchng cubes algorthm [10] (refer to, for nstance, [15] for a quck revew of volumetrc vsual hull methods). The volumetrc approach, whle numercally robust, has sometmes been regarded as less accurate than the polyhedral approach, e.g. [11,5], that attempts to compute the exact ntersecton of slhouette cones va explct geometry processng. Current hardware systems, however, cope easly wth hgh-resoluton grds to ncrease the precson of the resultng vsual hull. Furthermore, thanks to efforts to sample the volume space adaptvely and estmate ts boundary surface more accurately, e.g. [4, 9], hgh-qualty vsual hull meshes are now routnely generated by the volume-based methods. mportant advantages of the volumetrc approach are the smplcty of ts algorthm and ts nherent parallelsm n computaton, whch allows an effcent parallel mplementaton wth current hardware (refer to [7] to see some prevous mplementatons n varous parallel envronments). n partcular, t s well suted to mple-

2 Byungjoon Chang et al. mentaton on current GPUs, whch are hghly parallel, multthreaded, many-core processors. Ths observaton naturally led to several GPU-based mplementatons of volumetrc vsual hull algorthms [7, 14, 16], partcularly usng the compute unfed devce archtecture (CUDA) AP from NVDA [12]. n ths paper, we present a GPU accelerated parallel algorthm for volumetrc vsual hull reconstructon that, from a sequence of multple bnary slhouette mages, constructs exact vsual hull models effectvely by fully explotng the compute power of the many-core processor. Unlke prevous marchng cubes-based technques that smply lst extracted trangles wth the same vertces repeated n the representaton, our GPU algorthm removes such duplcaton and produces a trangular mesh n compact form usng the ndexed face set method so that the resultng mesh s nstantly avalable for effcent applcatons. ur method extracts smooth pecewse-lnear slhouette contours on the fly for a sophstcated voxel classfcaton, whch leads to sgnfcant reducton of the bumpy artfacts that often occur on the vsual hull surface due to a smple bnary n/out classfcaton. For a tme- and space-effcent CUDA mplementaton, we apply several optmzaton and dataparallel programmng technques ncludng parallel prefx sum [6] and parallel radx sort [13]. n partcular, as n [9], our method also estmates the exact locatons of vertces on the vsual hull surface usng nput slhouette mages. However, ours s based on the concept of perspectve correcton, whch requres fewer floatng-pont operatons to mplement. Last but not least, the presented GPU technque for the drect constructon of a compact mesh can easly be modfed for also producng an effcent GPU mplementaton of the marchng cubes algorthm, as demonstrated n the paper. 2 ur methods ur GPU computaton framework has three man phases, whch are explaned n the followng subsectons. Throughout ths work, we assume that the 3D computatonal volume s dscretzed nto a regular grd of gven resoluton, where ts grd ponts and cubes made of eght neghborng grd ponts are called voxels and cells, respectvely. Recall that the nput to our scheme s a set of multple bnary slhouette mages wth respectve camera vewng parameters, whch s repeatedly produced for every tme frame (see Fg. 1 for some example nput mages). n partcular, the object to be reconstructed s represented as black pxels n the mages, and the background as whte pxels. (a) Camera vew 0 (b) Camera vew 1 (c) Camera vew 2 (d) Camera vew 3 (e) The resultng model Fg. 1 nput bnary slhouette mages and the generated vsual hull model. ur test dataset conssts of 20 bnary slhouette mages of 1, 280 720 pxels per tme frame wth respectve camera calbraton parameters. Fgures (a) to (d) show four selected nput mages for a gven tme frame, and (e) dsplays the created vsual hull model. 2.1 Phase 1: Extracton of smooth pecewse-lnear slhouette contours Unlke the prevous nteractve vsual hull technques that are based on a smple bnary classfcaton of projected voxels, our method ntally constructs smooth slhouette contours on the fly from the nput bnary slhouette mages for a more refned classfcaton that markedly reduces bumpy artfacts on the surface of the resultng vsual hull. The frst step n our GPU accelerated scheme s to apply a Gaussan flter of gven sze to each nput bnary mage, where we use the recursve flterng technque [3] that s easly mplemented wth separable horzontal and vertcal convolutons. To mprove the flterng effcency, the bnary mage s parttoned nto tles of m m pxels (for our test mages havng resoluton of 1, 280 720 pxels, the best performance was observed n our parallel mplementaton when 32 32 tles were used). A CUDA kernel s then executed over the tles to see f a gven tle s on the border,.e. f t contans both black and whte pxels, markng the boundary tle and ts eght neghborng tles as vald. Then, the actual Gaussan flterng s performed tle by tle by

GPU-based Parallel Constructon of Compact Vsual Hull Meshes 3 a second kernel, where the convoluton operatons are carred out only wth respect to the vald tles. n our expermentaton, ths border-tle-only flterng strategy resulted n sgnfcant speedup over a smple CUDA mplementaton, performng the convolutons aganst the entre mage pxels because the nput bnary mages often possess a hgh degree of spatal coherence as those n Fg. 1(a) to (d). As a result of the smoothng process, the nput bnary mages are converted to grayscale mages whose pxel values now vary from zero (nsde) to one (outsde). For effcent GPU processng n later stages, ths frst phase produces a 2D array for every nput mage, each of whose elements corresponds to a square regon, formed by four adjacent pxels of the fltered slhouette mage (note that the square n the screen space s the 2D verson of the cell n the 3D volume space). For ths, another group of parallel threads are spawned, one for each square, where each thread classfes the four corners of the assgned square usng a gven threshold value,.e. a gven so-value. When all the four ntensty values are less than (greater than) the so-value, the square s smply marked nner (outer). therwse, t s marked boundary, and stored wth a lne segment (or segments) extracted through lnear nterpolaton usng a smple 2D verson of the marchng cubes algorthm (see Fg. 2). 0.69 0.61 B B 0.54 B B B 0.41 B so-value = 0.5 Fg. 2 Square classfcaton and extracton of pecewselnear slhouette contour. Each square regon, made of four adjacent pxels of a Gaussan-fltered slhouette mage, s classfed as nner (), outer (), or boundary (B) accordng to the pxels ntenstes and a gven threshold value. For a boundary square, an orented lne segment (or segments) s addtonally stored so that the 3D voxel classfcaton n the second phase of our method can be made effcently. B B The extracted lne segments together form a smooth pecewse-lnear slhouette contour. However, t should be emphaszed that the connecton nformaton between lne segments s not recorded explctly whle they are ndependently generated by parallel threads on the GPU. nstead, we store each lne segment wth an orentaton n the boundary square such that the nteror regon always locates on the rght sde, whch allows an easy n/out classfcaton for the boundary square regon. nce ths 2D nner/outer/boundary classfcaton s over for each nput slhouette mage, n the next stage, the decson whether a voxel n the 3D volume space s contaned n a slhouette cone, generated by a gven nput mage, can be made effcently by projectng the voxel onto the correspondng slhouette mage and checkng, usng the smple 2D classfcaton data, f the projected voxel resdes n the nteror area of the extracted slhouette contour. Fg. 3 shows three example sets of a bnary slhouette, a Gaussan-fltered slhouette, and an extracted contour (from left to rght, respectvely), where t s demonstrated that the Gaussan flter successfully removed the bumpy slhouettes n the orgnal bnary mages to generate smooth but feature preservng slhouette contours. 2.2 Phase 2: Constructon of compact mesh structure Gven the 2D classfcaton data for extracted slhouette contours, we start to buld a vsual hull model. We am to represent the model compactly n the ndexed face set format, where the trangular mesh structure conssts of a smple lst of vertex coordnates and a lst of trangles that ndex the vertces they use. n ths phase, the mesh structure s constructed only partally usng temporary vertex nformaton (refer to Fg. 8), and the actual vertex coordnates of the model are computed n the fnal phase. The frst step n ths phase s to dentfy the boundary cells wth whch the vsual hull surface ntersects. For effcent GPU computaton, we frst lnearze the 3D voxel grd nto a one-dmensonal (1D) array called a voxel array (VA) wth a smple address calculaton, n whch chunks of contguous voxel elements form CUDA blocks of threads (see Fg. 4). Each thread executes a voxel classfcaton kernel whch projects the correspondng voxel onto nput slhouette mages and uses the 2D nner/outer/boundary classfcaton nformaton to check whether t resdes nsde of the respectve slhouette contours. Snce the vsual hull s the ntersecton of all slhouette cones, the voxel s marked nner only f t s found to exst wthn all slhouette contours. Smlarly, the 3D cells n the volume space are enumerated lnearly nto a 1D array called a boundary cell array (BCA), whch s for markng the boundary cells. We also allocate another array of the same dmenson, called a trangle count array (TCA), for rememberng the number of trangles generated n each boundary cell. A cell classfcaton kernel s then executed by each

4 Byungjoon Chang et al. and sxth cells (countng from zero) are classfed as boundary cells and ther trangle counts are accordngly stored (see the second row). Boundary Cell Array (BCA) Trangle Count Array (TCA) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (a) Example 1 [2] [4] [5] [6] [2] [4] [5] [6] 0 0 1 0 1 1 1 0 0 0 2 0 1 3 1 0 after scanned after scanned 0 0 0 1 1 2 3 4 0 0 0 2 2 3 6 7 [2] [4] [5] [6] 0 2 3 6 7 (b) Example 2 Boundary Cell D Array (BCDA) Trangle ffset Array (TA) Fg. 5 Extracton of boundary cell nformaton. Usng the voxel classfcaton nformaton n the VA, each cell of the 3D grd, ntally lnearzed nto a 1D array on the GPU memory, s frst checked f t s a boundary cell. f t s, the number of trangles that wll be created n the cell by the marchng cubes algorthm s also stored (the two arrays n the second row). Then, through the help of the exclusve-scan operaton, only the boundary cells are packed nto an array (BCDA). At the same tme, the accumulated number of trangles created before a current boundary cell s also recorded n an extra array (TA) for a proper address calculaton n a later stage. (c) Example 3 Fg. 3 Constructon of smooth pecewse-lnear slhouette contours (left and center: bnary and fltered slhouette mages, rght: extracted contour). We fnd that, for nput bnary mages of 1, 280 720 pxels, a 7 7 Gaussan flter wth threshold value 0.5 usually enables to generate satsfactory contours. Voxel Array (VA) 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 Fg. 4 nner/outer classfcaton of voxels. n the begnnng of the second phase, the voxels n the 3D volume space, lnearzed nto a 1D array on the GPU memory, are classfed as nner (1) or outer (0) accordng to whether they are nsde of the vsual hull,.e. whether they are nsde of all the slhouette cones, generated by the extracted slhouette contours. thread that, usng the nformaton n the VA, classfes the cell t handles and marks the correspondng BCA element f t ntersects the vsual hull surface,.e. f the nner/outer classfcatons of the eght ncdent voxels do not concde. n that case, the thread addtonally calculates how many trangles are created n the cell, and stores the count n the correspondng TCA element. At ths moment, only the trangle count s calculated quckly by referrng to the marchng cubes table wthout generatng the actual trangles. Fg. 5 llustrates an example n whch the second, fourth, ffth, We next perform a data-parallel prefx sum (scan) operaton [6] on both the BCA and the TCA. Note that an applcaton of the exclusve scan operaton on a sequence (a 0, a 1, a 2, ) returns another sequence (0, b 1, b 2, ) such that b = 1 j=0 a j, = 1, 2,. So, for a boundary cell, the scanned BCA contans the number of boundary cells that precede t n the cell enumeraton, and hence the offset for storng the boundary cell s D n the compacted boundary cell D array (BCDA). Lkewse, the correspondng element of the scanned TCA ndcates the number of all trangles generated n the precedng boundary cells. By storng ths nformaton usng the same offset n another array called a trangle offset array (TA), the trangles from the boundary cell can be generated and stored n the proper locaton by a parallel CUDA thread n a later stage. See an example n Fg. 5, where the D of the second boundary cell n the BCA s 5, and ts three trangles should be stored n the trangle lst of the mesh structure, startng from the thrd element. We are now prepared to fll n the trangle and the vertex lsts of the trangular mesh structure. Fg. 6 llustrates a lst of trangles, called a trangle lst (TL), that ntally contans a sequence of trples, one per trangle, n whch each (, j), encoded n a 32-bt unsgned nteger, ndcates the jth vertex of the th trangle. n addton, we use another lst called an edge D lst (EDL); ts elements, mapped one to one wth those of the TL, wll hold temporary vertex nformaton. Here, a tran-

GPU-based Parallel Constructon of Compact Vsual Hull Meshes 5 0 Trangle Lst (TL) 3 6 9 (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1) (3, 2) 12 15 18 (4, 0) (4, 1) (4, 2) (5, 0) (5, 1) (5, 2) (6, 0) (6, 1) (6, 2) 0 Edge D Lst (EDL) 3 6 9 (16, -z) (17, -z) (16, +y) (81, -y) (81, -z) (81, -x) (27, +x) (27, +y) (27, -z) (72, +z) (72, +y) (9, +x) 12 15 18 (3, +x) (73, +y) (3, -y) (55, -x) (3, -x) (67, -z) (43, -x) (12, +y) (27, +x) after radx-sorted wth respect to EDL elements 8 (B) (17, 1) (5, 1) (4, 0) (19, 2) (26, 1) (57, 1) (29, 0) (4, 2) (3, 2) (35, 0) (88, 0) (87, 2) 12 (E) (19, 0) (47, 2) (37, 2) (57, 0) (16, 0) (55, 0) (29, 0) (67, 1) (15, 1) 8 (C) (3, -x) (3, -x) (3, +x) (3, +x) (3, +x) (3, +x) (3, -y) (3, -y) (9, +x) (9, +x) (9, -y) (9, -y) (9, -y) (9, -y) (12, +x) (12, +x) (12, +x) (12, +y) (12, +y) (12, +y) (12, +z) 0 3 6 8 (A) 9 12 15 18 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 after scanned 8 (D) 0 1 1 2 2 2 2 3 3 4 4 5 5 5 5 6 6 6 7 7 7 Fg. 6 GPU-based removal of duplcate vertces generated by the marchng cubes algorthm. Here, the on flag of the eghth element of the extra array (marked by (A)) ndcates that the second vertex of the thrd trangle (B) s (9, +x) (C), and should be stored n the thrd slot (D) of the redundancy-free vertex lst of the compact trangular mesh structure. The 0/1 array n the below ndcates the head vertex of each redundant vertex group. Note that the ndex of a temporary vertex to the vertex lst can be calculated by subtractng one from the sum of the correspondng values n the 0/1 array and ts scanned sequence. For nstance, the ndex of the vertex (19, 0) n the twelfth slot of the sorted TL (E) s 4 (= 0 + 5 1). gle s vertex that exsts along an edge of a boundary cell s temporarly denoted by a par made of the D of the nner voxel and the axs drecton from the base voxel. For nstance, the trangle n Fg. 7 s represented by a trple of vertces: ((16, z), (17, z), (16, +y)). +y 16 17 -z -z (16, -z) n ut Fg. 7 Temporary vertex representaton. A vertex along an edge of a boundary cell, whose exact locaton s not known yet, s temporarly denoted by the D of the nner voxel and the axs drecton from the voxel. ts actual coordnates are calculated n the thrd phase. Then, each CUDA thread, launched one per boundary cell n the BCDA, agan refers to the marchng cubes table to compute the temporary vertex pars of all trangles that are created from the cell, storng them n the proper place n the EDL usng the offset nformaton found n the TA. After that, the TL and EDL are smultaneously sorted by the EDL s values, whch have been encoded n a 32-bt unsgned nteger, usng the parallel radx sort algorthm [13]. As a result, the duplcate vertces, shared by adjacent trangles, are placed n a contguous regon of the EDL wth correspondng vertex dentfcatons n the TL. Now, a CUDA thread, generated one per sorted EDL element, checks whether the assgned element s vertex s the frst one n the same vertex group (ths can be done easly by comparng wth the precedng element n the EDL), and marks the result n an extra array. Ths 0/1 array, together wth an addtonal sequence obtaned through an exclusve scan, provdes the offset nformaton wth whch the locatons of (temporary) vertces n the redundancy-free vertex lst of the mesh structure are easly determned (refer to Fg. 6 agan to see how to decde the ndex of an arbtrary vertex to the compacted vertex lst). Fnally, a CUDA kernel s executed on the threads spawned wth respect to the sorted TL elements, where, for the correspondng vertex (, j), each thread calculates the ndex, d, to the vertex lst of the trangular mesh (Vertex lst n Fg. 8), and records d as the jth ndex of the th trangle n the trangle lst (Trangle lst). Also, to reduce memory bandwdth consumpton, only the thread for the head vertex,.e. the thread correspondng to the frst one n the same vertex group, stores the temporary vertex nformaton of the vertex (, j) n the x-coordnate feld of the dth vertex n the

6 Byungjoon Chang et al. Trangle lst Vertex lst 0 (0,0) (0,1) (0,2) 0 (3, -x) -- -- 1 2 3 (1,0) (1,1) (1,2) (2,0) (2,1) (2,2) (3,0) (3,1) 3 4 1 (4,1) 2 4 (9, -y) -- -- 5 (5,0) 0 (5,2) 1 (3, +x) -- -- 2 (3, -y) -- -- 3 (9, +x) -- -- 5 (12, +x) -- -- 6 (12, +y) -- -- 19 4 (19,1) (19,2) 7 (12, +z) -- -- usng an extended verson of Bresenham s lne-drawng algorthm [2], whch reveals the rato t = p s ps s p s po. s Because of the perspectve projecton, t s n general dfferent from the needed rato α. n fact, t can be t/z shown that α = o c tz t/z = c o c +(1 t)/z c tz, whch c +(1 t)zo c requres only two addtons, two multplcatons, and one dvson to calculate from t (refer to Appendx for the correctness of the formula and Fg. 10 for a pseudocode for computng the α value). Fg. 8 The compact trangular mesh structure n the ndexed face set representaton after the second phase of our method. n the followng thrd stage, a thread s spawned for each x-component value of Vertex lst to compute the actual xyz coordnates, and store them n the correspondng slot, completng the constructon of a compact trangular mesh structure. Note that Trangle lst n ths fgure has only been partally flled wth the example data from Fg. 6. vertex lst. For example, n Fg. 6, the ndex of the vertex (3, 2) n the eghth slot of the sorted TL (B) s 3 (= 1 + 3 1). Hence, the second ndex (countng from zero) of the thrd trangle becomes 3 (see Fg. 8). Also, snce the vertex s a head vertex, the correspondng thread stores ts temporary vertex nformaton (9, +x) (C) n the x component of the thrd vertex of the vertex lst. 2.3 Phase 3: Effcent generaton of exact vertex coordnates usng the dea of perspectve correcton To complete the constructon of the compact mesh structure, a parallel thread, spawned wth respect to each temporary vertex nformaton n the x component of the vertex lst, fnds the actual locaton of the correspondng vertex, and overrdes the temporary vertex nformaton wth ts xyz coordnate vector. n ths process, we explot the dea of perspectve correcton, whch has been effectvely appled n 3D graphcs for correct texture mappng, to effcently estmate the exact ntersecton n the world space,.e. n the volume space, between an edge of a boundary cell and a slhouette cone formed by an extracted slhouette contour. For the edge contanng the temporary vertex of a current thread, let p c = (x c y c z c) t and p o c = (x o c y o c z o c ) t be the camera space coordnates of ts two end voxels p w and p o w n the world space, between whch the ntersecton p s w = α p o w + (1 α) p w s to be computed (see Fg. 9). For each camera vew, the thread projects p w and p o w nto the screen space where the nput slhouette mage exsts, and fnds the exact ntersecton p s s between the projected edge (p s, p o s) and the pecewse-lnear slhouette curve x y z Screen Space t p s s 1 t 1 p o s p s p w p s w p o w World Space Fg. 9 Effcent computaton of exact ntersecton n the world space. For a gven camera vew, the ntersecton p s w n the world space between an edge of a boundary cell, (p w, p o w) and the correspondng slhouette cone can only be calculated by fndng the ntersecton p s s n the screen space between the projected edge (p s, p o s) and the pecewse-lnear slhouette contour (the curve on the plane), extracted n the frst phase of our method. nput: two voxel coordnates P w & Po w n volume space and a camera D. utput: alpha, the dstance rato from P w to the pont Ps w on the slhouette cone. Begn S := the current slhouette mage; P s := the projecton of P w onto S; Po s := the projecton of Po w onto S; March from P s to Po s n S untl a boundary square BS s met; Ps s := the ntersecton between the BS s lne segment(s) and (P s, Po s); t := the dstance rato from P s to Ps s; z c := the z coordnate of P w n camera space; zo c := the z coordnate of Po w n camera space; tmp := t*z c; alpha := tmp/(tmp + (1-t)*zo c); End Fg. 10 Pseudocode for the computaton of the α value for a gven camera vew. Then, takng the smallest of the α values from all camera vews, we can locate the ntersecton pont on the vsual hull boundary. t should be mentoned that the exact ntersecton was also proposed n prevous

GPU-based Parallel Constructon of Compact Vsual Hull Meshes 7 work based on a matrx computaton [9]. ur technque fnds the same ntersecton wth fewer floatng-pont operatons, whch allows a marked performance enhancement for nontrval numbers of cameras and volume resolutons. 3 mplementaton results 3.1 Constructon of vsual hull meshes from nput slhouette mages We mplemented our GPU algorthm usng the CUDA AP [12], and evaluated ts performance usng several example mages, generated by the real-tme 3D modelng system at the Electroncs and Telecommuncatons Research nsttute n Korea. Fg. 11 shows the vsual hull models constructed from three representatve test datasets, named Woo, Bboy, and Grl, respectvely. Each dataset conssts of 20 bnary slhouette mages of 1, 280 720 pxels wth the correspondng camera calbraton parameters. Fg. 11 Vsual hull models created from three test datasets: Woo, Bboy, and Grl (from left to rght). Table 1 shows statstcs measured on an NVDA GeForce GTX 580 GPU wth 1.5 GB of graphcs memory wth respect to three dfferent volume resolutons, where each row reveals the sze of the produced trangular mesh, represented n the ndexed face set, and the total GPU tme along wth the relatve overhead of the three computatonal phases. Here, the fgure n parentheses n the Vertces column denotes the value of three tmes the number of faces dvded by the number of vertces, whch ndcates the degree of vertex redundancy n the smple mesh representaton that smply enumerates the vertces of trangles. nterestngly, the observed ratos, ncludng those from our marchng cubes mplementaton (refer to Table 3) were qute consstent, and the removal of vertex duplcaton n the mesh representaton resulted n the tme- and spaceeffcent GPU mplementaton. The tmng results show that the computaton tme taken by our method greatly depends on the voxel resoluton. When a lower-resoluton volume grd was selected, the frst phase for extractng a smooth slhouette contour requred a relatvely sgnfcant perod of the tme. However, as the volume resoluton ncreased, the GPU mplementaton became domnated by the second phase snce the numbers of voxels and cells to be processed ncreased wth the cube of the volume resoluton. n the other hand, the thrd phase of fndng the exact vertex locatons consumed only a moderate amount of tme because of our effcent calculaton framework that only processed a relatvely small number of boundary cells. n partcular, launchng one CUDA thread per one and only one unque temporary vertex n the compacted vertex lst avoded unnecessary thread dvergence, leadng to an ncrease n the GPU occupancy durng the vertex coordnate calculaton. Recall that the two major computatons carred out n the frst phase are the applcaton of the Gaussan flter and the extracton of the pecewse-lnear slhouette contour, whch mples that the computaton tme of ths phase s bascally dependent on the mage resoluton and the complexty of the contour curve. As can be seen n the tmngs n the Phase columns of Table 1 and 2, relatvely small amounts of tme were spent applyng the 7 7 Gaussan flter to 20 bnary mages of 1, 280 720 pxels thanks to our GPU accelerated flterng scheme that apples the separated convoluton only to the pxels neghborng to the slhouette contour. n partcular, when a nontrval volume resoluton (e.g. 256 256 256) was chosen, the addtonal cost for the Gaussan flterng was just small compared to the entre computaton tme. t should be mentoned that, whle the vsual hull constructon speed was slghtly mproved wthout the Gaussan flterng, the gan only came wth unsghtly bumpy artfacts on the vsual hull surfaces. When no smoothng flter was appled, the extracted slhouette contour almost concded wth the boundary of the bnary slhouette. Hence, the alases on the surfaces became unavodable, although the exact ntersectons between the respectve slhouette cones and the edges of the 3D cells were computed n the thrd phase. Fg. 12 compares the outputs produced for the volume resoluton of 256 256 256 voxels wthout and wth the Gaussan flter appled, n whch we clearly observed that the undesrable surface effects were ncely smoothed out through the refned voxel classfcaton. We also observed that the elaborate extracton of the pecewselnear slhouette contours created less vsual artfacts

8 Byungjoon Chang et al. Dataset Woo Bboy Grl Volume Sze of mesh Computaton tme (ms) resoluton Vertces Faces Phase Phase Phase Total 64 3 3,222 (6.004) 6,448 12.03 (74.9%) 3.12 (19.4%) 0.92 (5.7%) 16.07 128 3 13,118 (6.000) 26,236 12.12 (60.7%) 6.78 (34.0%) 1.06 (5.3%) 19.96 256 3 53,194 (5.999) 106,376 12.04 (29.5%) 26.83 (65.7%) 1.97 (4.8%) 40.84 64 3 2,986 (5.996) 5,968 12.43 (75.9%) 2.96 (18.1%) 0.99 (6.0%) 16.38 128 3 12,302 (6.000) 24,604 12.40 (61.5%) 6.65 (33.0%) 1.12 (5.5%) 20.17 256 3 50,010 (6.000) 100,016 12.41 (29.7%) 27.44 (65.7%) 1.92 (4.6%) 41.77 64 3 2,290 (6.000) 4,580 11.55 (78.8%) 2.25 (15.4%) 0.85 (5.8%) 14.65 128 3 9,368 (6.000) 18,736 11.48 (61.9%) 6.12 (33.0%) 0.95 (5.1%) 18.55 256 3 38,104 (6.000) 76,208 11.53 (29.8%) 25.70 (66.4%) 1.47 (3.8%) 38.70 Table 1 Performance statstcs on our GPU accelerated vsual hull constructon method. Three dfferent datasets, each made of 20 bnary slhouette mages of 1, 280 720 pxels, were tested wth respect to three volume resolutons, where a 7 7 Gaussan flter was appled n Phase. The numbers of vertces and faces of the generated trangular meshes represented n the ndexed face set format are shown. The fgures n parentheses n the Vertces column denote the ratos 3*(# of faces)/(# of vertces), ndcatng the degree of redundancy of vertces n the smple mesh representaton, n whch the vertex coordnates of extracted trangles are smply enumerated wth the same vertex repeated. The phase-by-phase dssecton of computaton tmes requred by the GPU s also provded. Dataset Woo Bboy Grl Volume Sze of mesh Computaton tme (ms) resoluton Vertces Faces Phase Phase Phase Total 64 3 2,972 (6.004) 5,948 10.98 (74.3%) 2.99 (20.2%) 0.81 (5.5%) 14.78 128 3 12,476 (6.000) 24,952 10.98 (60.2%) 6.26 (34.3%) 1.01 (5.5%) 18.25 256 3 51,520 (5.998) 103,012 10.97 (27.9%) 26.48 (67.2%) 1.93 (4.9%) 39.38 64 3 2,816 (5.979) 5,612 10.97 (74.7%) 2.90 (19.7%) 0.82 (5.6%) 14.69 128 3 11,660 (5.990) 23,280 10.96 (58.6%) 6.64 (35.5%) 1.1 (5.9%) 18.70 256 3 47,858 (5.996) 95,660 10.97 (27.6%) 26.86 (67.7%) 1.86 (4.7%) 39.69 64 3 2,136 (6.000) 4,272 11.00 (78.5%) 2.27 (16.2%) 0.75 (5.3%) 14.02 128 3 8,864 (6.001) 17,732 11.00 (60.6%) 6.15 (33.9%) 1.01 (5.5%) 18.16 256 3 36,574 (6.000) 73,144 11.01 (29.3%) 25.11 (66.9%) 1.44 (3.8%) 37.56 Table 2 Performance statstcs wthout a Gaussan flter appled. As can be dentfed n the computaton tmes taken by the frst phases of the two mplementatons wth and wthout the applcaton of the Gaussan flter, the computatonal burden of the Gaussan flterng was allevated sgnfcantly through the parallel convoluton computaton only on the border regons of the slhouette contours. Note that, when no Gaussan flter was appled, somewhat smaller trangular meshes were produced n slghtly less tme, but only at the expense of ugly lookng artfacts on the surfaces of the vsual hull objects. when the reconstructed objects were rendered wth varous texture mages. 3.2 Applcaton to the marchng cubes algorthm The presented GPU technque for drectly generatng compact trangular meshes n the form of ndexed face set can be easly modfed for mplementng the frequently used, marchng cubes algorthm [10] on the GPU. Gven a volumetrc dataset and an so-value, the second phase of our method bulds, as before, the trangle ndex lst and the temporary vertex lst, ntally contanng the vertex offset nformaton. Then, n the followng stage, the actual vertex coordnates (optonally wth normal coordnates) are generated by smple lnear nterpolaton along the edges of boundary cells, storng them n the correspondng locatons of the compact vertex array. The expermental results n Table 3, obtaned wth respect to four volumetrc datasets (see Fg. 13) wth two dfferent volume resolutons, compare our GPU mplementaton wth a conventonal mplementaton whch smply classfes boundary cells and lsts the vertex and normal coordnates of each trangle extracted from them. bvously, our mplementaton s more complcated than the smple mplementaton because ours should go through an addtonal GPU stage that bulds the ndexed face set structure. nterestngly, however, our method turned out to be faster sgnfcantly on the NVDA GeForce GTX 580 GPU as demonstrated n the table. Notce that, n the NVDA s Ferm archtecture, memory operatons are ssued per warp (32 threads), and t s crtcal to the performance of CUDA applcatons to have each warp access global memory as coalesced as possble. The major dfference between the two GPU mplementatons s the amount and localty of the

GPU-based Parallel Constructon of Compact Vsual Hull Meshes 9 Dataset (Tme: ms, Memory: MB) Volume Sze of mesh urs Conventonal resoluton Vertces Faces Tme Memory Tme Memory Bunny 128 3 33,392 (5.974) 66,491 4.84 1.53 7.22 4.57 256 3 132,883 (5.987) 265,194 15.58 6.08 29.20 18.21 Armadllo 128 3 23,528 (5.999) 47,052 4.36 1.08 7.18 3.23 256 3 95,004 (6.000) 190,004 15.50 4.35 28.77 13.05 Dragon 128 2 256 41,473 (5.979) 82,659 7.14 1.90 12.50 5.68 256 2 512 167,684 (5.990) 334,796 26.62 7.67 55.94 22.99 Happy 128 2 256 31,833 (5.990) 63,565 6.87 1.46 12.44 4.37 Buddha 256 2 512 129,815 (5.994) 259,361 26.12 5.94 55.56 17.81 Table 3 Statstcs on the two GPU mplementatons for the marchng cubes algorthm. n ths table, urs represents the mplementaton produced based on the presented GPU technque, whle Conventonal corresponds to the classc mplementaton that smply lsts the vertex and normal coordnates of extracted trangles wth the same vertex nformaton repeated n the representaton. Agan, the fgures n parentheses n the Vertces column ndcate the degree of vertex redundancy (refer to Table 1 for an explanaton of these values). (a) Woo wthout smoothng (b) Woo wth smoothng (a) Bunny (b) Armadllo (c) Grl wthout smoothng (d) Grl wth smoothng (c) Dragon (d) Happy Buddha Fg. 12 Effect of the sophstcated voxel classfcaton. To clearly see the effect of the smooth contour extracton carred out n the frst phase, the trangular meshes are rendered wth flat shadng such that the ndvdual trangles are vsble. Fg. 13 Trangular meshes produced by the marchng cubes algorthm from four test datasets. mesh data that each warp must wrte n parallel nto the global memory of the GPU. n the fnal stage of the conventonal mplementaton, each thread of a warp wrtes the coordnate data of all trangles extracted from a boundary cell assgned to t. Whle the most effcent stuaton for the Ferm archtecture s that each warp requests 32 algned, consecutve 4-byte words wthn a sngle, algned 128 byte-long segment of global memory, there s a hgh probablty that the memory access from the warps are very scattered n the conventonal mplementaton, leadng to a marked performance decrease, snce each trangle consumes 72 bytes (6 floats per vertex and 3 vertces per trangle). n contrast, n the new mplementaton, the global memory access s relatvely less scattered as only the coherent vertex array regon s accessed n ths stage, enablng a better tmng performance n spte of the addtonal computaton for buldng the ndexed face set structure. Ths expermental result strongly mples that constructng compact vsual hull meshes s equally mportant for effcent GPU processng.

10 Byungjoon Chang et al. 4 Concludng remarks We have presented an effectve parallel volumetrc vsual hull constructon algorthm that employs a novel technque for creatng a compact trangular mesh of the reconstructed vsual hull model as well as several optmzaton technques for producng a tme- and space-effcent GPU mplementaton. To the best of our knowledge, our computaton scheme s the frst parallel algorthm that, fully run on the GPU, generates smooth hgh-resoluton vsual hull meshes n compact form, based on a refned voxel classfcaton. We have also shown that the presented GPU technque allows an easy modfcaton for another mportant problem, that s, the GPU mplementaton of the marchng cubes algorthm. Through our experments, t was demonstrated that, wth the current GPU archtecture, t s undoubtedly worthwhle to develop GPU schemes that facltate compact data representaton by reducng wasteful data redundancy. f course, ths statement s also true when vsual hull meshes are to be constructed on the GPU. Acknowledgements Ths research was supported by Basc Scence Research Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence and Technology (grant no. 2012R1A1A2008958), and by the strategc technology development program of MCST/MKE/KET (Development of Full 3D Reconstructon Technology for Broadcastng Communcaton Fuson (K001798)). References 1. Baumgart, B.: Geometrc modelng for computer vson. Ph.D. thess, Stanford Unversty (1974) 2. Bresenham, J.: Algorthm for computer control of a dgtal plotter. BM Systems Journal 4(1), 25 30 (1965) 3. Derche, R.: Recursvely mplementng the Gaussan and ts dervatves. Unté de Recherche NRA-Sopha Antpols, Tech. Rep. No. 1893 (1993) 4. Erol, A., Bebs, G., Boyle, R., Ncolescu, M.: Vsual hull constructon usng adaptve samplng. n: Proc. of the 7th EEE Works. on Applcaton of Computer Vson, vol. 1, pp. 234 241 (2005) 5. Franco, J.S., Boyer, E.: Exact polyhedral vsual hulls. n: Proc. of Brtsh Machne Vson Conf., pp. 329 338 (2003) 6. Harrs, M.: Parallel prefx sum (scan) wth CUDA. n: H. Nguyen (ed.) GPU Gems 3, chap. 39, pp. 851 876. Addson Wesley (2008) 7. Ladkos, A., Benhmane, S., Navab, N.: Effcent vsual hull computaton for real-tme 3D reconstructon usng CUDA. n: Proc. of the Conf. on Computer Vson and Pattern Recognton Works., pp. 1 8 (2008) 8. Laurentn, A.: The vsual hull concept for slhouettebased mage understandng. EEE Trans. PAM 16(2), 150 162 (1994) 9. Lang, C., Wong, K.Y.: Exact vsual hull from marchng cubes. n: Proc. of the 3rd nt. Conf. on Computer Vson Theory and Applcatons, vol. 2, pp. 597 604 (2008) 10. Lorensen, W., Clne, H.: Marchng Cubes: A hgh resoluton 3D surface constructon algorthm. Proc. of ACM SGGRAPH 21, 163 169 (1987) 11. Matusk, W., Buehler, C., McMllan, L.: Polyhedral vsual hulls for real-tme renderng. n: Proc. of the 12th Eurographcs Works. on Renderng Technques, pp. 115 126 (2001) 12. NVDA: NVDA CUDA C Programmng Gude (Verson 3.2) (2010) 13. Satsh, N., Harrs, M., Garland, M.: Desgnng effcent sortng algorthms for manycore GPUs. n: Proc. of the 2009 EEE nt. Symp. on Parallel & Dstrbuted Processng, pp. 1 10 (2009) 14. Shujun, Z., Cong, W., Xuqang, S., We, W.: Dream World: CUDA-accelerated real-tme 3D modelng system. n: Proc. of the EEE nt. Conf. on Vrtual Envronments, Human-Computer nterfaces and Measurement Systems, pp. 168 173 (2009) 15. Slabaugh, G., Culbertson, B., Malzbender, T., Schafer, R.: A survey of methods for volumetrc scene reconstructon from photographs. n: Proc. of nt. Works. on Volume Graphcs, pp. 81 100 (2001) 16. Wazenegger, W., Feldmann,., Esert, P., Kauff, P.: Parallel hgh resoluton real-tme vsual hull on GPU. n: Proc. of the 16th EEE nt. Conf. on mage Processng, pp. 4301 4304 (2009) Appendx: Proof of perspectve-corrected ratos We want to reveal the relaton between the ponts p w, p s w, p o w n the world space and the mapped ponts p s, p s s, p o s n the screen space (see Fg. 9 agan). Note that the transformaton from the normalzed mage space to the screen space s an affne transformaton because t nvolves only translaton, scalng, and possbly shearng. So s the vew transformaton that converts ponts from the world space to the camera space. Snce the affne transformatons preserve the ratos of dstance along a lne, t s enough to consder the mappng between the correspondng ponts p c, p s c, p o c n the camera space and p n, ps n, po n n the normalzed mage space (see Fg. 14). x c y c x n f Normalzed mage Space y n p o n p s n p n t 1 t 1 p c p s c p o c z c Camera Space Fg. 14 Mappng between the camera space and the normalzed mage space.

GPU-based Parallel Constructon of Compact Vsual Hull Meshes 11 Assume that p c = (x c yc zc) t, p s c = (x s c yc s zc) s t, p o c = (x o c yc o zc o ) t, where ( ) x o c ( ) x c p s c = α yc o zc o + (1 α) yc zc. Smlarly, let p n = (x n yn) t, p s n = (x s n yn) s t, p o n = (x o n yn) o t. When p s c s transformed nto the normalzed mage space by multplyng the perspectve transformaton matrx, we get [ ] ( ) f 0 0 fαx o c + f(1 α)x 0 f 0 p s c c = fαyc o + f(1 α)y c, 0 0 1 whch, va perspectve dvson, leads to ( ) ( ) p s fα x n = o c f(1 α) x yc o + c yc. By the same ( perspectve transformaton, t becomes that (x n yn) t fx = fy ) ( c c t zc z and (x o c n yn) o t fx = o fy ) o c c t. zc o z From c o these, we obtan that αz o c p s n = ( ) x o n y o + n (1 α)z c αz o c + (1 α)z c ( ) x n y n = αz o c p o (1 α)zc n + p o. Ths mples that t = α = tz c tz c +(1 t)zo c αz o c αz o c +(1 α)z c, from whch we are led to.