Accelerating 3D convolution using streaming architectures on FPGAs

Accelerting 3D convolution using streming rchitectures on FPGAs Hohun Fu, Robert G. Clpp, Oskr Mencer, nd Oliver Pell ABSTRACT We investigte FPGA rchitectures for ccelerting pplictions whose dominnt cost is 3D convolution, such s modeling nd Reverse Time Migrtion (RTM). We explore different design options, such s using different stencils, fitting multiple stencil opertors into the FPGA, processing multiple time steps in one pss, nd customizing the computtion precisions. The explortion revels constrints nd trdeoffs between different design prmeters nd metrics. The experiment results show tht the FPGA streming rchitecture provides gret potentil for ccelerting 3D convolution, nd cn chieve up to two orders of mgnitude speedup. INTRODUCTION The oil industry hs lwys been one of the leding consumers of high performnce computing systems. With the incresing of the CPU clock frequencies coming to n end, we cn no longer double our computtion speed by purchsing updted computers every eighteen months nd need to dpt to new computtion rchitectures, such s multi-core processors, Generl Purpose Grphic Processing Units (GPGPUs), nd Field Progrmmble Gte Arrys (FPGAs). Recent reserch work hs shown tht FPGAs cn provide customized solution for specific ppliction nd chieve more thn two orders of mgnitude speedup compred to single-core softwre implementtion. Exmples include cryptology pplictions (Cheung et l. 2005), finnce nd physics simultions (Zhng et l. 2005; Gokhle et l. 2004) s well s seismic computtions (Nemeth et l. 2008). The mjor difference between FPGA nd other computtion pltform is the reconfigurbility of the processing nd storge units in the device, which enbles n FPGA to be configured into rbitrry processing units nd circuit structures. The reconfigurbility of the FPGA leds to two mjor dvntges over other computtion pltforms: (1) A streming computtion rchitecture. While CPUs nd GPGPUs tke in sequence of instructions tht operte on corresponding dt in memory, in FPGAs the instructions re mpped into circuit units long the pth from input to output. The

Fu et l. 2 FPGAs FPGA then performs the computtion by streming the dt items through the circuit units. The streming rchitecture mkes efficient utiliztion of the computtion device, s every prt of the circuit is performing n opertion on one corresponding dt item in the dt strem. (2) Customizble number representtions. While CPUs nd GPGPUs cn only hndle 8-, 16-, 32- or 64-bit vribles, FPGAs support rbitrry bit width for ech vrible in the design. By djusting the bit widths ccording to the precision requirement, we cn often chieve significnt reduction in the silicon re cost of rithmetic units nd the bndwidth requirement between different hrdwre modules, thus improving the overll throughput of the entire system. To investigte FPGA s cpbility on solving the convolution problem, we explore design options such s: (1) using different stencils; (2) fitting multiple stencil opertors into the FPGA; (3) processing multiple time steps in one pss; (4) customizing the computtion precisions. The explortion demonstrtes constrints nd trdeoffs between different design prmeters nd metrics. Experiment results show tht the streming computtion rchitecture of FPGAs cn provide up to two orders of mgnitude speedup compred to single-core softwre implementtion. STREAMING ARCHITECTURE FOR CONVOLUTION Trget Appliction Our trget ppliction is 512 by 512 by 512 finite difference problem, with 6th to 8th order in spce nd 2nd order in time ccurcy. Ech time step of the computtion tkes the current wve-field stte, the wve-field stte from the previous time step nd the velocity model s inputs, nd produces the next wve-field stte s the output. FPGA Pltform Current Xilinx FPGAs contin three mjor ctegories of resources: (1) reconfigurble logic slices with 6-input lookup tbles (LUTs) nd flip flops (FFs); (2) DSP48E rithmetic units tht cn perform 18 25 multiplictions; (3) 36-KBit Block RAM (BRAM)s used s locl storge or FIFOs. In our work, we use the Mxeler MAX2 ccelertion crd, which contins two Virtex-5 LX330T FPGA chips, 12 GB onbord memory, nd PCI-Express x16 interfce to the host PC. Tble 1 nd 2 show the resource summry of our current FPGAs nd the recently relesed Virtex-6 SX475T FPGA, nd the bsic cost for implementing single-precision floting-point units on FPGAs.

Fu et l. 3 FPGAs FPGAs #LUTs #FFs #DSP48Es #BRAMs LX330T 207,360 207,360 196 324 SX475T 287,600 595,200 2,016 1,064 Tble 1: Resource summry of the Virtex-5 LX330T nd Virtex-6 SX475T FPGA. Opertions #LUTs #FFs #DSP48Es #BRAMs +/ 425 557 0 0 122 173 2 0 Tble 2: Costs for single-precision floting-point units. Streming Architectures Finite difference bsed convolution opertors normlly perform multiplictions nd dditions on number of djcent points. While the points re neighbors to ech other in 3D geometric perspective, they re often stored reltively fr prt in memory. For exmple, in the 7-point 2D convolution performed on 2D rry shown in Figure 1, dt items (0, 3) nd (1, 3) re neighbors in the y direction. However, suppose the rry uses row-mjor storge nd hs row size of 512, the storge loctions of (0, 3) nd (1, 3) will be 512 items (one line) wy. For 3D rry of the size 512 512 512, the neighbors in z direction will be 512 512 items wy. In softwre implementtions, this memory storge pttern cn incur lot of cche misses when the domin gets lrger, nd decreses the efficiency of the computtion. x 0,3 Figure 1: A streming exmple of 2D convolution. [NR] y 1,3 2,3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 4,3 5,3 6,3 In n FPGA implementtion, we use streming rchitecture tht computes one result per cycle. As shown in Figure 1, suppose we re pplying the stencil on the dt item (3, 3), the circuit requires 13 different vlues (solid, drk-color), two of

Fu et l. 4 FPGAs which ((0, 3) nd (6, 3)) re three lines wy from the current dt item. As the dt items re stremed in one by one, in order to mke the vlues of (0, 3) nd (6, 3) vilble to the circuit, we put memory buffer tht stores ll the six lines of vlues from (0, 3) to (6, 3) (illustrted by the checker bord pttern on the grid). For row size of 512, this incurs storge cost of 512 6 dt items. Similrly, for 7-point 3D convolution on 512 512 512 rry, the design requires buffer for 512 512 6 dt items. Assume ech dt item is singleprecision floting-point number, the buffer size mounts to 6 MB for the 512 512 512 exmple. The FPGA chip we currently use provides 1.4 MB of potentil buffer size, which is not enough to store ll the stremed-in vlues. We solve this problem by 3D blocking, i.e. dividing the originl 3D rry into smller-size 3D rrys, nd performing convolution on them seprtely. 3D blocking reduces the buffer requirement for n FPGA convolution implementtion t cost. Given convolution stencil with ns non-zero lgs in ech direction, we must send in (nx + ns) (ny + ns) (nz + ns) block to produce nx ny nz output block. As nx, ny, nz becomes smll, the blocking overhed cn dominte. Menwhile, the initiliztion cost for setting up the memory ddress registers nd strt the streming process is lso incresed s we need to strem multiple blocks. Different Stencils EXPLORATION OF DESIGN OPTIONS Our trget ppliction uses 7-point str stencil (Figure 2()) to perform the 8th order finite difference. In our explortion, beside the str stencil, we lso consider 3-by-3-by-3 cube stencil (Figure 2(b)), which performs 6th order finite difference (Spotz nd Crey 1996). In softwre implementtions, the cube nd the str stencils provide similr performnce. For the FPGA implementtions, the resource costs for the str nd the cube stencils re different. The upper prt of Tble 3 shows the strightforwrd implementtions of str nd cube stencils for 120x120x120 rry. The cube consumes 20% more DSP48E rithmetic units thn str, s it involves more multiplictions. Menwhile, the memory cost (BRAM) of the cube is one third of the str, s the dt buffering requirement decreses from 6 slices to 2 slices. For the FPGA designs, we cn reduce the count of rithmetic opertions by exploiting the symmetry of the coefficients. For exmple, in the cube stencil shown in Figure 2(b), the stencil coefficients re the sme for the points mrked with the sme letters, s both the Lplce derivtives nd the scling rtio determined by the smpling rte of different xes re the sme for these points. Therefore, insted of computing 1 c + 2 c + 3 c, we compute (1 + 2 + 3) c. Applying this technique, the computtion for the cube stencil reduces from 27 multiplictions nd

Fu et l. 5 FPGAs y z x y z x c d e d c b b g b f h f g b c d e d c () (b) Figure 2: Different 3D stencils: str vs. cube. [NR] Norml Stencils str cube FPGA #slices 5618 7072 resource #BRAMs 87 30 costs #DSP48Es 50 60 Optimized Stencils str cube FPGA #slices 5207 6256 resource #BRAMs 87 30 cost #DSP48Es 32 18 Tble 3: Resource costs of str nd cube 3D stencils for 120x120x120 3D rrys.

Fu et l. 6 FPGAs 26 dditions to 8 multiplictions nd 26 dditions, while the str stencil reduces from 19 multiplictions nd 18 dditions to 10 multiplictions nd 18 dditions. The lower prt of Tble 3 shows the resource costs for multipliction-reduced cube nd str. While the cost of BRAMs remins the sme, the number of DSP48Es reduces significntly for the cube. After the multipliction reduction, the cube consumes much less thn the str for both DSP48Es nd BRAMs. The str consumes less logic slices s it involves fewer dditions. As the stencil opertor only consumes 8 or 10 multiplictions nd 26 or 18 dditions, the FPGA hs the cpcity for multiple copies of the stencil opertors. Therefore, we hve two different wys to improve the performnce of the FPGA: (1) using multiple stencil opertors to work on multiple dt items in prllel; (2) processing multiple time steps in one pss. The following sections discusses these two options in more detil. Multiple Stencil Opertors To mke full utiliztion of ll the units on n FPGA, we cn try to fit s mny stencil opertors s possible into the chip. For the exmple shown in Figure 1, insted of processing only (3,3), we cn process consecutive dt items (such s (3,2), (3,3), nd (3,4)) in prllel. However, incresing the number of stencil units does not lwys improve the overll performnce due to the constrint of the bndwidth between the FPGA nd the onbord memories, which is pproximtely 13 GB/s in our pltform. Considering the controlling overheds, the bndwidth for pure input nd output dt is round 8 GB/s. When the input strems for the multiple stencil opertors pproch the sturtion point of the memory bndwidth, incresing the number of stencil opertors my not improve the performnce ny more. Using mesured experiment results, we built softwre tool tht models the costs nd performnce of vrious FPGA designs. Figure 3 shows the estimted performnce for processing 512 512 512 3D convolution using different number of computtion cores on n FPGA. The FPGA circuit is running t 125 MHz. The speedup is clculted ginst single-core softwre implementtion running on Intel Xeon 2.0 GHz. Due to the constrint of logic slices, the FPGA cn fit six concurrent cube stencils or eight concurrent str stencils. For ll the different number of stencil opertors, the cube provides slightly better performnce thn the str. Both the cube nd the str rrive t the sturtion point of round 25x speedup with four stencil opertors.

* + + Fu et l. 7 FPGAs % (" & Figure 3: Speedups for processing 512 512 512 3D convo-, - ) * lution using multiple stencil opertors. [NR] "!$#%'& ( input 0 unit 1 time step 1 1 [n][i][j] = conv( 0 [n+1][i][j], 0 [n][i][j], 0 [n-1][i][j], ) Figure 4: Bsic circuit structure for processing multiple time steps ( i denotes the wve-field dt in the time step i). [NR] unit 2 unit 3 time step 2 2 [n-1][i][j] = conv( 1 [n][i][j], 1 [n-1][i][j], 1 [n-2][i][j], ) time step 3 3 [n-2][i][j] = conv( 2 [n-1][i][j], 2 [n-2][i][j], 2 [n-3][i][j], ) output 3 Processing Multiple Time Steps Insted of putting concurrent cores, nother strtegy is to process multiple time steps in one pss. Figure 4 shows the bsic structure of circuit tht processes three time steps in one pss. The three units process three time steps seprtely with the output of ech unit s the input of the next unit. The exmple in the figure uses 3-by-3-by-3 cube stencil. In generl, the computtion of wve-field dt in slice n requires the wve-field dt in slices (n + 1), n, nd (n 1) in the previous time step. Therefore, when the unit 1 strts processing slice n, the unit 2 cn strt processing slice (n 1). Menwhile, unit 2 needs intermedite buffers to store the results for slices (n 1) nd n from unit 1. An dvntge of processing multiple time steps over putting multiple stencil opertors is tht the performnce will not be constrined by the memory bndwidth, s the unit for ech time step is getting inputs from the previous time step, nd does

Fu et l. 8 FPGAs not consume the memory bndwidth of the FPGA. However, on the dt side, s we re doing 3D blocking of the rry, processing multiple time steps requires extr dt items to strt with. Given convolution stencil with ns non-zero lgs in ech direction, to process n time steps in one pss for nx ny rry, we need to strt with n rry of the size (nx + 2 n ns) (ny + 2 n ns). Considering doing 10 time steps for 100x100 size, the dt overhed is 44% for the cube nd 156% for the str. Menwhile, s the unit t ech time step needs to store the results of the previous time step, this pproch lso increses the requirement for BRAM resources. Therefore, to increse the number of time steps, we need to reduce the blocking size, nd thus incresing the cost of streming overlpping dt items nd doing lrger number of strems. Another dvntge of this multiple-time-step rchitecture is tht we cn improve the order of time ccurcy with reltively smll costs. For exmple, for the unit 3 in Figure 4, insted of only getting the previous wve-field dt 2 from unit 2, we cn get in the wve-field dt 2 nd 1 from both units 2 nd 1 to chieve 4th order in time ccurcy. The cost for improving the time order is the extr buffer to store the wve-field dt from unit 1 nd the incresed number of dders nd multipliers. Figure 5 shows the estimted performnce for FPGA convolution designs tht process multiple time steps in one pss. The str, the 2nd nd 4th order cube re compred here. For this pproch, the cube stencil shows much better performnce thn the str stencil due to its smller requirement for BRAM resources ( str needs to buffer six slices for the convolution opertion, while cube only needs to buffer two). Due to the constrint of logic slices, the FPGA cn fit eight time steps for the str, six nd five steps for the 2nd nd 4th order cube. The str gets its pek performnce of 11x speedup with four time steps. After tht, the performnce becomes worse with more time steps. The 2nd order cube stencil increses ll the wy to 29x speedup with 6 time steps. The 4th order cube chieves 25x speedup with 5 time steps. Different Precisions As mentioned bove, one of FPGA s dvntges is the support for customizble number representtions. Our previous work (Fu et l. 2008) hs shown tht, in certin cses of seismic computtions, reduced precision provides equivlent results within cceptble tolernces. For FPGA designs, reduced precision cn significntly reduce the re cost nd I/O bndwidth of the design, nd multiply the performnce with more computtion units on the FPGA. Figure 6 shows the performnce we cn chieve using reduced floting-point precision. With 16-bit floting-point precision, the multiple-stencil pproch provides 49x speedup nd the multiple-time-step pproch provides 46x speedup.

$ % % Fu et l. 9 FPGAs (! ) ( "+*,- Figure 5: Speedups for processing different number of time steps processed in one pss. [NR] & ' # $ "! ACCELERATION RESULTS We hve implemented the 2nd order cube with 6 time steps nd the 4th order cube with 5 time steps onto the Mxeler ccelertion crd. The 2nd order cube processes 6 time steps in 1.383 seconds, nd the 4th order cube processes 5 time steps in 1.346 seconds. Compred to the 6.36 seconds to process one time step in 2nd order, the 2nd nd 4th order cube designs provide 27.5x nd 23.5x speedups, slightly lower thn our estimted performnce. The speedup discussed so fr is chieved by using one FPGA of the ccelertion crd. The ccelertion crd contins two FPGAs of the sme settings. There is lso inter-fpga link which cn updte the overlpping boundries between the FPGAs in prllel with the computtion performed on the FPGAs. Therefore, by dividing the rry into two prts nd computing in two FPGAs concurrently, we cn get nother 2x nd chieve up to 55x nd 47x speedup in totl. Note tht the FPGAs we re using re Xilinx Virtex-5 LX330T chips relesed severl yers go. Projecting our designs into the recently nnounced Xilinx Virtex-6 SX475T FPGAs (shown in Tble 1), we cn fit up to 13 time steps in one FPGA nd chieve up to 55x speedup. With two FPGAs working concurrently on n ccelertion crd, we cn chieve up to 110x speedup compred to single-core CPU version. CONCLUSIONS Our explortion on FPGA convolution designs shows tht, the cube stencil fits the FPGA streming rchitecture much better thn the str stencil. We especilly investigte the rchitecture tht processes multiple time steps in one pss. This pproch removes the constrints of the memory bndwidth, nd improves the performnce t the cost of extr dt buffering nd streming overhed. Experiment results show

" # # Fu et l. 10 FPGAs &(' )(*+-, ). &/' )01 & )(.2 )-. Figure 6: Speedups for different floting-point precisions. [NR] $ %! " tht the FPGA streming rchitecture provides gret potentil for ccelerting 3D convolution, nd cn chieve up to two orders of mgnitude speedup. ACKNOWLEDGMENTS We would like to thnk Mxeler Technologies for providing the hrdwre device nd the Center for Computtionl Erth nd Environmentl Science in Stnford University for funding this reserch. REFERENCES Cheung, R., N. Telle, W. Luk, nd P. Cheung, 2005, Customisble elliptic curve cryptosystems: IEEE Trnsctions on VLSI Systems, 13, 1048 1059. Fu, H., W. Osborne, R. Clpp, nd O. Pell, 2008, Accelerting seismic computtions on fpgs: From the perspective of number representtions: Presented t the. Gokhle, M., J. Frigo, C. Ahrens, J. Tripp, nd R. Minnich, 2004, Monte Crlo rditive het trnsfer simultion on reconfigurble computer: Proc. FPL, LNCS 3203, 95 104. Nemeth, T., J. Stefni, W. Liu, R. Dimond, O. Pell, nd R. Ergs, 2008, An implementtion of the coustic wve eqution on FPGAs: Presented t the. Spotz, W. nd G. Crey, 1996, A high-order compct formultion for the 3d poisson eqution: Numericl Methods for Prtil Differentil Equtions. Zhng, G., P. Leong, C. Ho, K. Tsoi, C. Cheung, D. Lee, R. Cheung, nd W. Luk, 2005, Reconfigurble Accelertion for Monte Crlo bsed Finncil Simultion: Proc. FPT, 215 222.