Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Hndaw Scentfc Programmng Volume 2017, Artcle ID 1205892, 16 pages https://do.org/10.1155/2017/1205892 Research Artcle Performance Optmzaton of 3D Lattce Boltzmann Flow Solver on a GPU Nhat-Phuong Tran, Myungho Lee, and Sugwon Hong Department of Computer Scence and Engneerng, Myongj Unversty, 116 Myongj-ro, Cheon-gu, Yongn, Gyeongg-do, Republc of Korea Correspondence should be addressed to Myungho Lee; myunghol@mju.ac.kr Receved 9 June 2016; Accepted 23 October 2016; Publshed 16 January 2017 Academc Edtor: Baslo B. Fraguela Copyrght 2017 Nhat-Phuong Tran et al. Ths s an open access artcle dstrbuted under the Creatve Commons Attrbuton Lcense, whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded the orgnal work s properly cted. Lattce Boltzmann Method (LBM) s a powerful numercal smulaton method of the flud flow. Wth ts data parallel nature, t s a promsng canddate for a parallel mplementaton on a GPU. The LBM, however, s heavly data ntensve and memory bound. In partcular, movng the data to the adjacent cells n the streamng computaton phase ncurs a lot of uncoalesced accesses on the GPU whch affects the overall performance. Furthermore, the man computaton kernels of the LBM use a large number of regsters per thread whch lmts the thread parallelsm avalable at the run tme due to the fxed number of regsters on the GPU. In ths paper, we develop hgh performance parallelzaton of the LBM on a GPU by mnmzng the overheads assocated wth the uncoalesced memory accesses whle mprovng the cache localty usng the tlng optmzaton wth the data layout change. Furthermore, we aggressvely reduce the regster uses for the LBM kernels n order to ncrease the run-tme thread parallelsm. Expermental results on the Nvda Tesla K20 GPU show that our approach delvers mpressve throughput performance: 1210.63 Mllon Lattce Updates Per Second (MLUPS). 1. Introducton Lattce Boltzmann Method (LBM) s a powerful numercal smulaton method of the flud flow, orgnatng from the lattce gas automata methods [1]. LBM models the flud flow consstng of partcles movng wth random motons. Such partcles exchange the momentum and the energy through thestreamngandthecollsonprocessesoverthedscrete lattce grd n the dscrete tme steps. At each tme step, the partcles move nto adjacent cells whch cause collsons wth the exstng partcles n the cells. The ntrnsc data parallel nature of LBM makes ths class of applcatons a promsng canddate for parallel mplementaton on varous Hgh Performance Computng (HPC) archtectures ncludng manycore accelerators such as the Graphc Processng Unt (GPU) [2], Intel Xeon Ph [3], and the IBM Cell BE [4]. Recently, the GPU s becomng ncreasngly popular for the HPC server market and n the Top 500 lst, n partcular.thearchtectureofthegpuhasgonethrough a number of nnovatve desgn changes n the last decade. It s ntegrated wth a large number of cores and multple threads per core, levels of the cache herarches, and the large amount (>5 GB) of the on-board memory. The peak floatngpont throughput performance (flops) of the latest GPU has drastcally ncreased to surpass 1 Tflops for the double precson arthmetc [5]. In addton to the archtectural nnovatons, user frendly programmng envronments have been recently developed such as CUDA [5] from Nvda, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Archtecture Revew Board (ARB). The advanced GPU archtecture and the flexble programmng envronments have made possble nnovatve performance mprovements n many applcaton areas. In ths paper, we develop hgh performance parallelzaton of the LBM on a GPU. The LBM s heavly data ntensve and memory bound. In partcular, movng the data to the adjacent cells n the streamng phase of the LBM ncurs a lot of uncoalesced accesses on the GPU and affects the overall performance. Prevous research focused on utlzng thesharedmemoryofthegputodealwththeproblem [1, 8, 9]. In ths paper, we use the tlng algorthm along wth thedatalayoutchangenordertomnmzetheoverheads

2 Scentfc Programmng of the uncoalesced accesses and mprove the cache localty as well. The computaton kernels of the LBM nvolve a large number of floatng-pont varables, thus usng a large number of regsters per thread. Ths lmts the avalable thread parallelsm generated at the run tme as the total numberoftheregstersonthegpusfxed.wedeveloped technques to aggressvely reduce the regster uses for the kernels n order to ncrease the avalable thread parallelsm and the occupancy on the GPU. Furthermore, we developed technques to remove the branch dvergence. Our parallel mplementaton usng CUDA shows mpressve performance results. It delvers up to 1210.63 Mllon Lattce Update Per Second (MLUPS) throughput performance and 136-tme speedup on the Nvda Tesla K20 GPU compared wth a seral mplementaton. The rest of the paper s organzed as follows: Secton 2 ntroduces the LBM algorthm. Secton 3 descrbes the archtecture of the latest GPU and ts programmng model. Secton 4 explans our technques for mnmzng the uncoalesced accesses, mprovng the cache localty and the thread parallelsm along wth the regster usage reducton and the branch dvergence removal. Secton 5 shows the expermental results on the Nvda Tesla K20 GPU. Secton 6 explans the prevous research on parallelng the LBM. Secton 7 wraps up the paper wth conclusons. 2. Lattce Boltzmann Method Lattce Boltzmann Method (LBM) s a powerful numercal smulaton of the flud flow. It s derved as a specal case of the lattce gas cellular automata (LGCA) to smulate the flud moton. The fundamental dea s that the fluds can be regarded as consstng of a large number of small partcles movng wth random motons. These partcles exchange the momentum and the energy through the partcle streamng and the partcle collson. The physcal space of the LBM s dscretzed nto a set of unformly spaced nodes (lattce). At each node, a dscrete set of veloctes s defned for the propagaton of the flud molecules. The veloctes are referred to as mcroscopc veloctes whch are denoted by e.the LBM model whch has n dmensons and q velocty vectors at each lattce pont s represented as DnQq. Fgure 1 shows a typcal lattce node of the most common model n 2D (D2Q9) whch has two-dmensonal 9 velocty vectors. In ths paper, however, we consder the D3Q19 model whch has three-dmensonal 19 velocty vectors. Fgure 2 shows a typcal lattce node of D3Q19 model wth 19 veloctes e defned by e = { (±1, 0, 0), (0, ±1, 0), (0, 0, ±1) = 2, 4, 6, 8, 9, 14 (0, 0, 0), = 0 { { (±1, ±1, 0), (0, ±1, ±1), (±1,0,±1) = 1, 3, 5, 7, 10, 11, 12, 13, 15, 16, 17, 18. (1) () Eachpartcleonthelattcesassocatedwthadscrete dstrbuton functon, called as partcle dstrbuton functon (pdf), f ( x, t), =0,...,18.TheLBequaton s dscretzed as follows: f ( x+c e Δt, t + Δt) =f ( x, t) 1 τ [f ( x, t) f (eq) (ρ ( x, t), u ( x, t))] (2) where c s the lattce speed and τ s the relaxaton parameter. () The macroscopc quanttes are the densty ρ and the velocty u( x, t). They are defned as 18 ρ ( x, t) = u ( =0 x, t) = 1 ρ f ( x, t) (3) 18 =0 cf e (4) () The equlbrum functon f (eq) (ρ( x, t), u( x, t)) s defned as f (eq) (ρ ( x, t), u ( x, t)) =ω ρ+ρs ( u ( x, t)) (5) where s ( u) =ω [1 + 3 c 2 ( e u) + 9 2c 4 ( e u) 2 3 2c 2 u u] (6) and the weghtng factor ω has the followng values: 1 3, α = 0 { 1 ω = 18, α=2,4,6,8,9,14 { 1, α = 1, 3, 5, 7, 10, 11, 12, 13, 15, 16, 17, 18. { 36 Algorthm 1 summarzes the algorthm of the LBM. The LBM algorthm executes a loop over a number of tme steps. At each teraton, two computaton steps are appled: (7) () Streamng (or propagaton) phase: the partcles move accordng to the pdf nto the adjacent cells.

Scentfc Programmng 3 (1) Step 1: Intalze macroscopc quanttes, densty ρ, velocty u, the dstrbuton functon f, and the equlbrum functon f (eq) n the drecton of e (2) Step 2: Streamng phase: move f f (3) Step 3: Calculate densty ρ and velocty u from f usng Equatons (3) and (4) (4) Step 4: Calculate the equlbrum functon f (eq) usng Equaton (5) (5) Step 5: Collson phase: calculate the updated dstrbuton functon f =f (1/τ)(f f eq ) usng Equaton (2) (6) Repeat Steps 2 to 5tmeSteps-tmes Algorthm 1: Algorthm of LBM. NW 6 N 2 NE5 TN 11 TW 12 T 9 TE 10 TS 13 y W 3 C E 1 NW 3 W 4 SW 5 N 2 C S 6 NE 1 E 8 SE 7 x BN 16 SW 7 S 4 SE 8 BW 17 B 14 BE 15 Fgure 1: Lattce cell wth 9 dscrete drectons n D2Q9 model. y z BS 18 Table1:Pullandpushschemes. Pull scheme Push scheme + Read dstrbuton functons + Read dstrbuton functons from the adjacent cells from the current cell f ( x,t) f ( x c e Δt, t Δt) +Calculateρ, u, f (eq) +Calculateρ, u, f (eq) + Update values to the current cell f ( + Update values to the adjacent x,t) cells f ( x +c e Δt, t + Δt) Fgure 2: Lattce cell wth 19 dscrete drectons n D3Q19 model. x () Collson phase: the partcles collde wth other partcles streamng nto ths cell from dfferent drectons. Dependng on whether the streamng phase precedes or follows the collson phase, we have the pull or the push schementheupdateprocess[10].thepullscheme(fgure3) pulls the post-collson values from the prevous tme step from lattce A and then performs the collson on these to producethenewpdfswhcharestorednlattceb.inthe push scheme (Fgure 4); on the other hand, the pdfs of one node (square wth black arrows) are read from lattce A; then collson step performs frst. The post-collson values are propagated to the neghbor nodes n the streamng step to lattce B (red arrows). Table 1 compares the computaton steps of these schemes. In Algorthm 2, we lst the skeleton of the LBM algorthm whch conssts of the collson phase and the streamng phase. In the functon LBM,the collde functonand stream functon are called tmesteps-tmes. At the end of each tme step, (a) Pull: lattce A (b) Pull: lattce B Fgure 3: Illustraton of pull scheme wth D2Q9 model [11]. (a) Push: lattce A (b) Push: lattce B Fgure 4: Illustraton of push scheme wth D2Q9 model [11].

4 Scentfc Programmng (1) vod LBM(double *source grd, double dest grd, nt grd sze, nt tmesteps) (2) { (3) nt ; (4) double temp grd; (5) for ( = 0; < tmesteps; ++) (6) { (7) collde(source grd, temp grd, grd sze); (8) stream(temp grd, dest grd, grd sze); (9) swap grd(source grd, dest grd); (10) } (11) } Algorthm 2: Basc skeleton of LBM algorthm. source grd and dest grd are swapped to nterchange values between the two grds. 3. Latest GPU Archtecture Recently, the many-core accelerator chps are becomng ncreasngly popular for the HPC applcatons. The GPU chps from Nvda and AMD are representatve ones along wth the Intel Xeon Ph. The latest GPU archtecture s characterzed by a large number of unform fne-gran programmable cores or thread processors whch have replaced separate processng unts for shader, vertex, and pxel n the earler GPUs. Also, the clock rate of the latest GPU has ramped up sgnfcantly. These have drastcally mproved the floatng-pont performance of the GPUs, far exceedng that of the latest CPUs. The fne-gran cores (or thread processors) are dstrbuted n multple streamng multprocessors (SMX) (or thread blocks) (see Fgure 5). Software threads are dvded ntoanumberofthreadgroups(calledwarps)eachof whch conssts of 32 threads. Threads n the same WARP are scheduled and executed together on the thread processors n the same SMX n the SIMD (Sngle Instructon Multple Data) mode. Each thread executes the same nstructon drected by the common Instructon Unt on ts own data streamng from the devce memory to the on-chp cache memores and regsters. When a runnng WARP encounters a cache mss, for example, the context s swtched to a new WARP whle the cache mss s servced for the next few hundred cycles, the GPU executes n a multthreaded fashon as well. The GPU s bult around a sophstcated memory herarchy as shown n Fgure 5. There are regsters and local memores belongng to each thread processor or core. The local memory s an area n the off-chp devce memory. Shared memory, level-1 (L-1) cache, and read-only data cache are ntegrated n a thread block of the GPU. The shared memory s a fast (as fast as regsters) programmer-managed memory. Level-2 (L-2) cache s ntegrated on-chp and used among all the thread blocks. Global memory s an area n the off-chp devce memory accessed from all the thread blocks, through whch the GPU can communcate wth the host CPU. Data n the global memory get cached drectly n the shared memory by the programmer or they can be cached through GPU chp Thread block (streamng multprocessor)-n... Thread block (streamng multprocessor)-2 Thread block (streamng multprocessor)-1 Thread processor 1 Regsters Shared memory Thread processor 2 Regsters Level-1 cache Read only data cache Thread processor M Regsters Level-2 cache Instructon unt Devce memory (Global memory, local memory, texture memory, and constant memory) Fgure 5: Archtecture of a latest GPU (Nvda Tesla K20). the L-2 and L-1 caches automatcally as they get accessed. There are constant memory and texture memory regons n the devce memory also. Data n these regons s read-only. They can be cached n the L-2 cache and the read-only data cache.onnvdateslak20,theread-onlydatafromthe globalmemorycanbeloadedthroughthesamecacheusedby the texture ppelne va a standard ponter wthout the need to bnd to a texture beforehand. Ths read-only cache s used automatcally by the compler as long as certan condtons are met. restrct qualfer should be used when a varable s declared to help the compler detect the condtons [5]. In order to effcently utlze the latest advanced GPU archtectures, programmng envronments such as CUDA [5] from Nvda, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Archtecture Revew Board (ARB) have been developed. Usng these envronments, users can have a more drect control over the

Scentfc Programmng 5 (1) nt ; (2) for( = 0; < tmesteps; ++) (3) { (4) collson kernel<<<grid, BLOCK>>> (source grd, temp grd, xdm, ydm, zdm, cell sze, grd sze); (5) cudathreadsynchronze(); (6) streamng kernel<<<grid, BLOCK>>>(temp grd, dest grd, xdm, ydm, zdm, cell sze, grd sze); (7) cudathreadsynchronze(); (8) swap grd(source grd, dest grd); (9) } Algorthm 3: Two separate CUDA kernels for dfferent phases of LBM. large number of GPU cores and ts sophstcated memory herarchy. The flexble archtecture and the programmng envronments have led to a number of nnovatve performance mprovements n many applcaton areas and many more are stll to come. 4. Optmzng Cache Localty and Thread Parallelsm In ths secton, we frst ntroduce some prelmnary steps we employed n our parallelzaton and optmzaton of the LBM algorthm n Secton 4.1. They are mostly borrowed from the prevous research such as combnng the collson phase and the streamng phase, a GPU archtecture frendly data organzaton scheme (SoA scheme), an effcent data placement n the GPU memory herarchy, and usng the pull scheme for avodng and mnmzng the uncoalesced memory accesses. Then, we descrbe our key optmzaton technques for mprovng the cache localty and the thread parallelsm such as the tlng wth the data layout change and the aggressve reducton of the regster uses per thread n Sectons 4.2 and 4.3. Optmzaton technques for removng the branch dvergence are presented n Secton 4.4. Our key optmzaton technques presented n ths secton have been mproved from our earler work n [13]. 4.1. Prelmnares 4.1.1. Combnaton of Collson Phase and Streamng Phase. As shown n the descrpton of the LBM algorthm n Algorthm 2, the LBM conssts of the two man computng kernels: collson kernel for the collson phase and streamng kernel for the streamng phase. In the collson kernel, threads load the partcle dstrbuton functon from the source grd (source grd) and then calculate the velocty, the densty, and the collson product. The post-collson values are stored to the temporary grd (temp grd). In streamng kernel, the post-collson values from temp grd are loaded and updated to approprate neghbor grd cells n the destnaton grd (dest grd). At the end of each tme step, source grd and dest grd are swapped for the next tme step. Ths mplementaton (see (1) nt ; (2) for ( = 0; < tmesteps; ++) (3) { (4) lbm kernel<<<grid, BLOCK>>>(source grd, dest grd, xdm, ydm, zdm, cell sze, grd sze); (5) cudathreadsyncronze(); (6) swap grd(source grd, dest grd); (7) } Algorthm 4: Sngle CUDA kernel after combnng two phases of LBM. Algorthm 3) needs extra loads/stores from/to temp grd whch s stored n the global memory and affects the global memory bandwdth [2]. In addton, some extra cost s ncurred wth the global synchronzaton between the two kernels (cudathreadsynchronze) whch affects the overall performance. In order to reduce these overheads, we can combne collson kernel and streamng kernel nto one kernel lbm kernel, where the collson product s streamed to the neghbor grd cells drectly after calculaton (see Algorthm 4). Compared wth Algorthm 3, storng to and loadng from temp grd are removed and the global synchronzaton cost s reduced. 4.1.2. Data Organzaton. In order to represent the 3- dmensonal grd of cells, we use the 1-dmensonal array whch has N x N y N z Qelements, where N x, N y, N z are wdth, heght, and depth of the grd and Q s the number of drectons of each cell [2, 14]. For example, f the model s D3Q19 wth N x =16, N y =16,andN z =16,wehavethe1D array of 16 16 16 19 = 77824 elements. We use 2 separate arrays for storng the source grd and the destnaton grd. There are two common data organzaton schemes for storng the arrays: () Array of structures (AoS): grd cells are arranged n 1D array. 19 dstrbutons of each cell occupy 19 consecutve elements of the 1D array (Fgure 6).

6 Scentfc Programmng C N S WT WB C N S WT WB C N S WT WB Cell 0 Cell 1 Cell (16 16 16 1) Thread 0 Thread 1 Fgure 6: AoS scheme. Cell (16 16 16) 1 Cell Cell 0 Cell 1 Cell 0 Cell 1 (16 16 16) 1 Cell 0 Cell 1 Cell (16 16 16) 1 C C C C C N N N N N WB WB WB WB WB Thread 0 Thread 0 Thread 0 Thread 1 Thread 1 Thread 1 Fgure 7: SoA scheme. () Structure of arrays (SoA): the value of one dstrbuton of all cells s arranged consecutvely n memory (Fgure 7). Ths scheme s more sutable for the GPU archtecture as we wll show n the expermental results (Secton 5). 4.1.3. Data Placement. In order to effcently utlze the memory herarchy of the GPU, the placement of the major data structures of the LBM s crucal [5]. In our mplementaton, we use the followng arrays: src grd, dst grd, types arr, lc arr,andnb arr. We use the followng data placements for these arrays: () src grd and dst grd areusedtostorethenput grd and the result grd. They are swapped at the end of each tme step by exchangng ther ponters nstead of explct storng to and loadng from the memory through a temporary array. Snce src grd and dst grd are very large sze arrays wth a lot of data stores and loads, we place them n the global memory. () In types arr array, the types of the grd cells are stored.weusethelddrvencavty(ldc)asthe test case n ths paper. The LDC conssts of a cube flledwththeflud.onesdeofthecubeserves as the acceleraton plane by sldng constantly. The acceleraton s mplemented by assgnng the cells n the acceleraton area at a constant velocty. Ths change requres three types of cells: regular flud, acceleraton cells, or boundary. Thus, we also need 1D array, types arr, n order to store the types of each cell n the grd. The sze of ths array s N x N y N z elements. For example, f the model s D3Q19 wth N x =16, N y =16,andN z =16,theszeofthearray s 16 16 16 = 4,096 elements. Thus, types arr s a large array also and contans constant values. Thus, they are not modfed throughout the executon of the program. For these reasons, the texture memory s the rght place for ths array. () lc arr and nb arr areusedtostorethebasendces for accesses to 19 drectons of the current cell and the neghbor cells, respectvely. There are 19 ndces correspondng to 19 drectons of D3Q19 model. These ndces are calculated at the start of the program and used tll the end of the program executon. Thus, we use the constant memory to store them. As standng at any cell, we use the followng formula to defne the poston n the 1D array of any drecton out of 19 cell drectons: curr dr pos n arr = cell pos + lc arr[drecton] (for the current cell) and nb dr pos n arr = nb pos + nb arr[drecton] (for the neghbor cells). 4.1.4. Usng Pull Scheme to Reduce Costs for Uncoalesced Accesses. Coalescng the global memory accesses can sgnfcantly reduce the memory overheads on the GPU. Multple global memory loads whose addresses fall wthn the 128- byte range are combned nto one request and sent to the memory. Ths saves the memory bandwdth a lot and mproves the performance. In order to reduce the costs for the uncoalesced accesses, we use the pull scheme [12]. Choosng the pull scheme comes from the observaton that the cost of the uncoalesced readng s smaller than the cost of the uncoalesced wrtng. Algorthm 5 shows the LBM algorthm usng the push scheme. At the frst step, the pdfs are coped drectly from thecurrentcell.thesepdfsareusedtocalculatethepdfs at the new tme step (collson phase). The new pdfs are then streamed to the adjacent cells (streamng phase). At

Scentfc Programmng 7 (1) global vod soa push kernel(float *source grd, float *dest grd, unsgned char* flags) (2) { (3) Gather 19 pdfs from the current cell (4) (5) Apply boundary condtons (6) (7) Calculate the mass densty ρ and the velocty u (8) (9) Calculate the local equlbrum dstrbuton functons f (eq) usng ρ and u (10) (11) Calculate the pdfs at new tme step (12) (13) Stream 19 pdfs to the adjacent cells (14) } Algorthm 5: Kernel usng the push scheme. (1) global vod soa pull kernel(float *source grd, float *dest grd, unsgned char* flags) (2) { (3) Stream 19 pdfs from adjacent cells to the current cell (4) (5) Apply boundary condtons (6) (7) Calculate the mass densty ρ and the velocty u (8) (9) Calculate the local equlbrum dstrbuton functons f (eq) usng ρ and u (10) (11) (12) Calculate the pdf at new tme step (13) (14) Save 19 values of pdf to the current cell (15) } Algorthm 6: Kernel usng the pull scheme. the streamng phase, the dstrbuton values are updated to neghbors after they are calculated. All dstrbuton values whch do not move to the east or west drecton (x-drecton valuesequalto0)canbeupdatedtotheneghbors(wrte to the devce memory) drectly wthout any msalgnment. However, other dstrbuton values (x-drecton values equal to +1 or 1) need to be consdered carefully because of ther msalgned update postons. The update postons are shfted to the memory locatons that do not belong to the 128-byte segment whle thread ndexes are not shfted correspondngly. So the msalgned accesses occur and the performance can degrade sgnfcantly. If we use the pull scheme, on the other hand, the order ofthecollsonphaseandthestreamngphasenthelbm kernel s reversed (see Algorthm 6). At the frst step of the pull scheme, the pdfs from adjacent cells are gathered to the current cell (streamng phase) (Lnes 3 5). Next, these pdfs areusedtocalculatethepdfsatthenewtmestepandthese new pdfs are then stored to the current cell drectly (collson phase). Thus, n the pull scheme, the uncoalesced accesses occur when the data s read from the devce memory whereas they occur when the data s wrtten n the push scheme. As a result, the cost of the uncoalesced accesses s smaller wth the pull scheme. 4.2. Tlng Optmzaton wth Data Layout Change. In the D3Q19 model of the LBM, as computatons for the streamng and the collson phases are conducted for a certan cell, 19 dstrbuton values whch belong to 19 surroundng cells are accessed. Fgure 8 shows the data accesses to the 19 cells when

8 Scentfc Programmng P +1 P P 1 N y N x Fgure 8: Data accesses for orange cell n conductng computatons for streamng and collson phases. a thread performs the computatons for the orange colored cell. The 19 cells (18 drectons (green cells) + current cell n thecenter(orangecell))aredstrbutedonthethreedfferent planes. Let P be the plane contanng the current computng (orange) cell, and let P 1 and P +1 bethelowerandupper planes, respectvely. P plane contans 9 cells. P 1 and P +1 planes contan 5 cells, respectvely. When the computatons for the cell, for example, (x, y, z) =(1, 1, 1), areperformed, the followng cells are accessed: () P 0 plane: (0, 1, 0), (1, 0, 0), (1, 1, 0), (1, 2, 0), (2, 1, 0) () P 1 plane: (0, 0, 1), (0, 1, 1), (0,2,1), (1, 0, 1), (1, 1, 1), (1, 2, 1), (2, 0, 1), (2, 1, 1), (2,2,1) () P 2 plane: (0, 1, 2), (1, 0, 2), (1, 1, 2), (1, 2, 2), (2, 1, 2) The 9 accesses for P 1 plane are dvded nto three groups {(0,0,1), (0,1,1), (0,2,1)}, {(1,0,1), (1,1,1), (1,2,1)}, {(2,0,1), (2,1,1), (2,2,1)}. Each group accesses the consecutve memory locatons belongng to the same row. Accesses of the dfferent groups are separated apart and lead to the uncoalesced accesses on the GPU when N x s suffcently large. In each of P 0 and P 2 planes, there are three groups of accesses each. Here, the accesses of the same group touch the consecutve memory locatons and accesses of the dfferent groups are separated apart n the memory whch lead to the uncoalesced accesses also. Accesses to the data elements n the dfferent planes (P 0, P 1,andP 2 ) are further separated apart and also lead to the uncoalesced accesses when N y s suffcently large. As the computatons proceed, three rows n the ydmenson of P 0, P 1, P 2 planes wll be accessed sequentally for x=0 N x 1, y = 0, 1, 2,followedbyx=0 N x 1, y = 1, 2, 3,..., x=0 N x 1, y=n y 3,N y 2,N y 1. When the complete P 0, P 1, P 2 planes are swept, then smlar data accesses wll contnue for P 1, P 2,andP 3 planes, and so on. Therefore, there are a lot of data reuses n x-, y-, and zdmensons. As explaned n Secton 4.1.2, the 3D lattce grd s stored n the 1D array. The 19 cells for the computatons belongngtothesameplanearestored±1 or ±N x +±1cells away. The cells n dfferent planes are stored ±N x Ny+ ±N x +±1cells away. The data reuse dstance along the xdmenson s short: +1 or +2 loop teratons apart. The data reusedstancealongthey-andz-dmensons s ±N x +±1or ±N x N y +±N x +±1teratons apart. If we can make the data reuse occur faster by reducng the reuse dstances, for example, usng the tlng optmzaton, t can greatly mprove the cache ht rato. Furthermore, t can reduce the overheads wth the uncoalesced accesses because lots of global memory accesses can be removed by the cache hts. Therefore, we tle the 3D lattce grd nto smaller 3D blocks. We also change the data layout n accordance wth the data access patterns of the tled code n order to store the data elements n dfferent groups closer n the memory. Thus we can remove a lot of uncoalesced memory accesses, because they can be stored wthn 128-byte boundary. In Sectons 4.2.1 and 4.2.2, we descrbe our tlng and data layout change optmzatons. 4.2.1. Tlng. Let us assume the followng: () N x, N y,andn z areszesofthegrdnx-, y-, and zdmenson. () n x, n y,andn z are szes of the 3D block n x-, y-, and z-dmenson. () xy-plane s a subplane whch s composed of (n x n y ) cells. We tle the grd nto small 3D blocks wth the tle szes of n x, n y,andn z (yellow block n Fgure 9(a)), where n x =N x x c n y =N y y c n z =N z z c xc,yc,zc = [1, 2, 3,...). We let each CUDA thread block process one 3D tled block. Thus n z xy-planes need to be loaded for each thread block. In each xy-plane, each thread of the thread block executes the computatons for one grd cell. Thus each thread deals wth a column contanng n z cells (the red column n Fgure 9(b)). If z c = 1,eachthreadprocessesN z cells and f z c = N z, each thread processes only one cell. The tle szecanbeadjustedbychangngtheconstantsx c, y c,andz c. These constants need to be selected carefully to optmze the performance. Usng the tlng, the number of created threads s reduced by z c -tmes. 4.2.2. Data Layout Change. In order to further mprove benefts of the tlng and reduce the overheads assocated wth the uncoalesced accesses, we propose to change the data layout. Fgure 10 shows one xy-planeofthegrdwth and wthout the layout change. Wth the orgnal layout (Fgure 10(a)), the data s stored n the row major fashon. Thus the entre frst row s stored, followed by the second (8)

Scentfc Programmng 9 Block z x y (a) Grd dvded nto 3D blocks (b) Block contanng subplanes. The red column contans all cells one thread processes Fgure 9: Tlng optmzaton for LBM. Block 3 Block 2 Block 2 Block 3 Block 1 Block 0 Block 0 Block 1 (a) Orgnal data layout (b) Proposed data layout Fgure 10: Dfferent data layouts for blocks. row, and so on. In the proposed new layout, the cells n the tled frst row n Block 0 are stored frst. Then the second tled row of Block 0 s stored nstead of the frst row of Block 1 (Fgure 10(b)). Wth the layout change, the data cells accessed n the consecutve teratons of the tled code are placed sequentally. Ths places the data elements of the dfferent groups closer. Thus, t ncreases the possblty for these memory accesses to the dfferent groups coalesced f thetlngfactorandthememorylayoutfactorareadjusted approprately. Ths can further mprove the performance beyond the tlng. The data layout can be transformed usng the followng formula: ndex new =x d +y d N x +z d N x N y (9) where x d and y d are cell ndexes n x- andy-dmenson on the plane of grd and z d s the value n the range of 0 to n z 1. x d and y d canbecalculatedasfollows: x d = (block ndex n x-dmenson) (number of threads n thread block n x-dmenson) + (thread ndex n thread block n x-dmenson) y d =(block ndex n y-dmenson) (numberofthreadsnthreadblockny-dmenson) +(thread ndex n thread block n y-dmenson) (10) In our mplementaton, we use the changed nput data layout stored offlne before the program starts. (The orgnal nput s changed to the new layout and stored to the nput fle.) Then, the nput fle s used whle conductng the experments. 4.3. Reducton of Regster Uses per Thread. The D3Q19 model s more precse than the models wth smaller dstrbutons such as D2Q9 or D3Q13, thus usng more varables. Ths leads to more regster uses for the man computaton kernels. In GPU, the regster use of the threads s one of the factors lmtngthenumberofactvewarpsonastreamngmultprocessor (SMX). Hgher regster uses can lead to the lower parallelsm and occupancy (see Fgure 11 for an example) whch results n the overall performance degradaton. The Nvda compler provdes a flag to lmt the regster uses to a certan lmt such as maxrregcount or launch bounds () qualfer [5]. The maxrregcount swtch sets a maxmum on the number of regsters used for each thread. These can help ncrease the occupancy by reducng the regster uses per thread. However, our experments show that the overall performance goes down, because they lead to a lot of regster splls/reflls to/from the local memory. The ncreased

10 Scentfc Programmng Block 0 Block 0 Thread 0 Thread 1 Block 0 Thread 63 Regster fle Block 7 Block 7 Thread 0 Thread 1 Block 7 Thread 63 (a) Eghtblocks,64threadsperblock,and4regstersperthread Block 0 Thread 0 Block 0 Thread 31 Regster fle Block 7 Thread 0 Block 7 Thread 31 (b) Eghtblocks,32threadsperblock,and8regstersperthread Fgure 11: Sharng 2048 regsters among (a) a larger number of threads wth smaller regster uses versus (b) a smaller number of threads wth larger regster uses. memory traffc to/from the local memory and the ncreased nstructon count for accessng the local memory hurt the performance. In order to reduce the regster uses per thread whle avodng the regster spll/refll to/from the local memory, we used the followng technques: () Calculate ndexng address of dstrbutons manually. Each cell has 19 dstrbutons; thus we need 38 varables for storng ndexes (19 dstrbutons 2 memory accesses for load and store) n the D3Q19 model. However, each ndex varable s used only one tmeateachexecutonphase.thus,wecanuseonly two varables nstead of 38, one for calculatng the loadng ndexes and one for calculatng the storng ndexes. () Use the shared memory for the commonly used varables among threads, for example, to store the base addresses. () Castng multple small sze varables nto one large varable: for example, we combned 4 char type varables nto one nteger varable. (v) For smple operatons whch can be easly calculated, we do not store them to the memory varables. Instead, we recompute them later. (v) We use only one array to store the dstrbutons nstead of usng 19 arrays separately. (v) In the orgnal LBM code, a lot of varables are declared to store the FP computaton results whch ncrease the regster uses. In order to reduce the regster uses, we attempt to reuse varables whose lfe-tme ended earler n the former code. Ths may lower the nstructon-level parallelsm of the kernel. However, t helps ncrease the thread-level parallelsm as more threads can be actve at the same tme wth the reduced regster uses per thread. (v) In the orgnal LBM code, there are some complcated floatng-pont (FP) ntensve computatons used n a number of nearby statements. We aggressvely extract these computatons as the common subexpressons. It frees the regsters nvolved n the common subexpressons, thus reducng the regster uses. It also reduces the number of dynamc nstructon counts. Applyng the above technques n our mplementaton, thenumberofregstersneachkernelsgreatlyreducedfrom 70 regsters to 40 regsters. It leads to the hgher occupancy for the SMXs and the sgnfcant performance mprovements. 4.4. Removng Branch Dvergence. Flow control nstructons onthegpucausethethreadsofthesamewarptodverge. Thus, the resultng dfferent executon paths get seralzed. Ths can sgnfcantly affect the performance of the applcaton program on the GPU. Thus, the branch dvergence shouldbeavodedasmuchaspossble.inthelbmcode, therearetwomanproblemswhchcancausethebranch dvergence: () Solvng the streamng at the boundary postons () Defnng actons for correspondng cell types

Scentfc Programmng 11 120 Performance (MLUPS) 100 80 60 40 20 Ghost layer Boundary Others Fgure 12: Illustraton of the lattce wth a ghost layer. (1) IF(cell type == FLUID) (2) x=a; (3) ELSE (4) x=b; Algorthm 7: Skeleton of IF-statement used n LBM kernel. (1) s flud = (cell type == FLUID); (2) x=a*sflud + b * (!s flud); Algorthm 8: Code wth IF-statement removed. In order to avod usng IF-statements whle streamng at the boundary poston, a ghost layer s attached n y- and z-dmenson. If N x, N y,andn z are the wdth, heght, and depth of the orgnal grd, NN x =N x, NN y =N y +1,and NN z =N z +1are the new wdth, heght, and depth of the grd wth the ghost layer (Fgure 12). Wth the ghost layer, we can regard the computatons at the boundary poston as the normaloneswthoutworryngaboutrunnngoutofthendex bound. As explaned n Secton 4.1.2, cells of the grd belong to three types such as the regular flud, the acceleraton cells, or the boundary. The LBM kernel contans condtons to defne actons for each type of the cell. The boundary cell type can be covered n the above-mentoned way usng the ghost layer. Ths leads to the exstence of the other two dfferent condtons n the same half WARP of the GPU. Thus, n order to remove IF-condtons we combne condtons nto computatonal statements. Usng ths technque, the IF- statement n Algorthm 7 s rewrtten as n Algorthm 8. 5. Expermental Results In ths secton, we frst descrbe the expermental setup. Then we show the performance results wth analyses. 0 Seral AoS 64 3 128 3 192 3 256 3 Doman sze Fgure 13: Performance (MLUPS) comparson of the seral and the AoS wth dfferent doman szes. 5.1. Expermental Setup. We mplemented the LBM n the followng fve ways: () Seral mplementaton usng sngle CPU core (seral), usng the source code from the SPEC CPU 2006 470.lbm [15] to make sure t s reasonably optmzed () Parallel mplementaton on a GPU usng the AoS data scheme (AoS) () Parallel mplementaton usng the SoA data scheme and the push scheme (SoA Push Only) (v) Parallel mplementaton usng the SoA data scheme and the pull scheme (SoA Pull Only) (v) SoA usng pull scheme wth our varous optmzatons ncludng the tlng wth the data layout change (SoA Pull ) We summarze our mplementatons n Table 2. We used the D3Q19 model for the LBM algorthm. Doman grd szes arescaledntherangeof64 3, 128 3, 192 3,and256 3.The numbers of tme steps are 1000, 5000, and 10000. In order to measure the performance of the LBM, the Mllon Lattce Updates Per Second (MLUPS) unt s used whch s calculated as follows: MLUPS = N x N y N z N ts (11) 10 6 T where N x, N y,andn z are doman szes n the x-, y-, and zdmenson, N ts s the number of tme steps used, and T s the run tme of the smulaton. Our experments were conducted on a system ncorporatng the Intel multcore processor (6-core 2.0 Ghz Intel Xeon E5-2650) wth 20 MB level-3 cache and Nvda Tesla K20 GPU based on the Kepler archtecture wth 5 GB devce memory. The OS s CentOS 5.5. In order to valdate the effectveness of our approach over the prevous approaches, we have also conducted further experments on another GPU, Nvda GTX285 GPU. 5.2. Results Usng Prevous Approaches. The average performancesoftheseralandtheaosareshownnfgure13.

12 Scentfc Programmng Table 2: Summary of experments. Experment Seral Parallel AoS SoA Push Only SoA Pull SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tlng Descrpton Seral mplementaton on sngle CPU core AoS scheme SoAscheme+pushdatascheme SoAscheme+pulldatascheme SoA scheme + pull data scheme + branch dvergence removal SoAscheme+pulldatascheme+regsterreducton SoA scheme + pull data scheme + branch dvergence removal + regster usage reducton SoA scheme + pull data scheme + branch dvergence removal + regster usage reducton + tlng wth data layout change Performance (MLUPS) 900 800 700 600 500 400 300 200 100 0 64 3 128 3 192 3 256 3 AoS SoA_Push_Only Doman sze Fgure 14: Performance (MLUPS) comparson of the AoS and the SoA wth dfferent doman szes. Performance (MLUPS) 860 840 820 800 780 760 740 720 700 680 64 3 128 3 192 3 256 3 SoA_Pull_Only SoA_Push_Only Doman sze Fgure 15: Performance (MLUPS) comparson of the SoA usng push scheme and pull scheme wth dfferent doman szes. Wth varous doman szes of 64 3, 128 3, 192 3,and256 3,the MLUPS numbers for the seral are 9.82 MLUPS, 7.42 MLUPS, 9.57 MLUPS, and 8.93 MLUPS. The MLUPS for the AoS are 112.06 MLUPS, 78.69 MLUPS, 76.86 MLUPS, and 74.99 MLUPS, respectvely. Wth these numbers as the baselne, we also measured the SoA performance for varous doman szes. Fgure 14 shows that the SoA sgnfcantly outperforms the AoS scheme. The SoA s faster than the AoS by 6.63, 9.91, 10.28, and 10.49 tmes for the doman szes 64 3, 128 3, 192 3, and 256 3. Note that n ths experment we appled only the SoA scheme wthout any other optmzaton technques. Fgure 15 compares the performance of the pull scheme and the push scheme. For far comparson, we dd not apply any other optmzaton technques to both of the mplementatons. The pull scheme performs at 797.3 MLUPS, 838.4 MLUPS, 849.8 MLUPS, and 848.37 MLUPS, whereas the push scheme performs at 743.4 MLUPS, 780 MLUPS, 790.16 MLUPS, and 787.13 MLUPS for doman szes 64 3, 128 3, 192 3,and256 3, respectvely. Thus, the pull scheme s betterthanthepushschemeby6.75%,6.97%,7.02%,and 7.2%, respectvely. The number of global memory transactons observed shows that the total transactons (loads and stores) of the pull and push schemes are qute equvalent. However, the number of store transactons of the pull scheme s 56.2% smaller than the push scheme. Ths leads to the performance mprovement of the pull scheme compared wth the push scheme. 5.3. Results Usng Our Optmzaton Technques. In ths subsecton, we show the performance mprovements of our optmzaton technques compared wth the prevous approach based on thesoa Pull mplementaton: () Fgure 16 compares the average performance of the SoA wth and wthout removng the branch dvergences explaned n Secton 4.4 n the kernel code. Removng the branch dvergence mproves the performance by 4.37%, 4.45%, 4.69%, and 5.19% for doman szes 64 3, 128 3, 192 3,and256 3,respectvely. () Reducng the regster uses descrbed n Secton 4.3 mproves the performance by 12.07%, 12.44%, 11.98%, and 12.58% as Fgure 17 shows.

Scentfc Programmng 13 920 1200 Performance (MLUPS) 900 880 860 840 820 800 780 760 Performance (MLUPS) 1000 800 600 400 200 740 64 3 128 3 192 3 256 3 Doman sze 0 64 3 128 3 192 3 256 3 Doman sze SoA_Pull_Only SoA_Pull_BR SoA_Pull_Only SoA_Pull_Full Fgure 16: Performance (MLUPS) comparson of the SoA wth and wthout branch removal for dfferent doman szes. Fgure 18: Performance (MLUPS) comparson of the SoA wth and wthout optmzaton technques for dfferent doman szes. Performance (MLUPS) 1000 950 900 850 800 750 700 650 600 64 3 128 3 192 3 256 3 SoA_Pull_Only SoA_Pull_RR Doman sze Fgure 17: Performance (MLUPS) comparson of the SoA wth and wthout reducng regster uses for dfferent doman szes. Performance (MLUP) 1400 1200 1000 800 600 400 200 0 64 3 128 3 192 3 256 3 SoA_Pull_Full_Tlng SoA_Pull_Full Doman sze Fgure 19: Performance (MLUPS) comparson of the SoA Pull Full and the SoA Pull Full Tlng wth dfferent doman szes. () Fgure 18 compares the performance of the SoA usng the pull scheme wth optmzaton technques such as the branch dvergence removal and the regster usage reducton descrbed n Sectons 4.3 and 4.4 (SoA Pull Full) and the SoA Pull Only. The optmzed SoA Pull mplementaton s better than the SoA Pull Only by 16.44%, 16.89%, 16.68%, and 17.77% for the doman szes 64 3, 128 3, 192 3, and 256 3, respectvely. (v) Fgure 19 shows the performance comparson of the SoA Pull Full and SoA Pull Full Tlng. The SoA Pull Full Tlng performance s better than the SoA Pull Full from 11.78% to 13.6%. The doman sze 128 3 gves the best performance mprovement of 13.6%, whle the doman sze 256 3 gves the lowest mprovement of 11.78%. The expermental results show that the tlng sze for the best performance s n x =32, n y =16,andn z =N z 4. (v) Fgure 20 presents the overall performance of the SoA Pull Full Tlng mplementaton compared wth the SoA Pull Only. Wth all our optmzaton technques descrbed n Sectons 4.2, 4.3, and 4.4, we obtaned 28% overall performance mprovements compared wth the prevous approach. (v) Table 3 compares the performance of four mplementatons (seral, AoS, SoA Pull Only, and SoA Pull Full Tlng) wth dfferent doman szes. As shown, the peak performance of 1210.63 MLUPS s acheved by the SoA Pull Full Tlng wth doman sze 256 3,wherethespeedupof136salsoacheved. (v) Table 4 compares the performance of our work wth the prevous work conducted by Mawson and Revell [12]. Both mplementatons were conducted on the same K20 GPU. Our approach performs better than [12] from 14% to 19%. Our approach ncorporates

14 Scentfc Programmng Table 3: Performance (MLUPS) comparsons of four mplementatons. Doman szes TmeSteps Seral AoS SoA Pull Only SoA Pull Full Tlng 64 3 128 3 192 3 256 3 1000 9.89 111.73 759.52 1034 5000 9.75 112.24 814.01 1115.63 10000 9.82 111.73 818.36 1129.32 Avg. Perf. 9.82 112.41 797.30 1092.99 1000 7.64 78.58 798.21 1115.2 5000 7.65 78.74 855.55 1189.15 10000 6.98 78.74 861.42 1199.33 Avg. Perf. 7.42 78.69 838.39 1167.69 1000 9.47 76.79 811.35 1114.96 5000 9.69 76.89 866.76 1185.95 10000 9.56 76.91 871.39 1205.48 Avg. Perf. 9.57 76.86 849.83 1168.8 1000 8.99 74.74 787.76 1113.77 5000 8.91 75.09 873.8 1182.33 10000 8.9 775.14 883.56 1210.63 Avg. Perf. 8.93 74.99 848.37 1168.91 Performance (MLUPS) 1400 1200 1000 800 600 400 200 Performance (MLUPS) 400 350 300 250 200 150 100 50 0 64 3 128 3 192 3 256 3 SoA_Pull_Full_Tlng SoA_Pull_Only Doman sze Fgure 20: Performance (MLUPS) comparson of the SoA Pull Only and the SoA Pull Tlng wth dfferent doman szes. Table 4: Performance (MLUPS) comparson of our work wth prevous work [12]. Domanszes MawsonandRevell(2014) Ourwork 64 3 914 1129 128 3 990 1199 192 3 1036 1205 256 3 1020 1210 more optmzaton technques such as the tlng optmzaton wth the data layout change, the branch dvergence removal, among others compared wth [12]. 0 64 3 128 3 160 3 Doman sze Seral AoS SoA_Push_Only SoA_Pull_Only SoA_Pull_RR SoA_Pull_BR SoA_Pull_Full SoA_Pull_Full_Tlng Fgure 21: Performance (MLUPS) on GTX285 wth dfferent doman szes. (v) In order to valdate the effectveness of our approach, we conducted more experments on the other GPU, Nvda GTX285. Table 5 and Fgure 21 show the average performance of our mplementatons wth doman szes 64 3, 128 3,and160 3.(Thegrdszes larger than 160 3 cannot be accommodated n the devce memory of the GTX 285.) As shown, our optmzaton technque, SoA Pull Full Tlng, s better than the prevous SoA Pull Only up to 22.85%. Also we obtaned 46-tme speedup compared wth the seral mplementaton. The level of the performance mprovement and the speedup are, however, lower on the GTX 285 compared wth the K20.

Scentfc Programmng 15 Table 5: Performance (MLUPS) on GTX285. Doman szes Seral AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tlng 64 3 9.82 45.22 240.97 257.36 268.45 283.24 296.73 328.87 128 3 7.42 49.85 242.12 259.04 270.58 285.68 299.79 335.77 160 3 9.57 50.15 237.58 254.18 266.25 279.59 294.26 328.62 6. Prevous Research Prevous parallelzaton approaches for the LBM algorthm focused on two man ssues: how to effcently organze the dataandhowtoavodthemsalgnmentnthestreamng (propagaton) phase of the LBM. In the data organzaton, theaosandthesoaaretwomostcommonlyusedschemes. Whle AoS scheme s sutable for the CPU archtecture, SoA s a better scheme for the GPU archtecture when the global memory access coalton technque s ncorporated. Thus, most mplementatons of the LBM on the GPU use the SoA asthemandataorganzaton. In order to avod the msalgnment n the streamng phase of the LBM, there are two man approaches. The frst proposed approach uses the shared memory. Tölke n [9] used the approach and mplemented the D2Q9 model. Habch et al. [8] followed the same approach for the D3Q19 model. Baley et al. n [16] also used the shared memory to acheve 100% coalescence n the propagaton phase for the D3Q13 model. In the second approach, the pull scheme was used nstead of the push scheme wthout usng the shared memory. As observed, the man am of usng the shared memory s to avod the msalgned accesses caused by the dstrbuton values movng to the east and west drectons [8, 9, 17]. However, the shared memory mplementaton needs extra synchronzatons and ntermedate regsters. Ths lowers the acheved bandwdth. In addton, usng the shared memory lmts the maxmum number of threads per thread block because of the lmted sze of the shared memory [17] whch reduces the number of actve WARPs (occupancy) of the kernels, thereby hurtng the performance. Usng the pull scheme, nstead, there s no extra synchronzaton cost ncurred and no ntermedate regsters are needed. In addton, the better utlzaton of the regsters n the pull scheme leads to generatng a larger number of threads as the total number of regsters s fxed. Ths leads to better utlzaton of the GPU s multthreadng capablty and hgher performance. Latest results n [12] confrm the hgher performance of the pull scheme compared wth usng the shared memory. Besdes the above approaches, n [12], the new feature of the Tesla K20 GPU, shuffle nstructon, was appled to avodthemsalgnmentnthestreamngphase.however,the obtaned results were worse. In [18], Obrecht et al. focused on choosng careful data transfer schemes n the global memory nsteadofusngthesharedmemorynordertosolvethe msalgned memory access problem. There were some approaches to maxmze the GPU multprocessor occupancy by reducng the regster uses per thread. Baley et al. n [16] showed 20% mprovement n maxmum performance compared wth the D3Q19 model n [8]. They set the number of regsters used by the kernel below a certan lmt usng the Nvda compler flag. However, ths approach may spll the regster data to the local memory. Habch et al. [8] suggested a method to reduce the number of regsters by usng the base ndex, whch forces the compler to reuse the same regster agan. A few dfferent mplementatons of the LBM were attempted. Astorno et al. [19] bult a GPU mplementaton framework for the LBM vald for the two- and threedmensonal problems. The framework s organzed n a modular fashon and allows for easy modfcaton. They used the SoA scheme and the semdrect approach as the addressng scheme. They also adopted the swappng technque to save the memory requred for the LBM mplementaton. Rnald et al. [17] suggested an approach based on the sngle-step algorthm wth a reversed collson-propagaton scheme. They used the shared memory as the man computatonal memory nstead of the global memory. In our mplementaton, we adopted these approaches for the SoA Pull Only mplementaton shown n Secton 5. 7. Concluson In ths paper, we developed hgh performance parallelzaton of the LBM algorthm wth the D3Q19 model on the GPU. In order to mprove the cache localty and mnmze the overheads assocated wth the uncoalesced accesses n movng the data to the adjacent cells n the streamng phase of the LBM, we used the tlng optmzaton wth the data layout change. For reducng the hgh regster pressure for the LBM kernels and mprovng the avalable thread parallelsm generated at the run tme, we developed technques for aggressvely reducng the regster uses for the kernels. We also developed optmzaton technques for removng the branch dvergence. Other already-known technques were also adopted n our parallel mplementaton such as combnng the streamng phase and the collson phasentoonephasetoreducethememoryoverhead,a GPU frendly data organzaton scheme so-called the SoA scheme, effcent data placement of the major data structures n the GPU memory herarchy, and adoptng a data update scheme (pull scheme) to further reduce the overheads of the uncoalesced accesses. Expermental results on the 6-core 2.2 Ghz Intel Xeon processor and the Nvda Tesla K20 GPU usng CUDA show that our approach leads to mpressve performance results. It delvers up to 1210.63 MLUPS throughput performance and acheves up to 136-tme speedup compared wth a seral mplementaton runnng on sngle CPU core.

16 Scentfc Programmng Competng Interests The authors declare that there s no conflct of nterests regardng the publcaton of ths paper. Acknowledgments Ths research was supported by Next-Generaton Informaton Computng Development Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence, and Technology (NRF- 2015M3C4A7065662). Ths work was supported by the Human Resources Program n Energy Technology of the Korea Insttute of Energy Technology Evaluaton and Plannng (KETEP), granted fnancal resource from the Mnstry of Trade, Industry & Energy, Republc of Korea (no. 20154030200770). References [1] J.-P. Rvet and J. P. Boon, Lattce Gas Hydrodynamcs, vol. 11, Cambrdge Unversty Press, Cambrdge, UK, 2005. [2] J. Tölke and M. Krafczyk, TeraFLOP computng on a desktop PC wth GPUs for 3D CFD, Internatonal Journal of Computatonal Flud Dynamcs,vol.22,no.7,pp.443 456,2008. [3] G.Crm,F.Mantovan,M.Pvant,S.F.Schfano,andR.Trpccone, Early experence on portng and runnng a Lattce Boltzmann code on the Xeon-Ph co-processor, n Proceedngs of the 13th Annual Internatonal Conference on Computatonal Scence (ICCS 13),vol.18,pp.551 560,Barcelona,Span,June2013. [4] M. Stürmer, J. Götz, G. Rchter, A. Dörfler, and U. Rüde, Flud flow smulaton on the Cell Broadband Engne usng the lattce Boltzmann method, Computers & Mathematcs wth Applcatons, vol. 58, no. 5, pp. 1062 1070, 2009. [5] NVIDA, CUDA Toolkt Documentaton, September 2015, http://docs.nvda.com/cuda/ndex.html. [6] GROUP KHRONOS, OpenCL, 2015, https://www.khronos.org/opencl/. [7] OpenACC-standard.org, OpenACC, March 2012, http://www.openacc.org/. [8] J. Habch, T. Zeser, G. Hager, and G. Wellen, Performance analyss and optmzaton strateges for a D3Q19 lattce Boltzmann kernel on nvidia GPUs usng CUDA, Advances n Engneerng Software, vol. 42, no. 5, pp. 266 272, 2011. [9] J. Tölke, Implementaton of a Lattce Boltzmann kernel usng the compute unfed devce archtecture developed by nvidia, Computng and Vsualzaton n Scence,vol.13,no.1,pp.29 39, 2010. [10] G.Wellen,T.Zeser,G.Hager,andS.Donath, Onthesngle processor performance of smple lattce Boltzmann kernels, Computers and Fluds,vol.35,no.8-9,pp.910 919,2006. [11] M. Wttmann, T. Zeser, G. Hager, and G. Wellen, Comparson of dfferent propagaton steps for lattce Boltzmann methods, Computers and Mathematcs wth Applcatons,vol.65,no.6,pp. 924 935, 2013. [12] M. J. Mawson and A. J. Revell, Memory transfer optmzaton for a lattce Boltzmann solver on Kepler archtecture nvda GPUs, Computer Physcs Communcatons, vol. 185, no. 10, pp. 2566 2574, 2014. [13] N. Tran, M. Lee, and D. H. Cho, Memory-effcent parallelzaton of 3D lattce boltzmann flow solver on a GPU, n Proceedngs of the IEEE 22nd Internatonal Conference on Hgh Performance Computng (HPC 15), pp. 315 324, IEEE, Bangalore, Inda, December 2015. [14] K. Iglberger, Cache Optmzatons for the Lattce Boltzmann Method n 3D, vol. 10, Lehrstuhl für Informatk, Würzburg, Germany, 2003. [15] J. L. Hennng, SPEC CPU2006 Benchmark descrptons, ACM SIGARCH Computer Archtecture News, vol.34,no.4,pp.1 17, 2006. [16]P.Baley,J.Myre,S.D.C.Walsh,D.J.Llja,andM.O.Saar, Acceleratng lattce boltzmann flud flow smulatons usng graphcs processors, n Proceedngs of the 38th Internatonal Conference on Parallel Processng (ICPP 09), pp. 550 557, IEEE, Venna, Austra, September 2009. [17] P. R. Rnald, E. A. Dar, M. J. Vénere, and A. Clausse, A Lattce- Boltzmann solver for 3D flud smulaton on GPU, Smulaton Modellng Practce and Theory,vol.25,pp.163 171,2012. [18] C. Obrecht, F. Kuznk, B. Tourancheau, and J.-J. Roux, A new approach to the lattce Boltzmann method for graphcs processng unts, Computers & Mathematcs wth Applcatons, vol.61,no.12,pp.3628 3638,2011. [19] M. Astorno, J. B. Sagredo, and A. Quarteron, A modular lattce boltzmann solver for GPU computng processors, SeMA Journal,vol.59,no.1,pp.53 78,2012.

Journal of Advances n Industral Engneerng Multmeda The Scentfc World Journal Appled Computatonal Intellgence and Soft Computng Internatonal Journal of Dstrbuted Sensor Networks Advances n Fuzzy Systems Modellng & Smulaton n Engneerng Submt your manuscrpts at https://www.hndaw.com Journal of Computer Networks and Communcatons Advances n Artfcal Intellgence Internatonal Journal of Bomedcal Imagng Advances n Artfcal Neural Systems Internatonal Journal of Computer Engneerng Computer Games Technology Advances n Advances n Software Engneerng Internatonal Journal of Reconfgurable Computng Robotcs Computatonal Intellgence and Neuroscence Advances n Human-Computer Interacton Journal of Journal of Electrcal and Computer Engneerng