Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Size: px
Start display at page:

Download "Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU"

Transcription

1 Hndaw Scentfc Programmng Volume 2017, Artcle ID , 16 pages Research Artcle Performance Optmzaton of 3D Lattce Boltzmann Flow Solver on a GPU Nhat-Phuong Tran, Myungho Lee, and Sugwon Hong Department of Computer Scence and Engneerng, Myongj Unversty, 116 Myongj-ro, Cheon-gu, Yongn, Gyeongg-do, Republc of Korea Correspondence should be addressed to Myungho Lee; myunghol@mju.ac.kr Receved 9 June 2016; Accepted 23 October 2016; Publshed 16 January 2017 Academc Edtor: Baslo B. Fraguela Copyrght 2017 Nhat-Phuong Tran et al. Ths s an open access artcle dstrbuted under the Creatve Commons Attrbuton Lcense, whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded the orgnal work s properly cted. Lattce Boltzmann Method (LBM) s a powerful numercal smulaton method of the flud flow. Wth ts data parallel nature, t s a promsng canddate for a parallel mplementaton on a GPU. The LBM, however, s heavly data ntensve and memory bound. In partcular, movng the data to the adjacent cells n the streamng computaton phase ncurs a lot of uncoalesced accesses on the GPU whch affects the overall performance. Furthermore, the man computaton kernels of the LBM use a large number of regsters per thread whch lmts the thread parallelsm avalable at the run tme due to the fxed number of regsters on the GPU. In ths paper, we develop hgh performance parallelzaton of the LBM on a GPU by mnmzng the overheads assocated wth the uncoalesced memory accesses whle mprovng the cache localty usng the tlng optmzaton wth the data layout change. Furthermore, we aggressvely reduce the regster uses for the LBM kernels n order to ncrease the run-tme thread parallelsm. Expermental results on the Nvda Tesla K20 GPU show that our approach delvers mpressve throughput performance: Mllon Lattce Updates Per Second (MLUPS). 1. Introducton Lattce Boltzmann Method (LBM) s a powerful numercal smulaton method of the flud flow, orgnatng from the lattce gas automata methods [1]. LBM models the flud flow consstng of partcles movng wth random motons. Such partcles exchange the momentum and the energy through thestreamngandthecollsonprocessesoverthedscrete lattce grd n the dscrete tme steps. At each tme step, the partcles move nto adjacent cells whch cause collsons wth the exstng partcles n the cells. The ntrnsc data parallel nature of LBM makes ths class of applcatons a promsng canddate for parallel mplementaton on varous Hgh Performance Computng (HPC) archtectures ncludng manycore accelerators such as the Graphc Processng Unt (GPU) [2], Intel Xeon Ph [3], and the IBM Cell BE [4]. Recently, the GPU s becomng ncreasngly popular for the HPC server market and n the Top 500 lst, n partcular.thearchtectureofthegpuhasgonethrough a number of nnovatve desgn changes n the last decade. It s ntegrated wth a large number of cores and multple threads per core, levels of the cache herarches, and the large amount (>5 GB) of the on-board memory. The peak floatngpont throughput performance (flops) of the latest GPU has drastcally ncreased to surpass 1 Tflops for the double precson arthmetc [5]. In addton to the archtectural nnovatons, user frendly programmng envronments have been recently developed such as CUDA [5] from Nvda, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Archtecture Revew Board (ARB). The advanced GPU archtecture and the flexble programmng envronments have made possble nnovatve performance mprovements n many applcaton areas. In ths paper, we develop hgh performance parallelzaton of the LBM on a GPU. The LBM s heavly data ntensve and memory bound. In partcular, movng the data to the adjacent cells n the streamng phase of the LBM ncurs a lot of uncoalesced accesses on the GPU and affects the overall performance. Prevous research focused on utlzng thesharedmemoryofthegputodealwththeproblem [1, 8, 9]. In ths paper, we use the tlng algorthm along wth thedatalayoutchangenordertomnmzetheoverheads

2 2 Scentfc Programmng of the uncoalesced accesses and mprove the cache localty as well. The computaton kernels of the LBM nvolve a large number of floatng-pont varables, thus usng a large number of regsters per thread. Ths lmts the avalable thread parallelsm generated at the run tme as the total numberoftheregstersonthegpusfxed.wedeveloped technques to aggressvely reduce the regster uses for the kernels n order to ncrease the avalable thread parallelsm and the occupancy on the GPU. Furthermore, we developed technques to remove the branch dvergence. Our parallel mplementaton usng CUDA shows mpressve performance results. It delvers up to Mllon Lattce Update Per Second (MLUPS) throughput performance and 136-tme speedup on the Nvda Tesla K20 GPU compared wth a seral mplementaton. The rest of the paper s organzed as follows: Secton 2 ntroduces the LBM algorthm. Secton 3 descrbes the archtecture of the latest GPU and ts programmng model. Secton 4 explans our technques for mnmzng the uncoalesced accesses, mprovng the cache localty and the thread parallelsm along wth the regster usage reducton and the branch dvergence removal. Secton 5 shows the expermental results on the Nvda Tesla K20 GPU. Secton 6 explans the prevous research on parallelng the LBM. Secton 7 wraps up the paper wth conclusons. 2. Lattce Boltzmann Method Lattce Boltzmann Method (LBM) s a powerful numercal smulaton of the flud flow. It s derved as a specal case of the lattce gas cellular automata (LGCA) to smulate the flud moton. The fundamental dea s that the fluds can be regarded as consstng of a large number of small partcles movng wth random motons. These partcles exchange the momentum and the energy through the partcle streamng and the partcle collson. The physcal space of the LBM s dscretzed nto a set of unformly spaced nodes (lattce). At each node, a dscrete set of veloctes s defned for the propagaton of the flud molecules. The veloctes are referred to as mcroscopc veloctes whch are denoted by e.the LBM model whch has n dmensons and q velocty vectors at each lattce pont s represented as DnQq. Fgure 1 shows a typcal lattce node of the most common model n 2D (D2Q9) whch has two-dmensonal 9 velocty vectors. In ths paper, however, we consder the D3Q19 model whch has three-dmensonal 19 velocty vectors. Fgure 2 shows a typcal lattce node of D3Q19 model wth 19 veloctes e defned by e = { (±1, 0, 0), (0, ±1, 0), (0, 0, ±1) = 2, 4, 6, 8, 9, 14 (0, 0, 0), = 0 { { (±1, ±1, 0), (0, ±1, ±1), (±1,0,±1) = 1, 3, 5, 7, 10, 11, 12, 13, 15, 16, 17, 18. (1) () Eachpartcleonthelattcesassocatedwthadscrete dstrbuton functon, called as partcle dstrbuton functon (pdf), f ( x, t), =0,...,18.TheLBequaton s dscretzed as follows: f ( x+c e Δt, t + Δt) =f ( x, t) 1 τ [f ( x, t) f (eq) (ρ ( x, t), u ( x, t))] (2) where c s the lattce speed and τ s the relaxaton parameter. () The macroscopc quanttes are the densty ρ and the velocty u( x, t). They are defned as 18 ρ ( x, t) = u ( =0 x, t) = 1 ρ f ( x, t) (3) 18 =0 cf e (4) () The equlbrum functon f (eq) (ρ( x, t), u( x, t)) s defned as f (eq) (ρ ( x, t), u ( x, t)) =ω ρ+ρs ( u ( x, t)) (5) where s ( u) =ω [1 + 3 c 2 ( e u) + 9 2c 4 ( e u) 2 3 2c 2 u u] (6) and the weghtng factor ω has the followng values: 1 3, α = 0 { 1 ω = 18, α=2,4,6,8,9,14 { 1, α = 1, 3, 5, 7, 10, 11, 12, 13, 15, 16, 17, 18. { 36 Algorthm 1 summarzes the algorthm of the LBM. The LBM algorthm executes a loop over a number of tme steps. At each teraton, two computaton steps are appled: (7) () Streamng (or propagaton) phase: the partcles move accordng to the pdf nto the adjacent cells.

3 Scentfc Programmng 3 (1) Step 1: Intalze macroscopc quanttes, densty ρ, velocty u, the dstrbuton functon f, and the equlbrum functon f (eq) n the drecton of e (2) Step 2: Streamng phase: move f f (3) Step 3: Calculate densty ρ and velocty u from f usng Equatons (3) and (4) (4) Step 4: Calculate the equlbrum functon f (eq) usng Equaton (5) (5) Step 5: Collson phase: calculate the updated dstrbuton functon f =f (1/τ)(f f eq ) usng Equaton (2) (6) Repeat Steps 2 to 5tmeSteps-tmes Algorthm 1: Algorthm of LBM. NW 6 N 2 NE5 TN 11 TW 12 T 9 TE 10 TS 13 y W 3 C E 1 NW 3 W 4 SW 5 N 2 C S 6 NE 1 E 8 SE 7 x BN 16 SW 7 S 4 SE 8 BW 17 B 14 BE 15 Fgure 1: Lattce cell wth 9 dscrete drectons n D2Q9 model. y z BS 18 Table1:Pullandpushschemes. Pull scheme Push scheme + Read dstrbuton functons + Read dstrbuton functons from the adjacent cells from the current cell f ( x,t) f ( x c e Δt, t Δt) +Calculateρ, u, f (eq) +Calculateρ, u, f (eq) + Update values to the current cell f ( + Update values to the adjacent x,t) cells f ( x +c e Δt, t + Δt) Fgure 2: Lattce cell wth 19 dscrete drectons n D3Q19 model. x () Collson phase: the partcles collde wth other partcles streamng nto ths cell from dfferent drectons. Dependng on whether the streamng phase precedes or follows the collson phase, we have the pull or the push schementheupdateprocess[10].thepullscheme(fgure3) pulls the post-collson values from the prevous tme step from lattce A and then performs the collson on these to producethenewpdfswhcharestorednlattceb.inthe push scheme (Fgure 4); on the other hand, the pdfs of one node (square wth black arrows) are read from lattce A; then collson step performs frst. The post-collson values are propagated to the neghbor nodes n the streamng step to lattce B (red arrows). Table 1 compares the computaton steps of these schemes. In Algorthm 2, we lst the skeleton of the LBM algorthm whch conssts of the collson phase and the streamng phase. In the functon LBM,the collde functonand stream functon are called tmesteps-tmes. At the end of each tme step, (a) Pull: lattce A (b) Pull: lattce B Fgure 3: Illustraton of pull scheme wth D2Q9 model [11]. (a) Push: lattce A (b) Push: lattce B Fgure 4: Illustraton of push scheme wth D2Q9 model [11].

4 4 Scentfc Programmng (1) vod LBM(double *source grd, double dest grd, nt grd sze, nt tmesteps) (2) { (3) nt ; (4) double temp grd; (5) for ( = 0; < tmesteps; ++) (6) { (7) collde(source grd, temp grd, grd sze); (8) stream(temp grd, dest grd, grd sze); (9) swap grd(source grd, dest grd); (10) } (11) } Algorthm 2: Basc skeleton of LBM algorthm. source grd and dest grd are swapped to nterchange values between the two grds. 3. Latest GPU Archtecture Recently, the many-core accelerator chps are becomng ncreasngly popular for the HPC applcatons. The GPU chps from Nvda and AMD are representatve ones along wth the Intel Xeon Ph. The latest GPU archtecture s characterzed by a large number of unform fne-gran programmable cores or thread processors whch have replaced separate processng unts for shader, vertex, and pxel n the earler GPUs. Also, the clock rate of the latest GPU has ramped up sgnfcantly. These have drastcally mproved the floatng-pont performance of the GPUs, far exceedng that of the latest CPUs. The fne-gran cores (or thread processors) are dstrbuted n multple streamng multprocessors (SMX) (or thread blocks) (see Fgure 5). Software threads are dvded ntoanumberofthreadgroups(calledwarps)eachof whch conssts of 32 threads. Threads n the same WARP are scheduled and executed together on the thread processors n the same SMX n the SIMD (Sngle Instructon Multple Data) mode. Each thread executes the same nstructon drected by the common Instructon Unt on ts own data streamng from the devce memory to the on-chp cache memores and regsters. When a runnng WARP encounters a cache mss, for example, the context s swtched to a new WARP whle the cache mss s servced for the next few hundred cycles, the GPU executes n a multthreaded fashon as well. The GPU s bult around a sophstcated memory herarchy as shown n Fgure 5. There are regsters and local memores belongng to each thread processor or core. The local memory s an area n the off-chp devce memory. Shared memory, level-1 (L-1) cache, and read-only data cache are ntegrated n a thread block of the GPU. The shared memory s a fast (as fast as regsters) programmer-managed memory. Level-2 (L-2) cache s ntegrated on-chp and used among all the thread blocks. Global memory s an area n the off-chp devce memory accessed from all the thread blocks, through whch the GPU can communcate wth the host CPU. Data n the global memory get cached drectly n the shared memory by the programmer or they can be cached through GPU chp Thread block (streamng multprocessor)-n... Thread block (streamng multprocessor)-2 Thread block (streamng multprocessor)-1 Thread processor 1 Regsters Shared memory Thread processor 2 Regsters Level-1 cache Read only data cache Thread processor M Regsters Level-2 cache Instructon unt Devce memory (Global memory, local memory, texture memory, and constant memory) Fgure 5: Archtecture of a latest GPU (Nvda Tesla K20). the L-2 and L-1 caches automatcally as they get accessed. There are constant memory and texture memory regons n the devce memory also. Data n these regons s read-only. They can be cached n the L-2 cache and the read-only data cache.onnvdateslak20,theread-onlydatafromthe globalmemorycanbeloadedthroughthesamecacheusedby the texture ppelne va a standard ponter wthout the need to bnd to a texture beforehand. Ths read-only cache s used automatcally by the compler as long as certan condtons are met. restrct qualfer should be used when a varable s declared to help the compler detect the condtons [5]. In order to effcently utlze the latest advanced GPU archtectures, programmng envronments such as CUDA [5] from Nvda, OpenCL [6] from Khronos Group, and OpenACC [7] from a subgroup of OpenMP Archtecture Revew Board (ARB) have been developed. Usng these envronments, users can have a more drect control over the

5 Scentfc Programmng 5 (1) nt ; (2) for( = 0; < tmesteps; ++) (3) { (4) collson kernel<<<grid, BLOCK>>> (source grd, temp grd, xdm, ydm, zdm, cell sze, grd sze); (5) cudathreadsynchronze(); (6) streamng kernel<<<grid, BLOCK>>>(temp grd, dest grd, xdm, ydm, zdm, cell sze, grd sze); (7) cudathreadsynchronze(); (8) swap grd(source grd, dest grd); (9) } Algorthm 3: Two separate CUDA kernels for dfferent phases of LBM. large number of GPU cores and ts sophstcated memory herarchy. The flexble archtecture and the programmng envronments have led to a number of nnovatve performance mprovements n many applcaton areas and many more are stll to come. 4. Optmzng Cache Localty and Thread Parallelsm In ths secton, we frst ntroduce some prelmnary steps we employed n our parallelzaton and optmzaton of the LBM algorthm n Secton 4.1. They are mostly borrowed from the prevous research such as combnng the collson phase and the streamng phase, a GPU archtecture frendly data organzaton scheme (SoA scheme), an effcent data placement n the GPU memory herarchy, and usng the pull scheme for avodng and mnmzng the uncoalesced memory accesses. Then, we descrbe our key optmzaton technques for mprovng the cache localty and the thread parallelsm such as the tlng wth the data layout change and the aggressve reducton of the regster uses per thread n Sectons 4.2 and 4.3. Optmzaton technques for removng the branch dvergence are presented n Secton 4.4. Our key optmzaton technques presented n ths secton have been mproved from our earler work n [13] Prelmnares Combnaton of Collson Phase and Streamng Phase. As shown n the descrpton of the LBM algorthm n Algorthm 2, the LBM conssts of the two man computng kernels: collson kernel for the collson phase and streamng kernel for the streamng phase. In the collson kernel, threads load the partcle dstrbuton functon from the source grd (source grd) and then calculate the velocty, the densty, and the collson product. The post-collson values are stored to the temporary grd (temp grd). In streamng kernel, the post-collson values from temp grd are loaded and updated to approprate neghbor grd cells n the destnaton grd (dest grd). At the end of each tme step, source grd and dest grd are swapped for the next tme step. Ths mplementaton (see (1) nt ; (2) for ( = 0; < tmesteps; ++) (3) { (4) lbm kernel<<<grid, BLOCK>>>(source grd, dest grd, xdm, ydm, zdm, cell sze, grd sze); (5) cudathreadsyncronze(); (6) swap grd(source grd, dest grd); (7) } Algorthm 4: Sngle CUDA kernel after combnng two phases of LBM. Algorthm 3) needs extra loads/stores from/to temp grd whch s stored n the global memory and affects the global memory bandwdth [2]. In addton, some extra cost s ncurred wth the global synchronzaton between the two kernels (cudathreadsynchronze) whch affects the overall performance. In order to reduce these overheads, we can combne collson kernel and streamng kernel nto one kernel lbm kernel, where the collson product s streamed to the neghbor grd cells drectly after calculaton (see Algorthm 4). Compared wth Algorthm 3, storng to and loadng from temp grd are removed and the global synchronzaton cost s reduced Data Organzaton. In order to represent the 3- dmensonal grd of cells, we use the 1-dmensonal array whch has N x N y N z Qelements, where N x, N y, N z are wdth, heght, and depth of the grd and Q s the number of drectons of each cell [2, 14]. For example, f the model s D3Q19 wth N x =16, N y =16,andN z =16,wehavethe1D array of = elements. We use 2 separate arrays for storng the source grd and the destnaton grd. There are two common data organzaton schemes for storng the arrays: () Array of structures (AoS): grd cells are arranged n 1D array. 19 dstrbutons of each cell occupy 19 consecutve elements of the 1D array (Fgure 6).

6 6 Scentfc Programmng C N S WT WB C N S WT WB C N S WT WB Cell 0 Cell 1 Cell ( ) Thread 0 Thread 1 Fgure 6: AoS scheme. Cell ( ) 1 Cell Cell 0 Cell 1 Cell 0 Cell 1 ( ) 1 Cell 0 Cell 1 Cell ( ) 1 C C C C C N N N N N WB WB WB WB WB Thread 0 Thread 0 Thread 0 Thread 1 Thread 1 Thread 1 Fgure 7: SoA scheme. () Structure of arrays (SoA): the value of one dstrbuton of all cells s arranged consecutvely n memory (Fgure 7). Ths scheme s more sutable for the GPU archtecture as we wll show n the expermental results (Secton 5) Data Placement. In order to effcently utlze the memory herarchy of the GPU, the placement of the major data structures of the LBM s crucal [5]. In our mplementaton, we use the followng arrays: src grd, dst grd, types arr, lc arr,andnb arr. We use the followng data placements for these arrays: () src grd and dst grd areusedtostorethenput grd and the result grd. They are swapped at the end of each tme step by exchangng ther ponters nstead of explct storng to and loadng from the memory through a temporary array. Snce src grd and dst grd are very large sze arrays wth a lot of data stores and loads, we place them n the global memory. () In types arr array, the types of the grd cells are stored.weusethelddrvencavty(ldc)asthe test case n ths paper. The LDC conssts of a cube flledwththeflud.onesdeofthecubeserves as the acceleraton plane by sldng constantly. The acceleraton s mplemented by assgnng the cells n the acceleraton area at a constant velocty. Ths change requres three types of cells: regular flud, acceleraton cells, or boundary. Thus, we also need 1D array, types arr, n order to store the types of each cell n the grd. The sze of ths array s N x N y N z elements. For example, f the model s D3Q19 wth N x =16, N y =16,andN z =16,theszeofthearray s = 4,096 elements. Thus, types arr s a large array also and contans constant values. Thus, they are not modfed throughout the executon of the program. For these reasons, the texture memory s the rght place for ths array. () lc arr and nb arr areusedtostorethebasendces for accesses to 19 drectons of the current cell and the neghbor cells, respectvely. There are 19 ndces correspondng to 19 drectons of D3Q19 model. These ndces are calculated at the start of the program and used tll the end of the program executon. Thus, we use the constant memory to store them. As standng at any cell, we use the followng formula to defne the poston n the 1D array of any drecton out of 19 cell drectons: curr dr pos n arr = cell pos + lc arr[drecton] (for the current cell) and nb dr pos n arr = nb pos + nb arr[drecton] (for the neghbor cells) Usng Pull Scheme to Reduce Costs for Uncoalesced Accesses. Coalescng the global memory accesses can sgnfcantly reduce the memory overheads on the GPU. Multple global memory loads whose addresses fall wthn the 128- byte range are combned nto one request and sent to the memory. Ths saves the memory bandwdth a lot and mproves the performance. In order to reduce the costs for the uncoalesced accesses, we use the pull scheme [12]. Choosng the pull scheme comes from the observaton that the cost of the uncoalesced readng s smaller than the cost of the uncoalesced wrtng. Algorthm 5 shows the LBM algorthm usng the push scheme. At the frst step, the pdfs are coped drectly from thecurrentcell.thesepdfsareusedtocalculatethepdfs at the new tme step (collson phase). The new pdfs are then streamed to the adjacent cells (streamng phase). At

7 Scentfc Programmng 7 (1) global vod soa push kernel(float *source grd, float *dest grd, unsgned char* flags) (2) { (3) Gather 19 pdfs from the current cell (4) (5) Apply boundary condtons (6) (7) Calculate the mass densty ρ and the velocty u (8) (9) Calculate the local equlbrum dstrbuton functons f (eq) usng ρ and u (10) (11) Calculate the pdfs at new tme step (12) (13) Stream 19 pdfs to the adjacent cells (14) } Algorthm 5: Kernel usng the push scheme. (1) global vod soa pull kernel(float *source grd, float *dest grd, unsgned char* flags) (2) { (3) Stream 19 pdfs from adjacent cells to the current cell (4) (5) Apply boundary condtons (6) (7) Calculate the mass densty ρ and the velocty u (8) (9) Calculate the local equlbrum dstrbuton functons f (eq) usng ρ and u (10) (11) (12) Calculate the pdf at new tme step (13) (14) Save 19 values of pdf to the current cell (15) } Algorthm 6: Kernel usng the pull scheme. the streamng phase, the dstrbuton values are updated to neghbors after they are calculated. All dstrbuton values whch do not move to the east or west drecton (x-drecton valuesequalto0)canbeupdatedtotheneghbors(wrte to the devce memory) drectly wthout any msalgnment. However, other dstrbuton values (x-drecton values equal to +1 or 1) need to be consdered carefully because of ther msalgned update postons. The update postons are shfted to the memory locatons that do not belong to the 128-byte segment whle thread ndexes are not shfted correspondngly. So the msalgned accesses occur and the performance can degrade sgnfcantly. If we use the pull scheme, on the other hand, the order ofthecollsonphaseandthestreamngphasenthelbm kernel s reversed (see Algorthm 6). At the frst step of the pull scheme, the pdfs from adjacent cells are gathered to the current cell (streamng phase) (Lnes 3 5). Next, these pdfs areusedtocalculatethepdfsatthenewtmestepandthese new pdfs are then stored to the current cell drectly (collson phase). Thus, n the pull scheme, the uncoalesced accesses occur when the data s read from the devce memory whereas they occur when the data s wrtten n the push scheme. As a result, the cost of the uncoalesced accesses s smaller wth the pull scheme Tlng Optmzaton wth Data Layout Change. In the D3Q19 model of the LBM, as computatons for the streamng and the collson phases are conducted for a certan cell, 19 dstrbuton values whch belong to 19 surroundng cells are accessed. Fgure 8 shows the data accesses to the 19 cells when

8 8 Scentfc Programmng P +1 P P 1 N y N x Fgure 8: Data accesses for orange cell n conductng computatons for streamng and collson phases. a thread performs the computatons for the orange colored cell. The 19 cells (18 drectons (green cells) + current cell n thecenter(orangecell))aredstrbutedonthethreedfferent planes. Let P be the plane contanng the current computng (orange) cell, and let P 1 and P +1 bethelowerandupper planes, respectvely. P plane contans 9 cells. P 1 and P +1 planes contan 5 cells, respectvely. When the computatons for the cell, for example, (x, y, z) =(1, 1, 1), areperformed, the followng cells are accessed: () P 0 plane: (0, 1, 0), (1, 0, 0), (1, 1, 0), (1, 2, 0), (2, 1, 0) () P 1 plane: (0, 0, 1), (0, 1, 1), (0,2,1), (1, 0, 1), (1, 1, 1), (1, 2, 1), (2, 0, 1), (2, 1, 1), (2,2,1) () P 2 plane: (0, 1, 2), (1, 0, 2), (1, 1, 2), (1, 2, 2), (2, 1, 2) The 9 accesses for P 1 plane are dvded nto three groups {(0,0,1), (0,1,1), (0,2,1)}, {(1,0,1), (1,1,1), (1,2,1)}, {(2,0,1), (2,1,1), (2,2,1)}. Each group accesses the consecutve memory locatons belongng to the same row. Accesses of the dfferent groups are separated apart and lead to the uncoalesced accesses on the GPU when N x s suffcently large. In each of P 0 and P 2 planes, there are three groups of accesses each. Here, the accesses of the same group touch the consecutve memory locatons and accesses of the dfferent groups are separated apart n the memory whch lead to the uncoalesced accesses also. Accesses to the data elements n the dfferent planes (P 0, P 1,andP 2 ) are further separated apart and also lead to the uncoalesced accesses when N y s suffcently large. As the computatons proceed, three rows n the ydmenson of P 0, P 1, P 2 planes wll be accessed sequentally for x=0 N x 1, y = 0, 1, 2,followedbyx=0 N x 1, y = 1, 2, 3,..., x=0 N x 1, y=n y 3,N y 2,N y 1. When the complete P 0, P 1, P 2 planes are swept, then smlar data accesses wll contnue for P 1, P 2,andP 3 planes, and so on. Therefore, there are a lot of data reuses n x-, y-, and zdmensons. As explaned n Secton 4.1.2, the 3D lattce grd s stored n the 1D array. The 19 cells for the computatons belongngtothesameplanearestored±1 or ±N x +±1cells away. The cells n dfferent planes are stored ±N x Ny+ ±N x +±1cells away. The data reuse dstance along the xdmenson s short: +1 or +2 loop teratons apart. The data reusedstancealongthey-andz-dmensons s ±N x +±1or ±N x N y +±N x +±1teratons apart. If we can make the data reuse occur faster by reducng the reuse dstances, for example, usng the tlng optmzaton, t can greatly mprove the cache ht rato. Furthermore, t can reduce the overheads wth the uncoalesced accesses because lots of global memory accesses can be removed by the cache hts. Therefore, we tle the 3D lattce grd nto smaller 3D blocks. We also change the data layout n accordance wth the data access patterns of the tled code n order to store the data elements n dfferent groups closer n the memory. Thus we can remove a lot of uncoalesced memory accesses, because they can be stored wthn 128-byte boundary. In Sectons and 4.2.2, we descrbe our tlng and data layout change optmzatons Tlng. Let us assume the followng: () N x, N y,andn z areszesofthegrdnx-, y-, and zdmenson. () n x, n y,andn z are szes of the 3D block n x-, y-, and z-dmenson. () xy-plane s a subplane whch s composed of (n x n y ) cells. We tle the grd nto small 3D blocks wth the tle szes of n x, n y,andn z (yellow block n Fgure 9(a)), where n x =N x x c n y =N y y c n z =N z z c xc,yc,zc = [1, 2, 3,...). We let each CUDA thread block process one 3D tled block. Thus n z xy-planes need to be loaded for each thread block. In each xy-plane, each thread of the thread block executes the computatons for one grd cell. Thus each thread deals wth a column contanng n z cells (the red column n Fgure 9(b)). If z c = 1,eachthreadprocessesN z cells and f z c = N z, each thread processes only one cell. The tle szecanbeadjustedbychangngtheconstantsx c, y c,andz c. These constants need to be selected carefully to optmze the performance. Usng the tlng, the number of created threads s reduced by z c -tmes Data Layout Change. In order to further mprove benefts of the tlng and reduce the overheads assocated wth the uncoalesced accesses, we propose to change the data layout. Fgure 10 shows one xy-planeofthegrdwth and wthout the layout change. Wth the orgnal layout (Fgure 10(a)), the data s stored n the row major fashon. Thus the entre frst row s stored, followed by the second (8)

9 Scentfc Programmng 9 Block z x y (a) Grd dvded nto 3D blocks (b) Block contanng subplanes. The red column contans all cells one thread processes Fgure 9: Tlng optmzaton for LBM. Block 3 Block 2 Block 2 Block 3 Block 1 Block 0 Block 0 Block 1 (a) Orgnal data layout (b) Proposed data layout Fgure 10: Dfferent data layouts for blocks. row, and so on. In the proposed new layout, the cells n the tled frst row n Block 0 are stored frst. Then the second tled row of Block 0 s stored nstead of the frst row of Block 1 (Fgure 10(b)). Wth the layout change, the data cells accessed n the consecutve teratons of the tled code are placed sequentally. Ths places the data elements of the dfferent groups closer. Thus, t ncreases the possblty for these memory accesses to the dfferent groups coalesced f thetlngfactorandthememorylayoutfactorareadjusted approprately. Ths can further mprove the performance beyond the tlng. The data layout can be transformed usng the followng formula: ndex new =x d +y d N x +z d N x N y (9) where x d and y d are cell ndexes n x- andy-dmenson on the plane of grd and z d s the value n the range of 0 to n z 1. x d and y d canbecalculatedasfollows: x d = (block ndex n x-dmenson) (number of threads n thread block n x-dmenson) + (thread ndex n thread block n x-dmenson) y d =(block ndex n y-dmenson) (numberofthreadsnthreadblockny-dmenson) +(thread ndex n thread block n y-dmenson) (10) In our mplementaton, we use the changed nput data layout stored offlne before the program starts. (The orgnal nput s changed to the new layout and stored to the nput fle.) Then, the nput fle s used whle conductng the experments Reducton of Regster Uses per Thread. The D3Q19 model s more precse than the models wth smaller dstrbutons such as D2Q9 or D3Q13, thus usng more varables. Ths leads to more regster uses for the man computaton kernels. In GPU, the regster use of the threads s one of the factors lmtngthenumberofactvewarpsonastreamngmultprocessor (SMX). Hgher regster uses can lead to the lower parallelsm and occupancy (see Fgure 11 for an example) whch results n the overall performance degradaton. The Nvda compler provdes a flag to lmt the regster uses to a certan lmt such as maxrregcount or launch bounds () qualfer [5]. The maxrregcount swtch sets a maxmum on the number of regsters used for each thread. These can help ncrease the occupancy by reducng the regster uses per thread. However, our experments show that the overall performance goes down, because they lead to a lot of regster splls/reflls to/from the local memory. The ncreased

10 10 Scentfc Programmng Block 0 Block 0 Thread 0 Thread 1 Block 0 Thread 63 Regster fle Block 7 Block 7 Thread 0 Thread 1 Block 7 Thread 63 (a) Eghtblocks,64threadsperblock,and4regstersperthread Block 0 Thread 0 Block 0 Thread 31 Regster fle Block 7 Thread 0 Block 7 Thread 31 (b) Eghtblocks,32threadsperblock,and8regstersperthread Fgure 11: Sharng 2048 regsters among (a) a larger number of threads wth smaller regster uses versus (b) a smaller number of threads wth larger regster uses. memory traffc to/from the local memory and the ncreased nstructon count for accessng the local memory hurt the performance. In order to reduce the regster uses per thread whle avodng the regster spll/refll to/from the local memory, we used the followng technques: () Calculate ndexng address of dstrbutons manually. Each cell has 19 dstrbutons; thus we need 38 varables for storng ndexes (19 dstrbutons 2 memory accesses for load and store) n the D3Q19 model. However, each ndex varable s used only one tmeateachexecutonphase.thus,wecanuseonly two varables nstead of 38, one for calculatng the loadng ndexes and one for calculatng the storng ndexes. () Use the shared memory for the commonly used varables among threads, for example, to store the base addresses. () Castng multple small sze varables nto one large varable: for example, we combned 4 char type varables nto one nteger varable. (v) For smple operatons whch can be easly calculated, we do not store them to the memory varables. Instead, we recompute them later. (v) We use only one array to store the dstrbutons nstead of usng 19 arrays separately. (v) In the orgnal LBM code, a lot of varables are declared to store the FP computaton results whch ncrease the regster uses. In order to reduce the regster uses, we attempt to reuse varables whose lfe-tme ended earler n the former code. Ths may lower the nstructon-level parallelsm of the kernel. However, t helps ncrease the thread-level parallelsm as more threads can be actve at the same tme wth the reduced regster uses per thread. (v) In the orgnal LBM code, there are some complcated floatng-pont (FP) ntensve computatons used n a number of nearby statements. We aggressvely extract these computatons as the common subexpressons. It frees the regsters nvolved n the common subexpressons, thus reducng the regster uses. It also reduces the number of dynamc nstructon counts. Applyng the above technques n our mplementaton, thenumberofregstersneachkernelsgreatlyreducedfrom 70 regsters to 40 regsters. It leads to the hgher occupancy for the SMXs and the sgnfcant performance mprovements Removng Branch Dvergence. Flow control nstructons onthegpucausethethreadsofthesamewarptodverge. Thus, the resultng dfferent executon paths get seralzed. Ths can sgnfcantly affect the performance of the applcaton program on the GPU. Thus, the branch dvergence shouldbeavodedasmuchaspossble.inthelbmcode, therearetwomanproblemswhchcancausethebranch dvergence: () Solvng the streamng at the boundary postons () Defnng actons for correspondng cell types

11 Scentfc Programmng Performance (MLUPS) Ghost layer Boundary Others Fgure 12: Illustraton of the lattce wth a ghost layer. (1) IF(cell type == FLUID) (2) x=a; (3) ELSE (4) x=b; Algorthm 7: Skeleton of IF-statement used n LBM kernel. (1) s flud = (cell type == FLUID); (2) x=a*sflud + b * (!s flud); Algorthm 8: Code wth IF-statement removed. In order to avod usng IF-statements whle streamng at the boundary poston, a ghost layer s attached n y- and z-dmenson. If N x, N y,andn z are the wdth, heght, and depth of the orgnal grd, NN x =N x, NN y =N y +1,and NN z =N z +1are the new wdth, heght, and depth of the grd wth the ghost layer (Fgure 12). Wth the ghost layer, we can regard the computatons at the boundary poston as the normaloneswthoutworryngaboutrunnngoutofthendex bound. As explaned n Secton 4.1.2, cells of the grd belong to three types such as the regular flud, the acceleraton cells, or the boundary. The LBM kernel contans condtons to defne actons for each type of the cell. The boundary cell type can be covered n the above-mentoned way usng the ghost layer. Ths leads to the exstence of the other two dfferent condtons n the same half WARP of the GPU. Thus, n order to remove IF-condtons we combne condtons nto computatonal statements. Usng ths technque, the IF- statement n Algorthm 7 s rewrtten as n Algorthm Expermental Results In ths secton, we frst descrbe the expermental setup. Then we show the performance results wth analyses. 0 Seral AoS Doman sze Fgure 13: Performance (MLUPS) comparson of the seral and the AoS wth dfferent doman szes Expermental Setup. We mplemented the LBM n the followng fve ways: () Seral mplementaton usng sngle CPU core (seral), usng the source code from the SPEC CPU lbm [15] to make sure t s reasonably optmzed () Parallel mplementaton on a GPU usng the AoS data scheme (AoS) () Parallel mplementaton usng the SoA data scheme and the push scheme (SoA Push Only) (v) Parallel mplementaton usng the SoA data scheme and the pull scheme (SoA Pull Only) (v) SoA usng pull scheme wth our varous optmzatons ncludng the tlng wth the data layout change (SoA Pull ) We summarze our mplementatons n Table 2. We used the D3Q19 model for the LBM algorthm. Doman grd szes arescaledntherangeof64 3, 128 3, 192 3,and256 3.The numbers of tme steps are 1000, 5000, and In order to measure the performance of the LBM, the Mllon Lattce Updates Per Second (MLUPS) unt s used whch s calculated as follows: MLUPS = N x N y N z N ts (11) 10 6 T where N x, N y,andn z are doman szes n the x-, y-, and zdmenson, N ts s the number of tme steps used, and T s the run tme of the smulaton. Our experments were conducted on a system ncorporatng the Intel multcore processor (6-core 2.0 Ghz Intel Xeon E5-2650) wth 20 MB level-3 cache and Nvda Tesla K20 GPU based on the Kepler archtecture wth 5 GB devce memory. The OS s CentOS 5.5. In order to valdate the effectveness of our approach over the prevous approaches, we have also conducted further experments on another GPU, Nvda GTX285 GPU Results Usng Prevous Approaches. The average performancesoftheseralandtheaosareshownnfgure13.

12 12 Scentfc Programmng Table 2: Summary of experments. Experment Seral Parallel AoS SoA Push Only SoA Pull SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tlng Descrpton Seral mplementaton on sngle CPU core AoS scheme SoAscheme+pushdatascheme SoAscheme+pulldatascheme SoA scheme + pull data scheme + branch dvergence removal SoAscheme+pulldatascheme+regsterreducton SoA scheme + pull data scheme + branch dvergence removal + regster usage reducton SoA scheme + pull data scheme + branch dvergence removal + regster usage reducton + tlng wth data layout change Performance (MLUPS) AoS SoA_Push_Only Doman sze Fgure 14: Performance (MLUPS) comparson of the AoS and the SoA wth dfferent doman szes. Performance (MLUPS) SoA_Pull_Only SoA_Push_Only Doman sze Fgure 15: Performance (MLUPS) comparson of the SoA usng push scheme and pull scheme wth dfferent doman szes. Wth varous doman szes of 64 3, 128 3, 192 3,and256 3,the MLUPS numbers for the seral are 9.82 MLUPS, 7.42 MLUPS, 9.57 MLUPS, and 8.93 MLUPS. The MLUPS for the AoS are MLUPS, MLUPS, MLUPS, and MLUPS, respectvely. Wth these numbers as the baselne, we also measured the SoA performance for varous doman szes. Fgure 14 shows that the SoA sgnfcantly outperforms the AoS scheme. The SoA s faster than the AoS by 6.63, 9.91, 10.28, and tmes for the doman szes 64 3, 128 3, 192 3, and Note that n ths experment we appled only the SoA scheme wthout any other optmzaton technques. Fgure 15 compares the performance of the pull scheme and the push scheme. For far comparson, we dd not apply any other optmzaton technques to both of the mplementatons. The pull scheme performs at MLUPS, MLUPS, MLUPS, and MLUPS, whereas the push scheme performs at MLUPS, 780 MLUPS, MLUPS, and MLUPS for doman szes 64 3, 128 3, 192 3,and256 3, respectvely. Thus, the pull scheme s betterthanthepushschemeby6.75%,6.97%,7.02%,and 7.2%, respectvely. The number of global memory transactons observed shows that the total transactons (loads and stores) of the pull and push schemes are qute equvalent. However, the number of store transactons of the pull scheme s 56.2% smaller than the push scheme. Ths leads to the performance mprovement of the pull scheme compared wth the push scheme Results Usng Our Optmzaton Technques. In ths subsecton, we show the performance mprovements of our optmzaton technques compared wth the prevous approach based on thesoa Pull mplementaton: () Fgure 16 compares the average performance of the SoA wth and wthout removng the branch dvergences explaned n Secton 4.4 n the kernel code. Removng the branch dvergence mproves the performance by 4.37%, 4.45%, 4.69%, and 5.19% for doman szes 64 3, 128 3, 192 3,and256 3,respectvely. () Reducng the regster uses descrbed n Secton 4.3 mproves the performance by 12.07%, 12.44%, 11.98%, and 12.58% as Fgure 17 shows.

13 Scentfc Programmng Performance (MLUPS) Performance (MLUPS) Doman sze Doman sze SoA_Pull_Only SoA_Pull_BR SoA_Pull_Only SoA_Pull_Full Fgure 16: Performance (MLUPS) comparson of the SoA wth and wthout branch removal for dfferent doman szes. Fgure 18: Performance (MLUPS) comparson of the SoA wth and wthout optmzaton technques for dfferent doman szes. Performance (MLUPS) SoA_Pull_Only SoA_Pull_RR Doman sze Fgure 17: Performance (MLUPS) comparson of the SoA wth and wthout reducng regster uses for dfferent doman szes. Performance (MLUP) SoA_Pull_Full_Tlng SoA_Pull_Full Doman sze Fgure 19: Performance (MLUPS) comparson of the SoA Pull Full and the SoA Pull Full Tlng wth dfferent doman szes. () Fgure 18 compares the performance of the SoA usng the pull scheme wth optmzaton technques such as the branch dvergence removal and the regster usage reducton descrbed n Sectons 4.3 and 4.4 (SoA Pull Full) and the SoA Pull Only. The optmzed SoA Pull mplementaton s better than the SoA Pull Only by 16.44%, 16.89%, 16.68%, and 17.77% for the doman szes 64 3, 128 3, 192 3, and 256 3, respectvely. (v) Fgure 19 shows the performance comparson of the SoA Pull Full and SoA Pull Full Tlng. The SoA Pull Full Tlng performance s better than the SoA Pull Full from 11.78% to 13.6%. The doman sze gves the best performance mprovement of 13.6%, whle the doman sze gves the lowest mprovement of 11.78%. The expermental results show that the tlng sze for the best performance s n x =32, n y =16,andn z =N z 4. (v) Fgure 20 presents the overall performance of the SoA Pull Full Tlng mplementaton compared wth the SoA Pull Only. Wth all our optmzaton technques descrbed n Sectons 4.2, 4.3, and 4.4, we obtaned 28% overall performance mprovements compared wth the prevous approach. (v) Table 3 compares the performance of four mplementatons (seral, AoS, SoA Pull Only, and SoA Pull Full Tlng) wth dfferent doman szes. As shown, the peak performance of MLUPS s acheved by the SoA Pull Full Tlng wth doman sze 256 3,wherethespeedupof136salsoacheved. (v) Table 4 compares the performance of our work wth the prevous work conducted by Mawson and Revell [12]. Both mplementatons were conducted on the same K20 GPU. Our approach performs better than [12] from 14% to 19%. Our approach ncorporates

14 14 Scentfc Programmng Table 3: Performance (MLUPS) comparsons of four mplementatons. Doman szes TmeSteps Seral AoS SoA Pull Only SoA Pull Full Tlng Avg. Perf Avg. Perf Avg. Perf Avg. Perf Performance (MLUPS) Performance (MLUPS) SoA_Pull_Full_Tlng SoA_Pull_Only Doman sze Fgure 20: Performance (MLUPS) comparson of the SoA Pull Only and the SoA Pull Tlng wth dfferent doman szes. Table 4: Performance (MLUPS) comparson of our work wth prevous work [12]. Domanszes MawsonandRevell(2014) Ourwork more optmzaton technques such as the tlng optmzaton wth the data layout change, the branch dvergence removal, among others compared wth [12] Doman sze Seral AoS SoA_Push_Only SoA_Pull_Only SoA_Pull_RR SoA_Pull_BR SoA_Pull_Full SoA_Pull_Full_Tlng Fgure 21: Performance (MLUPS) on GTX285 wth dfferent doman szes. (v) In order to valdate the effectveness of our approach, we conducted more experments on the other GPU, Nvda GTX285. Table 5 and Fgure 21 show the average performance of our mplementatons wth doman szes 64 3, 128 3,and160 3.(Thegrdszes larger than cannot be accommodated n the devce memory of the GTX 285.) As shown, our optmzaton technque, SoA Pull Full Tlng, s better than the prevous SoA Pull Only up to 22.85%. Also we obtaned 46-tme speedup compared wth the seral mplementaton. The level of the performance mprovement and the speedup are, however, lower on the GTX 285 compared wth the K20.

15 Scentfc Programmng 15 Table 5: Performance (MLUPS) on GTX285. Doman szes Seral AoS SoA Push Only SoA Pull Only SoA Pull BR SoA Pull RR SoA Pull Full SoA Pull Full Tlng Prevous Research Prevous parallelzaton approaches for the LBM algorthm focused on two man ssues: how to effcently organze the dataandhowtoavodthemsalgnmentnthestreamng (propagaton) phase of the LBM. In the data organzaton, theaosandthesoaaretwomostcommonlyusedschemes. Whle AoS scheme s sutable for the CPU archtecture, SoA s a better scheme for the GPU archtecture when the global memory access coalton technque s ncorporated. Thus, most mplementatons of the LBM on the GPU use the SoA asthemandataorganzaton. In order to avod the msalgnment n the streamng phase of the LBM, there are two man approaches. The frst proposed approach uses the shared memory. Tölke n [9] used the approach and mplemented the D2Q9 model. Habch et al. [8] followed the same approach for the D3Q19 model. Baley et al. n [16] also used the shared memory to acheve 100% coalescence n the propagaton phase for the D3Q13 model. In the second approach, the pull scheme was used nstead of the push scheme wthout usng the shared memory. As observed, the man am of usng the shared memory s to avod the msalgned accesses caused by the dstrbuton values movng to the east and west drectons [8, 9, 17]. However, the shared memory mplementaton needs extra synchronzatons and ntermedate regsters. Ths lowers the acheved bandwdth. In addton, usng the shared memory lmts the maxmum number of threads per thread block because of the lmted sze of the shared memory [17] whch reduces the number of actve WARPs (occupancy) of the kernels, thereby hurtng the performance. Usng the pull scheme, nstead, there s no extra synchronzaton cost ncurred and no ntermedate regsters are needed. In addton, the better utlzaton of the regsters n the pull scheme leads to generatng a larger number of threads as the total number of regsters s fxed. Ths leads to better utlzaton of the GPU s multthreadng capablty and hgher performance. Latest results n [12] confrm the hgher performance of the pull scheme compared wth usng the shared memory. Besdes the above approaches, n [12], the new feature of the Tesla K20 GPU, shuffle nstructon, was appled to avodthemsalgnmentnthestreamngphase.however,the obtaned results were worse. In [18], Obrecht et al. focused on choosng careful data transfer schemes n the global memory nsteadofusngthesharedmemorynordertosolvethe msalgned memory access problem. There were some approaches to maxmze the GPU multprocessor occupancy by reducng the regster uses per thread. Baley et al. n [16] showed 20% mprovement n maxmum performance compared wth the D3Q19 model n [8]. They set the number of regsters used by the kernel below a certan lmt usng the Nvda compler flag. However, ths approach may spll the regster data to the local memory. Habch et al. [8] suggested a method to reduce the number of regsters by usng the base ndex, whch forces the compler to reuse the same regster agan. A few dfferent mplementatons of the LBM were attempted. Astorno et al. [19] bult a GPU mplementaton framework for the LBM vald for the two- and threedmensonal problems. The framework s organzed n a modular fashon and allows for easy modfcaton. They used the SoA scheme and the semdrect approach as the addressng scheme. They also adopted the swappng technque to save the memory requred for the LBM mplementaton. Rnald et al. [17] suggested an approach based on the sngle-step algorthm wth a reversed collson-propagaton scheme. They used the shared memory as the man computatonal memory nstead of the global memory. In our mplementaton, we adopted these approaches for the SoA Pull Only mplementaton shown n Secton Concluson In ths paper, we developed hgh performance parallelzaton of the LBM algorthm wth the D3Q19 model on the GPU. In order to mprove the cache localty and mnmze the overheads assocated wth the uncoalesced accesses n movng the data to the adjacent cells n the streamng phase of the LBM, we used the tlng optmzaton wth the data layout change. For reducng the hgh regster pressure for the LBM kernels and mprovng the avalable thread parallelsm generated at the run tme, we developed technques for aggressvely reducng the regster uses for the kernels. We also developed optmzaton technques for removng the branch dvergence. Other already-known technques were also adopted n our parallel mplementaton such as combnng the streamng phase and the collson phasentoonephasetoreducethememoryoverhead,a GPU frendly data organzaton scheme so-called the SoA scheme, effcent data placement of the major data structures n the GPU memory herarchy, and adoptng a data update scheme (pull scheme) to further reduce the overheads of the uncoalesced accesses. Expermental results on the 6-core 2.2 Ghz Intel Xeon processor and the Nvda Tesla K20 GPU usng CUDA show that our approach leads to mpressve performance results. It delvers up to MLUPS throughput performance and acheves up to 136-tme speedup compared wth a seral mplementaton runnng on sngle CPU core.

16 16 Scentfc Programmng Competng Interests The authors declare that there s no conflct of nterests regardng the publcaton of ths paper. Acknowledgments Ths research was supported by Next-Generaton Informaton Computng Development Program through the Natonal Research Foundaton of Korea (NRF) funded by the Mnstry of Educaton, Scence, and Technology (NRF- 2015M3C4A ). Ths work was supported by the Human Resources Program n Energy Technology of the Korea Insttute of Energy Technology Evaluaton and Plannng (KETEP), granted fnancal resource from the Mnstry of Trade, Industry & Energy, Republc of Korea (no ). References [1] J.-P. Rvet and J. P. Boon, Lattce Gas Hydrodynamcs, vol. 11, Cambrdge Unversty Press, Cambrdge, UK, [2] J. Tölke and M. Krafczyk, TeraFLOP computng on a desktop PC wth GPUs for 3D CFD, Internatonal Journal of Computatonal Flud Dynamcs,vol.22,no.7,pp ,2008. [3] G.Crm,F.Mantovan,M.Pvant,S.F.Schfano,andR.Trpccone, Early experence on portng and runnng a Lattce Boltzmann code on the Xeon-Ph co-processor, n Proceedngs of the 13th Annual Internatonal Conference on Computatonal Scence (ICCS 13),vol.18,pp ,Barcelona,Span,June2013. [4] M. Stürmer, J. Götz, G. Rchter, A. Dörfler, and U. Rüde, Flud flow smulaton on the Cell Broadband Engne usng the lattce Boltzmann method, Computers & Mathematcs wth Applcatons, vol. 58, no. 5, pp , [5] NVIDA, CUDA Toolkt Documentaton, September 2015, [6] GROUP KHRONOS, OpenCL, 2015, [7] OpenACC-standard.org, OpenACC, March 2012, [8] J. Habch, T. Zeser, G. Hager, and G. Wellen, Performance analyss and optmzaton strateges for a D3Q19 lattce Boltzmann kernel on nvidia GPUs usng CUDA, Advances n Engneerng Software, vol. 42, no. 5, pp , [9] J. Tölke, Implementaton of a Lattce Boltzmann kernel usng the compute unfed devce archtecture developed by nvidia, Computng and Vsualzaton n Scence,vol.13,no.1,pp.29 39, [10] G.Wellen,T.Zeser,G.Hager,andS.Donath, Onthesngle processor performance of smple lattce Boltzmann kernels, Computers and Fluds,vol.35,no.8-9,pp ,2006. [11] M. Wttmann, T. Zeser, G. Hager, and G. Wellen, Comparson of dfferent propagaton steps for lattce Boltzmann methods, Computers and Mathematcs wth Applcatons,vol.65,no.6,pp , [12] M. J. Mawson and A. J. Revell, Memory transfer optmzaton for a lattce Boltzmann solver on Kepler archtecture nvda GPUs, Computer Physcs Communcatons, vol. 185, no. 10, pp , [13] N. Tran, M. Lee, and D. H. Cho, Memory-effcent parallelzaton of 3D lattce boltzmann flow solver on a GPU, n Proceedngs of the IEEE 22nd Internatonal Conference on Hgh Performance Computng (HPC 15), pp , IEEE, Bangalore, Inda, December [14] K. Iglberger, Cache Optmzatons for the Lattce Boltzmann Method n 3D, vol. 10, Lehrstuhl für Informatk, Würzburg, Germany, [15] J. L. Hennng, SPEC CPU2006 Benchmark descrptons, ACM SIGARCH Computer Archtecture News, vol.34,no.4,pp.1 17, [16]P.Baley,J.Myre,S.D.C.Walsh,D.J.Llja,andM.O.Saar, Acceleratng lattce boltzmann flud flow smulatons usng graphcs processors, n Proceedngs of the 38th Internatonal Conference on Parallel Processng (ICPP 09), pp , IEEE, Venna, Austra, September [17] P. R. Rnald, E. A. Dar, M. J. Vénere, and A. Clausse, A Lattce- Boltzmann solver for 3D flud smulaton on GPU, Smulaton Modellng Practce and Theory,vol.25,pp ,2012. [18] C. Obrecht, F. Kuznk, B. Tourancheau, and J.-J. Roux, A new approach to the lattce Boltzmann method for graphcs processng unts, Computers & Mathematcs wth Applcatons, vol.61,no.12,pp ,2011. [19] M. Astorno, J. B. Sagredo, and A. Quarteron, A modular lattce boltzmann solver for GPU computng processors, SeMA Journal,vol.59,no.1,pp.53 78,2012.

17 Journal of Advances n Industral Engneerng Multmeda The Scentfc World Journal Appled Computatonal Intellgence and Soft Computng Internatonal Journal of Dstrbuted Sensor Networks Advances n Fuzzy Systems Modellng & Smulaton n Engneerng Submt your manuscrpts at Journal of Computer Networks and Communcatons Advances n Artfcal Intellgence Internatonal Journal of Bomedcal Imagng Advances n Artfcal Neural Systems Internatonal Journal of Computer Engneerng Computer Games Technology Advances n Advances n Software Engneerng Internatonal Journal of Reconfgurable Computng Robotcs Computatonal Intellgence and Neuroscence Advances n Human-Computer Interacton Journal of Journal of Electrcal and Computer Engneerng

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Array transposition in CUDA shared memory

Array transposition in CUDA shared memory Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units Invted Artcle Computer Scence & Technology March 2012 Vol.57 No.7: 707 715 do: 10.1007/s11434-011-4908-y SPECIAL TOPICS: Effcent parallel mplementaton of the lattce Boltzmann method on large clusters of

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Overvew 2 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION Introducton Mult- Smulator MASIM Theoretcal Work and Smulaton Results Concluson Jay Wagenpfel, Adran Trachte Motvaton and Tasks Basc Setup

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

3D vector computer graphics

3D vector computer graphics 3D vector computer graphcs Paolo Varagnolo: freelance engneer Padova Aprl 2016 Prvate Practce ----------------------------------- 1. Introducton Vector 3D model representaton n computer graphcs requres

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide Lobachevsky State Unversty of Nzhn Novgorod Polyhedron Quck Start Gude Nzhn Novgorod 2016 Contents Specfcaton of Polyhedron software... 3 Theoretcal background... 4 1. Interface of Polyhedron... 6 1.1.

More information

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation Intellgent Informaton Management, 013, 5, 191-195 Publshed Onlne November 013 (http://www.scrp.org/journal/m) http://dx.do.org/10.36/m.013.5601 Qualty Improvement Algorthm for Tetrahedral Mesh Based on

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

The Shortest Path of Touring Lines given in the Plane

The Shortest Path of Touring Lines given in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 262 The Open Cybernetcs & Systemcs Journal, 2015, 9, 262-267 The Shortest Path of Tourng Lnes gven n the Plane Open Access Ljuan Wang 1,2, Dandan He

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method

GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method GPU Accelerated Blood Flow Computaton usng the Lattce Boltmann Method Cosmn Nţă, Lucan Mha Itu, Constantn Sucu Department of Automaton Translvana Unversty of Braşov Braşov, Romana Constantn Sucu Corporate

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Determining the Optimal Bandwidth Based on Multi-criterion Fusion Proceedngs of 01 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 5 (01) (01) IACSIT Press, Sngapore Determnng the Optmal Bandwdth Based on Mult-crteron Fuson Ha-L Lang 1+, Xan-Mn

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A New Approach For the Ranking of Fuzzy Sets With Different Heights New pproach For the ankng of Fuzzy Sets Wth Dfferent Heghts Pushpnder Sngh School of Mathematcs Computer pplcatons Thapar Unversty, Patala-7 00 Inda pushpndersnl@gmalcom STCT ankng of fuzzy sets plays

More information

Fast Computation of Shortest Path for Visiting Segments in the Plane

Fast Computation of Shortest Path for Visiting Segments in the Plane Send Orders for Reprnts to reprnts@benthamscence.ae 4 The Open Cybernetcs & Systemcs Journal, 04, 8, 4-9 Open Access Fast Computaton of Shortest Path for Vstng Segments n the Plane Ljuan Wang,, Bo Jang

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data Malaysan Journal of Mathematcal Scences 11(S) Aprl : 35 46 (2017) Specal Issue: The 2nd Internatonal Conference and Workshop on Mathematcal Analyss (ICWOMA 2016) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES

More information

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research Schedulng Remote Access to Scentfc Instruments n Cybernfrastructure for Educaton and Research Je Yn 1, Junwe Cao 2,3,*, Yuexuan Wang 4, Lanchen Lu 1,3 and Cheng Wu 1,3 1 Natonal CIMS Engneerng and Research

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016) Technsche Unverstät München WSe 6/7 Insttut für Informatk Prof. Dr. Thomas Huckle Dpl.-Math. Benjamn Uekermann Parallel Numercs Exercse : Prevous Exam Questons Precondtonng & Iteratve Solvers (From 6)

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.15 No.10, October 2015 1 Evaluaton of an Enhanced Scheme for Hgh-level Nested Network Moblty Mohammed Babker Al Mohammed, Asha Hassan.

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

A fast algorithm for color image segmentation

A fast algorithm for color image segmentation Unersty of Wollongong Research Onlne Faculty of Informatcs - Papers (Arche) Faculty of Engneerng and Informaton Scences 006 A fast algorthm for color mage segmentaton L. Dong Unersty of Wollongong, lju@uow.edu.au

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

Memory and I/O Organization

Memory and I/O Organization Memory and I/O Organzaton 8-1 Prncple of Localty Localty small proporton of memory accounts for most run tme Rule of thumb For 9% of run tme next nstructon/data wll come from 1% of program/data closest

More information

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management. //7 Prnceton Unversty Computer Scence 7: Introducton to Programmng Systems Goals of ths Lecture Storage Management Help you learn about: Localty and cachng Typcal storage herarchy Vrtual memory How the

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING An Improved K-means Algorthm based on Cloud Platform for Data Mnng Bn Xa *, Yan Lu 2. School of nformaton and management scence, Henan Agrcultural Unversty, Zhengzhou, Henan 450002, P.R. Chna 2. College

More information

Solving Planted Motif Problem on GPU

Solving Planted Motif Problem on GPU Solvng Planted Motf Problem on GPU Naga Shalaja Dasar Old Domnon Unversty Norfolk, VA, USA ndasar@cs.odu.edu Ranjan Desh Old Domnon Unversty Norfolk, VA, USA dranjan@cs.odu.edu Zubar M Old Domnon Unversty

More information

Speedup of Type-1 Fuzzy Logic Systems on Graphics Processing Units Using CUDA

Speedup of Type-1 Fuzzy Logic Systems on Graphics Processing Units Using CUDA Speedup of Type-1 Fuzzy Logc Systems on Graphcs Processng Unts Usng CUDA Durlabh Chauhan 1, Satvr Sngh 2, Sarabjeet Sngh 3 and Vjay Kumar Banga 4 1,2 Department of Electroncs & Communcaton Engneerng, SBS

More information

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments Internatonal Journal of u- and e- ervce, cence and Technology Vol8, o 7 0), pp7-6 http://dxdoorg/07/unesst087 ecurty Enhanced Dynamc ID based Remote ser Authentcaton cheme for ult-erver Envronments Jun-ub

More information

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT 3. - 5. 5., Brno, Czech Republc, EU APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT Abstract Josef TOŠENOVSKÝ ) Lenka MONSPORTOVÁ ) Flp TOŠENOVSKÝ

More information

Analysis on the Workspace of Six-degrees-of-freedom Industrial Robot Based on AutoCAD

Analysis on the Workspace of Six-degrees-of-freedom Industrial Robot Based on AutoCAD Analyss on the Workspace of Sx-degrees-of-freedom Industral Robot Based on AutoCAD Jn-quan L 1, Ru Zhang 1,a, Fang Cu 1, Q Guan 1 and Yang Zhang 1 1 School of Automaton, Bejng Unversty of Posts and Telecommuncatons,

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization What s a Computer Program? Descrpton of algorthms and data structures to acheve a specfc ojectve Could e done n any language, even a natural language lke Englsh Programmng language: A Standard notaton

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

Research Article GPU Acceleration of Melody Accurate Matching in Query-by-Humming

Research Article GPU Acceleration of Melody Accurate Matching in Query-by-Humming e Scentfc World Journal, Artcle ID 614193, 7 pages http://dx.do.org/10.1155/2014/614193 Research Artcle GPU Acceleraton of Melody Accurate Matchng n Query-by-Hummng Lmn Xao, 1,2 Yao Zheng, 1,2,3 Wenq Tang,

More information

IP Camera Configuration Software Instruction Manual

IP Camera Configuration Software Instruction Manual IP Camera 9483 - Confguraton Software Instructon Manual VBD 612-4 (10.14) Dear Customer, Wth your purchase of ths IP Camera, you have chosen a qualty product manufactured by RADEMACHER. Thank you for the

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices Internatonal Mathematcal Forum, Vol 7, 2012, no 52, 2549-2554 An Applcaton of the Dulmage-Mendelsohn Decomposton to Sparse Null Space Bases of Full Row Rank Matrces Mostafa Khorramzadeh Department of Mathematcal

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme Contemporary Engneerng Scences, Vol. 7, 2014, no. 26, 1475-1482 HIKARI Ltd, www.m-hkar.com http://dx.do.org/10.12988/ces.2014.49186 Securty Vulnerabltes of an Enhanced Remote User Authentcaton Scheme Hae-Soon

More information

Video Proxy System for a Large-scale VOD System (DINA)

Video Proxy System for a Large-scale VOD System (DINA) Vdeo Proxy System for a Large-scale VOD System (DINA) KWUN-CHUNG CHAN #, KWOK-WAI CHEUNG *# #Department of Informaton Engneerng *Centre of Innovaton and Technology The Chnese Unversty of Hong Kong SHATIN,

More information

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap Int. Journal of Math. Analyss, Vol. 8, 4, no. 5, 7-7 HIKARI Ltd, www.m-hkar.com http://dx.do.org/.988/jma.4.494 Emprcal Dstrbutons of Parameter Estmates n Bnary Logstc Regresson Usng Bootstrap Anwar Ftranto*

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Obstacle-Aware Routing Problem in. a Rectangular Mesh Network

Obstacle-Aware Routing Problem in. a Rectangular Mesh Network Appled Mathematcal Scences, Vol. 9, 015, no. 14, 653-663 HIKARI Ltd, www.m-hkar.com http://dx.do.org/10.1988/ams.015.411911 Obstacle-Aware Routng Problem n a Rectangular Mesh Network Norazah Adzhar Department

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Lecture 3: Computer Arithmetic: Multiplication and Division

Lecture 3: Computer Arithmetic: Multiplication and Division 8-447 Lecture 3: Computer Arthmetc: Multplcaton and Dvson James C. Hoe Dept of ECE, CMU January 26, 29 S 9 L3- Announcements: Handout survey due Lab partner?? Read P&H Ch 3 Read IEEE 754-985 Handouts:

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach Modelng, Manpulatng, and Vsualzng Contnuous Volumetrc Data: A Novel Splne-based Approach Jng Hua Center for Vsual Computng, Department of Computer Scence SUNY at Stony Brook Talk Outlne Introducton and

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier Floatng-Pont Dvson Algorthms for an x86 Mcroprocessor wth a Rectangular Multpler Mchael J. Schulte Dmtr Tan Carl E. Lemonds Unversty of Wsconsn Advanced Mcro Devces Advanced Mcro Devces Schulte@engr.wsc.edu

More information

Optimized Resource Scheduling Using Classification and Regression Tree and Modified Bacterial Foraging Optimization Algorithm

Optimized Resource Scheduling Using Classification and Regression Tree and Modified Bacterial Foraging Optimization Algorithm World Engneerng & Appled Scences Journal 7 (1): 10-17, 2016 ISSN 2079-2204 IDOSI Publcatons, 2016 DOI: 10.5829/dos.weasj.2016.7.1.22540 Optmzed Resource Schedulng Usng Classfcaton and Regresson Tree and

More information

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images Internatonal Journal of Informaton and Electroncs Engneerng Vol. 5 No. 6 November 015 Usng Fuzzy Logc to Enhance the Large Sze Remote Sensng Images Trung Nguyen Tu Huy Ngo Hoang and Thoa Vu Van Abstract

More information

Parallel Branch and Bound Algorithm - A comparison between serial, OpenMP and MPI implementations

Parallel Branch and Bound Algorithm - A comparison between serial, OpenMP and MPI implementations Journal of Physcs: Conference Seres Parallel Branch and Bound Algorthm - A comparson between seral, OpenMP and MPI mplementatons To cte ths artcle: Luco Barreto and Mchael Bauer 2010 J. Phys.: Conf. Ser.

More information

Resource and Virtual Function Status Monitoring in Network Function Virtualization Environment

Resource and Virtual Function Status Monitoring in Network Function Virtualization Environment Journal of Physcs: Conference Seres PAPER OPEN ACCESS Resource and Vrtual Functon Status Montorng n Network Functon Vrtualzaton Envronment To cte ths artcle: MS Ha et al 2018 J. Phys.: Conf. Ser. 1087

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

2-Dimensional Image Representation. Using Beta-Spline

2-Dimensional Image Representation. Using Beta-Spline Appled Mathematcal cences, Vol. 7, 03, no. 9, 4559-4569 HIKARI Ltd, www.m-hkar.com http://dx.do.org/0.988/ams.03.3359 -Dmensonal Image Representaton Usng Beta-plne Norm Abdul Had Faculty of Computer and

More information

Dynamic wetting property investigation of AFM tips in micro/nanoscale

Dynamic wetting property investigation of AFM tips in micro/nanoscale Dynamc wettng property nvestgaton of AFM tps n mcro/nanoscale The wettng propertes of AFM probe tps are of concern n AFM tp related force measurement, fabrcaton, and manpulaton technques, such as dp-pen

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information