Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

Similar documents
Parallelism for Nested Loops with Non-uniform and Flow Dependences

The Codesign Challenge

Load Balancing for Hex-Cell Interconnection Network

Cluster Analysis of Electrical Behavior

GPU Accelerated Blood Flow Computation using the Lattice Boltzmann Method

Fast Computation of Shortest Path for Visiting Segments in the Plane

Wavefront Reconstructor

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Simulation Based Analysis of FAST TCP using OMNET++

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Module Management Tool in Software Development Organizations

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

The Shortest Path of Touring Lines given in the Plane

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Analysis on the Workspace of Six-degrees-of-freedom Industrial Robot Based on AutoCAD

A Binarization Algorithm specialized on Document Images and Photos

Research Article Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

High-Boost Mesh Filtering for 3-D Shape Enhancement

An Entropy-Based Approach to Integrated Information Needs Assessment

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

An Optimal Algorithm for Prufer Codes *

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Virtual Machine Migration based on Trust Measurement of Computer Node

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Parallel matrix-vector multiplication

Modeling, Manipulating, and Visualizing Continuous Volumetric Data: A Novel Spline-based Approach

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

AADL : about scheduling analysis

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Speedup of Type-1 Fuzzy Logic Systems on Graphics Processing Units Using CUDA

Smoothing Spline ANOVA for variable screening

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Design of Structure Optimization with APDL

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Resource and Virtual Function Status Monitoring in Network Function Virtualization Environment

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

Maintaining temporal validity of real-time data on non-continuously executing resources

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Efficient Distributed File System (EDFS)

Related-Mode Attacks on CTR Encryption Mode

Array transposition in CUDA shared memory

Research and Application of Fingerprint Recognition Based on MATLAB

COMPARISON OF TWO MODELS FOR HUMAN EVACUATING SIMULATION IN LARGE BUILDING SPACES. University, Beijing , China

An inverse problem solution for post-processing of PIV data

Distributed Middlebox Placement Based on Potential Game

Classifying Acoustic Transient Signals Using Artificial Intelligence

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Mathematics 256 a course in differential equations for engineering students

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A fast algorithm for color image segmentation

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

User Authentication Based On Behavioral Mouse Dynamics Biometrics

X- Chart Using ANOM Approach

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility


Evaluation of Parallel Processing Systems through Queuing Model

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

Simulation of a Ship with Partially Filled Tanks Rolling in Waves by Applying Moving Particle Semi-Implicit Method

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Meta-heuristics for Multidimensional Knapsack Problems

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Real-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems

Video Proxy System for a Large-scale VOD System (DINA)

Programming in Fortran 90 : 2017/2018

A new segmentation algorithm for medical volume image based on K-means clustering

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA

Concurrent Apriori Data Mining Algorithms

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Performance Comparison of a QoS Aware Routing Protocol for Wireless Sensor Networks

Parallel Branch and Bound Algorithm - A comparison between serial, OpenMP and MPI implementations

A Five-Point Subdivision Scheme with Two Parameters and a Four-Point Shape-Preserving Scheme

ANALYTICAL MODEL AND PERFORMANCE ANALYSIS OF A NETWORK INTERFACE CARD. Abstract

Efficient Content Distribution in Wireless P2P Networks

Two-Stage Data Distribution for Distributed Surveillance Video Processing with Hybrid Storage Architecture

Hybrid Non-Blind Color Image Watermarking

Investigations of Topology and Shape of Multi-material Optimum Design of Structures

Application of Improved Fish Swarm Algorithm in Cloud Computing Resource Scheduling

Analysis of Collaborative Distributed Admission Control in x Networks

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

S1 Note. Basis functions.

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Reproducing Works of Calder

A Parallelization Design of JavaScript Execution Engine

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Learning-Based Top-N Selection Query Evaluation over Relational Databases

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Finite Element Analysis of Rubber Sealing Ring Resilience Behavior Qu Jia 1,a, Chen Geng 1,b and Yang Yuwei 2,c

UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urban area

Obstacle-Aware Routing Problem in. a Rectangular Mesh Network

Transcription:

Invted Artcle Computer Scence & Technology March 2012 Vol.57 No.7: 707 715 do: 10.1007/s11434-011-4908-y SPECIAL TOPICS: Effcent parallel mplementaton of the lattce Boltzmann method on large clusters of graphc processng unts XIONG QnGang 1,2, LI Bo 1,2, XU J 1,2, FANG XaoJan 1,2, WANG XaoWe 1*, WANG LMn 1*, HE XanFeng 1 & GE We 1 1 State Key Laboratory of Multphase Complex Systems, Insttute of Process Engneerng, Chnese Academy of Scences, Bejng 100190, Chna; 2 Graduate Unversty of Chnese Academy of Scences, Bejng 100049, Chna Receved May 23, 2011; accepted October 19, 2011 Many-core processors, such as graphc processng unts (GPUs), are promsng platforms for ntrnsc parallel algorthms such as the lattce Boltzmann method (LBM). Although tremendous speedup has been obtaned on a sngle GPU compared wth manstream CPUs, the performance of the LBM for multple GPUs has not been studed extensvely and systematcally. In ths artcle, we carry out LBM smulaton on a GPU cluster wth many nodes, each havng multple Ferm GPUs. Asynchronous executon wth CUDA stream functons, OpenMP and non-blockng MPI communcaton are ncorporated to mprove effcency. The algorthm s tested for two-dmensonal Couette flow and the results are n good agreement wth the analytcal soluton. For both the one- and two-dmensonal decomposton of space, the algorthm performs well as most of the communcaton tme s hdden. Drect numercal smulaton of a two-dmensonal gas-sold suspenson contanng more than one mllon sold partcles and one bllon gas lattce cells demonstrates the potental of ths algorthm n large-scale engneerng applcatons. The algorthm can be drectly extended to the three-dmensonal decomposton of space and other modelng methods ncludng explct grd-based methods. asynchronous executon, compute unfed devce archtecture, graphc processng unt, lattce Boltzmann method, non-blockng message passng nterface, OpenMP Ctaton: Xong Q G, L B, Xu J, et al. Effcent parallel mplementaton of the lattce Boltzmann method on large clusters of graphc processng unts. Chn Sc Bull, 2012, 57: 707 715, do: 10.1007/s11434-011-4908-y Hgh-performance computng (HPC) on general-purpose graphcal processng unts (GPGPUs) has emerged as a compettve approach to make demandng computatons such as those of computatonal flud dynamcs (CFD) [1,2] and dscrete partcle smulatons [3 5]. Ths s, on one hand, due to the computatonal capacty of graphcal processng unts (GPUs), whch s almost one order of magntude hgher than that of manstream central processng unts (CPUs) n terms of both peak performance and memory bandwdth, and on the other hand, due to the ntroducton of effectve and convenent programmng nterfaces such as Compute Unfed Devce Archtecture (CUDA). *Correspondng authors (emal: xwwang@home.pe.ac.cn; lmwang@home.pe.ac.cn) The lattce Boltzmann method (LBM) [6] s a numercal method sutable for GPGPUs owng to ts explct numercal scheme, localzed communcaton mode and nherent addtvty of ts numercal operatons. Hence, t s a powerful alternatve to CFD methods such as fnte dfference and fnte volume methods. Implementatons of LBM on a sngle GPU have been reported [7 10] wth speedup ratos rangng from tens to above 100 relatve to a sngle CPU core. In the case of mult-gpu mplementatons, L et al. [11] performed LBM smulaton of ld-drven cavty flow on an HPC system comprsng both Nvda and AMD GPUs, usng CUDA and Brook+, respectvely, and combnng va the Message Passng Interface (MPI). Myre et al. [12] mplemented sngle-phase, mult-phase and mult-component The Author(s) 2012. Ths artcle s publshed wth open access at Sprngerlnk.com csb.scchna.com www.sprnger.com/scp

708 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 LBMs on GPU clusters usng Open Mult-Processng (OpenMP). In these mplementatons, data communcaton between GPUs s trval or the GPUs are nstalled at the same node, so the real performances of these mplementatons were almost unaffected by communcaton. However, ths s not typcal n engneerng practce. In fact, the data n GPUs cannot be accessed by the network drectly and has to be coped, from the GPU to CPU before sendng and from the CPU to GPU after recevng, through a PCIe bus wth bandwdth of about 10 GB/s currently (Gen 2), whch s much lower than that of the GPU global memory. Therefore, communcaton between the CPU and GPU can be a bottleneck n some applcatons. In ths artcle, we ntegrate asynchronous computng communcaton va the CUDA v3.1 framework [13], sharedmemory parallelzaton usng OpenMP and nter-node parallelzaton usng non-blockng MPI to mprove the performance of mult-gpu LBM smulatons. Performances for both one- and two-dmensonal decompostons are analyzed and t s found that our mplementaton s very effcent. The consstency of our mplementaton on HPC systems wth multple GPUs at one node s emphaszed. 1 The lattce Boltzmann method The lattce BGK model [14] s one of the most frequently used schemes for the LBM. Dependng on the dmensonalty (D) and number of dscrete lattce veloctes (Q), there are dfferent varatons, such as D2Q9, D3Q13, and D3Q19. The formulaton of the lattce BGK model s 1 eq f( x 1, t 1) f( x, t) ( f ( x, t) f( x, t)), (1) where f (x,t) s the densty functon of the th drecton at poston x and tme t. τ s the relaxaton tme related to flud molecular dynamc vscosty μ. The term f eq ( x, t) s approxmated to second order as 2 eq e u ( e u) u u f ( x, t) w(1 3 4.5 1.5 ), (2) 2 2 c c c where f, u e f. (3) The D2Q9 scheme s llustrated n Fgure 1 and further detals were gven by Qan et al. [14]. To reduce the compressng effect n the orgnal lattce BGK model, He et al. [15] proposed revsons to the DdQq schemes and named them DdQq. The evolutonal rules are the same but wth dfferent equlbrum densty propagatons: eq f ( x, t) 0 p 2 e u ( e u) u u w (1 3 4.5 1.5 ), (4) 2 2 c c c Fgure 1 D2Q9 model wth w = 4/9 when = 0, w = 1/9 when = 1, 2, 3 and 4, and w = 1/36 when = 5, 6, 7 and 8. w0 1 w where 0 3, 3, 0 2 2. ρ 0 s the referenced c c flud densty for the ntal state, pressure p and velocty u are expressed as 2 c uu p ( f 1.5 w0 ), u 2 e f. (5) 3(1 w0 ) 0 c 0 The DdQq schemes ntroduce no further computatonal cost, and for GPU mplementaton, the zeroth drecton can be omtted, whch makes the schemes faster than the correspondng DdQq schemes. However, for DdQq schemes, the hdng of data communcatons s more mportant snce the communcaton-to-computaton rato s hgher than DdQq for the sze of data to be transfered among GPUs s same. 2 Mult-GPU mplementaton of the D2Q9 scheme The mplementaton of the LBM for a sngle GPU has been dscussed extensvely n [7,16]. We emphass one pont here. As the GPU s sutable for data-ndependent computaton-ntensve tasks, the memory access mode s crtcal to the performance. For ths reason, the storage of LBM grd data must be algned and accessed n a coalescent manner to make full use of the memory bandwdth. As long as global memory access s optmzed, the performance of dfferent mplementatons on the same sngle GPU vares lttle. However, for mult-gpu mplementaton, GPU CPU data transfer and CPU CPU communcaton may requre a large porton of the wall tme, and they have to be optmzed also. In CUDA 3.1, the launch of a GPU kernel s asynchronous, whch means that when a kernel s launched, the system returns to ts ntal state before the kernel completes ts computng. Ths feature enables the host CPU to perform

Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 709 Fgure 2 Schematc map of the overlappng of GPU computaton and data communcatons. ndcates a boundary cell and an nner cell; and cells make up the entre grd executed n stream [1]. other jobs whle watng for the GPU kernel to fnsh; e.g., copyng data between a GPU and CPU and carryng out nter-cpu communcaton and arthmetc operatons. For LBM smulatons, ths mples that collson and propagaton of the densty functons can be run n parallel by copyng boundary grd nformaton to a CPU and then transferrng the nformaton to neghborng CPUs. As shown n Fgure 2, ths s realzed usng the stream functon and portable pnned memory n CUDA 3.1, OpenMP and non-blockng communcatons provded by MPI. The flowchart of parallel mplementaton of LBM on GPU cluster s gven n Fgure 3. At the begnnng of each teraton, the collson operaton on boundary cells s launched asynchronously by the kernel Boundary_Collson n stream[0]. In ths kernel, the boundary grds are only subject to collson and not to propagaton, and post-collson boundary nformaton s wrtten to sendng buffers n the GPU global memory. The collson and propagaton on the entre grd are launched by the kernel Collson_Propagaton n stream[1] as soon as Boundary_Collson returns. The host can return before these asynchronous kernels completon, but kernels n the same stream are carred out n seres. Therefore, we launch the copy between GPU and CPU cudamemcpyasync n stream[0] to ensure that the copy operaton starts after the completon of Boundary_Collson. Although the operatons n stream[0] are n seres, these operatons can be done whle Collson_Propagaton s n executon. To use the asynchronous cudamemcpyasync, the buffers n the host must be allocated as pnned memory. After the GPU CPU copy operaton, the communcatons between CPUs are ready to be carred out. To confrm the fnsh of GPU CPU data copy n host memory, cudastreamsynchronze (stream[0]) s performed to ensure that all boundary nformaton s coped to sendng buffers n host memory. Non-blockng MPI_Isend and MPI_Irecv are then launched f the neghborng processors do not belong to the same node. These two MPI functons are non-blocked so that other CPU operatons can proceed whle data are beng sent or receved. MPI_Wat s needed to wat untl data have been receved. If neghborng processors are located on the same node, data can be transfered wth the portable pnned memory n CUDA. Ths desgn results n the reducton of the amount of data n MPI and acheves a hgher data transfer speed. Such an dea s realzed usng OpenMP for data communcatons wthn a node [17]. OpenMP threads control GPU devces and make portable pnned memory vsble to all GPU devces at the same node. Furthermore, a new technology, GPUDrect [18] for Tesla or Ferm GPUs, s adopted to mprove communcaton performance. The mprovement s acheved by removng the step of copyng data from GPU-dedcated host memory to host memory avalable to InfnBand devces to execute the RDMA communcatons. After the data communcatons, receved data are stll coped to the GPU wth cudamemcpyasync. Fnally, the

710 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 Fgure 3 Flowchart of the hybrd mplementaton of the LBM on mult-gpus [20]. boundary nformaton s updated by the data from recevng buffers n GPU global memory. 3 Results and dscusson In the followng, the algorthm s valdated and ts performance tested for our GPU cluster Mole-8.5 (cf. http://www. top500.org/lst/2011/11/100), whch conssts of 362 nodes connected wth Quad Data Rate InfnBand. Most of the computng nodes are equpped wth two quad-core CPUs and sx Nvda Tesla C2050 GPUs; therefore, the whole system s confgured wth more than 2000 GPUs, resultng n peak performance of 2 petaflops n sngle precson. 3.1 Valdaton Numercal valdaton s mportant n GPU computng, although many authors [7,19] have declared that the results are nsenstve to sngle precson. We consder the analytcal soluton for the classcal case of two-dmensonal Couette flow to evaluate the accuracy of our GPU mplementaton. The doman sze s 2048 2048 and the Reynolds number Re s 400. The smulaton s run n parallel on four GPUs. The smulaton results and the analytcal soluton are llustrated n Fgure 4. We fnd that the computatonal results of our GPU mplementaton agree very well wth the analytcal soluton wth a maxmum error of about 1.5%. 3.2 Performance Fve cases of Couette flow are smulated wth the grd szes for each GPU rangng from 512 512 (A), to 512 1024 (B), 1024 1024 (C), 1024 2048 (D) and 2048 2048 (E). The whole computaton doman s parttoned n ether one

Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 711 Fgure 4 Velocty profles at steady state for a two-dmensonal Couette flow smulaton wth grd sze 2048 2048 (Reynolds number Re = UH/υ = 400). or two dmensons. All cases were run 10 tmes wth 10000 teraton steps for each and the wall tmes were recorded after arthmetcal averagng. In the followng, unless otherwse specfed, each node runs sx GPUs concurrently. Tme costs of GPU computaton, data transfer between the GPU and CPU and communcaton between neghborng CPUs n cases usng 12 GPUs for one- and two-dmensonal decomposton wth synchronous executon and blockng MPI are plotted n Fgures 5 and 6 respectvely. We fnd that the tme portons of GPU CPU data transfer and communcaton between CPUs ncrease wth reducton of the doman sze for each GPU. In addton, as expected, the tme percentage of GPU CPU and CPU CPU data transfer n two-dmensonal decomposton s hgher than that for one-dmensonal decomposton and sometmes the tme consumpton even exceeds the tme for GPU computng, whch means there s more room to mprove the effcency by hdng data transfer between the GPU and CPU and communcatons between CPUs. Smulatons deployng the proposed computaton communcaton overlappng algorthm n both one-and twodmensonal decomposton were carred out. The tme costs for all cases are llustrated n Fgures 7 and 8. The fgures show that most of the tme for data copy and communcaton s successfully hdden through overlappng wth GPU computaton, leadng to an obvous reducton n the total tme. In two-dmensonal decomposton, the performance mprovement s even greater than that n one- dmensonal Fgure 5 (a) Tme component of each part of the algorthm wth synchronous executon and blockng MPI but wthout OpenMP n one-dmensonal decomposton; (b) tme percentages of GPU CPU data transfer and CPU CPU communcaton. Fgure 6 (a) Tme component of each part of the algorthm wth synchronous executon and blockng MPI n two-dmensonal decomposton; (b) tme percentages of GPU CPU data transfer and CPU CPU communcaton.

712 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 decomposton snce more tme for data transfer between a GPU and CPU and communcaton s hdden. To descrbe the performance mprovement clearly, we take case E n one-dmensonal decomposton usng 12 GPUs as an example to compare tme components of 5 algorthms: (a) synchronous executon and blockng MPI wthout OpenMP; (b) synchronous executon and blockng MPI wth OpenMP; (c) asynchronous executon and blockng MPI wth OpenMP; (d) synchronous executon and non-blockng MPI wth OpenMP; (e) asynchronous executon and nonblockng MPI wth OpenMP. The tme results are lsted n Table 1. Because of the non-seral characterstc of asynchronous executon and non-blockng MPI, the tme requred for asynchronous GPU executon and non-blockng MPI s dffcult to separate. Therefore, the GPU computaton tme was assumed to be the same for the asynchronous cases. Table 1 shows that the tme requred for data delvery between the GPU and CPU s reduced by about 60% 70% and the tme requred for nter-cpu communcaton s reduced by 70% 80%, whch gves performance of 1192 mllon lattce updates per second for each GPU card n mult-node and multple GPU mplementaton. Table 1 Comparson of tme components for fve algorthms n case E Algorthm GPU computaton (s) GPU CPU data transfer (s) CPU CPU communcaton (s) Total (s) (a) 33.90231 1.89775 2.65466 38.45473 (b) 33.91365 1.88922 1.13562 36.93849 (c) 33.90231 0.63391 1.1479 35.68412 (d) 33.89173 1.90276 0.6431 36.43759 (e) 33.90231 0.63391 0.6431 35.17932 Fgure 7 (a) Tme component for the algorthm wth asynchronous executon, OpenMP and non-blockng MPI n one-dmensonal decomposton; (b) tme percentage of GPU CPU copy and CPU CPU communcaton. Fgure 8 (a) Tme component for the algorthm wth asynchronous executon, OpenMP and non-blockng MPI n two-dmensonal decomposton; (b) tme percentage of GPU CPU copy and CPU CPU communcaton.

Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 713 To nvestgate the scalablty of the mplementaton further, we change the number of GPUs n case E, rangng from 12 to 1728. The correspondng tme costs for communcaton are shown n Fgure 9. We see that the computaton communcaton overlappng algorthm stll performs better than orgnal algorthms wth blockng MPI as the number of GPUs ncreases. Ths shows that the optmzaton can be appled to hundreds or thousands of GPUs wth good scalablty. 3.3 Performance balance for mult-gpus nodes In addton to the above performance dscussons, we also run our GPU mplementaton usng 12 GPUs for case E but wth a varyng number (one, two, three, four or sx) of GPUs at each node to test the balance of performance and economy for computng nodes ntegratng multple GPUs. As t s known that the bandwdth of the PCI-E bus s usually a bottleneck owng to data transfer between the GPU and CPU durng computaton compared wth the GPU computng, the performance deterorates when multple GPUs at one node are engaged n a parallel computaton because of the PCI-E bandwdth conflct. Owng to the use of CUDA portable pnned memory and OpenMP, the communcaton load of the processes wthn a node s theoretcally equal, rrespectve of how many GPUs are employed concurrently at a node. Therefore, we can ensure that there are neglgble dfferences n the CPU CPU communcaton tme for the fve confguraton settngs. The performance of our mplementaton s summarzed n Table 2. We fnd that although the number of GPUs used at each node ncreases from one to sx, the ncrease n the total computaton tme s almost neglgble as most of the tme for communcaton and data transfer s hdden owng to the asynchronous executon. The tme dfference s manly due to the GPU CPU data transfer as more data are transfered through the PCI-E bus n the case that more GPUs are runnng on the same Fgure 9 Comparson of communcaton tme between blockng and non-blockng MPI n large-scale LBM smulatons. node. Therefore, we beleve that nodes ntegratng more GPUs lke Mole-8.5 acheve a good balance between performance and economy for some applcatons wth an effcent algorthm consderng the hardware cost and space occupaton. 3.4 Applcaton Because of CUDA s nteroperablty wth OpenGL, we couple the effcent GPU mplementaton of the LBM wth a vsualzaton framework developed by our group [20] to realze large-scale smulatons. In ths secton, we conduct a drect numercal smulaton of gas up-flowng through 1166400 suspended sold partcles under a two-dmensonal doubly perodcal boundary condton. The smulaton doman s 11.5 cm 46 cm, whch s dscretzed by about one bllon lattce cells. We smulate the gas-sold flow usng 576 GPUs at 96 nodes by two-dmensonal doman decomposton. In Fgure 10, dstnct regons of partcle aggregaton, whch are called clusters n the chemcal communty, are reproduced. Ths large-scale smulaton confrms that the effcent mult-gpu parallel LBM smulaton wth a powerful GPU cluster s a promsng tool for scentfc or ndustral modelng. 4 Conclusons and prospects A hybrd parallel GPU mplementaton for LBM smulaton was proposed. Asynchronous GPU executon technology was appled to confrm overlappng between GPU CPU data transfer and GPU computaton, ndcatng that a large porton of the tme for GPU CPU copy can be hdden. Data transfer between CPUs s realzed wth MPI. To hde ths nter-cpu communcaton cost, non-blockng MPI was used to enable concurrent executons of GPU computng and MPI sendng and recevng. A shared memory model such as OpenMP was appled to mprove the performance of nodes ntegrated wth multple GPUs. In our test cases, the tme requred for GPU CPU data transfer and nter-cpu communcaton was reduced by up to about 70% for one-dmensonal decomposton and 80% for twodmensonal decomposton. These results show that the hybrd mult-gpu LBM mplementaton s a feasble way to mprove effcency. Large-scale drect numercal smulaton of an 11.5 cm 46 cm two-dmensonal doubly perodcal gas-sold suspenson was demonstrated by couplng the mplementaton wth a vsualzaton framework. The hybrd mode was easy to mplement and can be extended to three-dmensonal decomposton. Although our mplementatons were based on the LBM, other CFD methods such as the fnte dfference and fnte volume methods can be ncorporated nto ths hybrd mode and we beleve that they wll also perform well.

714 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 Table 2 Tme costs for GPU CPU data transfer and CPU CPU communcaton wth a varyng number of GPUs at each node n case E Number of GPUs n a node GPU computaton (s) GPU CPU data transfer (s) CPU CPU communcaton (s) Total (s) 1 33.90231 0.4678 0.6431 35.01321 2 33.90231 0.5307 0.6431 35.07611 3 33.90231 0.5735 0.6431 35.11891 4 33.90231 0.61142 0.6431 35.15683 6 33.90231 0.63391 0.6431 35.17932 Fgure 10 Large-scale drect numercal smulaton of a two-dmensonal gas-sold suspenson contanng more than one mllon partcles [20]. Ths work was supported by the Natonal Natural Scence Foundaton of Chna (20221603 and 20906091). We are grateful to Prof. Abng Yu of Unversty of New South Wales for llumnatve dscussons. Two anonymous revewers who gave valuable comments and suggestons that helped mprove the qualty of ths artcle are gratefully acknowledged. Support from Nvda through the CUDA Center of Excellence Program s also apprecated. 1 Kampols I C, Trompouks X S, Asout V G, et al. CFD-based analyss and two-level aerodynamc optmzaton on graphcs processng unts. Comput Method Appl M, 2010, 199: 712 722 2 Wang J, Xu M, Ge W, et al. GPU accelerated drect numercal smulaton wth SIMPLE arthmetc for sngle-phase flow. Chn Sc Bull, 2010, 55: 1979 1986 3 Anderson J A, Lorenz C D, Travesset A. General purpose molecular dynamcs smulatons fully mplemented on graphcs processng unt. J Comput Phys, 2008, 227: 5342 5359 4 Chen F, Ge W, L J. Molecular dynamcs smulaton of complex multphase flow on a computer cluster wth GPUs. Sc Chna Ser B: Chem, 2009, 52: 372 380 5 Xong Q, L B, Chen F, et al. Drect numercal smulaton of sub-grd structures n gas-sold flow GPU mplementaton of macro-scale pseudo-partcle modelng. Chem Eng Sc, 2010, 65: 5356 5365 6 McNamara G R, Zanett G. Use of the Boltzmann equaton to smulate lattce-gas automata. Phys Rev Lett, 1988, 61: 2332 2335 7 Tolke J, Krafczyk M. TeraFLOP computng on a desktop PC wth GPUs for 3D CFD. Int J Comput Flud D, 2008, 22: 443 456 8 Ge W, Chen F, Meng F, et al. Mult-scale Dscrete Smulaton Parallel Computng Based on GPU (n Chnese). Bejng: Scence Press, 2009 9 Bernasch M, Fatca M, Melchonna S, et al. A flexble hghperformance lattce Boltzmann GPU code for the smulatons of flud flows n complex geometres. Concurr Comp-Pract E, 2010, 22: 1 14 10 Kuznk F, Obrecht C, Rusaouen G, et al. LBM based flow smulaton usng GPU computng processor. Comput Math Appl, 2010, 59: 2380 2392 11 L B, L X, Zhang Y, et al. Lattce Boltzmann smulaton on Nvda

Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 715 and AMD GPUs (n Chnese). Chn Sc Bull (Chn Ver), 2009, 54: 3177 3184 12 Myre J, Walsh S, Llja D, et al. Performance analyss of sngle-phase, multphase, and multcomponent lattce-boltzmann flud flow smulatons on GPU clusters. Concurr Comp-Pract E, 2010, 23: 332 350 13 NVIDIA. NVIDIA CUDA compute unfed devce archtecture Programmng Gude Verson 3.1, 2010 14 Qan Y, Humeres D, Lallemand P. Lattce BGK for Naver-Stokes equaton. Europhys Lett, 1992, 17: 479 484 15 He N, Wang N, Sh B. A unfed ncompressble lattce BGK model and ts applcaton to three-dmensonal ld-drven cavty flow. Chn Phys, 2004, 13: 40 46 16 Obrecht C, Kuznk F, Tourancheau B, et al. A new approach to the lattce Boltzmann method for graphcs processng unts. Comput Math Appl, 2011, 61: 3628 3638 17 Yang C, Huang C, Ln C. Hybrd CUDA, OpenMP, and MPI parallel programmng on multcore GPU clusters. Comput Phys Commun, 2011, 182: 266 269 18 Mellanox. NVIDIA GPUDrect Technology Acceleratng GPU-based Systems. 2010 19 Komattsch D, Erlebacher G, Goddeke D, et al. Hgh-order fnte-element sesmc wave propagaton modelng wth MPI on a large GPU cluster. J Comput Phys, 2010, 229: 7692 7714 20 Ge W, Wang W, Yang N, et al. Meso-scale orented smulaton towards vrtual process engneerng (VPE) The EMMS paradgm. Chem Eng Sc, 2011, 66: 4426 4458 Open Access Ths artcle s dstrbuted under the terms of the Creatve Commons Attrbuton Lcense whch permts any use, dstrbuton, and reproducton n any medum, provded the orgnal author(s) and source are credted.