Computational Science and Engineering (Int. Master s Program)

Size: px

Start display at page:

Download "Computational Science and Engineering (Int. Master s Program)"

Annice Wilkins
6 years ago
Views:

Boltzmann Simulations Author: Arash Bakhtiari 1 st examiner: Prof. Dr.

1 Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis MPI Parallelization of GPU-based Lattice Boltzmann Simulations Author: Arash Bakhtiari 1 st examiner: Prof. Dr. Hans-Joachim Bungartz 2 nd examiner: Prof. Dr. Michael Bader Assistant advisor: Dr. rer. nat. Philipp Neumann Dipl.-Inf. Christoph Riesinger Dipl.-Inf. Martin Schreiber Thesis handed in on: October 7, 2013

3 I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references. October 7, 2013 Arash Bakhtiari

5 Acknowledgments I would like to express my gratitude to Prof. Dr. Hans-Joachim Bungartz for giving me the great possibility to work on this project. I wish to thank, Dr. rer. nat. Philipp Neumann for his scientific support. I would like to express my great appreciation to my advisors, Christoph Riesinger and Martin Schreiber for their ongoing support of my work, for helpful discussions and encouragement though out the time. v

6 vi

7 Contents Acknowledgements v 1. Introduction 1 2. Lattice Boltzmann Method Boltzmann Equation Lattice Boltzmann Method Lattice Boltzmann and HPC Turbulent LBM Turbulence Modeling Overview of Simulation Approaches BGK-Smagorinsky Model GPU Architecture CPU vs. GPU OpenCL Platform Model Execution Model Memory Model Programming Model Advanced OpenCL Event Model Usage Single-GPU LBM OpenCL Implementation Memory layout Data Storage Multi-GPU LBM Parallelization Models Domain Decomposition Ghost Layer Synchronization CPU/GPU Communication Software Design Modules Basic Implementation Overlapping Work and Communication SBK-SCQ Method MBK-SCQ Method vii

8 Contents MBK-MCQ Method Validation Validation Setup Multi-GPU Validation Performance Optimization K1DD Approach K19DD Approach K19DD Approach Performance Evaluation Simulation Platform and Setup Weak Scaling Strong Scaling Multi-GPU Turbulent LBM OpenCL Implementation Conclusion 55 Appendix 55 A. Configuration File Example 57 Bibliography 65 viii

9 1. Introduction Fluid flow plays an important rule in our life, the examples can range from the blood flow in our body to the air flow around a space shuttle. Therefore, the field of computational fluid dynamics (CFD) has been always an area of interest for numerical simulations. Most of the CFD simulations require the modeling of complex physical effects via numerical algorithms with high computational demands. Hence, these simulations have to be executed on supercomputers. The lattice Boltzmann method (LBM) is a popular class of CFD methods for fluid simulation which is suitable for massively parallel simulations because of its local memory access character. Therefore it is the method of choice in this thesis. Nowadays GPGPU (General-purpose graphics processing unit) computing becomes more and more popular in High Performance Computing (HPC) field, due to the massively-parallel computation power that they provide with a relatively low hardware cost. Since LBM is suitable for high parallelism and has minimal dependency between data elements, it is a perfect candidate for implementation on GPU platforms. Due to the restricted memory available on the GPUs, performing simulations with high memory demands requires utilising multiple GPUs. The major contribution of this thesis is thorough design and implementation of several methods to implement an efficient MPI parallelized LBM on a GPU cluster. Its primary goals are exploiting advanced features of modern GPUs, in order to gain an efficient and scalable massively parallel Multi-GPU LBM code. In order to achieve this goal, several sophisticated overlapping techniques are designed and implemented. In addition, during this thesis the software design has been an important aspect, therefore, development of an extendable and maintainable software was of high priority. Furthermore, the implementation of these methods is optimized to achieve better overall performance of the simulation software. To demonstrate that the primary goal has been met, the performance of each method is evaluated. In addition, the GPU code is extended to be able to simulate the turbulent fluids. The Large Eddy Simulation (LES) with the Smagorinsky subgrid-scale (SGS) model was adopted for the turbulence simulation. The thesis starts with introduction and discussion on the theory of the LBM. In chapter three, turbulence and the equations required to implement the LES with Smagorinsky model are given. The next chapter is devoted to GPU architecture and introduction to OpenCL programming model and its advanced features. The Single-GPU implementation and its memory layout is given in chapter five. Chapter six consists of the parallelization models, software design and advanced overlapping techniques developed in this thesis. At the end of this chapter, the validation strategy, performance optimization and evaluation of the Multi-GPU code is given. In the chapter six, the extension of the Multi-GPU code for turbulent fluid simulations is discussed. 1

10 1. Introduction 2

11 2. Lattice Boltzmann Method The Lattice Boltzmann Method (LBM) is a well-known approach for fluid simulation. The method has its origin in a molecular description of a fluid. Since in this method, most of the calculations are performed locally, an optimized implementation normally achieve a linear scalability in parallel computing. In addition, complex geometries and physical phenomena can easily be represented by this method. Depending on the underlying simulation requirements, LBM could yield beneficial properties compared to Navier-Stokes method as stated previously. However, instead of considering incompressible flows, Lattice Boltzmann schemes simulate weakly compressible flows. The introduction of LBM method and its history can be found in [20, 22, 19, 21] and references therein Boltzmann Equation The LBM is based on the Boltzmann equation: f t + v f = (f f eq ) (2.1) where f( x, v, t) denotes the probability density for finding fluid particles in an infinitesimal volume around x at time time t having the velocity v. On the right hand side of Eq. 2.1 is called collision operator and represents changes due to intermolecular collisions in the fluid. f eq is the equilibrium distribution and is given by Eq It is also known as the Maxwell-Boltzmann distribution. Å f eq (v) = ρ m 2πκ b T ã d/2 e m(v u)2 2κ b T (2.2) This equation describes the density of particles which have a specific velocity v within an area with bulk velocity u at an infinitesimally small volume. The κ b is the Boltzmann constant and d is the number of dimensions. m is the mass of a single particle and T the temperature value and ρ is the mas density. Solving Eq. 2.1 analytically is very challenging and can only be done for special cases. Prabhu L. Bhatnagar, Eugene P. Gross, and Max Krook, the three scientists who noticed that the main effect of the collision term is to bring the velocity distribution function closer to the equilibrium distribution. Based on this, they proposed the BGK approximation. The collision operator in Boltzmann equation can be approximated by BGK model: (f f eq ) 1 τ (f f eq ) (2.3) where τ is the relaxation time of our system, i.g. a characteristic time for the relaxation process of f towards equilibrium. 3

12 2. Lattice Boltzmann Method Equation 2.1 and 2.3 provides information on our flow using statistical methods, while the Navier-Stokes equations are based on a set of continuity equations. However, the quantities known from the Navier-stokes equations such as the velocity u or density ρ can be computed in a certain point x by integrating our probability density f over the velocity space: 2.2. Lattice Boltzmann Method ρ ( x, t) = f dv 1... dv D R D (2.4) ρ ( x, t) u ( x, t) = f v dv 1... dv D R D (2.5) LBM originated from the lattice gas automata (LGA) method. LGA is a simplified molecular dynamics model in which quantities like space, time and particle velocities are discrete. In this model, each lattice node is connected with its neighbors by 6 lattice velocities. In each time interval, as it is demonstrated in Fig. 2.1, particles in each node move to the neighboring nodes in direction of one of 6 lattice velocities. Figure 2.1.: LGA model lattice vectors. From [2]. When more than one particles from different directions arrive to the same node according to some collision rules, they collide and as a result change their velocities. A more in-depth information regarding LGA method is provided in [12]. The transition from LGA model to LBM model is accomplished by switching from a model with quantitative description to a probabilistic description of the particles in phasespace. The LBM simulates fluid flow by tracking particle distribution function. Accomplishing this in a continuous phase space is impossible, therefore LBM tracks particle distribution along a limited number of directions. The particle collisions are computed based on the density values of the cells with a so called collision operator. The movement of the particles within one timestep is then simulated by propagating the density values in the direction of the corresponding velocity vector to an adjacent cell. This is called streaming step. 4

13 2.3. Lattice Boltzmann and HPC LBM Models can be operated on different discretization schemes which describe the dimension and the adjacent communication cells for data exchange along the lattice vectors. To classify the different methods the DdQq notation is used, where d is the number of dimensions and q is the number of density distributions for each cell. E.g. D3Q19 is a 3D model with 19 density distribution (Fig. 2.2) which is used in this work z dd0 - dd3, dd16,dd17: distributions with lattice speed 1 dd6 - dd15: distributions with lattice speed sqrt(2) dd18: particles at rest y x 15 Figure 2.2.: Lattice vectors for D3Q19 model. From [17]. Based on the LBM model, the discrete representation of Lattice Boltzmann update is f i ( x + c i dt, t + dt) = f i ( x, t) 1 τ (f i ( x, t) f eq i ) (2.6) where f eq i is the discretized equilibrium distribution function ( u) = ω i ρ ï1 + 3 ( e i u) ( e i u) 2 3 ò 2 u2 f eq i (2.7) with ω i = 1 18 for i [0; 3] or [16, 17], 1 36 for i [4; 15] and 1 3 for i = 18. In order to compute the density and velocity of each cell the equations 2.4 and 2.5 can be rewritten as ρ = 18 ρ v = 18 (f i e i ) (2.8) f i i= Lattice Boltzmann and HPC Since the discrete probability distribution functions used in Lattice Boltzmann model require more memory for their storage than the variables of the Navier-Stokes equations, this method might at first sight seem quite resource consuming but in fact on modern computers this is never an issue. Since the collision operation of each cell can be computed independently and data dependency scheme is not complex, the lattice Boltzmann method is particularly well suited for computations on a parallel architecture. This advantage is also valid for other types of high performance hardware like General Purpose Graphics Processing Units (GPGPUs), which are investigated particularly in this thesis. i=0 5

14 2. Lattice Boltzmann Method 6

15 3. Turbulent LBM Turbulent flow is characterised by irregular and stochastic changes in fluid properties like pressure and velocity. Many complex flows, ranging from smoke rising from a cigarette which after sometime shows completely disordered structure in the air and a jet exhausting, show chaotic and irregular flow disturbances. In contrast to laminar flows, turbulent flows exhibit a wide range of length scales. In order to measure the strength of a turbulence in the flow, the Reynolds number: Re = vl ν (3.1) is introduced, where v, the fluid velocity, ν the fluid viscosity and L the characteristic length scale is. The turbulent fluids have normally a Reynolds number over 5000, while flows with a Reynolds number below 1500 are typically laminar. Table 3.1 provide a comparison of the characteristic properties of laminar and turbulent fluids. Laminar Flows Highly orderly flow No stochastic irregularity Stable against interference from outside Occur by low Reynolds number Turbulent Flows Chaotic fluid flow Irregular in place and time Unsteady Only occur by high Reynolds number Table 3.1.: Comparison of characteristic properties of turbulent and laminar flows. From [4] In the next section, the modeling of turbulent fluids is investigated Turbulence Modeling This section is intended to give a brief introduction to three established turbulence models, namely Direct Numerical Simulation (DNS), Reynolds-averaged Navier-Stokes(RANS) and Large-Eddy Simulation (LES), for a better understanding of the implementation without going into mathematical and physical details of these models. At the end, the BGK-Smagorinsky model, which is implemented in this thesis is investigated. 7

16 3. Turbulent LBM 3.2. Overview of Simulation Approaches An overview of turbulence models is provided in this section. Interested reader can find more in-depth material regarding the turbulence models in [4, 11]. Direct Numerical Simulation (DNS) This model bases on solving the three dimensional, unsteady Navier-Stokes equations. The DNS Model is the most accurate method to simulate the turbulent flows, since in this model the whole range of spatial and temporal scales of the turbulence must be resolved. Therefore, the computational cost of DNS is very high, even at low Reynolds numbers. The only error source of this model is the approximation error of the numerical method, which can be resolved by choosing an appropriate numerical method. For the most industrial applications, the computational resources required by a DNS would exceed the capacity of the most powerful computer currently available. Reynolds-Averaged Navier-Stokes(RANS) This model is based on the statistical observation of a turbulence fluid. In this model the current values in a turbulent flow field is divided in two parts of temporal average (Ensemble Average) and fluctuation part. By employing these values in Navier-Stokes equations, and applying a temporal average, Reynolds-averaged Navier-Stokes(RANS) equations will be obtained. Large-Eddy Simulation (LES) This model was initially proposed in 1963 by Joseph Smagorinsky. The basic idea behind LES model is applying a low-pass filtering to Navier- Stokes equations in order to eliminate the small scales of the solution. This leads to transformed equations which solve a filtered velocity field. In order to reduce the computational cost, the LES model resolves the large scales of the solution and models the smaller scales. This makes the LES model suitable for industrial simulations with complex geometries. In this thesis, the BGK-Smagorinsky model which is an LES model is implemented. The next section is devoted to this model BGK-Smagorinsky Model As described in the previous section, LES model bases on applying a low-pass filtering on the Navier-Stokes equations. In order to understand the concept of LES simulations, we investigate the characteristic properties of turbulent fluids. A rough comparison of characteristic properties of large and small scale in turbulent fluids is given in Table

17 3.3. BGK-Smagorinsky Model Large Scales Produced from average fluid flows Demonstrate coherent structures Last longer and energy-rich hard to model Small Scales Originate in large scales movements Chaotic, Stochastic Last shorter and low energy easier to model Table 3.2.: Comparison of characteristic properties of large and small scales in turbulent flows. From [4] In the RANS model, both large and small scales are modeled with an approximate statistical model. In contrast, by DNS model, both scales are computed through direct solution of Navier-Stokes equations. The LES model can be considered as a compromise of both RANS and DNS methods. In LES model, the large scale values are obtained by direct solution of Navier-Stokes equations and the small scales are approximated with a model. Interested reader can find more detailed information about these method in [11]. The following discrete lattice Boltzmann equation is obtained by applying the filter (see [9]): f α ( x + e α δt, t + δt) = f α ( x, t) 1 Å τ [f α ( x, t) f α ( x, t)] ã 2τ S α ( x, t) δt (3.2) where overbars indicate the filtered values. The major difference of this equation and the original lattice Boltzmann equation (see Eq. 2.7) is that the density distribution functions are replaced with filtered values. The relaxation time is replaced with the total relaxation time τ = τ + τ t. Consequently the total viscosity is defined as following: ν = ν + ν t = 1 3 Å τ 1 ã c 2 δt = 1 Å τ + τ t 1 ã c 2 δt (3.3) where the turbulent viscosity ν t is defined in Eq. 3.4 and τ t is the turbulent relaxation time. ν t = 1 3 τ tc 2 δt (3.4) Smagorinsky Model In the eddy viscosity subgrid-scale (SGS) model the turbulent stress is given by formula 3.5. τ t ij 1 3 δ ijτ t kk = 2ν t S ij (3.5) In addition, the turbulence eddy viscosity is computed by the following formula: ν t = (C s ) 2 S (3.6) where C s is a Smagorinsky constant. In this thesis C s is set to 0.1. Eq. 3.7 defines the filtered strain rate:» S = 2S ij S ij (3.7) 9

18 3. Turbulent LBM where filtered strain rate tensor is: S ij = 1 2 Ç ui + u å j x j x i (3.8) The computation of filtered strain rate tensor as is formulated in Eq. 3.8 requires finite difference calculations. In the LBM, this can be avoided by using the second moment of the filtered non-equilibrium density distribution functions. In our case the filtered nonequilibrium flux tensor is applied: with where Π neq ij following: 1 S ij = 2 tτ ρc 2 Q ij (3.9) s Q ij Π (neq) ij Ä u i F j + u j F i ä (3.10) is the filtered non-equilibrium momentum flux tensor and can be computed as Π neq ij = α ( e αi e αj f α f (eq) ) Finally, the total relaxation time can be computed directly with the following formula: Ö Ã α è (3.11) τ = τ + τ t = 1 2 τ + τ [C s ] 2» Q ij Q ij ρ t 2 c 4 (3.12) In chapter 7, this method is implemented on a Multi-GPU platform. 10

4. GPU Architecture The programmable GPUs (Graphic Processor Unit) has advanced into a highly parallel processors with tremendous computational power.

19 4. GPU Architecture The programmable GPUs (Graphic Processor Unit) has advanced into a highly parallel processors with tremendous computational power. Since the GPUs are specialized for compute-intensive, highly parallel computation, they are designed such that more transistors are devoted to data processing rather than data caching and flow control. As a result, the floating-point capability of the GPUs has raised a lot of attention in scientific computing community. Especially for the problems that the same program is executed on many data elements in parallel (data-parallel computation) or the problems with high memory requirements, e.g. the LBM, it has shown that a great performance boost can be achieved by using the GPU architectures [15, 18]. In the next section, the CPU and GPU architectures are compared more deeply CPU vs. GPU CPUs are optimized for sequential code performance. Therefore, they use more sophisticated control logic and provide large cache memories but neither control logic nor cache memories contribute to the peak calculation speed. GPUs are originally designed as accelerators for rendering and graphics computations, e.g. computer games. In most of graphical tasks like visualization purposes the same operations are often executed for all data elements. As a result, the design of GPUs is optimized to solve intensive data parallel throughput computations. Hence, by GPUs much more chip area is dedicated to the floating-point calculations. Compared to CPUs, GPUs offer higher throughput of floating point operations due to this specialization. The schematic layout of CPUs and GPUs are illustrated in Fig In GPUs, smaller flow control unit and cache memories are provided to help control the bandwidth requirements. For algorithms with high flow control, GPUs less fit to such an architecture than CPUs, due to the smaller flow control units provided by their architecture. Figure 4.1.: The GPU devotes more transistors to data processing. From [5]. 11

20 4. GPU Architecture The main advantages of GPUs are massive parallelism and wide memory bandwidth which its architecture provides. This is achieved by the providing thousands of smaller, more efficient cores designed for parallel performance but the multicore CPUs consist of a few cores optimized for serial processing. Most of graphics processing units (GPUs) are designed for single instruction, multiple data (SIMD). SIMD computers have multiple processing elements that perform the same operation on multiple data points simultaneously. Multicores CPUs are optimised for coarse, heavyweight threads which supply better performance per thread but GPUs create fine, lightweight threads with a relatively poor single-thread performance. For instance, the NVIDIA Tesla M2090 provides 512 processor cores with 1.3 GHz processor core clock. Each NVIDIA Tesla M2090 has a main memory or so called global memory of 6 GB. In Table 4.1 the specification of the NVIDIA Tesla M2090 and older model M2050 are depicted. NVIDIA Tesla M2050 M2090 Compute Capability Code Name Fermi Fermi CUDA-Cores Main Memory [GB] 3 6 Memory Bandwidth [GB/s] Peak DP FLOPS [GFLOPS] Table 4.1.: Comparison of NVIDIA Tesla M2050 and M2090. From [5]. In table 4.2 a comparison of NVIDIA Tesla M2090 and Intel Core i7-3770k is provided. NVIDIA Tesla Intel Core i7-3770k M2090 Cores Main Memory [GB] Max 32 6 Memory Bandwidth [GB/s] Peak DP FLOPS [GFLOPS] Table 4.2.: Comparison of NVIDIA Tesla M2090 and Intel Core i Programming models like CUDA and OpenCL make GPUs accessible for computation like CPUs. OpenCL is the currently dominant open general-purpose GPU computing language. In the following, a brief introduction to OpenCL is given to the extent required for understanding of this thesis OpenCL In the older graphics cards, the computing elements were specialized to process independent vertices and fragments. By introducing GPGPU direct programming interfaces like 12

21 4.2. OpenCL CUDA and OpenCL this specializations are removed and it allows running programs written in a language similar to C without using the 3D graphics API. OpenCL is an open industry standard for programming a heterogeneous collection of CPUs, GPUs organized into a single platform. OpenCL includes a language, API, libraries and a runtime system to support software development. OpenCL is the first industry standard that directly addresses the needs for heterogeneous computations. It is first released in December of The early products became available in the fall of With OpenCL, you can write a single program that can run on a wide range of systems, from cell phones, to nodes in massive supercomputers. This is one of the reasons why OpenCL is so important. At the same time this is also the source of much of the criticism launched at OpenCL. OpenCL is based on four models: Platform Model Memory Model Execution Model Programming Model In the following sections, each of these models are explained in more details. The content of following sections are based on OpenCL Specification [10]. A more in-depth explanation of OpenCL standard can also be find in [6, 16, 13] Platform Model The platform model is illustrated on Fig The model consist of a single host and some OpenCL devices that are connected to host. The host performs tasks like I/O or interaction with a program s user. An OpenCL device can be a CPU, GPU, digital signal processor (DSP) as well as other processors. An OpenCL device consist of one or more compute units (CUs). Each CU provides one or more processing elements (PEs) which perform the actual computations on a device. Figure 4.2.: OpenCL platform model. From [10]. 13

4. GPU Architecture 4.2.2. Execution Model An OpenCL application consists of two parts: the host program and one or more kernels. The host program runs on the host.

22 4. GPU Architecture Execution Model An OpenCL application consists of two parts: the host program and one or more kernels. The host program runs on the host. The host program create the context for the kernels. In addition, it manages the execution of kernels. The kernels execute on the OpenCL devices. Kernels are simple functions that perform the real work of an OpenCL application. These functions are written with the OpenCL C programming language and compiled with the OpenCL compiler for the running device. The kernels are defined on the host. In order to submit the kernels for execution on a device, the host program issues a command. Afterwards, an integer index space is instantiated by the OpenCL runtime system. For each element in this index space, an instance of the kernel is created and launched. Each instance is called an work-item. In order to identify the work-items, their coordinates in the index space is used. These coordinates are called global IDs which are unique for each work-item. While each work-item execute the same instructions, the used data in each work-item can vary by using the global ID. Work-items are organized into work-groups. Work-groups are assigned a unique ID with the same dimensionality as the index space used for the work-items. Work-items are assigned a unique local ID within a work-group. As a result, a single work-item can be uniquely identified in two ways: (1) by its global ID (2) by a combination of its local ID and work-group ID Memory Model The OpenCL standard offers two types of memory objects: buffer objects and image objects. A buffer object is a contiguous block of memory which is available to the kernels. A programmer can initialise the buffers with any type of information from host and access the buffer through pointers. Image objects, are restricted to holding images. A summary of OpenCL memory model and its interaction with platform model is depicted in Fig Figure 4.3.: OpenCL memory model. From [13]. OpenCL defines four distinct memory regions, that executing work-items have access to: 14

23 4.2. OpenCL Global Memory: All work-items in all work-groups have read/write access to this memory region. Depending on the capabilities of the device Reads and writes to this memory region may be cached [10]. Constant Memory: This memory remains constant during the execution of a kernel. Constant memory is allocated on global memory region. Initialization constant memory is done through the host [10]. Local Memory: This memory region is local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that workgroup. Depending on the device the local memory can mapped onto the global memory or it can have its dedicated memory [10]. Private Memory: Variables defined in one work-item s private memory are not visible to another work-item [10] Programming Model This section describe how a programmer can map parallel algorithms onto OpenCL using a programming model. OpenCL consist of two different programming models: task parallelism and data parallelism. OpenCL also supports hybrids of these two models. The primary model behind the design of OpenCL is data parallel. Data Parallelism Data parallelism programming model is main idea behind the OpenCL s execution model. In this parallel programming model, the same sequence of instructions are applied to multiple elements of a memory object. Normally access to the memory object is accomplished by the index space associated with the OpenCL execution model. The programmer needs to align the data structures in the problem with the index space in order to access correct data in the memory object. Task Parallelism Although OpenCL is designed for data parallelism, it can also be used to achieve task parallelism. In this case, a single instance of a kernel is executed independent of any index space. Since task parallelism is not used in this thesis, the details are not described here Advanced OpenCL Event Model Usage OpenCL Usage Models In this section, various usage models of OpenCL standard is explained. In order to perform an operation on OpenCL objects like memory, kernel and program objects, a command-queue is used. In addition, the command-queues are used to submit work to a device. The commands queued in a command-queue can execute in-order or out-of-order. Having multiple command-queues allows application to queue multiple independent commands. Note that sharing of objects across multiple commandqueues or using an out-of-order command-queue will require explicit synchronisation of commands. Based on the these choices, various OpenCL usage models can be defined [8]. In the following a few of this models are described: 15

24 4. GPU Architecture Single Device In-Order Usage Model (SDIO): This usage model is composed of a simple in-order queue. All commands execute on a single device and all memory operations occur in single memory pool. Single Device Out-of-Order Usage Model (SDOO): This model is same as SDIO model but with an out-of-order queue. As a result, the execution order has no guarantees and device starts executing commands as soon as it is possible. It is the responsibility of the developer to assure program correctness by analyzing the commands dependencies. Single Device Multi-Command Queue Usage Model (SDMC): In this model, multiple command-queues are employed to queue commands to a single device. The model can be applied in order to overlap execution of different commands or overlap commands and host/device communication. Dependent on GPU computation capacity, SDMC model make it possible to launch several kernels concurrently on the same device. Dependent on the parallelization algorithm, each OpenCL usage model can lead to different performance and scalability results. By implementing more sophisticated overlapping techniques some of these models are applied. Synchronization Mechanisms In order to ensure that the changes to the state of a shared object (such as a command-queue object, memory object,... ) occur in the correct order (for instance, when multiple command-queues in multiple threads are making changes to the state of a shared object), the application needs to implement appropriate synchronisation across threads on the host processor by construction of a task graph. OpenCL event model provides the ability to construct complicated task graphs for the tasks enqueued in any of the command-queues associated with an OpenCL context. In addition, OpenCL events can be used to interact with functions on the host through the callback mechanism defined in OpenCL 1.1. These OpenCL features and their application in implementing of Multi-GPU LBM simulation with overlapping technique is described in the following sections. Events An event is an object that can be used to determine the status of commands in OpenCL. The events can be generated with commands in a command-queue. Other commands can use these events to synchronise themselves. Hence, event objects can be used as synchronisation points. All clenqueue() methods can return event objects. An event can be passed as the final argument to the enqueue functions. In addition, a list of events can be passed to an enqueue function to specify the dependence list. Based on the dependence list which in OpenCL terminology is called event wait list, the command will not start the execution until all of events in the list have completed. The following code demonstrate an example of using OpenCL event based synchronization: Listing 4.1: OpenCL event based synchronization. cl_uint num_events_in_waitlist = 2; cl_event event_waitlist[2]; 16

25 4.2. OpenCL err = clenqueuereadbuffer(queue, buffer0, CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[0]); err = clenqueuereadbuffer(queue, buffer1, CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[1]); /* last read buffer waits on previous two read buffer events */ err = clenqueuereadbuffer(queue, buffer2, CL_FALSE /* non-blocking */, 0, 0, num_events_in_waitlist, event_waitlist, NULL); User Events Events can also be used to synchronize the commands running within an command-queue and functions executing on the host. This can be done by creating the so called user events. The user event can be used in OpenCL enqueue functions like other events. In this case, the execution status of the event is set explicitly. Creating a user event on the host can be accomplished by using the clclreateuserevent function. Callback Events Callbacks are functions invoked asynchronously when the associated event reach specific state. A programmer can associate a callback with an arbitrary event. Using OpenCL callback mechanism can be beneficial specially for the applications in which the host CPU would have to wait while the device is executing. This can leads to worse system efficiency. In such cases by setting a callback to a host function the CPU can do other work instead of spinning while waiting on the GPU. The clseteventcallback function is used to set a callback for an event. Using Events for Profiling Performance analysis is a crucial part of any HPC programming effort. Hence, the mechanism to profile OpenCL programs uses events to collect the profiling data. To make this functionality available the command-queue should be created with the flag CL QUEUE PROFILING ENABLE. Then, by using the function clgetevent- ProfiingInfo() the timing data can be extracted form event objects. A sample code of this process is shown in listing 4.2. Listing 4.2: Extracting profiling information with OpenCL events. cl_event event; cl_ulong start; // start time in nanoseconds cl_ulong end; // end time in nanoseconds cl_float duration; // duration time in milliseconds clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL); clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL); duration = (end - start)* ; CL PROFILING COMMAND START and CL PROFILING COMMAND END are flags used to return the value of device time counter in nanoseconds for start and end of a command associated with the event, respectively. 17

26 4. GPU Architecture 18

27 5. Single-GPU LBM The GPU computation by providing advantages like massive-parallel processing and wide memory bandwidth, allows LBM simulations to achieve high performance. The fact that computation of LBM cells can be performed independently, makes this method appropriate for parallelization on the GPU. This chapter is intended to give a brief introduction to implementation of Single-GPU LBM with OpenCL API. For a more in-depth information regarding the Single-GPU implementation see [17]. The Multi-GPU implementation in section 6 uses this work as the initial framework OpenCL Implementation This section describes the implementation of LBM by using the OpenCL standard. One of the most important design aspect of a simulation software accelerated with GPUs is memory access patterns used in the software. In the following sections the memory layout used in the Single-GPU implementation for storing and accessing the simulation data, i.e. the density distribution values, is discussed Memory layout Since the global memory an the GPUs are still restricted, designing an efficient and frugal memory layout for the LBM simulations on GPUs plays a crucial rule to achieve an optimized software. Several memory layouts to store the density distributions have been developed by different research groups. In the next section two of the memory layouts, namely A-B pattern and A-A pattern, is discussed. For the Single-GPU LBM used in this thesis, the A-A pattern memory layout is used, which consumes the lowest amount of memory[3]. A-B Pattern In this memory layout, after applying the collision operator, the density distribution values are stored in the same memory location. The propagation operator reads the density distribution value of adjacent cell in the opposite direction of current lattice vector and save the values in corresponding lattice vector direction of the current cell. The A-B memory layout is illustrated in Fig

28 5. Single-GPU LBM Density distributions (model): Collision Propagation Data storage (implementation): Collision Propagation Figure 5.1.: Memory layout of A-B pattern. From [17]. In order to avoid the race condition, reading and writing to the same cells in a parallel implementation requires an additional density distribution buffer. As a result, the A-B pattern doubles the memory consumption due to the additional buffer. In the A-A memory layout, this issue is addressed. A-A Pattern The main design aspect of A-A pattern is enhancing the memory demand by maintaining almost the same performance. The A-A pattern achieves this goal by using two different kernels for odd and even time steps: we follow referring to different kernels as alpha and beta kernels (see [17]). Alpha kernel reads the distribution values the same way as A-B pattern. After applying the collision operator, the new values are stored in the opposite lattice vector of the current cell. Hence, Alpha kernel does not change the values of other cells at all and only accesses the values of the current cell. In contrast, the Beta kernel reads and writes the density distribution values only from adjacent cells. In this kernel, the values are read from the adjacent cells which are in the opposite direction of current lattice vector. After applying the collision operator, the new values are stored in the adjacent cell in the same direction of current lattice vector. Figure 5.2 demonstrate the A-A memory access pattern. 20

29 5.1. OpenCL Implementation Density distributions (model): Collision + Propagation Collision + Propagation beta alpha Data storage (implementation): Collision + Propagation Collision + Propagation beta alpha Figure 5.2.: Memory layout of A-A pattern. From [17]. Together with the alpha kernel, this access rule implicitly implements the propagation operator Data Storage The density distributions of a each lattice vector is stored linearly in the global memory of the GPU. To achieve a coalesced memory access, first all x components are stored, then the y and z components. Consecutively the next lattice vector density distribution values are stored. A schematic illustration of this memory layout is illustrated in Fig density distributions 0 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) density distributions 1 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) density distributions 18 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) Figure 5.3.: Density distribution memory layout. From [17]. The velocity vectors of each cell are stored in the same way of the density distribution values. In addition, the density values and cell type flags are stored consecutively in separate buffers. The more information about the memory layout of Single-GPU LBM can be found in [17]. 21

30 5. Single-GPU LBM 22

31 6. Multi-GPU LBM The simulation of real world scenarios is usually very compute intensive. In addition, the main memory of one compute device is commonly not sufficient to meet the memory demands (i.e., 6GB on NVIDIA M2090 GPU). Using multiple GPUs efficiently for LBM Method can help to fulfill the memory requirements and as a result it makes it possible to use the simulation with higher number of unknowns (weak scaling). However, the use of multiple GPUs demands more sophisticated memory management, communication and synchronization techniques in order to avoid communication overhead in a distributed and even a shared memory system. To overcome the previously stated challenges in a Multi-GPU LBM simulation, sophisticated optimization techniques are required. In the following section, the parallelization paradigm used in this thesis to proceed from Single-GPU to a Multi-GPU implementation is described. Additionally the eminent components of the software which are crucial to achieve a great software design are explained comprehensively. In order to achieve a great performance and defeat the communication overhead difficulty, various techniques of overlapping the computation and communication are implemented and their efficiency is investigated. At the end, the benchmark results of the software on GPU MAC cluster is demonstrated Parallelization Models There are two parallelization models: shared memory and distributed memory parallelization. In a shared memory model, as the name implies, different processes shared a global address space and asynchronously read and write data to it. In a distributed memory model, the processes exchange data by passing messages to each other in an asynchronous or synchronous way. The most common standard message-passing system is MPI (Message Passing Interface). For this thesis the distributed memory model is adopted and the data exchange between GPUs is accomplished by MPI. An essential aspect of any parallelization paradigm is Problem Decomposition. The two types of problem decomposition is Task Parallelism and Data Parallelism. The Task Parallelism focuses on processes of execution. In contrast, by Data Parallelism, a set of task will operate on a data set but independently on separate partitions. Since the same computation are applied for each domain cell in LBM and they are completely independent of other cells, it makes it a perfect candidate for a Data Parallelism. In the next section, these parallelization paradigms are explained specifically for LBM Simulations. 23

32 6. Multi-GPU LBM Domain Decomposition Domain decomposition is most common way to parallelize LBM codes. The term domain decomposition is understood to mean that the computational domain is divided up in several smaller domain parts which are distributed to several computational units. Each domain partition is assigned to one MPI process and one GPU which is responsible for the computation of that partition. As a result of data dependency, MPI processes need to exchange data among each other. The demanded data, based on the LBM data dependency, is the outer layer of the process local simulation domain. Therefor each MPI process extract the data that is required by other processes and send the data to the receiving process. The received data is saved in the so called ghost layer subregion of corresponding simulation data Ghost Layer Synchronization Normally the ghost layer data is just used during the local computation of subregion and can be safely overwritten by the new data in the next simulation step. As it is described in section the α-kernel only access the values of the current cell for the collision and propagation. The general approach of ghost layer data exchange can be applied to data synchronization after α time step. In this thesis we call it α- synchronization. Figure 6.1.: α-synchronization. In α-synchronization all the lattice vectors values are exchanged. Opposite to α-synchronization, the β-kernel reads the density distribution values from the adjacent cells and after performing the computation, the results are written to the ad- 24

33 6.1. Parallelization Models jacent cells in the direction of lattice vector of the computing density distribution. As a consequence, the GPU threads computing the neighboring cells of ghost layers, write their computation results in ghost layer data. This data is required for the next simulation step of the process that ghost layers data originally comes from. Thus, the new data in ghost layer should be sent back to the original process. This procedure is demonstrated in Fig We call this procedure β-synchronization. Figure 6.2.: β-synchronization. In the β-synchronization only the red lattice vectors are exchanged CPU/GPU Communication The simulation data of each subdomain is stored in the global memory of corresponding GPU. Therefor, before performing MPI communications the data should be transferred to the host memory. The data is send over the host and device bus system. (PCI express, e.g.). In contrast to MPI communications that consist only out of CPUs, in a Multi-GPU communication, an additional step is required to transfer data from the GPU memory to the host memory. This process is presented in Fig 6.3. GPU-CPU Copy PCI Express Transfers InfinitiBand Send via MPI CPU-GPU Copy PCI Express Transfers InfinitiBand Receive via MPI Figure 6.3.: Activity diagram of MPI communication for Multi-GPU simulation. 25

6. Multi-GPU LBM To get a better intuition of Multi-GPU MPI communication process, a geometric overview of data communication in a Multi-GPU LBM simulation is provided in Fig. 6.4.

34 6. Multi-GPU LBM To get a better intuition of Multi-GPU MPI communication process, a geometric overview of data communication in a Multi-GPU LBM simulation is provided in Fig Figure 6.4.: A geometric overview of Multi-GPU MPI communication. In this figure, the three form of communication, namely, GPU-GPU, GPU-CPU and MPI communication are shown. In this thesis, local communication is not implemented. From [5]. In case that each host is managing more than one GPU, no MPI communication for data exchange is required and communication can be performed locally, since the GPUs can directly or indirectly through the host, access to each others global memory. However, this thesis focuses on MPI communication only Software Design From the beginning of this thesis, the software design has been an important aspect. In the design process following aspects are considered: Extensibility: Adding new capabilities to the software can be accomplished effortlessly without restructuring of major parts of the software components and their interrelation. Modularity: the software comprises independent modules which prompt better maintainability. In addition, The components can be tested in isolation before being integrated to the software. Maintainability: In consequence of modularity and extensibility, bug determination is simplified. Efficiency: The software is optimized in many aspects. Data structures are chosen in a way to consume less memory and provide the performance needed for a massively parallelized large-scale simulation software. 26

35 6.2. Software Design Scalability: Scalability is the major design goal of the software developed in this thesis. Sophisticated techniques like the overlapping of work and communication are implemented in order to achieve a great scalability for large-scale simulations. Apart from HPC and design aspects, the software development of this thesis also offers: Storing visualization data in VTK format (legacy format and xml binary format) Automatically profile different modules of code with tools like Scalasca[7], Vampir 1 and Valgrind[14] In the following a detailed overview on available modules and their underlying design strategy is provided Modules In this section, all the modules developed in this thesis and their interoperability is discussed. Figure. 6.5 presents a general overview of the simulation flow. The figure is generated by using Callgrind. In addition, the caller/callee relationship between the most essential functions of simulation software is presented. Figure 6.5.: Simulation callgraph. Each node in the graph represents a function, and each edge represents calls. Cost shown per function is the cost spent while that function is running. In the following a detailed overview of software modules is provided: Manager Class: Manager class is responsible for management of general tasks in simulation process that are not specific to one subdomain in particular, e.g. domain decomposition and assigning subdomains to different processes. These types of tasks are normally carried out during the initialization phase and needs to be done only once during the simulation. 27

36 6. Multi-GPU LBM The class features one template parameter: T. T determines the data type of simlation data stored in the memory. Using this template parameter make it possible to run the simulation software on GPUs with single or double precision support. This template parameter is also used in most of other software modules developed in this thesis. Simulation Parameters: The Manager class stores the grid information such as domain length and number of lattice cells in each direction. Further, the class also save the number of subdomains for the domain decomposition which is specified by the user via the configuration file or the command line arguments. The grid information and number of subdomains are established as constructor arguments during the instantiating of Manager class. By using this information, Manager instance compute the size of each subdomain and its location in entire grid. Domain Partitioning: The Manager class is also responsible for assigning tasks (computation of each subdomain) to MPI processes. Distributing workloads across multiple processes can be performed by using various strategies. An optimized strategy is the one, that provokes the least amount of communication between MPI processes. Partition Boundaries: In addition, based on the location of each subdomain in the simulation grid, the class determines appropriate boundary conditions for each subdomain. For instance for the subdomains that shares a domain face with neighboring subdomains, the boundary condition for that face is assigned to FLAG GHOST LAYER. This flag specifies that the computation related to this face can only be accomplished only after the corresponding ghost layer data is fetched from neighboring subdomain. Communication: After detecting a ghost layer face, the Manager class initiate a Comm object. The Comm class contains the required information for communication with neighbors like the size and origin of sending and receiving data. The Comm class is explained more deeply in the following sections. Simulation Geometry: The Manager class is also responsible for setting the geometry of simulation domain. This is achieved by setting appropriate flags to each cell of the subdomain. FLAG OBSTACLE and FLAG FLUID are some examples of currently available flags. Controller: The Manager class, as is previously stated, is responsible for the tasks related to initialization phase of the simulation. Therefore two strategies are available which by the first strategy the initialization process is done on the root process and the results will be sent to other processes via MPI send commands while by the second strategy each process performs the initialization phase separately. Which strategy leads to better performance depends on initialization/communication time ratio. In this thesis, since the initialization operations are not compute intensive, in order to avoid the communication between MPI processes, each process performs the initialization related to each subdomain individually. As a consequence, every MPI process instantiates its own Manager class which carry out the initialization operations of the corresponding subdomain. In addition, every Manager instance 28

37 6.2. Software Design aggregate a Controller class which controls the simulation procedure of that subdomain. Controller Class: Class Conroller provides an interface to control the simulation steps on the corresponding subdomain. For every subdomain one Controller class is instantiated. Each Controller class has a unique ID which is used in MPI communications as identifier of the subdomains. Each Controller instance needs to communicate with neighboring Controller instances in order to synchronize the ghost layer data. This task is accomplished by syncalpha and sync- Beta functions which send the alpha and beta ghost layer data respectively. An overloaded version of these functions, by adding two more arguments of types MPI Request and MPI Status, provides the ability to perform a non-blocking MPI communication. The required MPI communication information is stored in Comm class instance assigned to the corresponding communication. For subdomains with more than one neighbor the Comm instances are stored in a C++ std::map container with the position of neighboring domain (direction of communication which is a C++ enumeration) as the key of map container. This makes it possible to access the communication information of each direction independently. This feature is used later in order to synchronize the ghost layer data of each direction separately (see section 6.4). Before sending the ghost layer data, the functions storedataalpha and storedatabeta loads the data to the sending buffers. The setdataalpha and setdatabeta functions place the received data from neighboring subdomains in the proper places in local simulation data of current subdomain. These functions are overloaded two times. The overloaded functions with one argument of type MPI COMM DIRECTION perform their tasks only in direction specified in the argument. There is also another overloaded version of these functions, which in addition to direction, provides the ability to profit from event based OpenCL synchronization features by adding three more OpenCL event arguments. The strategy behind different optimization methods is implemented in simulationstepalpha and simulationstepbeta functions. Basically these functions encapsulates all operations needed to perform one simulation step of alpha and beta computations, respectively. For example which part of the computation domain should be computed first and the order in which the computation and communication operations should be performed can be encapsulated under these functions (see section 6.4). The class Controller performs alpha and beta simulation steps by utilizing the lbm solver attribute which is a type of LbmSolver class. This class is explained in the following section. The function initlbmsolver, which is called in the constructor of Controller class, queries the local GPUs and create OpenCL platforms, contexts and command-queues. It also chooses the appropriate available GPU device for performing the computations. In order to apply the modularity principle in software design the Controller class does not directly enqueues the OpenCL kernels to the command-queues. The LbmSolver class is employed for this purpose. Therefore, lbm solver attribute is initiated in this function and the OpenCL contexts and device values are passed as constructor arguments of LbmSolver class during the instantiating of this attribute. Furthermore, the class Controller aggregate a visualization class instance in order to visualize the simulation data in each time step. 29

38 6. Multi-GPU LBM LbmSolver Class: This Class works as a wrapper around all available OpenCL kernels. It provides interface functions for enqueueing computation and device/host communication kernels The function reload allocates OpenCL memory objects which contains the local computation results. Additionally, it creates OpenCL kernels, compile them and assigns the kernel arguments. Enqueueing OpenCL commands in command-queues is done only through the interface provided in this class. For instance, it implements the functions simulationstepalpha and simulationstepbeta which enqueues OpenCL kernels that perform the alpha and beta computations on the entire local domain. In addition, simulationstepalpharect and simulationstepbetarect functions gives the user the capability to run the alpha and beta kernels on a rectangular part of the domain. To use these functions the user needs to provide the origin of the ractangular part and size of it. The rectangular functionality is also available for functions which store and set the data from and to the device. With the help of these functions is possible to modify the data of an specific rectangular part of the whole domain. This features used in getting and setting of the ghost layer data. Most of the functions provided in this class are overwritten to exploit the OpenCL events synchronization mechanism. The usage of these functions is explained in later sections. Comm Class: The Comm class encapsulates the information required for MPI communication between Controller classes. The public interface of the class offers functions for accessing MPI communication destination rank and origin and size of data to send to and receive from the destination rank. Comm class also allocates receive and send data buffers in its constructor so the buffers are allocated only once through the whole simulation steps. Configuration Class: This class is in charge of parsing the general setting of the simulation process and providing global access to the settings. The simulation configurations can be set through the command line arguments or by providing an xml file. An example of the XSD schema is available in appendix A. The configuration file name can be established as constructor argument during the instantiating of Configuration class or it can be specified as the argument of loadfile function. The simulation settings are divided into four categories of data: physics, grid, simulation and device settings. The settings related to each category is available under corresponding xml element in configuration file (see A). The class should provide a global point of access to the settings. This can be achieved via the Singleton software design pattern. The singleton pattern ensures a class has only one instance and it is globally available through the function Instance(). In order to be able to turn every class to a singleton, a c++ singleton template class is implemented. By using the configuration class as the template parameter of the Singleton class, it can be easily used as a singleton. LbmVisualizationVTK Class: All visualization classes have to inherit from the abstract base class ILbmVisualization which defines the interface of a visualization class compatible with software developed in this thesis. 30

6.3. Basic Implementation ILbmVisualization interface supply two functions, namely setup() and render(). The first function is used for the initialization of the visualization process.

39 6.3. Basic Implementation ILbmVisualization interface supply two functions, namely setup() and render(). The first function is used for the initialization of the visualization process. The render() function, should be called whenever the simulation data needs to be visualized. The functionality of render() function depends on the implementation class. For instance, in the case of LbmVisualizationVTK class the render function saves the simulation data to a VTK file Basic Implementation To develop the software in this thesis the following simple, disciplined and pragmatic approach to software engineering which has been attributed to Kent Beck, is applied: Make It Work, Make It Right, Make It Fast. According to this approach, first a software is developed which works properly and fulfills the basic goals of the thesis. In the next step, various methods in order to improve the basic design are implemented and verified. Finally, in the last step of development, the software is optimized in many aspects to gain the best performance in terms of execution time and scalability for large-scale simulations. This section describes the implementation of the basic method. This basic method computes all subdomains by employing a GPU for each subdomain. After accomplishing all computations for a time step, the boundary regions of subdomains are exchanged between GPUs. A GPU cannot directly access to the global memory of other GPUs, as a result, host CPUs are used as a bridge for data exchange between GPUs. The data exchange is composed of the following 3 steps: (1) Data transfer from GPU to CPU (2) Data exchange between CPUs via MPI (3) Data transfer back from CPU to GPU [18]. A schematic overview of this three step communication is illustrated in Fig Figure 6.6.: Schematic timeline for the basic method. From [18]. The code developed for this approach does not require sophisticated synchronization techniques, since the MPI communication is performed first when the computation of all subdomain cells is accomplished. A sample of an α-iteration is shown in Listing 6.1. Listing 6.1: Implementation of not optimized basic method for computing one alpha simulation step. // First the alpha computation of all cells // is performed clbmptr->simulationstepalpha(); // The communication phase starts after the computation phase is // completely finished. syncalpha(); 31

40 6. Multi-GPU LBM Although the implementation of this method is straightforward, the communication overhead introduced by the three-step communication scheme described before, can reduce the performance of the simulation. We expect its impact getting even more significant when we use more GPUs. In this method, in order to exchange the boundary data, first the computation of whole domain should be accomplished. Therefore, the GPU/CPU transfer operations and MPI communications stay idle for a longer time. In the next sections, other techniques to avoid the communication costs by overlapping communication and computation parts are investigated Overlapping Work and Communication Increasing the performance becomes more difficult, primarily because the inter-node communication can not catch up with the performance increase of massively parallel GPUs. Achieving good parallel efficiency when using distributed-memory machines requires more advanced programming techniques for hiding communication overhead by overlapping methods. A possibility to prevent communication overhead is to perform communication parallel to the actual simulation. This requires two OpenCL LBM kernels. One updates the outer layer of the Block, and one updates the inner region. In this way, first, the computation of outer boundary is accomplished. Next, the computation of inner part, and the extraction, insertion and the MPI communication is executed asynchronously. Hence, the time spent for all PCI-E transfers and the MPI communication can be hidden by the computation of the inner part, if the time of the inner kernel is longer than the time for the communication. If this is not the case only part of the communication is hidden. [5] Implementing the overlapping techniques require asynchronous execution of GPU kernels, GPU-CPU copy operations and MPI communications commands on the CPU. Both OpenCL and MPI standards provide advanced synchronization mechanisms which is discussed in section Operations used in a Multi-GPU LBM simulation fall into the following four categories: GPU Computation: The operations that perform the actual computation of collision operators, belong to this category. This operations run solely on the GPU. GPU-CPU Communication: This category consists of operations that are responsible to transfer updated values form GPU memory to host memory and vice-versa. CPU-Computation: In an hybrid model, in addition to GPU also CPU is utilized to perform some part of computations. However, this is not utilized in our implementation. MPI-Communication: All MPI commands which are used to synchronize the ghost layer data. The primary idea of techniques designed in this thesis is to overlap the operations of these four categories, reducing the critical path length of the dependency graph. 32

41 6.4. Overlapping Work and Communication To overlap the GPU-Computation and GPU-CPU Communication operations, the advanced OpenCL event synchronization techniques described in the section can be exploited. In addition, in case that several kernels are used to compute different areas of the subdomain, the device should be capable of running multiple kernels simultaneously. In CUDA programming model, this can be carried out by exploiting the CUDA Stream concept. Achieving the same results with OpenCL is theoretically possible by using several OpenCL command-queues associated to the same device. Current OpenCL specification does not specify this feature and as a result, this is completely dependent to the vendor providing the OpenCL driver. Overlapping of GPU operations with the CPU-Computation category, can be achieved by taking advantage of OpenCL callback mechanism. It will be discussed in later sections that the overhead introduced by implementing the callback mechanism, significantly degrade the performance and scaling of the software. By using non-blocking MPI commands, CPU operations and MPI-communication operations can run asynchronously. Based on these overlapping techniques for the Multi-GPU LBM, in this thesis several approaches are designed, implemented and their performance results are compared, in the following sections: SBK-SCQ Method Single Boundary Kernel Single Command-Queue (SBK-SCQ) method, consists of one boundary kernel that computes all the boundary values in each direction at once. The main idea behind this approach is to utilize this kernel to first update only the boundary values then the communication process and computation of inner cells are performed asynchronously. In Fig. 6.7 a schematic timeline for the operations performed in this method is illustrated. SBK-SCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 ISend X1 ISend Y0 ISend Y1 ISend Z0 ISend Z1 ISend X0 X1 Irecv Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store X0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation Boundaries computation Inner Computation Figure 6.7.: Schematic timeline for SBK-SCQ method. 33

42 6. Multi-GPU LBM After computing the boundary values, the data transfer operations get triggered. During the communication phase, first the boundary values of each direction are separately transferred from the GPU main memory to the host memory. This is achieved by enqueueing OpenCL data transfer functions to the one single command-queue. In the next step, the ghost layer data are exchanged between MPI processes, by using non-blocking MPI sending/receiving functions. Since in this method no concurrent OpenCL kernel execution can be performed on the device, neither GPU-Computation nor GPU-CPU transfer operations can overlap. In contrast, as soon as a transfer operation of one boundary is finished the corresponding MPI communication can be executed on the host and the transfer operation of other boundaries can be executed on the GPU, asynchronously. Finally, each simulation step concludes with receiving data from neighboring domains. This data needs to be transferred from the host memory to correct location in the GPU memory. Again, in this phase separate transfer operations are executed for each boundary. By using only one OpenCL command-queue, overlapping the GPU computation and GPU-CPU transfer operations is not possible. As a result, the computation of inner part of the domain can be triggered only after the data transfer operation is accomplished. A schematic sample code of this method is demonstrated in Listing 6.2. The simulation- StepAlphaBoundaries() is the function that enqueues the boundary kernel, which compute only the boundary values. The functions storedataalpha() and setdataalpha() are responsible for transferring the data from the GPU memory to host memory and vice-versa. The single argument used by the functions specifies the intended boundary location. syncalpha() performs a non-blocking MPI send and receive operations. This function in addition to direction of communication, has also two output arguments of type MPI Request for sending and receiving operations. The arguments are used later to provide information about the status of the MPI operation. During performing the communication operations the computation of inner part of the domain is triggered by using the simulationstepalpharect() function. Listing 6.2: Implementation of SBK-SCQ method for computing one alpha simulation step. ////////////////////////////////////////////////////// // --> One Kernel computing the entire boundary cells ///////////////////////////////////////////////////// clbmptr->simulationstepalphaboundaries(0, NULL, NULL); storedataalpha(mpi_comm_direction_x_0); syncalpha(mpi_comm_direction_x_0, &req_send_x0, &req_recv_x0); storedataalpha(mpi_comm_direction_x_1); syncalpha(mpi_comm_direction_x_1, &req_send_x1, &req_recv_x1); // same for y and z directions. // --> Computation of inner part clbmptr->simulationstepalpharect(inner_origin, inner_size, 0, 34

43 6.4. Overlapping Work and Communication NULL, NULL); MPI_Status stat_recv; MPI_Wait( req_recv_x0, &stat_recv); setdataalpha(mpi_comm_direction_x_0); MPI_Status stat_recv; MPI_Wait( req_recv_x1, &stat_recv); setdataalpha(mpi_comm_direction_x_1); // same for y and z directions. As it is shown in Fig. 6.7 there are many gaps between the four kind of operations. In order to improve the overall performance, the timeline should be as tight as possible. This can be achieved by overlapping of more independent operations. The identification of independent tasks requires a more advanced dependency analysis of the simulation process. One way to achieve this is to divide the boundary kernel into six separate kernels for each boundary region. Hence, the computation of each boundary region can be accomplished independently. This can provide more overlapping opportunities. This method is explained in the next section MBK-SCQ Method In Multiple Boundary Kernels Single Command-Queue (MBK-SCQ) method, in contrast to SBK-SCQ method, each boundary region is computed separately. Therefore, once the computation of one boundary region is finished, the GPU-CPU transfer operation of that region get triggered. Figure 6.8 illustrate a schematic timeline for MBK-SCQ method. 35

44 6. Multi-GPU LBM MBK-SCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 Isend X1 Isend Y0 Isend Y1 Isend Z0 Isend Z1 Isend X0 Irecv X1 Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store x0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation X0 Comp. X1 Comp. Y0 Comp. Y1 Comp. Z0 Comp. Z1 Comp. Inner Computation Figure 6.8.: Schematic timeline for MBK-SCQ method. Although decomposing the boundary computation provides more flexibility in organizing the operations, it requires more advanced synchronization techniques. The GPU-CPU transfer operation of each boundary region should be executed directly after the computation of corresponding region is finished. In addition, each boundary region should be computed after the previous boundary region has successfully completed its task. A sample code of this method is provided in Listing 6.3. In this method, instead of executing one kernel to compute all partition s boundary regions, each boundary is computed separately by using the simulationstepalpharect() function. As it is explained in section 6.2, this function performs the collision and streaming operations on a rectangular subregion of the the domain. The origin and size of the the subregion are given as function arguments. Listing 6.3: Implementation of MBK-SCQ method for computing one alpha simulation step. /////////////////////////////////////// // --> Simulation step alpha x boundary /////////////////////////////////////// // performing the alpha time step on x0 boundary clbmptr->simulationstepalpharect(x0_origin, x_size, 0, NULL, &ev_ss_x0); // performing the alpha time step on x1 boundary clbmptr->simulationstepalpharect(x1_origin, x_size, 1, &ev_ss_x0, & ev_ss_x1); // --> Store x boundary storedataalpha(mpi_comm_direction_x_0, 1, &ev_ss_x0, &ev_store_x0); storedataalpha(mpi_comm_direction_x_1, 1, &ev_ss_x1, &ev_store_x1); 36

45 6.4. Overlapping Work and Communication // --> Sync x boundary syncalpha(mpi_comm_direction_x_0, &req_send_x0, &req_recv_x0); syncalpha(mpi_comm_direction_x_1, &req_send_x1, &req_recv_x1);. // receiving and setting the boundary data from neighbors MPI_Wait( req_recv_x0, &stat_recv); setdataalpha(mpi_comm_direction_x_0); MPI_Wait( req_recv_x1, &stat_recv); setdataalpha(mpi_comm_direction_x_1); Although decomposing bigger tasks in smaller tasks provides more flexibility in overlapping independent computation parts, the overhead introduced during the implementation can degrade the performance. In this method, no concurrent kernel execution on one device can be applied. Hence, the implementation overhead can dominate the performance evaluation. This is discussed more deeply in the section 6.7. In the section some techniques to overcome this problem are introduced. MBK-SCQ Method with OpenCL Callback Mechanism Another technique which is experimented in this thesis, is taking advantage of OpenCL callback mechanism to execute the MPI communication commands. The callback mechanism which it is introduced in section 4.2.5, provides the ability to invoke a function on the host when an OpenCL event has reached a specific status. A usage scenario of OpenCL callbacks would be for applications that usually the host would have to wait while the device is executing. This could reduce the efficiency. The callback mechanism can be used in Multi-GPU implementation in a way that when the associated event to a GPU-CPU transfer operation has changed its status to CL COMPLETE, a callback function will be triggered which invokes the MPI communication commands to exchange the transferred data with other processes. In this way, the host process can continue with enqueueing the next GPU kernels without getting interrupted with MPI commands. Although, this technique in theory is promising, in section 6.7 is shown that overhead introduced by OpenCL callback mechanism drastically degrade the performance and scalability of the software MBK-MCQ Method The fundamental shortage of previous methods is the lack of concurrent execution of GPU- Computation and GPU-CPU transfer operations on a device. Although the OpenCL specification does not require the existence of this capability from the vendors which provides the OpenCL drivers, this feature is available on some NVIDIA graphics cards by using CUDA programming language. In order to execute GPU-CPU transfer operations and GPU-Computation operations simultaneously under OpenCL standard, each of these commands should be enqueued to separate OpenCL command-queues associated with the same device. This is the basic 37

46 6. Multi-GPU LBM idea behind the Multiple Boundary Kernels Multiple Command-Queue (MBK-MCQ) method. By using multiple command-queues, i.e. two command-queues, one of the command-queues can be devoted to GPU-CPU transfer operations and the other is utilized for GPU-Computation operations. As a result, overlapping of these two command types can be achieved. Figure 6.9 represent the schematic timeline of this method. As it is shown in this figure, after computing the x0 boundary region, the transfer operation for this boundary region is enqueued to the GPU-CPU transfer command-queue and simultaneously the computation of x1 boundary region is enqueued to the GPU-Computation command-queue. MBK-MCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 Isend X1 Isend Y0 Isend Y1 Isend Z0 Isend Z1 Isend X0 Irecv X1 Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store x0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation X0 Comp. X1 Comp. Y0 Y1 Z0 Z1 Comp. Comp. Comp. Comp. Inner Computation Figure 6.9.: Schematic timeline for MBK-MCQ method. In case that the GPU is capable of concurrent kernel execution, the computation of inner region can also be triggered at the beginning of simulation step process and in parallel with the computation of boundary regions. By taking advantage of concurrent kernel execution and overlapped transfer and computation operations on the GPU, the timeline provided in figure 6.9 is much denser than the previous methods, which theoretically results in huge performance boost. Using the hardware features required by this method, not only depend on the capabilities of the hardware but also if the vendors implemented these features in their OpenCL drivers. In addition, the overhead introduced by simultaneous scheduling of several kernels on the GPU depends on the OpenCL driver implementation. As a result, although this method in theory promises better results, in practice the previously stated difficulties can dominate the performance results. In section 6.7, the performance results of this method is compared with previous methods. 38

47 6.5. Validation 6.5. Validation This section describes the validation process of Multi-GPU LBM simulation developed in this thesis. Numerical validation is an important issue specially in GPU computing where calculations can be performed in single precision. Multi-GPU implementation in this thesis bases on the Single-GPU code developed in [17]. Physical validation of Single-GPU code is also provided in [17] Validation Setup To validate the LBM software, a lid driven cavity scenario is used. In this scenario a cubic domain is created with a x velocity on the top wall and non-slip conditions on every other wall. The validation is performed with different Reynolds numbers with various domain resolutions Multi-GPU Validation In order to verify the correctness of the MPI parallelization, the same simulation scenario should be computed on one GPU as well as on multiple GPUs. Afterwards, the density distribution values of each cell in both cases should be compared against each other. Since the fluid velocity and pressure are computed from the density distribution values, comparison of these metrics is not necessary. In order to automate the validation process, a validation module is developed which performs the same scenario which runs on multiple GPUs, on one additional GPU without decomposing the domain. Afterwards, the simulation data of each subdomain (excluding the boundary values), are sent via MPI to the same process that is responsible for computing the Single-GPU simulation. The validation process receives these data and based on the ID of the sender process the location of received data in the total domain is identified. In the next step, corresponding data of Single-GPU are transferred from the validation GPU s memory to the host of the validation process. Once the results of Single-GPU and Multiple-GPU simulations are available on the validation process, they will be compared against each other and the number of nonmatching data will be counted. The validation process will be reported as successful when no nonmatching value is found. In order to activate the validation process, the value of the validate element in xml configuration file (see A) should be set to 1. In this case, the simulation should be launched with one additional process devoted to the computation of Single-GPU data. The validation can be performed only for the scenarios with a domain resolution which fits in one GPU memory. In this thesis, the basic method as well as all overlapping methods are successfully validated in this way. 39

48 6. Multi-GPU LBM 6.6. Performance Optimization One of the necessary part of any HPC software development is performance optimization. First step in optimization process is profiling of the software. A common profiling strategy is to find out how much time is spend in the different functions in order to identify hot spots. These hot spots are subsequently analyzed for possible optimization opportunities. In order to identify the hot spots of the software developed in this thesis, the first tool that is adopted is Callgrind. Callgrind is an extension to Cachegrind which produces information about callgraphs of program. The collected data by Callgrind consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. The result of analysis of the software for SBK-SCQ method after 100 iterations is demonstrated in Table 6.1. In this table the first seven functions with the highest exclusive execution time is given. Incl. Self Called Function enqueuecopyrectkernel storedensitydistribution setdataalpha setdatabeta initsimulation simulationstepbeta simulationstepalpha Table 6.1.: Table of called functions in Multi-GPU LBM simulation. The table provides the data for a simulation with 100 iterations. As it is shown, the function enqueuecopyrectkernel has the highest execution time and is called more often than the other functions. This function enqueues a kernel to the command-line, which copies a rectangular part of the domain from the density distribution buffer to a temporary buffer. This buffer is used in the next step to transfer the data from the GPU memory to the host memory. The callgraph of enqueuecopyrectkernel() function is illustrated in Fig In this Figure, it is shown that this function is adopted by functions that store the boundary values from the GPU memory to host as well as setting the values to GPU memory. 40

49 6.6. Performance Optimization Figure 6.10.: Rectangle copy kernel callgraph. As it is discussed in section 5, density distributions for a specific lattice vector are simply stored linearly in the global memory of the GPU. In this memory layout, first the density distribution values of lattice vector f 0 for all domain cells are saved in a row. The domains cells are saved in x, y, z order. Subsequently, the values of density distributions for the next lattice vector for all domain cells are saved and so on. In spite of the fact that this way of storing the data provides an optimized memory access pattern for Single-GPU implementation (see [17]), in a Multi-GPU case which the boundary regions data should be exchanged in each simulation step, leads to an inefficient GPU memory access. Since the density distribution data are stored linearly, accessing the memory locations of a rectangular boundary region requires several read and write operations on discontinuous locations on the GPU memory. In Fig a sample of memory layout for a 3x3x3 domain is illustrated. In this sample, the x0 boundary cells are highlighted with red color. For the β-synchronization the problem becomes more complicated since some of the lattice vectors values should be skipped. All these challenges leads to uncoalesced memory access. Figure 6.11.: Memory access pattern for boundary regions in density distribution buffer. In GPU architecture, memory optimizations are the most important technique to gain better performance. In addition, the most effective memory access optimization technique in programming for the CUDA architecture is coalescing global memory accesses. In addition, the exchange of boundary regions happens in each simulation step. Therefore, 41

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014