Computational Science and Engineering (Int. Master s Program)

Size: px
Start display at page:

Download "Computational Science and Engineering (Int. Master s Program)"

Transcription

1 Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis MPI Parallelization of GPU-based Lattice Boltzmann Simulations Author: Arash Bakhtiari 1 st examiner: Prof. Dr. Hans-Joachim Bungartz 2 nd examiner: Prof. Dr. Michael Bader Assistant advisor: Dr. rer. nat. Philipp Neumann Dipl.-Inf. Christoph Riesinger Dipl.-Inf. Martin Schreiber Thesis handed in on: October 7, 2013

2

3 I hereby declare that this thesis is entirely the result of my own work except where otherwise indicated. I have only used the resources given in the list of references. October 7, 2013 Arash Bakhtiari

4

5 Acknowledgments I would like to express my gratitude to Prof. Dr. Hans-Joachim Bungartz for giving me the great possibility to work on this project. I wish to thank, Dr. rer. nat. Philipp Neumann for his scientific support. I would like to express my great appreciation to my advisors, Christoph Riesinger and Martin Schreiber for their ongoing support of my work, for helpful discussions and encouragement though out the time. v

6 vi

7 Contents Acknowledgements v 1. Introduction 1 2. Lattice Boltzmann Method Boltzmann Equation Lattice Boltzmann Method Lattice Boltzmann and HPC Turbulent LBM Turbulence Modeling Overview of Simulation Approaches BGK-Smagorinsky Model GPU Architecture CPU vs. GPU OpenCL Platform Model Execution Model Memory Model Programming Model Advanced OpenCL Event Model Usage Single-GPU LBM OpenCL Implementation Memory layout Data Storage Multi-GPU LBM Parallelization Models Domain Decomposition Ghost Layer Synchronization CPU/GPU Communication Software Design Modules Basic Implementation Overlapping Work and Communication SBK-SCQ Method MBK-SCQ Method vii

8 Contents MBK-MCQ Method Validation Validation Setup Multi-GPU Validation Performance Optimization K1DD Approach K19DD Approach K19DD Approach Performance Evaluation Simulation Platform and Setup Weak Scaling Strong Scaling Multi-GPU Turbulent LBM OpenCL Implementation Conclusion 55 Appendix 55 A. Configuration File Example 57 Bibliography 65 viii

9 1. Introduction Fluid flow plays an important rule in our life, the examples can range from the blood flow in our body to the air flow around a space shuttle. Therefore, the field of computational fluid dynamics (CFD) has been always an area of interest for numerical simulations. Most of the CFD simulations require the modeling of complex physical effects via numerical algorithms with high computational demands. Hence, these simulations have to be executed on supercomputers. The lattice Boltzmann method (LBM) is a popular class of CFD methods for fluid simulation which is suitable for massively parallel simulations because of its local memory access character. Therefore it is the method of choice in this thesis. Nowadays GPGPU (General-purpose graphics processing unit) computing becomes more and more popular in High Performance Computing (HPC) field, due to the massively-parallel computation power that they provide with a relatively low hardware cost. Since LBM is suitable for high parallelism and has minimal dependency between data elements, it is a perfect candidate for implementation on GPU platforms. Due to the restricted memory available on the GPUs, performing simulations with high memory demands requires utilising multiple GPUs. The major contribution of this thesis is thorough design and implementation of several methods to implement an efficient MPI parallelized LBM on a GPU cluster. Its primary goals are exploiting advanced features of modern GPUs, in order to gain an efficient and scalable massively parallel Multi-GPU LBM code. In order to achieve this goal, several sophisticated overlapping techniques are designed and implemented. In addition, during this thesis the software design has been an important aspect, therefore, development of an extendable and maintainable software was of high priority. Furthermore, the implementation of these methods is optimized to achieve better overall performance of the simulation software. To demonstrate that the primary goal has been met, the performance of each method is evaluated. In addition, the GPU code is extended to be able to simulate the turbulent fluids. The Large Eddy Simulation (LES) with the Smagorinsky subgrid-scale (SGS) model was adopted for the turbulence simulation. The thesis starts with introduction and discussion on the theory of the LBM. In chapter three, turbulence and the equations required to implement the LES with Smagorinsky model are given. The next chapter is devoted to GPU architecture and introduction to OpenCL programming model and its advanced features. The Single-GPU implementation and its memory layout is given in chapter five. Chapter six consists of the parallelization models, software design and advanced overlapping techniques developed in this thesis. At the end of this chapter, the validation strategy, performance optimization and evaluation of the Multi-GPU code is given. In the chapter six, the extension of the Multi-GPU code for turbulent fluid simulations is discussed. 1

10 1. Introduction 2

11 2. Lattice Boltzmann Method The Lattice Boltzmann Method (LBM) is a well-known approach for fluid simulation. The method has its origin in a molecular description of a fluid. Since in this method, most of the calculations are performed locally, an optimized implementation normally achieve a linear scalability in parallel computing. In addition, complex geometries and physical phenomena can easily be represented by this method. Depending on the underlying simulation requirements, LBM could yield beneficial properties compared to Navier-Stokes method as stated previously. However, instead of considering incompressible flows, Lattice Boltzmann schemes simulate weakly compressible flows. The introduction of LBM method and its history can be found in [20, 22, 19, 21] and references therein Boltzmann Equation The LBM is based on the Boltzmann equation: f t + v f = (f f eq ) (2.1) where f( x, v, t) denotes the probability density for finding fluid particles in an infinitesimal volume around x at time time t having the velocity v. On the right hand side of Eq. 2.1 is called collision operator and represents changes due to intermolecular collisions in the fluid. f eq is the equilibrium distribution and is given by Eq It is also known as the Maxwell-Boltzmann distribution. Å f eq (v) = ρ m 2πκ b T ã d/2 e m(v u)2 2κ b T (2.2) This equation describes the density of particles which have a specific velocity v within an area with bulk velocity u at an infinitesimally small volume. The κ b is the Boltzmann constant and d is the number of dimensions. m is the mass of a single particle and T the temperature value and ρ is the mas density. Solving Eq. 2.1 analytically is very challenging and can only be done for special cases. Prabhu L. Bhatnagar, Eugene P. Gross, and Max Krook, the three scientists who noticed that the main effect of the collision term is to bring the velocity distribution function closer to the equilibrium distribution. Based on this, they proposed the BGK approximation. The collision operator in Boltzmann equation can be approximated by BGK model: (f f eq ) 1 τ (f f eq ) (2.3) where τ is the relaxation time of our system, i.g. a characteristic time for the relaxation process of f towards equilibrium. 3

12 2. Lattice Boltzmann Method Equation 2.1 and 2.3 provides information on our flow using statistical methods, while the Navier-Stokes equations are based on a set of continuity equations. However, the quantities known from the Navier-stokes equations such as the velocity u or density ρ can be computed in a certain point x by integrating our probability density f over the velocity space: 2.2. Lattice Boltzmann Method ρ ( x, t) = f dv 1... dv D R D (2.4) ρ ( x, t) u ( x, t) = f v dv 1... dv D R D (2.5) LBM originated from the lattice gas automata (LGA) method. LGA is a simplified molecular dynamics model in which quantities like space, time and particle velocities are discrete. In this model, each lattice node is connected with its neighbors by 6 lattice velocities. In each time interval, as it is demonstrated in Fig. 2.1, particles in each node move to the neighboring nodes in direction of one of 6 lattice velocities. Figure 2.1.: LGA model lattice vectors. From [2]. When more than one particles from different directions arrive to the same node according to some collision rules, they collide and as a result change their velocities. A more in-depth information regarding LGA method is provided in [12]. The transition from LGA model to LBM model is accomplished by switching from a model with quantitative description to a probabilistic description of the particles in phasespace. The LBM simulates fluid flow by tracking particle distribution function. Accomplishing this in a continuous phase space is impossible, therefore LBM tracks particle distribution along a limited number of directions. The particle collisions are computed based on the density values of the cells with a so called collision operator. The movement of the particles within one timestep is then simulated by propagating the density values in the direction of the corresponding velocity vector to an adjacent cell. This is called streaming step. 4

13 2.3. Lattice Boltzmann and HPC LBM Models can be operated on different discretization schemes which describe the dimension and the adjacent communication cells for data exchange along the lattice vectors. To classify the different methods the DdQq notation is used, where d is the number of dimensions and q is the number of density distributions for each cell. E.g. D3Q19 is a 3D model with 19 density distribution (Fig. 2.2) which is used in this work z dd0 - dd3, dd16,dd17: distributions with lattice speed 1 dd6 - dd15: distributions with lattice speed sqrt(2) dd18: particles at rest y x 15 Figure 2.2.: Lattice vectors for D3Q19 model. From [17]. Based on the LBM model, the discrete representation of Lattice Boltzmann update is f i ( x + c i dt, t + dt) = f i ( x, t) 1 τ (f i ( x, t) f eq i ) (2.6) where f eq i is the discretized equilibrium distribution function ( u) = ω i ρ ï1 + 3 ( e i u) ( e i u) 2 3 ò 2 u2 f eq i (2.7) with ω i = 1 18 for i [0; 3] or [16, 17], 1 36 for i [4; 15] and 1 3 for i = 18. In order to compute the density and velocity of each cell the equations 2.4 and 2.5 can be rewritten as ρ = 18 ρ v = 18 (f i e i ) (2.8) f i i= Lattice Boltzmann and HPC Since the discrete probability distribution functions used in Lattice Boltzmann model require more memory for their storage than the variables of the Navier-Stokes equations, this method might at first sight seem quite resource consuming but in fact on modern computers this is never an issue. Since the collision operation of each cell can be computed independently and data dependency scheme is not complex, the lattice Boltzmann method is particularly well suited for computations on a parallel architecture. This advantage is also valid for other types of high performance hardware like General Purpose Graphics Processing Units (GPGPUs), which are investigated particularly in this thesis. i=0 5

14 2. Lattice Boltzmann Method 6

15 3. Turbulent LBM Turbulent flow is characterised by irregular and stochastic changes in fluid properties like pressure and velocity. Many complex flows, ranging from smoke rising from a cigarette which after sometime shows completely disordered structure in the air and a jet exhausting, show chaotic and irregular flow disturbances. In contrast to laminar flows, turbulent flows exhibit a wide range of length scales. In order to measure the strength of a turbulence in the flow, the Reynolds number: Re = vl ν (3.1) is introduced, where v, the fluid velocity, ν the fluid viscosity and L the characteristic length scale is. The turbulent fluids have normally a Reynolds number over 5000, while flows with a Reynolds number below 1500 are typically laminar. Table 3.1 provide a comparison of the characteristic properties of laminar and turbulent fluids. Laminar Flows Highly orderly flow No stochastic irregularity Stable against interference from outside Occur by low Reynolds number Turbulent Flows Chaotic fluid flow Irregular in place and time Unsteady Only occur by high Reynolds number Table 3.1.: Comparison of characteristic properties of turbulent and laminar flows. From [4] In the next section, the modeling of turbulent fluids is investigated Turbulence Modeling This section is intended to give a brief introduction to three established turbulence models, namely Direct Numerical Simulation (DNS), Reynolds-averaged Navier-Stokes(RANS) and Large-Eddy Simulation (LES), for a better understanding of the implementation without going into mathematical and physical details of these models. At the end, the BGK-Smagorinsky model, which is implemented in this thesis is investigated. 7

16 3. Turbulent LBM 3.2. Overview of Simulation Approaches An overview of turbulence models is provided in this section. Interested reader can find more in-depth material regarding the turbulence models in [4, 11]. Direct Numerical Simulation (DNS) This model bases on solving the three dimensional, unsteady Navier-Stokes equations. The DNS Model is the most accurate method to simulate the turbulent flows, since in this model the whole range of spatial and temporal scales of the turbulence must be resolved. Therefore, the computational cost of DNS is very high, even at low Reynolds numbers. The only error source of this model is the approximation error of the numerical method, which can be resolved by choosing an appropriate numerical method. For the most industrial applications, the computational resources required by a DNS would exceed the capacity of the most powerful computer currently available. Reynolds-Averaged Navier-Stokes(RANS) This model is based on the statistical observation of a turbulence fluid. In this model the current values in a turbulent flow field is divided in two parts of temporal average (Ensemble Average) and fluctuation part. By employing these values in Navier-Stokes equations, and applying a temporal average, Reynolds-averaged Navier-Stokes(RANS) equations will be obtained. Large-Eddy Simulation (LES) This model was initially proposed in 1963 by Joseph Smagorinsky. The basic idea behind LES model is applying a low-pass filtering to Navier- Stokes equations in order to eliminate the small scales of the solution. This leads to transformed equations which solve a filtered velocity field. In order to reduce the computational cost, the LES model resolves the large scales of the solution and models the smaller scales. This makes the LES model suitable for industrial simulations with complex geometries. In this thesis, the BGK-Smagorinsky model which is an LES model is implemented. The next section is devoted to this model BGK-Smagorinsky Model As described in the previous section, LES model bases on applying a low-pass filtering on the Navier-Stokes equations. In order to understand the concept of LES simulations, we investigate the characteristic properties of turbulent fluids. A rough comparison of characteristic properties of large and small scale in turbulent fluids is given in Table

17 3.3. BGK-Smagorinsky Model Large Scales Produced from average fluid flows Demonstrate coherent structures Last longer and energy-rich hard to model Small Scales Originate in large scales movements Chaotic, Stochastic Last shorter and low energy easier to model Table 3.2.: Comparison of characteristic properties of large and small scales in turbulent flows. From [4] In the RANS model, both large and small scales are modeled with an approximate statistical model. In contrast, by DNS model, both scales are computed through direct solution of Navier-Stokes equations. The LES model can be considered as a compromise of both RANS and DNS methods. In LES model, the large scale values are obtained by direct solution of Navier-Stokes equations and the small scales are approximated with a model. Interested reader can find more detailed information about these method in [11]. The following discrete lattice Boltzmann equation is obtained by applying the filter (see [9]): f α ( x + e α δt, t + δt) = f α ( x, t) 1 Å τ [f α ( x, t) f α ( x, t)] ã 2τ S α ( x, t) δt (3.2) where overbars indicate the filtered values. The major difference of this equation and the original lattice Boltzmann equation (see Eq. 2.7) is that the density distribution functions are replaced with filtered values. The relaxation time is replaced with the total relaxation time τ = τ + τ t. Consequently the total viscosity is defined as following: ν = ν + ν t = 1 3 Å τ 1 ã c 2 δt = 1 Å τ + τ t 1 ã c 2 δt (3.3) where the turbulent viscosity ν t is defined in Eq. 3.4 and τ t is the turbulent relaxation time. ν t = 1 3 τ tc 2 δt (3.4) Smagorinsky Model In the eddy viscosity subgrid-scale (SGS) model the turbulent stress is given by formula 3.5. τ t ij 1 3 δ ijτ t kk = 2ν t S ij (3.5) In addition, the turbulence eddy viscosity is computed by the following formula: ν t = (C s ) 2 S (3.6) where C s is a Smagorinsky constant. In this thesis C s is set to 0.1. Eq. 3.7 defines the filtered strain rate:» S = 2S ij S ij (3.7) 9

18 3. Turbulent LBM where filtered strain rate tensor is: S ij = 1 2 Ç ui + u å j x j x i (3.8) The computation of filtered strain rate tensor as is formulated in Eq. 3.8 requires finite difference calculations. In the LBM, this can be avoided by using the second moment of the filtered non-equilibrium density distribution functions. In our case the filtered nonequilibrium flux tensor is applied: with where Π neq ij following: 1 S ij = 2 tτ ρc 2 Q ij (3.9) s Q ij Π (neq) ij Ä u i F j + u j F i ä (3.10) is the filtered non-equilibrium momentum flux tensor and can be computed as Π neq ij = α ( e αi e αj f α f (eq) ) Finally, the total relaxation time can be computed directly with the following formula: Ö Ã α è (3.11) τ = τ + τ t = 1 2 τ + τ [C s ] 2» Q ij Q ij ρ t 2 c 4 (3.12) In chapter 7, this method is implemented on a Multi-GPU platform. 10

19 4. GPU Architecture The programmable GPUs (Graphic Processor Unit) has advanced into a highly parallel processors with tremendous computational power. Since the GPUs are specialized for compute-intensive, highly parallel computation, they are designed such that more transistors are devoted to data processing rather than data caching and flow control. As a result, the floating-point capability of the GPUs has raised a lot of attention in scientific computing community. Especially for the problems that the same program is executed on many data elements in parallel (data-parallel computation) or the problems with high memory requirements, e.g. the LBM, it has shown that a great performance boost can be achieved by using the GPU architectures [15, 18]. In the next section, the CPU and GPU architectures are compared more deeply CPU vs. GPU CPUs are optimized for sequential code performance. Therefore, they use more sophisticated control logic and provide large cache memories but neither control logic nor cache memories contribute to the peak calculation speed. GPUs are originally designed as accelerators for rendering and graphics computations, e.g. computer games. In most of graphical tasks like visualization purposes the same operations are often executed for all data elements. As a result, the design of GPUs is optimized to solve intensive data parallel throughput computations. Hence, by GPUs much more chip area is dedicated to the floating-point calculations. Compared to CPUs, GPUs offer higher throughput of floating point operations due to this specialization. The schematic layout of CPUs and GPUs are illustrated in Fig In GPUs, smaller flow control unit and cache memories are provided to help control the bandwidth requirements. For algorithms with high flow control, GPUs less fit to such an architecture than CPUs, due to the smaller flow control units provided by their architecture. Figure 4.1.: The GPU devotes more transistors to data processing. From [5]. 11

20 4. GPU Architecture The main advantages of GPUs are massive parallelism and wide memory bandwidth which its architecture provides. This is achieved by the providing thousands of smaller, more efficient cores designed for parallel performance but the multicore CPUs consist of a few cores optimized for serial processing. Most of graphics processing units (GPUs) are designed for single instruction, multiple data (SIMD). SIMD computers have multiple processing elements that perform the same operation on multiple data points simultaneously. Multicores CPUs are optimised for coarse, heavyweight threads which supply better performance per thread but GPUs create fine, lightweight threads with a relatively poor single-thread performance. For instance, the NVIDIA Tesla M2090 provides 512 processor cores with 1.3 GHz processor core clock. Each NVIDIA Tesla M2090 has a main memory or so called global memory of 6 GB. In Table 4.1 the specification of the NVIDIA Tesla M2090 and older model M2050 are depicted. NVIDIA Tesla M2050 M2090 Compute Capability Code Name Fermi Fermi CUDA-Cores Main Memory [GB] 3 6 Memory Bandwidth [GB/s] Peak DP FLOPS [GFLOPS] Table 4.1.: Comparison of NVIDIA Tesla M2050 and M2090. From [5]. In table 4.2 a comparison of NVIDIA Tesla M2090 and Intel Core i7-3770k is provided. NVIDIA Tesla Intel Core i7-3770k M2090 Cores Main Memory [GB] Max 32 6 Memory Bandwidth [GB/s] Peak DP FLOPS [GFLOPS] Table 4.2.: Comparison of NVIDIA Tesla M2090 and Intel Core i Programming models like CUDA and OpenCL make GPUs accessible for computation like CPUs. OpenCL is the currently dominant open general-purpose GPU computing language. In the following, a brief introduction to OpenCL is given to the extent required for understanding of this thesis OpenCL In the older graphics cards, the computing elements were specialized to process independent vertices and fragments. By introducing GPGPU direct programming interfaces like 12

21 4.2. OpenCL CUDA and OpenCL this specializations are removed and it allows running programs written in a language similar to C without using the 3D graphics API. OpenCL is an open industry standard for programming a heterogeneous collection of CPUs, GPUs organized into a single platform. OpenCL includes a language, API, libraries and a runtime system to support software development. OpenCL is the first industry standard that directly addresses the needs for heterogeneous computations. It is first released in December of The early products became available in the fall of With OpenCL, you can write a single program that can run on a wide range of systems, from cell phones, to nodes in massive supercomputers. This is one of the reasons why OpenCL is so important. At the same time this is also the source of much of the criticism launched at OpenCL. OpenCL is based on four models: Platform Model Memory Model Execution Model Programming Model In the following sections, each of these models are explained in more details. The content of following sections are based on OpenCL Specification [10]. A more in-depth explanation of OpenCL standard can also be find in [6, 16, 13] Platform Model The platform model is illustrated on Fig The model consist of a single host and some OpenCL devices that are connected to host. The host performs tasks like I/O or interaction with a program s user. An OpenCL device can be a CPU, GPU, digital signal processor (DSP) as well as other processors. An OpenCL device consist of one or more compute units (CUs). Each CU provides one or more processing elements (PEs) which perform the actual computations on a device. Figure 4.2.: OpenCL platform model. From [10]. 13

22 4. GPU Architecture Execution Model An OpenCL application consists of two parts: the host program and one or more kernels. The host program runs on the host. The host program create the context for the kernels. In addition, it manages the execution of kernels. The kernels execute on the OpenCL devices. Kernels are simple functions that perform the real work of an OpenCL application. These functions are written with the OpenCL C programming language and compiled with the OpenCL compiler for the running device. The kernels are defined on the host. In order to submit the kernels for execution on a device, the host program issues a command. Afterwards, an integer index space is instantiated by the OpenCL runtime system. For each element in this index space, an instance of the kernel is created and launched. Each instance is called an work-item. In order to identify the work-items, their coordinates in the index space is used. These coordinates are called global IDs which are unique for each work-item. While each work-item execute the same instructions, the used data in each work-item can vary by using the global ID. Work-items are organized into work-groups. Work-groups are assigned a unique ID with the same dimensionality as the index space used for the work-items. Work-items are assigned a unique local ID within a work-group. As a result, a single work-item can be uniquely identified in two ways: (1) by its global ID (2) by a combination of its local ID and work-group ID Memory Model The OpenCL standard offers two types of memory objects: buffer objects and image objects. A buffer object is a contiguous block of memory which is available to the kernels. A programmer can initialise the buffers with any type of information from host and access the buffer through pointers. Image objects, are restricted to holding images. A summary of OpenCL memory model and its interaction with platform model is depicted in Fig Figure 4.3.: OpenCL memory model. From [13]. OpenCL defines four distinct memory regions, that executing work-items have access to: 14

23 4.2. OpenCL Global Memory: All work-items in all work-groups have read/write access to this memory region. Depending on the capabilities of the device Reads and writes to this memory region may be cached [10]. Constant Memory: This memory remains constant during the execution of a kernel. Constant memory is allocated on global memory region. Initialization constant memory is done through the host [10]. Local Memory: This memory region is local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that workgroup. Depending on the device the local memory can mapped onto the global memory or it can have its dedicated memory [10]. Private Memory: Variables defined in one work-item s private memory are not visible to another work-item [10] Programming Model This section describe how a programmer can map parallel algorithms onto OpenCL using a programming model. OpenCL consist of two different programming models: task parallelism and data parallelism. OpenCL also supports hybrids of these two models. The primary model behind the design of OpenCL is data parallel. Data Parallelism Data parallelism programming model is main idea behind the OpenCL s execution model. In this parallel programming model, the same sequence of instructions are applied to multiple elements of a memory object. Normally access to the memory object is accomplished by the index space associated with the OpenCL execution model. The programmer needs to align the data structures in the problem with the index space in order to access correct data in the memory object. Task Parallelism Although OpenCL is designed for data parallelism, it can also be used to achieve task parallelism. In this case, a single instance of a kernel is executed independent of any index space. Since task parallelism is not used in this thesis, the details are not described here Advanced OpenCL Event Model Usage OpenCL Usage Models In this section, various usage models of OpenCL standard is explained. In order to perform an operation on OpenCL objects like memory, kernel and program objects, a command-queue is used. In addition, the command-queues are used to submit work to a device. The commands queued in a command-queue can execute in-order or out-of-order. Having multiple command-queues allows application to queue multiple independent commands. Note that sharing of objects across multiple commandqueues or using an out-of-order command-queue will require explicit synchronisation of commands. Based on the these choices, various OpenCL usage models can be defined [8]. In the following a few of this models are described: 15

24 4. GPU Architecture Single Device In-Order Usage Model (SDIO): This usage model is composed of a simple in-order queue. All commands execute on a single device and all memory operations occur in single memory pool. Single Device Out-of-Order Usage Model (SDOO): This model is same as SDIO model but with an out-of-order queue. As a result, the execution order has no guarantees and device starts executing commands as soon as it is possible. It is the responsibility of the developer to assure program correctness by analyzing the commands dependencies. Single Device Multi-Command Queue Usage Model (SDMC): In this model, multiple command-queues are employed to queue commands to a single device. The model can be applied in order to overlap execution of different commands or overlap commands and host/device communication. Dependent on GPU computation capacity, SDMC model make it possible to launch several kernels concurrently on the same device. Dependent on the parallelization algorithm, each OpenCL usage model can lead to different performance and scalability results. By implementing more sophisticated overlapping techniques some of these models are applied. Synchronization Mechanisms In order to ensure that the changes to the state of a shared object (such as a command-queue object, memory object,... ) occur in the correct order (for instance, when multiple command-queues in multiple threads are making changes to the state of a shared object), the application needs to implement appropriate synchronisation across threads on the host processor by construction of a task graph. OpenCL event model provides the ability to construct complicated task graphs for the tasks enqueued in any of the command-queues associated with an OpenCL context. In addition, OpenCL events can be used to interact with functions on the host through the callback mechanism defined in OpenCL 1.1. These OpenCL features and their application in implementing of Multi-GPU LBM simulation with overlapping technique is described in the following sections. Events An event is an object that can be used to determine the status of commands in OpenCL. The events can be generated with commands in a command-queue. Other commands can use these events to synchronise themselves. Hence, event objects can be used as synchronisation points. All clenqueue() methods can return event objects. An event can be passed as the final argument to the enqueue functions. In addition, a list of events can be passed to an enqueue function to specify the dependence list. Based on the dependence list which in OpenCL terminology is called event wait list, the command will not start the execution until all of events in the list have completed. The following code demonstrate an example of using OpenCL event based synchronization: Listing 4.1: OpenCL event based synchronization. cl_uint num_events_in_waitlist = 2; cl_event event_waitlist[2]; 16

25 4.2. OpenCL err = clenqueuereadbuffer(queue, buffer0, CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[0]); err = clenqueuereadbuffer(queue, buffer1, CL_FALSE /* non-blocking */, 0, 0, 0, NULL, &event_waitlist[1]); /* last read buffer waits on previous two read buffer events */ err = clenqueuereadbuffer(queue, buffer2, CL_FALSE /* non-blocking */, 0, 0, num_events_in_waitlist, event_waitlist, NULL); User Events Events can also be used to synchronize the commands running within an command-queue and functions executing on the host. This can be done by creating the so called user events. The user event can be used in OpenCL enqueue functions like other events. In this case, the execution status of the event is set explicitly. Creating a user event on the host can be accomplished by using the clclreateuserevent function. Callback Events Callbacks are functions invoked asynchronously when the associated event reach specific state. A programmer can associate a callback with an arbitrary event. Using OpenCL callback mechanism can be beneficial specially for the applications in which the host CPU would have to wait while the device is executing. This can leads to worse system efficiency. In such cases by setting a callback to a host function the CPU can do other work instead of spinning while waiting on the GPU. The clseteventcallback function is used to set a callback for an event. Using Events for Profiling Performance analysis is a crucial part of any HPC programming effort. Hence, the mechanism to profile OpenCL programs uses events to collect the profiling data. To make this functionality available the command-queue should be created with the flag CL QUEUE PROFILING ENABLE. Then, by using the function clgetevent- ProfiingInfo() the timing data can be extracted form event objects. A sample code of this process is shown in listing 4.2. Listing 4.2: Extracting profiling information with OpenCL events. cl_event event; cl_ulong start; // start time in nanoseconds cl_ulong end; // end time in nanoseconds cl_float duration; // duration time in milliseconds clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL); clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL); duration = (end - start)* ; CL PROFILING COMMAND START and CL PROFILING COMMAND END are flags used to return the value of device time counter in nanoseconds for start and end of a command associated with the event, respectively. 17

26 4. GPU Architecture 18

27 5. Single-GPU LBM The GPU computation by providing advantages like massive-parallel processing and wide memory bandwidth, allows LBM simulations to achieve high performance. The fact that computation of LBM cells can be performed independently, makes this method appropriate for parallelization on the GPU. This chapter is intended to give a brief introduction to implementation of Single-GPU LBM with OpenCL API. For a more in-depth information regarding the Single-GPU implementation see [17]. The Multi-GPU implementation in section 6 uses this work as the initial framework OpenCL Implementation This section describes the implementation of LBM by using the OpenCL standard. One of the most important design aspect of a simulation software accelerated with GPUs is memory access patterns used in the software. In the following sections the memory layout used in the Single-GPU implementation for storing and accessing the simulation data, i.e. the density distribution values, is discussed Memory layout Since the global memory an the GPUs are still restricted, designing an efficient and frugal memory layout for the LBM simulations on GPUs plays a crucial rule to achieve an optimized software. Several memory layouts to store the density distributions have been developed by different research groups. In the next section two of the memory layouts, namely A-B pattern and A-A pattern, is discussed. For the Single-GPU LBM used in this thesis, the A-A pattern memory layout is used, which consumes the lowest amount of memory[3]. A-B Pattern In this memory layout, after applying the collision operator, the density distribution values are stored in the same memory location. The propagation operator reads the density distribution value of adjacent cell in the opposite direction of current lattice vector and save the values in corresponding lattice vector direction of the current cell. The A-B memory layout is illustrated in Fig

28 5. Single-GPU LBM Density distributions (model): Collision Propagation Data storage (implementation): Collision Propagation Figure 5.1.: Memory layout of A-B pattern. From [17]. In order to avoid the race condition, reading and writing to the same cells in a parallel implementation requires an additional density distribution buffer. As a result, the A-B pattern doubles the memory consumption due to the additional buffer. In the A-A memory layout, this issue is addressed. A-A Pattern The main design aspect of A-A pattern is enhancing the memory demand by maintaining almost the same performance. The A-A pattern achieves this goal by using two different kernels for odd and even time steps: we follow referring to different kernels as alpha and beta kernels (see [17]). Alpha kernel reads the distribution values the same way as A-B pattern. After applying the collision operator, the new values are stored in the opposite lattice vector of the current cell. Hence, Alpha kernel does not change the values of other cells at all and only accesses the values of the current cell. In contrast, the Beta kernel reads and writes the density distribution values only from adjacent cells. In this kernel, the values are read from the adjacent cells which are in the opposite direction of current lattice vector. After applying the collision operator, the new values are stored in the adjacent cell in the same direction of current lattice vector. Figure 5.2 demonstrate the A-A memory access pattern. 20

29 5.1. OpenCL Implementation Density distributions (model): Collision + Propagation Collision + Propagation beta alpha Data storage (implementation): Collision + Propagation Collision + Propagation beta alpha Figure 5.2.: Memory layout of A-A pattern. From [17]. Together with the alpha kernel, this access rule implicitly implements the propagation operator Data Storage The density distributions of a each lattice vector is stored linearly in the global memory of the GPU. To achieve a coalesced memory access, first all x components are stored, then the y and z components. Consecutively the next lattice vector density distribution values are stored. A schematic illustration of this memory layout is illustrated in Fig density distributions 0 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) density distributions 1 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) density distributions 18 (0,0,0) (0,0,1) (0,0,2) (0,0,3) (0,1,0) (0,1,1) (3,3,1) (3,3,2) (3,3,3) Figure 5.3.: Density distribution memory layout. From [17]. The velocity vectors of each cell are stored in the same way of the density distribution values. In addition, the density values and cell type flags are stored consecutively in separate buffers. The more information about the memory layout of Single-GPU LBM can be found in [17]. 21

30 5. Single-GPU LBM 22

31 6. Multi-GPU LBM The simulation of real world scenarios is usually very compute intensive. In addition, the main memory of one compute device is commonly not sufficient to meet the memory demands (i.e., 6GB on NVIDIA M2090 GPU). Using multiple GPUs efficiently for LBM Method can help to fulfill the memory requirements and as a result it makes it possible to use the simulation with higher number of unknowns (weak scaling). However, the use of multiple GPUs demands more sophisticated memory management, communication and synchronization techniques in order to avoid communication overhead in a distributed and even a shared memory system. To overcome the previously stated challenges in a Multi-GPU LBM simulation, sophisticated optimization techniques are required. In the following section, the parallelization paradigm used in this thesis to proceed from Single-GPU to a Multi-GPU implementation is described. Additionally the eminent components of the software which are crucial to achieve a great software design are explained comprehensively. In order to achieve a great performance and defeat the communication overhead difficulty, various techniques of overlapping the computation and communication are implemented and their efficiency is investigated. At the end, the benchmark results of the software on GPU MAC cluster is demonstrated Parallelization Models There are two parallelization models: shared memory and distributed memory parallelization. In a shared memory model, as the name implies, different processes shared a global address space and asynchronously read and write data to it. In a distributed memory model, the processes exchange data by passing messages to each other in an asynchronous or synchronous way. The most common standard message-passing system is MPI (Message Passing Interface). For this thesis the distributed memory model is adopted and the data exchange between GPUs is accomplished by MPI. An essential aspect of any parallelization paradigm is Problem Decomposition. The two types of problem decomposition is Task Parallelism and Data Parallelism. The Task Parallelism focuses on processes of execution. In contrast, by Data Parallelism, a set of task will operate on a data set but independently on separate partitions. Since the same computation are applied for each domain cell in LBM and they are completely independent of other cells, it makes it a perfect candidate for a Data Parallelism. In the next section, these parallelization paradigms are explained specifically for LBM Simulations. 23

32 6. Multi-GPU LBM Domain Decomposition Domain decomposition is most common way to parallelize LBM codes. The term domain decomposition is understood to mean that the computational domain is divided up in several smaller domain parts which are distributed to several computational units. Each domain partition is assigned to one MPI process and one GPU which is responsible for the computation of that partition. As a result of data dependency, MPI processes need to exchange data among each other. The demanded data, based on the LBM data dependency, is the outer layer of the process local simulation domain. Therefor each MPI process extract the data that is required by other processes and send the data to the receiving process. The received data is saved in the so called ghost layer subregion of corresponding simulation data Ghost Layer Synchronization Normally the ghost layer data is just used during the local computation of subregion and can be safely overwritten by the new data in the next simulation step. As it is described in section the α-kernel only access the values of the current cell for the collision and propagation. The general approach of ghost layer data exchange can be applied to data synchronization after α time step. In this thesis we call it α- synchronization. Figure 6.1.: α-synchronization. In α-synchronization all the lattice vectors values are exchanged. Opposite to α-synchronization, the β-kernel reads the density distribution values from the adjacent cells and after performing the computation, the results are written to the ad- 24

33 6.1. Parallelization Models jacent cells in the direction of lattice vector of the computing density distribution. As a consequence, the GPU threads computing the neighboring cells of ghost layers, write their computation results in ghost layer data. This data is required for the next simulation step of the process that ghost layers data originally comes from. Thus, the new data in ghost layer should be sent back to the original process. This procedure is demonstrated in Fig We call this procedure β-synchronization. Figure 6.2.: β-synchronization. In the β-synchronization only the red lattice vectors are exchanged CPU/GPU Communication The simulation data of each subdomain is stored in the global memory of corresponding GPU. Therefor, before performing MPI communications the data should be transferred to the host memory. The data is send over the host and device bus system. (PCI express, e.g.). In contrast to MPI communications that consist only out of CPUs, in a Multi-GPU communication, an additional step is required to transfer data from the GPU memory to the host memory. This process is presented in Fig 6.3. GPU-CPU Copy PCI Express Transfers InfinitiBand Send via MPI CPU-GPU Copy PCI Express Transfers InfinitiBand Receive via MPI Figure 6.3.: Activity diagram of MPI communication for Multi-GPU simulation. 25

34 6. Multi-GPU LBM To get a better intuition of Multi-GPU MPI communication process, a geometric overview of data communication in a Multi-GPU LBM simulation is provided in Fig Figure 6.4.: A geometric overview of Multi-GPU MPI communication. In this figure, the three form of communication, namely, GPU-GPU, GPU-CPU and MPI communication are shown. In this thesis, local communication is not implemented. From [5]. In case that each host is managing more than one GPU, no MPI communication for data exchange is required and communication can be performed locally, since the GPUs can directly or indirectly through the host, access to each others global memory. However, this thesis focuses on MPI communication only Software Design From the beginning of this thesis, the software design has been an important aspect. In the design process following aspects are considered: Extensibility: Adding new capabilities to the software can be accomplished effortlessly without restructuring of major parts of the software components and their interrelation. Modularity: the software comprises independent modules which prompt better maintainability. In addition, The components can be tested in isolation before being integrated to the software. Maintainability: In consequence of modularity and extensibility, bug determination is simplified. Efficiency: The software is optimized in many aspects. Data structures are chosen in a way to consume less memory and provide the performance needed for a massively parallelized large-scale simulation software. 26

35 6.2. Software Design Scalability: Scalability is the major design goal of the software developed in this thesis. Sophisticated techniques like the overlapping of work and communication are implemented in order to achieve a great scalability for large-scale simulations. Apart from HPC and design aspects, the software development of this thesis also offers: Storing visualization data in VTK format (legacy format and xml binary format) Automatically profile different modules of code with tools like Scalasca[7], Vampir 1 and Valgrind[14] In the following a detailed overview on available modules and their underlying design strategy is provided Modules In this section, all the modules developed in this thesis and their interoperability is discussed. Figure. 6.5 presents a general overview of the simulation flow. The figure is generated by using Callgrind. In addition, the caller/callee relationship between the most essential functions of simulation software is presented. Figure 6.5.: Simulation callgraph. Each node in the graph represents a function, and each edge represents calls. Cost shown per function is the cost spent while that function is running. In the following a detailed overview of software modules is provided: Manager Class: Manager class is responsible for management of general tasks in simulation process that are not specific to one subdomain in particular, e.g. domain decomposition and assigning subdomains to different processes. These types of tasks are normally carried out during the initialization phase and needs to be done only once during the simulation. 27

36 6. Multi-GPU LBM The class features one template parameter: T. T determines the data type of simlation data stored in the memory. Using this template parameter make it possible to run the simulation software on GPUs with single or double precision support. This template parameter is also used in most of other software modules developed in this thesis. Simulation Parameters: The Manager class stores the grid information such as domain length and number of lattice cells in each direction. Further, the class also save the number of subdomains for the domain decomposition which is specified by the user via the configuration file or the command line arguments. The grid information and number of subdomains are established as constructor arguments during the instantiating of Manager class. By using this information, Manager instance compute the size of each subdomain and its location in entire grid. Domain Partitioning: The Manager class is also responsible for assigning tasks (computation of each subdomain) to MPI processes. Distributing workloads across multiple processes can be performed by using various strategies. An optimized strategy is the one, that provokes the least amount of communication between MPI processes. Partition Boundaries: In addition, based on the location of each subdomain in the simulation grid, the class determines appropriate boundary conditions for each subdomain. For instance for the subdomains that shares a domain face with neighboring subdomains, the boundary condition for that face is assigned to FLAG GHOST LAYER. This flag specifies that the computation related to this face can only be accomplished only after the corresponding ghost layer data is fetched from neighboring subdomain. Communication: After detecting a ghost layer face, the Manager class initiate a Comm object. The Comm class contains the required information for communication with neighbors like the size and origin of sending and receiving data. The Comm class is explained more deeply in the following sections. Simulation Geometry: The Manager class is also responsible for setting the geometry of simulation domain. This is achieved by setting appropriate flags to each cell of the subdomain. FLAG OBSTACLE and FLAG FLUID are some examples of currently available flags. Controller: The Manager class, as is previously stated, is responsible for the tasks related to initialization phase of the simulation. Therefore two strategies are available which by the first strategy the initialization process is done on the root process and the results will be sent to other processes via MPI send commands while by the second strategy each process performs the initialization phase separately. Which strategy leads to better performance depends on initialization/communication time ratio. In this thesis, since the initialization operations are not compute intensive, in order to avoid the communication between MPI processes, each process performs the initialization related to each subdomain individually. As a consequence, every MPI process instantiates its own Manager class which carry out the initialization operations of the corresponding subdomain. In addition, every Manager instance 28

37 6.2. Software Design aggregate a Controller class which controls the simulation procedure of that subdomain. Controller Class: Class Conroller provides an interface to control the simulation steps on the corresponding subdomain. For every subdomain one Controller class is instantiated. Each Controller class has a unique ID which is used in MPI communications as identifier of the subdomains. Each Controller instance needs to communicate with neighboring Controller instances in order to synchronize the ghost layer data. This task is accomplished by syncalpha and sync- Beta functions which send the alpha and beta ghost layer data respectively. An overloaded version of these functions, by adding two more arguments of types MPI Request and MPI Status, provides the ability to perform a non-blocking MPI communication. The required MPI communication information is stored in Comm class instance assigned to the corresponding communication. For subdomains with more than one neighbor the Comm instances are stored in a C++ std::map container with the position of neighboring domain (direction of communication which is a C++ enumeration) as the key of map container. This makes it possible to access the communication information of each direction independently. This feature is used later in order to synchronize the ghost layer data of each direction separately (see section 6.4). Before sending the ghost layer data, the functions storedataalpha and storedatabeta loads the data to the sending buffers. The setdataalpha and setdatabeta functions place the received data from neighboring subdomains in the proper places in local simulation data of current subdomain. These functions are overloaded two times. The overloaded functions with one argument of type MPI COMM DIRECTION perform their tasks only in direction specified in the argument. There is also another overloaded version of these functions, which in addition to direction, provides the ability to profit from event based OpenCL synchronization features by adding three more OpenCL event arguments. The strategy behind different optimization methods is implemented in simulationstepalpha and simulationstepbeta functions. Basically these functions encapsulates all operations needed to perform one simulation step of alpha and beta computations, respectively. For example which part of the computation domain should be computed first and the order in which the computation and communication operations should be performed can be encapsulated under these functions (see section 6.4). The class Controller performs alpha and beta simulation steps by utilizing the lbm solver attribute which is a type of LbmSolver class. This class is explained in the following section. The function initlbmsolver, which is called in the constructor of Controller class, queries the local GPUs and create OpenCL platforms, contexts and command-queues. It also chooses the appropriate available GPU device for performing the computations. In order to apply the modularity principle in software design the Controller class does not directly enqueues the OpenCL kernels to the command-queues. The LbmSolver class is employed for this purpose. Therefore, lbm solver attribute is initiated in this function and the OpenCL contexts and device values are passed as constructor arguments of LbmSolver class during the instantiating of this attribute. Furthermore, the class Controller aggregate a visualization class instance in order to visualize the simulation data in each time step. 29

38 6. Multi-GPU LBM LbmSolver Class: This Class works as a wrapper around all available OpenCL kernels. It provides interface functions for enqueueing computation and device/host communication kernels The function reload allocates OpenCL memory objects which contains the local computation results. Additionally, it creates OpenCL kernels, compile them and assigns the kernel arguments. Enqueueing OpenCL commands in command-queues is done only through the interface provided in this class. For instance, it implements the functions simulationstepalpha and simulationstepbeta which enqueues OpenCL kernels that perform the alpha and beta computations on the entire local domain. In addition, simulationstepalpharect and simulationstepbetarect functions gives the user the capability to run the alpha and beta kernels on a rectangular part of the domain. To use these functions the user needs to provide the origin of the ractangular part and size of it. The rectangular functionality is also available for functions which store and set the data from and to the device. With the help of these functions is possible to modify the data of an specific rectangular part of the whole domain. This features used in getting and setting of the ghost layer data. Most of the functions provided in this class are overwritten to exploit the OpenCL events synchronization mechanism. The usage of these functions is explained in later sections. Comm Class: The Comm class encapsulates the information required for MPI communication between Controller classes. The public interface of the class offers functions for accessing MPI communication destination rank and origin and size of data to send to and receive from the destination rank. Comm class also allocates receive and send data buffers in its constructor so the buffers are allocated only once through the whole simulation steps. Configuration Class: This class is in charge of parsing the general setting of the simulation process and providing global access to the settings. The simulation configurations can be set through the command line arguments or by providing an xml file. An example of the XSD schema is available in appendix A. The configuration file name can be established as constructor argument during the instantiating of Configuration class or it can be specified as the argument of loadfile function. The simulation settings are divided into four categories of data: physics, grid, simulation and device settings. The settings related to each category is available under corresponding xml element in configuration file (see A). The class should provide a global point of access to the settings. This can be achieved via the Singleton software design pattern. The singleton pattern ensures a class has only one instance and it is globally available through the function Instance(). In order to be able to turn every class to a singleton, a c++ singleton template class is implemented. By using the configuration class as the template parameter of the Singleton class, it can be easily used as a singleton. LbmVisualizationVTK Class: All visualization classes have to inherit from the abstract base class ILbmVisualization which defines the interface of a visualization class compatible with software developed in this thesis. 30

39 6.3. Basic Implementation ILbmVisualization interface supply two functions, namely setup() and render(). The first function is used for the initialization of the visualization process. The render() function, should be called whenever the simulation data needs to be visualized. The functionality of render() function depends on the implementation class. For instance, in the case of LbmVisualizationVTK class the render function saves the simulation data to a VTK file Basic Implementation To develop the software in this thesis the following simple, disciplined and pragmatic approach to software engineering which has been attributed to Kent Beck, is applied: Make It Work, Make It Right, Make It Fast. According to this approach, first a software is developed which works properly and fulfills the basic goals of the thesis. In the next step, various methods in order to improve the basic design are implemented and verified. Finally, in the last step of development, the software is optimized in many aspects to gain the best performance in terms of execution time and scalability for large-scale simulations. This section describes the implementation of the basic method. This basic method computes all subdomains by employing a GPU for each subdomain. After accomplishing all computations for a time step, the boundary regions of subdomains are exchanged between GPUs. A GPU cannot directly access to the global memory of other GPUs, as a result, host CPUs are used as a bridge for data exchange between GPUs. The data exchange is composed of the following 3 steps: (1) Data transfer from GPU to CPU (2) Data exchange between CPUs via MPI (3) Data transfer back from CPU to GPU [18]. A schematic overview of this three step communication is illustrated in Fig Figure 6.6.: Schematic timeline for the basic method. From [18]. The code developed for this approach does not require sophisticated synchronization techniques, since the MPI communication is performed first when the computation of all subdomain cells is accomplished. A sample of an α-iteration is shown in Listing 6.1. Listing 6.1: Implementation of not optimized basic method for computing one alpha simulation step. // First the alpha computation of all cells // is performed clbmptr->simulationstepalpha(); // The communication phase starts after the computation phase is // completely finished. syncalpha(); 31

40 6. Multi-GPU LBM Although the implementation of this method is straightforward, the communication overhead introduced by the three-step communication scheme described before, can reduce the performance of the simulation. We expect its impact getting even more significant when we use more GPUs. In this method, in order to exchange the boundary data, first the computation of whole domain should be accomplished. Therefore, the GPU/CPU transfer operations and MPI communications stay idle for a longer time. In the next sections, other techniques to avoid the communication costs by overlapping communication and computation parts are investigated Overlapping Work and Communication Increasing the performance becomes more difficult, primarily because the inter-node communication can not catch up with the performance increase of massively parallel GPUs. Achieving good parallel efficiency when using distributed-memory machines requires more advanced programming techniques for hiding communication overhead by overlapping methods. A possibility to prevent communication overhead is to perform communication parallel to the actual simulation. This requires two OpenCL LBM kernels. One updates the outer layer of the Block, and one updates the inner region. In this way, first, the computation of outer boundary is accomplished. Next, the computation of inner part, and the extraction, insertion and the MPI communication is executed asynchronously. Hence, the time spent for all PCI-E transfers and the MPI communication can be hidden by the computation of the inner part, if the time of the inner kernel is longer than the time for the communication. If this is not the case only part of the communication is hidden. [5] Implementing the overlapping techniques require asynchronous execution of GPU kernels, GPU-CPU copy operations and MPI communications commands on the CPU. Both OpenCL and MPI standards provide advanced synchronization mechanisms which is discussed in section Operations used in a Multi-GPU LBM simulation fall into the following four categories: GPU Computation: The operations that perform the actual computation of collision operators, belong to this category. This operations run solely on the GPU. GPU-CPU Communication: This category consists of operations that are responsible to transfer updated values form GPU memory to host memory and vice-versa. CPU-Computation: In an hybrid model, in addition to GPU also CPU is utilized to perform some part of computations. However, this is not utilized in our implementation. MPI-Communication: All MPI commands which are used to synchronize the ghost layer data. The primary idea of techniques designed in this thesis is to overlap the operations of these four categories, reducing the critical path length of the dependency graph. 32

41 6.4. Overlapping Work and Communication To overlap the GPU-Computation and GPU-CPU Communication operations, the advanced OpenCL event synchronization techniques described in the section can be exploited. In addition, in case that several kernels are used to compute different areas of the subdomain, the device should be capable of running multiple kernels simultaneously. In CUDA programming model, this can be carried out by exploiting the CUDA Stream concept. Achieving the same results with OpenCL is theoretically possible by using several OpenCL command-queues associated to the same device. Current OpenCL specification does not specify this feature and as a result, this is completely dependent to the vendor providing the OpenCL driver. Overlapping of GPU operations with the CPU-Computation category, can be achieved by taking advantage of OpenCL callback mechanism. It will be discussed in later sections that the overhead introduced by implementing the callback mechanism, significantly degrade the performance and scaling of the software. By using non-blocking MPI commands, CPU operations and MPI-communication operations can run asynchronously. Based on these overlapping techniques for the Multi-GPU LBM, in this thesis several approaches are designed, implemented and their performance results are compared, in the following sections: SBK-SCQ Method Single Boundary Kernel Single Command-Queue (SBK-SCQ) method, consists of one boundary kernel that computes all the boundary values in each direction at once. The main idea behind this approach is to utilize this kernel to first update only the boundary values then the communication process and computation of inner cells are performed asynchronously. In Fig. 6.7 a schematic timeline for the operations performed in this method is illustrated. SBK-SCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 ISend X1 ISend Y0 ISend Y1 ISend Z0 ISend Z1 ISend X0 X1 Irecv Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store X0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation Boundaries computation Inner Computation Figure 6.7.: Schematic timeline for SBK-SCQ method. 33

42 6. Multi-GPU LBM After computing the boundary values, the data transfer operations get triggered. During the communication phase, first the boundary values of each direction are separately transferred from the GPU main memory to the host memory. This is achieved by enqueueing OpenCL data transfer functions to the one single command-queue. In the next step, the ghost layer data are exchanged between MPI processes, by using non-blocking MPI sending/receiving functions. Since in this method no concurrent OpenCL kernel execution can be performed on the device, neither GPU-Computation nor GPU-CPU transfer operations can overlap. In contrast, as soon as a transfer operation of one boundary is finished the corresponding MPI communication can be executed on the host and the transfer operation of other boundaries can be executed on the GPU, asynchronously. Finally, each simulation step concludes with receiving data from neighboring domains. This data needs to be transferred from the host memory to correct location in the GPU memory. Again, in this phase separate transfer operations are executed for each boundary. By using only one OpenCL command-queue, overlapping the GPU computation and GPU-CPU transfer operations is not possible. As a result, the computation of inner part of the domain can be triggered only after the data transfer operation is accomplished. A schematic sample code of this method is demonstrated in Listing 6.2. The simulation- StepAlphaBoundaries() is the function that enqueues the boundary kernel, which compute only the boundary values. The functions storedataalpha() and setdataalpha() are responsible for transferring the data from the GPU memory to host memory and vice-versa. The single argument used by the functions specifies the intended boundary location. syncalpha() performs a non-blocking MPI send and receive operations. This function in addition to direction of communication, has also two output arguments of type MPI Request for sending and receiving operations. The arguments are used later to provide information about the status of the MPI operation. During performing the communication operations the computation of inner part of the domain is triggered by using the simulationstepalpharect() function. Listing 6.2: Implementation of SBK-SCQ method for computing one alpha simulation step. ////////////////////////////////////////////////////// // --> One Kernel computing the entire boundary cells ///////////////////////////////////////////////////// clbmptr->simulationstepalphaboundaries(0, NULL, NULL); storedataalpha(mpi_comm_direction_x_0); syncalpha(mpi_comm_direction_x_0, &req_send_x0, &req_recv_x0); storedataalpha(mpi_comm_direction_x_1); syncalpha(mpi_comm_direction_x_1, &req_send_x1, &req_recv_x1); // same for y and z directions. // --> Computation of inner part clbmptr->simulationstepalpharect(inner_origin, inner_size, 0, 34

43 6.4. Overlapping Work and Communication NULL, NULL); MPI_Status stat_recv; MPI_Wait( req_recv_x0, &stat_recv); setdataalpha(mpi_comm_direction_x_0); MPI_Status stat_recv; MPI_Wait( req_recv_x1, &stat_recv); setdataalpha(mpi_comm_direction_x_1); // same for y and z directions. As it is shown in Fig. 6.7 there are many gaps between the four kind of operations. In order to improve the overall performance, the timeline should be as tight as possible. This can be achieved by overlapping of more independent operations. The identification of independent tasks requires a more advanced dependency analysis of the simulation process. One way to achieve this is to divide the boundary kernel into six separate kernels for each boundary region. Hence, the computation of each boundary region can be accomplished independently. This can provide more overlapping opportunities. This method is explained in the next section MBK-SCQ Method In Multiple Boundary Kernels Single Command-Queue (MBK-SCQ) method, in contrast to SBK-SCQ method, each boundary region is computed separately. Therefore, once the computation of one boundary region is finished, the GPU-CPU transfer operation of that region get triggered. Figure 6.8 illustrate a schematic timeline for MBK-SCQ method. 35

44 6. Multi-GPU LBM MBK-SCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 Isend X1 Isend Y0 Isend Y1 Isend Z0 Isend Z1 Isend X0 Irecv X1 Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store x0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation X0 Comp. X1 Comp. Y0 Comp. Y1 Comp. Z0 Comp. Z1 Comp. Inner Computation Figure 6.8.: Schematic timeline for MBK-SCQ method. Although decomposing the boundary computation provides more flexibility in organizing the operations, it requires more advanced synchronization techniques. The GPU-CPU transfer operation of each boundary region should be executed directly after the computation of corresponding region is finished. In addition, each boundary region should be computed after the previous boundary region has successfully completed its task. A sample code of this method is provided in Listing 6.3. In this method, instead of executing one kernel to compute all partition s boundary regions, each boundary is computed separately by using the simulationstepalpharect() function. As it is explained in section 6.2, this function performs the collision and streaming operations on a rectangular subregion of the the domain. The origin and size of the the subregion are given as function arguments. Listing 6.3: Implementation of MBK-SCQ method for computing one alpha simulation step. /////////////////////////////////////// // --> Simulation step alpha x boundary /////////////////////////////////////// // performing the alpha time step on x0 boundary clbmptr->simulationstepalpharect(x0_origin, x_size, 0, NULL, &ev_ss_x0); // performing the alpha time step on x1 boundary clbmptr->simulationstepalpharect(x1_origin, x_size, 1, &ev_ss_x0, & ev_ss_x1); // --> Store x boundary storedataalpha(mpi_comm_direction_x_0, 1, &ev_ss_x0, &ev_store_x0); storedataalpha(mpi_comm_direction_x_1, 1, &ev_ss_x1, &ev_store_x1); 36

45 6.4. Overlapping Work and Communication // --> Sync x boundary syncalpha(mpi_comm_direction_x_0, &req_send_x0, &req_recv_x0); syncalpha(mpi_comm_direction_x_1, &req_send_x1, &req_recv_x1);. // receiving and setting the boundary data from neighbors MPI_Wait( req_recv_x0, &stat_recv); setdataalpha(mpi_comm_direction_x_0); MPI_Wait( req_recv_x1, &stat_recv); setdataalpha(mpi_comm_direction_x_1); Although decomposing bigger tasks in smaller tasks provides more flexibility in overlapping independent computation parts, the overhead introduced during the implementation can degrade the performance. In this method, no concurrent kernel execution on one device can be applied. Hence, the implementation overhead can dominate the performance evaluation. This is discussed more deeply in the section 6.7. In the section some techniques to overcome this problem are introduced. MBK-SCQ Method with OpenCL Callback Mechanism Another technique which is experimented in this thesis, is taking advantage of OpenCL callback mechanism to execute the MPI communication commands. The callback mechanism which it is introduced in section 4.2.5, provides the ability to invoke a function on the host when an OpenCL event has reached a specific status. A usage scenario of OpenCL callbacks would be for applications that usually the host would have to wait while the device is executing. This could reduce the efficiency. The callback mechanism can be used in Multi-GPU implementation in a way that when the associated event to a GPU-CPU transfer operation has changed its status to CL COMPLETE, a callback function will be triggered which invokes the MPI communication commands to exchange the transferred data with other processes. In this way, the host process can continue with enqueueing the next GPU kernels without getting interrupted with MPI commands. Although, this technique in theory is promising, in section 6.7 is shown that overhead introduced by OpenCL callback mechanism drastically degrade the performance and scalability of the software MBK-MCQ Method The fundamental shortage of previous methods is the lack of concurrent execution of GPU- Computation and GPU-CPU transfer operations on a device. Although the OpenCL specification does not require the existence of this capability from the vendors which provides the OpenCL drivers, this feature is available on some NVIDIA graphics cards by using CUDA programming language. In order to execute GPU-CPU transfer operations and GPU-Computation operations simultaneously under OpenCL standard, each of these commands should be enqueued to separate OpenCL command-queues associated with the same device. This is the basic 37

46 6. Multi-GPU LBM idea behind the Multiple Boundary Kernels Multiple Command-Queue (MBK-MCQ) method. By using multiple command-queues, i.e. two command-queues, one of the command-queues can be devoted to GPU-CPU transfer operations and the other is utilized for GPU-Computation operations. As a result, overlapping of these two command types can be achieved. Figure 6.9 represent the schematic timeline of this method. As it is shown in this figure, after computing the x0 boundary region, the transfer operation for this boundary region is enqueued to the GPU-CPU transfer command-queue and simultaneously the computation of x1 boundary region is enqueued to the GPU-Computation command-queue. MBK-MCQ Phase MPI Comm. X0 Comm. X1 Comm. Y0 Comm. Y1 Comm. Z0 Comm. Z1 Comm. CPU Computation X0 Isend X1 Isend Y0 Isend Y1 Isend Z0 Isend Z1 Isend X0 Irecv X1 Irecv Y0 Irecv Y1 Irecv Z0 Irecv Z1 Irecv GPU-CPU Comm. Store x0 Store X1 Store Y0 Store Y1 Store Z0 Store Z1 Set X0 Set X1 Set Y0 Set Y1 Set Z0 Set Z1 GPU Computation X0 Comp. X1 Comp. Y0 Y1 Z0 Z1 Comp. Comp. Comp. Comp. Inner Computation Figure 6.9.: Schematic timeline for MBK-MCQ method. In case that the GPU is capable of concurrent kernel execution, the computation of inner region can also be triggered at the beginning of simulation step process and in parallel with the computation of boundary regions. By taking advantage of concurrent kernel execution and overlapped transfer and computation operations on the GPU, the timeline provided in figure 6.9 is much denser than the previous methods, which theoretically results in huge performance boost. Using the hardware features required by this method, not only depend on the capabilities of the hardware but also if the vendors implemented these features in their OpenCL drivers. In addition, the overhead introduced by simultaneous scheduling of several kernels on the GPU depends on the OpenCL driver implementation. As a result, although this method in theory promises better results, in practice the previously stated difficulties can dominate the performance results. In section 6.7, the performance results of this method is compared with previous methods. 38

47 6.5. Validation 6.5. Validation This section describes the validation process of Multi-GPU LBM simulation developed in this thesis. Numerical validation is an important issue specially in GPU computing where calculations can be performed in single precision. Multi-GPU implementation in this thesis bases on the Single-GPU code developed in [17]. Physical validation of Single-GPU code is also provided in [17] Validation Setup To validate the LBM software, a lid driven cavity scenario is used. In this scenario a cubic domain is created with a x velocity on the top wall and non-slip conditions on every other wall. The validation is performed with different Reynolds numbers with various domain resolutions Multi-GPU Validation In order to verify the correctness of the MPI parallelization, the same simulation scenario should be computed on one GPU as well as on multiple GPUs. Afterwards, the density distribution values of each cell in both cases should be compared against each other. Since the fluid velocity and pressure are computed from the density distribution values, comparison of these metrics is not necessary. In order to automate the validation process, a validation module is developed which performs the same scenario which runs on multiple GPUs, on one additional GPU without decomposing the domain. Afterwards, the simulation data of each subdomain (excluding the boundary values), are sent via MPI to the same process that is responsible for computing the Single-GPU simulation. The validation process receives these data and based on the ID of the sender process the location of received data in the total domain is identified. In the next step, corresponding data of Single-GPU are transferred from the validation GPU s memory to the host of the validation process. Once the results of Single-GPU and Multiple-GPU simulations are available on the validation process, they will be compared against each other and the number of nonmatching data will be counted. The validation process will be reported as successful when no nonmatching value is found. In order to activate the validation process, the value of the validate element in xml configuration file (see A) should be set to 1. In this case, the simulation should be launched with one additional process devoted to the computation of Single-GPU data. The validation can be performed only for the scenarios with a domain resolution which fits in one GPU memory. In this thesis, the basic method as well as all overlapping methods are successfully validated in this way. 39

48 6. Multi-GPU LBM 6.6. Performance Optimization One of the necessary part of any HPC software development is performance optimization. First step in optimization process is profiling of the software. A common profiling strategy is to find out how much time is spend in the different functions in order to identify hot spots. These hot spots are subsequently analyzed for possible optimization opportunities. In order to identify the hot spots of the software developed in this thesis, the first tool that is adopted is Callgrind. Callgrind is an extension to Cachegrind which produces information about callgraphs of program. The collected data by Callgrind consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. The result of analysis of the software for SBK-SCQ method after 100 iterations is demonstrated in Table 6.1. In this table the first seven functions with the highest exclusive execution time is given. Incl. Self Called Function enqueuecopyrectkernel storedensitydistribution setdataalpha setdatabeta initsimulation simulationstepbeta simulationstepalpha Table 6.1.: Table of called functions in Multi-GPU LBM simulation. The table provides the data for a simulation with 100 iterations. As it is shown, the function enqueuecopyrectkernel has the highest execution time and is called more often than the other functions. This function enqueues a kernel to the command-line, which copies a rectangular part of the domain from the density distribution buffer to a temporary buffer. This buffer is used in the next step to transfer the data from the GPU memory to the host memory. The callgraph of enqueuecopyrectkernel() function is illustrated in Fig In this Figure, it is shown that this function is adopted by functions that store the boundary values from the GPU memory to host as well as setting the values to GPU memory. 40

49 6.6. Performance Optimization Figure 6.10.: Rectangle copy kernel callgraph. As it is discussed in section 5, density distributions for a specific lattice vector are simply stored linearly in the global memory of the GPU. In this memory layout, first the density distribution values of lattice vector f 0 for all domain cells are saved in a row. The domains cells are saved in x, y, z order. Subsequently, the values of density distributions for the next lattice vector for all domain cells are saved and so on. In spite of the fact that this way of storing the data provides an optimized memory access pattern for Single-GPU implementation (see [17]), in a Multi-GPU case which the boundary regions data should be exchanged in each simulation step, leads to an inefficient GPU memory access. Since the density distribution data are stored linearly, accessing the memory locations of a rectangular boundary region requires several read and write operations on discontinuous locations on the GPU memory. In Fig a sample of memory layout for a 3x3x3 domain is illustrated. In this sample, the x0 boundary cells are highlighted with red color. For the β-synchronization the problem becomes more complicated since some of the lattice vectors values should be skipped. All these challenges leads to uncoalesced memory access. Figure 6.11.: Memory access pattern for boundary regions in density distribution buffer. In GPU architecture, memory optimizations are the most important technique to gain better performance. In addition, the most effective memory access optimization technique in programming for the CUDA architecture is coalescing global memory accesses. In addition, the exchange of boundary regions happens in each simulation step. Therefore, 41

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters SIAM PP 2014 High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters C. Riesinger, A. Bakhtiari, M. Schreiber Technische Universität München February 20, 2014

More information

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures Dirk Ribbrock, Markus Geveler, Dominik Göddeke, Stefan Turek Angewandte Mathematik, Technische Universität Dortmund

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more Parallel Free-Surface Extension of the Lattice-Boltzmann Method A Lattice-Boltzmann Approach for Simulation of Two-Phase Flows Stefan Donath (LSS Erlangen, stefan.donath@informatik.uni-erlangen.de) Simon

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung

More information

Virtual EM Inc. Ann Arbor, Michigan, USA

Virtual EM Inc. Ann Arbor, Michigan, USA Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan,

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python

Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python Micha l Januszewski Institute of Physics University of Silesia in Katowice, Poland Google GTC 2012 M. Januszewski (IoP, US) Sailfish:

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

simulation framework for piecewise regular grids

simulation framework for piecewise regular grids WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko

PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON. Pawe l Wróblewski, Krzysztof Boryczko Computing and Informatics, Vol. 28, 2009, 139 150 PARALLEL SIMULATION OF A FLUID FLOW BY MEANS OF THE SPH METHOD: OPENMP VS. MPI COMPARISON Pawe l Wróblewski, Krzysztof Boryczko Department of Computer

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations

Dynamic Mode Decomposition analysis of flow fields from CFD Simulations Dynamic Mode Decomposition analysis of flow fields from CFD Simulations Technische Universität München Thomas Indinger Lukas Haag, Daiki Matsumoto, Christoph Niedermeier in collaboration with Agenda Motivation

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,

More information

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm March 17 March 21, 2014 Florian Schornbaum, Martin Bauer, Simon Bogner Chair for System Simulation Friedrich-Alexander-Universität

More information

Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE

Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE 18th World Conference on Nondestructive Testing, 16-20 April 2012, Durban, South Africa Graphical Processing Units (GPU)-based modeling for Acoustic and Ultrasonic NDE Nahas CHERUVALLYKUDY, Krishnan BALASUBRAMANIAM

More information

Real-time Thermal Flow Predictions for Data Centers

Real-time Thermal Flow Predictions for Data Centers Real-time Thermal Flow Predictions for Data Centers Using the Lattice Boltzmann Method on Graphics Processing Units for Predicting Thermal Flow in Data Centers Johannes Sjölund Computer Science and Engineering,

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure

Towards real-time prediction of Tsunami impact effects on nearshore infrastructure Towards real-time prediction of Tsunami impact effects on nearshore infrastructure Manfred Krafczyk & Jonas Tölke Inst. for Computational Modeling in Civil Engineering http://www.cab.bau.tu-bs.de 24.04.2007

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Peta-Scale Simulations with the HPC Software Framework walberla:

Peta-Scale Simulations with the HPC Software Framework walberla: Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,

More information

Numerical Simulation of Coastal Wave Processes with the Use of Smoothed Particle Hydrodynamics (SPH) Method

Numerical Simulation of Coastal Wave Processes with the Use of Smoothed Particle Hydrodynamics (SPH) Method Aristotle University of Thessaloniki Faculty of Engineering Department of Civil Engineering Division of Hydraulics and Environmental Engineering Laboratory of Maritime Engineering Christos V. Makris Dipl.

More information

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method

A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method GTC (GPU Technology Conference) 2013, San Jose, 2013, March 20 A Peta-scale LES (Large-Eddy Simulation) for Turbulent Flows Based on Lattice Boltzmann Method Takayuki Aoki Global Scientific Information

More information

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer Xiaoqian Zhu 1,2, Xin Liu 1, Xiangfei Meng 2, Jinghua Feng 2 1 School of Computer, National University of Defense

More information

Analysis, extensions and applications of the Finite-Volume Particle Method (FVPM) PN-II-RU-TE Synthesis of the technical report -

Analysis, extensions and applications of the Finite-Volume Particle Method (FVPM) PN-II-RU-TE Synthesis of the technical report - Analysis, extensions and applications of the Finite-Volume Particle Method (FVPM) PN-II-RU-TE-2011-3-0256 - Synthesis of the technical report - Phase 1: Preparation phase Authors: Delia Teleaga, Eliza

More information

Lattice Boltzmann Method for Simulating Turbulent Flows

Lattice Boltzmann Method for Simulating Turbulent Flows Lattice Boltzmann Method for Simulating Turbulent Flows by Yusuke Koda A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied Science

More information

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS 14 th European Conference on Mixing Warszawa, 10-13 September 2012 LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS Felix Muggli a, Laurent Chatagny a, Jonas Lätt b a Sulzer Markets & Technology

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

NIA CFD Futures Conference Hampton, VA; August 2012

NIA CFD Futures Conference Hampton, VA; August 2012 Petascale Computing and Similarity Scaling in Turbulence P. K. Yeung Schools of AE, CSE, ME Georgia Tech pk.yeung@ae.gatech.edu NIA CFD Futures Conference Hampton, VA; August 2012 10 2 10 1 10 4 10 5 Supported

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net Cougar Open CL v1.0 Users Guide for Open CL support for Delphi/ C++Builder and.net MtxVec version v4, rev 2.0 1999-2011 Dew Research www.dewresearch.com Table of Contents Cougar Open CL v1.0... 2 1 About

More information

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)

More information

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method June 21, 2011 Introduction Free Surface LBM Liquid-Gas-Solid Flows Parallel Computing Examples and More References Fig. Simulation

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Shallow Water Simulations on Graphics Hardware

Shallow Water Simulations on Graphics Hardware Shallow Water Simulations on Graphics Hardware Ph.D. Thesis Presentation 2014-06-27 Martin Lilleeng Sætra Outline Introduction Parallel Computing and the GPU Simulating Shallow Water Flow Topics of Thesis

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters

A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters computation Article A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters Christoph Riesinger 1, ID, Arash Bakhtiari 1 ID, Martin Schreiber ID,

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Driven Cavity Example

Driven Cavity Example BMAppendixI.qxd 11/14/12 6:55 PM Page I-1 I CFD Driven Cavity Example I.1 Problem One of the classic benchmarks in CFD is the driven cavity problem. Consider steady, incompressible, viscous flow in a square

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs)

Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs) OBJECTIVE FLUID SIMULATIONS Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs) The basic objective of the project is the implementation of the paper Stable Fluids (Jos Stam, SIGGRAPH 99). The final

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Computational Fluid Dynamics using OpenCL a Practical Introduction

Computational Fluid Dynamics using OpenCL a Practical Introduction 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Computational Fluid Dynamics using OpenCL a Practical Introduction T Bednarz

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways

Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways Three Dimensional Numerical Simulation of Turbulent Flow Over Spillways Latif Bouhadji ASL-AQFlow Inc., Sidney, British Columbia, Canada Email: lbouhadji@aslenv.com ABSTRACT Turbulent flows over a spillway

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Advanced OpenCL Event Model Usage

Advanced OpenCL Event Model Usage Advanced OpenCL Event Model Usage Derek Gerstmann University of Western Australia http://local.wasp.uwa.edu.au/~derek OpenCL Event Model Usage Outline Execution Model Usage Patterns Synchronisation Event

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information