Solution of the Transport Equation Using Graphical Processing Units

Size: px
Start display at page:

Download "Solution of the Transport Equation Using Graphical Processing Units"

Transcription

1 Solution of the Transport Equation Using Graphical Processing Units Gil Gonçalves Brandão October Introduction Computational Fluid Dynamics (CFD) always have struggled for faster computing resources to solve their problems. Until a few years ago, it was possible to just wait a few months and buy a new faster machine. Unfortunately, current technology bumped into a wall: the ever decreasing die size with increasing transistor density has made heat problems unbearable. The solution for growth is now, more than ever, parallel computing. The CPU industry already has turned multicore. But depending on the problem, the speedup of parallel computing can be from negligible to linear scaling, with the number of processing units. Meanwhile, the graphic cards industry was pursuing it s own path, using parallel, dedicated hardware from the start and drove by the rich market demand of gamers. The Graphics Processing Unit (GPU) gained more capabilities and, in the last couple of years, toolkits and dedicated frameworks allowing for the true start of the exploration of GPU power for general computing. This work focus on dissecting and exploring the GPU for solving partial differential equation problems. 1.1 GPU Technologies From the hardware side, there are mainly two device makers, NVIDIA and ATI, and their successive generations of devices. In the software side there are two kinds of technology to compute data on the GPUs: mapping the algorithms to the graphics pipeline or using the more recent and general languages. Currently, in the first approach, there are two big families: NVIDIA Cg 1 and OpenGL Shading Language 2 ; In the general language field, there are NVIDIA Common Unified Device Architecture (CUDA) 3, OpenCL 4 and the Brook 5 family. With the release of the CUDA framework, the devices were completely opened to programmers and, since then, all kinds of computational applications have been released: video encoding, weather prediction, molecular simulation, fluid dynamics, computer vision, cryptography, etc. In the CFD context, lattice Boltzmann methods have been studied[11, 7, 15]. Apart from this the Navier-Stokes equation has been solved using finite volume[2] and finite element[6] codes. In the linear algebra domain codes 1 main.html home.html

2 have been developed. NVIDIA released a CUDA version of the standard BLAS library (routines that compute simple operations such as addition and multiplication on vector and matrix level). At a higher level (linear system solvers), in the present, two main orientation seems to exist: in one side there is a big interest in maintaining the old LAPACK 6 library interface, using the GPU as a high performance resource. The factorization algorithms (such as the LU, QR or Cholesky factorizations) are being implemented[13] and hybrid CPU-GPU approaches are being studied [12]. On the other side, a new generic approach to algebra algorithms is being developed[4] and GPUs are being used as a test framework to this new approach[1]. Beside this two approaches, there is also work on sparse matrix algebra[5]. 1.2 Objectives The main objectives of the present work are: To investigate the concepts behind the technology, their implementation and what key mechanisms can lead to best performances. To investigate the performance of GPU based computing in a class of CFD problems: solving the advection-diffusion transport equation using finite difference methods. To achieve the stated objectives, the following was done: Port a state of the art benchmark to the CUDA environment to understand the programming paradigm and compare it with the CPU environment. Implement tests that quantify the different memory access methods relative performance. Implement equivalent programs that solve the advection-diffusion uni-dimensional equation in the both the CPU and GPU environments using finite differences compact schemes to compute the spatial derivatives and the Runge-Kutta method to perform the time integration. The resulting systems of algebraic equations were solved using two direct dense solvers are compared for each environment. A LU based solver and inverse matrix based solver. 2 Cuda Programming Overview 2.1 GPU Hardware Model Each GPU is an aggregate of multi core processors (multiprocessors - MP) sharing a global memory. The GPU (as a parallel system) is essentially a shared memory system. Each MP is composed of: a number of scalar cores which perform the computations (these scalar cores are specialized in arithmetic instructions); a instruction unit responsible to delivering instructions to the scalar cores; and on-chip shared memory that can be used in scalar core communication (this memory isn t accessible by any other MP in the GPU). The memory system of the current NVIDIA GPUs is complex. There are two main groups of memory: on-chip (the memory is located inside each MP) and off-chip or global memory (the memory is located in the GPU and accessible by all MP). 6 For complete information, search the lapack working notes in 2

3 Global memory is organized into 4 types : linear memory, texture memory, constant memory and local memory. The main implication of using each type is how a MP access the memory: any access to linear memory means to use the shared bus; texture and constant memories are cached but read only. Because the bus to global memory is shared and serialization of accesses occur, the GPU has the ability to coalesce 7 some access patterns. On-chip memory has two additional types: the shared memory, which is directly accessible by any scalar core inside each MP and the local registers that are private to each scalar core. In terms of execution in a GPU environment, the minimal computing task is the thread. These threads are created, managed (scheduled) and destroyed by the GPU, i.e., the threads live in hardware space. This is one of the major differences from other common parallel environments. In GPUs the threads are grouped into sets of up to 32 threads called warps that is the scheduling unit. Figure 1: The GPU anatomy 2.2 CUDA Programming Model The CUDA software model is an extension of C/C++ programming languages that reflects the underlying GPU hardware. There main extensions are[9, sec 4.1]: function and variables qualifiers to specify whether the function or variable is referred to the host or to the device and a directive to configure and start a kernel. A CUDA program is no more than a usual C/C++ program that make calls to CUDA kernels. A function to be run in the device must have the device or the global qualifiers. The former defines functions to be called by code run on the device, the later defines kernels 8. By default functions with no specifier are considered to be host functions. In terms of variables, the environment defines the scope of the variables, i.e, in device functions the variables belong to the device memory space and on host functions, the variables belong to the host memory space. In the other cases, qualifiers are used (similar to the function ones). Neither the device can directly access the host variables, neither can the host directly access the device s ones. The only direct interface existent is in the kernel call where the kernel parameters are automatically copied to the device s memory. The memory management (allocation, free and copies) is done by the host using determined functions The host holds the locations of the device s data in its own memory by using traditional pointers. 7 An access is said to be coalesced if with only one transaction, several requests are fulfilled. 8 Additionally kernel functions must be void typed, i.e, cannot return any value. 3

4 Figure 2: Asynchronous execution of the device unrelated code while the device is processing the data. Regarding the execution configuration, the threads are organized in a matrix like form called block; each block is attributed to a MP. The blocks are also organized in a matrix like form called grid. Within a MP, each thread has built-in variables that can be used to do the assignment of the tasks to the particular threads. After launching the kernel, the mapping of each thread to the MP and scalar cores is automatically done by the hardware. The execution of the device s threads is asynchronous with respect to the host, i.e, the host can execute another In the figure 2.2 is shown the thread organization as well as the asynchronous run feature. The synchronization is done using a function, in which the host program waits until all threads in the device have finished their work. 3 CUDA Environment Tests Some testes were done to understand which capabilities the used system have. The tests are based in two NVIDIA benchmarks and following [7], a port of the Stream 9 benchmark was implemented. 3.1 Peak Throughput In order to know what is the peak throughput achievable with the GPU, a NVIDIA test is used. It consists in an unrolled loop with a total of 2048 FMAD 10 instructions The maximum FLOPS value achieved is 617GF LOP S. This maximum value is not steadily attained from the beginning, for block sizes of 32, 64 and 128. Looking at the data it s found that full performance is only achieved if the scheduler has 32 or more threads to take care of. 3.2 Bandwidth: Host - device transfers A benchmark that copies blocks of data from the host to the device using all the methods that the API provides was implemented. The total time is evaluated as a function of the number of elements transfered. There is a high penalty in performance when transferring small quantities of data. For large transfers (more than 10 6 elements) a global stable value was achieved. 3.3 Bandwidth: Device - device transfers The main intra-device transfers performance is evaluated, different vector copy operation versions are implemented using read operations from each one of the available memory spaces and a write to the global memory. Because the texture and constant memories are cached, a vector form of the sum reduction operation (a(i) = M j=1 b(j), i = 1 N) is also implemented to take A FMAD instruction is a hardware instruction that computes an multiplication and a sum, e.g., a b + c. 4

5 advantage of it. And because of the global memory is not cached, a software based cache 11 is implemented with shared memory. The most important observation is that the device memory performance was completely destroyed with small vector dimensions. The theoretical bandwidth is 102GB/s but (for small sizes, N < 16k), only < 7GB/s was achieved. The test scales in performance and (for sufficiently large vector sizes - N > 2 20 ), significant performances were achieved (B > 60GB/s). In [14] is reported that greater performances were achieved (89%) but the devices that were used aren t the same. In terms of relative performance between each access type, the direct access to the global memory and the access through texture cache perform identically, but access to the constant memory is slower. In the cached access test the results are significantly different from the previous test, in even with small sets and all access types the performance exceeded the memory s bandwidth: the implemented software cache with the share memory presents by far the best results with large sizes attaining a peak of 287GB/s (280%). Access through texture memory also outperformed the global memory limit by a 178% factor. 4 Stream benchmark The Stream benchmark consists in 4 vector operations: vector copy, product by scalar, vector sum and vector sum plus scalar product. The performance is evaluated as a function of the number of blocks and the block configuration. The previous results showed completely different performances for small and large size problems, so two different vector sizes are evaluated. The phenomenon of bad performances for small sizes is maintained and has one common parameter constant for all configurations: 256 threads per MP (which corresponds to the size of the warp).this result is identical to the FLOPS test, but now with memory transactions in play. For large vector dimensions the performance is approximately constant and equal to 75GB/s. Finally, a comparison of the host CPU performance and the device performance (not considering host-device transfers) is done. The results are present in the table 1. operation cpu gpu speedup cpu gpu speedup (M B/s) (M B/s) (M B/s) (M B/s) N = 2E3 N = 2E6 copy scale add triad Table 1: Stream benchmark results 11 in fact, no true cache mechanism[8] was implemented. The code takes advantage of knowing previously what memories will be used with higher frequency. 5

6 5 Solution of the 1D transport equation The model used to test the computational performance speedup of GPGPU programming is a linearized version of the Burgers equation or the uni-directional transport equation. u t + U u 0 x = ν 2 u x 2 (1) where U 0 and ν are real constants and u is a continuous field.to solve this equation in order to u, the equation 1 is first transformed into an explicit form (equation 2): F (t, x) = u t = ν 2 u x 2 U u 0 x As shown by equation 2, this is an initial value problem. To solve it, the classic 4 th order Runge- Kutta method[3] is used. The spatial derivatives (first and second) are calculated using 4 th order compact schemes [10]. These methods are a particular case of central difference schemes (eq. 3)and the derivatives are calculated by solving a linear equation system, Ax = b where A is an N N element matrix and x and b are N element vectors; β(u i 2 + u i+2) + α(u i 1 + u i+1) + u = 1 ( ui+1 u i 1 x a In a matrix form, the problem is now formulated as: + u i+2 u i 2 b + u ) i+3 u i 3 c u νa 1 2 B 2u U 0 A 1 1 B 1u (4) The computational domain of the problem is defined by the following constrains: the domain of the spatial coordinate is normalized: x [0, 1]; x is an uniform mesh of N points (hence x = 1 N 1 ); the time step, t, is constant and it s given by the Courant number; the simulation has NT time steps. U 0 is normalized (U 0 = 1); the viscosity coefficient ν is calculated in function of the Fourier number 12 As stated, the compact scheme methods used belong to the class of a particular case of central differences, so special considerations have to taken at both boundaries of the spatial domain (u(t, x = 0) and u (t, x = 1)). On the right side, a Dirichlet condition is imposed, i.e, u(t, x = 1) = 0. On the opposite side, the model order reduction presented in [10] was implemented and on this boundary the problem is represented by a forward 3 rd order scheme. (2) (3) 5.1 Implementation The implementation consists on two versions of the code: a C based serial (single processor) version and a CUDA based version. The code of each version is as identical as possible. The core of the simulation is a simple sequential loop in the time variable. In each loop the velocity field u is updated by the Runge-Kutta integration. All data is logged into memory 12 Even if calculating the physical constant isn t a natural practice in problems of fluid mechanics - where the objective is to calculate the flow for a given fluid - the focus of the current work is computational; So we compute the problem s parameters as a function of computational significant units (as the number of points and iterations) and numerical stability. 6

7 and dumped to a file at the end. In the CUDA version, after the problem initialization, all the necessary data is copied to the GPU memory. Only in the end, the data is downloaded back into the main memory. For the linear system solver, two direct are methods considered : an LU (with partial pivoting) solver and a matrix inversion solver. In both methods all constants are computed during initialization and using a serial CPU method: the LU solver computes the pivots, L and U during initialization and the inverse based solver computes the A 1 matrices Solving an LU factorized linear system on GPU The only routine that, inside the main loop, wasn t implemented by any CUDA based package is the equivalent of the LAPACK sgetrs routine. This routine forms a pair with the sgetrf routine: sgetrf computes the LU factorization with partial pivoting of a general M N matrix and sgetrs solves a linear system AX = B, with that previously factorized A matrix. The netlib version of the sgetrs function is used as a guide to port the function to the CUDA architecture. The routine makes calls to BLAS library routines that are available in the CuBLAS package, so they were used. It also calls a LAPACK internal routine, slaswp, that was implemented. The slaswp is a routine that applies a given permutation (in form of row interchanges) to a matrix. It receives an integer array with the indices of the rows that need to be permuted and the matrix to operate on. The algorithm, as implemented by LAPACK, its inherently sequential because the order of row interchange matters: in the LAPACK standard, the indices in the pivot s array returned by the sgetrf routine may be repeated. This leads to differences if the interchange is applied in different orders. If the predetermined (sequential) order isn t followed the final output will differ and thus will lead to an erroneous solution. However, the columns of the solution matrix are completely independent, so a decomposition may be done mapping each task to a column. 6 Metrics The essential metric in this work is the time ratio between the serial version and the CUDA version (G = t serial t cuda ). Since the initialization is always done on the CPU, another important measure in the CUDA version is the ratio of times between the initialization part and the total time (R = tinit t total ). 6.1 Simulations Results All the simulations were made with C = 0.3, F = 0.1 and 500 time steps. Performance is evaluated as a function of the problem size. In the figure 3a the evolution of speedup of both methods is represented. On the left side axis is the scale for the inverse method while on the right side axis it s the scale for the LU method. The results are significantly different for each method. The speedup obtained with the CUDA version for the inverse method is always greater than unity: the lowest value is 6.5 (N = 500) and the maximum is 15.2 (N = 2000). With the LU method only in one case a speedup of 1 was achieved; all other cases, the performance was poorer when compared with the serial version. Another important measure is the comparison of the best method for each platform, i.e., inverse in the GPU with the LU in the CPU (T Clu /T Ginv ), which is represented in the figure 3a with the blue dashed line, referring to the left side scale. The computation is now 3.5 to 10.8 faster. 7

8 Speedup (inverse and best method) Problem size 1 Speedup (LU method) inverse initialization 100% 90% 80% 70% 60% 50% 40% 30% 20% 2.75% 2.5% 2.25% 2% 1.75% 1.5% 1.25% 1% 0.75% Problem size lu initialization inverse solver best case lu solver inverse solver gpu/cpu inverse lu solver (a) Absolute speedup (b) Initialization ratio Figure 3: Performance The evolution of time ratio between the initialization part and the total simulation (done with the GPU) is represented in the figure 3b. On the left axis is represented the scale for the inverse method and on the right side is the scale for the LU. As it can be seen, as the number of points grows, the initial inversion becomes very significant. With N = 500 the ratio is about 30% of the total; With N = 2000 (the speedup peak), the initialization takes 57% of the total and for N = 5000 it s 82%. With the LU, the evolution is completely different as the initialization fraction is negligible (< 3%). In the same figure is presented the ratio between the initialization part of the inverse method done on the CPU and done on GPU (T Cinit inv /T Ginit inv ). The scale is the one on the left side axis. This relation represents essentially the weight of the data transfer: the distance of the curve to 1 represents the difference between each version initialization runtime (in the code, the only differences that exist are the memory allocation and transfer to the device). As seen in the figure, for N = 500 it takes around 25% of the time; for N 2000 it represents no more than 3%. This clearly explains the disruption in performance showed in figure 3a: as the fraction of the initialization becomes the predominant one, and because only a very small fraction of the time is spent on bus transfers, the inversion becomes the main computational problem. It s known that the LU method should perform better on systems which the relation rows/columns of the solution matrix B is equal to 1 or less. Computing an inverse matrix is a particular linear system for which that relation is exactly 1. The implemented LU solver is now used to compute the inverse matrices. In the figure 4 the results of the global speedup obtained are presented: on the left side axis is the scale to the speedup obtained when com- Speedup (inverse) Problem size cpu/gpu inverse cpu lu/gpu inverse Figure 4: Speedup with the inverse computed on the GPU Speedup (LU method) 8

9 pared with the case of computing the inverse matrix on the CPU (black continuous line). There are two remarks to do (they are both consequence of the initialization being a considerable fraction): 1)the performance gain is always greater than 1 which means that, in the end, for every problem size a performance boost was achieved; 2) the boost is continuously increasing - even at a slow rate - which means that the initialization burden got somehow mitigated. The updated best method comparison is presented in the same figure with the blue-dashed line and the right side axis scale: the results were boosted as the blue line had shown. The speedups are now between 4.3 to 18.8; 7 Conclusion The main objective of this work is to investigate whether a class of problems in the computational fluid domain could benefit from the possibility open by the GPU based computing. This objective was accomplished with success as speedups between 4 and 18 were obtained. The study of the parallel computing paradigm and its influence on the device s model emerge from the fact that the use of GPUs for scientific problem solving is a novelty. Because of the device s design, best performances are generally obtained with massive size problems. This size factor leads to the fact that communication between the host and the device (with the hardware used) becomes continually less significant when compared with the traditional sequential access pattern to the variables. In respect to the device s memory system, majors differences from the host memory were observed: in the current (multi-core) computers, the system s memory bus is shared by 4 cores while in the GPU, the equivalent bus is shared among at least 8 scalar cores (as concrete example, in the GPU used in this work there are 240 scalar cores). This implies that the knowledge of how to explore the device s memory system have a significant impact on the results obtained. When repeated access to a vector variable is needed, there are several approaches to obtain a cached (thus faster) access to it. To achieve the main goal of the present work, a uni-dimensional convection-diffusion transport equation solver was implemented. Two direct methods were compared: an LU method and the inverse method. The implemented LU method performs poorly on the device for this problem as its solution it s just one column. The inverse method outperforms the LU which clearly shows how an algorithm better suited for sequential computation is uncorrelated with its performance on parallel computing. The fact that the matrix inversion on the CPU is an expensive computation implies that the method s scalability is doomed for problems with sizes larger than This fact also completely hides the eventual problem of the latency added by data transfer to the device. Finally the matrix inversion step was done on the GPU (using the LU solver developed) and this lead to a performance increase factor of approximately 2 when compared with the previous strategy of computing the matrix on the CPU. 7.1 Future Work The use of GPUs in the scientific computing will dramatic consequences that may change the numerical methods used today. The future work should be develop in two directions: i)continue to evaluate the platform as a computational resource (by investigating on CPU/GPU cooperation, comparing the platform with traditional cluster platforms and the study of clustering GPUs) and, ii) implement the knowledge to engineering CFD applications. 9

10 References [1] S. Barrachina, M. Castillo, F. D. Igual, and G. Q.-O. Rafael Mayo, Enrique S. Quintana-Ortí. Exploiting the capabilities of modern gpus for dense matrix computations. Technical report, Universidad Jaime I, [2] J. Cohen and M. Garland. Solving computational problems with gpu computing. Computing in Science and Engineering, 11(5):58 63, [3] J. H. Ferziger and P. Milovan. Computational Methods for Fluid Dynamics. Springer, 2 edition, [4] F. G. Van Zee, E. Chan, R. van de Geijn, E. S. Quintana-Ortí, and G. Quintana-Ortí. Introducing: The libflame library for dense matrix computations. CiSE, page 9. [5] M. Garland. Sparse matrix computations on manycore gpu s. In DAC 08: Proceedings of the 45th annual Design Automation Conference, pages 2 6, New York, NY, USA, ACM. [6] D. Goddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, H. Wobker, C. Becker, and S. Turek. Using gpus to improve multigrid solver performance on a cluster. Int. J. Comput. Sci. Eng., 4(1):36 55, [7] J. Habich. Performance evaluation of numeric compute kernels on nvidia gpus. Master s thesis, FRIEDRICH-ALEXANDER-UNIVERSITAT, [8] L. Null and J. Lobur. Essentials of Computer Organization and Architecture. Jones and Bartlett Publishers, Inc., USA, [9] Nvidia. CUDA Programming Guide. [10] L. Sanjiva K. Compact finite difference schemes with spectral-like resolution. Journal of Computational Physics, 103:16 42, [11] J. Tolke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. Int. J. Comput. Fluid Dyn., 22(7): , [12] S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Technical Report 210, LAPACK Working Note, Oct [13] V. Volkov and J. Demmel. Lu, qr and cholesky factorizations using vector capabilities of gpus. Technical report, Electrical Engineering and Computer Sciences, University of California at Berkeley, [14] V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC 08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1 11, Piscataway, NJ, USA, IEEE Press. [15] Y. Zhao. Lattice boltzmann based pde solver on the gpu. Vis. Comput., 24(5): ,

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata CUDA Fluid simulation Lattice Boltzmann Models Cellular Automata Please excuse my layout of slides for the remaining part of the talk! Fluid Simulation Navier Stokes equations for incompressible fluids

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

MD-CUDA. Presented by Wes Toland Syed Nabeel

MD-CUDA. Presented by Wes Toland Syed Nabeel MD-CUDA Presented by Wes Toland Syed Nabeel 1 Outline Objectives Project Organization CPU GPU GPGPU CUDA N-body problem MD on CUDA Evaluation Future Work 2 Objectives Understand molecular dynamics (MD)

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures

More information

Unleashing the Power of Multi-GPU Accelerators with FLAME

Unleashing the Power of Multi-GPU Accelerators with FLAME Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010

More information

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Simulation of one-layer shallow water systems on multicore and CUDA architectures Noname manuscript No. (will be inserted by the editor) Simulation of one-layer shallow water systems on multicore and CUDA architectures Marc de la Asunción José M. Mantas Manuel J. Castro Received: date

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU The myth 10x-1000x speed up on GPU vs CPU Papers supporting the myth: Microsoft: N. K. Govindaraju, B. Lloyd, Y.

More information

AperTO - Archivio Istituzionale Open Access dell'università di Torino

AperTO - Archivio Istituzionale Open Access dell'università di Torino AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators Ahmad Abdelfattah 1, Jack Dongarra 2, David Keyes 1 and Hatem Ltaief 3 1 KAUST Division of Mathematical and Computer Sciences and

More information

Computational Fluid Dynamics using OpenCL a Practical Introduction

Computational Fluid Dynamics using OpenCL a Practical Introduction 19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Computational Fluid Dynamics using OpenCL a Practical Introduction T Bednarz

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract

More information

GPU Acceleration of Unmodified CSM and CFD Solvers

GPU Acceleration of Unmodified CSM and CFD Solvers GPU Acceleration of Unmodified CSM and CFD Solvers Dominik Göddeke Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek Angewandte Mathematik und Numerik TU Dortmund, Germany dominik.goeddeke@math.tu-dortmund.de

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N. GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

OPTIMIZING MATRIX INVERSION WITH THE USAGE OF GPU TECHNOLOGY. Mario Garrido. Universidad de Chile Santiago, Chile

OPTIMIZING MATRIX INVERSION WITH THE USAGE OF GPU TECHNOLOGY. Mario Garrido. Universidad de Chile Santiago, Chile OPTIMIZING MATRIX INVERSION WITH THE USAGE OF GPU TECHNOLOGY Mario Garrido Universidad de Chile Santiago, Chile ABSTRACT While, in myriad applications, matrix inversion is heavily avoided and discouraged,

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Efficiency Aspects for Advanced Fluid Finite Element Formulations

Efficiency Aspects for Advanced Fluid Finite Element Formulations Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing

FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing P Kranthi Kumar, B V Nagendra Prasad, Gelli MBSS Kumar, V. Chandrasekaran, P.K.Baruah Sri Sathya Sai Institute of Higher Learning,

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2

FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION. M. Panchatcharam 1, S. Sundar 2 International Journal of Applied Mathematics Volume 25 No. 4 2012, 547-557 FINITE POINTSET METHOD FOR 2D DAM-BREAK PROBLEM WITH GPU-ACCELERATION M. Panchatcharam 1, S. Sundar 2 1,2 Department of Mathematics

More information

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each

More information

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com CULA: Hybrid GPU Accelerated Linear Algebra Routines John R. Humphrey *, Daniel K. Price, Kyle E. Spagnoli, Aaron L. Paolini, Eric J. Kelmelis EM Photonics, Inc, 51 E Main St, Suite 203, Newark, DE, USA

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix Acceleration of Hessenberg Reduction for Nonsymmetric Matrix by Hesamaldin Nekouei Bachelor of Science Degree in Electrical Engineering Iran University of Science and Technology, Iran, 2009 A thesis presented

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information