Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Size: px
Start display at page:

Download "Terascale on the desktop: Fast Multipole Methods on Graphical Processors"

Transcription

1 Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami) This work has been supported by NASA Presented on September 25, 2007 at Numerical Analysis Seminar, UMD, College Park

2 Outline Introduction General purpose programming on graphical processors (GPU) Fast Multipole Method (FMM) FMM on GPU Conclusions

3 Introduction Large computational tasks Moore s law Graphical processors (GPU)

4 Large problems. Example 1: Sound scattering from complex shapes. Problem: Boundary value problem for the Helmholtz equation in complex 3D domain for a range of frequencies (e.g. 200 frequencies from 20 Hz to 20 khz). Mesh: elements vertices KEMAR Mesh: elements, vertices

5 Large problems. Example 2: Stellar dynamics. Problem: Compute dynamics of star cluster (Solve large system of ODE s). Info: A galaxy like Milky Way has 100 millions stars and evolves for billions years.

6 Large problems. Example 3: Imaging (Medical, Geo, Weather, etc.), Computer Vision and Graphics. Problems: 3D, 4D (and more D) interpolation of scattered data; Discrete transforms; Data compression and representation. Much more

7 Moore s law In 1965, Intel cofounder Gordon Moore saw the future. His prediction, now popularly known as Moore's Law, states that the number of transistors on a chip doubles about every two years. Other versions: Every 18 months, Every X months.

8 Some other laws Wirth s law : Software is decelerating faster than hardware is accelerating. Gates law: The speed of commercial software generally slows by fifty percent every 18 months. (never formulated explicitly by Bill Gates, but rather relates to Microsoft products).

9 Graphical processors (GPU) GPU RSX at 550MHz 1.8 teraflop floating point performance

10 The NVIDIA Tesla C870 GPU Computing processor and D870 Deskside Supercomputer will be available in October Graphical processors (GPU) NVIDIA GeForce 8800 GTX July 2007 $500 Memory GFLOPS 8800 GTX 768 MB 330 C870 Tesla 1.5 GB 500

11 General purpose programming on GPU (GP GPU) Challenges Programming languages NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Our middleware library Examples of programming using the middleware library

12 Challenges SIMD (single instruction multiple data) semantics; common algorithms should be mapped to SIMD; heavy degradation of performance for operations with structured data; optimal sizing dependent on the GPU architecture (e.g. 256 threads per block, 32n blocks, 8 processors, 16 multiprocessors); Lack of high level language compilers and environments friendly to scientific programming; a scientific programmer should take care on many issues on low programming level; difficult debugging; Uncommon for scientific programming elementary computing: accuracy and error handling non-compliant with the IEEE standards; native single precision float computations; lack of many basic math functions; Substantial difference in speeds to access different memory types; low local (cache or fast (shared)) memory; algorithms should be redesigned to take into account several different types of memory (size and access speed); A skillful programming is needed to realize all potential of the hardware (learning, training, experience, etc.);

13 Programming languages OpenGL, ActiveX, Direct3D, ; Cg; Cu.

14 Compute Unified Device Architecture (CUDA) of NVIDIA Ideology: Main flow of the algorithm is realized on CPU with conventional C or C++ programming; High performance functions are implemented in Cu and precompiled to objects (analog of dynamically linked libraries); API is used to communicate with GPU: Allocate/deallocate GPU global memory; transfer data between the GPU and CPU; manage global variables on GPU; call global functions implemented on CU; Several high performance functions, including FFT and BLAS functions are implemented and are callable as library functions.

15 Math libraries (CUBLAS and CUFFT) Callable from C, C++; CUBLAS is supplied by a Fortran wrapper; CUFFT can be also easily wrapped, e.g. to be called from Fortran.

16 Our middleware library (Main concepts) Global algorithm instructions should be executed on CPU, while data should reside on GPU (minimize data exchange between the CPU and GPU); Variables allocated in global GPU memory should be transparent for CPU (easy dereferencing, access to parts of variables); Easy handling of different types, sizes, and shapes of variables; Short functions headers, similar to those in high level languages (C, Fortran); Use in full extend CUBLAS, CUFFT, and other future math libraries; Possibility for easy writing of new custom functions and including to the library.

17 Realization Use structures available in C and Fortran 9x (encapsulation); Use CUDA; Wrap some CUDA and CUBLAS/CUFFT functions; Use customizable modules in Cu for easy modification of existing or writing new functions.

18 Example of fast GPU zation of Fortran 90 subroutine using middleware library Original fortran subroutine is from pseudospectral plasma simulation program of William Dorland

19 Example: Performance of a code generated using the middleware library 1.E+02 y=ax 2 y=bx 2 Time (s) 1.E+01 1.E+00 CPU GPU a/b=25 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E-01 1.E-02 2D MHD Simulations (100 Time Steps in Pseudospectral Method) 1.E+01 1.E+02 1.E+03 1.E+04 N (equivalent grid (NxN)) GPU: NVIDIA GeForce 8800 GTX.

20 Fast Multipole Method (FMM) About Algorithm Data structures Translation theory Complexity and optimizations Example applications

21 About the FMM Introduced by Rokhlin & Greengard (1987,1988) for computation of 2D and 3D fields for Laplace Equation; Reduces complexity of matrix-vector product from O(N2) to O(N) or O(NlogN) (depends on data structure); Hundreds of publications for various 1D, 2D, and 3D problems (Laplace, Helmholtz, Maxwell, Yukawa Potentials, etc.); We taught the first in the country course on FMM fundamentals & application at the University of Maryland (2002); Our reports on fundamentals of the FMM and lectures are available online (visit our web pages).

22 About the FMM Problem: Compute matrix-vector product Some kernels Laplace 3D: Helmholtz 3D: Gaussian nd:

23 Major principle: Use expansions. Theorem 1. The field of a single source located at x 0 can be locally expanded about center x * into absolutely and uniformly convergent series in domain y-x * <r<r< x 0 -x * (local, or R-expansion). Corollary: holds for the field of s sources located outside the sphere x i -x * >R. Theorem 2: The field of a single source located at x 0 can be expanded about center x * into absolutely and uniformly convergent series in domain y-x * >R>r> x 0 -x * (multipole, or S-expansion). Corrolary: holds for the field of s sources located inside the sphere x i - x * <r. Theorem 3: R- and S-expansions can be translated (change of the basis and/or the expansion center, subject to geometric constraints). E.g. for Laplace kernel:

24 FMM algorithm Computational domain (nd cube) is partitioned by quadtree (2D) or octree (3D). Upward pass (get S-expansions for all boxes (skip empty)) (S-expansion means singular, or multipole, or far field expansion) 2. Get S-exp for other levels 1. Get S-exp for Max Level (use S S-translations) y x i x c (n,l)

25 FMM algorithm Downward pass (get R-expansions for all boxes (skip empty)) (R-expansion means regular, or local, or near field expansion) 1. Get R-exp from S-exp s of the boxes in the neighborhood (use S R-translations) 2. Get R-exp from parent (use R R-translations)

26 FMM algorithm Final evaluation (evaluate R-expansions for boxes at Max Level) and sum up directly contributions of sources in the neighborhood of receivers ) 1. Evaluate R-exp 2. Direct summation y j y j

27 Data structures Binary, quad- or octrees (1,2, and 3D); Determines MaxLevel based on clustering parameter; Requires building of lists of neighbors and neighborhood data structure (e.g. for fast determination of boxes in the neighborhood of the parent box); Requires indexing of sources and receivers in boxes; We do all this on serial CPU using bit interleaving technique, sorting, and operations on sets (union, intersection, etc.); Overall complexity of our algorithm O(NlogN); Normally the complexity of generation of data structure is lower than for the run part of the algorithm; Additional amortization: in many problems the data structure should be set once, while the run part can be executed many times (iterative solution of linear system); It is a non-trivial task to parallelize this algorithm while we expect to perform this task in closest future.

28 Translation theory Standard translation method: Apply p 2 x p 2 matrix to p 2 vector of expansion coefficients: O(p 4 ) complexity; There exist O(p 3 ) methods: Currently we use the RCR-decomposition of the translation operators (Rotation-Coaxial Translation-Back Rotation); Sparse matrix decomposition (a bit slower, but less local memory); There exist O(p 2 ) or O(p 2 logp) methods based on diagonal forms of translation operators: Greengard-Rokhlin exponential forms for truncated conical domain (require some O(p 3 ) transforms); FFT-based methods (large asymptotic constants); Our own O(p 2 ) diagonal form method (problems with numerical stability, especially for single precision); Diagonal forms require larger function representations (samplings on grid) than spectral expansions, and effective asymptotic constants are larger than for the RCR); For relatively low p, which is sufficient for single precision (p=4,,12), the RCR-method is comparable in speed with the fastest O(p 2 ) methods.

29 Translation theory (RCR-decomposition) From the group theory follows that general translation can be reduced to x z y p 4 x p 3 p 3 z y y p 3 z y z x x

30 Translation theory Reduction of translation complexity by using translation stencils and variable truncation number: e.g. S-expansions from the shaded boxes can be translated to the center of the parent box, with the same error bound as from the white box to the child. Also for each box its own truncation number can be used.

31 Complexity of the FMM Operations, which do not depend on the number of boxes: (Generation of S-expansions and evaluation of R-expansions) Complexity: ~ AN (assume that the number of sources and receivers is of the same order). Translation operations, which do not depend on the number of sources/receivers (only on the number of boxes) Complexity: BN boxes ~ BN/s (s is the clustering parameter). Direct summation, depends on the number of sources/receivers and the number of sources in the neighborhood of receivers. Complexity: ~ CNs. Total complexity: Cost = Cost exp +Cost trans +Cost dir ~ AN+BN/s+CNs.

32 Optimization of the FMM (Uniform distributions): Total complexity: Cost(s)=AN+BN/s+CNs. s opt = (B/C) 1/2. Optimal Max Level of the octree: l max = log 8 (N/s opt )

33 Example Applications Boundary Element Method accelerated by GMRES/FMM solver Potential external Dirichlet and Neumann problems 1000 randomly oriented ellipsoids 488,000 vertices and 972,000 elements

34 Example Applications 1.E+04 1.E+03 BEM Dirichlet Problem for Ellipsoids (3D Laplace) y=cx 3 y=bx 2 Total CPU time (s) 1.E+02 1.E+01 1.E+00 1.E-01 y=ax O(N 2 ) Memory Threshold GMRES+FMM GMRES+Low Mem Direct GMRES+High Mem Direct LU-decomposition 1.E-02 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM: p = 8

35 Example Applications Single Matrix-Vector Multiplication CPU Time (s) 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 Number of GMRES Iterations Direct+ Matrix Entries Computation y=bx 2 Multiplication of Stored Matrix 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM y=ax 3D Laplace FMM: p = 8

36 Example Applications Sound pressure kd=0.96 kd=9.6 kd=96 (250 Hz) (2.5 khz) (25 khz) BEM for the Helmholtz equation (fgmres/fmm) Mesh: 132,072 elements, 65,539 vertices

37 FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance

38 Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU.

39 High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). Time (s) 1.E+00 1.E-01 GPU, Direct Estimated achieved rate: 190 GFLOPS. 1.E-02 y=bx 2 +cx+d 1.E-03 1.E-04 b/a=600 3D Laplace 1.E+03 1.E+04 1.E+05 1.E+06 CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) Number of Sources

40 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 Sparse Matrix-Vector Product s max =60 y=bx CPU: 1.E+01 Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: Time (s) 1.E-01 1.E-02 1.E-03 2 y=ax GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware

41 Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.

42 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx Time (s) 1.E+00 1.E-01 1.E GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources

43 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio E E E E E E E E E E E E E E E E E E-01 69

44 Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes.

45 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction):

46 Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.

47 What is more accurate for solution of large problems on GPU: direct summation or FMM? L2-relative error 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 1.E-09 Error in computations of potential FMM FMM FMM GPU CPU Direct p=8 p=4 p=12 Filled = GPU, Empty = CPU 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).

48 Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio

49 Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM y=bx 2 GPU, FMM y=dx 1.E-02 GPU, FMM a/b = 600 c/d = 30 p=4, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, FMM a/b = 600 c/d = 50 p=8, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, Direct a/b = 600 c/d = 50 p=12, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources

50 Performance Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, GPU direct FMM dir FMM CPU

51 Conclusions What do we have What is next

52 What do we have Some insight and methods of programming on GPU; High performance FMM for 3D Laplace kernel running in configuration 1CPU- 1GPU; Results encouraging to continue.

53 What is next Applications of the FMM matrix-vector multiplier for solution of physics-based and engineering problems (particle/molecular dynamics, boundary element methods, RBF interpolation, etc.); Mapping on GPU algorithms generating FMM data structures; Develop the FMM for larger CPU/GPU clusters and hit big problems; Some research to adjust the algorithms for GPU, particularly more efficient use of shared/constant memory; FMM on GPU for different kernels (we see a lot of applications); Continue to work towards simplification of programming on GPU for scientists; Upgrades in hardware (double precision, larger memory, etc.) are expected.

54 Thank you!

55 Outline Introduction Large computational tasks Moore s law Graphical processors (GPU) General purpose programming on GPU Challenges NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Middleware libraries Fast Multipole Method (FMM) Algorithm Data structures Translation theory Complexity and optimizations FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance Conclusions What do we have What is next

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

FMM accelerated BEM for 3D Helmholtz equation

FMM accelerated BEM for 3D Helmholtz equation FMM accelerated BEM for 3D Helmholtz equation Nail A. Gumerov and Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, U.S.A. also @ Fantalgo, LLC, U.S.A. www.umiacs.umd.edu/~gumerov

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation

Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Nail A. Gumerov Ross Adelman Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies and Fantalgo,

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

FMM CMSC 878R/AMSC 698R. Lecture 13

FMM CMSC 878R/AMSC 698R. Lecture 13 FMM CMSC 878R/AMSC 698R Lecture 13 Outline Results of the MLFMM tests Itemized Asymptotic Complexity of the MLFMM; Optimization of the Grouping (Clustering) Parameter; Regular mesh; Random distributions.

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees

More information

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

c 2007 Society for Industrial and Applied Mathematics

c 2007 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 29, No. 5, pp. 1876 1899 c 2007 Society for Industrial and Applied Mathematics FAST RADIAL BASIS FUNCTION INTERPOLATION VIA PRECONDITIONED KRYLOV ITERATION NAIL A. GUMEROV AND

More information

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous

More information

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion L10 Layered Depth Normal Images Introduction Related Work Structured Point Representation Boolean Operations Conclusion 1 Introduction Purpose: using the computational power on GPU to speed up solid modeling

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

GPUML: Graphical processors for speeding up kernel machines

GPUML: Graphical processors for speeding up kernel machines GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of

More information

Fast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods.

Fast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods. An Introduction to Fast Multipole Methods Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods

The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods Boundary Elements and Other Mesh Reduction Methods XXXVII 121 The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods A. Buchau

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing

Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

International Supercomputing Conference 2009

International Supercomputing Conference 2009 International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

CFD Solvers on Many-core Processors

CFD Solvers on Many-core Processors CFD Solvers on Many-core Processors Tobias Brandvik Whittle Laboratory CFD Solvers on Many-core Processors p.1/36 CFD Backgroud CFD: Computational Fluid Dynamics Whittle Laboratory - Turbomachinery CFD

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

Turbostream: A CFD solver for manycore

Turbostream: A CFD solver for manycore Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware

More information