Terascale on the desktop: Fast Multipole Methods on Graphical Processors
|
|
- Gilbert Walters
- 5 years ago
- Views:
Transcription
1 Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami) This work has been supported by NASA Presented on September 25, 2007 at Numerical Analysis Seminar, UMD, College Park
2 Outline Introduction General purpose programming on graphical processors (GPU) Fast Multipole Method (FMM) FMM on GPU Conclusions
3 Introduction Large computational tasks Moore s law Graphical processors (GPU)
4 Large problems. Example 1: Sound scattering from complex shapes. Problem: Boundary value problem for the Helmholtz equation in complex 3D domain for a range of frequencies (e.g. 200 frequencies from 20 Hz to 20 khz). Mesh: elements vertices KEMAR Mesh: elements, vertices
5 Large problems. Example 2: Stellar dynamics. Problem: Compute dynamics of star cluster (Solve large system of ODE s). Info: A galaxy like Milky Way has 100 millions stars and evolves for billions years.
6 Large problems. Example 3: Imaging (Medical, Geo, Weather, etc.), Computer Vision and Graphics. Problems: 3D, 4D (and more D) interpolation of scattered data; Discrete transforms; Data compression and representation. Much more
7 Moore s law In 1965, Intel cofounder Gordon Moore saw the future. His prediction, now popularly known as Moore's Law, states that the number of transistors on a chip doubles about every two years. Other versions: Every 18 months, Every X months.
8 Some other laws Wirth s law : Software is decelerating faster than hardware is accelerating. Gates law: The speed of commercial software generally slows by fifty percent every 18 months. (never formulated explicitly by Bill Gates, but rather relates to Microsoft products).
9 Graphical processors (GPU) GPU RSX at 550MHz 1.8 teraflop floating point performance
10 The NVIDIA Tesla C870 GPU Computing processor and D870 Deskside Supercomputer will be available in October Graphical processors (GPU) NVIDIA GeForce 8800 GTX July 2007 $500 Memory GFLOPS 8800 GTX 768 MB 330 C870 Tesla 1.5 GB 500
11 General purpose programming on GPU (GP GPU) Challenges Programming languages NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Our middleware library Examples of programming using the middleware library
12 Challenges SIMD (single instruction multiple data) semantics; common algorithms should be mapped to SIMD; heavy degradation of performance for operations with structured data; optimal sizing dependent on the GPU architecture (e.g. 256 threads per block, 32n blocks, 8 processors, 16 multiprocessors); Lack of high level language compilers and environments friendly to scientific programming; a scientific programmer should take care on many issues on low programming level; difficult debugging; Uncommon for scientific programming elementary computing: accuracy and error handling non-compliant with the IEEE standards; native single precision float computations; lack of many basic math functions; Substantial difference in speeds to access different memory types; low local (cache or fast (shared)) memory; algorithms should be redesigned to take into account several different types of memory (size and access speed); A skillful programming is needed to realize all potential of the hardware (learning, training, experience, etc.);
13 Programming languages OpenGL, ActiveX, Direct3D, ; Cg; Cu.
14 Compute Unified Device Architecture (CUDA) of NVIDIA Ideology: Main flow of the algorithm is realized on CPU with conventional C or C++ programming; High performance functions are implemented in Cu and precompiled to objects (analog of dynamically linked libraries); API is used to communicate with GPU: Allocate/deallocate GPU global memory; transfer data between the GPU and CPU; manage global variables on GPU; call global functions implemented on CU; Several high performance functions, including FFT and BLAS functions are implemented and are callable as library functions.
15 Math libraries (CUBLAS and CUFFT) Callable from C, C++; CUBLAS is supplied by a Fortran wrapper; CUFFT can be also easily wrapped, e.g. to be called from Fortran.
16 Our middleware library (Main concepts) Global algorithm instructions should be executed on CPU, while data should reside on GPU (minimize data exchange between the CPU and GPU); Variables allocated in global GPU memory should be transparent for CPU (easy dereferencing, access to parts of variables); Easy handling of different types, sizes, and shapes of variables; Short functions headers, similar to those in high level languages (C, Fortran); Use in full extend CUBLAS, CUFFT, and other future math libraries; Possibility for easy writing of new custom functions and including to the library.
17 Realization Use structures available in C and Fortran 9x (encapsulation); Use CUDA; Wrap some CUDA and CUBLAS/CUFFT functions; Use customizable modules in Cu for easy modification of existing or writing new functions.
18 Example of fast GPU zation of Fortran 90 subroutine using middleware library Original fortran subroutine is from pseudospectral plasma simulation program of William Dorland
19 Example: Performance of a code generated using the middleware library 1.E+02 y=ax 2 y=bx 2 Time (s) 1.E+01 1.E+00 CPU GPU a/b=25 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E-01 1.E-02 2D MHD Simulations (100 Time Steps in Pseudospectral Method) 1.E+01 1.E+02 1.E+03 1.E+04 N (equivalent grid (NxN)) GPU: NVIDIA GeForce 8800 GTX.
20 Fast Multipole Method (FMM) About Algorithm Data structures Translation theory Complexity and optimizations Example applications
21 About the FMM Introduced by Rokhlin & Greengard (1987,1988) for computation of 2D and 3D fields for Laplace Equation; Reduces complexity of matrix-vector product from O(N2) to O(N) or O(NlogN) (depends on data structure); Hundreds of publications for various 1D, 2D, and 3D problems (Laplace, Helmholtz, Maxwell, Yukawa Potentials, etc.); We taught the first in the country course on FMM fundamentals & application at the University of Maryland (2002); Our reports on fundamentals of the FMM and lectures are available online (visit our web pages).
22 About the FMM Problem: Compute matrix-vector product Some kernels Laplace 3D: Helmholtz 3D: Gaussian nd:
23 Major principle: Use expansions. Theorem 1. The field of a single source located at x 0 can be locally expanded about center x * into absolutely and uniformly convergent series in domain y-x * <r<r< x 0 -x * (local, or R-expansion). Corollary: holds for the field of s sources located outside the sphere x i -x * >R. Theorem 2: The field of a single source located at x 0 can be expanded about center x * into absolutely and uniformly convergent series in domain y-x * >R>r> x 0 -x * (multipole, or S-expansion). Corrolary: holds for the field of s sources located inside the sphere x i - x * <r. Theorem 3: R- and S-expansions can be translated (change of the basis and/or the expansion center, subject to geometric constraints). E.g. for Laplace kernel:
24 FMM algorithm Computational domain (nd cube) is partitioned by quadtree (2D) or octree (3D). Upward pass (get S-expansions for all boxes (skip empty)) (S-expansion means singular, or multipole, or far field expansion) 2. Get S-exp for other levels 1. Get S-exp for Max Level (use S S-translations) y x i x c (n,l)
25 FMM algorithm Downward pass (get R-expansions for all boxes (skip empty)) (R-expansion means regular, or local, or near field expansion) 1. Get R-exp from S-exp s of the boxes in the neighborhood (use S R-translations) 2. Get R-exp from parent (use R R-translations)
26 FMM algorithm Final evaluation (evaluate R-expansions for boxes at Max Level) and sum up directly contributions of sources in the neighborhood of receivers ) 1. Evaluate R-exp 2. Direct summation y j y j
27 Data structures Binary, quad- or octrees (1,2, and 3D); Determines MaxLevel based on clustering parameter; Requires building of lists of neighbors and neighborhood data structure (e.g. for fast determination of boxes in the neighborhood of the parent box); Requires indexing of sources and receivers in boxes; We do all this on serial CPU using bit interleaving technique, sorting, and operations on sets (union, intersection, etc.); Overall complexity of our algorithm O(NlogN); Normally the complexity of generation of data structure is lower than for the run part of the algorithm; Additional amortization: in many problems the data structure should be set once, while the run part can be executed many times (iterative solution of linear system); It is a non-trivial task to parallelize this algorithm while we expect to perform this task in closest future.
28 Translation theory Standard translation method: Apply p 2 x p 2 matrix to p 2 vector of expansion coefficients: O(p 4 ) complexity; There exist O(p 3 ) methods: Currently we use the RCR-decomposition of the translation operators (Rotation-Coaxial Translation-Back Rotation); Sparse matrix decomposition (a bit slower, but less local memory); There exist O(p 2 ) or O(p 2 logp) methods based on diagonal forms of translation operators: Greengard-Rokhlin exponential forms for truncated conical domain (require some O(p 3 ) transforms); FFT-based methods (large asymptotic constants); Our own O(p 2 ) diagonal form method (problems with numerical stability, especially for single precision); Diagonal forms require larger function representations (samplings on grid) than spectral expansions, and effective asymptotic constants are larger than for the RCR); For relatively low p, which is sufficient for single precision (p=4,,12), the RCR-method is comparable in speed with the fastest O(p 2 ) methods.
29 Translation theory (RCR-decomposition) From the group theory follows that general translation can be reduced to x z y p 4 x p 3 p 3 z y y p 3 z y z x x
30 Translation theory Reduction of translation complexity by using translation stencils and variable truncation number: e.g. S-expansions from the shaded boxes can be translated to the center of the parent box, with the same error bound as from the white box to the child. Also for each box its own truncation number can be used.
31 Complexity of the FMM Operations, which do not depend on the number of boxes: (Generation of S-expansions and evaluation of R-expansions) Complexity: ~ AN (assume that the number of sources and receivers is of the same order). Translation operations, which do not depend on the number of sources/receivers (only on the number of boxes) Complexity: BN boxes ~ BN/s (s is the clustering parameter). Direct summation, depends on the number of sources/receivers and the number of sources in the neighborhood of receivers. Complexity: ~ CNs. Total complexity: Cost = Cost exp +Cost trans +Cost dir ~ AN+BN/s+CNs.
32 Optimization of the FMM (Uniform distributions): Total complexity: Cost(s)=AN+BN/s+CNs. s opt = (B/C) 1/2. Optimal Max Level of the octree: l max = log 8 (N/s opt )
33 Example Applications Boundary Element Method accelerated by GMRES/FMM solver Potential external Dirichlet and Neumann problems 1000 randomly oriented ellipsoids 488,000 vertices and 972,000 elements
34 Example Applications 1.E+04 1.E+03 BEM Dirichlet Problem for Ellipsoids (3D Laplace) y=cx 3 y=bx 2 Total CPU time (s) 1.E+02 1.E+01 1.E+00 1.E-01 y=ax O(N 2 ) Memory Threshold GMRES+FMM GMRES+Low Mem Direct GMRES+High Mem Direct LU-decomposition 1.E-02 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM: p = 8
35 Example Applications Single Matrix-Vector Multiplication CPU Time (s) 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 Number of GMRES Iterations Direct+ Matrix Entries Computation y=bx 2 Multiplication of Stored Matrix 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM y=ax 3D Laplace FMM: p = 8
36 Example Applications Sound pressure kd=0.96 kd=9.6 kd=96 (250 Hz) (2.5 khz) (25 khz) BEM for the Helmholtz equation (fgmres/fmm) Mesh: 132,072 elements, 65,539 vertices
37 FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance
38 Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU.
39 High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). Time (s) 1.E+00 1.E-01 GPU, Direct Estimated achieved rate: 190 GFLOPS. 1.E-02 y=bx 2 +cx+d 1.E-03 1.E-04 b/a=600 3D Laplace 1.E+03 1.E+04 1.E+05 1.E+06 CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) Number of Sources
40 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 Sparse Matrix-Vector Product s max =60 y=bx CPU: 1.E+01 Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: Time (s) 1.E-01 1.E-02 1.E-03 2 y=ax GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware
41 Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.
42 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx Time (s) 1.E+00 1.E-01 1.E GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources
43 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio E E E E E E E E E E E E E E E E E E-01 69
44 Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes.
45 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction):
46 Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.
47 What is more accurate for solution of large problems on GPU: direct summation or FMM? L2-relative error 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 1.E-09 Error in computations of potential FMM FMM FMM GPU CPU Direct p=8 p=4 p=12 Filled = GPU, Empty = CPU 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).
48 Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio
49 Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM y=bx 2 GPU, FMM y=dx 1.E-02 GPU, FMM a/b = 600 c/d = 30 p=4, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, FMM a/b = 600 c/d = 50 p=8, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, Direct a/b = 600 c/d = 50 p=12, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources
50 Performance Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, GPU direct FMM dir FMM CPU
51 Conclusions What do we have What is next
52 What do we have Some insight and methods of programming on GPU; High performance FMM for 3D Laplace kernel running in configuration 1CPU- 1GPU; Results encouraging to continue.
53 What is next Applications of the FMM matrix-vector multiplier for solution of physics-based and engineering problems (particle/molecular dynamics, boundary element methods, RBF interpolation, etc.); Mapping on GPU algorithms generating FMM data structures; Develop the FMM for larger CPU/GPU clusters and hit big problems; Some research to adjust the algorithms for GPU, particularly more efficient use of shared/constant memory; FMM on GPU for different kernels (we see a lot of applications); Continue to work towards simplification of programming on GPU for scientists; Upgrades in hardware (double precision, larger memory, etc.) are expected.
54 Thank you!
55 Outline Introduction Large computational tasks Moore s law Graphical processors (GPU) General purpose programming on GPU Challenges NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Middleware libraries Fast Multipole Method (FMM) Algorithm Data structures Translation theory Complexity and optimizations FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance Conclusions What do we have What is next
FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationScientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics
Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationFMM accelerated BEM for 3D Helmholtz equation
FMM accelerated BEM for 3D Helmholtz equation Nail A. Gumerov and Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, U.S.A. also @ Fantalgo, LLC, U.S.A. www.umiacs.umd.edu/~gumerov
More informationGPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationScalable Fast Multipole Methods on Distributed Heterogeneous Architectures
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer
More informationFast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation
Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Nail A. Gumerov Ross Adelman Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies and Fantalgo,
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationFMM CMSC 878R/AMSC 698R. Lecture 13
FMM CMSC 878R/AMSC 698R Lecture 13 Outline Results of the MLFMM tests Itemized Asymptotic Complexity of the MLFMM; Optimization of the Grouping (Clustering) Parameter; Regular mesh; Random distributions.
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationFMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding
FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationPARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS
Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.
More informationc 2007 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 29, No. 5, pp. 1876 1899 c 2007 Society for Industrial and Applied Mathematics FAST RADIAL BASIS FUNCTION INTERPOLATION VIA PRECONDITIONED KRYLOV ITERATION NAIL A. GUMEROV AND
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationA MATLAB Interface to the GPU
A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous
More informationL10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion
L10 Layered Depth Normal Images Introduction Related Work Structured Point Representation Boolean Operations Conclusion 1 Introduction Purpose: using the computational power on GPU to speed up solid modeling
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationReal-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010
1 Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010 Presentation by Henrik H. Knutsen for TDT24, fall 2012 Om du ønsker, kan du sette inn navn, tittel på foredraget, o.l.
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationScalable Distributed Fast Multipole Methods
Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationGPUML: Graphical processors for speeding up kernel machines
GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of
More informationFast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods.
An Introduction to Fast Multipole Methods Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationWhat Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others
What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationThe meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods
Boundary Elements and Other Mesh Reduction Methods XXXVII 121 The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods A. Buchau
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationMotivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University
Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA
More informationCurrent Trends in Computer Graphics Hardware
Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationRadial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing
Radial Basis Function-Generated Finite Differences (RBF-FD): New Opportunities for Applications in Scientific Computing Natasha Flyer National Center for Atmospheric Research Boulder, CO Meshes vs. Mesh-free
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationPerformance and accuracy of hardware-oriented. native-, solvers in FEM simulations
Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationInternational Supercomputing Conference 2009
International Supercomputing Conference 2009 Implementation of a Lattice-Boltzmann-Method for Numerical Fluid Mechanics Using the nvidia CUDA Technology E. Riegel, T. Indinger, N.A. Adams Technische Universität
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationCFD Solvers on Many-core Processors
CFD Solvers on Many-core Processors Tobias Brandvik Whittle Laboratory CFD Solvers on Many-core Processors p.1/36 CFD Backgroud CFD: Computational Fluid Dynamics Whittle Laboratory - Turbomachinery CFD
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationGPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.
Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More information