CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
|
|
- Thomas Andrews
- 6 years ago
- Views:
Transcription
1 CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore Nodes Graphics Processors (GPUs) Fast Matrix-Vector Multiplication on GPU with On-Core Computable Kernel Change of the Cost/Optimization scheme on GPU compared to CPU Other steps Test Results Profile Accuracy Overall Performance Heterogeneous algorithm 1
2 Publication The major part of this work has been published: N.A. Gumerov and R. Duraiswami Fast multipole methods on graphics processors. Journal of Computational Physics, 227, , Q. Hu, N.A. Gumerov, and R. Duraiswami Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures, Proceedings of the Supercomputing 11 Conference, Seattle, November 12-18, (The paper is nominated for the best student paper award). Two parts of the FMM Generate data structure (define the matrix); Perform Matrix-vector product (run); Two different types of problems: Static matrix (e.g. for iterative solution of large linear system): Data structure should be generated once, while run should be performed many times with different input vectors; Dynamic matrix (e.g. N-body problem) Data structure should be generated each time, when matrixvector multiplication is needed. 2
3 Data Structures We need O(NlogN) procedure to index all the boxes, find children/parent and neighbors; Implemented on CPU using bit-interleaving (Morton-Peano indexing in octree) and sorting; We generate lists of Children/Parent/Neighbors/E4-neighbors and store in global memory; Recently, graduate students Michael Lieberman and Adam O Donovan implemented basic data structures on GPU (prefix sort, etc.) and found of order 10 speedups compared to the CPU for large data sets. The most recent work on data structures on GPU is performed by Hu Qi (up to 100 times speed up compared to the standard CPU). In the present lecture we will focus on the run algorithm, assuming that the necessary lists are generated and stored in the global memory. FMM on CPU 3
4 Flow Chart of FMM Run Level l max Levels l max -1,, 2 Level 2 Levels 3,,l max 1. Get S-expansion coefficients (directly) 2. Get S-expansion coefficients from children (S S translation) 3. Get R-expansion coefficients from far neighbors (S R translation) 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) Direct Sparse Matrix- Vector Multiplication Approximate Dense Matrix- Vector Multiplication FMM Cost (Operations) Points are clustered with clustering parameter s, means not more that s points in the smallest octree box. Expansions have length p 2 (number of coefficients). There are O(N) points and O(N boxes )=O(N/s) in the octree. O(Np 2 ) O(N boxes p 3 ) O(p 3 ) O(N boxes p 3 ) Level l max Levels l max -1,, 2 Level 2 Levels 3,,l max 1. Get S-expansion coefficients (directly) 2. Get S-expansion coefficients from children (S S translation) 3. Get R-expansion coefficients from far neighbors (S R translation) 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) O(Ns) O(Np 2 ) O(N boxes p 3 ) 4
5 FMM Cost/Optimization (operations) Since p=o(1), we have TotalCost = AN+BN noxes +CNs =N(A+B/s+Cs) Total Cost Cs s opt =(B/C) 1/2 B/s s opt s Fine Grain Parallelization for Multicore PC s Serial Algorithm Constructor Step 1 Loop 1 Step n Allocate Global Memory Parallel Loop 1 Parallel Algorithm Allocate Constructor 1 Step Step n Global Memory Fork parallelization Open MP C++, Fortran 9x Also auto vectorization of loops (SSE) Loop n Destructor Deallocate Parallel Loop n Destructor Deallocate Cannot write to the same memory address! 5
6 Some test results for OMP parallelization (4 cores) GPUs 6
7 A Quick Introduction to the GPU Graphics processing unit (GPU) is a highly parallel, multithreaded, many-core processor with high computation power and memory bandwidth GPU is designed for single instruction multiple data (SIMD) computation; more transistors for processing rather than data caching and flow control A few cores Hundreds cores Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores Is It Expensive? Any PC has GPU which probably performs faster than the CPU GPUs with Teraflops performance are used in game stations Tens of millions of GPUs are produced each year Price for 1 good GPU in range $ Prices for the most advanced NVIDIA GPUs for general purpose computing (e.g. Tesla C2050) are in the range $1K-$2K Modern research supercomputer with several GPUs can be purchased for a few thousand dollars GPUs provide the best Gflops/$ ratio They also provide the best Gflops/watt 7
8 Floating-Point Operations for CPU and GPU Is It Easy to Program A GPU? For inexperienced GPU programmers Matlab Parallel Toolbox Middleware to program GPU from Fortran Relatively easy to incorporate to existing codes Developed by the authors at UMD Free (available online) For advanced users CUDA: a C-like programming language Math libraries are available Custom functions can be implemented Requires careful memory management Free (available online) Local memory ~50 kb GPU global memory ~1-4 GB Host memory ~4-128 GB 8
9 Availability Fortran-9X version is released for free public use. It is called FLAGON. FMM on GPU 9
10 Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU. Implementation Main algorithm is in Fortran 95 Data structures are computed on CPU Interface between Fortran and GPU is provided by our middleware (Flagon), which is a layer between Fortran and C++/CUDA High performance custom subroutines written using Cu 10
11 Time (s) 11/10/2011 Fast Kernel Matrix-vector multiplication on GPU High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). GPU, Direct 1.E+00 1.E-01 1.E-02 y=bx 2 +cx+d b/a=600 1.E-03 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 Number of Sources Estimated achieved rate: 190 GFLOPS. CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) 11
12 Receiver Data Thread registers Algorithm for direct summation on GPU (Kernel matrix multiplication) 1 2 n Global GPU memory Block of Data Shared Coalesced (fast) read to thread register (fast memory) Source Data Global GPU memory Coalesced (fast) copy to the shared (fast) memory Compute and add contribution of the j-th source to the sum handled by the i-th thread (performed by the i-th thread) i-th thread Write result to global GPU memory Block of Data Block of threads Change of Cost/Optimization on GPU 12
13 Time (s) 11/10/2011 Algorithm modification for direct summation in the FMM 1. Each block of threads executes computations for one receiver box (Blockper-box parallelization). 2. Each block of threads accesses the data structure. 3. Loop over all sources contributing to the result requires partially non-coalesced read of the source data (box by box). Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 1.E+01 Sparse Matrix-Vector Product s max =60 y=bx CPU: Time=CNs, s=8 -lmax N 1.E+00 1.E-01 y=ax 2 CPU 4 l max =5 y=cx GPU: Time=A 1 N+B 1 N/s+C 1 Ns 1.E-02 1.E GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware 13
14 Optimization Sparse M-V product Time Dense M-V product Time Time CPU: + = 0 s 0 s 0 s opt s Time Time Time GPU: + = 0 s1 0 0 s s s 1 s opt s Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2. 14
15 Time (s) 11/10/2011 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx 1.E E-01 1.E-02 2 GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio E E E E E E E E E E E E E E E E E E
16 Other steps of the FMM Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes. 16
17 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction): Test results 17
18 Profile Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used. 18
19 L2-relative error 11/10/2011 What is more accurate for solution of large problems on GPU: direct summation or FMM? 1.E-02 1.E-03 1.E-04 1.E-05 Error in computations of potential p=4 FMM p=8 FMM FMM Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. 1.E-06 p=12 Possible reason why GPU the GPU error in direct 1.E-07 Direct summation grows: CPU systematic roundoff 1.E-08 error in computation of function 1/sqrt(x). Filled = GPU, Empty = CPU (still a question). 1.E-09 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio
20 Run Time (s) Run Time (s) Run Time (s) 11/10/2011 Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx 1.E+01 1.E+00 CPU, Direct CPU, FMM y=dx 1.E+01 1.E+00 CPU, Direct CPU, FMM GPU, Direct y=dx 1.E+01 1.E+00 CPU, Direct CPU, FMM y=bx 2 y=dx GPU, FMM GPU, Direct 1.E-01 1.E-01 1.E-01 1.E-02 GPU, FMM a/b = 600 c/d = 30 1.E-02 GPU, FMM a/b = 600 c/d = 50 1.E-02 GPU, Direct a/b = 600 c/d = 50 p=4, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources p=8, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources p=12, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Performance FMM GPU Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). dir CPU FMM M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, direct 20
21 Heterogeneous Algorithm ODE solver: source receiver update GPU work Single node algorithm CPU work particle positions, source strength data structure (octree and neighbors) source S-expansions translation stencils local direct sum upward S S downward S R R R receiver R-expansions final sum of far-field and near-field interactions time loop 21
22 Advantages CPU and GPU are tasked with the most efficient jobs Faster translations: CPU code can be better optimized which requires complex translation stencil data structures Faster local direct sum: many cores on GPU; same kernel evaluation but on multiple data (SIMD) The CPU is not idle during the GPU-based computations High accuracy translation without much cost penalty on CPU Easy visualization: all data reside in GPU RAM Smaller data transfer between the CPU and GPU Single node tests 22
23 The billion size test case Using all 32 Chimera nodes Timing excludes the partition 1 billion runs potential kernel in 21.6 s Actual performance: TFs To compute without FMM for the same time: 1 EFs performance 23
FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationTerascale on the desktop: Fast Multipole Methods on Graphical Processors
Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationScientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics
Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail
More informationGPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationScalable Fast Multipole Methods on Distributed Heterogeneous Architectures
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationFMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding
FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationOn the limits of (and opportunities for?) GPU acceleration
On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationScalable Distributed Fast Multipole Methods
Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University
More informationGPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.
Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationSEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi
SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationParallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units
Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationGPU Cluster Computing. Advanced Computing Center for Research and Education
GPU Cluster Computing Advanced Computing Center for Research and Education 1 What is GPU Computing? Gaming industry and high- defini3on graphics drove the development of fast graphics processing Use of
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationFMM CMSC 878R/AMSC 698R. Lecture 13
FMM CMSC 878R/AMSC 698R Lecture 13 Outline Results of the MLFMM tests Itemized Asymptotic Complexity of the MLFMM; Optimization of the Grouping (Clustering) Parameter; Regular mesh; Random distributions.
More informationTree-based methods on GPUs
Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationLarge-scale Deep Unsupervised Learning using Graphics Processors
Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationKrishnan Suresh Associate Professor Mechanical Engineering
Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationNvidia Tesla The Personal Supercomputer
International Journal of Allied Practice, Research and Review Website: www.ijaprr.com (ISSN 2350-1294) Nvidia Tesla The Personal Supercomputer Sameer Ahmad 1, Umer Amin 2, Mr. Zubair M Paul 3 1 Student,
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More information