Applications of Berkeley s Dwarfs on Nvidia GPUs
|
|
- Kristopher Stevens
- 5 years ago
- Views:
Transcription
1 Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang
2 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 2/37
3 CUDA Parallel computing platform and programming model for GPGPU Supports various languages including C/C++ and Fortran Lots of libraries available (e.g. cusparse, cublas, NPP, etc ) Team N2: Yang Zhang & Haiqing Wang CUDA 3/37
4 CUDA : Execution Model Each thread gets an ID Group of threads build a block Group of blocks build a grid Each thread executed by a core Each block executed by a SM A block is further split into warps Blocks are independent of each other Team N2: Yang Zhang & Haiqing Wang CUDA: Execution Model 4/37
5 CUDA : Memory Model Each thread has a private local memory Each block has a shared memory Allows communication between threads All thread can access the global memory Constant memory is a read-only memory Team N2: Yang Zhang & Haiqing Wang CUDA: Memory Model 5/37
6 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 6/37
7 Dynamic Programming [1] : Matrix Chain Product An example: ((A1 A2 A3 A4) (A5 A6)) 2*9*3+2*3*1+2*1*4+4*11*5+2*4*5=328 (A1 (A2 A3) (A4 A5) A6) 9*3*1+2*9*1+1*4*112*1*11+2*11*5=221 Goal: Minimize the total number of multiplications Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Matrix Chain Product 7/37
8 Dynamic Programming [1] : Algorithm Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 8/37
9 Dynamic Programming [1] : Algorithm (n=6) Table m: Table s: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Algorithm 9/37
10 Dynamic Programming [1] : Implementation Table m: (n=8) Computing is independent Can be computed in parallel Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 10/37
11 Dynamic Programming [1] : Implementation The number of (i,j) for each l The number of k for each (i,j) of each l the performance depends on various factors Using three different Kernels: OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry The amount of the computation for each l Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 11/37
12 Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. m 1,5, m 2,6, m 3,7, m 4,8 each one is computed concurrently all use previous entries Change Memory Mapping Memory Mapping Direction: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 12/37
13 Dynamic Programming [1] : Implementation OneThreadPerOneEntry Allocates one Thread to compute one entry e.g. m 1,5, m 2,6, m 3,7, m 4,8 each one is computed concurrently by one core all use previous entries in shared memory stored in Global memory after computing Stored in Global memory: CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 13/37
14 Dynamic Programming [1] : Implementation OneBlockPerOneEntry Allocates one Block to compute one entry e.g. m 1,5 = min 1 k<5 (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one Streaming multiprocessor each (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one core use another core for selection CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 14/37
15 Dynamic Programming [1] : Implementation BlocksPerOneEntry Allocates multiple Blocks to compute for one entry e.g. m 1,5 = min 1 k<5 (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by a few Streaming multiprocessors each (m 1,k + m k+1,5 + p 0 p k p 5 ) is computed by one core but maybe from different Streaming multiprocessors use another core in any Streaming multiprocessor for selection CUDA Architecture: Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Implementation 15/37
16 Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Total time of each kernel for different number of threads and blocks (n = 16384) Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 16/37
17 Dynamic Programming [1] : Evaluation GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. Fastest Kernel for different l Running time with l of each kernel: OneThreadPerOneEntry OneBlockPerOneEntry BlocksPerOneEntry Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation 17/37
18 Dynamic Programming [1] : Evaluation GPU vs. CPU GPU: Nvidia GeForce GTX 480 with 480 processing cores (15 Streaming Multiprocessors which has 32 processing cores) 1.4GHz, 3GB memory. (combination of three Kernels) Fastest Kernel for different l CPU: Intel Core i7 870, 2.93GHz, 8GB memory (sequential program in C language) Total computing time for n = The speedup factor is unfair Team N2: Yang Zhang & Haiqing Wang Dynamic Programming: Evaluation GPU vs. CPU 18/37
19 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 19/37
20 Sparse Linear Algebra [2] Goal: Accelerate sparse matrix-matrix (SpMM) product on GPU SpMM product: Compute C = AB where A sparse matrix, B dense matrix X Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra 20/37
21 Sparse Linear Algebra [2] : FastSpMM Approach: Extension of the ELLR-T kernel called FastSpMM Relies on ELLPACK-R storage format Outperforms common libraries for SpMM (e.g. cusparse) Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: FastSpMM 21/37
22 Sparse Linear Algebra [2] : Evaluation SpMM Three versions of SpMM routines evaluated on two Nvidia GPUs: FastSpMM vs. ELLR-T (ELLPACK-R storage format) vs. cusparse (CRS storage format) GTX480 Tesla C2050 NxN test sparse matrices Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation SpMM 22/37
23 Sparse Linear Algebra [2] : Evaluation GPU vs. CPU GTX480 and Tesla C2050 using FastSpMM vs. Intel Xeon E5640 with 4 cores using the MKL library Runtimes (in seconds) on test matrices: Speedups compared to CPU: GTX480: 2,8 6,2 Tesla C2050: 1,7 3, Team N2: Yang Zhang & Haiqing Wang Sparse Linear Algebra: Evaluation GPU vs. CPU 23/37
24 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 24/37
25 Unstructured Grids [3] : Compressible Flows Compressible flows simulation on 3-D unstructured grids Compressible flows : fluid mechanics that deals with flows having significant changes in fluid density An example : Subsonic Flow past a Sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Compressible Flows 25/37
26 Unstructured Grids [3] : DG Method Discontinuous Galerkin (DG) method : in mathematics form a class of numerical methods for solving differential equations DG method can be implemented in parallel An example : Subsonic Flow past a Sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: DG Method 26/37
27 Unstructured Grids [3] : Evaluation GPU vs. CPU GPU: NVIDIA Tesla K20c GPU containing 2496 multiprocessors (OpenACC-based program) Nelem: number of elements Ntime : number of time steps CPU: AMD Opteron 6128 CPU containing 16 cores (MPI-based parallel program) Timing measurements for subsonic flow past a sphere Team N2: Yang Zhang & Haiqing Wang Unstructured Grids: Evaluation GPU vs. CPU 27/37
28 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 28/37
29 Combinational Logic [4] : Parallel AES Goal: Efficient encryption/decryption of data streams on web server applications Approach: Design of a parallel AES on GPU Two design choices: Fine-grained: Focus on thread-level parallelism A lot of communication and synchronization Coarse-grained: Focus on higher-level parallelism i.e. blocks Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Parallel AES 29/37
30 Combinational Logic [4] : Evaluation Comparison: Fine-grained vs coarse-grained on a Nvidia 8880 GT (112 cores) Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation 30/37
31 Combinational Logic [4] : Evaluation GPU vs. CPU Throughput (in Mbps) comparisons on two Nvidia GPUs and two high-end CPUs (in 2009): CPU implementation from the OpenSSL toolkit Team N2: Yang Zhang & Haiqing Wang Combinational Logic: Evaluation GPU vs. CPU 31/37
32 Overview CUDA The Dwarfs Dynamic Programming Sparse Linear Algebra Unstructured Grids Combinational Logic Graphical Model Summary Team N2: Yang Zhang & Haiqing Wang Overview 32/37
33 Graphical Model [5] : Speech Recognition System ANN:Artificial Neural Network HMM:Hidden Markov Model ANN Model: recognizing the acoustic in a time frame (a word or a phoneme) HMM Model: warping and adjusting the whole acoustic combining these words or phonemes from ANN Team N2: Yang Zhang & Haiqing Wang Graphical Model: Speech Recognition System 33/37
34 Graphical Model [5] : ANN Training Input: A vector represents acoustic in a time frame Output: A vector represents most possible relative word or phoneme Hidden vector = Input vector weight vector 1 Output vector = Hidden vector weight vector 2 Inner product Training is the process of adjusting weight vector 1 and weight vector Team N2: Yang Zhang & Haiqing Wang Graphical Model: ANN Training 34/37
35 Graphical Model [5] : Block ANN Training Input: A Matrix made up of many input vectors Output: A Matrix made up of many output vectors Hidden matrix = Input matrix weight vector 1 Output matrix = Hidden matrix weight vector 2 Training can be solved by linear algebra Team N2: Yang Zhang & Haiqing Wang Graphical Model: Block ANN Training 35/37
36 Graphical Model [5] : Evaluation GPU vs. CPU GPU: 1600 MHz FSB, 8 GB RAM, NVIDIA GTX280 GPU (CuBLAS library) Training time, and relative speed-up for the WSJ0 corpus: CPU: a quad-core 3.0 GHz CPU (Intel MKL library) a speedup factor of Team N2: Yang Zhang & Haiqing Wang Graphical Model: Evaluation GPU vs. CPU 36/37
37 Summary What is it good for? Provides extremely high parallelism Accelerates scientific computations by a considerable factor Reduce CPU workload Achieves high performance for low cost Learning curve? Rather smooth since languages like C/C++ is supported But: Precise knowledge of hardware architecture necessary Given scalar α and two vectors x and y: operation x = αx + y? Easy to implement? Fairly easy: Basically C implementation with some added keywords and CPU/GPU memory management Disclaimer: Some comparisons to CPU not really representative or not clearly specified Team N2: Yang Zhang & Haiqing Wang Summary 37/37
38 References 1 [1] K. Nishida, Y. Ito, K. Nakano. Accelerating the Dynamic Programming for the Matrix Chain Product on the GPU. Networking and Computing (ICNC), 2011 Second International Conference on, pp , Nov Dec [2] F. Vazquez, G. Ortega, J. J. Fernandez, I.Garcia and E. M. Garzon. Fast sparse matrix matrix product based on ELLR-T and GPU computing. Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on, pp , July [3] Y. Xia, H. Luo, L. Luo, J. Edwards, J. Lou and F. Mueller. OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method. 52nd Aerospace Sciences Meeting. January Team N2: Yang Zhang & Haiqing Wang References 1 Ref 1/2
39 References 2 [4] A. di Biagio, A. Barenghi, G. Agosta, G. Pelosi. Design of a Parallel AES for Graphics Hardware using the CUDA framework. Parallel & Distributed Processing, IPDPS IEEE International Symposium on, pp. 1-8, May [5] S. Scanzio, S. Cumani, R. Gemello, F. Mana, P. Laface. Parallel implementation of artificial neural network training. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp , March 2010 Image Sources: Team N2: Yang Zhang & Haiqing Wang References 2 Ref 2/2
40 Credits Yang Zhang: Haiqing Wang: CUDA Sparse Linear Algebra Combinational Logic Summary Dynamic Programming (in detail) Unstructured Grids Graphical Model Team N2: Yang Zhang & Haiqing Wang Credits
Technology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationFacial Recognition Using Neural Networks over GPGPU
Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationPOST-SIEVING ON GPUs
POST-SIEVING ON GPUs Andrea Miele 1, Joppe W Bos 2, Thorsten Kleinjung 1, Arjen K Lenstra 1 1 LACAL, EPFL, Lausanne, Switzerland 2 NXP Semiconductors, Leuven, Belgium 1/18 NUMBER FIELD SIEVE (NFS) Asymptotically
More informationEmpirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms
Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationA GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled
More informationInstitute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2
Grzegorz Tomasz Kowalik 1, Jennifer Anne Steeden 1, Bejal Pandya 1, David Atkinson 2, Andrew Taylor 1, and Vivek Muthurangu 1 1 Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging,
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationParallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications
Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationOn Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy
On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationGPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA
GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble
More informationA Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids
A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids Patrice Castonguay and Antony Jameson Aerospace Computing Lab, Stanford University GTC Asia, Beijing, China December 15 th, 2011
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationPAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER A GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation Yasuaki ITO and Koji NAKANO, Members SUMMARY This paper presents a GPU (Graphics
More informationSurvey on Heterogeneous Computing Paradigms
Survey on Heterogeneous Computing Paradigms Rohit R. Khamitkar PG Student, Dept. of Computer Science and Engineering R.V. College of Engineering Bangalore, India rohitrk.10@gmail.com Abstract Nowadays
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationGPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis
GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation
More informationAES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley
AES Cryptosystem Acceleration Using Graphics Processing Units Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley Overview Introduction Compute Unified Device Architecture (CUDA) Advanced
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationSpeed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU
Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationA GPU Implementation of Dynamic Programming for the Optimal Polygon Triangulation
2596 IEICE TRANS. INF. & SYST., VOL.E96 D, NO.12 DECEMBER 2013 PAPER Special Section on Parallel and Distributed Computing and Networking A GPU Implementation of Dynamic Programming for the Optimal Polygon
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationSampling Using GPU Accelerated Sparse Hierarchical Models
Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationAccelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster
th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationCUDA 6.0 Performance Report. April 2014
CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random
More informationCSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationCURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS
CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS Roberto Gomperts (NVIDIA, Corp.) Michael Frisch (Gaussian, Inc.) Giovanni Scalmani (Gaussian, Inc.) Brent Leback (PGI) TOPICS Gaussian Design
More information