A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
|
|
- Arron Sullivan
- 6 years ago
- Views:
Transcription
1 A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois
2 A Scalable, Numerically Stable, High- How to Build a gtsv for Performance Tridiagonal Solver for CUSPARSE 23 GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois
3 Material in this Session This talk is based on our SC 2 paper Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs, Proceedings of the International Conference for High Performance Computing, Networking Storage and Analysis, 22 (SC 2) But it contains more Details not shown in the paper due to page limit Extension worked with the NVIDIA CUSPARSE team 3
4 Comparison among Tridiagonal Solvers Solver Matlab (backslash) Intel MKL (gtsv) Intel SPIKE Numerical Stability Yes Yes Yes CUSPASRE gtsv (22) No Our gtsv Yes Our heterogeneous gtsv Yes 4
5 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance Matlab (backslash) Yes Poor Intel MKL (gtsv) Yes Good Intel SPIKE Yes Good CUSPASRE gtsv (22) No Not supported Our gtsv Yes Not supported Our heterogeneous gtsv Yes Good 5
6 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Matlab (backslash) Yes Poor Not supported Intel MKL (gtsv) Yes Good Not supported Intel SPIKE Yes Good Not supported CUSPASRE gtsv (22) No Not supported Good Our gtsv Yes Not supported Good Our heterogeneous gtsv Yes Good Good 6
7 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 7
8 Numerical Stability on GPUs All previous related works for GPUs Unstable algorithms, like Thomas algorithms, Cyclic reduction (CR), or Parallel cyclic reduction (PCR) No pivoting Why pivoting important? 8
9 CUSPARSE gtsv 22) CR (+ PCR) But when bi s are s b a c b b a b a c b a c b e e e e b a c b a c b a c b e e e e a c a c a c e e e e
10 Why Numerical Stability is Difficult on GPUs Why people didn t apply pivoting on GPU? They worried about performance Pivoting does not seem to fit GPU Pivoting may serialize computation Pivoting requires data-dependent control flow GPU likes regular computation and regular memory access Branch divergence may hurt performance
11 Our gtsv For parallelization SPIKE algorithm is applied to decompose the problem A optimization technique is applied to achieve high memory efficiency Data layout transformation For data-dependent control flow Diagonal pivoting is chosen A optimization technique is proposed to achieve high memory efficiency Dynamic tiling
12 Part : SPIKE Algorithm SPIKE algorithm decomposes a tridiagonal matrix A into several blocks 2
13 SPIKE Algorithm D and S can be redefined as A = DS AX = F can be solved by solving DY = F, and SX =Y 3
14 A Small Example A e e e A B e F C 2 A 2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23
15 A Small Example e e e e w w v v = David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23
16 How to build S? SPIKE Algorithm Solve DY = F Solve several independent tridiagonal matrices A i s 6
17 SPIKE Algorithm How to solve SX =Y? Solving the collection of the first and latest rows in all blocks Reduction* Problem size: 4L -> 6 Backward substitution v L w 2 w 2L v 2 v 2L w 3 v 3 w 3L v 3L w 4 *E. Polizzi and A. H. Sameh, A parallel hybrid banded system solver: The SPIKE algorithm, 7 Parallel Computing, vol. 32, no. 2, pp , 26.
18 Part 2: Diagonal Pivoting for Tridiagonal Matrices How to solve each block A i in a numerically stable way? Diagonal pivoting* A i can be solved sequentially by each thread Why diagonal pivoting? A better data-dependent control flow, which we can handle on GPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp , 2. 8
19 Diagonal Pivoting A tridiagonal matrix A can be decomposed to LBM T Instead of LDU L and M are unit lower triangular matrices B is a block diagonal matrix with -by- or 2-by-2 blocks Criteria for choosing -by- or 2-by-2 blocks Asymmetric Bunch-Kaufman pivoting 9
20 LBM^T decompistion Bd is -by- or 2-by-2 block As is also a tridiagonal matrix As is updated by modifying leading elements of T22 As can be decomposed recursively =bb2-a2c 2
21 Diagonal Pivoting A can be solved by solving L, B, and then M T It has data-dependent control flow B contains -by- or 2-by-2 blocks It is better than other pivoting Only access nearby rows Require dynamic tiling to perform efficiently on GPUs 2
22 b c a2 b2 c2 a3 b3 c3 a4 b4 More Optimization Not stored. computed on the fly We store conditions and leading elements of B d= b c2/b a2/b L2 B2 M2^T d=2 b c -cc2/ -a2a3/ ba3/ a2 b2 bc2/ L2 B2 M2^T
23 An Example 2.5 d= What we really store A = 2.5 condition= 2 2
24 Pivoting Criteria Bunch-Kaufman algorithm for unsymmetric cases k = 5 /2 σ = max c, a 2, b 2, c 2, a 3 if b σ k c a 2 by pivoting else 2 by 2 pivoting
25 Our gtsv Algorithm Solving each A i dominates the runtime Using diagonal pivoting One A i is solved sequentially, and all A i s are solved in parallel Require data layout transformation to perform efficiently on GPUs 25
26 Data Layout Observation GPU requires stride-one memory access to fully utilize memory bandwidth Contradiction Consecutive elements in a diagonal are stored in consecutive memory in gtsv interface Each block is processed by one thread Solution Data layout transformation 26
27 Data Layout Transformation Local transpose b i s are elements in a diagonal 6 4-elements blocks (block in SPIKE) address address local transpose 27
28 Data Layout Transformation random diagonally dominant Runtime (ms) old layout (gtsv interface) proposed layout 4-5x zero diagonal data marshaling overhead Random -by- or 2-by- 2 pivoting Diagonally dominant Always -by- pivoting Zero diagonal Always 2-by-2 pivoting 28
29 Dynamic Tiling Observation Memory access with compact footprint can be handled well by L Even though branch divergence exists Scattered footprint dramatically reduces memory efficiency Solution Insert barriers to regularize memory access footprint address thread ID T T2 T3 T compact footprint scattered footprint 29
30 Dynamic Tiling T T2 T3 T estimated tiling boundary real barrier estimated tiling boundary real barrier T T2 T3 T
31 Dynamic Tiling 7 Runtime (ms) data layout only dynamic tiling (with data layout) x random diagonally dominant zero diagonal 3
32 Dynamic Tiling % Performance counters Global Memory Load Efficiency Global Memory Store Efficiency L Hit Rate Warp Execution Efficiency x.8x 3x 2 no tiling, random tiling, random no tiling, diagonally dominant tiling, diagonally dominant Because of branch divergence no tiling, zero diagonal tiling, zero diagonal 32
33 Final Evaluation 3 kinds of evaluation Numerical stability A backward analysis 6 selected types of matrices* One GPU performance Cluster scalability Multiple GPUs Ax b b Multiple GPUs + multiple CPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp , 2 33
34 Numerical Stability Relative Backward Error Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab.82E-4.97E-4 7.4E-2.88E-4.39E-5.96E E-6.27E-6.69E-6.3E-6.2E-6.3E E-6.52E E-6.35E-6.29E-6.35E E-4.22E-4.39E-2 3.E-5.69E E-5 5.7E-4.3E-4.82E-4.56E E E-4 6.5E-6.6E-6.57E E-7 9.5E E E E-6 5.3E E E E E-4 2.4E-4.5E+ 3.76E E-6 2.4E E-5 3.9E-4.93E+8 3.5E-5 9.7E-6.9E E E E+5 3.2E E-6 3.2E E E E+ 2.99E-4 2.2E E E E E E E E E- 5.45E-.2E E- 3.92E-5 3.8E E E E+5.77E E+54.77E E+6 Nan Nan.47E+59 Fail 3.69E+58 6 Inf Nan Nan Inf Fail 4.68E+7 34
35 GPU Performance Runtime of solving an 8M matrix (ms) Our dgtsv (GPU) Our ddtsvb (GPU) CUSPARSE dtsv (GPU) Data transfer (pageable) Data transfer (pinned) MKL dgtsv(sequential, CPU) Random Diagonally dominant 35
36 Our Heterogeneous gtsv SPIKE algorithm OpenMP for multicore in one node CUDA stream for multi-gpus MPI for multi-nodes MKL gtsv for CPU Our gtsv for GPU 36
37 Cluster Scalability (GPUs) Strong Scaling (ms) Our gtsv Our gtsv (predistributed data).e+3.e+2.e+ GPU 2 GPUs 4 GPUs 8 GPUs 6 GPUs 37
38 Cluster Scalability (GPUs) Weak Scaling (ms).e+3 Our gtsv Our gtsv (predistributed data).e+2.e+ GPU, a 2Msized matrix 2 GPUs, a 4Msized matrix 4 GPUs, an 8Msized matrix 8 GPUs, a 6Msized matrix 6 GPUs, a 32Msized matrix 38
39 Cluster Scalability (GPUs+CPUs) Strong scaling Weak Strong (predistributed scaling data) 39
40 Short Summary Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 4
41 More Features for Our gtsv Support 4 data types (in CUSPARSE 23) Float(S), double(d), complex(c), double complex(z) Support arbitrary sizes Support multiple right-hand-side vectors Support both general matrices (gtsv) and diagonally dominant matrices (dtsvb) 4
42 More Details 4 data types CURSPARSE built-in operators dtsvb SPIKE + Thomas algorithm Arbitrary sizes Padding Pad s for the main diagonal, and s for the lower and upper diagonals 42
43 More Details Multiple right-hand-side vectors Yi s have multiple columns, but Wi s and Vi s only have one column 43
44 More Details Solve Vi s, Wi s and the first column of Yi s Build L, B, and M^T Then solve the rest columns of Yi s using the pre-built L, B, and M^T 44
45 Summary The first numerically stable tridiagonal solver for GPUs Comparable numerical stability with Intel MKL Comparable speed with NVIDIA CUSPARSE 22 Support large size matrices CUSPARSE gtsv 23 Cluster support is removed Source codes for a prototype are available at With a BSD-like license 45
46 Something We Forgot How about the batch version? Batch version means multiple matrices of the same size Currently, you can just simply merge them in a large matrix Even work for multiple matrices of different sizes 46
47 A Case Study Empirical Mode Decomposition (EMD) An adaptive time-(or spatial-)frequency analysis Applications Climate research Orbit research Structural health monitoring Water wave analysis Biomedical signal analysis 47
48 Empirical Mode Decomposition Spline interpolation Sifting Procedure maxima Extrema Detector Spline Interpolation Tridiagonal Solver Spline Interpolation Interpolation Vector Mean - + Vector Subs minima Tridiagonal Solver Interpolation IMF Procedure Sifting m (t) Sifting Sifting + m M (t) - Vector Subs r i (t) c i (t) x(t) IMF r (t) IMF IMF r N (t) c (t) c 2 (t) c N (t) 48
49 Characteristics of Tridiagonal Matrices in EMD Large size Different numbers of matrices Dimensions or channels of signals Simultaneous tridiagonal matrices D ( channel) signal/d multiple channels signals/2d signals Variations of EMD Ensemble EMD (EEMD) Adding noise and performing EMD several times Multiple dimensional EEMD 49
50 Benefits of Our gtsv Large size matrices Some previous GPU EMD works used B-spline to approximate spline, because they cannot solve large-size systems efficiently Our gtsv perfectly fits Multiple matrices of different sizes Our gtsv perfectly fits 5
51 Short Summary It is still an on-going work New GPU EMD source codes coming soon Check A joint project with Norden Huang s group 5
52 Q & A Thank you Li-Wen Chang at SC'2 52
A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs
A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs Li-Wen Chang, John A. Stratton, Hee-Seok Kim, and Wen-Mei W. Hwu Electrical and Computer Engineering University of Illinois
More informationChapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs
Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationS4289: Efficient solution of multiple scalar and block-tridiagonal equations
S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,
More informationModule 12 Floating-Point Considerations
GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationState of Art and Project Proposals Intensive Computation
State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationParallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units
Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationOptimization of Tele-Immersion Codes
Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda
More informationA GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal
More informationL17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects
L17: CUDA, cont. October 28, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. - Present results to work on communication of technical ideas
More informationpyeemd Documentation Release Perttu Luukko
pyeemd Documentation Release 1.3.1 Perttu Luukko August 10, 2016 Contents 1 Contents: 3 1.1 Installing pyeemd............................................ 3 1.2 Tutorial..................................................
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationA novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters
Clemson University TigerPrints All Theses Theses 12-2015 A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Ashwin Trikuta Srinath Clemson
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationInterpolation & Polynomial Approximation. Cubic Spline Interpolation II
Interpolation & Polynomial Approximation Cubic Spline Interpolation II Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University
More informationTechnical Report TR
Technical Report TR-2012-04 SPIKE::GPU - A GPU-based Banded Linear System Solver Ang Li, Andrew Seidl, Dan Negrut November 15, 2012 Abstract The SPIKE algorithm [1, 2] is an efficient generic divide-and-conquer
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationProject Kickoff CS/EE 217. GPU Architecture and Parallel Programming
CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign! 1 Two flavors Application Implement/optimize
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationNumerical considerations
Numerical considerations CHAPTER 6 CHAPTER OUTLINE 6.1 Floating-Point Data Representation...13 Normalized Representation of M...13 Excess Encoding of E...133 6. Representable Numbers...134 6.3 Special
More informationFigure 6.1: Truss topology optimization diagram.
6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.
More informationCluster EMD and its Statistical Application
Cluster EMD and its Statistical Application Donghoh Kim and Heeseok Oh Sejong University and Seoul National University November 10, 2007 1/27 Contents 1. Multi-scale Concept 2. Decomposition 3. Cluster
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationGPU Implementation of Implicit Runge-Kutta Methods
GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationRethinking Computer Architecture for Throughput Computing
Rethinking Computer Architecture for Throughput Computing Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare, Inc. A Few Thoughts on Computer Architecture While in Samos Where are we today?
More informationL14: CUDA, cont. Execution Model and Memory Hierarchy"
Programming Assignment 3, Due 11:59PM Nov. 7 L14: CUDA, cont. Execution Model and Hierarchy" October 27, 2011! Purpose: - Synthesize the concepts you have learned so far - Data parallelism, locality and
More informationParallel multi-frontal solver for isogeometric finite element methods on GPU
Parallel multi-frontal solver for isogeometric finite element methods on GPU Maciej Paszyński Department of Computer Science, AGH University of Science and Technology, Kraków, Poland email: paszynsk@agh.edu.pl
More informationCUDA Performance Optimization
Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationA Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing
A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationA MATLAB Interface to the GPU
A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous
More informationSIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner
SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis Hannes Fassold, Jakub Rosner 2014-03-26 2 Overview GPU-activities @ AVM research group SIFT descriptor extraction Algorithm GPU implementation
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationSpring 2009 Prof. Hyesoon Kim
Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationModule Memory and Data Locality
GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationDistributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs
Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationHighly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs
Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov
More informationSIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014
SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image
More informationGPU Supercomputing From Blue Waters to Exascale
GPU Supercomputing From Blue Waters to Exascale Wen-mei Hwu Professor, University of Illinois at Urbana-Champaign (UIUC) CTO, MulticoreWare Inc. New BW Configuration Cray System & Storage cabinets: Compute
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More information2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.
Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture
More informationKSTAR tokamak. /
KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationOptimization by Run-time Specialization for Sparse-Matrix Vector Multiplication
Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication Maria J. Garzaran University of Illinois at Urbana-Champaign Joint work with Sam Kamin (UIUC) and Baris Aktemur (Ozyegin
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationParallel Data Mining on a Beowulf Cluster
Parallel Data Mining on a Beowulf Cluster Peter Strazdins, Peter Christen, Ole M. Nielsen and Markus Hegland http://cs.anu.edu.au/ Peter.Strazdins (/seminars) Data Mining Group Australian National University,
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I
EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I Mattan Erez The University of Texas at Austin N EE382N: Parallelilsm and Locality, Spring
More informationModule 12 Floating-Point Considerations
GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation
More informationswsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu
swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationUsing R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov
Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More information