A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Size: px
Start display at page:

Download "A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois"

Transcription

1 A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

2 A Scalable, Numerically Stable, High- How to Build a gtsv for Performance Tridiagonal Solver for CUSPARSE 23 GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois

3 Material in this Session This talk is based on our SC 2 paper Chang, Li-Wen; Stratton, John; Kim, Hee-Seok; Hwu, Wen-mei, A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs, Proceedings of the International Conference for High Performance Computing, Networking Storage and Analysis, 22 (SC 2) But it contains more Details not shown in the paper due to page limit Extension worked with the NVIDIA CUSPARSE team 3

4 Comparison among Tridiagonal Solvers Solver Matlab (backslash) Intel MKL (gtsv) Intel SPIKE Numerical Stability Yes Yes Yes CUSPASRE gtsv (22) No Our gtsv Yes Our heterogeneous gtsv Yes 4

5 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance Matlab (backslash) Yes Poor Intel MKL (gtsv) Yes Good Intel SPIKE Yes Good CUSPASRE gtsv (22) No Not supported Our gtsv Yes Not supported Our heterogeneous gtsv Yes Good 5

6 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Matlab (backslash) Yes Poor Not supported Intel MKL (gtsv) Yes Good Not supported Intel SPIKE Yes Good Not supported CUSPASRE gtsv (22) No Not supported Good Our gtsv Yes Not supported Good Our heterogeneous gtsv Yes Good Good 6

7 Comparison among Tridiagonal Solvers Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 7

8 Numerical Stability on GPUs All previous related works for GPUs Unstable algorithms, like Thomas algorithms, Cyclic reduction (CR), or Parallel cyclic reduction (PCR) No pivoting Why pivoting important? 8

9 CUSPARSE gtsv 22) CR (+ PCR) But when bi s are s b a c b b a b a c b a c b e e e e b a c b a c b a c b e e e e a c a c a c e e e e

10 Why Numerical Stability is Difficult on GPUs Why people didn t apply pivoting on GPU? They worried about performance Pivoting does not seem to fit GPU Pivoting may serialize computation Pivoting requires data-dependent control flow GPU likes regular computation and regular memory access Branch divergence may hurt performance

11 Our gtsv For parallelization SPIKE algorithm is applied to decompose the problem A optimization technique is applied to achieve high memory efficiency Data layout transformation For data-dependent control flow Diagonal pivoting is chosen A optimization technique is proposed to achieve high memory efficiency Dynamic tiling

12 Part : SPIKE Algorithm SPIKE algorithm decomposes a tridiagonal matrix A into several blocks 2

13 SPIKE Algorithm D and S can be redefined as A = DS AX = F can be solved by solving DY = F, and SX =Y 3

14 A Small Example A e e e A B e F C 2 A 2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

15 A Small Example e e e e w w v v = David Kirk/NVIDIA and Wen-mei W. Hwu, 2-23

16 How to build S? SPIKE Algorithm Solve DY = F Solve several independent tridiagonal matrices A i s 6

17 SPIKE Algorithm How to solve SX =Y? Solving the collection of the first and latest rows in all blocks Reduction* Problem size: 4L -> 6 Backward substitution v L w 2 w 2L v 2 v 2L w 3 v 3 w 3L v 3L w 4 *E. Polizzi and A. H. Sameh, A parallel hybrid banded system solver: The SPIKE algorithm, 7 Parallel Computing, vol. 32, no. 2, pp , 26.

18 Part 2: Diagonal Pivoting for Tridiagonal Matrices How to solve each block A i in a numerically stable way? Diagonal pivoting* A i can be solved sequentially by each thread Why diagonal pivoting? A better data-dependent control flow, which we can handle on GPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp , 2. 8

19 Diagonal Pivoting A tridiagonal matrix A can be decomposed to LBM T Instead of LDU L and M are unit lower triangular matrices B is a block diagonal matrix with -by- or 2-by-2 blocks Criteria for choosing -by- or 2-by-2 blocks Asymmetric Bunch-Kaufman pivoting 9

20 LBM^T decompistion Bd is -by- or 2-by-2 block As is also a tridiagonal matrix As is updated by modifying leading elements of T22 As can be decomposed recursively =bb2-a2c 2

21 Diagonal Pivoting A can be solved by solving L, B, and then M T It has data-dependent control flow B contains -by- or 2-by-2 blocks It is better than other pivoting Only access nearby rows Require dynamic tiling to perform efficiently on GPUs 2

22 b c a2 b2 c2 a3 b3 c3 a4 b4 More Optimization Not stored. computed on the fly We store conditions and leading elements of B d= b c2/b a2/b L2 B2 M2^T d=2 b c -cc2/ -a2a3/ ba3/ a2 b2 bc2/ L2 B2 M2^T

23 An Example 2.5 d= What we really store A = 2.5 condition= 2 2

24 Pivoting Criteria Bunch-Kaufman algorithm for unsymmetric cases k = 5 /2 σ = max c, a 2, b 2, c 2, a 3 if b σ k c a 2 by pivoting else 2 by 2 pivoting

25 Our gtsv Algorithm Solving each A i dominates the runtime Using diagonal pivoting One A i is solved sequentially, and all A i s are solved in parallel Require data layout transformation to perform efficiently on GPUs 25

26 Data Layout Observation GPU requires stride-one memory access to fully utilize memory bandwidth Contradiction Consecutive elements in a diagonal are stored in consecutive memory in gtsv interface Each block is processed by one thread Solution Data layout transformation 26

27 Data Layout Transformation Local transpose b i s are elements in a diagonal 6 4-elements blocks (block in SPIKE) address address local transpose 27

28 Data Layout Transformation random diagonally dominant Runtime (ms) old layout (gtsv interface) proposed layout 4-5x zero diagonal data marshaling overhead Random -by- or 2-by- 2 pivoting Diagonally dominant Always -by- pivoting Zero diagonal Always 2-by-2 pivoting 28

29 Dynamic Tiling Observation Memory access with compact footprint can be handled well by L Even though branch divergence exists Scattered footprint dramatically reduces memory efficiency Solution Insert barriers to regularize memory access footprint address thread ID T T2 T3 T compact footprint scattered footprint 29

30 Dynamic Tiling T T2 T3 T estimated tiling boundary real barrier estimated tiling boundary real barrier T T2 T3 T

31 Dynamic Tiling 7 Runtime (ms) data layout only dynamic tiling (with data layout) x random diagonally dominant zero diagonal 3

32 Dynamic Tiling % Performance counters Global Memory Load Efficiency Global Memory Store Efficiency L Hit Rate Warp Execution Efficiency x.8x 3x 2 no tiling, random tiling, random no tiling, diagonally dominant tiling, diagonally dominant Because of branch divergence no tiling, zero diagonal tiling, zero diagonal 32

33 Final Evaluation 3 kinds of evaluation Numerical stability A backward analysis 6 selected types of matrices* One GPU performance Cluster scalability Multiple GPUs Ax b b Multiple GPUs + multiple CPUs *J. B. Erway, R. F. Marcia, and J. Tyson, Generalized diagonal pivoting methods for tridiagonal systems without interchanges, IAENG International Journal of Applied Mathematics, vol. 4, no. 4, pp , 2 33

34 Numerical Stability Relative Backward Error Matrix type Our gtsv Our dtsvb CUSPARSE MKL Intel SPIKE Matlab.82E-4.97E-4 7.4E-2.88E-4.39E-5.96E E-6.27E-6.69E-6.3E-6.2E-6.3E E-6.52E E-6.35E-6.29E-6.35E E-4.22E-4.39E-2 3.E-5.69E E-5 5.7E-4.3E-4.82E-4.56E E E-4 6.5E-6.6E-6.57E E-7 9.5E E E E-6 5.3E E E E E-4 2.4E-4.5E+ 3.76E E-6 2.4E E-5 3.9E-4.93E+8 3.5E-5 9.7E-6.9E E E E+5 3.2E E-6 3.2E E E E+ 2.99E-4 2.2E E E E E E E E E- 5.45E-.2E E- 3.92E-5 3.8E E E E+5.77E E+54.77E E+6 Nan Nan.47E+59 Fail 3.69E+58 6 Inf Nan Nan Inf Fail 4.68E+7 34

35 GPU Performance Runtime of solving an 8M matrix (ms) Our dgtsv (GPU) Our ddtsvb (GPU) CUSPARSE dtsv (GPU) Data transfer (pageable) Data transfer (pinned) MKL dgtsv(sequential, CPU) Random Diagonally dominant 35

36 Our Heterogeneous gtsv SPIKE algorithm OpenMP for multicore in one node CUDA stream for multi-gpus MPI for multi-nodes MKL gtsv for CPU Our gtsv for GPU 36

37 Cluster Scalability (GPUs) Strong Scaling (ms) Our gtsv Our gtsv (predistributed data).e+3.e+2.e+ GPU 2 GPUs 4 GPUs 8 GPUs 6 GPUs 37

38 Cluster Scalability (GPUs) Weak Scaling (ms).e+3 Our gtsv Our gtsv (predistributed data).e+2.e+ GPU, a 2Msized matrix 2 GPUs, a 4Msized matrix 4 GPUs, an 8Msized matrix 8 GPUs, a 6Msized matrix 6 GPUs, a 32Msized matrix 38

39 Cluster Scalability (GPUs+CPUs) Strong scaling Weak Strong (predistributed scaling data) 39

40 Short Summary Solver Numerical Stability CPU Performance GPU Performance Cluster Scalability Matlab (backslash) Yes Poor Not supported Not supported Intel MKL (gtsv) Yes Good Not supported Not supported Intel SPIKE Yes Good Not supported Supported CUSPASRE gtsv (22) No Not supported Good Not supported Our gtsv Yes Not supported Good Supported Our heterogeneous gtsv Yes Good Good Supported 4

41 More Features for Our gtsv Support 4 data types (in CUSPARSE 23) Float(S), double(d), complex(c), double complex(z) Support arbitrary sizes Support multiple right-hand-side vectors Support both general matrices (gtsv) and diagonally dominant matrices (dtsvb) 4

42 More Details 4 data types CURSPARSE built-in operators dtsvb SPIKE + Thomas algorithm Arbitrary sizes Padding Pad s for the main diagonal, and s for the lower and upper diagonals 42

43 More Details Multiple right-hand-side vectors Yi s have multiple columns, but Wi s and Vi s only have one column 43

44 More Details Solve Vi s, Wi s and the first column of Yi s Build L, B, and M^T Then solve the rest columns of Yi s using the pre-built L, B, and M^T 44

45 Summary The first numerically stable tridiagonal solver for GPUs Comparable numerical stability with Intel MKL Comparable speed with NVIDIA CUSPARSE 22 Support large size matrices CUSPARSE gtsv 23 Cluster support is removed Source codes for a prototype are available at With a BSD-like license 45

46 Something We Forgot How about the batch version? Batch version means multiple matrices of the same size Currently, you can just simply merge them in a large matrix Even work for multiple matrices of different sizes 46

47 A Case Study Empirical Mode Decomposition (EMD) An adaptive time-(or spatial-)frequency analysis Applications Climate research Orbit research Structural health monitoring Water wave analysis Biomedical signal analysis 47

48 Empirical Mode Decomposition Spline interpolation Sifting Procedure maxima Extrema Detector Spline Interpolation Tridiagonal Solver Spline Interpolation Interpolation Vector Mean - + Vector Subs minima Tridiagonal Solver Interpolation IMF Procedure Sifting m (t) Sifting Sifting + m M (t) - Vector Subs r i (t) c i (t) x(t) IMF r (t) IMF IMF r N (t) c (t) c 2 (t) c N (t) 48

49 Characteristics of Tridiagonal Matrices in EMD Large size Different numbers of matrices Dimensions or channels of signals Simultaneous tridiagonal matrices D ( channel) signal/d multiple channels signals/2d signals Variations of EMD Ensemble EMD (EEMD) Adding noise and performing EMD several times Multiple dimensional EEMD 49

50 Benefits of Our gtsv Large size matrices Some previous GPU EMD works used B-spline to approximate spline, because they cannot solve large-size systems efficiently Our gtsv perfectly fits Multiple matrices of different sizes Our gtsv perfectly fits 5

51 Short Summary It is still an on-going work New GPU EMD source codes coming soon Check A joint project with Norden Huang s group 5

52 Q & A Thank you Li-Wen Chang at SC'2 52

A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs

A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs Li-Wen Chang, John A. Stratton, Hee-Seok Kim, and Wen-Mei W. Hwu Electrical and Computer Engineering University of Illinois

More information

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs

Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Chapter 2 A Guide for Implementing Tridiagonal Solvers on GPUs Li-Wen Chang and Wen-mei W. Hwu 2.1 Introduction The tridiagonal solver has been recognized as a critical building block for many engineering

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

S4289: Efficient solution of multiple scalar and block-tridiagonal equations

S4289: Efficient solution of multiple scalar and block-tridiagonal equations S4289: Efficient solution of multiple scalar and block-tridiagonal equations Endre László endre.laszlo [at] oerc.ox.ac.uk Oxford e-research Centre, University of Oxford, UK Pázmány Péter Catholic University,

More information

Module 12 Floating-Point Considerations

Module 12 Floating-Point Considerations GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

State of Art and Project Proposals Intensive Computation

State of Art and Project Proposals Intensive Computation State of Art and Project Proposals Intensive Computation Annalisa Massini - 2015/2016 Today s lecture Project proposals on the following topics: Sparse Matrix- Vector Multiplication Tridiagonal Solvers

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract

More information

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in

More information

Optimization of Tele-Immersion Codes

Optimization of Tele-Immersion Codes Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects L17: CUDA, cont. October 28, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. - Present results to work on communication of technical ideas

More information

pyeemd Documentation Release Perttu Luukko

pyeemd Documentation Release Perttu Luukko pyeemd Documentation Release 1.3.1 Perttu Luukko August 10, 2016 Contents 1 Contents: 3 1.1 Installing pyeemd............................................ 3 1.2 Tutorial..................................................

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters

A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Clemson University TigerPrints All Theses Theses 12-2015 A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters Ashwin Trikuta Srinath Clemson

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Interpolation & Polynomial Approximation. Cubic Spline Interpolation II

Interpolation & Polynomial Approximation. Cubic Spline Interpolation II Interpolation & Polynomial Approximation Cubic Spline Interpolation II Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University

More information

Technical Report TR

Technical Report TR Technical Report TR-2012-04 SPIKE::GPU - A GPU-based Banded Linear System Solver Ang Li, Andrew Seidl, Dan Negrut November 15, 2012 Abstract The SPIKE algorithm [1, 2] is an efficient generic divide-and-conquer

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign! 1 Two flavors Application Implement/optimize

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Numerical considerations

Numerical considerations Numerical considerations CHAPTER 6 CHAPTER OUTLINE 6.1 Floating-Point Data Representation...13 Normalized Representation of M...13 Excess Encoding of E...133 6. Representable Numbers...134 6.3 Special

More information

Figure 6.1: Truss topology optimization diagram.

Figure 6.1: Truss topology optimization diagram. 6 Implementation 6.1 Outline This chapter shows the implementation details to optimize the truss, obtained in the ground structure approach, according to the formulation presented in previous chapters.

More information

Cluster EMD and its Statistical Application

Cluster EMD and its Statistical Application Cluster EMD and its Statistical Application Donghoh Kim and Heeseok Oh Sejong University and Seoul National University November 10, 2007 1/27 Contents 1. Multi-scale Concept 2. Decomposition 3. Cluster

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Rethinking Computer Architecture for Throughput Computing

Rethinking Computer Architecture for Throughput Computing Rethinking Computer Architecture for Throughput Computing Wen-mei Hwu University of Illinois, Urbana-Champaign MulticoreWare, Inc. A Few Thoughts on Computer Architecture While in Samos Where are we today?

More information

L14: CUDA, cont. Execution Model and Memory Hierarchy"

L14: CUDA, cont. Execution Model and Memory Hierarchy Programming Assignment 3, Due 11:59PM Nov. 7 L14: CUDA, cont. Execution Model and Hierarchy" October 27, 2011! Purpose: - Synthesize the concepts you have learned so far - Data parallelism, locality and

More information

Parallel multi-frontal solver for isogeometric finite element methods on GPU

Parallel multi-frontal solver for isogeometric finite element methods on GPU Parallel multi-frontal solver for isogeometric finite element methods on GPU Maciej Paszyński Department of Computer Science, AGH University of Science and Technology, Kraków, Poland email: paszynsk@agh.edu.pl

More information

CUDA Performance Optimization

CUDA Performance Optimization Mitglied der Helmholtz-Gemeinschaft CUDA Performance Optimization GPU Programming with CUDA April 25-27, 2016 Jiri Kraus (NVIDIA) based on work by Andrew V. Adinetz What you will learn: What is memory

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU A MATLAB Interface to the GPU Second Winter School Geilo, Norway André Rigland Brodtkorb SINTEF ICT Department of Applied Mathematics 2007-01-24 Outline 1 Motivation and previous

More information

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis Hannes Fassold, Jakub Rosner 2014-03-26 2 Overview GPU-activities @ AVM research group SIFT descriptor extraction Algorithm GPU implementation

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms Objective To learn to handle arbitrary matrix sizes in tiled matrix multiplication

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Distributed NVAMG Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs Istvan Reguly (istvan.reguly at oerc.ox.ac.uk) Oxford e-research Centre NVIDIA Summer Internship

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs Kaixi Hou, Hao Wang, Wu chun Feng {kaixihou, hwang121, wfeng}@vt.edu Jeffrey S. Vetter, Seyong Lee vetter@computer.org, lees2@ornl.gov

More information

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image

More information

GPU Supercomputing From Blue Waters to Exascale

GPU Supercomputing From Blue Waters to Exascale GPU Supercomputing From Blue Waters to Exascale Wen-mei Hwu Professor, University of Illinois at Urbana-Champaign (UIUC) CTO, MulticoreWare Inc. New BW Configuration Cray System & Storage cabinets: Compute

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

KSTAR tokamak. /

KSTAR tokamak. / KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication Maria J. Garzaran University of Illinois at Urbana-Champaign Joint work with Sam Kamin (UIUC) and Baris Aktemur (Ozyegin

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Parallel Data Mining on a Beowulf Cluster

Parallel Data Mining on a Beowulf Cluster Parallel Data Mining on a Beowulf Cluster Peter Strazdins, Peter Christen, Ole M. Nielsen and Markus Hegland http://cs.anu.edu.au/ Peter.Strazdins (/seminars) Data Mining Group Australian National University,

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I Mattan Erez The University of Texas at Austin N EE382N: Parallelilsm and Locality, Spring

More information

Module 12 Floating-Point Considerations

Module 12 Floating-Point Considerations GPU Teaching Kit Accelerated Computing Module 12 Floating-Point Considerations Lecture 12.1 - Floating-Point Precision and Accuracy Objective To understand the fundamentals of floating-point representation

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information