TANGRAM Efficient Kernel Synthesis for Performance Portable Programming

Size: px
Start display at page:

Download "TANGRAM Efficient Kernel Synthesis for Performance Portable Programming"

Transcription

1 TANGRAM Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang 1, Izzat El Hajj 1, Christopher Rodrigues 2, Juan Gómez-Luna 3, Wen-Mei Hwu 1 1 University of Illinois, 2 Huawei America Research Lab, 3 Universidad de Córdoba lchang20@illinois.edu

2 Performance Portability Maintaining optmized programs for different devices is costly Ideally, programs wriven once should run difference devices with performance 2

3 CPU CPU

4 CPU Port Accelerator A Port Accelerator B CPU Accelerator A Accelerator B

5 Re-opTmize Port Re-opTmize Port Re-opTmize CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0

6 Performance Portability: OpenCL SGEMM Performance (GFLOPS) %è60% %è30% Tesla GPU (GTX 280) Fermi GPU (C2050) Sandy Bridge CPU (i7-3820) Parboil (default naïve OpenCL version) Parboil (OpenCL version optimized for Tesla GPU) Reference (MKL for CPU, CUBLAS for GPU) 6

7 Source CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0

8 -based Programing Language NESL, Sequoia, Petabricks Highly adaptve to hierarchies Through compositon Usually scaling well Performance relies on base-rule implementatons/ libraries 8

9 Performance SensiTvity in Base Rule: DGEMM Normalized Performance (higher is be)er) % loss MKL TANGRAM Petabricks (with MKL) 11% of MKL Petabricks (no MKL) Sandy Bridge i

10 TANGRAM -based language Focus at high-performance code synthesis within a node Remove dependence of high-performance baserule implementatons/libraries Provide a representaton for bever SIMD utlizaton Provide an architectural hierarchy model to guide compositon 10

11 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 11

12 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 12

13 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 13

14 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 14

15 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 15

16 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 16

17 TANGRAM Language p c codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; c b unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet c a codelet tag(asso_tiled) tunable unsigned p; S unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling S S S p d S S S S 17

18 Rule ExtracTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM parser Rule ExtracTon Rule SpecializaTon Device-specific gen Clang 3.5 Program Rules Specialized Rules Plans Kernel Versions Customized TANGRAM AST builder Output a set of TANGRAM ASTs Program Composi,on Rules: (sum) Rule 1: compose(sum, L) S L, devolve(l L ), compose(sum, l L ) Rule 2: compose(sum, L) compute(c a, SE L ) Rule 3: compose(sum, L) compute (c b, VE L ) Rule 4: compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Rule 5: compose(sum, L) S L, regroup(p d, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Example for Deriving Composi,on Rules from Compound lets: (codelet c) compose(sum, L) compose(c c, L) compose(sum(map(sum, par::on(, p c ))), L) compose(map(sum, par::on(, p c )), L), compose(sum, L) compose(par::on(, p c ), L), compose(map(sum, ), L), compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) 18

19 Architectural Hierarchy Model Define a level ComputaTonal capability Scalar or vector executon TANGRAM Lang. lets Rule ExtracTon Program Rules Architectural Hierarchy Model Rule SpecializaTon Specialized Rules Plans Capability to synchronize across the subordinate level of that level Device-specific gen Kernel Versions Device SpecificaKon: G := C G = none, ( l G, S G ) = ( B, terminate/launch ) // G : grid B := C B = VE B, ( l B, S B ) = ( T, syncthreads() ) // B : block T := C T = SE T, ( l T, S T ) = none // T : thread Extensible CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism 19

20 Rule SpecializaTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM analyzer Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser Program Rules Specialized Rules Plans Kernel Versions Output a lookup table Legal codelets for each level Also prioritze them Specialized Composition Rules: G rules: G1: compose(sum, G) S G, devolve(b), compose(sum, B) G4: compose(sum, G) S G, regroup(p c, G), distribute(b), compose(sum, B), compose(sum, G) G5: compose(sum, G) S G, regroup(p d, G), distribute(b), compose(sum, B), compose(sum, G) B rules: B1: compose(sum, B) S B, devolve(t), compose(sum, T) B3: compose(sum, B) compute (c b, VE B ) B4: compose(sum, B) S B, regroup(p c, B), distribute(t), compose(sum, T), compose(sum, B) B5: compose(sum, B) S B, regroup(p d, B), distribute(t), compose(sum, T), compose(sum, B) T rules: T2: compose(sum, T) compute(c a, SE T ) 20

21 TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM planner Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser/builder Program Rules Specialized Rules Plans Kernel Versions SelecTon of codelets or map policies Pruning Output ASTs for codegen p d p c p c p c p c c a S S S c a c a c a c a c a c a c a c a c a S c b c b c b terminate/launch p d p c c b S S S S c a c a c a c b 21

22 gen TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM codegen Rule ExtracTon Rule SpecializaTon Device-specific gen AST traversers Program Rules Specialized Rules Plans Kernel Versions ConvenTonal optmizatons Output C/CUDA source code 22

23 GPU gen Example tile = (len + griddim.x 1)/gridDim.x; sub_tile = (tile + blockdim.x 1)/blockDim.x; accum = 0 #pragma unroll for(unsigned i = 0; i < sub_tile; ++i) { accum += in[blockidx.x*tile + i*blockdim.x + threadidx.x]; tmp[threadidx.x] = accum; syncthreads(); for(unsigned s=1; s<blockdim.x; s *= 2) { if(id >= s) tmp[threadidx.x] += tmp[threadidx.x - s]; syncthreads(); partial[blockidx.x] = tmp[blockdim.x-1]; return; // Launch new kernel to sum up partial GPU 1. Grid 2. Block 3. Thread p d p c p c p c c a c a c a c a c a c a c a c a c a c b c b c b

24 Experimental Results TANGRAM delivers 70% or higher performance compared to highly-optmized libraries, such as Intel MKL, NVIDIA CUBLAS, CUSPARSE, or Thrust, or experts optmized benchmarks, Rodinia Normalized Performance (higher is be*er) Kepler (reference) Kepler (TANGRAM) Fermi (reference) Fermi (TANGRAM) CPU (reference) CPU (TANGRAM) Scan SGEMV-TS SGEMV-SF DGEMM SpMV KMeans BFS 24

25 FAQ1 Why TANGRAM is bever than other compositon-based languages? TANGRAM provides an architectural hierarchy model to guide compositon TANGRAM provides a representaton of cooperatve codelets for bever SIMD utlizaton Especially shuffle instructons and scratchpad 25

26 FAQ2 Where optmizatons happen? SelecTon of codelets or map policies in ConvenTonal optmizatons in gen OpTmizaTons in backend compilers 26

27 FAQ3 What? MulTple versions? We did NOT ask users to write multple versions of kernels lets can be used to synthesize different versions of kernels lets can be reused multple Tmes within one kernel, across kernels in a device, across kernels for different devices 27

28 Takeaways of TANGRAM Performance portability 70% or higher performance compared to highlyoptmized libraries Extensible architectural hierarchical model Support CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism NaTve descripton for algorithmic design space Perfect for domain users 28

29 QuesTons 29

Efficient Kernel Synthesis for Performance Portable Programming

Efficient Kernel Synthesis for Performance Portable Programming Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei Hwu University of Illinois at Urbana-Champaign Huawei America

More information

Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs

Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs imon Garcia De Gonzalo C and Coordinated cience Lab UIUC grcdgnz2@illinois.edu imon

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering

Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering NVIDIA GPU Technology Conference March 18, 2015 San José, California Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering Mrugesh Gajjar Technical Expert Siemens Corporate Technology, India

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame Mo:va:on Challenges to programming the GPU

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

LDetector: A low overhead data race detector for GPU programs

LDetector: A low overhead data race detector for GPU programs LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

Generating Performance Portable Code using Rewrite Rules

Generating Performance Portable Code using Rewrite Rules Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer Christian Fensch Sam Lindley Christophe Dubach The Problem(s)

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways: COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU Computing with CUDA. Part 2: CUDA Introduction GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

HPVM: Heterogeneous Parallel Virtual Machine

HPVM: Heterogeneous Parallel Virtual Machine HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Adding CUDA Support to Cling: JIT Compile to GPUs

Adding CUDA Support to Cling: JIT Compile to GPUs Published under CC BY-SA 4.0 DOI: 10.5281/zenodo.1412256 Adding CUDA Support to Cling: JIT Compile to GPUs S. Ehrig 1,2, A. Naumann 3, and A. Huebl 1,2 1 Helmholtz-Zentrum Dresden - Rossendorf 2 Technische

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

CUDA. Sathish Vadhiyar High Performance Computing

CUDA. Sathish Vadhiyar High Performance Computing CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

PARALLEL MERGE FOR MANY-CORE ARCHITECTURES JIE LV THESIS

PARALLEL MERGE FOR MANY-CORE ARCHITECTURES JIE LV THESIS c 2016 Jie Lv PARALLEL MERGE FOR MANY-CORE ARCHITECTURES BY JIE LV THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview

More information

Module 3: CUDA Execution Model -I. Objective

Module 3: CUDA Execution Model -I. Objective ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Object Support in an Array-based GPGPU Extension for Ruby

Object Support in an Array-based GPGPU Extension for Ruby Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Merge Data Parallelism / Data-Dependent Execution Data-Independent Data-Dependent Data Parallel Stencil Histogram SpMV Not Data Parallel

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPU Particle-Grid Methods: Electrostatics

GPU Particle-Grid Methods: Electrostatics GPU Particle-Grid Methods: Electrostatics John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/

More information