TANGRAM Efficient Kernel Synthesis for Performance Portable Programming
|
|
- Imogene Hunt
- 5 years ago
- Views:
Transcription
1 TANGRAM Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang 1, Izzat El Hajj 1, Christopher Rodrigues 2, Juan Gómez-Luna 3, Wen-Mei Hwu 1 1 University of Illinois, 2 Huawei America Research Lab, 3 Universidad de Córdoba lchang20@illinois.edu
2 Performance Portability Maintaining optmized programs for different devices is costly Ideally, programs wriven once should run difference devices with performance 2
3 CPU CPU
4 CPU Port Accelerator A Port Accelerator B CPU Accelerator A Accelerator B
5 Re-opTmize Port Re-opTmize Port Re-opTmize CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0
6 Performance Portability: OpenCL SGEMM Performance (GFLOPS) %è60% %è30% Tesla GPU (GTX 280) Fermi GPU (C2050) Sandy Bridge CPU (i7-3820) Parboil (default naïve OpenCL version) Parboil (OpenCL version optimized for Tesla GPU) Reference (MKL for CPU, CUBLAS for GPU) 6
7 Source CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0
8 -based Programing Language NESL, Sequoia, Petabricks Highly adaptve to hierarchies Through compositon Usually scaling well Performance relies on base-rule implementatons/ libraries 8
9 Performance SensiTvity in Base Rule: DGEMM Normalized Performance (higher is be)er) % loss MKL TANGRAM Petabricks (with MKL) 11% of MKL Petabricks (no MKL) Sandy Bridge i
10 TANGRAM -based language Focus at high-performance code synthesis within a node Remove dependence of high-performance baserule implementatons/libraries Provide a representaton for bever SIMD utlizaton Provide an architectural hierarchy model to guide compositon 10
11 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 11
12 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 12
13 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 13
14 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 14
15 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 15
16 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 16
17 TANGRAM Language p c codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; c b unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet c a codelet tag(asso_tiled) tunable unsigned p; S unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling S S S p d S S S S 17
18 Rule ExtracTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM parser Rule ExtracTon Rule SpecializaTon Device-specific gen Clang 3.5 Program Rules Specialized Rules Plans Kernel Versions Customized TANGRAM AST builder Output a set of TANGRAM ASTs Program Composi,on Rules: (sum) Rule 1: compose(sum, L) S L, devolve(l L ), compose(sum, l L ) Rule 2: compose(sum, L) compute(c a, SE L ) Rule 3: compose(sum, L) compute (c b, VE L ) Rule 4: compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Rule 5: compose(sum, L) S L, regroup(p d, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Example for Deriving Composi,on Rules from Compound lets: (codelet c) compose(sum, L) compose(c c, L) compose(sum(map(sum, par::on(, p c ))), L) compose(map(sum, par::on(, p c )), L), compose(sum, L) compose(par::on(, p c ), L), compose(map(sum, ), L), compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) 18
19 Architectural Hierarchy Model Define a level ComputaTonal capability Scalar or vector executon TANGRAM Lang. lets Rule ExtracTon Program Rules Architectural Hierarchy Model Rule SpecializaTon Specialized Rules Plans Capability to synchronize across the subordinate level of that level Device-specific gen Kernel Versions Device SpecificaKon: G := C G = none, ( l G, S G ) = ( B, terminate/launch ) // G : grid B := C B = VE B, ( l B, S B ) = ( T, syncthreads() ) // B : block T := C T = SE T, ( l T, S T ) = none // T : thread Extensible CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism 19
20 Rule SpecializaTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM analyzer Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser Program Rules Specialized Rules Plans Kernel Versions Output a lookup table Legal codelets for each level Also prioritze them Specialized Composition Rules: G rules: G1: compose(sum, G) S G, devolve(b), compose(sum, B) G4: compose(sum, G) S G, regroup(p c, G), distribute(b), compose(sum, B), compose(sum, G) G5: compose(sum, G) S G, regroup(p d, G), distribute(b), compose(sum, B), compose(sum, G) B rules: B1: compose(sum, B) S B, devolve(t), compose(sum, T) B3: compose(sum, B) compute (c b, VE B ) B4: compose(sum, B) S B, regroup(p c, B), distribute(t), compose(sum, T), compose(sum, B) B5: compose(sum, B) S B, regroup(p d, B), distribute(t), compose(sum, T), compose(sum, B) T rules: T2: compose(sum, T) compute(c a, SE T ) 20
21 TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM planner Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser/builder Program Rules Specialized Rules Plans Kernel Versions SelecTon of codelets or map policies Pruning Output ASTs for codegen p d p c p c p c p c c a S S S c a c a c a c a c a c a c a c a c a S c b c b c b terminate/launch p d p c c b S S S S c a c a c a c b 21
22 gen TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM codegen Rule ExtracTon Rule SpecializaTon Device-specific gen AST traversers Program Rules Specialized Rules Plans Kernel Versions ConvenTonal optmizatons Output C/CUDA source code 22
23 GPU gen Example tile = (len + griddim.x 1)/gridDim.x; sub_tile = (tile + blockdim.x 1)/blockDim.x; accum = 0 #pragma unroll for(unsigned i = 0; i < sub_tile; ++i) { accum += in[blockidx.x*tile + i*blockdim.x + threadidx.x]; tmp[threadidx.x] = accum; syncthreads(); for(unsigned s=1; s<blockdim.x; s *= 2) { if(id >= s) tmp[threadidx.x] += tmp[threadidx.x - s]; syncthreads(); partial[blockidx.x] = tmp[blockdim.x-1]; return; // Launch new kernel to sum up partial GPU 1. Grid 2. Block 3. Thread p d p c p c p c c a c a c a c a c a c a c a c a c a c b c b c b
24 Experimental Results TANGRAM delivers 70% or higher performance compared to highly-optmized libraries, such as Intel MKL, NVIDIA CUBLAS, CUSPARSE, or Thrust, or experts optmized benchmarks, Rodinia Normalized Performance (higher is be*er) Kepler (reference) Kepler (TANGRAM) Fermi (reference) Fermi (TANGRAM) CPU (reference) CPU (TANGRAM) Scan SGEMV-TS SGEMV-SF DGEMM SpMV KMeans BFS 24
25 FAQ1 Why TANGRAM is bever than other compositon-based languages? TANGRAM provides an architectural hierarchy model to guide compositon TANGRAM provides a representaton of cooperatve codelets for bever SIMD utlizaton Especially shuffle instructons and scratchpad 25
26 FAQ2 Where optmizatons happen? SelecTon of codelets or map policies in ConvenTonal optmizatons in gen OpTmizaTons in backend compilers 26
27 FAQ3 What? MulTple versions? We did NOT ask users to write multple versions of kernels lets can be used to synthesize different versions of kernels lets can be reused multple Tmes within one kernel, across kernels in a device, across kernels for different devices 27
28 Takeaways of TANGRAM Performance portability 70% or higher performance compared to highlyoptmized libraries Extensible architectural hierarchical model Support CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism NaTve descripton for algorithmic design space Perfect for domain users 28
29 QuesTons 29
Efficient Kernel Synthesis for Performance Portable Programming
Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei Hwu University of Illinois at Urbana-Champaign Huawei America
More informationAutomatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs
Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs imon Garcia De Gonzalo C and Coordinated cience Lab UIUC grcdgnz2@illinois.edu imon
More informationA Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle
A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationA Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing
A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul
More informationAvoiding Shared Memory Bank Conflicts in Rate Conversion Filtering
NVIDIA GPU Technology Conference March 18, 2015 San José, California Avoiding Shared Memory Bank Conflicts in Rate Conversion Filtering Mrugesh Gajjar Technical Expert Siemens Corporate Technology, India
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationOptimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei H. Hwu
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationProgramming in CUDA. Malik M Khan
Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationA Script- Based Autotuning Compiler System to Generate High- Performance CUDA code
A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code Malik Khan, Protonu Basu, Gabe Rudy, Mary Hall, Chun Chen, Jacqueline Chame Mo:va:on Challenges to programming the GPU
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationLocality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationA GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationGenerating Performance Portable Code using Rewrite Rules
Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer Christian Fensch Sam Lindley Christophe Dubach The Problem(s)
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationLecture 6. Programming with Message Passing Message Passing Interface (MPI)
Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011
More informationGPU Computing with CUDA. Part 2: CUDA Introduction
GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de
More informationCUDA 6.0 Performance Report. April 2014
CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationHPVM: Heterogeneous Parallel Virtual Machine
HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou Department of Computer Science University of Illinois at Urbana-Champaign kotsifa2@illinois.edu Prakalp Srivastava Department of Computer Science
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationAdding CUDA Support to Cling: JIT Compile to GPUs
Published under CC BY-SA 4.0 DOI: 10.5281/zenodo.1412256 Adding CUDA Support to Cling: JIT Compile to GPUs S. Ehrig 1,2, A. Naumann 3, and A. Huebl 1,2 1 Helmholtz-Zentrum Dresden - Rossendorf 2 Technische
More informationLecture 5. Performance Programming with CUDA
Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy
More informationOpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2
2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James
More informationCUDA. Sathish Vadhiyar High Performance Computing
CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationPARALLEL MERGE FOR MANY-CORE ARCHITECTURES JIE LV THESIS
c 2016 Jie Lv PARALLEL MERGE FOR MANY-CORE ARCHITECTURES BY JIE LV THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationModule 3: CUDA Execution Model -I. Objective
ECE 8823A GPU Architectures odule 3: CUDA Execution odel -I 1 Objective A more detailed look at kernel execution Data to thread assignment To understand the organization and scheduling of threads Resource
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationStanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690
Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationObject Support in an Array-based GPGPU Extension for Ruby
Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationHIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS
HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Merge
CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Merge Data Parallelism / Data-Dependent Execution Data-Independent Data-Dependent Data Parallel Stencil Histogram SpMV Not Data Parallel
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationGPU Particle-Grid Methods: Electrostatics
GPU Particle-Grid Methods: Electrostatics John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign Research/gpu/
More information