GPU & Reproductibility, predictability, Repetability, Determinism.
|
|
- Andrea Richardson
- 5 years ago
- Views:
Transcription
1 GPU & Reproductibility, predictability, Repetability, Determinism. David Defour DALI/LIRMM, UPVD
2 Phd Thesis proposal Subject Study the {numerical} predictability of GPUs Why?... fuzzy definition but useful for Bounding execution on Real Time System, Debugging, Build proof. and...
3 FPS Battlezone 2 used a lockstep networking model requiring absolutely identical results on every client, down to the least-significant bit of the mantissa, or the simulations would start to diverge. While this was difficult to achieve, it meant we only needed to send user input across the network; all other game state could be computed locally. During development, we discovered that AMD and Intel processors produced slightly different results for trancendental functions (sin, cos, tan, and their inverses), so we had to wrap them in non-optimized function calls to force the compiler to leave them at single-precision. That was enough to make AMD and Intel processors consistent, but it was definitely a learning experience. Ken Miller, Pandemic Studios In FSW1 when desync is detected in player would be instantly killed by magic sniper. All that stuff was fixed in FSW2. We just ran precise FP and used Havok FPU libs instead SIMD on PC. Also integer modulo is problem too because C++ standard says it s implementation defined (in case when multiple compilers/ platforms are used). In general I liked tools for lockstep we developed, finding desyncs in code on FSW2 was trivial. Branimir Karadžić, Pandemic Studios
4 CUDA tutorial in... 1 min Software model: Hardware model: Grid of Blocks of Threads Set of CTA of SM of CU working in SIMT manner No assumptions on execution for Blocks, Warps, Threads within a warp, Communication with Synchronization barrier: syncthreads() atomics: atomicadd()
5 Hardware view Block scheduler Warp scheduler Instruction scheduler Clock tree Dynamic frequency scaling ECC
6 Your Computer s configuration Name GPU CUDA capability CUDA #MP/ GPC MP Core / MP Warp Scheduler/ MP GPU clock (Mhz) Memory clock(mhz) C870 G GX G GTX480 GF GTX560 GF GTX680 K
7 First work Are GPUs predictable? Regarding Blocks scheduling Regarding Threads scheduling Can predictability be improved? Reseting initial states of the GPU Changing frequency
8 Measure of Predictibility How to measure predictability in the context of data parallel processing? Solution (Statistical Mode) For one problem, one program, one input dataset, one processor, look at the output over several runs and take the most probable output Example: Run Output x y z w x x y x z t Predictibility of 40%
9 Block vs Warp? Test Block scheduling Launch 1 to 32 blocks with 1 warp per block Test Warp scheduling Launch #MP blocks with 1 to maxthreads per block global void TestOrder(int* dmem) {! const unsigned int gid = blockdim.x * blockidx.x + threadidx.x;! int cl;! SYNC! cl=clock();! dmem[gid] = cl; } What is the most predictable, warp or block?
10 Warp scheduling C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16"
11 Block scheduling C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block,
12 What is Next? Choice 1 Choice 2.a Choice 2.b Choice 2.c I don t trust clock(), lets try global memory access Synchronize warps of block before starting Let s optimize it by Reset the Reduce clock device before frequency starting
13 I don t trust clock(), let s use global access atomicadd(&cpt, 1);
14 Global memory access Clock AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
15 Global memory Clock access AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
16 Global memory access Clock AtomicAdd() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
17 Synchronize warp asm volatile ("bar.sync 0;")
18 1 Synchronization Clock syncthreads() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
19 1 Synchronization Clock syncthreads() Warp Block C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 Improvement for GTX680 GTX"4 GTX"5 GTX"6 5" 15" 25" Block,number, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
20 32 Synchronization Clock 32 syncthreads() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
21 32 Synchronization Clock 32 syncthreads() Warp Block 5" 15" 25" C8 GTX"4 GTX"5 GTX"6 Number,of,block, Improvement for every GPU C8 except G80 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
22 Play with clock frequency
23 Clock Frequency GTX480: Default clock GPU: 701Mhz, Memory:1848Mhz, Shader 1401Mhz Set each clock for each of the 4 performance level to 900Mhz. GTX560: Default clock Memory: 2004Mhz, Shader: 1620Mhz Set each clock for each performance level to 1Ghz
24 UnderClocking Clock Underclocking Warp Block 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 17" 18" 19" 21" 22" 23" 24" 25" 26" 27" 28" Number,of,block, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 17" 18" 19" 21" 22" 23" 24" 25" 26" 27" 28" Number,of,block, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" Number,of,Warp,,
25 Reset the device cudadevicereset();
26 Reset Clock cudadevicereset() Warp Block 5" 15" 25" Number,of,block, C8 GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, C8 980GX2" GTX"4 GTX"5 GTX"6 C8 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
27 Reset atomic cudadevicereset() Warp Block 5" 15" 25" Block,number, GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 5" 15" 25" Number,of,block, 980GX2" GTX"4 GTX"5 GTX"6 GTX"4 GTX"5 GTX"6 2" 4" 6" 8" 12" 14" 16" 2" 4" 6" 8" 12" 14" 16"
28 Impact of Software 2 examples: FP summation Tree operations (Rootfix, Leaffix)
29 A simple problem: FP Summation «Technology Challenges in Achieving Exascale Systems», Darpa Report 2008
30 Solutions for Reduction Thread Grid Block Block Block Grid Block Block Block Grid Block Block Block Private memory (register) atomicadd() Shared memory atomicadd() Shared memory atomicadd() atomicadd() atomicadd() Global memory Global memory Global memory Solution N 1 Solution N 2 Solution N 3
31 Predictability 1,00E+02% 9,00E+01% 8,00E+01% 7,00E+01% 6,00E+01% 5,00E+01% 4,00E+01% 3,00E+01% V1%% V2% V3% 2,00E+01% 1,00E+01% 0,00E+00% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% Number,of,Warp, GTX480
32 Optimized Predictability GPU+MEM: 1GHZ; RESET ; syncthread() "V1"" V2" V3" 1" 2" 3" 4" 5" 6" 7" 8" 9" 11" 12" 13" 14" 15" 16" Number,of,Warp, GTX480
33 CUDA SDK
34 Its time to graduate 3 easy factors to improve predictability Reset the device Align clock frequency (and lower it by the way) Synchronize warp before starting However software play an important role But... all this Floating-Point non associativity Chosen algorithm Compiler s optimization
35 2 Sweep Operations Rootfix Scan a tree from top to bottom and writes the sum of parents into every child, up to the leaf Leaffix Scan a tree from bottom to top and writes the sum of children into every parent, up to the root /43
Multi-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many
More informationShared Memory and Synchronizations
and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationCS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME
CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationCS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees
CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, 2007-2012 1 Objective To master Reduction Trees, arguably the
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationECE 408 / CS 483 Final Exam, Fall 2014
ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across
More informationCS671 Parallel Programming in the Many-Core Era
CS671 Parallel Programming in the Many-Core Era Lecture 3: GPU Programming - Reduce, Scan & Sort Zheng Zhang Rutgers University Review: Programming with CUDA An Example in C Add vector A and vector B to
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationLecture 6. Programming with Message Passing Message Passing Interface (MPI)
Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationHardware/Software Co-Design
1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled
More informationUniversity of Bielefeld
Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationAtomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.
Atomic Operations Atomic operations, fast reduction GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University ATOMIC OPERATIONS
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationGPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60
1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling
More informationGPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43
1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016 2 / 43 Outline CUDA Memories Atomic Operations 3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationCS/EE 217 Midterm. Question Possible Points Points Scored Total 100
CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationCS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4
CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you
More informationHigh Performance Linear Algebra on Data Parallel Co-Processors I
926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationGlobal Memory Access Pattern and Control Flow
Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access
More informationLDetector: A low overhead data race detector for GPU programs
LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1 Data races in GPU Introduction & Contribution Impact correctness
More informationCUDA. Sathish Vadhiyar High Performance Computing
CUDA Sathish Vadhiyar High Performance Computing Hierarchical Parallelism Parallel computations arranged as grids One grid executes after another Grid consists of blocks Blocks assigned to SM. A single
More informationCSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication
CSC266 Introduction to Parallel Computing using GPUs Synchronization and Communication Sreepathi Pai November 8, 2017 URCS Outline Barriers Atomics Warp Primitives Memory Fences Outline Barriers Atomics
More informationECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications
ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The NVIDIA GPU Memory Ecosystem Atomic operations in CUDA The thrust library October 7, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS 759
More informationUnderstanding and modeling the synchronization cost in the GPU architecture
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-1-2013 Understanding and modeling the synchronization cost in the GPU architecture James Letendre Follow this
More informationWarp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles
Warp shuffles Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationOptimization Strategies Global Memory Access Pattern and Control Flow
Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Highest latency instructions: 400-600 clock cycles Likely
More informationDEVELOPER DAY. Vulkan Subgroup Explained Daniel Koch NVIDIA MONTRÉAL APRIL Copyright Khronos Group Page 1
DEVELOPER DAY Vulkan Subgroup Explained Daniel Koch (@booner_k), NVIDIA MONTRÉAL APRIL 2018 Copyright Khronos Group 2018 - Page 1 Copyright Khronos Group 2018 - Page 2 Agenda Motivation Subgroup overview
More informationCUDA. More on threads, shared memory, synchronization. cuprintf
CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int
More informationSelecting the right Tesla/GTX GPU from a Drunken Baker's Dozen
Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,
More informationOptimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups
Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups Nov. 21, 2017 Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr
More informationLecture 4: warp shuffles, and reduction / scan operations
Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp shuffles Warp
More informationReductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research
Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative
More informationIntroduction to CUDA C
NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is
More informationCS 314 Principles of Programming Languages
CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations
More informationGPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline
More informationIntroduction to CUDA C
Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationGPU Computing: A Quick Start
GPU Computing: A Quick Start Orest Shardt Department of Chemical and Materials Engineering University of Alberta August 25, 2011 Session Goals Get you started with highly parallel LBM Take a practical
More informationDebugging and Optimization strategies
Debugging and Optimization strategies Philip Blakely Laboratory for Scientific Computing, Cambridge Philip Blakely (LSC) Optimization 1 / 25 Writing a correct CUDA code You should start with a functional
More informationCUB. collective software primitives. Duane Merrill. NVIDIA Research
CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More information4.10 Historical Perspective and References
334 Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures 4.10 Historical Perspective and References Section L.6 (available online) features a discussion on the Illiac IV (a representative
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationWarped parallel nearest neighbor searches using kd-trees
Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:
More informationInter-Block GPU Communication via Fast Barrier Synchronization
CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationGPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102
1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102
More informationAutomatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015 Goals Running CUDA code on CPUs. Why? Performance portability! A major challenge faced
More informationReproducible floating-point atomic addition in data-parallel environment
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 721 728 DOI: 10.15439/2015F86 ACSIS, Vol. 5 Reproducible floating-point atomic addition in data-parallel environment
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationCUDA Advanced Techniques 3 Mohamed Zahran (aka Z)
Some slides are used and slightly modified from: NVIDIA teaching kit CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu
More informationCUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK
CUDA programming Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics CUDA requirements A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK Standard C compiler http://www.nvidia.com/cuda
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationAn embedded language for data parallel programming
An embedded language for data parallel programming Master of Science Thesis in Computer Science By Joel Svensson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY GÖTEBORGS
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationCUDA C Programming Mark Harris NVIDIA Corporation
CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment
More informationCS 179: GPU Programming. Lecture 7
CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More information