PyCUDA and PyUblas: Hybrid HPC in Python made easy
|
|
- Crystal Barrett
- 5 years ago
- Views:
Transcription
1 PyCUDA and PyUblas: Hybrid HPC in Python made easy Applied Mathematics, Brown University March 5, 2009
2 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Lucas Wilcox (UT Austin) Akil Narayan (Brown) PyCUDA contributors Nvidia Corporation
3 Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview 3 PyUblas 4 Conclusions
4 Scripting for GPUs Outline 1 PyCUDA: What and Why? Scripting for GPUs (Very) brief GPU PyCUDA: Overview 3 PyUblas 4 Conclusions
5 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other
6 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other CPU: largely restricted to control tasks ( 1000/sec) Python fast enough
7 Scripting for GPUs PyCUDA: Why do Scripting for GPUs? GPUs are everything that Python is not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other CPU: largely restricted to control tasks ( 1000/sec) Python fast enough Realize a promise: Use Python... from first prototype to full-scale production code.
8 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code.
9 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language.
10 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity.
11 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid.
12 Scripting for GPUs Python: Speed Usual answer to the Speed Question : Hybrid ( mixed ) Code. Plays to the strengths of each language. But: Introduces (some) complexity. Observation: GPU code is already hybrid. Consequence: No added complexity through hybrid code.
13 (Very) brief GPU 101 Outline 1 PyCUDA: What and Why? Scripting for GPUs (Very) brief GPU PyCUDA: Overview 3 PyUblas 4 Conclusions
14 (Very) brief GPU 101 GPUs: What? Design target for CPUs: Make a single thread very fast Hide latency through large caches Predict, speculate
15 (Very) brief GPU 101 GPUs: What? Design target for CPUs: Make a single thread very fast Hide latency through large caches Predict, speculate Design target for GPUs: Throughput matters single threads do not Hide latency through massive parallelism Let programmer deal with raw storage hierarchy
16 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction.
17 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features.
18 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features. (Very) close semantic relative of OpenCL.
19 (Very) brief GPU 101 What is CUDA? CUDA is Nvidia s proprietary compute abstraction. Main merit: A well-balanced model of GPU computing. Abstract enough to not be hardware-specific. Concrete enough to expose most hardware features. (Very) close semantic relative of OpenCL. Python + CUDA = PyCUDA.
20 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) Multi-tiered Parallelism Block Grid Block (1, 0) (0, 3) (1, 3) (2, 3) (3, 3) (0, 2) (1, 2) (2, 2) (3, 2) (0, 1) (0, 0) (1, 1) (1, 0) (2, 1) (2, 0) (3, 1) (3, 0) Image Credit: Johan S. Seland, Sintef
21 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Image Credit: Johan S. Seland, Sintef
22 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Each Block is assigned to physical execution unit. Image Credit: Johan S. Seland, Sintef
23 (Very) brief GPU 101 GPUs: Execution Model Computational Grid Block (0, 1) Block (0, 0) Block (1, 0) (0, 3) (0, 2) (0, 1) (0, 0) Block (1, 1) Block (1, 0) (1, 3) (1, 2) (1, 1) (1, 0) Block (2, 1) Block (2, 0) (2, 3) (2, 2) (2, 1) (2, 0) (3, 3) (3, 2) (3, 1) (3, 0) Multi-tiered Parallelism Block Grid Only threads within a block can communicate Each Block is assigned to physical execution unit. Grids and Blocks replace outer loops in an algorithm. Image Credit: Johan S. Seland, Sintef
24 Whetting your Appetite Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview Whetting your Appetite Working with PyCUDA 3 PyUblas 4 Conclusions
25 Whetting your Appetite Whetting your appetite 1 import pycuda.driver as cuda 2 import pycuda.autoinit 3 import numpy 4 5 a = numpy.random.randn(4,4).astype(numpy.float32) 6 a gpu = cuda.mem alloc(a.nbytes) 7 cuda.memcpy htod(a gpu, a) [This is examples/demo.py in the PyCUDA distribution.]
26 Whetting your Appetite Whetting your appetite 9 mod = cuda.sourcemodule( 10 global void doublify ( float a) 11 { 12 int idx = threadidx.x + threadidx.y 4; 13 a[ idx ] = 2; 14 } 15 ) func = mod.get function( doublify ) 18 func(a gpu, block=(4,4,1)) a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a
27 Whetting your Appetite Whetting your appetite 9 mod = cuda.sourcemodule( 10 global void doublify ( float a) 11 { 12 int idx = threadidx.x + threadidx.y 4; 13 a[ idx ] = 2; 14 } 15 ) func = mod.get function( doublify ) 18 func(a gpu, block=(4,4,1)) a doubled = numpy.empty like(a) 21 cuda.memcpy dtoh(a doubled, a gpu) 22 print a doubled 23 print a Compute kernel
28 Whetting your Appetite Whetting your appetite, Part II Did somebody say Abstraction is good?
29 Whetting your Appetite Whetting your appetite, Part II 1 import numpy 2 import pycuda.autoinit 3 import pycuda.gpuarray as gpuarray 4 5 a gpu = gpuarray.to gpu( 6 numpy.random.randn(4,4).astype(numpy.float32)) 7 a doubled = (2 a gpu).get() 8 print a doubled 9 print a gpu
30 Working with PyCUDA Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview Whetting your Appetite Working with PyCUDA 3 PyUblas 4 Conclusions
31 Working with PyCUDA PyCUDA Philosophy Provide complete access
32 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources
33 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions
34 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use
35 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically
36 Working with PyCUDA PyCUDA Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy
37 Working with PyCUDA PyCUDA: Completeness PyCUDA exposes all of CUDA.
38 Working with PyCUDA PyCUDA: Completeness PyCUDA exposes all of CUDA. For example: Arrays and Textures Pagelocked host memory Memory transfers (asynchronous, structured) Streams and Events Device queries (GL Interop)
39 Working with PyCUDA PyCUDA: Completeness PyCUDA supports every OS that CUDA supports.
40 Working with PyCUDA PyCUDA: Completeness PyCUDA supports every OS that CUDA supports. Linux Windows OS X
41 Working with PyCUDA PyCUDA: Documentation
42 Working with PyCUDA PyCUDA: Workflow Edit Run
43 Working with PyCUDA PyCUDA: Workflow Edit Run SourceModule("...")
44 Working with PyCUDA PyCUDA: Workflow Edit Run SourceModule("...") PyCUDA
45 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run SourceModule("...") PyCUDA
46 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run nvcc SourceModule("...") PyCUDA
47 Working with PyCUDA PyCUDA: Workflow Edit Cache? Run nvcc.cubin SourceModule("...") PyCUDA
48 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") PyCUDA
49 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") Upload to GPU PyCUDA
50 Working with PyCUDA PyCUDA: Workflow Edit Cache! Run nvcc.cubin SourceModule("...") Run on GPU Upload to GPU PyCUDA
51 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed.
52 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time.
53 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free())
54 Working with PyCUDA Automatic Cleanup Reachable objects (memory, streams,... ) are never destroyed. Once unreachable, released at an unspecified future time. Scarce resources (memory) can be explicitly freed. (obj.free()) Correctly deals with multiple contexts and dependencies.
55 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy.
56 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get()
57 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet)
58 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,...
59 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64)
60 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64) print gpuarray for debugging.
61 Working with PyCUDA gpuarray: Simple Linear Algebra pycuda.gpuarray: Meant to look and feel just like numpy. gpuarray.to gpu(numpy array) numpy array = gpuarray.get() No: indexing, slicing, etc. (yet) Yes: +, -,, /, fill, sin, exp, rand, take,... Mixed types (int32 + float32 = float64) print gpuarray for debugging. Memory behind gpuarray available as.gpudata attribute. Use as kernel arguments, textures, etc.
62 Working with PyCUDA gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a gpu = curand((50,)) b gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( float a, float x, float b, float y, float z, z[ i ] = a x[i] + b y[i] ) c gpu = gpuarray.empty like(a gpu) lin comb(5, a gpu, 6, b gpu, c gpu) assert la. norm((c gpu (5 a gpu+6 b gpu)).get()) < 1e 5
63 Working with PyCUDA PyCUDA: Vital Information software/pycuda X Consortium License (no warranty, free for all use) Requires: numpy, Boost C++, Python Support via mailing list.
64 Working with PyCUDA PyCUDA: Metaprogramming In PyCUDA, GPU code does not need to be a compile-time constant.
65 Working with PyCUDA PyCUDA: Metaprogramming In PyCUDA, GPU code does not need to be a compile-time constant. (unlike native C interface)
66 Working with PyCUDA Shameless Plug How would you use this stuff to write a Discontinuous Galerkin solver? How well does that work? MS134: Large-Scale Computing with High-Order Mth. Tomorrow (Friday), 10:00 10:25, Concerto A
67 A Toolset for building Hybrid Codes Outline 1 PyCUDA: What and Why? 2 PyCUDA: Overview 3 PyUblas A Toolset for building Hybrid Codes 4 Conclusions
68 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python C++
69 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Boost.Python C++
70 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python C++
71 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python C++ Boost.Ublas
72 A Toolset for building Hybrid Codes Making C++ and Python talk to each other Making C++ and Python a workable combination: Python Numpy Boost.Python PyUblas C++ Boost.Ublas
73 A Toolset for building Hybrid Codes PyUblas: Features Exploits pluggable storage backends in Ublas: numpy vector<t>, numpy matrix<t> Normal citizens of the Ublas universe (take part in ET etc.) Type-safe, zero-copy language transition OK anywhere: argument, return value, data member Strided storage: numpy strided vector<t> Not OK: non-contiguous storage
74 A Toolset for building Hybrid Codes Example C++: #include <pyublas/numpy.hpp> numpy vector<double> triple(numpy vector<double> x) { return 3 x; } BOOST PYTHON MODULE(sample ext) { boost :: python:: def( triple, triple ); } Python: import sample ext import pyublas vec = numpy.ones((5,), dtype=float) print sample ext. triple (vec)
75 Conclusions GPUs and Python: good combination +
76 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. +
77 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. +
78 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. Provides many nice abstractions to make life easier. +
79 Conclusions GPUs and Python: good combination Been considering playing with GPUs? PyCUDA makes that easier. Allows full access to low-level bits, if needed. Provides many nice abstractions to make life easier. Enables just-in-time GPU code generation. See tomorrow s talk +
80 Questions?? Thank you for your attention! MS134: Tomorrow (Friday), 10:00 10:25, Concerto A image credits
81 Image Credits C870 GPU: Nvidia Corp. Snail: flickr.com/hadi fooladi GT200 die shot: Nvidia Corp. Old Books: flickr.com/ppdigital Adding Machine: flickr.com/thomashawk Floppy disk: flickr.com/ethanhein
GPU Metaprogramming using PyCUDA: Methods & Applications
GPU Metaprogramming using PyCUDA: Methods & Applications Division of Applied Mathematics Brown University Nvidia GTC October 2, 2009 Thanks Tim Warburton (Rice) Jan Hesthaven (Brown) Nicolas Pinto (MIT)
More informationPyCUDA. An Introduction
PyCUDA An Introduction Scripting GPUs with PyCUDA Why do Scripting for GPUs? GPUs are everything that scripting languages are not: Highly parallel Very architecture-sensitive Built for maximum FP/memory
More informationProgramming GPUs with PyCuda
Intro GPUs Scripting Hands-on Programming GPUs with PyCuda SciPy Conference 2009 / Advanced Tutorial http://conference.scipy.org/advanced_tutorials August 19, 2009 Intro GPUs Scripting Hands-on Thanks
More informationCOURTESY A. KLOECKNER
GPU Metaprogramming applied to High Order DG and Loop Generation Division of Applied Mathematics Brown University August 19, 2009 Thanks Jan Hesthaven (Brown) Tim Warburton (Rice) Akil Narayan (Brown)
More informationEasy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Courant Institute of Mathematical Sciences New York University May 21,
More informationPyCUDA. Continued...
PyCUDA Continued... gpuarray Vector Types pycuda.gpuarray.vec All CUDA vector types are supported: float3, int3, long4, etc, Available as numpy data types Field names x, y, z, and w as in CUDA Construct
More informationGPU Programming Languages
GPU Programming Languages Vilhelm Sjöberg April 5, 2010 What s wrong with CUDA? Low-level programs structured by kernels, not data flow. Limited metaprogramming features It s just not Haskell! (Or Python,
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationEasy, Effective, Efficient: GPU Programming in Python. GPU-Python with PyOpenCL and PyCUDA
Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Courant Institute of Mathematical Sciences New York University PASI: The Challenge of Massive Parallelism Lecture 3 January
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationAMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas
AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationMapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg
MapReduce Locality Sensitive Hashing GPUs NLP ML Web Andrew Rosenberg Big Data What is Big Data? Data Analysis based on more data than previously considered. Analysis that requires new or different processing
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationScientific Computing with Python and CUDA
Scientific Computing with Python and CUDA Stefan Reiterer High Performance Computing Seminar, January 17 2011 Stefan Reiterer () Scientific Computing with Python and CUDA HPC Seminar 1 / 55 Inhalt 1 A
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationHPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA
HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationGPU Programming Introduction
GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationGPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:
COMP528 Multi-Core Programming GPU programming,ii www.csc.liv.ac.uk/~alexei/comp528 Alexei Lisitsa Dept of computer science University of Liverpool a.lisitsa@.liverpool.ac.uk Different ways: GPU programming
More informationAdvanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.
CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution
More informationDynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California
Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationCSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA
CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model
More informationPython VSIP API: A first draft
Python VSIP API: A first draft Stefan Seefeld HPEC WG meeting, December 9, 2014 Goals Use cases: Promote VSIP standard to a wider audience (SciPy users) Add more hardware acceleration to SciPy Allow VSIP
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationIntroduction to CUDA
Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationCUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list
CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationCS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay
Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2 Shader GPGPU - Before current generation,
More informationBetty Holberton Multitasking Gives the illusion that a single processor system is running multiple programs simultaneously. Each process takes turns running. (time slice) After its time is up, it waits
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationScripting CUDA (using python, R and MATLAB)
Scripting CUDA (using python, R and MATLAB) Ferdinand Jamitzky jamitzky@lrz.de http://goo.gl/nkd8fy Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationCOSC 6374 Parallel Computations Introduction to CUDA
COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationInformation Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY
Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture
More informationiennacl GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp
GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp Mathematics and Computer Science Division Argonne National Laboratory based on previous work at Technische Universität
More informationHigh Performance and GPU Computing in MATLAB
High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1 About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationIntroduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator
Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationNVIDIA Fermi Architecture
Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationKSTAR tokamak. /
KSTAR tokamak / spinhalf@nfri.re.kr !!! Data parallelism CUDA programming python! pycuda GPU Development tools Python 2.6+ Scientific libraries as python package for interactive computing (Numpy, Scipy..)
More informationMotivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University
Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA
More informationGPU Programming for Mathematical and Scientific Computing
GPU Programming for Mathematical and Scientific Computing Ethan Kerzner and Timothy Urness Department of Mathematics and Computer Science Drake University Des Moines, IA 50311 ethan.kerzner@gmail.com timothy.urness@drake.edu
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGPU Computing: A VFX Plugin Developer's Perspective
.. GPU Computing: A VFX Plugin Developer's Perspective Stephen Bash, GenArts Inc. GPU Technology Conference, March 19, 2015 GenArts Sapphire Plugins Sapphire launched in 1996 for Flame on IRIX, now works
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationMemory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory
Memory Lecture 2: different memory and variable types Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Key challenge in modern computer architecture
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More information