High Performance Computing - Introduction to GPGPU Programming. Prof Matt Probert

Size: px
Start display at page:

Download "High Performance Computing - Introduction to GPGPU Programming. Prof Matt Probert"

Transcription

1 High Performance Computing - Introduction to GPGPU Programming Prof Matt Probert

2 Overview What are GPUs? GPU Architecture Memory Programming in CUDA The Future

3 What is a GPU? A GPU is a massive vector processor Hundreds of processing units Large memory bandwidth Processing units can coaborate Some probems can be soved more efficienty on vector processors (remember Crays). GPUs have some big advantages. Cheap and very fast for vector probems Computation is asynchronous with the CPU Hardware tricks such as texture fitering (zero overhead!) Other acceerators (e.g. Xeon Phi) avaiabe

4 Vaue for Money (2016 prices) P100 ~ 9k Ebor 40k K80 5k Xeon E7 5k However, Ebor is a distributed data machine with 16 nodes with 16 Xeon cores and tota 256 GB of RAM etc. These are different toos for different probems.

5 When is GPU Programming usefu? The GPU devotes more transistors to data processing (nvidia C Programming Guide) Good at doing ots of numeric cacuations simutaneousy. - This is actuay required for the GPU to be efficient - There must be many more numeric operations than memory operations to break even. Usefu as a co-processor. - Can offoad GPU efficient cacuations whie the CPU continues with the rest. Brute forcing! - Sometimes a brute force method is more efficient with many processors than an eegant soution on just one

6 Foating Point Standard nvidia architectures 2.x onwards (i.e. Fermi onwards) are IEEE 754 compiant. Od nvidia architectures were mosty compiant. Generay exceptions were handed in a noncompiant way. Aso some mathematics e.g. FMAD, division and sqrt were not standard. Standard compiant intrinsic functions are avaiabe but at a arge computationa penaty (software impemented).

7 nvidia Architectures V1 = Tesa (2008) introduced CUDA with performance of 0.5 GFLOP/watt, V2 = Fermi (2010) 64-bit foating point and performance = 2 GFLOP/W V3 = Keper (2012) - dynamic paraeism V5 = Maxwe (2014) faster Keper 3.8 TFLOP DP, 24 GB GDRAM, 2 GPU, 500 GB/s, 300 W V6 = Pasca (2016) unified memory, stacked DRAM, direct interconnect between GPU and main RAM to eiminate common botteneck V7 = Vota (2018) new = tensor cores with haf-precision math for machine earning now 7.8 TFLOP DP, 15.7 TFLOP SP, 125 TFLOP hp

8 Lots of arithmetic units sharing sma caches. Execution bocks are schedued across processing cores. GPU Architecture Large memory bandwidth but sow memory access. Fermi onwards have fu cache hierarchy.

9 Each thread runs one instance of a kerne. Threads Threads are organised into bocks (which can be up to 3D), bocks are schedued on and off the processors. Execution of a bock on the processors is caed a warp. Grids (which can be 2D) contain many bocks which are executed on the device.

10 Texture Write via CPU Aows hardware interpoation 2D Locaity of arrays Constant Memory Mode Write via CPU Sma, used for random access instructions Goba Write via CPU and GPU Shared Loca to Bock Low atency Fasted comms between threads Loca Per thread ony memory Registers Thread ony, Fastest memory avaiabe Limited space

11 Registers are oca to thread and has thread ifetime. Shared memory shared between threads in a bock and has bock ifetime Goba memory is accessibe everywhere and is persistent. Memory Scope

12 Tensor Cores One of the key operations when training a neura network or doing ML is matrix-matrix mutipication (DGEMM) Each tensor core has a 4x4x4 matrix processing array Can do 64 mixed-precision OPs/cock with haf-precision inputs and either HP or SP output

13 Programming Technoogies (I) CUDA (Compute Unified Device Architecture) Nvidia's proprietary patform. Offers both API (extensions to C/C++) and a driver eve programming. Proprietary PGI Fortran compier aows Fortran deveopment both directive and kerne modes. OpenCL (Open Computing Language) Genera patform for computing patforms (GPUs and CPUs). Impemented on a driver eve e.g. buit into MacOS Specification is manufacturer (and device) independent write once, run anywhere.

14 OpenACC Programming Technoogies (II) Open standard version of the directives approach of PGI Spec v2.0 Juy 2013 has better support for contro of data movement, caing externa functions, and separate compiation for host & device so can buid ibraries Led by Cray, nvidia, PGI and CAPS with support for Fortran, C/C++ Now in gcc and gfortran (since v6.1) OpenMP v4.0 Pushed by Inte to support Xeon Phi etc Subset of OpenACC functionaity at present In gcc/gfortran since v4.9.1 Supposed to incude a of OpenACC in future but difficuties with Inte vs nvidia

15 CUDA Programme Structure Seria (host) code Kerne (device) code - Grid Bocks - Threads Must remember to aocate data on both host and device Kerne executes asynchronousy

16 Starting CUDA Headers must be incuded. In this introduction ony the API wi be demonstrated. use cudafor #incude <cuda.h>; #incude <cuda_runtime.h>; C/C++ fies with CUDA kernes must have the extension.cu Fortran CUDA fies must have the extension.cuf See references at end for more compete exampes...

17 Device Kernes A Kerne is a goba subroutine (static function) which runs on the device. One instance of the kerne wi be executed by every thread which is invoked. attributes(goba) subroutine my_kerne( a, b, c) static goba void my_kerne( foat * a, foat * b, foat * c); Kernes can ony address device memory spaces. Executing kernes behaviour differs by using the thread index which uniquey identifies the thread. tx = threadidx%x + (bockidx%x * bockdim%x) int tx = threadidx.x + (bockidx.x * bockdim.x);

18 Host Code Device memory must be aocated in advance by host: rea,device,aocatabe :: Adev(:)... aocate(adev(m,n),stat=istat) static device doube * devptr = NULL;... cudamaoc((void**)&devptr, N*sizeof(doube)); Data must be copied (over the bus) to the device and if necessary copied back to the host ater. This is very sow! (Host memory can be aocated in non-paged (pinned) memory to avoid one copy operation). Adev = A! Host to device cudamemcpy(devptr, hostptr, N*sizeof(doube), cudamemcpyhosttodevice);

19 Kerne Execution ca my_kerne<<<grid,bock>>>(arg1,arg2,...) my_kerne<<<grid,bock>>>(arg1,arg2...); grid specifies the dimensions of the grid (i.e. the number of bocks aunched wi be the product of the dimensions). bock specifies the dimensions of each bock (i.e. the number of threads per bock is the product of the dimensions). dim3 derived type which has three members. type(dim3) :: grid Integer :: x=5,y=5,z=1 grid = (x,y,z) int x=5,y=5,z=1; dim3 grid(x,y,z);

20 Kerne Compiation and Run CUDA kernes must be compied with a CUDA compier (pgfortran or nvcc). At runtime the kerne wi be copied to the device on first execution note that for accurate profiing, a kerne shoud be executed at east once before timing. OpenCL kernes are usuay compied at runtime. This aows the kerne to be optimised for the running context but has a arger initia overhead.

21 Advanced Features Streaming Data can be moved across the system bus whie computation is happening. This is good for arge data structures which don't fit in device memory. Texture/Surface memory Memory can be accessed via non integer or surface coordinate! Linear interpoation can aso be performed. Fast intrinsic function Some specia functions exist which aow certain operations to be performed very quicky athough are not IEEE compiant. For exampe rsqrt reciproca square root. These can be usefu for areas of the code where precision is ess important.

22 V6 (2014+) Drop-in repacement for BLAS etc with auto offoad from CPU to GPU so free speed-up! CUBLAS BLAS ibrary - Incudes a S,D,C,Z eve 1-3 BLAS routines CUFFT FFT ibrary (free) CUDA Libraries - 1D, 2D, 3D compex and rea - Stream enabed for parae data movement & computation CUSPARSE Sparse matrix ibrary - BLAS stye routines between sparse & dense matrices CURAND Random number ibrary Now aso XT versions for muti-gpu support (non-free)

23 CUBLAS Fortran Interfacing Interfacing the CUBLAS ibrary (written in C) with the cudafortran from PGI requires an interface to be written using the C-interoperabiity part of F2003, e.g. Modue cubas Interface cuda_gemm Subroutine cuda_sgemm(cta, ctb, m, n, k apha, A, & & da, B, db,beta, c, dc) bind(c,name='cubassgemm') Use iso_c_binding character(1,c_char),vaue :: m,n,k,da,db,dc rea(c_foat),vaue :: apha,beta rea(c_foat),device,dimension(da,*) :: A rea(c_foat),device,dimension(db,*) :: B rea(c_foat),device,dimension(dc,*) :: C End subroutine cuda_sgemm End interface cuda_gemm End modue cubas

24 Fortran Exampe subroutine mmu( A, B, C ) use cudafor rea, dimension(:,:) :: A, B, C integer :: N, M, L rea, device, aocatabe, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimgrid, dimbock N = size(a,1) ; M = size(a,2) ; L = size(b,2) aocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) ; Bdev = B(1:M,1:L) dimgrid = dim3( N/16, L/16, 1 ) dimbock = dim3( 16, 16, 1 ) ca mmu_kerne<<<dimgrid,dimbock>>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deaocate( Adev, Bdev, Cdev ) end subroutine

25 Fortran Kerne attributes(goba) subroutine MMUL_KERNEL( A,B,C,N,M,L) rea,device :: A(N,M),B(M,L),C(N,L) integer,vaue :: N,M,L integer :: i,j,kb,k,tx,ty rea,shared :: Ab(16,16), Bb(16,16) rea :: Cij tx = threadidx%x ; ty = threadidx%y i = (bockidx%x-1) * 16 + tx ; j = (bockidx%y-1) * 16 + ty Cij = 0.0 do kb = 1, M, 16! Fetch one eement each into Ab and Bb NB 16x16 = 256! threads in this thread-bock are fetching separate eements of Ab and Bb Ab(tx,ty) = A(i,kb+ty-1) Bb(tx,ty) = B(kb+tx-1,j)! Wait unti a eements of Ab and Bb are fied ca syncthreads() do k = 1, 16 Cij = Cij + Ab(tx,k) * Bb(k,ty) enddo! Wait unti a threads in the thread-bock finish with this Ab and Bb ca syncthreads() enddo C(i,j) = Cij end subroutine

26 OpenMP v4.5 Spec finaized in Nov 2015 (C/C++/Fortran) Extensions to SIMD and TASK Array reduction now aowed Extended TARGET attributes for better acceerator performance Lots of new stuff for DEVICE Avaiabe in GNU v6.1 and Inte v18.0 V5.0 due any day now...

27 OpenMP v4.5 On CPU: #pragma omp parae for On GPU: #pragma omp target teams distribute parae for IBM XL compier with nvidia offoading (Dec 16): LULESH benchmark from OpenMP.org

28 OpenACC (I) Compier can auto-generate kernes for a section of code (e.g. mutipe oops and/or Fortran array operations) Maximum fexibiity Auto-detect dependencies:!$acc kernes fortran oop(s) to be executed on device!$acc end kernes #pragma acc kernes { c oop(s) to be executed on device }

29 OpenACC (II) Or user can assert a singe oop is dependencyfree and hence safe to be paraeized:!$acc parae oop For i=1,n!singe fortran oop!nb no matching end-parae #pragma acc parae oop For (i=0;i<n;i++) { //singe c oop to be executed on device } // NB no bock braces NB data may be copied back from host between successive parae sections but stays on host for duration of a kernes section

30 Further Reading CUDA Deveoper ZONE OpenCL Standard: OpenACC standard: OpenMP v4.5 openmp-4.5.pdf

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community

More information

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7 Functions Unit 6 Gaddis: 6.1-5,7-10,13,15-16 and 7.7 CS 1428 Spring 2018 Ji Seaman 6.1 Moduar Programming Moduar programming: breaking a program up into smaer, manageabe components (modues) Function: a

More information

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community Pre-exascae Architectures: OpenPOWER Performance and Usabiity Assessment for French Scientific Community G. Hautreux (GENCI), E. Boyer (GENCI) Technoogica watch group, GENCI-CEA-CNRS-INRIA and French Universities

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Lecture 4: Threads Announcement Project 0 Due Project 1 out Homework 1 due on Thursday Submit it to Gradescope onine 2 Processes Reca that

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program? Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Spring 2018 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program a set of

More information

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm A Comparison of a Second-Order versus a Fourth- Order Lapacian Operator in the Mutigrid Agorithm Kaushik Datta (kdatta@cs.berkeey.edu Math Project May 9, 003 Abstract In this paper, the mutigrid agorithm

More information

As Michi Henning and Steve Vinoski showed 1, calling a remote

As Michi Henning and Steve Vinoski showed 1, calling a remote Reducing CORBA Ca Latency by Caching and Prefetching Bernd Brügge and Christoph Vismeier Technische Universität München Method ca atency is a major probem in approaches based on object-oriented middeware

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

CUDA Fortran Brent Leback The Portland Group

CUDA Fortran Brent Leback The Portland Group CUDA Fortran 2013 Brent Leback The Portland Group brent.leback@pgroup.com Why Fortran? Rich legacy in the scientific community Semantics easier to vectorize/parallelize Array descriptors Modules Fortran

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Fa 2017 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program instructions

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Mobile App Recommendation: Maximize the Total App Downloads

Mobile App Recommendation: Maximize the Total App Downloads Mobie App Recommendation: Maximize the Tota App Downoads Zhuohua Chen Schoo of Economics and Management Tsinghua University chenzhh3.12@sem.tsinghua.edu.cn Yinghui (Catherine) Yang Graduate Schoo of Management

More information

Nearest Neighbor Learning

Nearest Neighbor Learning Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01 Page 1 of 15 Chapter 9 Chapter 9: Deveoping the Logica Data Mode The information requirements and business rues provide the information to produce the entities, attributes, and reationships in ogica mode.

More information

Introduction to OpenMP

Introduction to OpenMP MPSoC Architectures OpenMP Aberto Bosio, Associate Professor UM Microeectronic Departement bosio@irmm.fr Introduction to OpenMP What is OpenMP? Open specification for Muti-Processing Standard API for defining

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

A Memory Grouping Method for Sharing Memory BIST Logic

A Memory Grouping Method for Sharing Memory BIST Logic A Memory Grouping Method for Sharing Memory BIST Logic Masahide Miyazai, Tomoazu Yoneda, and Hideo Fuiwara Graduate Schoo of Information Science, Nara Institute of Science and Technoogy (NAIST), 8916-5

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Sample of a training manual for a software tool

Sample of a training manual for a software tool Sampe of a training manua for a software too We use FogBugz for tracking bugs discovered in RAPPID. I wrote this manua as a training too for instructing the programmers and engineers in the use of FogBugz.

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges Specia Edition Using Microsoft Exce 2000 - Lesson 3 - Seecting and Naming Ces and.. Page 1 of 8 [Figures are not incuded in this sampe chapter] Specia Edition Using Microsoft Exce 2000-3 - Seecting and

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Scheduing Announcement Homework 2 due on October 25th Project 1 due on October 26th 2 CSE 120 Scheduing and Deadock Scheduing Overview In discussing

More information

An Optimizing Compiler

An Optimizing Compiler An Optimizing Compier The big difference between interpreters and compiers is that compiers have the abiity to think about how to transate a source program into target code in the most effective way. Usuay

More information

Multi-MANO interworking for the management of multi-domains networks and network slicing Functionality & Demos

Multi-MANO interworking for the management of multi-domains networks and network slicing Functionality & Demos Muti-MANO interworking for the management of muti-domains networks and network sicing Functionaity & Demos Acknowedgement & Open Source Soutions NECOS project: NFVi Sicing http://aurabaea.com/necos/ SONATA

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

Introducing a Target-Based Approach to Rapid Prototyping of ECUs

Introducing a Target-Based Approach to Rapid Prototyping of ECUs Introducing a Target-Based Approach to Rapid Prototyping of ECUs FEBRUARY, 1997 Abstract This paper presents a target-based approach to Rapid Prototyping of Eectronic Contro Units (ECUs). With this approach,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

MCSE Training Guide: Windows Architecture and Memory

MCSE Training Guide: Windows Architecture and Memory MCSE Training Guide: Windows 95 -- Ch 2 -- Architecture and Memory Page 1 of 13 MCSE Training Guide: Windows 95-2 - Architecture and Memory This chapter wi hep you prepare for the exam by covering the

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion. Lecture outine 433-324 Graphics and Interaction Scan Converting Poygons and Lines Department of Computer Science and Software Engineering The Introduction Scan conversion Scan-ine agorithm Edge coherence

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

Straight-line code (or IPO: Input-Process-Output) If/else & switch. Relational Expressions. Decisions. Sections 4.1-6, , 4.

Straight-line code (or IPO: Input-Process-Output) If/else & switch. Relational Expressions. Decisions. Sections 4.1-6, , 4. If/ese & switch Unit 3 Sections 4.1-6, 4.8-12, 4.14-15 CS 1428 Spring 2018 Ji Seaman Straight-ine code (or IPO: Input-Process-Output) So far a of our programs have foowed this basic format: Input some

More information

Computing using GPUs

Computing using GPUs Computing using GPUs Introduction, architecture overview CUDA and OpenCL programming models (recap) Programming tools: CUDA Fortran, debuggers, profilers, libraries Software and applications: Mathematica,

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

GPU Implementation of Parallel SVM as Applied to Intrusion Detection System

GPU Implementation of Parallel SVM as Applied to Intrusion Detection System GPU Impementation of Parae SVM as Appied to Intrusion Detection System Sudarshan Hiray Research Schoar, Department of Computer Engineering, Vishwakarma Institute of Technoogy, Pune, India sdhiray7@gmai.com

More information

A Petrel Plugin for Surface Modeling

A Petrel Plugin for Surface Modeling A Petre Pugin for Surface Modeing R. M. Hassanpour, S. H. Derakhshan and C. V. Deutsch Structure and thickness uncertainty are important components of any uncertainty study. The exact ocations of the geoogica

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Microsoft Visual Studio 2005 Professional Tools. Advanced development tools designed for professional developers

Microsoft Visual Studio 2005 Professional Tools. Advanced development tools designed for professional developers Microsoft Visua Studio 2005 Professiona Toos Advanced deveopment toos designed for professiona deveopers If you re a professiona deveoper, Microsoft has two new ways to fue your deveopment efforts: Microsoft

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,

More information

Genetic Algorithms for Parallel Code Optimization

Genetic Algorithms for Parallel Code Optimization Genetic Agorithms for Parae Code Optimization Ender Özcan Dept. of Computer Engineering Yeditepe University Kayışdağı, İstanbu, Turkey Emai: eozcan@cse.yeditepe.edu.tr Abstract- Determining the optimum

More information

3.1 The cin Object. Expressions & I/O. Console Input. Example program using cin. Unit 2. Sections 2.14, , 5.1, CS 1428 Spring 2018

3.1 The cin Object. Expressions & I/O. Console Input. Example program using cin. Unit 2. Sections 2.14, , 5.1, CS 1428 Spring 2018 Expressions & I/O Unit 2 Sections 2.14, 3.1-10, 5.1, 5.11 CS 1428 Spring 2018 Ji Seaman 1 3.1 The cin Object cin: short for consoe input a stream object: represents the contents of the screen that are

More information

On-Chip CNN Accelerator for Image Super-Resolution

On-Chip CNN Accelerator for Image Super-Resolution On-Chip CNN Acceerator for Image Super-Resoution Jung-Woo Chang and Suk-Ju Kang Dept. of Eectronic Engineering, Sogang University, Seou, South Korea {zwzang91, sjkang}@sogang.ac.kr ABSTRACT To impement

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

JOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM

JOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM JOINT IMAGE REGISTRATION AND AMPLE-BASED SUPER-RESOLUTION ALGORITHM Hyo-Song Kim, Jeyong Shin, and Rae-Hong Park Department of Eectronic Engineering, Schoo of Engineering, Sogang University 35 Baekbeom-ro,

More information

Lecture 11: GPU programming

Lecture 11: GPU programming Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!

More information

An Introduc+on to OpenACC Part II

An Introduc+on to OpenACC Part II An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-

More information

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh. GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Introduction to USB Development

Introduction to USB Development Introduction to USB Deveopment Introduction Technica Overview USB in Embedded Systems Recent Deveopments Extensions to USB USB as compared to other technoogies USB: Universa Seria Bus A seria bus standard

More information

ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC

ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC Doug Miles, PGI Compilers & Tools, NVIDIA High Performance Computing Advisory Council February 21, 2018 PGI THE NVIDIA HPC SDK Fortran, C & C++

More information

Performance Diagnosis for Hybrid CPU/GPU Environments

Performance Diagnosis for Hybrid CPU/GPU Environments Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments

More information

A case study of performance portability with OpenMP 4.5

A case study of performance portability with OpenMP 4.5 A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Intel Architecture: Features & Futures

Intel Architecture: Features & Futures Inte Architecture: Features & Futures For Servers & Workstations Stephen L. Smith Corporate Vice President, Microprocessor Products Group Genera Manager, Santa Cara Processor Division Inte Corporation

More information

Chapter 3: Introduction to the Flash Workspace

Chapter 3: Introduction to the Flash Workspace Chapter 3: Introduction to the Fash Workspace Page 1 of 10 Chapter 3: Introduction to the Fash Workspace In This Chapter Features and Functionaity of the Timeine Features and Functionaity of the Stage

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Advanced Memory Management

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Advanced Memory Management CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Advanced Memory Management Advanced Functionaity Now we re going to ook at some advanced functionaity that the OS can provide appications using

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

An Introduction to Design Patterns

An Introduction to Design Patterns An Introduction to Design Patterns 1 Definitions A pattern is a recurring soution to a standard probem, in a context. Christopher Aexander, a professor of architecture Why woud what a prof of architecture

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae

More information

CSE120 Principles of Operating Systems. Architecture Support for OS

CSE120 Principles of Operating Systems. Architecture Support for OS CSE120 Principes of Operating Systems Architecture Support for OS Why are you sti here? You shoud run away from my CSE120! 2 CSE 120 Architectura Support Announcement Have you visited the web page? http://cseweb.ucsd.edu/casses/fa18/cse120-a/

More information

Register Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x

Register Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x Register Aocation Consider the foowing assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x Assume that two registers are avaiabe. Starting from the eft a compier woud generate

More information

COS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University

COS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University COS 318: Operating Systems Virtua Memory Design Issues: Paging and Caching Jaswinder Pa Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Virtua Memory:

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

S Comparing OpenACC 2.5 and OpenMP 4.5

S Comparing OpenACC 2.5 and OpenMP 4.5 April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Synchronization: Semaphore

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Synchronization: Semaphore CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Synchronization: Synchronization Needs Two synchronization needs Mutua excusion Whenever mutipe threads access a shared data, you need to worry

More information