High Performance Computing - Introduction to GPGPU Programming. Prof Matt Probert
|
|
- Hilda Preston
- 5 years ago
- Views:
Transcription
1 High Performance Computing - Introduction to GPGPU Programming Prof Matt Probert
2 Overview What are GPUs? GPU Architecture Memory Programming in CUDA The Future
3 What is a GPU? A GPU is a massive vector processor Hundreds of processing units Large memory bandwidth Processing units can coaborate Some probems can be soved more efficienty on vector processors (remember Crays). GPUs have some big advantages. Cheap and very fast for vector probems Computation is asynchronous with the CPU Hardware tricks such as texture fitering (zero overhead!) Other acceerators (e.g. Xeon Phi) avaiabe
4 Vaue for Money (2016 prices) P100 ~ 9k Ebor 40k K80 5k Xeon E7 5k However, Ebor is a distributed data machine with 16 nodes with 16 Xeon cores and tota 256 GB of RAM etc. These are different toos for different probems.
5 When is GPU Programming usefu? The GPU devotes more transistors to data processing (nvidia C Programming Guide) Good at doing ots of numeric cacuations simutaneousy. - This is actuay required for the GPU to be efficient - There must be many more numeric operations than memory operations to break even. Usefu as a co-processor. - Can offoad GPU efficient cacuations whie the CPU continues with the rest. Brute forcing! - Sometimes a brute force method is more efficient with many processors than an eegant soution on just one
6 Foating Point Standard nvidia architectures 2.x onwards (i.e. Fermi onwards) are IEEE 754 compiant. Od nvidia architectures were mosty compiant. Generay exceptions were handed in a noncompiant way. Aso some mathematics e.g. FMAD, division and sqrt were not standard. Standard compiant intrinsic functions are avaiabe but at a arge computationa penaty (software impemented).
7 nvidia Architectures V1 = Tesa (2008) introduced CUDA with performance of 0.5 GFLOP/watt, V2 = Fermi (2010) 64-bit foating point and performance = 2 GFLOP/W V3 = Keper (2012) - dynamic paraeism V5 = Maxwe (2014) faster Keper 3.8 TFLOP DP, 24 GB GDRAM, 2 GPU, 500 GB/s, 300 W V6 = Pasca (2016) unified memory, stacked DRAM, direct interconnect between GPU and main RAM to eiminate common botteneck V7 = Vota (2018) new = tensor cores with haf-precision math for machine earning now 7.8 TFLOP DP, 15.7 TFLOP SP, 125 TFLOP hp
8 Lots of arithmetic units sharing sma caches. Execution bocks are schedued across processing cores. GPU Architecture Large memory bandwidth but sow memory access. Fermi onwards have fu cache hierarchy.
9 Each thread runs one instance of a kerne. Threads Threads are organised into bocks (which can be up to 3D), bocks are schedued on and off the processors. Execution of a bock on the processors is caed a warp. Grids (which can be 2D) contain many bocks which are executed on the device.
10 Texture Write via CPU Aows hardware interpoation 2D Locaity of arrays Constant Memory Mode Write via CPU Sma, used for random access instructions Goba Write via CPU and GPU Shared Loca to Bock Low atency Fasted comms between threads Loca Per thread ony memory Registers Thread ony, Fastest memory avaiabe Limited space
11 Registers are oca to thread and has thread ifetime. Shared memory shared between threads in a bock and has bock ifetime Goba memory is accessibe everywhere and is persistent. Memory Scope
12 Tensor Cores One of the key operations when training a neura network or doing ML is matrix-matrix mutipication (DGEMM) Each tensor core has a 4x4x4 matrix processing array Can do 64 mixed-precision OPs/cock with haf-precision inputs and either HP or SP output
13 Programming Technoogies (I) CUDA (Compute Unified Device Architecture) Nvidia's proprietary patform. Offers both API (extensions to C/C++) and a driver eve programming. Proprietary PGI Fortran compier aows Fortran deveopment both directive and kerne modes. OpenCL (Open Computing Language) Genera patform for computing patforms (GPUs and CPUs). Impemented on a driver eve e.g. buit into MacOS Specification is manufacturer (and device) independent write once, run anywhere.
14 OpenACC Programming Technoogies (II) Open standard version of the directives approach of PGI Spec v2.0 Juy 2013 has better support for contro of data movement, caing externa functions, and separate compiation for host & device so can buid ibraries Led by Cray, nvidia, PGI and CAPS with support for Fortran, C/C++ Now in gcc and gfortran (since v6.1) OpenMP v4.0 Pushed by Inte to support Xeon Phi etc Subset of OpenACC functionaity at present In gcc/gfortran since v4.9.1 Supposed to incude a of OpenACC in future but difficuties with Inte vs nvidia
15 CUDA Programme Structure Seria (host) code Kerne (device) code - Grid Bocks - Threads Must remember to aocate data on both host and device Kerne executes asynchronousy
16 Starting CUDA Headers must be incuded. In this introduction ony the API wi be demonstrated. use cudafor #incude <cuda.h>; #incude <cuda_runtime.h>; C/C++ fies with CUDA kernes must have the extension.cu Fortran CUDA fies must have the extension.cuf See references at end for more compete exampes...
17 Device Kernes A Kerne is a goba subroutine (static function) which runs on the device. One instance of the kerne wi be executed by every thread which is invoked. attributes(goba) subroutine my_kerne( a, b, c) static goba void my_kerne( foat * a, foat * b, foat * c); Kernes can ony address device memory spaces. Executing kernes behaviour differs by using the thread index which uniquey identifies the thread. tx = threadidx%x + (bockidx%x * bockdim%x) int tx = threadidx.x + (bockidx.x * bockdim.x);
18 Host Code Device memory must be aocated in advance by host: rea,device,aocatabe :: Adev(:)... aocate(adev(m,n),stat=istat) static device doube * devptr = NULL;... cudamaoc((void**)&devptr, N*sizeof(doube)); Data must be copied (over the bus) to the device and if necessary copied back to the host ater. This is very sow! (Host memory can be aocated in non-paged (pinned) memory to avoid one copy operation). Adev = A! Host to device cudamemcpy(devptr, hostptr, N*sizeof(doube), cudamemcpyhosttodevice);
19 Kerne Execution ca my_kerne<<<grid,bock>>>(arg1,arg2,...) my_kerne<<<grid,bock>>>(arg1,arg2...); grid specifies the dimensions of the grid (i.e. the number of bocks aunched wi be the product of the dimensions). bock specifies the dimensions of each bock (i.e. the number of threads per bock is the product of the dimensions). dim3 derived type which has three members. type(dim3) :: grid Integer :: x=5,y=5,z=1 grid = (x,y,z) int x=5,y=5,z=1; dim3 grid(x,y,z);
20 Kerne Compiation and Run CUDA kernes must be compied with a CUDA compier (pgfortran or nvcc). At runtime the kerne wi be copied to the device on first execution note that for accurate profiing, a kerne shoud be executed at east once before timing. OpenCL kernes are usuay compied at runtime. This aows the kerne to be optimised for the running context but has a arger initia overhead.
21 Advanced Features Streaming Data can be moved across the system bus whie computation is happening. This is good for arge data structures which don't fit in device memory. Texture/Surface memory Memory can be accessed via non integer or surface coordinate! Linear interpoation can aso be performed. Fast intrinsic function Some specia functions exist which aow certain operations to be performed very quicky athough are not IEEE compiant. For exampe rsqrt reciproca square root. These can be usefu for areas of the code where precision is ess important.
22 V6 (2014+) Drop-in repacement for BLAS etc with auto offoad from CPU to GPU so free speed-up! CUBLAS BLAS ibrary - Incudes a S,D,C,Z eve 1-3 BLAS routines CUFFT FFT ibrary (free) CUDA Libraries - 1D, 2D, 3D compex and rea - Stream enabed for parae data movement & computation CUSPARSE Sparse matrix ibrary - BLAS stye routines between sparse & dense matrices CURAND Random number ibrary Now aso XT versions for muti-gpu support (non-free)
23 CUBLAS Fortran Interfacing Interfacing the CUBLAS ibrary (written in C) with the cudafortran from PGI requires an interface to be written using the C-interoperabiity part of F2003, e.g. Modue cubas Interface cuda_gemm Subroutine cuda_sgemm(cta, ctb, m, n, k apha, A, & & da, B, db,beta, c, dc) bind(c,name='cubassgemm') Use iso_c_binding character(1,c_char),vaue :: m,n,k,da,db,dc rea(c_foat),vaue :: apha,beta rea(c_foat),device,dimension(da,*) :: A rea(c_foat),device,dimension(db,*) :: B rea(c_foat),device,dimension(dc,*) :: C End subroutine cuda_sgemm End interface cuda_gemm End modue cubas
24 Fortran Exampe subroutine mmu( A, B, C ) use cudafor rea, dimension(:,:) :: A, B, C integer :: N, M, L rea, device, aocatabe, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimgrid, dimbock N = size(a,1) ; M = size(a,2) ; L = size(b,2) aocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) ; Bdev = B(1:M,1:L) dimgrid = dim3( N/16, L/16, 1 ) dimbock = dim3( 16, 16, 1 ) ca mmu_kerne<<<dimgrid,dimbock>>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deaocate( Adev, Bdev, Cdev ) end subroutine
25 Fortran Kerne attributes(goba) subroutine MMUL_KERNEL( A,B,C,N,M,L) rea,device :: A(N,M),B(M,L),C(N,L) integer,vaue :: N,M,L integer :: i,j,kb,k,tx,ty rea,shared :: Ab(16,16), Bb(16,16) rea :: Cij tx = threadidx%x ; ty = threadidx%y i = (bockidx%x-1) * 16 + tx ; j = (bockidx%y-1) * 16 + ty Cij = 0.0 do kb = 1, M, 16! Fetch one eement each into Ab and Bb NB 16x16 = 256! threads in this thread-bock are fetching separate eements of Ab and Bb Ab(tx,ty) = A(i,kb+ty-1) Bb(tx,ty) = B(kb+tx-1,j)! Wait unti a eements of Ab and Bb are fied ca syncthreads() do k = 1, 16 Cij = Cij + Ab(tx,k) * Bb(k,ty) enddo! Wait unti a threads in the thread-bock finish with this Ab and Bb ca syncthreads() enddo C(i,j) = Cij end subroutine
26 OpenMP v4.5 Spec finaized in Nov 2015 (C/C++/Fortran) Extensions to SIMD and TASK Array reduction now aowed Extended TARGET attributes for better acceerator performance Lots of new stuff for DEVICE Avaiabe in GNU v6.1 and Inte v18.0 V5.0 due any day now...
27 OpenMP v4.5 On CPU: #pragma omp parae for On GPU: #pragma omp target teams distribute parae for IBM XL compier with nvidia offoading (Dec 16): LULESH benchmark from OpenMP.org
28 OpenACC (I) Compier can auto-generate kernes for a section of code (e.g. mutipe oops and/or Fortran array operations) Maximum fexibiity Auto-detect dependencies:!$acc kernes fortran oop(s) to be executed on device!$acc end kernes #pragma acc kernes { c oop(s) to be executed on device }
29 OpenACC (II) Or user can assert a singe oop is dependencyfree and hence safe to be paraeized:!$acc parae oop For i=1,n!singe fortran oop!nb no matching end-parae #pragma acc parae oop For (i=0;i<n;i++) { //singe c oop to be executed on device } // NB no bock braces NB data may be copied back from host between successive parae sections but stays on host for duration of a kernes section
30 Further Reading CUDA Deveoper ZONE OpenCL Standard: OpenACC standard: OpenMP v4.5 openmp-4.5.pdf
SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI
SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI The Case for Fortran Clear, straight-forward syntax Successful legacy in the scientific community
More informationFunctions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7
Functions Unit 6 Gaddis: 6.1-5,7-10,13,15-16 and 7.7 CS 1428 Spring 2018 Ji Seaman 6.1 Moduar Programming Moduar programming: breaking a program up into smaer, manageabe components (modues) Function: a
More informationPre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community
Pre-exascae Architectures: OpenPOWER Performance and Usabiity Assessment for French Scientific Community G. Hautreux (GENCI), E. Boyer (GENCI) Technoogica watch group, GENCI-CEA-CNRS-INRIA and French Universities
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Lecture 4: Threads Announcement Project 0 Due Project 1 out Homework 1 due on Thursday Submit it to Gradescope onine 2 Processes Reca that
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?
Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Spring 2018 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program a set of
More informationA Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm
A Comparison of a Second-Order versus a Fourth- Order Lapacian Operator in the Mutigrid Agorithm Kaushik Datta (kdatta@cs.berkeey.edu Math Project May 9, 003 Abstract In this paper, the mutigrid agorithm
More informationAs Michi Henning and Steve Vinoski showed 1, calling a remote
Reducing CORBA Ca Latency by Caching and Prefetching Bernd Brügge and Christoph Vismeier Technische Universität München Method ca atency is a major probem in approaches based on object-oriented middeware
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel
More informationCUDA Fortran Brent Leback The Portland Group
CUDA Fortran 2013 Brent Leback The Portland Group brent.leback@pgroup.com Why Fortran? Rich legacy in the scientific community Semantics easier to vectorize/parallelize Array descriptors Modules Fortran
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationMassively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN
Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce
More informationIntro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated
Intro to Programming & C++ Unit 1 Sections 1.1-3 and 2.1-10, 2.12-13, 2.15-17 CS 1428 Fa 2017 Ji Seaman 1.1 Why Program? Computer programmabe machine designed to foow instructions Program instructions
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationIntroduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series
Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software
More informationMobile App Recommendation: Maximize the Total App Downloads
Mobie App Recommendation: Maximize the Tota App Downoads Zhuohua Chen Schoo of Economics and Management Tsinghua University chenzhh3.12@sem.tsinghua.edu.cn Yinghui (Catherine) Yang Graduate Schoo of Management
More informationNearest Neighbor Learning
Nearest Neighbor Learning Cassify based on oca simiarity Ranges from simpe nearest neighbor to case-based and anaogica reasoning Use oca information near the current query instance to decide the cassification
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationfile://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01
Page 1 of 15 Chapter 9 Chapter 9: Deveoping the Logica Data Mode The information requirements and business rues provide the information to produce the entities, attributes, and reationships in ogica mode.
More informationIntroduction to OpenMP
MPSoC Architectures OpenMP Aberto Bosio, Associate Professor UM Microeectronic Departement bosio@irmm.fr Introduction to OpenMP What is OpenMP? Open specification for Muti-Processing Standard API for defining
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationA Memory Grouping Method for Sharing Memory BIST Logic
A Memory Grouping Method for Sharing Memory BIST Logic Masahide Miyazai, Tomoazu Yoneda, and Hideo Fuiwara Graduate Schoo of Information Science, Nara Institute of Science and Technoogy (NAIST), 8916-5
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationCUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA
CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationSample of a training manual for a software tool
Sampe of a training manua for a software too We use FogBugz for tracking bugs discovered in RAPPID. I wrote this manua as a training too for instructing the programmers and engineers in the use of FogBugz.
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationSpecial Edition Using Microsoft Excel Selecting and Naming Cells and Ranges
Specia Edition Using Microsoft Exce 2000 - Lesson 3 - Seecting and Naming Ces and.. Page 1 of 8 [Figures are not incuded in this sampe chapter] Specia Edition Using Microsoft Exce 2000-3 - Seecting and
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Scheduing Announcement Homework 2 due on October 25th Project 1 due on October 26th 2 CSE 120 Scheduing and Deadock Scheduing Overview In discussing
More informationAn Optimizing Compiler
An Optimizing Compier The big difference between interpreters and compiers is that compiers have the abiity to think about how to transate a source program into target code in the most effective way. Usuay
More informationMulti-MANO interworking for the management of multi-domains networks and network slicing Functionality & Demos
Muti-MANO interworking for the management of muti-domains networks and network sicing Functionaity & Demos Acknowedgement & Open Source Soutions NECOS project: NFVi Sicing http://aurabaea.com/necos/ SONATA
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationIntroducing a Target-Based Approach to Rapid Prototyping of ECUs
Introducing a Target-Based Approach to Rapid Prototyping of ECUs FEBRUARY, 1997 Abstract This paper presents a target-based approach to Rapid Prototyping of Eectronic Contro Units (ECUs). With this approach,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationMCSE Training Guide: Windows Architecture and Memory
MCSE Training Guide: Windows 95 -- Ch 2 -- Architecture and Memory Page 1 of 13 MCSE Training Guide: Windows 95-2 - Architecture and Memory This chapter wi hep you prepare for the exam by covering the
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationLecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.
Lecture outine 433-324 Graphics and Interaction Scan Converting Poygons and Lines Department of Computer Science and Software Engineering The Introduction Scan conversion Scan-ine agorithm Edge coherence
More informationScientific discovery, analysis and prediction made possible through high performance computing.
Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationStraight-line code (or IPO: Input-Process-Output) If/else & switch. Relational Expressions. Decisions. Sections 4.1-6, , 4.
If/ese & switch Unit 3 Sections 4.1-6, 4.8-12, 4.14-15 CS 1428 Spring 2018 Ji Seaman Straight-ine code (or IPO: Input-Process-Output) So far a of our programs have foowed this basic format: Input some
More informationComputing using GPUs
Computing using GPUs Introduction, architecture overview CUDA and OpenCL programming models (recap) Programming tools: CUDA Fortran, debuggers, profilers, libraries Software and applications: Mathematica,
More informationGraph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.
Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationGPU Implementation of Parallel SVM as Applied to Intrusion Detection System
GPU Impementation of Parae SVM as Appied to Intrusion Detection System Sudarshan Hiray Research Schoar, Department of Computer Engineering, Vishwakarma Institute of Technoogy, Pune, India sdhiray7@gmai.com
More informationA Petrel Plugin for Surface Modeling
A Petre Pugin for Surface Modeing R. M. Hassanpour, S. H. Derakhshan and C. V. Deutsch Structure and thickness uncertainty are important components of any uncertainty study. The exact ocations of the geoogica
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationMicrosoft Visual Studio 2005 Professional Tools. Advanced development tools designed for professional developers
Microsoft Visua Studio 2005 Professiona Toos Advanced deveopment toos designed for professiona deveopers If you re a professiona deveoper, Microsoft has two new ways to fue your deveopment efforts: Microsoft
More informationIntroduction to CUDA
Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,
More informationGenetic Algorithms for Parallel Code Optimization
Genetic Agorithms for Parae Code Optimization Ender Özcan Dept. of Computer Engineering Yeditepe University Kayışdağı, İstanbu, Turkey Emai: eozcan@cse.yeditepe.edu.tr Abstract- Determining the optimum
More information3.1 The cin Object. Expressions & I/O. Console Input. Example program using cin. Unit 2. Sections 2.14, , 5.1, CS 1428 Spring 2018
Expressions & I/O Unit 2 Sections 2.14, 3.1-10, 5.1, 5.11 CS 1428 Spring 2018 Ji Seaman 1 3.1 The cin Object cin: short for consoe input a stream object: represents the contents of the screen that are
More informationOn-Chip CNN Accelerator for Image Super-Resolution
On-Chip CNN Acceerator for Image Super-Resoution Jung-Woo Chang and Suk-Ju Kang Dept. of Eectronic Engineering, Sogang University, Seou, South Korea {zwzang91, sjkang}@sogang.ac.kr ABSTRACT To impement
More informationVSC Users Day 2018 Start to GPU Ehsan Moravveji
Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally
More informationJOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM
JOINT IMAGE REGISTRATION AND AMPLE-BASED SUPER-RESOLUTION ALGORITHM Hyo-Song Kim, Jeyong Shin, and Rae-Hong Park Department of Eectronic Engineering, Schoo of Engineering, Sogang University 35 Baekbeom-ro,
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationAn Introduc+on to OpenACC Part II
An Introduc+on to OpenACC Part II Wei Feinstein HPC User Services@LSU LONI Parallel Programming Workshop 2015 Louisiana State University 4 th HPC Parallel Programming Workshop An Introduc+on to OpenACC-
More informationGPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.
GPGPU Alan Gray/James Perry EPCC The University of Edinburgh a.gray@ed.ac.uk Contents Introduction GPU Technology Programming GPUs GPU Performance Optimisation 2 Introduction 3 Introduction Central Processing
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationIntroduction to USB Development
Introduction to USB Deveopment Introduction Technica Overview USB in Embedded Systems Recent Deveopments Extensions to USB USB as compared to other technoogies USB: Universa Seria Bus A seria bus standard
More informationACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC
ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC Doug Miles, PGI Compilers & Tools, NVIDIA High Performance Computing Advisory Council February 21, 2018 PGI THE NVIDIA HPC SDK Fortran, C & C++
More informationPerformance Diagnosis for Hybrid CPU/GPU Environments
Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Performance Diagnosis for Hybrid CPU/GPU Environments
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationIntel Architecture: Features & Futures
Inte Architecture: Features & Futures For Servers & Workstations Stephen L. Smith Corporate Vice President, Microprocessor Products Group Genera Manager, Santa Cara Processor Division Inte Corporation
More informationChapter 3: Introduction to the Flash Workspace
Chapter 3: Introduction to the Fash Workspace Page 1 of 10 Chapter 3: Introduction to the Fash Workspace In This Chapter Features and Functionaity of the Timeine Features and Functionaity of the Stage
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Advanced Memory Management
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Advanced Memory Management Advanced Functionaity Now we re going to ook at some advanced functionaity that the OS can provide appications using
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationAn Introduction to Design Patterns
An Introduction to Design Patterns 1 Definitions A pattern is a recurring soution to a standard probem, in a context. Christopher Aexander, a professor of architecture Why woud what a prof of architecture
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationOutline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm
Outine Parae Numerica Agorithms Chapter 8 Prof. Michae T. Heath Department of Computer Science University of Iinois at Urbana-Champaign CS 554 / CSE 512 1 2 3 4 Trianguar Matrices Michae T. Heath Parae
More informationCSE120 Principles of Operating Systems. Architecture Support for OS
CSE120 Principes of Operating Systems Architecture Support for OS Why are you sti here? You shoud run away from my CSE120! 2 CSE 120 Architectura Support Announcement Have you visited the web page? http://cseweb.ucsd.edu/casses/fa18/cse120-a/
More informationRegister Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x
Register Aocation Consider the foowing assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x Assume that two registers are avaiabe. Starting from the eft a compier woud generate
More informationCOS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University
COS 318: Operating Systems Virtua Memory Design Issues: Paging and Caching Jaswinder Pa Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Virtua Memory:
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationHPCSE II. GPU programming and CUDA
HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control
More informationCSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Synchronization: Semaphore
CSE120 Principes of Operating Systems Prof Yuanyuan (YY) Zhou Synchronization: Synchronization Needs Two synchronization needs Mutua excusion Whenever mutipe threads access a shared data, you need to worry
More information