Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
|
|
- Marianna Simpson
- 5 years ago
- Views:
Transcription
1 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017
2 Housekeeping Your goal today: understand Altera s interpretation of OpenCL on FPGA Notices Handout #4: lab 2, due noon, 10/6 3 weeks to project proposal Readings skim Ch 10, Reconfigurable Computing for concrete reference: Altera SDK for OpenCL: Programming Guide F17 L11 S2, James C. Hoe, CMU/ECE/CALCM, 2017
3 F17 L11 S3, James C. Hoe, CMU/ECE/CALCM, 2017 Project Proposal Planning Review Lecture 6; talk to TAs; talk to me Rolling start In class proposal presentation (worth 25%) on wk 9 presentation slides (PDF) due 10/20 noon Peer grading of proposal your peers will evaluate you too 10% them; 15% me 15 minutes + 5 minutes Q&A to convince us it is worth doing 6 wks worth of work (for a 12 unit load for 2 or 3) reasonable chance of meeting milestones
4 Required Proposal Outline 1. Introduction what is the problem and metric 2. Motivation why is it important/interesting 3. Overview of approach what do you do exactly 4. Overview of expected results 5. Weekly testable milestones 6. Risk assessment (what are the known unknowns? do you have everything you need to start? do you know how to do every step? do you know already what will happen?) F17 L11 S4, James C. Hoe, CMU/ECE/CALCM, 2017
5 F17 L11 S5, James C. Hoe, CMU/ECE/CALCM, 2017 Khronos OpenCL
6 1. Platform model Two Parts to OpenCL host (processor & memory) 1 or more accelerator devices + device side mem hierarchy: global/local/private APIs for host thread to interact with devices launch compute kernels to devices prepare (load/unload) device memory 2. Kernel programming language perfect triply nested loops no loop carried dependence F17 L11 S6, James C. Hoe, CMU/ECE/CALCM, 2017 OpenCL terms introduced in bold
7 OpenCL Platform Model host CPU global mem compute device host mem global mem compute device compute device compute device F17 L11 S7, James C. Hoe, CMU/ECE/CALCM, 2017 What are these compute devices???
8 Basic Host Program Example main ( ) {... get device handle and queue allocate memory buf objects......get kernel object... while ( ) {... initialize memory buf data bind buf objects to kernel arguments add kernel and buf objects to device queue wait for kernel to finish retrieve memory buf object for result... } What are these kernels??? F17 L11 S8, James C. Hoe, CMU/ECE/CALCM, 2017
9 What are these kernels? F17 L11 S9, James C. Hoe, CMU/ECE/CALCM, 2017 Specifically talking about OpenCL C
10 Conceptually... for (int i=0; i < R 0 ; i++) for (int j=0; j < R 1 ; j++) for (int k=0; k < R 2 ; k++) { << local variable declarations >> << arbitrary C-code with access to global memory >> } Loop body must be data parallel local variables limited to scope disallow loop carried dependencies through global memory ==> statements from different iterations can interleave in any order (using disambiguated local variables) F17 L11 S10, James C. Hoe, CMU/ECE/CALCM, 2017
11 Concretely... Only specify loop body as a kernel function kernel foo(<< pointers to global mem buf>>) { } int i=get_global_id(2), j=get (1), k=get (0); << local variable declarations >> << arbitrary C-code with access to global memory >> Triply nested loops hardwired as NDRange specified as 3 integer constants, i.e., the loop bounds (R 0, R 1, R 2 ) 1 execution of kernel function is a work item work item has private memory for local var s 1 kernel execution is R 0 xr 1 xr 2 work items F17 L11 S11, James C. Hoe, CMU/ECE/CALCM, 2017
12 Example: N by N MMM kernel mmm( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } for(int k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]; NDRange=(N, N, 1) kernel function executed by NxNx1 work items each work item sees a different combination of dimension 0 and dimension 1 global id s no assumption about work items relative progress F17 L11 S12, James C. Hoe, CMU/ECE/CALCM, 2017
13 (For Your Reference: N by N MMM in C) float A[N][N], B[N][N], C[N][N]; for(int i=0; i<n; i++) for(int j=0; j<n; j++) { for(in k=0; k<n; k++) C[i][j]=C[i][j]+A[i][k]*B[k][j] } Note: Loop body of the inner most loop is not dataparallel dependency through C[i][j] Loop body of the second inner most loop is F17 L11 S13, James C. Hoe, CMU/ECE/CALCM, 2017
14 Work Group Partition NDRange of R 0 xr 1 xr 2 work items into 3D work groups of G 0 xg 1 xg 2 work items G 0/1/2 must divide R 0/1/2 evenly get_local_id(dimen): id within group NDRange=(9, 3, 6) get_group_id(dimen): id of group Group Size=(3, 3, 3) Group Range=(3,1,2) Work group signifies locality btwn work items execute together by a processing element can share per group local memory can synchronize by barrier() F17 L11 S14, James C. Hoe, CMU/ECE/CALCM, 2017 Why do we need this?
15 OpenCL Kernels on GPGPUs Work items is a CUDA thread Work group executes as a thread block broken down into 32 work item SIMD Warps Work groups from same and different kernels are interleaved on a Streaming Processor 128 SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz 1 kernel can execute on upto 20 StrmProc s (as 1 compute device), peak 8,873 GFLOPS Global=GDDR; local=shared memory 96KB SRAM; private=register file 256KB SRAM Nvidia terms in italic bold; numbers for GTX F17 L11 S15, James C. Hoe, CMU/ECE/CALCM, 2017
16 Aside: Barrel Processor [HEP, Smith] Each cycle, select a ready thread from scheduling pool only one instruction per thread in flight at once on a long latency stall, remove the thread from scheduling Simpler and faster pipeline implementation since no data dependence, hence no stall or forwarding no penalty in making pipeline deeper warp Inst T1 A B C D E E G H Inst T2 A B C D E F G H Inst T3 A B C D E F G H Inst T4 A B C D E F G H Inst T5 A B C D E F G H Inst T6 A B C D E F G H Inst T7 A B C D E F G H Inst T8 A B C D E F G H Inst T1 A B C D E F G H F17 L11 S16, James C. Hoe, CMU/ECE/CALCM, 2017
17 To get 8,873 GFLOPS... # work items 128 x StrmProc pipeline depth x 20 Computation entirely of Fused Multiply Add Interleaved warps so no worries about RAW stalls No if then else (branch divergence) BTW: ld s and st s take up inst. issue BW of the top 320 GB/sec DRAM BW AI > 108 SP FP / float only certain access pattern can sustain peak BW SIMD ld s and st s in a warp has to go to the same memory line (memory coalescing) F17 L11 S17, James C. Hoe, CMU/ECE/CALCM, 2017
18 F17 L11 S18, James C. Hoe, CMU/ECE/CALCM, 2017 Altera s Take on OpenCL
19 OpenCL FPGA Platform Model host CPU host mem global mem (DRAM) global Mem (DRAM) global Mem (DRAM) FPGA custom Kernel custom device kernel custom device kernel device F17 L11 S19, James C. Hoe, CMU/ECE/CALCM, 2017 Compute devices synthesized from kernel functions
20 Example: N by N MM Add kernel mma( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } C[i*N+j]=A[i*N+j]+B[i*N+j]; NDRange=(N, N, 1) Note in this example: data parallel kernel function no loop in kernel function F17 L11 S20, James C. Hoe, CMU/ECE/CALCM, 2017
21 Fully pipelined Kernel Datapath NDRange = (N,N,1) (gid0,gid1) stream =(0,1), (0,N 1),(1,0) (1,N 1) (N 1,0) (N 1,N 1) A gid0 gid1 B C addr calc load &A[i*N+j] addr calc load &B[i*N+j] addr calc &C[i*N+j] A[i*N+j] add B[i*N+j] store F17 L11 S21, James C. Hoe, CMU/ECE/CALCM, 2017 Fully unrolled loop in kernel fxn also okay
22 What about MMM? kernel mmm( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } for(int k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]; NDRange=(N, N, 1) Can t pipeline work items like before PE pipelines the k iterations dependency on C[i*N+j] kernel function scope limits the tricks we can play F17 L11 S22, James C. Hoe, CMU/ECE/CALCM, 2017
23 Single Work Item Kernel kernel mmm( global float *A, *B, *C) { for(int i=0; j+i<n; i++) for(int j=0; j<n; +) for(in k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j] NDRange=(1, 1, 1) never do this on GPU!! Arbitrary control flow (loops, if s) and dependencies Becomes just regular C to HW synthesis pipeline and parallelize loops schedule for initiation interval, resource, etc. Only want OpenCL s platform model and API; work group & work item not too meaningful F17 L11 S23, James C. Hoe, CMU/ECE/CALCM, 2017
24 Altera Kernel Kernel Channels GPU multi kernel OpenCL program kernel computes from global mem to global mem next kernel continues from last kernel s output buf producer and consumer kernels fully serialized For streaming processing on FPGA, connect kernels with streaming channels to bypass DRAM concurrent producer and consumer kernels reduce DRAM and PCIe bandwidth requirements host PCIe global mem (DRAM) global load kernel A chnl kernel B chnl krnl C store mem PCIe (DRAM) host F17 L11 S24, James C. Hoe, CMU/ECE/CALCM, 2017 Also supports OpenCL 2.0 Pipes
25 Parting Thoughts OpenCL = platform model/api + SIMD language kernel language forces regular and explicit parallelism SIMD parallelism manifest differently on GPUs vs FPGAs Altera s OpenCL = platform model/api + regular HLS single work item kernel unlocks parallelism style kernel kernel channels alleviate DRAM and PCIe bottleneck for streaming use cases develop/debug/analysis tools integral to appeal GPUs great at SIMD; FPGAs good for more than SIMD F17 L11 S25, James C. Hoe, CMU/ECE/CALCM, 2017
Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationLecture 8: Abstractions for HW. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 8: Abstractions for HW James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L08 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L08 S2, James C. Hoe, CMU/ECE/CALCM, 2017
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGPUs have enormous power that is enormously difficult to use
524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack
More informationLecture 10: Vivado C to IP HLS. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado
More informationLecture 10: Vivado C to IP HLS. Housekeeping
18 643 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationExotic Methods in Parallel Computing [GPU Computing]
Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationLecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAdvanced CUDA Optimizations. Umar Arshad ArrayFire
Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationHardware/Software Co-Design
1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationParallel Programming Concepts. GPU Computing with OpenCL
Parallel Programming Concepts GPU Computing with OpenCL Frank Feinbube Operating Systems and Middleware Prof. Dr. Andreas Polze Agenda / Quicklinks 2 Recapitulation Motivation History of GPU Computing
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationComputer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key
Computer Architecture and Engineering CS152 Quiz #5 April 23rd, 2009 Professor Krste Asanovic Name: Answer Key Notes: This is a closed book, closed notes exam. 80 Minutes 8 Pages Not all questions are
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationLecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 1: Welcome, why are you here? James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L01 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L01 S2, James C. Hoe, CMU/ECE/CALCM,
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationMemory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model
Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model Sunpyo Hong Hyesoon Kim ECE School of Computer Science Georgia Institute of Technology {shong9, hyesoon}@cc.gatech.edu
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationEXAMINATION PAPER. Exam in: INF-3201 Parallel Programming Date: Wednesday December 2, 2015 Time: 09:00-13:00 Place: Åsgårdveien 9. Approved aids: None
EXAMINATION PAPER Exam in: INF-3201 Parallel Programming Date: Wednesday December 2, 2015 Time: 09:00-13:00 Place: Åsgårdveien 9 Approved aids: None The exam contains 5 pages, including this cover page.
More informationFundamental Optimizations
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationSupercomputing, Tutorial S03 New Orleans, Nov 14, 2010
Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationSimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels
SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels Ahmed Sanaullah, Chen Yang, Daniel Crawley and Martin C. Herbordt Department of Electrical and Computer Engineering, Boston University The Intel
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationParallelization Techniques for Implementing Trellis Algorithms on Graphics Processors
1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor
More information1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }
Introduction: Convolution is a common operation in digital signal processing. In this project, you will be creating a custom circuit implemented on the Nallatech board that exploits a significant amount
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationAMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas
AMath 483/583 Lecture 24 May 20, 2011 Today: The Graphical Processing Unit (GPU) GPU Programming Today s lecture developed and presented by Grady Lemoine References: Andreas Kloeckner s High Performance
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationCUDA Memories. Introduction 5/4/11
5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction
More information