Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Similar documents
CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Threading Hardware in G80

Mattan Erez. The University of Texas at Austin

GPU Computation CSCI 4239/5239 Advanced Computer Graphics Spring 2014

CUDA (Compute Unified Device Architecture)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA (1 of n*)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 22 CUDA

CUDA Programming Model

Intro to GPU s for Parallel Computing

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPGPU. Lecture 2: CUDA

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CELL B.E. and GPU Programming. Agenda

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Lecture 1: Gentle Introduction to GPUs

Portland State University ECE 588/688. Graphics Processors

ME964 High Performance Computing for Engineering Applications

GPGPU/CUDA/C Workshop 2012

Tesla Architecture, CUDA and Optimization Strategies

GPU programming. Dr. Bernhard Kainz

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

L18: Introduction to CUDA

GPUs and GPGPUs. Greg Blanton John T. Lubia

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CS427 Multicore Architecture and Parallel Computing

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Lecture 1: Introduction

Programming in CUDA. Malik M Khan

Mathematical computations with GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CSE 160 Lecture 24. Graphical Processing Units

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

B. Tech. Project Second Stage Report on

CS516 Programming Languages and Compilers II

Introduction to GPU hardware and to CUDA

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

High Performance Computing

! Readings! ! Room-level, on-chip! vs.!

Massively Parallel Architectures

GPU Programming Using NVIDIA CUDA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Nvidia G80 Architecture and CUDA Programming

Paralization on GPU using CUDA An Introduction

Introduction to CUDA (1 of n*)

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 15

Current Trends in Computer Graphics Hardware

ECE/CS 757: Advanced Computer Architecture II GPGPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Accelerating CFD with Graphics Hardware

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Tesla GPU Computing A Revolution in High Performance Computing

CUDA Architecture & Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Real-time Graphics 9. GPGPU

Computer Architecture

Real-time Graphics 9. GPGPU

Lecture 1: Introduction and Computational Thinking

Parallel Computing. Hwansoo Han (SKKU)

GPGPU introduction and network applications. PacketShaders, SSLShader

high performance medical reconstruction using stream programming paradigms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Introduction to CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

CS654 Advanced Computer Architecture. Lec 1 - Introduction

ME964 High Performance Computing for Engineering Applications

L3: Data Partitioning and Memory Organization, Synchronization

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 11: GPU programming

From Brook to CUDA. GPU Technology Conference

The NVIDIA GeForce 8800 GPU

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

GRAPHICS PROCESSING UNITS

WHY PARALLEL PROCESSING? (CE-401)

NVIDIA Fermi Architecture

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Transcription:

ST: CDA 6938 Multi-Core/Many Core/Many-Core Architectures and Programming http://csl.cs.ucf.edu/courses/cda6938/ Prof. Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida Outline Administration Motivation Why multi-core many core processors? Why GPGPU? CPU vs. GPU An overview of Nvidia G80 and CUDA 2

Description (Syllabus) High performance computing on multi-core / many-core architectures Focus: Data-level parallelism, thread-level parallelism How to express them in various programming models Architectural features with high impact on the performance Prerequisite CDA5106: Advanced Computer Architecture I C programming 3 Description (cont.) Textbook No required textbooks, four optional ones Papers & Notes Tentative grading policy +/- policy will be used Homework: 25% In-class presentation: 10% Participation in discussion: 5% (Not applicable to FEEDS students) Project: 60% Including another in-class presentation A:90~100 B+: 85~90 B: 80~85 B-: 75~80. 4

Who am I Assistant Professor at School of EECS, UCF. My research area: computer architecture, back-end compiler, embedded systems High Performance, Power/Energy Efficient, Fault Tolerant Microarchitectures, Multi-core/many-core architectures (e.g., GPGPU), Architectural support for software debugging, Architectural support for information security 5 Topics Introduction to multi-core/many-core architecture Introduction to multi-core/many-core programming NVidia GPU architectures and the programming model for GPGPU (CUDA) AMD/ATI GPU architectures and the programming model for GPGPU (CTM or Brook+) (4 guest lectures from AMD) IBM Cell BE architecture and the programming model for GPGPU CPU/GPU trade-offs Stream processors Vector processors Data-level parallelism and the associated programming patterns -level parallelism and the associated programming patterns Future multi-core/many-core architectures Future programming support for multi-core/many-core processors 6

Assignments Homework #0 Hello world! using an emulator of Nvidia G80 processors #1 A small program: Parallel reduction (both on an emulator and the actual graphics processors) #2 Matrix Multiplication #3 Prefix Sum Computation Presentation an in-depth presentation based on some research papers (either on a particular processor or on GPGPU in general) Projects Select one processor model from Nvidia G80, ATI streaming processors, and IBM Cell processors. Select (or find your own) an application Try to improve the performance using the GPU that you selected Cross platform comparison 7 Lab: HEC 238 Experiments Get the keys to HEC 238 from Martin Dimitrov (in charge of experimental environment for AMD/ATI GPUs) dimitrov@cs.ucf.edu Hongliang Gao (in charge of experimental environment for IBM Cell processors) hgao@cs.ucf.edu Jingfei Kong (in charge of experimental environment for Nvidia G80 GPUs) jfkong@cs.ucf.edu Schedule the time within the group 8

Acknowledgement Some material including lecture notes are based on the lecture notes of the following courses: Programming Massively Parallel Processors (UIUC) Multicore Programming Premier: Learn and Compete Programming for the PS3 Cell Processors (MIT) Multicore and GPU Programming for Video Games (GaTech) 9 Computer Science at a Crossroads (D. Patterson) Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple cores (2X processors per chip / ~ 2 years) More simpler processors are more power efficient The Free (performance) Lunch is over: A Fundamental Turn Toward Concurrency in Software The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency (by Herb Sutter) 10

Problems with Sea Change Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, not ready to supply Level Parallelism or Data Level Parallelism for 1000 CPUs / chip, Architectures not ready for 1000 CPUs / chip Unlike Instruction Level Parallelism, cannot be solved by just by computer architects and compiler writers alone, but also cannot be solved without participation of computer architects Modern GPUs run hundreds or thousands threads / chip Shifts from Instruction Level Parallelism to Level Parallelism / Data Level Parallelism GPGPU is one such example 11 GPU at a Glance 1 st : Designed for graphics applications Trend: converging the different functions into a programmable model To suit graphics applications High memory bandwidth 86.4 GB/s (GPU) vs. 8.4 GB/s (CPU) High FP processing power 400~500 GFLOPS (GPU) vs. 30~40 GFLOPS (CPU) Can we utilize the processing power to perform computing besides graphics? GPGPU 12

GPU vs. CPU G80 Die (90 nm tech.) Photo IBM Power 6 Die (65 nm tech.) Photo 13 IBM Power 6 Outstanding Feature: 4.7 GHz; 2 cores with symmetric multiprocessing (SMP) support; 8MB L2 cache 14

Power 5 die Inside the CPU core (CDA5106) 15 Host Input Assembler Execution Manager NVidia G80 Some Outstanding features: 16 highly threaded SM s, >128 FPU s Shared memory per SM: 16KB Constant memory: 64 KB Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 16

GPU vs. CPU The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control ALU Control ALU CPU ALU ALU GPU DRAM DRAM 17 GPU vs. CPU CPU: all these on-chip estate are used to achieve performance improvement transparent to software developers Sequential programming model Moving towards multi-core and many-core GPU: more on-chip resources used for floating-point computation Requires data parallel programming model Expose architecture features to software developers and software needs to explicitly taking advantage of those features to achieve high performance 18

Things to know for a GPU processor execution model How the threads are executed, how to synchronize threads How the instructions in each/multiple thread(s) are executed Memory model How the memory is organized Speed and Size considerations for different types of memories Shared or private memory. If shared, how to ensure the memory ordering Control flow handling Instruction Set Architecture Support: Programming environment Compiler, debugger, emulator, etc. 19 HW and SW support for GPGPU Nvidia Geforce 8800 GTX vs Geforce 7800 Slides from the Nvidia talk given at Stanford Univ. Programming models (candidates for course presentation) CUDA Brook+ Peak Stream Rapid Mind 20

GeForce-8 8 Series HW Overview Streaming Processor Array TPC TPC TPC TPC TPC TPC Texture Processor Cluster SM Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch TEX Shared Memory SM SFU SFU 21 CUDA Processor Terminology A Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) TPC Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 ) Multi-threaded processor core Fundamental processing unit for CUDA thread block Streaming Processor Scalar ALU for a single CUDA thread 22

Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) 8 Streaming Processors () 2 Super Function Units (SFU) Multi-threaded instruction dispatch 1 to 768 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SFU SFU DRAM texture and memory access 23 G80 Computing Pipeline Processors The future execute of GPUs computing is programmable threadsprocessing Alternative So build the operating architecture mode around specifically the processor for computing Host Input Input Assembler Execution Manager Vtx Issue Generates grids based on kernel calls Geom Issue Setup / Rstr / ZCull Pixel Issue Parallel TF Data Parallel TF Data Parallel Data TF Parallel Data TF Parallel TF Data Parallel TF Data Parallel TF Data Parallel TF Data Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Processor Load/store L2 Load/storeL2 Load/store L2 Load/store L2 Load/store L2 Load/store L2 FB FB FB Global Memory FB FB FB 24

CUDA Compute Unified Device Architecture General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel coprocessor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management 25 Extended C Declspecs global, device, shared, local, constant Keywords threadidx, blockidx Intrinsics syncthreads Runtime API Memory, symbol, execution management Function launch device float filter[n]; global void convolve (float *image) { shared float region[m];... region[threadidx] = image[i]; syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudamalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 26

Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s OCG CPU Host Code foo.cpp gcc / cl G80 SASS foo.sass 27 CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few 28

Batching: Grids and s A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 (1, 1) (0, 0) Device (1, 0) Grid 1 (0, 0) (0, 1) Grid 2 (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) Courtesy: NDVIA (4, 2) 29 and IDs s and blocks have IDs So each thread can decide what data to work on ID: 1D or 2D ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Device (1, 1) (0, 0) (1, 0) Grid 1 (0, 0) (0, 1) (2, 0) (1, 0) (1, 1) (3, 0) (4, 0) (2, 0) (2, 1) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA 30

Programming Model: Memory Spaces Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory Grid (0, 0) Shared Memory Registers Registers (0, 0) (1, 0) (1, 0) Shared Memory Registers Registers (0, 0) (1, 0) Local Memory Local Memory Local Memory Local Memory The host can read/write global, constant, and texture memory Host Global Memory Constant Memory Texture Memory 31 Access Times Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle Local Memory DRAM, no cache - *slow* Global Memory DRAM, no cache - *slow* Constant Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Texture Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Instruction Memory (invisible) DRAM, cached 32

Compilation Any source file containing CUDA language extensions must be compiled with nvcc nvcc is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl,... nvcc can output: Either C code That must then be compiled with the rest of the application using another tool Or object code directly 33 Linking Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library (cudart) The CUDA core library (cuda) 34

Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (nvcc - deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread When running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of syncthreads 35 Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results There are various options to force strict single precision on the host 36