Master Informatics Eng.

Similar documents
Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA (1 of n*)

Threading Hardware in G80

Programming in CUDA. Malik M Khan

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Arquitetura e Organização de Computadores 2

ECE 8823: GPU Architectures. Objectives

Portland State University ECE 588/688. Graphics Processors

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to CUDA (1 of n*)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C. Objective

45-year CPU Evolution: 1 Law -2 Equations

Improving Performance of Machine Learning Workloads

Mattan Erez. The University of Texas at Austin

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 2: Introduction to CUDA C

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing: Parallel Architectures Jin, Hai

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

CUDA Programming Model

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COSC 6339 Accelerators in Big Data

CSE 160 Lecture 24. Graphical Processing Units

Handout 3. HSAIL and A SIMT GPU Simulator

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Data Parallel Execution Model

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Lecture 1: Gentle Introduction to GPUs

ECE 574 Cluster Computing Lecture 15

Programmer's View of Execution Teminology Summary

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Introduction to CUDA Programming

Lecture 1: Introduction and Computational Thinking

CUDA (Compute Unified Device Architecture)

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Numerical Simulation on the GPU

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to GPU hardware and to CUDA

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

From Application to Technology OpenCL Application Processors Chung-Ho Chen

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Deep Learning Accelerators

Tesla Architecture, CUDA and Optimization Strategies

Computer Architecture

Introduction to GPU programming with CUDA

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

GPUs have enormous power that is enormously difficult to use

Spring Prof. Hyesoon Kim

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

GRAPHICS PROCESSING UNITS

Fundamental CUDA Optimization. NVIDIA Corporation

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Advanced CUDA Optimization 1. Introduction

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to CELL B.E. and GPU Programming. Agenda

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Using GPUs to compute the multilevel summation of electrostatic forces

Fundamental CUDA Optimization. NVIDIA Corporation

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

ECE 574 Cluster Computing Lecture 17

Computer Architecture

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Warps and Reduction Algorithms

Mathematical computations with GPUs

ME964 High Performance Computing for Engineering Applications

Vector Processors and Graphics Processing Units (GPUs)

CS516 Programming Languages and Compilers II

Antonio R. Miele Marco D. Santambrogio

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallelism. CS6787 Lecture 8 Fall 2017

General introduction: GPUs and the realm of parallel architectures

Chapter 3 Parallel Software

B. Tech. Project Second Stage Report on

CS 179 Lecture 4. GPU Compute Architecture

Real-Time Rendering Architectures

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Lecture 3: Introduction to CUDA

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Parallel Processing SIMD, Vector and GPU s cont.

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Transcription:

Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The CUDA programming model Compute Unified Device Architecture CUDA is a recent programming model, designed for a multicore CPU host coupled to a many-core device, where devices have wide SIMD/SIMT parallelism, and the host and the device do not share memory CUDA provides: a thread abstraction to deal with SIMD synchr. & data sharing between small groups of threads CUDA programs are written in C with extensions OpenCL inspired by CUDA, but hw & sw vendor neutral programming model essentially identical AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 2

CUDA Devices and Threads A compute device is a coprocessor to the CPU or host ( memory has its own DRAM (device runs many threads in parallel is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels which run on many threads - SIMT Differences between GPU and CPU threads GPU threads are extremely lightweight very little creation overhead, requires LARGE register bank GPU needs 1000s of threads for full efficiency multi-core CPU needs only a few AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 3 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign CUDA basic model: Single-Program Multiple-Data (SPMD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks GPU Parallel Kernel KernelA<<< nblk, ntid >>>(args); GPU Parallel Kernel KernelB<<< nblk, ntid >>>(args); CPU Code CPU Code Grid 0 AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 4...... Grid 1 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Programming Model: SPMD + SIMT/SIMD Hierarchy Device => Grids Grid => Blocks Block => Warps Warp => Threads Single kernel runs on multiple blocks (SPMD) Threads within a warp are executed in a lock-step way called singleinstruction multiple-thread (SIMT) Single instruction are executed on multiple threads (SIMD) Warp size defines SIMD granularity (32 threads) Synchronization within a block uses shared memory Courtesy NVIDIA AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 5 The Computational Grid: Block IDs and Thread IDs A kernel runs on a computational grid of thread blocks Threads share global memory Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D A thread block is a batch of threads that can cooperate by: Sync their execution w/ barrier Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 6 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Multiply two vectors of length 8192 Code that works over all elements is the grid Example Thread blocks break this down into manageable sizes 512 threads per block SIMD instruction executes 32 elements at a time Thus grid size = 16 blocks Block is analogous to a strip-mined vector loop with vector length of 32 Block is assigned to a multithreaded SIMD processor by the thread block scheduler Current-generation GPUs (Fermi) have 7-16 multithreaded SIMD processors Graphical Processing Units Copyright 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 7 AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 8

Terminology (and in NVidia) Threads of SIMD instructions (warps) Each has its own IP (up to 48/64 per SIMD processor, Fermi/Kepler) Thread scheduler uses scoreboard to dispatch No data dependencies between threads! Threads are organized into blocks & executed in groups of 32 threads (thread block) Blocks are organized into a grid The thread block scheduler schedules blocks to SIMD processors (Streaming Multiprocessors) Within each SIMD processor: 32 SIMD lanes (thread processors) Wide and shallow compared to vector processors Graphical Processing Units Copyright 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 9 CUDA Thread Block Programmer declares (Thread) Block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads All threads in a Block execute the same thread program Threads share data and synchronize while doing their share of the work Threads have thread id numbers within Block Thread program uses thread id to select work and address shared data threadid CUDA Thread Block AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 10 0 1 2 3 4 float x = input[threadid]; float y = func(x); output[threadid] = y; 5 6 7 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Parallel Sharing Thread Block Grid 0 Grid 1 Local Shared...... Local : per-thread Private per thread Auto variables, register spill Shared : per-block Shared by threads of the same block Inter-thread communication Global : per-application Shared by all threads Inter-Grid communication Global Sequential Grids in Time AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 11 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Model Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant, and texture memories Host (Device) Grid Block (0, 0) Local Global Constant Texture Shared Thread (0, 0) Thread (1, 0) Local Block (1, 0) Local Shared Thread (0, 0) Thread (1, 0) Local AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 12 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

Hardware Implementation: Architecture Device memory (DRAM) Slow (2~300 cycles) Local, global, constant, and texture memory On-chip memory Fast (1 cycle), shared memory, constant/texture cache Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Processor 1 Shared Processor 2 Constant Cache Texture Cache Processor M Instruction Unit Device memory Courtesy NVIDIA AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 13 Example Graphical Processing Units Copyright 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 14

Vector Processor versus CUDA core Graphical Processing Units Copyright 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 15 Conditional Branching Like vector architectures, GPU branch hardware uses internal masks Also uses Branch synchronization stack Entries consist of masks for each SIMD lane I.e. which threads commit their results (all threads execute) Instruction markers to manage when a branch diverges into multiple execution paths Push on divergent branch and when paths converge Act as barriers Pops stack Graphical Processing Units Per-thread-lane 1-bit predicate register, specified by programmer Copyright 2012, Elsevier Inc. All rights reserved. AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 16

Beyond Vector/SIMD architectures Vector/SIMD-extended architectures are hybrid approaches mix (super)scalar + vector op capabilities on a single device highly pipelined approach to reduce memory access penalty tightly-closed access to shared memory: lower latency Evolution of Vector/SIMD-extended architectures PU (Processing Unit) cores with wider vector units x86 many-core: Intel MIC / Xeon KNL (more slides later) other many-core: IBM Power BlueGene/Q Compute, ShenWay 260 coprocessors (require a host scalar processor): accelerator devices on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC) ISA-free architectures, code compiled to silica: FPGA focus on SIMT/SIMD to hide memory latency: GPU-type approach focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU heterogeneous PUs in a SoC: multicore PUs with GPU-cores AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 17 Machine learning w/ neural nets & deep learning... Key algorithms to train & classify use matrix products, but require lower precision numbers! AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 18

For each SM: 8x 64 FMA ops/cycle 1k FLOPS/cycle! AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 19 NVidia competitors with neural net features: IBM TrueNorth chip array (August 2014) TrueNorth Chip: 4096 neurosynaptic cores Each core: 256 inputs (axons) 256 outputs (neurons) RAM w/ data for each neuron router (any neuron to any axon) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 20 http://www.nvidia.com/content/gated-pdfs/volta-architecture-whitepaper-v1.1.pdf NVidia Volta Architecture: the new Tensor Cores

NVidia competitors with neural net features: the IBM TrueNorth architecture AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 21 History Nervana Engine announced in May 16 Key features: ASIC chip, focused on matrix multiplication,convolutions,... (for neural nets) HBM2: 4x 8GB in-package storage & 1TB/sec memory access b/w no h/w managed cache hierarchy (saves die area, higher compute density) built-in networking (6 bi-directional high-b/w links) separate pipelines for computation and data management Loihi proprietary numeric format Flexpoint in-between floating point and fixed point precision Nervana acquired by Intel in August 2016: renamed the project to Lake Crest later to Nervana NNP, launched in October 17 Loihi test chip w/ self-learning capabilities announced in Sept 17, to be launched in 2018 AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 22 https://www.top500.org/news/intel-will-ship-first-neural-network-chip-this-year/ NVidia competitors with neural net features: Intel Nervana Neural Network Processor, NNP

NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April 17) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 23 Chip floor plan NVidia competitors with neural net features: Google Tensor Processing Unit, TPU (April 17) TPUs are intensively used by Google, namely in RankBrain, StreetView & Google Translate AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 24 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

NVidia competitors with neural net features: Google TPUv2 (September 17) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 25 Beyond Vector/SIMD architectures Vector/SIMD-extended architectures are hybrid approaches mix (super)scalar + vector op capabilities on a single device highly pipelined approach to reduce memory access penalty tightly-closed access to shared memory: lower latency Evolution of Vector/SIMD-extended architectures PU (Processing Unit) cores with wider vector units x86 many-core: Intel MIC / Xeon KNL other many-core: IBM Power BlueGene/Q Compute, ShenWay 260 coprocessors (require a host scalar processor): accelerator devices on disjoint physical memories (e.g., Xeon KNC with PCI-Expr, PEZY-SC) ISA-free architectures, code compiled to silica: FPGA focus on SIMT/SIMD to hide memory latency: GPU-type approach focus on tensor/neural nets cores: NVidia, IBM, Intel NNP, Google TPU heterogeneous PUs in a SoC: multicore PUs with GPU-cores x86 multicore coupled with SIMT/SIMD cores: Intel i5/i7 ARMv8 cores coupled with SIMT/SIMD cores: NVidia Tegra AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 26