Lecture 1: Introduction and Basics

Similar documents
CS516 Programming Languages and Compilers II

Lecture 2: CUDA Programming

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era

Tesla Architecture, CUDA and Optimization Strategies

Threading Hardware in G80

GPU programming. Dr. Bernhard Kainz

Parallel Processing SIMD, Vector and GPU s cont.

CS 314 Principles of Programming Languages

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Copyright 2012, Elsevier Inc. All rights reserved.

GPU Fundamentals Jeff Larkin November 14, 2016

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EECS4201 Computer Architecture

Mattan Erez. The University of Texas at Austin

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 1: Introduction

The Art of Parallel Processing

ME964 High Performance Computing for Engineering Applications

CS427 Multicore Architecture and Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Computing: Parallel Architectures Jin, Hai

Outline Marquette University

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Portland State University ECE 588/688. Graphics Processors

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CME 213 S PRING Eric Darve

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Technology for a better society. hetcomp.com

Microprocessor Trends and Implications for the Future

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction II. Overview

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

ME964 High Performance Computing for Engineering Applications

GP-GPU. General Purpose Programming on the Graphics Processing Unit

COMP 322: Fundamentals of Parallel Programming

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Parallelism in Hardware

Programmable Graphics Hardware (GPU) A Primer

Fundamentals of Quantitative Design and Analysis

WHY PARALLEL PROCESSING? (CE-401)

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Multicore Hardware and Parallelism

CS 475: Parallel Programming Introduction

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Why do we care about parallel?

Introduction to Parallel Computing with CUDA. Oswald Haan

GPU for HPC. October 2010

ECE 588/688 Advanced Computer Architecture II

CSE 291: Mobile Application Processor Design

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallelism. CS6787 Lecture 8 Fall 2017

Vector Processors and Graphics Processing Units (GPUs)

CS 179 Lecture 4. GPU Compute Architecture

Processor Performance and Parallelism Y. K. Malaiya

Programmer's View of Execution Teminology Summary

CS415 Compilers Overview of the Course. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Heterogenous Computing

Introduction to CELL B.E. and GPU Programming. Agenda

Parallel Systems I The GPU architecture. Jan Lemeire

CSL 860: Modern Parallel

CDA3101 Recitation Section 13

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Introduction to GPU hardware and to CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Lect. 2: Types of Parallelism

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Fundamentals of Computers Design

! Readings! ! Room-level, on-chip! vs.!

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Fundamentals of Computer Design

Computer Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

Modern Processor Architectures. L25: Modern Compiler Design

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Introduction to GPGPU and GPU-architectures

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

CUDA (Compute Unified Device Architecture)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GRAPHICS PROCESSING UNITS

CS/EE 6810: Computer Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

ECE 8823: GPU Architectures. Objectives

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Mathematical computations with GPUs

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Multi-Processors and GPU

GPU Programming Using CUDA

Transcription:

CS 515 Programming Language and Compilers I Lecture 1: Introduction and Basics Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/5/2017

Class Information Instructor: Zheng (Eddy) Zhang Email: eddyzhengzhang@gmailcom Office: CoRE 310 Office Hour: 1pm 2pm, Wednesdays Lectures (including 2-3 guest lectures from industry) Time: Tuesday Noon 3pm Location: TIL-253 LIV (tentative) Course Webpage: Coming Soon Grade Basis In-class presentation: 10% Three projects: 60% Five homework assignments: 30%

Topics Covered in Class Compiler Construction Lexical and Syntax Analysis Data Scopes and Intermediate Representation Instruction Scheduling and Resource Scheduling Program Behavior Analysis Data Flow Analysis Control Flow Analysis Memory and Cache Locality Optimization Correctness, Safety and Liveness Parallelization Task Decomposition Task Scheduling Parallel Algorithm Primitives Static and Dynamic Performance Tuning

What is a programming language? 4 Languages from the user's point of view A way of thinking A way of expressing algorithms Languages from the implementor's point of view An abstraction of virtual machine A way of specifying what you want the hardware to do without getting down into the bits

What is a compiler? 5 It is a program that translates a program in one language into a program in another language It was our belief that if FORTRAN, during its first months, were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger I believe that had we failed to produce efficient programs, the widespread use of languages like FORTRAN would have been seriously delayed John Backus

Three Pass Compiler Program Front End IR Middle End IR Back End Machine code Errors Front End Maps legal source code to intermediate representation (IR) Allows multiple front-ends Typically front-end problem is O(n) or O(n*log(n)) Middle End Performs language/architecture independent analysis and optimization Maps from one IR to one IR or multiple IRs Allows multiple passes Back End Mapping IR into target machine machine code Allows multiple back-ends Backend problems are typically NP complete, however practical solutions exist

Phases of Compilation Source: Programming Language Pragmatics Edition 2 Michael Scott

Compilation versus Interpretation Pure Compilation The compiler translates the high-level source program into an equivalent target program (typically in machine language), and then goes away: Pure Interpretation Interpreter stays around for the execution of the program Interpreter is the locus of control during execution Compilation vs interpretation Not opposites Not a clear-cut distinction

What You Will Learn From This Class In addition to Fundamental Topics in Program Compilation and Analysis How to Write Efficient Parallel Programs How to Perform Performance Debugging How to Read, Review, and Present Research Papers This offering of CS 515 has an emphasis on the principles and practice on parallel programming

What You Will Learn From This Class In addition to Fundamental Topics in Program Compilation and Analysis How to Write Efficient Programs How to Perform Performance Debugging How to Read, Review, and Present Research Papers You are encouraged to combine your own research This offering of CS 515 has an emphasis on the interests/ideas/projects with parallel programming principles and practice on parallel programming techniques learned in this class

Why parallel programming? Frequency Scaling Wall - Frequency scaling has stopped since 2004 Power Wall - Difficult to maintain the same power density as transistors get smaller ILP Wall - Instruction level parallelism exploited a lot - Sophisticated multi-stage pipeline leads to complexity in design Source: cpudbstanfordedu

Why parallel programming? And more dark silicon papers: Esmaeilzadeh, Hadi, et al "Dark silicon and the end of multicore scaling" Computer Architecture (ISCA), 2011 38th Annual International Symposium on IEEE, 2011 Pedram, Ardavan, et al "Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era" (2016) Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse, Michael Tayler, UCSD,

Dennard's Scaling & Moore s Law As transistors get smaller, the power density can be kept constant Robert Dennard Example: if there is a reduction in a transistor s linear size by 2, the power it uses can fall by 8 (with voltage and current both halving) if frequency remains the same Dynamic Power C*F*V^2 C = capacitance F = frequency V = voltage Dennard, Robert H, et al "Design of ion-implanted MOSFET's with very small physical dimensions" IEEE Journal of Solid-State Circuits 95 (1974): 256-268 The number of transistors in a dense integrated circuit doubles approximately every two years Gordon Moore Moore, Gordon E "Cramming more components onto integrated circuits" Electronics Magazine p 4 (1965)

Why Moore s Law and Dennard s Scaling are both coming to an end? 1 Frequency scaling faster than transistor size 2 Voltage stopped scaling in the 80s-90s 3 Power density can t remain the same Threshold voltage Leakage current Dynamic Power C*F*V^2 C = capacitance F = frequency V = voltage A projection made in 1999

What to do with more and more transistors? 1 Multi-core scaling 2 Thread level parallelism ITRS projections for gate lengths (nm) for 2005, 2008, 2011, 2013 editions 3 Cache and memory level parallelism

Parallel Computing is Ubiquitous Miniature Devices to Supercomputers GPU Supercomputers, desktops, mobile devices, self driving cars Raspberry PI Basic computing science education in developing countries, home automation, Intel Edison Wearable device, Internet of Things (IOT), and etc

Challenges in Parallel Computing Writing a parallel program is not easy - Task decomposition - Dependence analysis - Load balancing - Resource allocation - Not to mention write an efficient parallel program - Interactions with algorithm - Interaction with software/hardware interface - Interaction with microarchitecture Problem Algorithm Program SW/HW Interface Microarchitecture Circuits Electrons Language and programming system support - Heavily optimized for sequential programs or limited-parallelism programs in the past few decades - Exploration for the best parallel programming and execution model is an on-going research topic Source: Yale Patt, Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution, Proceedings of the IEEE 2001

Challenges in Parallel Computing Writing a parallel program is not easy - Task decomposition - Dependence analysis - Load balancing - Resource allocation - Not to mention write an efficient parallel program - Interactions with algorithm - Interaction with software/hardware interface - Interaction with microarchitecture Problem Algorithm Program SW/HW Interface Microarchitecture Circuits Electrons Language and programming system support - Heavily optimized for sequential programs or limited-parallelism programs in the past few decades - Exploration for the best parallel programming and execution model is an on-going research topic Source: Yale Patt, Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution, Proceedings of the IEEE 2001

GPU Computing

GPU Programming GPUs - A good example of multi-core/many-core scaling - Widely used to accelerate large-scale applications: machine learning, scientific simulation, weather prediction, computer vision, bioinformatics, and etc - We will use GPUs as our major programming platform to learn how to compile, develop, and optimize parallel programs All three projects will be implemented using GPU programming language(s)

GPU Throughput VS CPU Throughput Source: cuda programming guide https://docsnvidiacom/cuda/cuda-c-programming-guide/

GPU VS CPU Memory Throughput Source: cuda programming guide https://docsnvidiacom/cuda/cuda-c-programming-guide/

GPU Programming Languages/Models Compute Unified Device Architecture (CUDA) - NVIDIA Open Computing Language (OpenCL) C++ AMP - from Microsoft BrookGPU OpenGL / DirectX Stream (Brook+) - AMD

GPU Programming Languages/Models Compute Unified Device Architecture (CUDA) - NVIDIA Open Computing Language (OpenCL) C++ AMP - from Microsoft BrookGPU OpenGL / DirectX Stream (Brook+) - AMD

Programming in CUDA Let s look at a sequential program in C first: Add vector A and vector B to vector C void add_vector (float *A, float *B, float *C, int N){ for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index]; } void main ( ) { } add_vector (A, B, C, N);

Programming in CUDA Now let s look at the parallel implementation in CUDA: Add vector A and vector B to vector C _global_ void add_vector (float *A, float *B, float *C, int N){ if (tid < N) C[tid] = A[tid] + B[tid]; } void main ( ) { } add_vector <<<dimgrid, dimblock>>> (A, B, C, N); /ilab/users/zz124/cs515_2017/samples/0_simple/vectoradd

What is the difference? CPU Program GPU Program Add vector A and vector B to vector C Add vector A and vector B to vector C void add_vector (float *A, float *B, float *C, int N){ for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index]; } _global_ void add_vector (float *A, float *B, float *C, int N){ if ( tid < N ) C[tid] = A[tid] + B[tid]; } void main ( ) { } add_vector (A, B, C, N); void main ( ) { } add_vector <<<dimgrid, dimblock>>> (A, B, C, N);

Thread View A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: - Synchronizing their execution - Efficiently sharing data through a low latency shared memory (scratch-pad memory) - Two threads from two different blocks cannot cooperate by shared memory Concurrent kernel execution possible - Multiple kernels running simultaneously - Computation can also overlap memory transfer Threads in a block is organized into warps (32 threads/warp) A warp is a SIMD thread group

Single Instruction Multiple Data (SIMD) C[index] = A[index] + B[index]; thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1]; thread 15: C[15] = A[15] + B[15]; Execute at the same time Flynn s taxonomy for modern processors Single Instruction Single Data (SISD) Single Instruction Multiple Data (MIMD) Multiple Instruction Single Data (MISD) Multiple Instruction Multiple Data (MIMD) GPU SIMT (Single Instruction Multiple Thread) Simultaneous execution of many SIMD thread groups

An Example of GPU Architecture Nvidia Geforce 8800 GTX - 128 streaming processors (SP) each at 575 MHz - Each streaming multiprocessor has 16 KB shared memory

Thread Life Cycle in Hardware Thread blocks are distributed to all the SMs - Typically more than 1 block per SM One Streaming Multi-processor schedules and executes warps that are ready As a thread block completes, resources are freed - More thread blocks will be distributed to SMs

Thread Blocks on a SM Threads are assigned to SMs in Block granularity - The number of blocks per SM limited by resource constraints - SM in C2075 can host 1536 threads: 256 (threads/block) * 6 blocks or 128 (threads/block) * 12 blocks, and etc Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution

Thread Scheduling and Execution Warps are scheduling units within a SM - If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there? At any point in time, only a limited number of those warps bill be selected for execution on one SM

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling - Warps whose next instruction has its operands ready for consumption are eligible for execution - Eligible warps are selected for execution on a prioritized scheduling (round-robin/age) - All threads in a warp execute the same instruction when selected Typically 4 clock cycles needed to dispatch the same instruction for all threads in a warp time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Question: Assume one global memory access is needed for every 4 instructions, at one time only one instruction can be issued, and it takes 4 cycles to dispatch an instruction for one warp What is the minimum number of warps are needed to tolerate the 320- cycle global memory latency?

Review: SM Warp Scheduling Warp 1 Warp 2 Question: Global Memory Latency 320-cycle Warp 3 Assume one global memory access is needed for every 4 instructions, at one time only one instruction can be issued, and it takes 4 cycles to dispatch an instruction for one warp What is the minimum number of warps are needed to tolerate the 320-cycle global memory latency?

GPU Computing Resources at ilab 1 multi-gpu Machine: "atlascsrutgersedu" - 1x Telsa K40: 12 GB memory, 2880 CUDA cores, CUDA capability 35-2x Quadro K4000: 3 GB memory, 768 CUDA cores, CUDA capability 30 20 single-gpu Machines (listed on the right) - Each has one GT 630 2 GB memory, 192 CUDA cores, CUDA capability 30 Check http://reportcsrutgersedu/mrtg/systems/ilabhtml for the load and availability of different GPU machines

Suggested Reading CUDA Programming Guide Chapter 1 to Chapter 5 https://docsnvidiacom/cuda/cuda-c-programming-guide/