CS516 Programming Languages and Compilers II

Similar documents
Lecture 1: Introduction and Basics

CS671 Parallel Programming in the Many-Core Era

Lecture 2: CUDA Programming

CS671 Parallel Programming in the Many-Core Era

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Tesla Architecture, CUDA and Optimization Strategies

Threading Hardware in G80

CS 314 Principles of Programming Languages

Mattan Erez. The University of Texas at Austin

GPU Fundamentals Jeff Larkin November 14, 2016

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications

GPU programming. Dr. Bernhard Kainz

Fundamental CUDA Optimization. NVIDIA Corporation

Portland State University ECE 588/688. Graphics Processors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CUDA Programming Model

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS427 Multicore Architecture and Parallel Computing

Technology for a better society. hetcomp.com

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CME 213 S PRING Eric Darve

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to GPU hardware and to CUDA

Introduction to CUDA (1 of n*)

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 179 Lecture 4. GPU Compute Architecture

Programmer's View of Execution Teminology Summary

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Programmable Graphics Hardware (GPU) A Primer

! Readings! ! Room-level, on-chip! vs.!

GPUs have enormous power that is enormously difficult to use

CS516 Programming Languages and Compilers II

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to GPGPU and GPU-architectures

NVIDIA Fermi Architecture

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COMP 322: Fundamentals of Parallel Programming

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Heterogenous Computing

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to Parallel Computing with CUDA. Oswald Haan


CUDA Programming. Aiichiro Nakano

Master Informatics Eng.

Parallel Processing SIMD, Vector and GPU s cont.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Introduction to CUDA Programming

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Tesla GPU Computing A Revolution in High Performance Computing

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Mathematical computations with GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Fundamental Optimizations

ECE 574 Cluster Computing Lecture 15

Lecture 1: Gentle Introduction to GPUs

Performance potential for simulating spin models on GPU

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

ECE 574 Cluster Computing Lecture 17

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

CSE 160 Lecture 24. Graphical Processing Units

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

GPU for HPC. October 2010

Warps and Reduction Algorithms

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CUDA Architecture & Programming Model

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Data Parallel Execution Model

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Spring Prof. Hyesoon Kim

GPU Programming Using NVIDIA CUDA

From Application to Technology OpenCL Application Processors Chung-Ho Chen

High Performance Computing and GPU Programming

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Introduction to CUDA (1 of n*)

Transcription:

CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University

CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu) Website: www.cs.rutgers.edu/~zz124/ Office hours: 3pm-4pm Wednesday @ Core 310 Grader: Ari Hayes (arihayes@cs.rutgers.edu) Office hours: TBD @ Core 314 Important facts Lectures: Thursdays noon - 3:00pm location: Hill 005 course page: www.cs.rutgers.edu/~zz124/cs516_spring2015/ Basis for grades (tentative) In-class presentations: 20% Project 1 code: 15% Project II proposal: 5% Project II code: 30% Project II design review: 10% Project II report: 20%

CS516 Class-taking Techniques A mix of I talk and you talk * Not only learning, but also researching * You think critically, there is no single correct answer * Participating and presenting papers No exams!!! * Instead we will have intensive programming projects * Reading and presenting good paper(s) Projects * A common project a warm-up parallelization project * A research project one of the three research problems to solve, you will be assigned one based on your background and preference * Solution to the research problem I will suggest a couple of solutions, you will choose the final design and implementation. You will perform literature study and compare it against the work in the past. Your outcome will be written into a conference style paper.

What you will learn in this class Overview Principal topics in understanding and transforming programs at code and behavior levels What are the objectives? Learn the key techniques in modern compiler construction ( compiler engineers ) Understand the rationale of program optimization, able to write better code ( programmers ) Build the foundation for research in compilers, programming languages and so on ( researchers )

What is a compiler?

What is a compiler? It is a program that translates a program in one language into a program in another language. And it should improve the program, in some way

What is a compiler? It is a program that translates a program in one language into a program in another language. And it should improve the program, in some way "Contrary to common belief, performance evaluation is an art." (Raj Jain, 1991)

Three-pass Compiler focus of this course Front End IR Middle End IR Back End Machine code Errors Implications Use an intermediate representation (IR) Front end maps legal source code into IR Middle end performs language/architecture independent analysis and optimization Back end maps IR into target machine code Admits multiple front ends & multiple passes Typically, front end is O(n) or O(n log n), while middle-end and backend are NPC

We will focus on the compilation and optimization of parallel programs Why parallel? Why now?

Frequency Scaling Wall cpudb.stanford.edu

Michael Tayler, UCSD, Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse

Dennard Scaling Rule As transistors get smaller, the power density can be kept constant. Example: if there is a reduction in a transistor s linear size by 2, the power it uses falls by 4 (with voltage and current both halving). Power = Voltage * Current PowerDensity = Power / Area Frequency 1/Devicesize Robert H. Dennard: American electrical engineer and inventor. Most known for - the invention of DRAM (Dynamic Random Access Memory) and the development of scaling principles for miniaturization of MOS (Metal Oxide Semiconductor) transistors.

Microprocessor Power Density CPU Power = Capacitance x Voltage 2 x Freq A projection made in 1999

What do we do if we can t supply enough power or cool down the chip? CPU Power = Capacitance x Voltage 2 x Freq no more scaling! Multi-core scaling instead of frequency scaling

Frequency-scaling Wall cpudb.stanford.edu We will dedicate all of our future product designs to multicore environments -- CEO of Intel, Paul S. Otellini, 2004

What s happening now? Top500 List GPU FPGA Parallel computing is ubiquitous Hololens Intel MIC 16 GreenArrays

Implications Parallelize or... perish How to write *efficient* parallel programs? Writing parallel programs is hard... Understand performance tradeoffs in multi/many-core Understand the interplay between compiler, OS, and hardware What we do as language and compiler designers? Automate parallelization and optimization Provide efficient programming model and language support Multi-facet problem involving resource allocation, scheduling, locality optimization and etc

Topics and Schedule in CS 516 (tentative) GPU programming (our target parallel language) Parsing and modern parser generator tools (front-end) Dataflow and inter-procedure analysis (middle-end) Shared-mem program locality enhancement (middle-end, runtime support) Resource allocation and instruction/thread scheduling (back-end, runtime support) Domain-specific code generation and optimization (compiler, runtime) Your presentations will be interleaved

20 min break

GPGPU Computing Computation + Rendering on the Same Card source: https://www.youtube.com/watch?v=hjizognkczs

Througput GPU V.S CPU source: cuda programming guide

Memory Bandwidth GPU v.s. CPU source: cuda programming guide

GPU Programming Languanges/libraries/models Compute Unified Device Architecture (CUDA) - NVIDIA Open Computing Language (OpenCL) C++ AMP - from Microsoft BrookGPU OpenGL / DirectX Stream (Brook+) - AMD...

Programming with CUDA The Example in C Add vector A and vector B to vector C. void add_vector (float *A, float *B, float *C, int N){ for ( index = 0; index < N; index++ ) C[index] = A[index] + B[index]; } void main ( ) {... add_vector (A,B,C,N);...}

Programming with CUDA An Example in CUDA Add vector A and vector B to vector C. _global_ void kernel function to run on GPU add_vector (float *A, float *B, float *C, int N){ if (tid < N) C[tid] = A[tid] + B[tid]; } main function to run on CPU void main ( ) {... add_vector <<<dimgrid, dimblock>>> (A,B,C,N);...}

A kernel is executed as a grid of thread blocks All threads share off-chip memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through a low latency shared memory (scratch-pad memory) Two threads from two different blocks cannot cooperate by shared memory Thread View

Thread View Threads in a block is organized into warps (32 threads/warp) A (half) warp is a SIMD thread group.

SIMD (Single Instruction Multiple Data) C[index] = A[index] + B[index]; thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1];... thread 15: C[15] = A[15] + B[15]; Execute at the same time.

SIMD (Single Instruction Multiple Data) C[index] = A[index] + B[index]; SP (streaming processor) thread 0: C[0] = A[0] + B[0]; thread 1: C[1] = A[1] + B[1];... thread 15: C[15] = A[15] + B[15]; Execute at the same time. SIMT (Single Instruction Multiple Thread) Simultaneous execution of many SIMD thread groups

Thread Life Cycle in HW Thread blocks are serially distributed to all the streaming multi-processors (SM) potentially >1 thread block per SM SM schedules and executes warps that are ready As a thread block completes, resources are freed more thread blocks will be distributed to SMs

GPU Architecture Nvidia Tesla C2075 14 streaming multiprocessors (SM) -- 32 cores each @ 1.15 GHz Each SM has 48 KB shared memory, 32768 registers

Blocks on SM Threads are assigned to SMs in Block granularity Limited number of blocks per SM as resource allows SM in C2075 can host 1536 threads: 256 (threads/block) * 6 blocks Or 128 (threads/block) * 12 blocks, etc. Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread execution

Thread Scheduling/Execution Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? At any point in time, only one of those Warps will be selected for instruction fetch and execution on one SM.

SM Warp Scheduling time SM multithreaded Warp scheduler warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling (round-robin/age) policy All threads in a Warp execute the same instruction when selected If 4 clock cycles needed to dispatch the same instruction for all threads in a Warp, one global memory access is needed for every 4 instructions A minimal of? Warps are needed to fully tolerate 200-cycle memory latency

GPU @ Rutgers 1 CUDA GPU Machine: "atlas.cs.rutgers.edu" 1x Telsa K40: 12 GB memory, 2880 CUDA cores, CUDA capability 3.5 2x Quadro K4000: 3 GB memory, 768 CUDA cores, CUDA capability 3.0 21 CUDA GPU Machines: List "A". 1x GT 630: 2 GB memory, 192 CUDA cores, CUDA capability 2.1 8 CUDA GPU Machines: List "B". 1x GT 425M: 1 GB memory, 96 CUDA cores, CUDA capability 2.1 11 ATI GPU Machines: List "C". 1x FirePro V7900: 1GB memory, OpenCL capable List "C": cd.cs.rutgers.edu cp.cs.rutgers.edu grep.cs.rutgers.edu kill.cs.rutgers.edu less.cs.rutgers.edu ls.cs.rutgers.edu man.cs.rutgers.edu pwd.cs.rutgers.edu rm.cs.rutgers.edu top.cs.rutgers.edu vi.cs.rutgers.edu List "B": cpp.cs.rutgers.edu pascal.cs.rutgers.edu java.cs.rutgers.edu perl.cs.rutgers.edu lisp.cs.rutgers.edu basic.cs.rutgers.edu prolog.cs.rutgers.edu assembly.cs.rutgers.edu List "A": adapter.cs.rutgers.edu builder.cs.rutgers.edu command.cs.rutgers.edu composite.cs.rutgers.edu decorator.cs.rutgers.edu design.cs.rutgers.edu facade.cs.rutgers.edu factory.cs.rutgers.edu flyweight.cs.rutgers.edu interpreter.cs.rutgers.edu mediator.cs.rutgers.edu null.cs.rutgers.edu patterns.cs.rutgers.edu prototype.cs.rutgers.edu singleton.cs.rutgers.edu specification.cs.rutgers.edu state.cs.rutgers.edu strategy.cs.rutgers.edu template.cs.rutgers.edu utility.cs.rutgers.edu visitor.cs.rutgers.edu

CUDA Demo The matrix/vector add example /ilab/users/zz124/cs516_2015/samples/0_simple/vectoradd

Next Class More CUDA programming examples Parallel primitives on GPUs First project will be posted by tomorrow.