Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Similar documents
Lecture 5. Performance Programming with CUDA

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

CSE 160 Lecture 18. Message Passing

CSE 160 Lecture 15. Message Passing

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Spring Prof. Hyesoon Kim

Computer Architecture

CSE 160 Lecture 24. Graphical Processing Units

Optimizing Parallel Reduction in CUDA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Lecture 2: CUDA Programming

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Data Parallel Execution Model

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Performance Considerations (2 of 2)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Programming in CUDA. Malik M Khan

Advanced CUDA Programming. Dr. Timo Stich

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Tesla Architecture, CUDA and Optimization Strategies

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

CS 179 Lecture 4. GPU Compute Architecture

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Lecture 3. Programming with GPUs

GPU Fundamentals Jeff Larkin November 14, 2016

Warps and Reduction Algorithms

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Optimization Strategies Global Memory Access Pattern and Control Flow

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 10. Reduction Trees

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Hardware/Software Co-Design

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Performance Optimization. Patrick Legresley

Global Memory Access Pattern and Control Flow

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Tuning CUDA Applications for Fermi. Version 1.2

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

TUNING CUDA APPLICATIONS FOR MAXWELL

ECE 574 Cluster Computing Lecture 15

Multi-Processors and GPU

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Advanced CUDA Optimizations. Umar Arshad ArrayFire

L14: CUDA, cont. Execution Model and Memory Hierarchy"

GPU programming. Dr. Bernhard Kainz

CS377P Programming for Performance GPU Programming - II

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

CUDA Memories. Introduction 5/4/11

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

ECE 574 Cluster Computing Lecture 17

CUDA Architecture & Programming Model

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CS671 Parallel Programming in the Many-Core Era

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Threading Hardware in G80

ME964 High Performance Computing for Engineering Applications

CS 314 Principles of Programming Languages

CS 677: Parallel Programming for Many-core Processors Lecture 6

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPU Performance Nuggets

Introduction to Parallel Computing with CUDA. Oswald Haan

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

LDetector: A low overhead data race detector for GPU programs

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

Introduction to CUDA

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Debugging and Optimization strategies

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

CME 213 S PRING Eric Darve

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Portland State University ECE 588/688. Graphics Processors

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Stanford CS149: Parallel Computing Written Assignment 3

Image convolution with CUDA

GRAPHICS PROCESSING UNITS

NVIDIA Fermi Architecture

Transcription:

Lecture 6 Programming with Message Passing Message Passing Interface (MPI)

Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2

Finish CUDA Today s lecture Programming with message passing 2011 Scott B. Baden / CSE 262 / Spring 2011 3

SM Constraints Up to 8 resident blocks Not more than 1024 threads Up to 32 warps All threads in a warp execute the same instruction All branches followed Instructions disabled Divergence, serialization Grid: 1 or 2-dimensional (64K-1) Blocks 1,2, or 3- dimensional 512 threads Max dimensions: 512, 512, 64 Registers subdivided over threads Synchronization among all threads in the block Synchronization only within block SM MT IU SP Shared Memory Block (1, 1) t0 t1 t2 tm Device Grid 1 Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 2011 Scott B. Baden / CSE 262 / Spring 2011 4 Block (2, 0) Block (2, 1)

Fermi Running incrementarray on cseclass Device 0 has 15 cores Device is a GeForce GTX 570, capability: 2.0 Improvements Cache Higher peak double precision performance (on some models) Increases the number of cores (448 to 512) Dual thread schedulers Concurrent kernel execution (some models) Reduced kernel launch overhead (25 µs) Improved predication 2011 Scott B. Baden / CSE 262 / Spring 2011 5

Fermi Block Diagram NVIDIA 2011 Scott B. Baden / CSE 262 / Spring 2011 6

Shared memory banks Any memory load or store of n addresses that spans n distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is n times as high as the bandwidth of a single bank If multiple addresses of a memory request map to the same memory bank, the accesses are serialized. The hardware splits a memory request that has bank conflicts into as many separate conflict-free requests as necessary Exception: if all access the same bank: broadcast Exception: all threads in a half warp address the same shared memory location, resulting in a broadcast Devices of compute capability 2.x have the additional ability to multicast shared memory accesses See CUDA C Best Practices Guide Version 3.2 2011 Scott B. Baden / CSE 262 / Spring 2011 7

Shared memory design Successive 32-bit words assigned to successive banks For devices of compute capability 1.x Number of banks = 16 Bandwidth is 32 bits per bank per clock cycle Shared memory request for a warp is split in two No conflict occurs if only one memory location per bank is accessed by a half warp of threads For devices of compute capability 2.x [Fermi] Number of banks = 32 Bandwidth is 32 bits per bank per 2 clock cycles Shared memory request for a warp is not split Increased susceptibility to conflicts But no conflicts if access to bytes in same 32 bit word Unlike 1.x, no bank conflicts here: shared char shared[32]; char data = shared[baseindex + tid]; 2011 Scott B. Baden / CSE 262 / Spring 2011 8

Shared Memory/Cache L1 cache for each vector unit L2 cache shared by all vector units Cache accesses to local or global memory, including temporary register spills Cache inclusion (L1 L2?) partially configurable on per-access basis with mem. ref. instruction modifiers On-chip memory: partially shared memory, partially L1 16KB + 48 KB shared memory + L1 48KB + 16 KB shared memory + L1 128 byte cache line size 2011 Scott B. Baden / CSE 262 / Spring 2011 9

Global Memory 128 byte cache line size If accessed word > 4 bytes, warp s memory request split into separate, independently issued 128-byte memory requests Non-atomic, concurrent writes within a warp: writer not defined 2011 Scott B. Baden / CSE 262 / Spring 2011 10

Shared memory on Fermi Successive 32-bit words assigned to successive banks The bandwidth of shared memory is 32 bits per bank per clock 2 cycle For devices of compute capability 1.x Number of banks = 16 Shared memory request for a warp is split in two No conflict occurs if only one memory location per bank is accessed by a half warp of threads For devices of compute capability 2.x Number of banks = 32 Shared memory request for a warp is not split Increased susceptibility to conflicts 2011 Scott B. Baden / CSE 262 / Spring 2011 11

Vector units Each vector unit 32 CUDA cores for integer and floating-point arithmetic 4 special function units for Single Precision transcendentals FMA without truncation (32 or 64 bits) For devices of compute capability 2.1 48 CUDA cores for arithmetic operations 8 special function units for single-precision CUDA C Programming Guide Version 3.2, 4.3 2011 Scott B. Baden / CSE 262 / Spring 2011 12

Instruction issue Each vector unit 32 CUDA cores for integer and floating-point arithmetic 4 special function units for Single Precision transcendentals 2 Warp schedulers: each scheduler issues: 1 (2) instructions for capability 2.0 (2.1) One scheduler in charge of odd warp IDs, the other even warp IDs Only 1 scheduler can issue a double-precision floatingpoint instruction at a time Warp scheduler can issue an instruction to ½ the CUDA cores Scheduler must issue the instruction over 2 clock cycles for an integer or floating-point arithmetic instruction 2011 Scott B. Baden / CSE 262 / Spring 2011 13

Concurrency Host & Device Nonbocking operations Kernel launch Device {Host,Device} Async memory copies Multiple kernel invocations: certain capability 2.x devices CUDA Streams ( 3.2.6.5), with limitations cudastream_t stream[2]; for (int i = 0; i < 2; ++i) cudastreamcreate(&stream[i]); for (int i = 0; i < 2; ++i) Kernel<<<100, 512, 0, stream[i]>>> ( ); 2011 Scott B. Baden / CSE 262 / Spring 2011 14

Predication on Fermi All instructions support predication in 2.x Condition code or predicate per thread: set to true or false Execute only if predicate is true Compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by branch condition is not too large If the compiler predicts too many divergent warps. threshold = 7, else 4 2011 Scott B. Baden / CSE 262 / Spring 2011 15

Reduction

Warp scheduling Blocks are divided into warps of 32 (SIMD) threads which are. Subidivided into schedulable units: 16 threads ( 32 on Fermi) Warps are scheduled with zero overhead in hardware Scheduler finds an eligible warp: all operands are ready Scoreboarding Priorities Branches serialize execution within a warp SM MT IU SP Shared Memory t0 t1 t2 tm DavidKirk/NVIDIA & Wen-mei Hwu/UIUC warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 2011 Scott B. Baden / CSE 262 / Spring 2011 17

Thread Divergence Blocks are divided into warps of 32 threads All the threads in a warp execute the same instruction Different control paths are serialized Divergence when predicate is a function of the threadid if (threadid < 2) { } No divergence if all follow the same path if (threadid / WARP_SIZE < 2) { } Consider reduction, e.g. sumation i x i warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95. warp 8 instruction 12 warp 3 instruction 96 2011 Scott B. Baden / CSE 262 / Spring 2011 18

3 1 7 0 4 1 6 3 A naïve reduction 4 7 5 9 11 14 Thread 0! Thread 2! Thread 4! Thread 6! Thread 8! Thread 25 10! 0! 1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 1! 0+1! 2+3! 4+5! 6+7! 8+9! 10+11! 2! 0...3! 4..7! 8..11! 3! 0..7! 8..15! DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 2011 Scott B. Baden / CSE 262 / Spring 2011 19

The naïve code global void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadidx.x; unsigned int i = blockidx.x * blockdim.x + threadidx.x; shared int x[bsize]; x[tid] = (i<n)? input[i] : 0; syncthreads(); for (unsigned int stride = 1; stride < blockdim.x; stride *= 2) { syncthreads(); } if (tid % (2*stride) == 0) x[tid] += x[tid + stride]; if (tid == 0) atomicadd(total,x[tid]); } 2011 Scott B. Baden / CSE 262 / Spring 2011 20

Thread 0! Reducing divergence and avoiding bank conflicts 0! 1! 2! 3!! 13! 14! 15! 16! 17! 18! 19! 0+16! 15+31! DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 2011 Scott B. Baden / CSE 262 / Spring 2011 21

The Code unsigned int tid = threadidx.x; shared int x[bsize]; for (unsigned int s= blockdim.x/2; s>1; s /= 2) { syncthreads(); } if (tid < s) x[tid] += x[tid + s]; 2011 Scott B. Baden / CSE 262 / Spring 2011 22

Message Passing

Programming with Message Passing The primary model for implementing parallel applications Programs execute as a set of P processes We specify P when we run the program Assume each process is assigned a different physical processor Each process is initialized with the same code, but has private state SPMD = Same Program Multiple Data executes instructions at its own rate has an associated rank, a unique integer in the range 0:P-1 may or may not be assigned a different physical processor The sequence of instructions each process executes depends on its rank and the messages it sends and receives Program execution is often called bulk synchronous or loosely synchronous 2011 Scott B. Baden / CSE 262 / Spring 2011 24

Message Passing Messages are like email; to send one, we specify A destination A message body (can be empty) To receive a message we need similar information, including a receptacle to hold the incoming data Requires a sender and an explicit recipient that must be aware of one another Message passing performs two events Memory to memory block copy Synchronization signal on receiving end: Data arrived Message buffers 2011 Scott B. Baden / CSE 262 / Spring 2011 25

Query functions nproc( ) = # processors A minimal interface myrank( ) = this proces s rank Point-to-point communication Simplest form of communication Send a message to another process Send(Object, Destination process ID) Receive a message from another process Receive(Object) Receive(Source process, Object) 2011 Scott B. Baden / CSE 262 / Spring 2011 26

Send and Recv When Send( ) returns, the message is in transit A return doesn t tell us if the message has been received Somewhere in the system Safe to overwrite the buffer Receive( ) blocks until the message has been received Safe to use the data in the buffer 2011 Scott B. Baden / CSE 262 / Spring 2011 27

An unsafe program A Send( ) may or may not complete before a Recv ( ) has been posted May or may not depends on the implementation Some programs may deadlock on certain message passing implementations Process 0 Process 1 Send (x,1) Recv (y,1) Send(y,0) Recv(x,0) Process 0 Process 1 Send (x,1) Recv (y,1) Recv(x,0) Send(y,0) 2011 Scott B. Baden / CSE 262 / Spring 2011 28

Buffering Where does the data go when you send it? It might be buffered Preferable to avoid the extra copy User data Process 0 Process 1 Local buffer Comm network Local buffer E. Lusk User data 2011 Scott B. Baden / CSE 262 / Spring 2011 29

Causality If a process sends multiple messages to the same destination, then the messages will be received in the order sent If different processes send messages to the same destination, the order of receipt isn t defined across processes C B A A B C 2011 Scott B. Baden / CSE 262 / Spring 2011 30

Causality If different processes send messages to the same destination The order of receipt is defined from a single source The order of receipt is not defined across multiple sources B A A X Y B X Y 2011 Scott B. Baden / CSE 262 / Spring 2011 31

Asynchronous, non-blocking communication Immediate return, does not wait for completion Required to express certain algorithms Optimize performance: message flow problems Split-phased Phase 1: initiate communication with the immediate I variant of the point- to-point call IRecv( ), ISend( ) Phase 2: synchronize Wait( ) Perform unrelated computations between the two phases Building a blocking call Recv( ) = IRecv( ) + Wait( ) 2011 Scott B. Baden / CSE 262 / Spring 2011 32

Restrictions on non-blocking communication The message buffer may not be accessed between an IRecv( ) (or ISend()) and its accompanying Wait( ) ISend(data,destination) User data Wait() on ISend() Local buffer Use the data Each pending IRecv() must have a distinct buffer 2011 Scott B. Baden / CSE 262 / Spring 2011 33

Overlap behavior No Overlap IRecv(x,req) Send(..) Compute(y) Wait(req) Compute(x) Overlap Recv(x) Send( ) Compute(x) Compute(y) A message buffer may not be accessed between an IRecv( ) (or ISend( )) and its accompanying wait( ) 2011 Scott B. Baden / CSE 262 / Spring 2011 34