Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Similar documents
Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture: Memory Technology Innovations

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Flynn's Classification

Lecture 1: Introduction

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Parallelization of an Example Program

Simulating ocean currents

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lect. 2: Types of Parallelism

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

Parallel Processing SIMD, Vector and GPU s cont.

Parallelization Principles. Sathish Vadhiyar

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Portland State University ECE 588/688. Graphics Processors

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Introduction to GPU programming with CUDA

Multiprocessors at Earth This lecture: How do we program such computers?

Modern Processor Architectures. L25: Modern Compiler Design

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Fundamentals Jeff Larkin November 14, 2016

RECAP. B649 Parallel Architectures and Programming

Warps and Reduction Algorithms

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179 Lecture 4. GPU Compute Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Numerical Simulation on the GPU

Advanced Computer Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

GPUs have enormous power that is enormously difficult to use

Lecture: Networks, Disks, Datacenters, GPUs. Topics: networks wrap-up, disks and reliability, datacenters, GPU intro (Sections

Parallel Computing: Parallel Architectures Jin, Hai

Computer Architecture

Computer Architecture

ME964 High Performance Computing for Engineering Applications

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

From Application to Technology OpenCL Application Processors Chung-Ho Chen

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Programmer's View of Execution Teminology Summary

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Fundamental CUDA Optimization. NVIDIA Corporation

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Introduction II. Overview

High Performance Computing on GPUs using NVIDIA CUDA

Scientific Computing on GPUs: GPU Architecture Overview

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture


Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Fundamental CUDA Optimization. NVIDIA Corporation

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Tesla Architecture, CUDA and Optimization Strategies

Handout 3. HSAIL and A SIMT GPU Simulator

GRAPHICS PROCESSING UNITS

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Introducing Multi-core Computing / Hyperthreading

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Lecture 2: CUDA Programming

! Readings! ! Room-level, on-chip! vs.!

Exploring GPU Architecture for N2P Image Processing Algorithms

WHY PARALLEL PROCESSING? (CE-401)

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Introduction to GPGPU and GPU-architectures

EECS 570 Final Exam - SOLUTIONS Winter 2015

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Dense Linear Algebra. HPC - Algorithms and Applications

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Master Informatics Eng.

Preparing seismic codes for GPUs and other

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

TUNING CUDA APPLICATIONS FOR MAXWELL

Transcription:

Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1

Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence simpler hardware Explicit communication easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer 2

Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done) do diff = 0; for i 1 to n do for j 1 to n do temp = A[i,j]; A[i,j] 0.2 * (A[i,j] + neighbors); diff += abs(a[i,j] temp); end for end for if (diff < TOL) then done = 1; end while end procedure.. Row 1 Row k Row 2k Row 3k 3

Shared Address Space Model int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A G_MALLOC(); initialize (A); CREATE (nprocs,solve,a); WAIT_FOR_END (nprocs); end main procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i mymin to mymax for j 1 to n do endfor endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; BARRIER (bar1, nprocs); endwhile 4

Message Passing Model main() read(n); read(nprocs); CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; mya malloc( ) initialize(mya); while (!done) do mydiff = 0; if (pid!= 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid!= nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); if (pid!= 0) RECEIVE(&myA[0,0], n, pid-1, ROW); if (pid!= nprocs-1) RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i 1 to nn do for j 1 to n do endfor endfor if (pid!= 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i 1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; endfor if (mydiff < TOL) done = 1; for i 1 to nprocs-1 do SEND(done, 1, I, DONE); endfor endif endwhile 5

Multithreading Within a Processor Until now, we have executed multiple threads of an application on different processors can multiple threads execute concurrently on the same processor? Why is this desireable? inexpensive one CPU, no external interconnects no remote or coherence misses (more capacity misses) Why does this make sense? most processors can t find enough work peak IPC is 6, average IPC is 1.5! threads can share resources we can increase threads without a corresponding linear increase in area 6

How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Cycles Thread 3 Thread 4 Superscalar Fine-Grained Multithreading Simultaneous Multithreading Idle Superscalar processor has high under-utilization not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot 7

Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) this effect can be mitigated by trying to prioritize one thread With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 8

SIMD Processors Single instruction, multiple data Such processors offer energy efficiency because a single instruction fetch can trigger many data operations Such data parallelism may be useful for many image/sound and numerical applications 9

GPUs Initially developed as graphics accelerators; now viewed as one of the densest compute engines available Many on-going efforts to run non-graphics workloads on GPUs, i.e., use them as general-purpose GPUs or GPGPUs C/C++ based programming platforms enable wider use of GPGPUs CUDA from NVidia and OpenCL from an industry consortium A heterogeneous system has a regular host CPU and a GPU that handles (say) CUDA code (they can both be on the same chip) 10

The GPU Architecture SIMT single instruction, multiple thread; a GPU has many SIMT cores A large data-parallel operation is partitioned into many thread blocks (one per SIMT core); a thread block is partitioned into many warps (one warp running at a time in the SIMT core); a warp is partitioned across many in-order pipelines (each is called a SIMD lane) A SIMT core can have multiple active warps at a time, i.e., the SIMT core stores the registers for each warp; warps can be context-switched at low cost; a warp scheduler keeps track of runnable warps and schedules a new warp if the currently running warp stalls 11

The GPU Architecture 12

Architecture Features Simple in-order pipelines that rely on thread-level parallelism to hide long latencies Many registers (~1K) per in-order pipeline (lane) to support many active warps When a branch is encountered, some of the lanes proceed along the then case depending on their data values; later, the other lanes evaluate the else case; a branch cuts the data-level parallelism by half (branch divergence) When a load/store is encountered, the requests from all lanes are coalesced into a few 128B cache line requests; each request may return at a different time (mem divergence) 13

GPU Memory Hierarchy Each SIMT core has a private L1 cache (shared by the warps on that core) A large L2 is shared by all SIMT cores; each L2 bank services a subset of all addresses Each L2 partition is connected to its own memory controller and memory channel The GDDR5 memory system runs at higher frequencies, and uses chips with more banks, wide IO, and better power delivery networks A portion of GDDR5 memory is private to the GPU and the rest is accessible to the host CPU (the GPU performs copies) 14

Title Bullet 15