Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Similar documents
CSE 160 Lecture 24. Graphical Processing Units

Lecture 3. Programming with GPUs

Lecture 4. Performance: memory system, programming, measurement and metrics Vectorization

Lecture 3. Vectorization Memory system optimizations Performance Characterization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 5. Performance Programming with CUDA

Lecture 3. SIMD and Vectorization GPU Architecture

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Portland State University ECE 588/688. Graphics Processors

Mattan Erez. The University of Texas at Austin

Threading Hardware in G80

GPU Fundamentals Jeff Larkin November 14, 2016


Lecture 2. Memory locality optimizations Address space organization

Master Informatics Eng.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to CUDA (1 of n*)

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

ME964 High Performance Computing for Engineering Applications

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Lecture 1: Introduction and Computational Thinking

High Performance Computing and GPU Programming

Numerical Simulation on the GPU

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

B. Tech. Project Second Stage Report on

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Programming in CUDA. Malik M Khan

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Tesla Architecture, CUDA and Optimization Strategies

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 1: Gentle Introduction to GPUs

Computer Architecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Introduction to GPGPU and GPU-architectures

Lecture 9. Thread Divergence Scheduling Instruction Level Parallelism

Lecture 17. NUMA Architecture and Programming

Introduction to CUDA Programming

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Introduc)on to GPU Programming

Double-Precision Matrix Multiply on CUDA

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Tesla GPU Computing A Revolution in High Performance Computing

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

University of Bielefeld

Parallel Computing. Hwansoo Han (SKKU)

Warps and Reduction Algorithms

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Spring Prof. Hyesoon Kim

Real-Time Rendering Architectures

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CUDA Programming Model

ME964 High Performance Computing for Engineering Applications

Parallel Accelerators

Parallel Accelerators

ECE 574 Cluster Computing Lecture 15

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Data Parallel Execution Model

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

arxiv: v1 [physics.comp-ph] 4 Nov 2013

CSE 262 Lecture 13. Communication overlap Continued Heterogeneous processing

Josef Pelikán, Jan Horáček CGG MFF UK Praha

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

CS 179 Lecture 4. GPU Compute Architecture

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

GRAPHICS PROCESSING UNITS

Parallel Numerical Algorithms

GPU programming. Dr. Bernhard Kainz

Improving Performance of Machine Learning Workloads

Using GPUs to compute the multilevel summation of electrostatic forces

NVIDIA Fermi Architecture

Cartoon parallel architectures; CPUs and GPUs

COSC 6339 Accelerators in Big Data

Programmable Graphics Hardware (GPU) A Primer

Center for Computational Science

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Introduction to CUDA (1 of n*)

Parallel and Distributed Programming Introduction. Kenjiro Taura

Overview: Graphics Processing Units

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Lecture 10. Stencil Methods Limits to Performance

Lecture 16 Optimizing for the memory hierarchy

Transcription:

Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs

Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today, report by 9am tomorrow Don t wait until the last minute to turnin: be sure you have all required files: teameval.txt, etc. A2: GPU programming 2 parts, each has a turnin 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 2

Today s lecture Performance programming for stencil methods Vectorization Computing with Graphical Processing Units (GPUs) 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 3

Motivating application Solve Laplace s equation in 3-D with Dirichlet Boundary conditions Δu = 0, u=f on Ω Building block: iterative solver using Jacobi s method (7-point stencil) for (i,j,k) in 1:N x 1:N x 1:N u [i,j,k] = (u[i-1,j,k] + u[i+1,j,k] + u[i,j-1.k] + u[i,j+1.k] + u[i,j,k+1] + u[i,j,k-1] ) / 6.0 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 4

Memory representation and access 3D grid top center bottom U i,j,k-1 i, j-1,k i-1,j,k i+1,j,k i,j+1,k i,j,k+1 U Linear array space 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 5

Processor Geometry 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 6

AMD Phenom s memory hierarchy 6MB L3 on cseclass 01 and 02 http://developer.amd.com/media/gpu_assets/develop_brighton_justin_boggs-2.pdf 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 14

NUMA awareness When we allocate memory, a NUMA processor uses the first-touch policy, unless told otherwise Be sure to use parallel for when initializing data x = new double[n], y,z = new double[n]; #pragma omp parallel for for (i=0; i< n; i++) {x[i]=0; y[i] = ; z[i] =...} #pragma omp parallel for for (i=0; i<n; i++) {x[i] = y[i] + z[i]; } www.benjaminathawes.com/blog 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 15

Today s lecture Performance programming for stencil methods Vectorization Computing with GPUs 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 16

Streaming SIMD Extensions Each operation is independent for i = 0:N-1 { p[i] = a[i] * b[i];} SSE (SSE4 on Intel Nehalem), Altivec Short vectors: 128 bits (AVX: 256 bits) X X X X a b p 1 4 2 6 1 = 2 2 * 3 1 2 1 2 Jim Demmel 4 floats 2 doubles 16 bytes 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 17

Fused Multiply/Add a b X X X X c + + + + r r[0:3] = c[0:3] + a[0:3]*b[0:3] Courtesy of Mercury Computer Systems, Inc. 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 18

How do we use the SSE instructions? Low level: assembly language or libraries Higher level: a vectorizing compiler icc -O3 -ftree-vectorizer-verbose=2 -c s35.c int N; float a[n], b[n], c[n] = ; for (int i=0; i<n; i++) a[i] = b[i] + c[i]; s35.c(17): (col. 5) remark: LOOP WAS VECTORIZED. s35.c(20): (col. 20) remark: LOOP WAS VECTORIZED. s35.c(7): (col. 5) remark: LOOP WAS VECTORIZED. 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 19

How does non-vectorized code compare? Low level: assembly language or libraries Higher level: a vectorizing compiler #pragma novector for (int i=0; i<n; i++) // N = 2048 * 1024 a[i] = b[i] + c[i]; Single precision, running on a Nehalem processor using pgcc With vectorization : 7.693 sec. Without vectorization : 10.17 sec. Double precision With vectorization : 11.88 sec. Without vectorization : 11.85 sec. 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 20

Transformed code for (i = 0; i < 1024; i+=4) a[i:i+3] = b[i:i+3] + c[i:i+3]; Vector instructions for (i = 0; i < 1024; i+=4){ vb = vec_ld( &b[i] ); } How does the vectorizer work? vc = vec_ld( &c[i] ); va = vec_add( vb, vc ); vec_st( va, &a[i] ); 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 21

What prevents vectorization? Data dependencies for (int i = 1; i < N; i++) b[i] = b[i-1] + 2; flow.c(10): warning #592: variable "b" is used before its value is set b[i] = b[i-1] + 2; /* data dependence cycle */ Inner loops only for(int j=0; j< reps; j++) for (int i=0; i<n; i++) a[i] = b[i] + c[i]; b[1] = b[0] + 2; b[2] = b[1] + 2; b[3] = b[2] + 2; 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 22

What prevents vectorization Interrupted flow out of the loop for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); if (maxval > 1000.0) break; } t2.c(6): (col. 3) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. This loop will vectorize for (i=0; i<n; i++) { a[i] = b[i] + c[i]; maxval = (a[i] > maxval? a[i] : maxval); } 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 23

Today s lecture Performance programming for stencil methods Vectorization Computing with GPUs 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 26

Heterogeneous processing MEM C 0 C 1 C 2 P 0 P 1 P 2 2012 Scott B. Baden /CSE 260/ Winter 2012 27

NVIDIA GeForce GTX 280 Hierarchically organized clusters of streaming multiprocessors 240 cores @ 1.296 GHz Peak performance 933.12 Gflops/s SIMT parallelism 1 GB device memory (frame buffer) 512 bit memory interface @ 132 GB/s GTX 280: 1.4B transistors Intel Penryn: 410M (110mm 2 ) (dual core) Nehalem: 731M (263mm 2 ) 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 28

Streaming processor cluster GTX-280 GPU 10 clusters @ 3 streaming multiprocessors or vector cores Each vector core 8 scalar cores: fused multiply adder + multiplier (32 bits), truncate intermediate rslt Shared memory (16KB) and registers (16K 32 bits = 64KB) 1 64- bit fused multiply-adder + 2 super function units (2 fused multiply-adders) 1 FMA + 1 multiply per cycle = 3 flops / cycle / core * 240 cores = 720 flops/cycl @1.296 Ghz: 933 GFLOPS Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SP SP SFU SP SP DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 1/24/12 2010 Scott B. Baden /CSE 260/ Fall '10 29

Streaming Multiprocessor H. Goto 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 30

(Device) Grid Block (0, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) Memory Hierarchy Block (1, 0) Registers Shared Memory Thread (0, 0) Registers Thread (1, 0) Name Latency (cycles) Global DRAM 100s No Local DRAM 100s No Constant 1s 10s 100s Yes Texture 1s 10s 100s Yes Shared 1 -- Register 1 -- Cached Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory Courtesy DavidKirk/NVIDIA and Wen-mei Hwu/UIUC 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 31

CUDA Programming environment with extensions to C Under control of the host, invoke sequences of multithreaded kernels on the device (GPU) Many lightweight threads CUDA: programming environment + C extensions KernelA<<4,8>> KernelB<<4,8>> KernelC<<4,8>> 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 32

Global Memory Thread execution model Kernel call spawns virtualized, hierarchically organized threads Hardware handles dispatching, 0 overhead Compiler re-arranges loads to hide latencies Global synchronization: kernel invocation..... 1/24/12 2010 Scott B. Baden /CSE 260/ Fall '10

Hierarchical Thread Organization Thread organization Grid Block Thread Specify number and geometry of threads in a block and similarly for blocks Thread Blocks Unit of workload assignment Subdivide a global index domain Cooperate, synchronize, with access fast on-chip shared memory Threads in different blocks communicate only through slow global memory Host Kernel 1 Kernel 2 Device Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 0) Thread (1, 1) Thread (1, 2) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Thread (2, 2) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2) Block (2, 0) Block (2, 1) KernelA<<<2,3>,<3,5>>>() Grid Block DavidKirk/NVIDIA & Wen-mei Hwu/UIUC 1/24/12 2010 Scott B. Baden /CSE 260/ Fall '10 34

Threads, blocks, grids and data Threads all execute same instruction (SIMT) A block may have a different number of dimensions (1d, 2d or 3d) than a grid (1d/2d) Each thread uniquely specified by block & thread ID Programmer determines the mapping of virtual thread IDs to global memory location : Z n Z 2 Z 3 SM t0 t1 t2 tm Θ(Π ι ), Π ι Π MT IU Global Memory SP..... Shared Memory 2010 Scott B. Baden / CSE 160 / Winter 2010 35

SM Constraints Up to 8 resident blocks Not more than 1024 threads Up to 32 warps All threads in a warp execute the same instruction All branches followed Instructions disabled Divergence, serialization Grid: 1 or 2-dimensional (64k-1) Block 1,2, or 3- dimensional 512 threads Max dimensions: 512, 512, 64 Registers subdivided over threads Synchronization only among all threads in the block Synchronization only within a block SM MT IU Device Grid 1 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 36 SP Shared Memory Block (1, 1) t0 t1 t2 tm Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Block (2, 0) Block (2, 1)

Parallel Speedup How much did our GPU implemntation improve over the traditional processor? Speedup, S Running time of the fastest program on conventional processors Running time of the accelerated program Baseline: a multithreaded program 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 37

What limits speedup? Avoid algorithms that present intrinsic barriers to utilizing the hardware Hide latency of host device memory transfers Reduce global memory accesses Global memory accesses fast on-chip accesses Coalesced memory transfers Avoid costly branches, or render harmless Minimize serial sections 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 38

Performance issues Simplified processor design, but more user control over the hardware resources Rethink the problem solving technique Hide latency of host device memory transfers Global memory accesses fast on-chip accesses Coalesced memory transfers Avoid costly branches, or render them harmless If we don t use the parallelism, we lose it Amdahl s law - serial sections Von Neumann bottleneck data transfer costs Workload Imbalances 2010 Scott B. Baden /CSE 260/ Fall '10

Fermi All the cseclass machines have Fermi processors Private L1 Cache/Shared Memory (64KB/Vector unit), Shared L2 Cache (768 KB) CSEClass 01, 02: GeForce GTX 580 [Capability 2.0, GF100] 15 Vector units @ 32 cores/unit (480 cores), 4 SFUs 1.25 GB device memory CSEClass 03-07: GeForce GTX 460 Capability 2.1, GF104] 7 Vector units @ 48 cores (384 total cores), 8 SFUs 1.0 GB device memory www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2 http://icrontic.com/ 1/24/12 2012 Scott B. Baden /CSE 260/ Winter 2012 40

Fin