Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Similar documents
CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Programming in CUDA. Malik M Khan

Tesla Architecture, CUDA and Optimization Strategies

1/25/12. Administrative

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Code Optimizations for High Performance GPU Computing

Lecture 2: CUDA Programming

L14: CUDA, cont. Execution Model and Memory Hierarchy"

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Tuning CUDA Applications for Fermi. Version 1.2

ECE 408 / CS 483 Final Exam, Fall 2014

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Warps and Reduction Algorithms

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Performance Optimization. Patrick Legresley

Double-Precision Matrix Multiply on CUDA

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Dense Linear Algebra. HPC - Algorithms and Applications

Portland State University ECE 588/688. Graphics Processors

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

B. Tech. Project Second Stage Report on

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Lecture 5. Performance Programming with CUDA

COSC 6374 Parallel Computations Introduction to CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Introduction to CUDA Programming

Lecture 11: GPU programming

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

Josef Pelikán, Jan Horáček CGG MFF UK Praha

GRAPHICS PROCESSING UNITS

Computer Architecture

Introduction to GPGPU and GPU-architectures

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Matrix Multiplication in CUDA. A case study

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Practical Introduction to CUDA and GPU

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Cartoon parallel architectures; CPUs and GPUs

ECE453: Advanced Computer Architecture II Homework 1

Hardware/Software Co-Design

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CUDA Performance Considerations (2 of 2)

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Introduction to GPU programming. Introduction to GPU programming p. 1/17

CS377P Programming for Performance GPU Programming - II

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Overview: Graphics Processing Units

CS671 Parallel Programming in the Many-Core Era

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Architecture & Programming Model

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

TUNING CUDA APPLICATIONS FOR MAXWELL

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A GPGPU Compiler for Memory Optimization and Parallelism Management

CS 314 Principles of Programming Languages

Advanced CUDA Optimization 1. Introduction

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA Performance Optimization

Parallel Numerical Algorithms

Sparse Linear Algebra in CUDA

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

Advanced CUDA Programming. Dr. Timo Stich

TUNING CUDA APPLICATIONS FOR MAXWELL

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Massively Parallel Architectures

GPUs have enormous power that is enormously difficult to use

Advanced CUDA Optimizations

Introduction to CUDA

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Introduction OpenCL Code Exercices. OpenCL. Tópicos em Arquiteturas Paralelas. Peter Frank Perroni. November 25, 2015

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Programming with CUDA, WS09

Transcription:

Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project, oral and short - Tentatively April 11 and 13 Final projects - Poster session, April 27 (dry run April 25) - Final report, May 4 2 Midterm Exam Monday, April 4 Goal is to reinforce understanding of CUDA and NVIDIA architecture Material will come from lecture notes and assignments In class, should not be difficult to finish Open notes, but no computers 3 Parts of Exam I. Definitions A list of 5 terms you will be asked to define II. Short Answer (4 questions, 20 points) - Understand basic GPU architecture: processors and hierarchy - High level questions on more recent pattern and application lectures III. Problem Solving - Analyze data dependences and data reuse in code and use this to guide CUDA parallelization and hierarchy mapping - Given some CUDA code, indicate whether global accesses will be coalesced and whether there will be bank conflicts in shared - Given some CUDA code, add synchronization to derive a correct implementation - Given some CUDA code, provide an optimized version that will have fewer divergent branches IV. (Brief) Essay Question - Pick one from a set of 4 4 1

L1: Introduction and CUDA Overview Not much there L2: Hardware Execution Model Difference between a parallel programming model and a hardware execution model SIMD, MIMD, SIMT, SPMD Performance impact of fine-grain multithreaded architecture What happens during the execution of a warp? How are warps selected for execution (scoreboarding)? L3 & L4: Memory Hierarchy: Locality and Data Placement Memory latency and bandwidth optimizations Reuse and locality What are the different spaces on the device, who can read/ write them? How do you tell the compiler that something belongs in a particular space? Tiling transformation (to fit data into constrained storage): Safety and profitability 5 L5 & L6: Memory Hierarchy III: Memory Bandwidth Optimization Tiling (for registers) Bandwidth maximize utility of each cycle Memory accesses in scheduling (half-warp) Understanding global coalescing (for compute capability < 1.2 and > 1.2) Understanding shared bank conflicts L7: Writing Correct Programs Race condition, dependence What is a reduction computation and why is it a good match for a GPU? What does syncthreads () do? (barrier synchronization) Atomic operations Memory Fence Instructions Device emulation mode 6 L8: Control Flow Divergent branches Execution model Warp vote functions L9: Floating Point Single precision versus double precision IEEE Compliance: which operations are compliant? Intrinsics vs. arithmetic operations, what is more precise? What operations can be performed in 4 cycles, and what operations take longer? L10: Dense Linear Algebra on GPUs What are the key ideas contributing to CUBLAS 2.0 performance Concepts: high thread count vs. coarse-grain threads. When to use each? Transpose in shared plus padding trick L11: Sparse Linear Algebra on GPUS Different sparse matrix representations Stencil computations using sparse matrices L12&L13: Application case studies Host tiling for constant cache (plus data structure reorganization) Replacing trig function intrinsic calls with hardware implementations Global synchronization for MPM/GIMP L14: Dynamic Scheduling Task queues Static queues, dynamic queues Wait-free synchronization L15: Tree-based algorithms Flattening tree data structures Scheduling on a portion of the architecture 7 8 2

a. Managing bandwidth Given the following CUDA code, how would you rewrite to improve bandwidth to global and, if applicable, shared? Explain your answer for partial credit. Assume c is stored in row-major order, so c[i][j] is adjacent to c[i][j+1]. a. Managing bandwidth Given the following CUDA code, how would you rewrite to improve bandwidth to global and, if applicable, shared? Explain your answer for partial credit. Assume c is stored in row-major order, so c[i][j] is adjacent to c[i][j+1]. N = 512; NUMBLOCKS = 512/64; for (j = bx*64; j< (bx*64)+64; j++) a[tx] = a[tx] - c[tx][j] * b[j]; 9 N = 512; NUMBLOCKS = 512/64; for (j = bx*64; j< (bx*64)+64; j++) a[tx] = a[tx] - c[tx][j] * b[j]; 10 How to solve? Copy c to shared in coalesced order Tile in shared Copy a to register Copy b to shared, constant or texture N = 512; NUMBLOCKS = 512/64; float tmpa; shared ctmp[1024+32]; // let s use 32x32 tmpa = a[tx]; Pad1 = tx/32; Pad2 = j/32; // pad for bank conflicts for (jj = bx*64; jj< (bx*64)+64; jj+=32) for (j=jj; j<jj+2; j++) Ctmp[j*512+tx+pad1] = c[j][tx]; syncthreads(); tmpa = tmpa - ctmp[tx*512 + j + pad2] * b[j]; 11 How to solve? Copy c to shared in coalesced order Tile in shared Copy a to register Copy b to shared, constant or texture Exam: Problem III.b b. Divergent Branch Given the following CUDA code, describe how you would modify this to derive an optimized version that will have fewer divergent branches. Main() { float h_a[1024], h_b[1024]; /* assume appropriate cudamalloc called to create d_a and d_b, and d_a is */ /* initialized from h_a using appropriate call to cudamemcpy */ dim3 dimblock(256); dim3 dimgrid(4); compute<<<dimgrid, dimblock,0>>>(d_a,d_b); /* assume d_b is copied back from the device using call to cudamemcpy */ global compute (float *a, float *b) { float a[4][256], b[4][256]; bx = blockidx.x; if (tx % 16 == 0) (void) starting_kernel (a[bx][tx], b[bx][tx]); else /* (tx % 16 > 0) */ (void) default_kernel (a[bx][tx], b[bx][tx]); 12 Key idea: Separate multiples of 16 from others 3

Problem III.b Approach: Renumber thread to concentrate case where not divisible by 16 if (tx < 240) t = tx + (tx/16) + 1; Else t = (tx 240) * 16; Now replace tx with t in code Only last warp has divergent branches 13 Exam: Problem III.c c. Tiling The following sequential image correlation computation compares a region of an image to a template. Show how you would tile the image and threshold data to fit in 128MB global and the template data to fit in a 16KB shared? Explain your answer for partial credit. TEMPLATE_NROWS = TEMPLATE_NCOLS = 64; IMAGE_NROWS = IMAGE_NCOLS = 5192; int image[image_nrows][image_ncols], th[image_nrows][image_ncols]; int template[template_nrows][template_ncols]; for(m = 0; m < IMAGE_NROWS - TEMPLATE_NROWS + 1; m++){ for(n = 0; n < IMAGE_NCOLS - TEMPLATE_NCOLS + 1; n++){ for(i=0; i < TEMPLATE_NROWS; i++){ for(j=0; j < TEMPLATE_NCOLS; j++){ if(abs(image[i+m][j+n] template[i][j]) < threshold) th[m][n]+= image[i+m][j+n] 14 View of Computation Perform correlation of template with portion of image Move window horizontally and downward and repeat template Problem III.c i. How big is image and template data? Image = 5192 2 * 4 bytes/int = 100 Kbytes Th = 100 Kbytes Template = 64 2 * 4 bytes /int = exactly 16KBytes Total data set size = 216 Kbytes so fits in global and no need for tiling at this level Template data does not fit in shared due to other things placed there image ii. Partitioning to support tiling for shared Hint to exploit reuse on template by copying to shared Could also exploit reuse on portion of image Dependences only on th (a reduction) 15 16 4

Problem III.c (iii) Need to show tiling for template Can copy into shared in coalesced order Copy half or less at a time Exam: Problem III.d d. Parallel partitioning and synchronization (LU Decomposition) Without writing out the CUDA code, consider a CUDA mapping of the LU Decomposition sequential code below. Answer should be in three parts, providing opportunities for partial credit: (i) where are the data dependences in this computation? (ii) how would you partition the computation across threads and blocks? (iii) how would you add synchronization to avoid race conditions? Key Features of Solution: float a[1024][1024]; i. Dependences: True <a[i][j],a[k][k]>,<a[i][j],a[i][k]> carried by k True <a[i][j],a[k][k]>, <a[i][j],a[i][k]>, carried by k for (k=0; j<1023; k++) { True <a[i][j],a[i][j]>, <a[i][j], a[k][j]>, carried by k a[i][k] = a[i][k] / a[k][k]; ii. Partition: Merge i loops, interchange with j, partition j Across blocks/threads (sufficient ism?) or for (j=k+1; j<1024; j++) Partition I dimension across threads a[i][j] = a[i][j] a[i][k]*a[k][j]; using III.a. trick Load balance? Repartition on host 17 iii. Synchronization: On host 18 Exam: Problem III.d d. Parallel partitioning and synchronization (LU Decomposition) Without writing out the CUDA code, consider a CUDA mapping of the LU Decomposition sequential code below. Answer should be in three parts, providing opportunities for partial credit: (i) where are the data dependences in this computation? (ii) how would you partition the computation across threads and blocks? (iii) how would you add synchronization to avoid race conditions? float a[1024][1024]; for (k=0; j<1023; k++) { a[i][k] = a[i][k] / a[k][k]; for (j=k+1; j<1024; j++) a[i][j] = a[i][j] a[i][k]*a[k][j]; 19 5