CMPSCI 691AD General Purpose Computation on the GPU

Similar documents
CS179 GPU Programming: CUDA Memory. Lecture originally by Luke Durant and Tamas Szalay

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

CSC630/CSC730 Parallel & Distributed Computing

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Programming with CUDA

Analytical Modeling of Parallel Programs

NVIDIA CUDA Compute Unified Device Architecture

Memory Management. Memory Access Bandwidth. Memory Spaces. Memory Spaces

Programming with CUDA WS 08/09. Lecture 7 Thu, 13 Nov, 2008

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Tesla Architecture, CUDA and Optimization Strategies

CS179: GPU Programming Recitation 5: Rendering Fractals

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to parallel computing

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Design of Parallel Algorithms. Course Introduction

CUDA Memory Hierarchy

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

B. Tech. Project Second Stage Report on

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA

CS 179: GPU Programming. Lecture 10

Review: Creating a Parallel Program. Programming for Performance

Advanced CUDA Optimizations. Umar Arshad ArrayFire

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Understanding Parallelism and the Limitations of Parallel Computing

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Using GPUs to compute the multilevel summation of electrostatic forces

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

CS 179: GPU Programming. Lecture 7

Warps and Reduction Algorithms

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

27. Parallel Programming I

27. Parallel Programming I

TUNING CUDA APPLICATIONS FOR MAXWELL

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Global Memory Access Pattern and Control Flow

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA GPU CODING & COMPUTING

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Performance Optimization. Patrick Legresley

Portland State University ECE 588/688. Graphics Processors

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA CUDA Compute Unified Device Architecture

Advanced CUDA Optimizations

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

GPU Profiling and Optimization. Scott Grauer-Gray

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Computer Architecture

Master Thesis Accelerating Image Registration on GPUs

1/25/12. Administrative

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

COMP Parallel Computing. SMM (2) OpenMP Programming Model

CS427 Multicore Architecture and Parallel Computing

Performance analysis basics

Fundamental CUDA Optimization. NVIDIA Corporation

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Tuning CUDA Applications for Fermi. Version 1.2

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Programmable Graphics Hardware (GPU) A Primer

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

1/31/11. How to tell if results are correct. Assignment 2: Analyzing the Results. Targets of Memory Hierarchy Optimizations. Overview of Lecture

Chapter 5: Analytical Modelling of Parallel Programs

High Performance Computing and GPU Programming

Monte Carlo Methods in CUDA Gerald A. Hanweck, Jr., PhD CEO, Hanweck Associates, LLC

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

27. Parallel Programming I

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA Performance Considerations (2 of 2)

Παράλληλη Επεξεργασία

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

Height field ambient occlusion using CUDA

Double-Precision Matrix Multiply on CUDA

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

CUDA (Compute Unified Device Architecture)

Code Optimizations for High Performance GPU Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Analytical Modeling of Parallel Programs

Lecture 1: Introduction and Computational Thinking

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Performance potential for simulating spin models on GPU

Transcription:

CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang

(cont. from last lecture) Device Management Context Management Module Management Execution Control Memory Management Texture Reference Management Interoperability with OpenGL and D3D

Texture Reference Management Texture: 1D, 2D, or 3D pixel (texel) data. Texture is accessed using texture references: texture<type, Dim, ReadMode> texref; Type: must be either integer or single FP, or any 1-, 2-, or 4- component vector of these two types. Dim: 1, 2, or 3 ReadMode: either cudareadmodenormalizedfloat or cudareadmodeelementtype.

Host Runtime for Textures Texture Reference Structure: struct texturereference { int normalized; enum filtermode; enum addressmode[3]; struct channeldesc; } Examples: texref.normalized = true; texref.filtermode = cudafiltermodelinear; texref.addressmode[0] = cudaaddressmodewrap;

Host Runtime for Textures filtermode? addressmode? Normalized? [0, 1) : [0, N) Bind texture reference to a CUDA array: cudabindtexturetoarray(texref, cuarray); Can also bind texture reference to linear memory: but with several limitations.

Device Runtime for Textures Texturing from CUDA arrays using tex fetches: Type tex1d(texref, float x); Type tex2d(texref, float x, float y); Type tex3d(texref, float x, float y, float z);

Texture Reference Management Advantages of using Textures: Cached (high bandwidth if locality preserved) Not subject to memory coalescing pattern 'Free' linear interpolation (only valid for FP) Normalized texture coordinates (resolution independent) Addressing mode (automatic handling of out of boundary cases)

Example: Parallel Reduction Compute: N S = i=1 a i Put data in shared memory: Fast (remember that shared memory can be treated as a user managed L1 cache) Allows block-wise synchronization

Example: Parallel Reduction But there is no global synchronization Split computation into blocks A single block aggregates the final result

Example: Parallel Reduction First try: (consider a single block) Iterations

Example: Parallel Reduction

Example: Parallel Reduction Problem! (divergent in a warp; Shared memory bank conflict)

Example: Parallel Reduction Second try: (consider a single block) Iterations

Example: Parallel Reduction

Example: Parallel Reduction Idling threads?

Quantitative Analysis of Parallel Algorithms How much speedup do I gain by going parallel? Does the parallel program fully utilize resources? Efficiency vs. work size...

Source of Overhead in Parallel Algorithms Throwing in N processors doesn't mean your program runs N times faster Interprocess communication Idling due to load imbalancing, synchronization etc, serial sub-components Excess computation Sometimes you are forced to use a poorer but easily parallelizable algorithm

Performance Metrics Execution time: time elapsed between the beginning and end of its execution. T S T P : serial runtime : parallel runtime on p processors Total Parallel Overhead T O = pt P T S overhead is zero if perfect linear speedup achieved

Performance Metrics Speedup: S = T S T P Example: speedup of parallel reduction?

Performance Metrics Speedup: S = T S T P Example: speedup of parallel reduction? T S =O n For p multiprocessors: T P =O log p Overall: T P =O n log p p

Performance Metrics Theoretical maximum speedup: S = p In practice, much less (refer to Amdahl's law) Is it possible to exceed p?

Performance Metrics Theoretical maximum speedup: S = p In practice, much less (refer to Amdahl's law) Is it possible to exceed p? Superlinear speedup Sources of superlinearity: Hardware limitation that puts serial version at a disadvantage Superlinearity due to exploratory decomposition

Performance Metrics Efficiency: E = S p = T S p T P Measures how efficiently an algorithm is utilizing all the processors. Ideal efficiency: E = 1 Efficiency of parallel reduction?

Performance Metrics Efficiency: E = S p = T S p T P Measures how efficiently an algorithm is utilizing all the processors. Ideal efficiency: E = 1 Efficiency of parallel reduction? E =O 1 log p Hmm, not very efficient...

Performance Metrics Cost: C= p T P Reflects the sum of time that each processor spends solving the problem. pt P Compare the cost with the execution time of the fastest known sequential algorithm : Cost-optimal if they are asymptotically equal. Cost of parallel reduction? T S

Performance Metrics Cost: C= p T P Reflects the sum of time that each processor spends solving the problem. pt P Compare the cost with the execution time of the fastest known sequential algorithm : Cost-optimal if they are asymptotically equal. Cost of parallel reduction? T S pt P =O n log p T S =O n NOT cost-optimal!

Performance Metrics Revisit parallel reduction: How about we let each processor add then sum up the resulting p numbers? n p elements,

Performance Metrics Revisit parallel reduction: How about we let each processor add then sum up the resulting p numbers? T P =O n log p p n p elements, pt P =O n plog p which is approximately O n when p log p n Cost-optimal!