NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Similar documents
GPU Fundamentals Jeff Larkin November 14, 2016


Tesla Architecture, CUDA and Optimization Strategies

Portland State University ECE 588/688. Graphics Processors

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Advanced CUDA Optimizing to Get 20x Performance

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CUDA Architecture & Programming Model

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Master Informatics Eng.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Multi-Processors and GPU

Tesla GPU Computing A Revolution in High Performance Computing

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Mathematical computations with GPUs

CS 179 Lecture 4. GPU Compute Architecture

Tesla GPU Computing A Revolution in High Performance Computing

Improving Performance of Machine Learning Workloads

Introduction to GPGPU and GPU-architectures

CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

CSE 160 Lecture 24. Graphical Processing Units

The NVIDIA GeForce 8800 GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

B. Tech. Project Second Stage Report on

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Numerical Simulation on the GPU

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Introduction to CUDA Programming

Warps and Reduction Algorithms

GRAPHICS PROCESSING UNITS

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

TUNING CUDA APPLICATIONS FOR MAXWELL

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

GPU Computing with CUDA. Part 2: CUDA Introduction

Advanced CUDA Programming. Dr. Timo Stich

Introduction to GPU hardware and to CUDA

TUNING CUDA APPLICATIONS FOR MAXWELL

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Threading Hardware in G80

Parallel Numerical Algorithms

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Parallel Accelerators

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

CUDA Performance Optimization. Patrick Legresley

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Introduction to CUDA (1 of n*)

Introduction to Parallel Computing with CUDA. Oswald Haan

General Purpose GPU Computing in Partial Wave Analysis

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Accelerators

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Exotic Methods in Parallel Computing [GPU Computing]

NVIDIA Fermi Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Systems I The GPU architecture. Jan Lemeire

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GPUs have enormous power that is enormously difficult to use

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Computer Architecture

HPC with Multicore and GPUs

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture

Programmable Graphics Hardware (GPU) A Primer

CME 213 S PRING Eric Darve

45-year CPU Evolution: 1 Law -2 Equations

Using GPUs to compute the multilevel summation of electrostatic forces

Auto-tunable GPU BLAS

Antonio R. Miele Marco D. Santambrogio

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Programmer's View of Execution Teminology Summary

Introduction to CUDA

Programming in CUDA. Malik M Khan

Center for Computational Science

GPUBenchmark results for tesla2

Technical Report on IEIIT-CNR

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Transcription:

NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield

2 Outline Execution Model Architecture Demo

3 Execution Model

4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host OS Device Driver SIMT GPU Hardware Thread Scheduling Scalar CPU

5 SIMT Multithreaded Execution Single-Instruction Multi-Thread instruction scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95.. warp 8 instruction 12 warp 3 instruction 96 SIMT: Single-Instruction Multi-Thread applies instruction to independent threads SIMT provides easy single-thread scalar programming with SIMD efficiency Warp: the set of 32 parallel threads that execute a SIMT instruction Hardware implements zero-overhead warp and thread scheduling Each thread processor manages state for up to 128 threads SIMT threads execute independently SIMT warp diverges and converges when threads branch independently Best efficiency and performance when threads of a warp execute together

6 Execution Model - CUDA Thread Block Local Memory Register File Shared Memory Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-block Shared by threads of CTA Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Grid 0 Grid 1... Global Memory Sequential Grids in Time...

7 Execution Model - Graphics Pixel Thread Pixel Warp Register File Plane Equations Texture Local Registers: per Pixel Private per pixel Planes/Textures: per Warp Define surface-space inputs Array of 1D/2D/3D data arrays Target Images: per Grid Array of 2D surfaces Pixel Grid 0 Pixel Grid 1...... Target Target Images Target Images Images Target Target Images Target Images Images

8 Architecture

9 Implementation TSMC 65nm 1.4 Billion Transistors 24.3 x 24.0 mm die 2236 Ball BGA 1 TeraFLOPS / 84 GigaFLOPS DP 1.4 GHz Processor Clock 140 GB/s 1.1 GHz Memory Clock Up to 4GB on-board Memory

10 GTX200 Unified Visual Computing Fixed Function Acceleration Processor Communication Fabric Memory & I/O Tesla Unified Graphics and Computing Architecture Scales parallel performance 2X beyond G80 240 Thread Processor cores, 30K threads Double precision 64-bit IEEE 754 floating point

11 SM Multithreaded Multiprocessor I - Cache MT Issue C - Cache SFU DP SFU Shared Memory 8 Thread Processors IEEE 754 32-bit floating point 32-bit and 64-bit integer 2K 32-bit registers per Variable 4-128 registers / thread 2 SFU Special Function Units 1 DP Double Precision Unit IEEE 754R 64-bit floating point Fused multiply-add Scalar register-based ISA Multithreaded Instruction Unit 1024 threads, hardware multithreaded 32 SIMT warps of 32 threads Independent thread execution Mixed concurrent thread types 16KB Shared Memory Concurrent threads share data Low latency load/store

12 Double Precision Fused Multiply-Add Revised IEEE 754R standard specifies FMA FMA has lower latency than FMAD FMA improves accuracy of many computations Dot products Matrix multiplication Enables fast mixed-precision convergence algorithms Enables higher performance algorithms for extended-precision arithmetic Enables efficient implementation of exactlyrounded division and square root I - Cache MT Issue C - Cache SFU DP SFU Shared Memory

Coalesced Memory I/O When 16 threads access a contiguous region of device memory 16 data elements loaded in one instruction int, float: 64 bytes (fastest) int2, float2: 128 bytes int4, float4: 256 bytes (2 transactions) Regions aligned to multiple of size Coalescing scales gracefully for partially contiguous accesses Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Address 120 Address 124 Address 128 Address 132 Address 136 Address 140 Address 144 Address 148 Address 152 Address 156 Address 160 Address 164 Address 168 Address 172 64B segment Thread 12 Address 176 Thread 13 Address 180 Thread 14 Address 184 Thread 15 Address 188 Address 192 13 Address 196

14 Performance: Single Precision BLAS 350 300 250 BLAS (SGEMM) on CUDA CUBLAS: CUDA 2.0b2, Tesla C1060 ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core CUDA ATLAS 1 Thread ATLAS 4 Threads GFLOPS 200 150 100 50 0 Matrix Size

15 Performance: Double Precision BLAS BLAS (DGEMM) on CUDA CUBLAS CUDA 2.0b2 on Tesla C1060 ATLAS 3.81 on Intel Xeon E5440 Quad-core, 2.83 GHz CUBLAS ATLAS Parallel ATLAS Single Matrix Size

Performance: Scaling Oil and Gas Computing: Reverse Time Migration Hand Optimized SSE Versus CUDA C 16 X86 CPU NVIDIA GPU

17 Summary Rebalanced architecture to workload trends Scaled from 128 to 240 processors Hardware manages thousands of threads Zero software overhead Hides huge latencies High achieved utilization Natively Scalar No swizzling or vectorization overhead Coalescing for high bandwidth memory I/O Software architecture allows 2X scaling on customer C code with no modification

18 Demo

19 More Information Erik Lindholm, et. al., NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, March-April 2008. http://www.nvidia.com/cuda