NVIDIA Fermi Architecture

Similar documents
CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CUDA Performance Considerations (2 of 2)

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Portland State University ECE 588/688. Graphics Processors

Tesla GPU Computing A Revolution in High Performance Computing

CS427 Multicore Architecture and Parallel Computing

Spring Prof. Hyesoon Kim

Introduction to CUDA (1 of n*)

Tesla Architecture, CUDA and Optimization Strategies

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Mattan Erez. The University of Texas at Austin

Multi-Processors and GPU

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Architecture & Programming Model

Antonio R. Miele Marco D. Santambrogio

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Mathematical computations with GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Fundamental Optimizations

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Tuning CUDA Applications for Fermi. Version 1.2

Introduction to CUDA (1 of n*)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Lecture 8: GPU Programming. CSE599G1: Spring 2017

CUDA Performance Optimization. Patrick Legresley

Threading Hardware in G80

Programmable Graphics Hardware (GPU) A Primer

GRAPHICS PROCESSING UNITS

CUDA Programming Model

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ME964 High Performance Computing for Engineering Applications

Parallel Computing: Parallel Architectures Jin, Hai

CUDA. Matthew Joyner, Jeremy Williams

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

CS 179: GPU Programming

Tesla GPU Computing A Revolution in High Performance Computing

GPGPU: Parallel Reduction and Scan

CS 179 Lecture 4. GPU Compute Architecture

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU Programming and Architecture: Course Overview

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Implementation and Experimental Evaluation of a CUDA Core under Single Event Effects. Werner Nedel, Fernanda Kastensmidt, José.

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Dense Linear Algebra. HPC - Algorithms and Applications

CS516 Programming Languages and Compilers II

Current Trends in Computer Graphics Hardware

Using GPUs to compute the multilevel summation of electrostatic forces

Computer Architecture

Technical Report on IEIIT-CNR

Introduction to CUDA (2 of 2)

Martin Kruliš, v

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Martin Kruliš, v

GPU Fundamentals Jeff Larkin November 14, 2016

CS195V Week 9. GPU Architecture and Other Shading Languages

Inside Kepler. Manuel Ujaldon Nvidia CUDA Fellow. Computer Architecture Department University of Malaga (Spain)

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Sparse Linear Algebra in CUDA

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

General Purpose GPU Computing in Partial Wave Analysis

Josef Pelikán, Jan Horáček CGG MFF UK Praha

POST-SIEVING ON GPUs

Numerical Simulation on the GPU

Introduction to GPU hardware and to CUDA

Optimization solutions for the segmented sum algorithmic function

GPU Programming and Architecture: Course Overview

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

NVIDIA Parallel Nsight. Jeff Kiel


Computer Architecture

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Transcription:

Administrivia NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster session: 04/28 Three weeks from tomorrow G80, GT200, and Fermi November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100) New GPU Generation What are the technical goals for a new GPU generation? 1

New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? Advance programmability. In what ways? Fermi: What s More? More total cores (SPs) not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs) Fermi: What s Faster? Faster double precision 8x over GT200 Faster atomic operations. What for? 5-20x Faster context switches Between applications 10x Between graphics and compute, e.g., OpenGL and CUDA 2

Fermi: What s New? G80, GT200, and Fermi L1 and L2 caches. For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics G80, GT200, and Fermi GT200 and Fermi 3

Fermi Block Diagram GF100 16 SMs Each with 32 cores 512 total cores Each SM hosts up to 48 warps, or 1,536 threads In flight, up to 24,576 threads Fermi SM Why 32 cores per SM instead of 8? Why not more SMs? G80 8 cores GT200 8 cores GF100 32 cores Fermi SM Dual warp scheduling Why? 32K registers 32 cores Floating point and integer unit per core 16 Load/stores 4 SFUs Fermi SM 16 SMs * 32 cores/sm = 512 floating point operations per cycle Why not in practice? 4

Fermi SM Each SM 64KB on-chip memory 48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory Configurable by CUDA developer Fermi Dual Warping Scheduling Fermi Caches Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/sc09_cuda_luebke_intro.pdf Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf 5

Fermi Caches Fermi: Unified Address Space Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? No explicit CPU/GPU copies Direct GPU-GPU copies Direct I/O device to GPU copies 6

Fermi ECC Fermi Tessellation ECC Protected Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications? Fermi Tessellation Fermi Tessellation Fixed function hardware on each SM for graphics Texture filtering Texture cache Tessellation Vertex Fetch / Attribute Setup Stream Output Viewport Transform. Why? 7

Observations Becoming easier to port CPU code to the GPU Recursion, fast atomics, L1/L2 caches, faster global memory In fact Observations Becoming easier to port CPU code to the GPU Recursion, fast atomics, L1/L2 caches, faster global memory In fact GPUs are starting to look like CPUs Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics 8