NVIDIA S KEPLER ARCHITECTURE. Tony Chen 2015

Similar documents
NVIDIA Tesla K20X GPU Accelerator. Breton Minnehan, Beau Sattora

Greg Dias, Alex Wilson. Fermi

It has hardware. It has application software.

B Tech Project First Stage Report on

Parallel Processing in NCAR Command Language for Performance Improvement

CSE 3320 Operating Systems Computer and Operating Systems Overview Jia Rao

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand

Computer Organization and Architecture

CS4500/5500 Operating Systems Synchronization

Memory Hierarchy. Goal of a memory hierarchy. Typical numbers. Processor-Memory Performance Gap. Principle of locality. Caches

CS4500/5500 Operating Systems Computer and Operating Systems Overview

CSE 3320 Operating Systems Synchronization Jia Rao

Operating systems. Module 7 IPC (Interprocess communication) PART I. Tami Sorgente 1

CS510 Concurrent Systems Class 2. A Lock-Free Multiprocessor OS Kernel

RELEASE NOTES FOR PHOTOMESH 7.3.1

Date: October User guide. Integration through ONVIF driver. Partner Self-test. Prepared By: Devices & Integrations Team, Milestone Systems

CCNA 1 Chapter v5.1 Answers 100%

Dynamic Instruction Reuse

Upgrading Kaltura MediaSpace TM Enterprise 1.0 to Kaltura MediaSpace TM Enterprise 2.0

Contents: Module. Objectives. Lesson 1: Lesson 2: appropriately. As benefit of good. with almost any planning. it places on the.

Launch Wizard Invitations Wizard Multi-Question Charts Enhanced API (coming soon) Performance Improvements Improved Single Sign-On Integration

Quick Start Guide. Basic Concepts. DemoPad Designer - Quick Start Guide

These tasks can now be performed by a special program called FTP clients.

ECE 545 Project Deliverables

CA CMDB Connector for z/os

Speculative Parallelization. Devarshi Ghoshal

Extensible Query Processing in Starburst

Operating systems. Module 15 kernel I/O subsystem. Tami Sorgente 1

PaperStream Capture change history

2. When logging is used, which severity level indicates that a device is unusable?

Eastern Mediterranean University School of Computing and Technology Information Technology Lecture2 Functions

CCNA 3 Chapter 2 v5.0 Exam Answers 2015 (100%)

Custodial Integrator. Release Notes. Version 3.11 (TLM)

TRAINING GUIDE. Overview of Lucity Spatial

- Replacement of a single statement with a sequence of statements(promotes regularity)

Iowa State University

Because this underlying hardware is dedicated to processing graphics commands, OpenGL drawing is typically very fast.

CS4500/5500 Operating Systems Page Replacement Algorithms and Segmentation

CodeSlice. o Software Requirements. o Features. View CodeSlice Live Documentation

USER MANUAL. RoomWizard Administrative Console

Tekmos. TK68020 Microprocessor. Features. General Description. 9/03/14 1

SVC-T using DM36x H.264 codec

Xilinx Answer Xilinx PCI Express DMA Drivers and Software Guide

Announcing Veco AuditMate from Eurolink Technology Ltd

Product Release Notes

An Introduction to Crescendo s Maestro Application Delivery Platform

CounterSnipe Software Installation Guide Software Version 10.x.x. Initial Set-up- Note: An internet connection is required for installation.

Due Date: Lab report is due on Mar 6 (PRA 01) or Mar 7 (PRA 02)

TN How to configure servers to use Optimise2 (ERO) when using Oracle

Infrastructure Series

Computer Organization and Architecture

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

3 AXIS STAGE CONTROLLER

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on IBM eserver p690 and IBM DB2 UDB on eserver p5 570

Transmission Control Protocol Introduction

New Product Release Package 8 XT[2] System and Software 19 Jan 2009

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on IBM eserver BladeCenter and IBM DB2 UDB on eserver p5 550

Computer Organization and Architecture

KIRA-EMTA. Why KIRA-EMTA Multi-Threaded Supercomputer. Inside KIRA-EMTA. Elastic Supercomputing Architecture. Exa-Converged Architecture

Product Release Notes

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on HP ProLiant Server and Microsoft SQL Server 2005

Dell Compellent Storage Center SAN & Citrix XenDesktop 1,000 Desktop Reference Architecture. Dell Compellent Technical Solutions

Model WM100. Product Manual

Performance and Scalability Benchmark: Siebel CRM Release 7.7 Industry Applications on HP Integrity Server and Microsoft SQL Server 2005

Product Release Notes

Product Release Notes

Firmware Update. This utility installs an updated version of the Wireless Earbuds firmware and provides the following changes and benefits:

DELL PowerVault MD3200/MD3220 Series of Storage Arrays. A Dell Transition Guide Version 1.0

VMware AirWatch SDK Plugin for Apache Cordova Instructions Add AirWatch Functionality to Enterprise Applicataions with SDK Plugins

CSE 3320 Operating Systems Page Replacement Algorithms and Segmentation Jia Rao

Single File Upload Guide


Priority-aware Coflow Placement and scheduling in Datacenters

Using CppSim to Generate Neural Network Modules in Simulink using the simulink_neural_net_gen command

The. ARM Architecture. Thomas DeMeo Thomas Becker

DECISION CONTROL CONSTRUCTS IN JAVA

Maintenance Release Notes Release Version: 9.5.5

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Avocent Power Management Distribution Unit (PM PDU) Release Notes Firmware Version April 18, 2011

To over come these problems collections are recommended to use. Collections Arrays

ExpressSAS Host Adapter 6Gb v2.02 Mac OS X

Cntents 1 Intrductin Kit Cntents Requirements Installatin Gesture Sensr Kit Hardware and Jumper Settings De

Municode Website Instructions

Parallel Image Searching Using PostgreSQL and PgOpenCL. Tim Child CEO 3DMashUp

TED PRO Third Party Posting API Based on ECC Firmware Revision R400

Advances in Real-Time Voxel-Based GI

An Overview of Test at IBM Microelectronics

Model 86A00-2 Home Theater Extender 2 (HTX2)

IMPORTING INFOSPHERE DATA ARCHITECT MODELS INFORMATION SERVER V8.7

A Characterization of Data Mining Algorithms on a Modern Processor

Low-Cost Solutions for Video Compression Systems

VMware EVO:RAIL Customer Release Notes

Keysight Logic and Protocol Analyzer Software (64-bit Application)

CROWNPEAK DESKTOP CONNECTION (CDC) INSTALLATION GUIDE VERSION 2.0

HW4 Software Version 3.4.1

Chapter 14. Basic Planning Methodology

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Aras Innovator Viewer Add-Ons

CS510 Concurrent Systems Class 1a. Linux Kernel Locking Techniques

Shavlik Protect. Migration Tool User s Guide

Transcription:

NVIDIA S KEPLER ARCHITECTURE Tny Chen 2015

Overview 1. Fermi 2. Kepler a. SMX Architecture b. Memry Hierarchy c. Features 3. Imprvements 4. Cnclusin 5. Brief verlk int Maxwell

Fermi ~2010 40 nm TSMC (sme mbile used 28 nm) 16 Streaming Multiprcessrs 32 CUDA cres 16 lad/stre units 4 Special Functin Units (SFUs) Sine, csine, reciprcal, square rt CUDA cre One Integer FPU + ALU (flating pint)

Kepler ~2012-2014 28 nm technlgy TSMC On mst GeFrce 600, 700, and 800M series Designed with energy efficiency in mind 2 Kepler cres uses 90% f ne Fermi cre Unified GPU clck

SMX Architecture 15 SMX (Next Generatin Streaming Multiprcessr) 192 single precisin CUDA cres 64 duble precisin units 32 lads/stre units 32 SFUs 16 texture units 65,536 32-bit registers 4 Warp Scheduler

Feature Overview Quad Warp Scheduler Shuffle Instructins Texture Imprvements Atmic Operatins Memry Hierarchy Dynamic Parallelism Hyper-Q Grid Management Unit GPU Direct NVENC General imprvements/features

Quad Warp Scheduler A warp is 32 parallel threads Each SMX cntains 4 warp scheduler Each cntains 2 instructin dispatch units allwing 2 independent instructin per cycle Allws duble precisin peratins alngside ther peratins (Fermi did nt allw this)

Quad Warp Scheduler (Cnt.) Remval f cmplex hardware that prevents data hazards A multi prt register screbard dependency checker blck Used cmpiler t determine pssible hazards Simple hardware blck prvides this pre-determined infrmatin t the instructin Replaces pwer expensive hardware stage with simple hardware blck Frees up die space

Shuffle Instructins Allws threads within a warp t share data Previusly needed separate stre and lad peratins t pass data t shared memry Instead, mve the thread s they can access anther thread s register Stre and lad is carried in a single step Reduces amunt f shared memry needed 6% perfrmance gain in FFT using shuffle

Texture Imprvements Texture state is nw saved in memry Fermi used a fixed size binding table Assigned a entry when GPU needed t reference a texture Basically resulted in a 128 texture limit Obtained n demand Reduces CPU verhead and imprves GPU access efficiency

Atmic Operatins Read, write, mdify peratins perfrmed withut interruptins frm ther threads Imprtant fr parallel prgramming Added atmicmin, atmicmax, atmicand, atmicor, atmicxr peratins Native supprt fr 64 bit Atmic ps

Memry Hierarchy Cnfigurable 64KB shared memry 16/32/48 KB L1 cache 48/32/16 KB shared memry 48 KB read nly cache 1536 KB L2 cache Prtected by Single Errr Crrect Duble Errr Detect (SECDED) ECC cde Mre bandwidth at each level cmpared t previus

Dynamic Parallelism Allws the GPU t generate, synchrnize, and cntrl new wrk fr itself Traditinally CPU issues wrk t the GPU Des nt need t invlve the CPU fr new wrk

Hyper-Q Fermi had 16 cncurrent wrk streams but all were multiplexed int 1 hardware wrk queue Created false dependencies Increased number f hardware managed cnnectins (wrk queues) t 32 Each CUDA stream is internally managed and intrastream dependencies are ptimized

Grid Management Unit (GMU) Grid = grup f blcks blck = grup f threads Manages and priritizes grids that are t be passed int the CWD (CUDA Wrk Distributr) t be sent t the SMX units fr executin Keeps the GPU efficiently utilized

GPU Direct Allws direct access t GPU memry frm third party devices. NICs, SSDs, etc Remte Direct Memry Access(RDMA) Des nt need t invlve the CPU

NVENC New hardware-based H.264 vide encder Previus mdels used CUDA cres 4 times faster while using less pwer Up t 4096x4096 encde 16 minute lng 1080p, 30 fps vide will take apprximately 2 minutes

Imprvements f Kepler Access up t 255 register per thread (cmpared t 63 fr Fermi) Remval f shader clck Fermi used a shader clck typically 2x the GPU clck Achieves higher thrughput Uses mre pwer Runs ff GPU clck

Cnt. Up t 4 displays n ne card 4k supprt GPU Bst Dynamically scale GPU clck based n perating cnditins Adaptive V-sync Turns ff v-sync when frames per sec drps belw 60 Turns n v-sync when abve 60 fps

Cnt. FXAA (Fast Apprximate anti-aliasing) Cmparable sharpness t MSAA (Multisample antialiasing) Uses less cmputatin pwer Smths edges using pixels rather than the 3D mdel

Cnt. TXAA (Tempral anti-aliasing) Mix f hardware anti-aliasing, custm CG film style AA reslve high-quality reslve filter t wrk with the HDRcrrect pst prcessing pipeline TXAA 1 ffers visual quality n par with 8xMSAA with the perfrmance hit f 2xMSAA, while TXAA 2 ffers image quality that is superir t 8xMSAA, but with perfrmance cmparable t 4xMSAA.

Benchmarks

In Cnclusin Imprve Perfrmance Imprve energy efficiency Many hands make light wrk

Maxwell 28nm TSMC Early 2014 (ver 1) Late 2014 (ver 2 current versin) GTX 980, 970 New SM architecture (SMM) Efficiency - mre active threads per SMM Larger shared memry Larger L2 cache

Questins?