NVIDIA S KEPLER ARCHITECTURE Tny Chen 2015
Overview 1. Fermi 2. Kepler a. SMX Architecture b. Memry Hierarchy c. Features 3. Imprvements 4. Cnclusin 5. Brief verlk int Maxwell
Fermi ~2010 40 nm TSMC (sme mbile used 28 nm) 16 Streaming Multiprcessrs 32 CUDA cres 16 lad/stre units 4 Special Functin Units (SFUs) Sine, csine, reciprcal, square rt CUDA cre One Integer FPU + ALU (flating pint)
Kepler ~2012-2014 28 nm technlgy TSMC On mst GeFrce 600, 700, and 800M series Designed with energy efficiency in mind 2 Kepler cres uses 90% f ne Fermi cre Unified GPU clck
SMX Architecture 15 SMX (Next Generatin Streaming Multiprcessr) 192 single precisin CUDA cres 64 duble precisin units 32 lads/stre units 32 SFUs 16 texture units 65,536 32-bit registers 4 Warp Scheduler
Feature Overview Quad Warp Scheduler Shuffle Instructins Texture Imprvements Atmic Operatins Memry Hierarchy Dynamic Parallelism Hyper-Q Grid Management Unit GPU Direct NVENC General imprvements/features
Quad Warp Scheduler A warp is 32 parallel threads Each SMX cntains 4 warp scheduler Each cntains 2 instructin dispatch units allwing 2 independent instructin per cycle Allws duble precisin peratins alngside ther peratins (Fermi did nt allw this)
Quad Warp Scheduler (Cnt.) Remval f cmplex hardware that prevents data hazards A multi prt register screbard dependency checker blck Used cmpiler t determine pssible hazards Simple hardware blck prvides this pre-determined infrmatin t the instructin Replaces pwer expensive hardware stage with simple hardware blck Frees up die space
Shuffle Instructins Allws threads within a warp t share data Previusly needed separate stre and lad peratins t pass data t shared memry Instead, mve the thread s they can access anther thread s register Stre and lad is carried in a single step Reduces amunt f shared memry needed 6% perfrmance gain in FFT using shuffle
Texture Imprvements Texture state is nw saved in memry Fermi used a fixed size binding table Assigned a entry when GPU needed t reference a texture Basically resulted in a 128 texture limit Obtained n demand Reduces CPU verhead and imprves GPU access efficiency
Atmic Operatins Read, write, mdify peratins perfrmed withut interruptins frm ther threads Imprtant fr parallel prgramming Added atmicmin, atmicmax, atmicand, atmicor, atmicxr peratins Native supprt fr 64 bit Atmic ps
Memry Hierarchy Cnfigurable 64KB shared memry 16/32/48 KB L1 cache 48/32/16 KB shared memry 48 KB read nly cache 1536 KB L2 cache Prtected by Single Errr Crrect Duble Errr Detect (SECDED) ECC cde Mre bandwidth at each level cmpared t previus
Dynamic Parallelism Allws the GPU t generate, synchrnize, and cntrl new wrk fr itself Traditinally CPU issues wrk t the GPU Des nt need t invlve the CPU fr new wrk
Hyper-Q Fermi had 16 cncurrent wrk streams but all were multiplexed int 1 hardware wrk queue Created false dependencies Increased number f hardware managed cnnectins (wrk queues) t 32 Each CUDA stream is internally managed and intrastream dependencies are ptimized
Grid Management Unit (GMU) Grid = grup f blcks blck = grup f threads Manages and priritizes grids that are t be passed int the CWD (CUDA Wrk Distributr) t be sent t the SMX units fr executin Keeps the GPU efficiently utilized
GPU Direct Allws direct access t GPU memry frm third party devices. NICs, SSDs, etc Remte Direct Memry Access(RDMA) Des nt need t invlve the CPU
NVENC New hardware-based H.264 vide encder Previus mdels used CUDA cres 4 times faster while using less pwer Up t 4096x4096 encde 16 minute lng 1080p, 30 fps vide will take apprximately 2 minutes
Imprvements f Kepler Access up t 255 register per thread (cmpared t 63 fr Fermi) Remval f shader clck Fermi used a shader clck typically 2x the GPU clck Achieves higher thrughput Uses mre pwer Runs ff GPU clck
Cnt. Up t 4 displays n ne card 4k supprt GPU Bst Dynamically scale GPU clck based n perating cnditins Adaptive V-sync Turns ff v-sync when frames per sec drps belw 60 Turns n v-sync when abve 60 fps
Cnt. FXAA (Fast Apprximate anti-aliasing) Cmparable sharpness t MSAA (Multisample antialiasing) Uses less cmputatin pwer Smths edges using pixels rather than the 3D mdel
Cnt. TXAA (Tempral anti-aliasing) Mix f hardware anti-aliasing, custm CG film style AA reslve high-quality reslve filter t wrk with the HDRcrrect pst prcessing pipeline TXAA 1 ffers visual quality n par with 8xMSAA with the perfrmance hit f 2xMSAA, while TXAA 2 ffers image quality that is superir t 8xMSAA, but with perfrmance cmparable t 4xMSAA.
Benchmarks
In Cnclusin Imprve Perfrmance Imprve energy efficiency Many hands make light wrk
Maxwell 28nm TSMC Early 2014 (ver 1) Late 2014 (ver 2 current versin) GTX 980, 970 New SM architecture (SMM) Efficiency - mre active threads per SMM Larger shared memry Larger L2 cache
Questins?