Early 3D Graphics. NVIDIA Corporation Perspective study of a chalice Paolo Uccello, circa 1450

Size: px

Start display at page:

Download "Early 3D Graphics. NVIDIA Corporation Perspective study of a chalice Paolo Uccello, circa 1450"

Damon Davidson
5 years ago
Views:

3 Early 3D Graphics Perspective study of a chalice Paolo Uccello, circa 1450

4 Early Graphics Hardware Artist using a perspective machine Albrecht Dürer, 1525 Perspective study of a chalice Paolo Uccello, circa 1450

5 Early Electronic Graphics Hardware SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963

6 The Graphics Pipeline The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982

7 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending

8 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending

9 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending

10 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending

11 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending

12 The Graphics Pipeline Vertex Key abstraction of real-time graphics Rasterize Hardware used to look like this Pixel Test & Blend One chip/board per stage Fixed data flow through pipeline

13 SGI RealityEngine (1993) Vertex Rasterize Pixel Test & Blend!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+

14 SGI InfiniteReality (1997) Vertex Rasterize Pixel Test & Blend 567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3:

15 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware used to look like this Pixel Test & Blend

16 The Graphics Pipeline Vertex Rasterize!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1' $644 10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Hardware used to look like this: Pixel Vertex, pixel processing became programmable Test & Blend

17 The Graphics Pipeline Vertex Geometry Rasterize Pixel Test & Blend!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1' $644 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Hardware used to look like this Vertex, pixel processing became programmable New stages added

18 The Graphics Pipeline Vertex Tessellation Geometry Rasterize Pixel!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1' $644 10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH >I1J"A"9I1J"E"=I1JH K Hardware used to look like this Vertex, pixel processing became programmable New stages added Test & Blend GPU architecture increasingly centers around shader execution

19 Modern GPUs: Unified Design Discrete Design Unified Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core Shader C obuffer obuffer obuffer obuffer Shader D Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core

20 L2 L1 TF -($"B%C%09?$(#DE( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(.7B"$%&??(8F)(# 26?$ L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L2 L2 L2 L2 L2 GeForce 8: Modern GPU Architecture

21 L2 L1 TF -($"B%C%09?$(#DE( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(.7B"$%&??(8F)(# 26?$ L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L2 L2 L2 L2 L2 GeForce 8: Modern GPU Architecture

22 Modern GPU Architecture: GT200 -($"B%C%09?$(#DE( /(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"( TF L1 TF L1 TF L1 TF L1 TF L1 ;<#(9=%-><(=")(# L1 L1 L1 L1 L1 TF TF TF TF TF L2 L2 L2 L2 L2 L2 L2 L2

23 Next-Gen GPU Architecture: Fermi DRAM I/F Giga Thread HOST I/F DRAM I/F DRAM I/F DRAM I/F L2 DRAM I/F DRAM I/F NVIDIA next-!"#$%&"'()*$+',-).",./'"

24 GPUs Today Lessons from Graphics Pipeline Throughput is paramount!"#$%&'( )*%+,-./ Create, run, & retire lots of threads very rapidly Use multithreading to hide latency '56 ')*%+,-./ )% 68*%+,-./ &'5*%+,-./ %((88 6(&*%+,-./!"#$%&' )9 +,-./!""# $%%% $%%# $%!%

26 Perspective 1 TeraFLOP in 1993

27 !"#$%&#'($)**+#,-$./' Computers no longer get faster, just wider You must re-think your algorithms to be parallel! Data-parallel computing is most scalable solution

28 Why GPU Computing? Fact: Throughput nobody architecture cares about! massive theoretical parallelism peak Massive parallelism! Challenge: computational horsepower harness GPU power for real application performance GPU GFLOPS $"# 0&12345,-/&89*:;)!"#!"#$#%&'()*%&+,-.- /0-& &89*:;) CPU

29 CUDA Successes 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois Gene Sequencing U of Maryland

30 Accelerating Insight 4.6 Days 2.7 Days 8 Hours 3 Hours 27 Minutes 30 Minutes 13 Minutes 16 Minutes CPU Only Heterogeneous with Tesla GPU

32 GPU Computing 1.0 (Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34 GPU Computing 1.0: compute pretending to be graphics Disguise data as textures or geometry Disguise algorithm as render passes!trick graphics pipeline into doing your computation! Term GPGPU coined by Mark Harris

33 Typical GPGPU Constructs

34 Typical GPGPU Constructs

35 GPGPU Hardware & Algorithms GPUs get progressively more capable Fixed-function! register combiners! shaders fp32 pixel hardware greatly extends reach Algorithms get more sophisticated Cellular automata! PDE solvers! ray tracing Clever graphics tricks High-level shading languages emerge

36 GPU Computing Enter CUDA GPU Computing 2.0: direct compute Program GPU directly, no graphics-based restrictions GPU Computing supplants graphics-based GPGPU November 2006: NVIDIA introduces CUDA

37 CUDA In One Slide Thread B(#G$<#(9= )6>9)%8(86#* Local barrier Block B(#GF)6>'?<9#(= 8(86#* Kernel &''$% Global barrier... Kernel!"#$%... B(#G=(HD>( I)6F9) 8(86#*

38 CUDA C Example!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%'4(<#=%;%*<#=9 > Serial C Code??%@0!"A,%&,-#'. BCDEF%A,-0,. &'()*+&,-#'./02%GH82%(2%*59 ++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9 #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 >??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59 Parallel C Code

39 Heterogeneous Programming Use the right processor for the right job Serial Code Parallel Kernel &''((()*+,-.)*/01)222$"#34%5... Serial Code Parallel Kernel!"#((()*+,-.)*/01)222$"#34%5...

40 GPU Computing An Ecosystem GPU Computing 3.0: an emerging ecosystem Hardware & product lines Education & research Algorithmic sophistication Consumer applications Cross-platform standards High-level languages!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55*

41 Products CUDA is in products from laptops to supercomputers `

42 Emerging HPC Products New class of hybrid GPU-CPU servers 2 Tesla M1060 GPUs Upto 18 Tesla M1060 GPUs SuperMicro 1U GPU Server Bull Bullx Blade Enclosure

43 Algorithmic Sophistication Sort Sparse matrix Hash tables Fast multipole method Ray tracing (parallel tree traversal) 012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Parallel Computing Architecture OpenCL is

44 Cross-Platform Standards GPU Computing Applications CUDA C OpenCL tm Direct Compute CUDA Fortran Java Python.NET MATLAB 3 NVIDIA GPU with the CUDA Parallel Computing Architecture OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

45 CUDA Momentum Over 250 universities teach CUDA Over 1000 research papers CUDA powered TSUBAME 29 th fastest supercomputer in the world 180 Million CUDA GPUs 100,000 active developers Over 500 CUDA Apps

46 thrust Data Structures!!"#$%!&&'()*+(,)(+!-#!!"#$%!&&"-%!,)(+!-#!!"#$%!&&'()*+(,.!#! Etc. Algorithms!!"#$%!&&%-#!!!"#$%!&&#('$+(!!"#$%!&&(/+0$%*)(,%+12! Etc. thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA C++ template metaprogramming automatically chooses the fastest code path at compile time

2 0*: @"0*$>'01% A <<)39*9#":9)#"*1'@)1":")'*):;9);'4: :;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5 :;#84:BB39*9#":9$;=>97?

47 thrust::sort sort.cu 60*7,819)(:;#84:<;'4:=>97:'#?;2 60*7,819)(:;#84:<19>079=>97:'#?;2 60*7,819)(:;#84:<39*9#":9?;2 60*7,819)(:;#84:<4'#:?;2 60*7,819)(74:1,0!2 A <<)39*9#":9)#"*1'@)1":")'*):;9);'4: :;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5 :;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5 <<):#"*4&9#):')19>079)"*1)4'#: :;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975 <<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD :;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5 L #9:8#*)D5

48 GPU Begins to Vanish Ever-increasing number of codes & commercial Some bio/chem codes available or porting: NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3 LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3 Consumer applications, e.g. Photoshop Badaboom, vreveal, Nero MoveIt OS: Microsoft Windows 7, MacOS Snowleapord

50 Next-Gen GPU Architecture: Fermi 3 billion transistors Over 2x the cores (512 total) ~2x the memory bandwidth L1 and L2 caches 8x the peak DP performance ECC C++

SM Microarchitecture Instruction Cache Objective 5 optimize for GPU computing New ISA Revamp issue / control flow New CUDA core architecture 32

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store

51 SM Microarchitecture Instruction Cache Objective 5 optimize for GPU computing New ISA Revamp issue / control flow New CUDA core architecture 32 cores per SM (512 total) 64KB of configurable L1$ / shared memory Scheduler Dispatch Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 FP32 FP64 INT SFU LD/ST Ops / clk Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

SM Microarchitecture New IEEE 754-2008 arithmetic standard Fused Multiply-Add (FMA) for & DP New integer ALU optimized for 64-bit and extended precision ops CUDA Core Dispatch Port Operand Collector

52 SM Microarchitecture New IEEE arithmetic standard Fused Multiply-Add (FMA) for & DP New integer ALU optimized for 64-bit and extended precision ops CUDA Core Dispatch Port Operand Collector FP Unit INT Unit Result Queue Instruction Cache Scheduler Dispatch Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

53 Hardware Thread Scheduling Concurrent kernel execution + faster context switch Kernel 1 Kernel 1 Kernel 2 Kernel 2 Kernel 2 Kernel 3 Ker 4 Time Kernel 2 Kernel 3 nel Kernel 5 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution

54 More Fermi Goodness ECC protection for DRAM, L2, L1, RF Unified 40-bit address space for local, shared, global 5-20x faster atomics Dual DMA engines for CPU"! GPU transfers ISA extensions for C++ (e.g. virtual functions)

56 Key CUDA Challenges Express other programming models elegantly Producer-consumer: persistent thread blocks reading and writing work queues Task-parallel: kernel, thread block or warp as parallel task C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2- Foster more high-level languages & platforms Improve & mature development environment

57 Key GPU Workloads Computational graphics Scientific and numeric computing Image processing 5 video & images Computer vision Speech & natural language Machine learning

58 The Future of Computing? Forward-looking statements: All future interesting problems are throughput problems. GPUs will evolve to be the general-purpose throughput processors. CPUs as we know them will become (already are?)

59 Final Thoughts 5 Education We should teach parallel computing in CS 1 or CS 2 G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+ now Manycore is the future of computing Heapsort and mergesort Both O(n lg n) One parallel-friendly, one not Students need to understand this early

61 Miscellaneous Thoughts

62 NVIDIA Resources Available NVIDIA enables heterogeneous computing research and teaching: Grants for leading researchers in GPU Computing Ph.D. Fellowship program Professor Partnership program Discounted hardware GPU Computing Ventures 5 Venture fund focused on GPU computing Technical Training and Assistance

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth: