The End of Denial Architecture and The Rise of Throughput Computing

Size: px

Start display at page:

Download "The End of Denial Architecture and The Rise of Throughput Computing"

Peter Lane
5 years ago
Views:

1 The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University

2 Outline! Performance = Parallelism Efficiency = Locality! Single-thread processors are in denial about these two facts! We are entering the era of Throughput Computing! Stream Processors! Parallel arithmetic units! Exposed storage hierarchy! Stream Programming! Bulk operations! Scaling to ExaScale

3 Single-threaded processor performance is no longer scaling

4 Moore s Law! In 1965 Gordon Moore predicted the number of transistors on an integrated circuit would double every year.! Later revised to 18 months! Also predicted L 3 power scaling for constant function! No prediction of processor performance! Advanceds in architecture turn device scaling into performance scaling Moore, Electronics 38(8) April 19, 1965

5 Architecture Applications

6 Discontinuity 1 The End of ILP Scaling 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 19%/year 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

7 Future potential of novel architecture is large (1000 vs 30) 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

8 Single-Thread Processor Performance vs Calendar Year Source: Hennessy & Patterson, CAAQA, 4th Edition

9 Single-threaded processor performance is no longer scaling Performance = Parallelism

10 Chips are power limited and most power is spent moving data

11 Discontinuity 2 The End of L 3 Power Scaling Gordon Moore, ISSCC 2003

12 In the past we had constant-field scaling L = L/2 V = V/2 E = CV 2 = E/8 f = 2f D = 1/L 2 = 4D P = P Halve L and get 8x the capability for the same power

13 Now voltage is held nearly constant L = L/2 V = V E = CV 2 = E/2 f = 2f D = 1/L 2 = 4D P = 4P Halve L and get 2x the capability for the same power in ¼ the area

14 CMOS Chip is our Canvas 20mm

15 4,000 64b FPUs fit on a chip 64b FPU 20mm 0.1mm 2 50pJ/op 1.5GHz

16 Moving a word across die = 10FMAs Moving a word off chip = 20FMAs 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm 64b 1mm Channel 25pJ/word 10mm 250pJ, 4cycles 64b Off-Chip Channel 1nJ/word 64b Floating Point

17 Chips are power limited Most power is spent moving data Efficiency = Locality

18 Performance = Parallelism Efficiency = Locality

Requires efficient gather/scatter! Increasingly complex models!

19 Scientific Applications! Large data sets! Lots of parallelism! Increasingly irregular (AMR)! Irregular and dynamic data structures! Requires efficient gather/scatter! Increasingly complex models! Lots of locality! Global solution sometimes bandwidth limited! Less locality in these phases

20 Embedded Applications! Codecs, modems, image processing, etc! Lots of data (pixels, samples, etc )! Lots of parallelism! Increasingly complex! Lots of parallelism! Lots of data dependent control! High arithmetic intensity! Lots of locality

21 Performance = Parallelism Efficiency = Locality Fortunately, most applications have lots of both. Amdahl s law doesn t apply to most future applications.

22 Exploiting parallelism and locality requires: Many efficient processors (To exploit parallelism) An exposed storage hierarchy (To exploit locality) A programming system that abstracts this Denial is very inefficient

23 Tree-structured machines L3 Net L2 Net L1 L1 L1 L1 P P P P

Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States

24 Optimize use of scarce bandwidth! Provide rich, exposed storage hierarchy! Explicitly manage data movement on this hierarchy! Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

25 Predictability enables efficient static scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable

26 4/6/09 Some Example Stream Processors

Imagine Cell 4/6/09 Forward ECC 16 RDRAM

$ bank $ bank $ bank Cluster Cluster Cluster

Cluster Cluster Cluster Mips64 20kc Cluster

Gen Addres s Gen Addres s Gen Reorder Buf fer

27 Imagine Cell 4/6/09 Forward ECC 16 RDRAM Interfaces $ bank $ bank $ bank $ bank $ bank $ bank $ bank $ bank Cluster Cluster Cluster Cluster Cluster Cluster Network Cluster Microcontroller 10.2 mm Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Mips64 20kc Cluster Mips64 20kc Mem switc h Addres s Gen Addres s Gen Addres s Gen Addres s Gen Reorder Buf fer Reorder Buf fer Example Stream Processors Merrimac 12.5 mm Nvidia GT200 Storm-1

28 Merrimac ! 64 64b FP MADD units! 16 clusters of 4 units each! Stream cache! Shared memory Chip Crossing(s) 10kχ switch 1kχ switch LRF 100χ wire! Via network! Scatter-add DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF

29 Merrimac Application Results Application Developed Compiled Sustained GFLOP/s StreamFEM Brook StreamC 69 Speedup over Pentium4 Efficiency vs. Pentium4 StreamMD C StreamC StreamFLO Brook Metacompile d 13 StreamCDP Fortran StreamC 8 DGEMM StreamC StreamC CONV2D StreamC StreamC 79 FFT3D StreamC StreamC StreamSPAS C StreamC

30 Merrimac Bandwidth Hierarchy Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) StreamFEM3D (MHD, constant) M (95.0%) M (99.4%) GROMACS 50.1 * * 108M (StreamMD) (95.0%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) StreamSPAS 6.2 (80% BW) M (57.6%) Dense mat-mul 116 (90% FLOPS) M (93.1%) StreamCDP M (58.9%) 6.3M (3.9%) 7.7M (0.4%) 4.2M (2.9%) 7.2M (2.9%) 1.9M (28.5%) 17.1M (6.6%) 1.9M (25.8%) 1.8M (1.1%) 2.8M (0.2%) 1.5M (1.3%) 3.4M (1.4%) 0.93M (13.9%) 0.6M (0.2%) 1.1M (15.3%) * Dominated by divide and square-root operations

31 Fermi

32 Fermi is a throughput computer! 512 efficient cores Giga Thread! Rich storage hierarchy! Shared memory! L1! L2! GDDR5 DRAM

33 Exploiting parallelism and locality requires: Many efficient processors (To exploit parallelism) An exposed storage hierarchy (To exploit locality) A programming system that abstracts this

34 4/6/09 Stream Programming

35 Stream Programming: Parallelism, Locality, and Predictability! Parallelism! Data parallelism across stream elements! Task parallelsm across kernels! ILP within kernels! Locality! Producer/consumer! Within kernels! Predictability! Enables scheduling K2 K1 K4 K3 4/6/09

36 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels and streams Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2002 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2007 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out

37 CUDA as a Stream Language! Explicit control of the memory hierarchy with shared memory shared float a[size] ;! Also enables communication between threads of a CTA! Transfer data up and down the hierarchy! Operate on data with kernels! But does allow access to arbitrary data within a kernel

38 Examples 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X

39 Ease of Programming Source: Nicolas Pinto, MIT

40 Stream Programming (Sequoia view)! Express computation as tree of tasks! Inner tasks (streams) move data to subdivide the problem! Leaf tasks (kernels) solve local instance of the problem! Makes leaf tasks completely predictable! Enables static optimization! Enables strategic optimization! Stream scheduling! Tuning of data movement Node memory matmul inner Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU

41 Results Horizontal portability Scalar SMP Disk Cluster Cell PS3 SAXPY SGEMV SGEMM CONV2D FFT3D * GRAVITY HMMER * *Problem size reduced to fit * Reduced dataset size to fit in memory

42 System Utilization 100 SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER Percentage of Runtime 0 SMP Disk Cluster Cell PS3 42

43 Resource Utilization IBM Cell 100 Bandwidth utilization Compute utilization 0 Resource Utilization (%)

44 Performance of Auto-tuner Conv2D SGEMM FFT3D SUmb Cell Auto Hand Cluster Auto Hand Cluster of PS3s Auto Hand Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS. For FFT3D, performances is with fusion of leaf tasks. SUmb is too complicated to be hand-tuned.

45 Throughput computing must evolve to meet the challenges of ExaScale computing * This section is a projection based on Moore s law and does not represent a committed roadmap

46 CPU scaling ends, GPU continues Source: Hennessy & Patterson, CAAQA, 4th Edition 2016

47 DARPA Study Indentifies four challenges for ExaScale Computing Report published September 28, 2008:! Four Major Challenges Energy and Power challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge! Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! Power also constrains what we can put on a chip Available at

48 Energy and Power Challenge! Heterogeneous architecture! A few latency-optimized processors! Many (100s-1,000s) throughput-optimized processors! Which are optimized for ops/j! Efficient processor architecture! Simple control in-order multi-threaded! SIMT execution to amortize overhead! Agile memory system to capture locality! Keeps data and instruction access local! Optimized circuit design! Minimize energy/op! Minimize cost of data movement * This section is a projection based on Moore s law and does not represent a committed roadmap

49 Concurrency and Locality Challenge! To exploit locality! Agile memory hierarchy! Exposed storage hierarchy! Can manage automatically (as a cache) or explicitly (as a scratchpad)! Virtualizable! Efficient communication and synchronization! To move data up and down the storage hierarchy! Move computation to data! Fast active messages! Programming system support! Shared memory! Explicit hierarchy! Affinity and places * This section is a projection based on Moore s law and does not represent a committed roadmap

50 Agile Memory Minimizes Communication L3 Net L2 Net L1 L1 L1 L1 P P P P

51 Vertical locality is far more critical than horizontal locality * This section is a projection based on Moore s law and does not represent a committed roadmap

52 Concurrency and Locality Challenge! To expose billion-fold parallelism! Fine-grain threads! Hardware thread launch! Cooperative thread arrays! Nested parallelism! Efficient synchronization and communication! Shared memory! Agile memory hierarchy! Fast message-driven computing! Atomic operations! Productive, portable, parallel programming! CUDA and its evolution * This section is a projection based on Moore s law and does not represent a committed roadmap

53 An NVIDIA ExaScale Machine in 2017! 2017 GPU Node 300W! 2,400 throughput cores (7,200 FPUs), 16 CPUs single chip! 40TFLOPS (SP) 13TFLOPS (DP)! Deep, explicit on-chip storage hierarchy! Fast communication and synchronization! Node Memory! 128GB DRAM, 2TB/s bandwidth! 512GB Phase-change/Flash for checkpoint and scratch! Cabinet 100kW! 384 Nodes 15.7PFLOPS (SP), 50TB DRAM! Dragonfly network 1TB/s per node bandwidth! System 10MW! RAS! 128 Cabinets 2 EFLOPS (SP), 6.4 PB DRAM! Distributed EB disk array for file system! Dragonfly network with active optical links! ECC on all memory and links! Option to pair cores for self-checking (or use application-level checking)! Fast local checkpoint * This section is a projection based on Moore s law and does not represent a committed roadmap

54 Conclusion

55 Conclusion! Single-thread performance is no longer scaling! Performance = Parallelism! Efficiency = Locality! Applications have lots of both! Machines need lots of cores (parallelism) and an exposed storage hierarchy (locality)! A programming system must abstract this! Reaching an ExaScale requires evolving throughput computing! Agile memory! Energy efficient cores and communications! Efficient parallel mechanisms

The Future of GPU Computing

The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist