The End of Denial Architecture and The Rise of Throughput Computing

Size: px

Start display at page:

Download "The End of Denial Architecture and The Rise of Throughput Computing"

Emily Nelson
5 years ago
Views:

1 The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University May 19, 2009 Async Conference UNC

2 Outline Performance = Parallelism Efficiency = Locality Single-thread processors are in denial about these two facts We are entering the era of Throughput Computing Stream Processors Parallel arithmetic units Exposed storage hierarchy Stream Programming Bulk operations GPU Computing

3 Moore s Law In 1965 Gordon Moore predicted the number of transistors on an integrated circuit would double every year. Later revised to 18 months Also predicted L 3 power scaling for constant function No prediction of processor performance Advanceds in architecture turn device scaling into performance scaling Moore, Electronics 38(8) April 19, 1965

4 The Computer Value Chain Moore s law gives us more transistors Architects turn these transistors into more performance Applications turn this performance into value for the user

5 Discontinuity 1 The End of ILP Scaling 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 19%/year 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

6 Future potential of novel architecture is large (1000 vs 30) 1e+7 1e+6 1e+5 1e+4 Perf (ps/inst) Linear (ps/inst) 1e+3 1e+2 1e+1 1e+0 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classical Computer, ISAT Study, 2001

7 Single-Thread Processor Performance vs Calendar Year Source: Hennessy & Patterson, CAAQA, 4th Edition

8 Technology Constraints

9 CMOS Chip is our Canvas 20mm

10 4,000 64b FPUs fit on a chip 64b FPU 20mm 0.1mm 2 50pJ/op 1.5GHz

11 200,000 16b MACs fit on a chip 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz

12 Moving a word across die = 124MACs, 10FMAs Moving a word off chip = 250 MACs, 20FMAs 64b FPU 0.1mm 2 50pJ/op 1.5GHz 20mm. 16b MAC 0.002mm 2 1pJ/op 1.5GHz 64b 1mm Channel 25pJ/word 10mm 250pJ, 4cycles 10mm 62pJ, 4cycles 16b 1mm Channel 6pJ/word 64b Off-Chip Channel 1nJ/word 16b Off-Chip Channel 250pJ/word 64b Floating Point 16b Fixed Point

13 Discontinuity 2 The End of L 3 Power Scaling Gordon Moore, ISSCC 2003

14 Performance = Parallelism Efficiency = Locality

15 The End of Denial Architecture Single thread processors are in denial about parallelism and locality They provide two illusions: Serial execution Denies parallelism Tries to exploit parallelism with ILP limited scalability Flat memory Denies locality Tries to provide illusion with caches very inefficient when working set doesn t fit in the cache

16 We are entering the era of Throughput Computing

17 Throughput computing is what matters now Latency optimized processors (most CPUs) are improving very slowly Little value is being delivered from the evolution of these processors Throughput optimized processors (like GPUs) are still improving at >70% per year This drives new throughput applications that convert this performance to value Going forward throughput processors matter, not latency processors

18 Applications

Requires efficient gather/scatter Increasingly complex models Lots of

19 Scientific Applications Large data sets Lots of parallelism Increasingly irregular (AMR) Irregular and dynamic data structures Requires efficient gather/scatter Increasingly complex models Lots of locality Global solution sometimes bandwidth limited Less locality in these phases

20 Embedded Applications Codecs, modems, image processing, etc Lots of data (pixels, samples, etc ) Lots of parallelism Increasingly complex Lots of parallelism Lots of data dependent control High arithmetic intensity Lots of locality

21 Performance = Parallelism Efficiency = Locality Fortunately, most applications have lots of both. Amdahl s law doesn t apply to most future applications.

22 Stream Processor Architecture

23 Organize computation to Optimize use of scarce bandwidth Minimize expensive data movement Keep scarce bandwidth resources busy Take advantage of plentiful arithmetic Operate in parallel when data is local Avoid denial architecture Don t hide parallelism or locality Flat, serial model inhibits optimization

24 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Global Memory Switch LM CM Switch RM Switch RM Switch RM Switch R R R R R R R R R A A A A A A A A A

25 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement on this hierarchy Reduces demand, increases utilization Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

26 Optimize use of scarce bandwidth Provide rich, exposed storage hierarchy Explicitly manage data movement Makes execution predictable Enables more arithmetic per unit bandwidth

27 Summary so far Arithmetic is cheap, bandwidth is expensive Performance = parallelism Efficiency = Locality And most applications have lots of both Optimize use of scarce bandwidth Rich, exposed storage hierarchy Explicit, bulk transfers Reduces demand, increases utilization Enables more arithmetic and predictable execution

28 What is a Stream Processor? Lots of arithmetic units To exploit parallelism Rich, exposed storage hierarchy Makes frequent communication inexpensive Makes kernel execution predictable Explicit bulk transfers (streams) and bulk operations (kernels) Load/Store machine for streams Simple control SIMD x VLIW x Threads Simple synchronization Low Overhead Local Switch Lane 0 Lane 15 LdSt ORF COM ORF ALU0 ORF ALU4 LdSt LdSt Lane Register File 0 16 KBytes InterLane Switch LdSt Local Switch LdSt Lane Register File KBytes Stream Load/Store Unit ORF COM ORF ALU0 ORF ALU4 LdSt LdSt LdSt External DRAM

29 4/6/09 Some Example Stream Processors

Imagine Cell 4/6/09 Forward ECC 16 RDRAM Interfaces $ bank $ bank $ bank $

Cluster Network Cluster Microcontroller 10.

20kc Cluster Mips64 20kc Mem switc h Addres s Gen Addres s Gen Addres s Gen

30 Imagine Cell 4/6/09 Forward ECC 16 RDRAM Interfaces $ bank $ bank $ bank $ bank $ bank $ bank $ bank $ bank Cluster Cluster Cluster Cluster Cluster Cluster Network Cluster Microcontroller 10.2 mm Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Mips64 20kc Cluster Mips64 20kc Mem switc h Addres s Gen Addres s Gen Addres s Gen Addres s Gen Reorder Buf fer Reorder Buf fer Example Stream Processors Merrimac 12.5 mm Nvidia GT200 Storm-1

31 Imagine b FPUs 8 lanes of 6 32b FPUs each Exposed storage hierarchy LRFs, SRF, DRAM Bulk operations Including gather/scatter Conditional streams Applications Graphics OpenGL, RTSL, Reyes, Ray Tracing Image processing Codecs Modems Network processing 20x performance/w of efficient DSPs in same technology Parity with GPUs in same technology

32 Imagine Architecture Host Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster

33 Imagine Architecture (cont.) Hostr Processor Imagine Processor Stream Controller Micro- Controller SDRAM Mem System SRF Bank ALU cluster SDRAM Mem System SRF Bank ALU cluster Imagine Network Network Interface SRF Bank ALU cluster ADD ADD ADD MUL MUL DSQ SP COMM To/From Other Clusters Intra-cluster Switch To/From SRF

34 Producer-Consumer Locality in the Depth Extractor Memory/Global Data SRF/Streams Clusters/Kernels row of pixels previous partial sums new partial sums blurred row previous partial sums new partial sums sharpened row filtered row segment filtered row segment previous partial sums new partial sums depth map row segment Convolution (Gaussian) Convolution (Laplacian) SAD 1 : 23 : 317

35 Applications match the bandwidth hierarchy 1000 Bandwidth (GB/s) LRF SRF DRAM 0.1 Peak DEPTH MPEG QRD RTSL

36 Merrimac b FP MADD units 16 clusters of 4 units each Stream cache Shared memory Chip Crossing(s) 10kχ switch 1kχ switch LRF 100χ wire Via network Scatter-add DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF

37 Merrimac Application Results Application Developed Compiled Sustained GFLOP/s StreamFEM Brook StreamC 69 Speedup over Pentium4 Efficiency vs. Pentium4 StreamMD C StreamC StreamFLO Brook Metacompile d 13 StreamCDP Fortran StreamC 8 DGEMM StreamC StreamC CONV2D StreamC StreamC 79 FFT3D StreamC StreamC StreamSPAS C StreamC

38 Merrimac Bandwidth Hierarchy Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) StreamFEM3D (MHD, constant) M (95.0%) M (99.4%) GROMACS 50.1 * * 108M (StreamMD) (95.0%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) StreamSPAS 6.2 (80% BW) M (57.6%) Dense mat-mul 116 (90% FLOPS) M (93.1%) StreamCDP M (58.9%) 6.3M (3.9%) 7.7M (0.4%) 4.2M (2.9%) 7.2M (2.9%) 1.9M (28.5%) 17.1M (6.6%) 1.9M (25.8%) 1.8M (1.1%) 2.8M (0.2%) 1.5M (1.3%) 3.4M (1.4%) 0.93M (13.9%) 0.6M (0.2%) 1.1M (15.3%) * Dominated by divide and square-root operations

39 A CUDA-Enabled GPU is The Ultimate Throughput Computer GeForce GTX 280 / Tesla T scalar processors 30SMs x 8 each > 1TFLOPS peak Exposed, hierarchical memory x performance & efficiency of a latency-optimized processor

40 GPU Throughput Computer Communication Fabric Memory & I/O Fixed Function Acceleration GeForce GTX 280 / Tesla T10

41 4/6/09 Stream Programming

42 Stream Programming: Parallelism, Locality, and Predictability Parallelism Data parallelism across stream elements Task parallelsm across kernels ILP within kernels Locality Producer/consumer Within kernels Predictability Enables scheduling K2 K1 K4 K3 4/6/09

43 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels and streams Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2002 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2007 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out

44 Stream Programming (Sequoia view) Express computation as tree of tasks Inner tasks (streams) move data to subdivide the problem Leaf tasks (kernels) solve local instance of the problem Makes leaf tasks completely predictable Enables static optimization Enables strategic optimization Stream scheduling Tuning of data movement Node memory matmul inner Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU

45 Sequoia Generalize Kernels into Leaf Tasks Perform actual computation Analogous to kernels Small working set Node memory Aggregate LS void task matmul::leaf( in float A[M][P], in float B[P][N], inout float C[M][N] ) { for (int i=0; i<m; i++) { for (int j=0; j<n; j++) { for (int k=0; k<p; k++) { C[i][j] += A[i][k] * B[k][j]; } LS 0 matmul leaf FU LS 7 FU

46 Inner tasks Decompose to smaller subtasks Recursively Larger working sets Node memory void task matmul::inner( in float A[M][P], in float B[P][N], inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(a, U, X); blkset Bblks = rchop(b, X, V); blkset Cblks = rchop(c, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU } matmul(ablks[i][k],bblks[k][j],cblks[i][j]);

47 CUDA as a Stream Language Explicit control of the memory hierarchy with shared memory shared float a[size] ; Also enables communication between threads of a CTA Transfer data up and down the hierarchy Operate on data with kernels But does allow access to arbitrary data within a kernel

48 Examples 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X

49 Stream loads/stores (bulk operations) hide latency (1000s of words in flight) Cells gather Cells fn1 LD Cells 0 Flux fn2 LD Cells 1 Flux 0 scatter Cells LD Cells 2 Update 0 Cells DRAM SRFs LRFs ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2

50 Explicit storage enables simple, efficient execution LD Cells 0 All needed data and instructions on-chip no misses LD Cells 1 Flux 0 LD Cells 2 Update 0 ST Cells 0 Flux 1 LD Cells 3 Update 1 ST Cells 1 Flux 2

51 Caches lack predictability (controlled via a wet noodle ) LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1

52 Caches are controlled via a wet 99% hit rate, 1 miss costs noodle 100s of cycles, 10,000s of ops LD Cells 0 LD Cells 0 LD Cells 1 Flux 0 LD Cells 1 Flux 0A Miss LD Cells 2 ST Cells 0 Update 0 Flux 1 LD Cells 2 Flux 0B Update 0A Miss LD Cells 3 Update 1 Update 0B ST Cells 1 Flux 2 ST Cells 0 Flux 1

53 Predictability enables efficient static scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable

Bulk operations enable strategic optimization StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather

54 Bulk operations enable strategic optimization StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Face Geometry Numerical Flux Cell Geometry Elements (Current) Elements (New) Gathered Elements Cell Orientations

55 Results Horizontal portability Scalar SMP Disk Cluster Cell PS3 SAXPY SGEMV SGEMM CONV2D FFT3D * GRAVITY HMMER * *Problem size reduced to fit * Reduced dataset size to fit in memory

56 System Utilization 100 SAXPY SGEMV SGEMM CONV2D FFT3D GRAVITY HMMER Percentage of Runtime 0 SMP Disk Cluster Cell PS3 56

57 Resource Utilization IBM Cell 100 Bandwidth utilization Compute utilization 0 Resource Utilization (%)

58 Search Spaces T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation. In IEEE PACT, pages , 2000.

59 Performance of Auto-tuner Conv2D SGEMM FFT3D SUmb Cell Auto Hand Cluster Auto Hand Cluster of PS3s Auto Hand Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS. For FFT3D, performances is with fusion of leaf tasks. SUmb is too complicated to be hand-tuned.

60 Conclusion

61 The Road Ahead 20x improvement in GPU performance by 2015 <2x improvement in CPU performance by 2015 Most of the value delivered to end user comes from the GPU, the throughput computer But not all things scale at the same rate Memory bandwidth scales by <5x Energy per bit-mm on chip nearly constant Energy per op only improves by 5x Each of these presents an interesting challenge

62 Conclusion Parallelism and Locality for efficient computation Denial architecture is at an end We are entering an era of Throughput Computing Heterogeneous computers Value comes from Throughput Applications running on Throughput Processors (like a GPU) Performance = parallelism, Efficiency = locality Applications have lots of both Stream processing Many ALUs exploit parallelism Rich, exposed storage hierarchy enables locality Simple control and synchronization reduces overhead Stream programming - explicit movement, bulk operations Enables strategic optimization Exposes parallelism and locality Result: performance and efficiency TOPs on a chip 20-30x efficiency of conventional processors. Performance portability GPUs are the ultimate stream processors

63 4/6/09 Backup

64 Application Performance (cont.) Execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DEPTH MPEG QRD RTSL Average host bandwidth stalls stream controller overhead memory stalls cluster stalls kernel non main loop kernel main loop overhead operations 4/6/09

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University Outline! Performance = Parallelism