Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006

Size: px

Start display at page:

Download "Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006"

Claud York
5 years ago
Views:

1 Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 Edge: 1 May 24, 2006

2 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other stream processors Future directions Edge: 2 May 24, 2006

3 ILP is mined out end of superscalar processors Time for a new architecture 1e+7 1e+6 1e+5 1e+4 1e+3 52%/year Perf (ps/inst) Linear (ps/inst) 1e+2 1e+1 1e+0 74%/year 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classsical Computer, ISAT Study, 2001 Edge: 3 May 24, 2006

4 Performance = Parallelism Efficiency = Locality Edge: 4 May 24, 2006

5 Arithmetic is cheap, Communication is expensive Arithmetic Can put 100s of FPUs on a chip $0.50/GFLOPS, 50mW/GFLOPS Exploit with parallelism Communication Dominates cost $8/GW/s 2W/GW/s (off-chip) BW decreases (and cost increases) with distance Power increases with distance Latency increases with distance But can be hidden with parallelism Need locality to conserve global bandwidth 64-bit FPU (to scale) 0.5mm Decreasing BW Increasing power 1 clock 12mm 90n m Chip $200 1GHz Edge: 5 May 24, 2006

6 Cost of data access varies by 1000x From Energy Cost* Time Local Register 10pJ $0.50 1ns Chip Region (2mm) 50pJ $2 4ns Global on Chip (15mm) 200pJ $10 20ns Off chip (node mem) 1nJ $50 200ns Global 5nJ $500 1us *Cost of providing 1GW/s of bandwidth All numbers approximate Edge: 6 May 24, 2006

7 So we should build chips that look like this Edge: 7 May 24, 2006

8 An abstract view Global Memory Switch LM CM Switch RM Switch R R R RM Switch R R R RM Switch R R R A A A A A A A A A Edge: 8 May 24, 2006

9 Real question is: How to orchestrate movement of data Edge: 9 May 24, 2006

10 Conventional Wisdom: Use caches Global Memory Switch LM CM Switch RM Switch R R R RM Switch R R R RM Switch R R R A A A A A A A A A Edge: 10 May 24, 2006

11 Caches squander bandwidth our scarce resource Unnecessary data movement Poorly scheduled data movement Idles expensive resources waiting on data More efficient to map programs to an explicit memory hierarchy Edge: 11 May 24, 2006

12 Example Simplified Finite-Element Code loop over cells flux[i] =... loop over cells... = f(flux[i],...) Edge: 12 May 24, 2006

13 Explicitly block into SRF loop over cells flux[i] =... loop over cells... = f(flux[i],...) Flux passed through SRF, no memory traffic Edge: 13 May 24, 2006

14 Explicitly block into SRF loop over cells flux[i] =... loop over cells... = f(flux[i],...) Explicit re-use of Cells, no misses Edge: 14 May 24, 2006

15 Stream loads/stores (bulk operations) hide latency (1000s of words in flight) Cells gather Cells fn1 Flux fn2 scatter Cells DRAM Cells SRFs LRFs Edge: 15 May 24, 2006

16 Explicit storage enables simple, efficient execution All needed data and instructions on-chip no misses Edge: 16 May 24, 2006

17 Caches lack predictability (controlled via a wet noodle ) Edge: 17 May 24, 2006

18 Caches are controlled via a wet noodle 99% hit rate, 1 miss costs 100s of cycles, 10,000s of ops Edge: 18 May 24, 2006

19 So how do we program an explicit hierarchy? Edge: 19 May 24, 2006

20 Stream Programming: Parallelism, Locality, and Predictability Parallelism Data parallelism across stream elements Task parallelsm across kernels ILP within kernels Locality Producer/consumer Within kernels Predictability Enables scheduling K1 K2 K3 K4 Edge: 20 May 24, 2006

21 Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2001 Brook Continues the construct of streams and kernels Hides underlying details Too one-dimensional 2005 Sequoia Generalizes kernels to tasks Tasks operate on local data Local data gathered in an arbitrary way Inner tasks subdivide, leaf tasks compute Machine-specific details factored out Edge: 21 May 24, 2006

22 StreamC/KernelC Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve STREAMPROG depth) { im_stream<pixels> in, tmp; for (i=0; i<rows; i++) { convolve(in, tmp, ); convolve(tmp, conv_row, ); } for (i=0; i<rows; i++) { SAD(conv_row, depth_row, ); } } KERNEL convolve( istream<int> a, ostream<int> y) { loop_stream(a) { int ai, out; a >> ai; out = dotproduct(ai, ); y << out; } } Edge: 22 May 24, 2006

23 Explicit storage enables simple, efficient execution unit scheduling One iteration SW Pipeline ComputeCellInt kernel from StreamFem3D Over 95% of peak with simple hardware Depends on explicit communication to make delays predictable Edge: May 24, 2006

24 Stream scheduling exploits explicit storage to reduce bandwidth demand StreamFEM application Read-Only Table Lookup Data (Master Element) Compute Flux States Compute Numerical Flux Gather Cell Compute Cell Interior Advance Cell Element Faces Gathered Elements Face Geometry Numerical Flux Cell Orientations Cell Geometry Elements (Current) Elements (New) Prefetching, reuse, use/def, limited spilling Edge: 24 May 24, 2006

25 Sequoia Generalize Kernels into Leaf Tasks Perform actual computation Analogous to kernels Small working set Node memory Aggregate LS void task matmul::leaf( in float A[M][P], in float B[P][N], inout float C[M][N] ) { for (int i=0; i<m; i++) { for (int j=0; j<n; j++) { for (int k=0; k<p; k++) { C[i][j] += A[i][k] * B[k][j]; } LS 0 matmul leaf FU LS 7 FU Edge: 25 May 24, 2006

26 Inner tasks Decompose to smaller subtasks Recursively Larger working sets void task matmul::inner( in float A[M][P], in float B[P][N], inout float C[M][N] ) { tunable unsigned int U, X, V; blkset Ablks = rchop(a, U, X); blkset Bblks = rchop(b, X, V); blkset Cblks = rchop(c, U, V); mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) Node memory Aggregate LS matmul inner LS 0 matmul leaf FU LS 7 matmul leaf FU } matmul(ablks[i][k],bblks[k][j],cblks[i][j]); Edge: 26 May 24, 2006

27 Stream Processors make communication explicit Enables optimization Edge: 27 May 24, 2006

28 Stream architecture makes communication explicit exploits parallelism and locality Chip Crossing(s) 10kχ switch 1kχ switch 100χ wire LRF DRAM Bank Cache Bank SRF Lane CL SW LRF Chip Pins and Router M SW LRF DRAM Bank Cache Bank ALU and cluster arrays shown 1D LRF here may be laid out as 2D arrays Edge: 28 May 24, 2006 SRF Lane CL SW

29 Imagine VLSI Implementation Chip Details 2.56cm 2 die, 0.15um process, 21M transistors, 792-pin BGA Collaboration with TI ASIC Chips arrived on April 1, 2002 Dual-Imagine test board Edge: 29 May 24, 2006

30 Application Performance (cont.) Execution time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DEPTH MPEG QRD RTSL Average host bandwidth stalls stream controller overhead memory stalls cluster stalls kernel non main loop kernel main loop overhead operations Edge: 30 May 24, 2006

31 Applications match the bandwidth hierarchy 1000 Bandwidth (GB/s) LRF SRF DRAM 0.1 Peak DEPTH MPEG QRD RTSL Edge: 31 May 24, 2006

Merrimac Streaming Supercomputer Backplane Board Node 16 x XDR-DRAM 2GBytes 64GBytes/s St ream Processor 64 FPU 128 GFLOPS Node 2 Node 16 Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Board 32 Backplane

32 Merrimac Streaming Supercomputer Backplane Board Node 16 x XDR-DRAM 2GBytes 64GBytes/s St ream Processor 64 FPU 128 GFLOPS Node 2 Node 16 Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Board 32 Backplane 2 32 Boards 512 Nodes 32K FPUs 64TFLOPS 1TBytes Backplane 32 12GBytes/s pairs On-Board Network 48GBytes/s pairs 6 Teradyne GbX E/O O/E Intra-Cabinet Network 768GBytes/s 2K+2K links Ribbon Fiber Inter-Cabinet Network Bisection 24TBytes/s Scalable from 2-TFLOP workstation to 2-PFLOP supercomputer Edge: 32 May 24, 2006

33 Merrimac Application Results Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D (Euler, quadratic) M (95.0%) 6.3M (3.9%) 1.8M (1.1%) StreamFEM3D (MHD, constant) M (99.4%) 7.7M (0.4%) 2.8M (0.2%) StreamMD (grid algorithm) 14.2 * 12.1 * 90.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) GROMACS 38.8 * 9.7 * 108M (95.0%) 4.2M (2.9%) 1.5M (1.3%) StreamFLO 12.9 * 7.4 * 234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) Simulated on a machine with 64GFLOPS peak performance and no fused MADD * The low numbers are a result of many divide and square-root operations Applications achieve high performance and make good use of the bandwidth hierarchy Edge: 33 May 24, 2006

34 Other Stream Processors Edge: 34 May 24, 2006

35 Other Stream Processors GPUs GFLOP/s 10-30W ClearSpeed CSX GFLOP/s 10W STI Cell ~200 GFLOP/s 100W Edge: 35 May 24, 2006

36 Other Stream Processors Technology pushing many to build stream processors GPUs (Nvidia, ATI), Game Processors (Cell), Physics Processors (Ageia), Accelerators (Clearspeed) Many (10s-100s) of FPUs Distributed local storage Latency hiding on access to external memory Block access or deeply multithreaded All benefit from stream programming But the right architecture makes it easier and more efficient Edge: 36 May 24, 2006

37 Architecture Issues On-chip memory Read and write access to on-chip storage Producer-consumer locality demands write access Data movement between on-chip memories Without going off chip Off-chip memory No substitute for bandwidth Efficient gather and scatter required Edge: 37 May 24, 2006

38 What s Next? Edge: 38 May 24, 2006

39 ILP is mined out end of superscalar processors Time for a new architecture 1e+7 1e+6 1e+5 1e+4 1e+3 52%/year Perf (ps/inst) Linear (ps/inst) 1e+2 1e+1 1e+0 74%/year 30:1 19%/year 1,000:1 1e-1 1e-2 1e-3 30,000:1 1e Dally et al. The Last Classsical Computer, ISAT Study, 2001 Edge: 39 May 24, 2006

40 Computing landscape is changing Many function units Deep, distributed storage hierarchy Communication limited Research is needed to understand how to architect and program these processors Not an incremental fix: Fundamental rethinking of basic architecture, programming model, and compilers is required Edge: 40 May 24, 2006

41 Software Topics Exposed Communication Programming Models Abstract storage hierarchy and communication costs Portable codes with predictable performance Compiling bulk operations Strategic (vs tactical) program reorganization Scheduling bulk data transfers Size and shape of blocking Irregular computations Localization of shared neighbors Variable size results Applications Communication-efficient algorithms Edge: 41 May 24, 2006

42 Hardware Topics Close efficiency gap with hard-wired engines Gap is x today (10 for stream processors) Efficient data movement is first step Other overheads remain to be removed Storage hierarchies that can be abstracted Balancing parallelism ILP x DLP x TLP On-chip networks To connect within and between levels of the hierarchy Communication and synchronization mechanisms Drives granularity which in turn determines available parallelism Mechanisms for reuse of irregular data Edge: 42 May 24, 2006

43 Summary Communication is expensive, arithmetic is cheap Parallelism to exploit arithmetic Locality to conserve bandwidth Architectures evolving toward a deep, broad storage hierarchy Storage to hide latency, cover bandwidth taper Stream processors >10x efficiency of conventional CPUs Explicitly manage this hierarchy Makes efficient use of scarce, expensive resources Enables optimization Generalized Stream programming Bulk operations: data movement and kernels Parallelism, Locality, and Predictability Edge: 43 May 24, 2006

The End of Denial Architecture and The Rise of Throughput Computing

The End of Denial Architecture and The Rise of Throughput Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University May 19, 2009 Async Conference