Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Size: px

Start display at page:

Download "Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop"

Gordon Henry
5 years ago
Views:

1 Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003

2 Some Definitions A Stream Program expresses a computation as streams flowing through kernels Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve A Stream Processor exploits the locality and concurrency in a stream program to use lots of ALUs with little communication SRF Lane CL SW 10kχ switch Global Switch SRF Lane 1kχ switch CL SW 100χ wire SRF Lane CL SW SRF Lane CL SW Stream Arch: 2 August 22, 2003

row segment SAD 1 : 23 : 317 Stream Arch:

3 Producer-Consumer Locality in the Depth Extractor Memory/Global Data SRF/Streams Clusters/Kernels row of pixels previous partial sums new partial sums blurred row previous partial sums new partial sums sharpened row Convolution (Gaussian) Convolution (Laplacian) filtered row segment filtered row segment previous partial sums new partial sums depth map row segment SAD 1 : 23 : 317 Stream Arch: 3 August 22, 2003

4 A Bandwidth Hierarchy exploits kernel and producer-consumer locality SDRAM SDRAM SDRAM SDRAM Stream Register File ALU Cluster ALU Cluster ALU Cluster 2GB/s 32GB/s 544GB/s Memory BW Global RF BW Local RF BW Depth Extractor 0.80 GB/s GB/s GB/s MPEG Encoder 0.47 GB/s 2.46 GB/s GB/s Polygon Rendering 0.78 GB/s 4.06 GB/s GB/s QR Decomposition 0.46 GB/s 3.67 GB/s GB/s Stream Arch: 4 August 22, 2003

5 Bandwidth demand of stream programs fits bandwidth hierarchy of architecture Stream Arch: 5 August 22, 2003

development Test & debug building blocks of a 64-node system Collaboration with ISI-East Software tools based on

6 Prototype HW and SW Prototype of Imagine architecture Proof-of-concept 2.56cm 2 die in 0.15um TI process, 21M transistors Collaboration with TI ASIC Dual-Imagine development board Platform for rapid application development Test & debug building blocks of a 64-node system Collaboration with ISI-East Software tools based on Stream-C/Kernel-C Stream scheduler Communication scheduling Many Applications 3 Graphics pipelines Image-processing apps depth, MPEG 3G Cellphone (Rice) STAP Stream Arch: 6 August 22, 2003

7 Stream Processor Roadmap 90nm 65nm 130nm M1 Fixed-point, 100 mm 2, 130nm 480 GOPS, 10 W (Low Voltage: 320 GOPS, 2.5W) 8pJ/op ALUs + Shrink Tech Shrink 90nm M2 M3 256 SP FP MADDs 144 mm 2, 90nm 256 GFLOPS, 5W 20pJ/FLOP Fixed-point, 100 mm 2, 90nm 600 GOPS, 2W 3.4pJ/op ALUs + Shrink ALUs + Shrink 65nm M4 M5 512 SP FP MADDs 144 mm 2, 65nm 750 GFLOPS, 10W 14pJ/FLOP Fixed-point, 100 mm 2, 65nm 1.2 TOPS, 2W 1.7pJ/op Stream Arch: 7 August 22, 2003

8 Streaming Scientific Applications Application GFLOPS (out of 64 1 ) FLOPs/ Mem ref Refs SRF Refs Mem Refs StreamFEM ,505,648 10,299,776 1,354,448 (Euler, quad) (93.6%) (5.7%) (0.7%) StreamFEM (MHD, cubic) ,294,080 (94.0%) 43,762,752 (5.6%) 3,165,280 (0.4%) StreamMD (gridded) ,743,216 (96.5%) 9,505,088 (2.1%) 5,978,848 (1.4%) StreamFLO 50 (96%) (2%) (2%) (key kernels 3 ) 1 Simulations run on version of simulator with 64GFLOPS nodes. 2 Stream MD performance limited by false dependency. 3 Estimated from key kernels. Stream Arch: 8 August 22, 2003

9 Streaming in Time and Space K1 K2 K3 K1 K2 K3 K1 K2 K3 Space Multiplexing + Little storage required + Exploits control parallelism - Load imbalance - MIMD control -Requires IPC Time Multiplexing + Perfectly load balanced + Exploits data parallelism + SIMD control (power & area) - Requires storage (SRF) Stream Arch: 9 August 22, 2003

10 Load Imbalance in OpenGL Pipeline vs Scene Stream Arch: 10 August 22, 2003

11 Some Interesting Questions & Topics Streamifying compiler Automatically convert C or Fortran to kernels and streams Locality enhancement Program transformations to enhance use of SRF What applications do and don t stream well? All applications with data parallelism do stream well (dependence distance) For those that don t, why don t they (no DP, dependences, control ) Aspect ratio How much DP vs ILP vs TLP Storage architecture SRF indexing, switching, virtualization partitioning, switching Conditionals How much MIMD is needed? Stream Arch: 11 August 22, 2003

12 Conclusion Stream programs expose locality and concurrency Stream processors exploit these properties Concurrency uses lots of ALUs and hides latency Locality reduces communication and makes it explicit Partition kernels in time, space, or both Imagine demonstrates stream processing for media applications Many applications demonstrated pj/op can be made <2x that of special-purpose systems Merrimac exploring stream processing for scientific applications 1/0.3TFLOPS peak/sustained per node vs. 10/0.5 GFLOPS Global memory bandwidth is still an issue Many challenging questions and topics remain Compilation, Architecture, and Applications Stream Arch: 12 August 22, 2003

13 My project is a stream processor too Stream Arch: 13 August 22, 2003

Stream Processing for High-Performance Embedded Systems

Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form