IMAGINE: Signal and Image Processing Using Streams

Size: px

Start display at page:

Download "IMAGINE: Signal and Image Processing Using Streams"

Kerry Scott
6 years ago
Views:

1 IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University 1

2 : A Programmable Signal and Image Processor Motivation Applications poorly matched to conventional architectures Key stream architecture features High computational bandwidth (: 48 on-chip ALUs) Stream register organization Data bandwidth hierarchy Performance density of a special purpose processor 0.59 cm 2 CMOS chip, 0.13 µm standard cell, 500 MHz 20 GFLOPS peak performance (40 GOPS fixed point) 10 GFLOPS sustained on several apps > 2 GFLOPS/W, > 5 GOPS/W 2

3 Representative Applications Stereo Depth Extraction Polygon Rendering Render MPEG Encoding/Decoding Encode/ Decode Encoded 2D Data 2D Video Stream 3

4 Stream Processing Input Data Kernel Stream Output Data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic operations per memory reference) 4

5 Characteristics of Media Applications Poorly matched to conventional architectures Instruction-Level Parallelism Caches Few arithmetic units Well-matched to modern VLSI technology Lots (100 s s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 5

6 Communication Bandwidth: Care and Feeding of ALUs Special-Purpose Processors: ALUs fed by dedicated wires/memories General-Purpose Processors: Feeding Structure Dwarfs ALUs IP Instr. Cache IR Regs 6

7 Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control Stream Register File Peak BW: 2GB/s 32GB/s 544GB/s 7

8 Application Data Bandwidth Usage Stream Register File 2GB/s 32GB/s 544GB/s Memory BW Global RF BW Local RF BW Depth Extractor 0.80 GB/s GB/s GB/s MPEG Encoder 0.47 GB/s 2.46 GB/s GB/s Polygon Rendering 0.78 GB/s 4.06 GB/s GB/s QR Decomposition 0.46 GB/s 3.67 GB/s GB/s 8

9 Stream Register File: Details Arbiter SRF: Single-ported 128KB SRAM (1024 x 32W) 32W/cycle Stream buffers To/From Arithmetic Clusters 9

10 Arithmetic Cluster: Details Local Register File To SRF * * / CU Intercluster Network From SRF Cross Point Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) 10

11 Programming Environment StereoDepthExtraction( ) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Compile-time Run-time Host StreamC C++ compiler stream scheduler KernelC kernel scheduler microcode Convolve7x7( ) {... while(!in.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... } } 11

12 Performance bit applications 19.8 floating-point application bit kernels 20 GOPS floating-point kernel 5 0 depth mpeg qrd dct convolve fft 12

13 Sustained Application Performance Stereo Depth Extraction 320x240 8-bit grayscale 200 frames/second Polygon Rendering 4.5 Million Vertices/sec 5.1 Million Pixels/sec MPEG Encoding 720x bit color 120 frames/second Render Encode SPECviewperf ADVS benchmark (unlit) D Video Stream Encoded 2D Data 13

14 Power Estimates Other Mem Sys Pins SRF Clusters Clock Watts % 1% 2% 6% 23% depth mpeg qrd dct convolve fft average GOPS/W: % 14

15 The Stream Processor Streaming Memory System Host Processor Stream Controller Stream Register File: 32kW SRAM Network Interface Microcontroller: 2K VLIW Instrs Network Stream Processor 15

16 Floorplan 22 million transistors 500 MHz Stream Controller SRF Control Network Interface Micro-Controller 0 TI GS30KA: 0.15 µm L drawn 0.13 µm L eff CMOS process Memory System Streambuffers SRF Streambuffers mm mm 16

17 VLSI Implementation: 22M Transistors with 7 grad students Stream architecture reduces VLSI design complexity Modularity / Replication Long wire delays converted to explicit communications Exposed to microarchitecture, software Design methodology Standard ASIC flow with forced placement of datapaths Bitslice Verilog Improved area, delay Pre-placement wire length estimates Reduce design iterations 17

18 Status team accomplishments Cycle-accurate simulator Software tools Completed synthesizable Verilog Arithmetic units implemented in standard cells Industrial partners Texas Instruments: Fab Intel Future work Circuits/Logic: expected completion 9/15/00 Tapeout: expected Q4/

19 Summary Key stream architecture features Stream register organization Data bandwidth hierarchy Performance density of a special purpose processor 10 GFLOPS sustained on several apps >2 GFLOPS/W, >5 GOPS/W VLSI Implementation Validate architectural concepts Develop experimental prototype 19

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered