Vector IRAM: A Microprocessor Architecture for Media Processing

Size: px

Start display at page:

Download "Vector IRAM: A Microprocessor Architecture for Media Processing"

Morgan McDonald
5 years ago
Views:

1 IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology trends design trends application trends IRAM instruction set prototype architecture performance Page 2 1

2 Processor- Gap (latency) Performance Moore s Law 1991 CPU Time µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) 7%/yr. Page 3 Processor- Tax Intel PIII Xeon MIPS R12000 HP PA-8500 Sun Ultra-2 PowerPC G4 IBM Power3 AMD Athlon Alpha logic memory Million Transistors Page 4 2

3 Power Consumption 60 Performance (Spec95FP) Alpha AMD Athlon IBM Power3 PowerPC G4 Sun Ultra-2 HP PA-8500 MIPS R12000 Intel PIII Xeon Power (W) Page 5 Other Design Challenges Interconnect scaling problems multiple cycles to go across the chip difficult to achieve single cycle result forwarding need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency Design complexity of high-end CPUs 4 to 5 years from scratch to chips for new superscalar architectures >100 engineers >50% of resources to design verification Page 6 3

4 Complexity Vs. Performance Gains R5000 R10000 R10K/R5K Clock Rate 200 MHz 195 MHz 1.0x On-Chip Caches 32K/32K 32K/32K 1.0x Instructions/Cycle 1(+ FP) 4 4.0x Pipe stages x Model In-order Out-of-order --- Die Size (mm2) x wo cache, TLB x Development x (man years) SPECint_base x Page 7 Future microprocessor applications Multimedia applications image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. narrow data types, streaming data, real-time requirements Mobile and embedded environments notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. small devices, limited chip-count, limited power/energy budget Significantly different environment from the desktop/workstation model Page 8 4

5 Requirements on microprocessors (1) High performance for multimedia: real-time performance guarantees support for continuous media data-types exploit fine-grain parallelism exploit coarse-grain parallelism exploit high instruction reference locality code density high memory bandwidth Page 9 Average vs. real time performance... Inputs 45% 40% 35% 30% 25% 20% 15% 10% Average Which one is the best? Statistical Average C Real time Worst A 5% 0% Worst Case A B C Performance Best Case Page 10 5

6 Requirements on microprocessors (2) Low power and energy consumption energy efficiency for long battery life power efficiency for system cost reduction (cooling system, packaging etc...) Design scalability performance scalability physical design scalability design complexity, verification complexity immunity to interconnect scaling problems locality of interconnect, tolerance to latency System-on-a-chip (SoC) highly integrated system low system chip-count Page 11 The IRAM vision statement Microprocessor & on a single chip: on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width I/O I/O Bus I/O I/O D R D R Proc $ $ L2$ Bus Proc Bus A M A M L o g i c D R A M f a b f a b Page 12 6

7 IRAM processing high-performance for media processing low power/energy for processor control modularity, low complexity scalability well understood software development Embedded high bandwidth for vector processing low power/energy for memory accesses modularity, scalability small system size Page 13 IRAM ISA summary Full vector instruction set with 32 vector registers, 32 vector flag registers support for multiple data types (64b, 32b, 16b, 8b) support for strided and indexed memory accesses support for auto-increment addressing support for DSP operations (multiply-add, saturation etc) support for conditional execution support for software speculation support for fast reductions and butterfly permutations support for virtual memory restartable arithmetic (FP & integer) exceptions Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2) Page 14 7

8 architectural state Virtual Processors ($vlr) Control Regs General Purpose Registers (32) Flag Registers (32) vr 0 vr 1 vr 31 vf 0 vf 1 vf 31 VP 0 VP 1 VP $vlr-1 $vpw 1b vcr 0 vcr 1 vcr 31 vs 0 vs 1 vs 31 64b Scalar Regs 64b Page 15 Fixed-point Multiply-add x y Mul & Shift Right & Round n/2 n/2 * n Shift Round n a z n + Add & Sat sat n w Multiply halves & shift instruction provides support for any fixed-point format Precision is equal to the datatype width; multiplier s inputs have half the width Uniform, simple support for all datatypes Page 16 8

9 VIRAM-1 prototype Page 17 Design Overview 64b MIPS scalar core coprocessor interface 16KB I/D caches unit 8KByte vector register file support for 64b, 32b, and 16b data-types 2 arithmetic (1 FP), 2 flag processing, 1 load-store units 4 64-bit datapaths per unit latency included in vector pipeline 4 addresses/cycle for strided/indexed accesses 2-level TLB Memory system 8 2MByte e banks single sub-bank per bank 256-bit synchronous interface, separate I/O signals 20ns cycle time, 6.6ns column access crossbar interconnect for 12.8 GB/sec per direction no caches Network interface user-level message passing dedicated DMA engines 4 100MByte/s links Page 18 9

10 Unit Pipeline Structure Single-issue, in-order pipeline each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles latency is included in the execution pipeline (delayed pipeline) deep pipeline design, but not caches needed to avoid stalls worst case latency does not cause pipeline stalls Address decoupling buffer buffers memory addresses in the presence of conflicts (indexed/strided accesses) memory conflicts do not stall pipeline Page 19 Non-Delayed Pipeline VLOAD VALU VSTORE F D X M W latency: >=20ns A T VW Long Load-> ALU RAW hazard VR X1 X2... XN VW A T VR mem mem. vld vadd vst vld vadd vst. Load->ALU exposes full latency (long) Page 20 10

11 Tolerating Memory Latency Delayed Pipeline VLOAD VALU VSTORE F D X M W latency: >20ns A T VW Load-> ALU RAW hazard DELAY VR X1... XN VW A T VR. vld vadd vst vld vadd vst. Load ALU sees functional unit latency (short) Page 21 Clustered VLSI Design 64b Datapath 0 Datapath 0 Datapath 0 Datapath 0 Registers Registers Registers Registers Control Flag Regs. & Datapath Flag Regs. & Datapath Flag Regs. & Datapath Flag Regs. & Datapath FP Datapaths FP Datapaths FP Datapaths FP Datapaths Datapath 1 Datapath 1 Datapath 1 Datapath 1 256b Page 22 11

12 VIRAM-1 Floorplan N I M I P S Lane 0 Lane 1 C T L Lane 2 Lane 3 I O Page 23 Prototype Summary Technology: 0.18um e CMOS process (IBM) 6 layers of copper interconnect 1.2V and 1.8V power supply Memory: 16 MBytes Clock frequency: 200MHz Power: 2 W for vector unit and memory Transistor count: ~140 millions Peak performance: GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) GFLOPS: 1.6 (32b) Page 24 12

13 Kernels Performance Peak Perf. Sustained Perf. % of Peak Image Composition 6.4 GOPS 6.40 GOPS 100.0% idct 6.4 GOPS 1.97 GOPS 30.7% Color Conversion 3.2 GOPS 3.07 GOPS 96.0% Image Convolution 3.2 GOPS 3.16 GOPS 98.7% MV Multiply 3.2 GOPS 2.77 GOPS 86.5% VM Multiply 3.2 GOPS 3.00 GOPS 93.7% FP MV Multiply 1.6 GFLOPS 1.40 GFLOPS 87.5% FP VM Multiply 1.6 GFLOPS 1.59 GFLOPS 99.6% AVERAGE 86.6% Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths Page 25 Comparisons VIRAM MMX VIS TMS320C82 Image Composition (17.0x) - idct (3.2x) - - Color Conversion Image Convolution (10.2x) (7.6x) (4.5x) 6.19 (5.1x) 6.50 (5.3x) All numbers in cycles/pixel MMX, VIS, and TMS results assume all data in L1 cache Page 26 13

14 FFT Performance 200 Time (microseconds) Fixed Point (16 bit) Floating Point (32 bit) Pentium/200: 151 us PPC604e: 87 us 256 CRI Pathfinder-1: 22.3 us 512 Size (#points in FFT) TMS320C67x: 124 us TigerSHARC: 41 us VIRAM: 37 us CRI Pulsar: 27.9 us Wildstar: 25 us 1024 Note : Simulations performed with unscheduled fixed-point code Page 27 Motion Estimation Performance Size QCIF (176x144) CIF (352x288) VIRAM-1 (cycles) 7.1x10 6 (4.6x) 2.8x10 7 (5.0x) MMX (cycles) 3.3x x10 8 Note : MMX results assume all data in L1 cache Page 28 14

15 Overall Performance of H.263 Akiyo (12.95 kbit/s) Mom (16.25 kbit/s) Hall (20.47 kbit/s) Foreman (65.52 kbit/s) 23.5 fps 22.7fps 22.7fps 20.9fps Average encoding speed for H.263 on VIRAM standard mpeg test sequences, using exhaustive search for motion estimation and LLM for DCT. Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths Page 29 Summary Class Project Suggestions Architecture comparisons & applications information retrieval signal processing apps neural nets training Multimedia application analysis operand reuse patterns branch behavior data/value locality and memory access patterns Low power/energy architectures energy-exposed ISA design compilation for low energy speculation use for power reduction Page 30 15

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor