vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,

Size: px

Start display at page:

Download "vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,"

Marsha Simmons
5 years ago
Views:

1 Vector Microprocessors Simple MultiMedia Applications for G. Lee and Mark G. Stoodley Corinna of Toronto University Paper:

2 vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX, to implement cult over past 2 years, shippings have been delayed repeat- to meet target speeds edly late shippings attributed to complex out-of-order de- signs ref: Linley Gwennap, MPR articles, Feb,Oct,Dec 1997 Current Trends multimedia applications are growing in importance current hardware trend to support multimedia is to add BUT control logic for complex superscalar processor is dif- 2

3 2-way, in-order vector length of 64 vector width of 8 (i.e., has 8 lanes) for this study, focus on multimedia applications important emerging applications area others have demonstrated eectiveness of vector ar- on oating-point applications and SPECint chitectures programs Alternative Hardware Solution use simple vector processor design: 3

4 Congurations Processor Area Estimates Die Results Performance and Conclusions Summary Outline Motivation 4

5 Short Vector Long Vector Superscalar 64b MIPS 64b MIPS with vector extensions ISA Superscalar and Vector Processors Processors OOO OOO Simple Feature order out of order in order issue width 4 instructions 2 instructions issue width 4 instructions 2 instructions fetch buer size 56 instructions re-order registers 64 int 64 int 32 int #physical FP 64 FP 32 FP 64 8-element Vreg element Vreg 32 2 IUs 2 IUs 2 IUs datapath LSU 1 LSU 1 LSU 1 VU with 8 IUs 1 VU with 8 IUs 1 memory system data bus, 64-bit address bus 64-bit 2-level cache memory R10000-based compiler SGI V5.3 SGI V5.3 O2 and C VSUIF V1.1.0 O2 5

6 dierences: main vector length of 64, not 32 T0-Based Vector Processor 1 VU, not 2 VUs 64-bit data bus, not 128 bits 64-bit datapath, not 32 bits more powerful scalar core R10000-like latencies for operations fully pipelined multiply for narrow data types ( 16 bits) 6

7 Areas for Two Implementations Component the Long Vector Processor of in mm 2 Area to 0.25m Scaled Area Ecient High Performance Processor Component Implementation Implementation Vector Datapath 64b integer units unit load/store Vector Register File 64b 64-element vector registers MIPS R b integer and FP datapath scalar integer and FP register le scalar issue instruction and Overhead Clocking Total

8 =21mm :25mm 2 Scalar Integer Unit 64b area-ecient implementation: 3mm 2 area of 32b scalar datapath 2 4 area for 16x16 multiplier array + conservative estimate: area for IU is < 1mm 2 very high-performance implementation: 4.5mm 2 based on OOO integer unit in

9 data bus and 8 64b register buses ory area estimate: only one address bus scalar and vector memory instructions use same memory Load/Store Unit need 64x512 cross-bar to transfer data between 64b mem- for 128x256 crossbar area 2 area for shifting/aligning 32b data + address processing handled by scalar portion of processor interface 9

10 drivers, etc. multiplexors, area-ecient implementation: time-multiplexes word and lines bit high-performance implementation: Vector Register File based on layout details given in Asanovic's Ph.D. thesis includes area for overhead circuitry read sense ampliers, data latches for writes and reads, uses extra buses and ports to avoid time-multiplexing 10

11 Comparison Area Existing Superscalar Processors with area dierences to be due to parallel-specic want features due to dierences in line size dierences areas are for processor components only; areas for cache TLB structures, external interface logic, and the pad and excluded ring die areas scaled to a 0.25m process to eliminate areal areas based on actual VLSI implementations 11

12 Breakdown of Processor Die Areas IO 2-way superscalar MIPS R Datapath Registers Instruction Other Issue OOO 4-way superscalar MIPS R Alpha OOO 4-way short vector HP PA simple long vector area efficient 52 high performance Processor Die Area (in mm 2 scaled to 0.25µ) 12

13 threshold. 8 bit 24-bit Converts an image in RGB colorspace YUV values. to 8 bit color Blends two images together composite a blend factor. by 8,16 bit image(s) Convolves an image with a convolve 16-bit kernel. 3x3 16 bit 16,000-byte Unrolled version of IDEA decrypt.unroll decryption. 16 bit message Loop-interchanged version decrypt.inter Highly Vectorizable Benchmark Programs Data Width Input Description Benchmark 8 bit 320x240 Merges two images on chroma basis of a \whiteness" the of IDEA decryption. 13

14 VIVACE Compiler/Simulation Infrastructure [missing] 14

15 Speedup over OOO superscalar OOO superscalar OOO short vector simple long vector Processor Performance 0 chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 15

16 Cycles per Instruction OOO superscalar OOO short vector simple long vector chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 4.25 CPI and Dynamic Instruction Count (in millions) Number of Instructions scalar instructions vl8 vector instructions vl64 vector instructions chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 16

17 Count load r3,0(r2) 128 rolled. r2,r2,256 add with VL=64 vload v3,(r2) 2 stripmined r2,r2,256 add Stripmining = Implicitly Loop Unrolling Dynamic Version Static Instructions Instruction Loop r2,r2,4 add unrolled 64 times load r3,0(r2) 65 explicitly load r3,4(r2) 17

18 Speedup Average Using CPI Equation Deconstructed superscalar CPIss NIss CPIssNIss 1.00 OOO short vector 2.50CPIss 0.24NIss 0.60CPIssNIss 1.67 OOO Cycles per Speedup over Cycle Count Processor Dynamic Instruction Count Instruction Superscalar simple long vector 8.19CPIss 0.045NIss 0.37CPIssNIss

19 varies tremendously instruction instruction not an appropriate unit of work for determining use of hardware eective use operation instead Eectiveness at Using Parallelism? usually low CPI means ecient use of hardware no longer true because amount of work carried out by an amount of work carried out by a functional unit 19

20 Cycles per Operation OOO superscalar CPO and Dynamic Operation Count OOO short vector simple long vector chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean (in millions) Number of Operations scalar operations vector operations vector operations chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 20

21 use loop unrolling p software pipelining or Eective OLP and ILP in Vector Processors Type of Vector Processors Parallelism OOO short simple long Compiler Assistance operation-level p p simple list use to enable scheduling scalarscalar ILP p vectorscalar ILP instead vectorscalar ILP p p vectorvector ILP to enable 21

22 Speedup Average Using CPO Equation Deconstructed Speedup over superscalar CPOss NOss CPOssNOss 1.00 OOO short vector 0.67CPOss 0.88NOss 0.59CPOssNOss 1.70 OOO Cycles per Processor Cycle Count Dynamic Operation Operation Count Superscalar simple long vector 0.48CPOss 0.77NOss 0.37CPOssNOss

23 superscalar implementation OOO greater performance: 2.7x faster than OOO, 1.6x faster OOO short vector than congured like traditional vector architectures with two enhancements: major much wider vectors slightly wider instruction issue performance gains obtained with R10000-like 2-level Summary: Simple Long Vector Processor benets lower complexity and area cost than those for 4-way caches conservative area and performance estimates for long vector processor 23

Vector IRAM: A Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology