The T0 Vector Microprocessor. Talk Outline

Similar documents
Vector IRAM: A Microprocessor Architecture for Media Processing

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

The Nios II Family of Configurable Soft-core Processors

Chapter 13 Reduced Instruction Set Computers

W network supercomputer. This computer is targeted for a

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Krste Asanovic. James Beck

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Digital Semiconductor Alpha Microprocessor Product Brief

From T0 and CNS-1 to RISC-V and AI Hardware

Topics in computer architecture

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 14: Multithreading

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

Jim Keller. Digital Equipment Corp. Hudson MA

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Midterm #2 Solutions April 23, 1997

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

ARM Processors for Embedded Applications

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

PREPARED BY S.RAVINDRAKUMAR, SENIOR ASSISTANT PROFESSOR/ECE,CHETTINAD COLLEGE OF ENGINEERING AND TECHNOLOGY Page 1

ECE 3055: Final Exam

A Media-Enhanced Vector Architecture for Embedded Memory Systems

Advanced Computer Architecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

EECS 322 Computer Architecture Superpipline and the Cache

PowerPC 740 and 750

Processor (IV) - advanced ILP. Hwansoo Han

Advanced Computer Architecture

SH4 RISC Microprocessor for Multimedia

PACE: Power-Aware Computing Engines

CS654 Advanced Computer Architecture. Lec 1 - Introduction

EECS 452 Lecture 9 TLP Thread-Level Parallelism

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Memory Systems IRAM. Principle of IRAM

Universität Dortmund. ARM Architecture

CS 152, Spring 2011 Section 8

Advanced processor designs

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

The ARM10 Family of Advanced Microprocessor Cores

CPE300: Digital System Architecture and Design

MIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Hardware and Software Architecture. Chapter 2

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

POWER3: Next Generation 64-bit PowerPC Processor Design

HW1 Solutions. Type Old Mix New Mix Cost CPI

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Keywords and Review Questions

How to Write Fast Numerical Code

Portland State University ECE 588/688. Cray-1 and Cray T3E

General Purpose Signal Processors

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Introduction to Microprocessor

Advanced cache optimizations. ECE 154B Dmitri Strukov

Dynamic Scheduling. CSE471 Susan Eggers 1

Hardware/Compiler Codevelopment for an Embedded Media Processor

The Memory Hierarchy & Cache

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

COSC 6385 Computer Architecture - Pipelining

Systems Architecture II

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

The Processor: Instruction-Level Parallelism

Processors. Young W. Lim. May 12, 2016

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Vertex Shader Design I

Pipelining and Vector Processing

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

A brief History of INTEL and Motorola Microprocessors Part 1

ECE 468 Computer Architecture and Organization Lecture 1

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

NAME: Problem Points Score. 7 (bonus) 15. Total

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Structure of Computer Systems

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Intel released new technology call P6P

vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,

Lecture 13 - VLIW Machines and Statically Scheduled ILP

UltraSparc-3 Aims at MP Servers

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

EE 354 Fall 2015 Lecture 1 Architecture and Introduction

Transcription:

Slides from presentation at the Hot Chips VII conference, 15 August 1995.. The T0 Vector Microprocessor Krste Asanovic James Beck Bertrand Irissou Brian E. D. Kingsbury Nelson Morgan John Wawrzynek University of California at Berkeley and the International Computer Science Institute {krste,johnw}@cs.berkeley.edu Primary support for this work was from the ONR, URI Grant N00014-92-J-1617, the NSF, grants MIP-8922354/MIP-9311980, and ARPA, contract number N0001493-C0249. Additional support was provided by ICSI. Talk Outline Why Vector Microprocessors? Torrent Instruction Set Architecture (ISA) T0 (Torrent-0) Microarchitecture T0 Implementation and Packaging Status Summary

Goal: High-Performance, Programmable, Scalable DSP Architecture. Many new applications require high performance DSP, for example, multimedia and human-machine interface. Algorithms are constantly changing, and not all applications warrant custom hardware development, so need programmable DSP engine. Software development is a major expense. Desire object code compatibility while scaling parallelism up for performance, or scaling parallelism down for cost. Solution: Vector Instruction Set Architecture Many compute-intensive DSP operations are vectorizable and vector architectures are the most efficient way to run vector code: Low control complexity. High throughput with multiple parallel and pipelined functional units. Sustain high off-chip memory bandwidth with vector memory operations. Add to conventional scalar instruction set to preserve software investment. Scale implementations between low cost and high performance, with same object code.

Torrent User Programming Model CPU General Purpose Registers 31 0 r31 r30 r1 r0 Program Counter 31 0 pc Multiply/Divide Registers 31 0 hi lo VU (COP2) 16 Vector Registers, each holding 32 x 32-bit elements. vr15[0] vr15[1] vr15[31] vr1[0] vr0[0] vr1[1] vr0[1] vr1[31] vr0[31] Vector Length Register vlr Cycle Counter 31 0 vcount Vector Flag Registers 31 0 vcond vovf vsat T0 Block Diagram Address Bus I-Cache Scalar Unit Scalar Bus VP0 Conditional Move Clip Shift Right Add Multiply Shift Left Logic 28 VMP 32 128 Vector Memory Pipeline Vector Registers VP1 Data Bus SIP 8 8 Scan Chains Logic Shift Left Add Shift Right Clip Conditional Move

T0 I-Cache and Scalar Unit Instruction Cache 1 KB, direct-mapped, 16 byte lines. Cache line prefetch if memory otherwise idle: 2 cycle miss penalty with prefetch, 3 cycle miss penalty without prefetch. Service misses in parallel with interlocks. System Coprocessor 0 Exception handling registers. Host communication registers. 32-bit counter/timer. MIPS-II 32-bit Integer RISC CPU One instruction/cycle in 6 stage pipeline. Single architected branch delay slot. Annulling branch likelies. Interlocked load delay slots. 3 cycle load latency (no data cache). 18 cycle 32-bit integer multiply. 33 cycle 32-bit integer divide. T0 Vector Memory Operations Unit-stride with address post-increment lbai.v vv1, t0, t1 # t1 holds post-increment. Eight 8-bit elements per cycle. Eight 16-bit elements per cycle. Four 32-bit elements per cycle. +1 cycle if first element not aligned to 16 byte boundary. Strided operations lwst.v vv3, t0, t1 # t1 holds byte stride. One 8-bit, 16-bit, or 32-bit element per cycle. Indexed operations (scatter/gather) shx.v vv1, t0, vv3 # vv3 holds byte offsets. One 8-bit, 16-bit, or 32-bit element per cycle. + 3 cycle startup for first index. Indexed stores need 1 extra cycle every 8 elements.

T0 Vector Arithmetic Pipelines Full set of 32-bit integer vector instructions: add, shift, logical. Vector fixed-point instructions perform a complete scaled, rounded, and clipped fixed-point arithmetic operation in one pass through pipeline. Multiplier in VP0 provides 16-bit x 16-bit -> 32-bit pipelined multiplies. Scale results by any shift amount. Provides 4 rounding modes including round-to-nearest-even. Clip results to 8-bit, 16-bit, or 32-bit range. VP0 and VP1 each produce up to 8 results per cycle. Vector arithmetic operations have 3 cycle latency. T0 Vector Conditional Operations Vectorize loops containing conditional statements. Executed in either arithmetic pipeline. Vector compare: # vv2[i] = (vv5[i] < vv6[i]) slt.vv vv2, vv5, vv6 Vector conditional move: # if (vv2[i] > 0) then vv1[i] = vv3[i] cmvgtz.vv vv1, vv2, vv3 Vector condition flag register: # vcond[i] = (vv1[i] < vv2[i]) flt.vv vv1, vv2 # Set flag bits. cfc2 r1, vcond # Read into scalar reg.

T0 Vector Editing Instructions Executed in vector memory unit. Scalar insert/extract to/from vector register element. Vector extract supports reduction operations: vext.v vv2, t1, vv1 vv1 0 1 2 3 4 5 6 7 8 9 101112131415 vv2 0 1 2 3 4 5 6 7 8 9 101112131415 Avoids multiple memory accesses. Separates data movement from arithmetic operations. Software can schedule component instructions within reduction. T0 Pipeline Structure CPU X M N W VMP R R M W R M W M F D VP0 R R X1 X1 X2 R X1 X2 W X1 X2 X2 W VP1 R R X1 X1 X2 R X1 X2 W X1 X2 X2 W

Code Example lhai.v vv1, t0, t1 # Vector load. hmul.vv vv4, vv2, vv3 # Vector mul. sadd.vv vv7, vv5, vv7 # Vector add. addu t2, -1 # Scalar add. lhai.v vv2, t0, t1 # Vector load. hmul.vv vv5, vv1, vv3 # Vector mul. sadd.vv vv8, vv4, vv8 # Vector add. addu t7, t4 # Scalar add. (taken from matrix-vector multiply routine) lhai.v hmul.vv sadd.vv addu lhai.v hmul.vv sadd.vv addu Execution of Code Example CPU VMP VP0 VP1 time Operations Instruction issue

T0 Vector Unit Hazards All vector unit hazards fully interlocked in hardware. Vector instruction startup fully pipelined to eliminate stripmining overhead. Each functional unit has independent ports into vector register file so no chain slot time and no vector register access conflicts. All RAW, WAR, and WAW hazards on vector registers fully chained to reduce latency and decrease vector register pressure. Philosophy: Trade small amount of extra control logic for increased utilization of multiple, expensive datapaths. T0 External Interfaces External Memory Interface Supports up to 4 GB of SRAM with 720 MB/s bandwidth. SRAM access wave-pipelined over 1.5 cycles. Industry standard 17ns asynchronous SRAM for 45 MHz. Serial Interface Port Based on JTAG, but with 8 bit datapaths. Provides chip testing and processor single-step. Supports 30 MB/s host-t0 I/O bandwidth. Hardware Performance Monitoring Eight pins give cycle by cycle CPU and VU status. Fast External Interrupts Two prioritized fast interrupt pins with dedicated interrupt vectors.

T0 Die Statistics Technology: 1.0 µm MOSIS scalable CMOS rules, 2 metal, 1 poly. Contacted M1 pitch 3.25 µm Contacted M2 pitch 3.75 µm Fabbed in HP s CMOS 26G process. Die Size: 16.75 mm x 16.75 mm Transistor Count: 730,701 Clock Frequency: 45 MHz Power Supply: 5 V Power Dissipation: <12 W T0 Status First silicon received 3 April 1995. No bugs. SPERT-II systems running applications. Measured performance on neural net training: 20x tuned code on Sun Sparcstation-20/61, 5x tuned code on IBM RS/6000-590.

Summary With a fully programmable, scalable, vector-register instruction set architecture, T0 sustains up to 720 million arithmetic ops/s while accessing up to 360 million operands/s from main memory, and with up to 30 MB/s of concurrent I/O,...with roughly 3/4 million transistors (1 G 2 ),...at 45 MHz in a 1 µm 2-Al CMOS process,...in less than 10 person years from no ISA to running applications.