SH4 RISC Microprocessor for Multimedia

Similar documents
ECE 154A Introduction to. Fall 2012

General Purpose Signal Processors

SH-5: A First 64-bit SuperH Core with Multimedia Extension. Fumio Arakawa Hitachi, Ltd.

The ARM10 Family of Advanced Microprocessor Cores

EECS 322 Computer Architecture Superpipline and the Cache

Divide: Paper & Pencil

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

The Nios II Family of Configurable Soft-core Processors

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

EC 513 Computer Architecture

CS425 Computer Systems Architecture

Jim Keller. Digital Equipment Corp. Hudson MA

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Lecture 7 Pipelining. Peng Liu.

COMP 4211 Seminar Presentation

COSC4201. Prof. Mokhtar Aboelaze York University

CS152 Computer Architecture and Engineering. Complex Pipelines

VLSI Signal Processing

PowerPC 740 and 750

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

INTEL MILLION, TRANSISTOR 64-BIT MICROPROCESSOR. Design Charter. Target New Market Segments with. Maximize Performance. Supercomputer Concepts

Course Introduction. Purpose: Objectives: Content: Learning Time:

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

SAE5C Computer Organization and Architecture. Unit : I - V

Number Systems and Computer Arithmetic

The T0 Vector Microprocessor. Talk Outline

ARM Processors for Embedded Applications

Chapter 4. The Processor

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

Advanced issues in pipelining

Vertex Shader Design I

Advanced processor designs

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

Lecture 9: Multiple Issue (Superscalar and VLIW)

E0-243: Computer Architecture

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Instruction Set extensions to X86. Floating Point SIMD instructions

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Exploration and Evaluation of PLX Floating-point Instructions and Implementations for 3D Graphics

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Floating Point Arithmetic

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

Chapter 4. The Processor

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Design Objectives of the 0.35µm Alpha Microprocessor (A 500MHz Quad Issue RISC Microprocessor)

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Complex Pipelining. Motivation

CPE300: Digital System Architecture and Design

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

Chapter 2. Instructions: Language of the Computer. HW#1: 1.3 all, 1.4 all, 1.6.1, , , , , and Due date: one week.

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Architecture of a Graphics Floating-Point Processing Unit with Multi-Word Load and Selective Result Store Mechanisms

Execution/Effective address

Pipelining and Vector Processing

Case Study IBM PowerPC 620

HW1 Solutions. Type Old Mix New Mix Cost CPI

Instruction Set Architecture. "Speaking with the computer"

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Processors. Young W. Lim. May 9, 2016

AT Arithmetic. Integer addition

COMPUTER ORGANIZATION AND. Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Single-Cycle Examples, Multi-Cycle Introduction

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

AMULET3i an Asynchronous System-on-Chip

Limitations of Scalar Pipelines

Computer Architecture

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

COMPUTER ORGANIZATION AND DESI

Chapter 13 Reduced Instruction Set Computers

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

REAL TIME DIGITAL SIGNAL PROCESSING

Computer Systems Architecture Spring 2016

Advanced Computer Architecture

PROJECT 4 Architecture Review of the Nintendo GameCube TM CPU. Ross P. Davis CDA5155 Sec /06/2003

Vector Floating-point Coprocessor VFP11. Technical Reference Manual. for ARM1136JF-S processor r1p5

Introducing the Superscalar Version 5 ColdFire Core

A 1-GHz Configurable Processor Core MeP-h1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 2. Instruction Set Design. Computer Architectures. software. hardware. Which is easier to change/design??? Tien-Fu Chen

Chapter 3. Arithmetic Text: P&H rev

GAISLER. IEEE-STD-754 Floating Point Unit GRFPU Lite / GRFPU-FT Lite CompanionCore Data Sheet

GRE Architecture Session

Topics in computer architecture

Chapter 4 The Processor 1. Chapter 4A. The Processor

Introduction to Pipelined Datapath

Transcription:

SH4 RISC Microprocessor for Multimedia Fumio Arakawa, Osamu Nishii, Kunio Uchiyama, Norio Nakagawa Hitachi, Ltd. 1 Outline 1. SH4 Overview 2. New Floating-point Architecture 3. Length-4 Vector Instructions 4. 3DCG Performance 5. Double Precision Support 6. Conclusions 3DCG: Three Dimension Computer Graphics 2

SH4 Overview - Hitachi's SuperH Series Family - For Consumer Multimedia Systems Home Video Game, Handheld PC - Excellent Performance with Consumer Price 300 VAX MIPS - Excellent 3DCG-Performance with Consumer Price 5.0 M Polygons/sec* - IEEE 754 Standard Floating-point Architecture Double-precision with Hardware Emulations * measured with an original simple geometry benchmark 3 SH4 Specifications Technology Voltage Frequency Performance Cache TLB Interfaces Peripherals 0.25 µm CMOS, 5 Layer Metals 1.8 V (I/O: 3.3 V) 167 MHz (internal) / 83,55, etc. MHz (I/O) 300 MIPS (Dhrystone), 1.17 GFLOPS (peak) 8/16 KB (Inst./Data) Direct-mapped 4/64-entry (Inst./Unified) Fully-associative SRAM, DRAM, SDRAM, burst ROM, PCMCIA DMAC, SCI, RTC, Timer 4

Pipeline Stages - Simple Five-stage Pipelines - Two-way Superscalar Integer Inst. Dec. Reg. Read Exec. --- Write Back Floating Point Inst. Fetch Inst. Dec. Reg. Read 1st Exec. 2nd Exec. 3rd Exec. Write Back Load / Store Inst. Dec. Reg. Read Addr. Gen. Memory Access Write Back Branch Inst. Dec. Addr. Gen. Target Inst. Fetch 5 Superscalar Issue Combinations Integer (INT) Floating Point (FP) Load / Store (LS) Branch (BR) Both INT & LS (BO) Not Superscalar (NS) INT FP LS BR BO NS X O O O O X O X O O O X O O X O O X O O O X O X O O O O O X X X X X X X INT: Add, Subtract, Shift, etc. FP: Floating-point Add, Subtract, Multiply, Divide, etc. LS: Load/Store/Transfer from/to Integer/Floating-point Register, etc. BR: Branch Always/Conditionally, etc. BO: Move between Integer Registers, Integer Compare, etc. NS: Load to Control Register, etc. 6

Floating-point Arch. Enhancement - Two Sets of 16 Single Precision Registers - The extra set fits 4 by 4 matrix storage - Length-4 Vector Instructions - Inner Product - Transform Vector - Register Pair Load/Store/Transfer Instructions - Enough bandwidth for vector operations - Double Precision Format Mode 7 Floating-point Instructions - Common - FADD (add) - FSUB (subtract) - FMUL (multiply) - FDIV (divide) - FSQRT (Square Root) - FCMP (Compare) - FNEG (Negate) - FABS (Absolute Value) - FLOAT (Convert Integer to float) - FTRC (Convert float to Integer) - FMOV (Move from/to Register) - Single Precision Mode Only - FMAC (multiply-accumulate) - FIPR (Inner Product) - FTRV (Transform Vector) - Double Precision Mode Only - FCNVDS (Convert Double to Single) - FCNVSD (Convert Single to Double) 8

Inner Product Instruction - 16 Registers = 4 (Length-4) Vector Registers fv0 = (fr0,fr1,fr2,fr3 ) fv4 = (fr4,fr5,fr6,fr7 ) fv8 = (fr8,fr9,fr10,fr11) fv12 = (fr12,fr13,fr14,fr15) - Operation frn' = (fvm,fvn) - 1 cycle Pitch - 4 cycle Latency - Vector Normalization, Intensity Calculation, Surface Judgment m,n: 0,4,8,12 n'= n+3 9 Inner Product Hardware (Mantissa) - Calculating fr7 = (fv0,fv4) Floating-point Register File fr0 fr4 fr1 fr5 fr2 fr6 fr3 fr7 8 read 1 write Multiplier Multiplier Multiplier Multiplier fr0 * fr4 fr1 * fr5 fr2 * fr6 fr3 * fr7 Shifter Shifter Shifter Shifter 4-input Adder fr0 * fr4 + fr1 * fr5 + fr2 * fr6 + fr3 * fr7 4-to-2 Compressor & 2-input Adder Normalizer Rounder 10

Inner Product Hardware (Exponent) - Calculating fr7 = (fv0,fv4) Floating-point Register File fr0 fr4 fr1 fr5 fr2 fr6 fr3 fr7 Adder fr0 + fr4 Adder fr1 + fr5 Adder fr2 + fr6 Adder fr3 + fr7 Maximum Exponent Selector Subtracter Subtracter Subtracter Subtracter Normalizer Rounder Shift Count Shift Count Shift Count Shift Count 11 Inner Product Accuracy - Inner Product Inst. is an Approximate Inst. - No Accurate Intermediate Value (the width is too wide to implement) - More Accurate than the Worst Order Multiply and Add Inst. Combinations - Maximum Error: (Maximum Product x 2-25 ) + (Result x 2-23 ) - If source operands are rounded values, this is enough accuracy. 26 26 - For example: 4 Products are 2, -2, 1, 0. Accurate Result = 1. Inner Product Inst. Result = 0. 12

Inner Product v.s. SIMD Multiply-Add Peak Performance Latency Register Port (Read/Write) Normalizer & Rounder Floating-point Hardware Inner Product 1 4 8/1 1 1 4 Multiply-Add 1) 1 12 12/4 4 2 2) 3) 4) 1) Inner Product Inst. and 4 Multiply-Add achieve the same peak performance, which is one inner product per cycle. 2) SH4 takes 3 cycles for Multiply-Add. 4 x 3 = 12. 3) Eight more cycles must be filled with independent insts. for the peak performance with SIMD architecture. 4) Twice more hardware is necessary for SIMD architecture. 13 Elastic Pipeline - Pipeline stages become 4 cycles for vector Inst. - In-Order Issue, Out-of-Order Completion FX F1 F2 AG MA WB F3 WB Floating-point Pipeline (Vector) Load/Store Pipeline - Only Floating-point Non-Vector Arithmetic Inst. right after Vector Inst. is interlocked. FX F1 F2 F1 F3 WB : Instruction Decode, : Register Read, FX,F1,F2,F3: Floating-point Execution, AG: Address Generation, MA: Memory Access, WB: Register Write Back F2 Interlock with Resource Conflict Floating-point Pipeline (Vector) F3 Floating-point Pipeline WB (Non-Vector) 14

Transform Vector Instruction - Extra 16 Registers = 4 by 4 Matrix matrix = - Operation fvn = matrix fvn - 4 cycle Pitch, 7 cycle Latency - Coordinate Transformation, Coordinate Transformation Matrix Generation - No Work Registers xf0 xf4 xf8 xf12 xf1 xf5 xf9 xf13 xf2 xf6 xf10 xf14 xf3 xf7 xf11 xf15 n: 0,4,8,12 xf: extra floating-point register 15 Why Transform Vector Instruction? - Transform Vector Operation = 4 Inner Product Insts.? NO!! - Modification for Transform Vector: frn' = (xvm,fvn) m: 0,1,2,3, n: 0,4 n'= n+m+8 "fv8 = matrix fv0" is divided into 4 Inner Products: - 4 More Work Registers - Complicated and More Operands - No Generality (Just for Transform Vector) - Transform Vector Inst. is Better. xv0 = (xf0, xf4, xf8,xf12) xv1 = (xf1, xf5, xf9,xf13) xv2 = (xf2, xf6, xf10,xf14) xv3 = (xf3, xf7, xf11,xf15) fr8 = (xv0,fv0) fr9 = (xv1,fv0) fr10 = (xv2,fv0) fr11 = (xv3,fv0) 16

Transform Vector Implementation fv0 = Matrix fv0 FX F1 F2 Reg. Read F3 WB FX F1 F2 fr0 = (xv0,fv0) F3 WB FX F1 F2 F3 WB FX F1 F2 Reg. Write fr1 = (xv1,fv0) fr2 = (xv2,fv0) F3 WB fr3 = (xv3,fv0) - All reg. reads complete before first reg. write. No work regs. are necessary. : Instruction Decode, : Register Read, WB: Register Write Back FX,F1,F2,F3: Floating-point Execution 17 Pair Load/Store/Transfer Mode - Normal Mode: - 4-bit Reg. Field represents 16 Regs. of one set. - Set specifier must be changed for another set access. - Pair Mode: - 4-bit Reg. Field represents 16 Pair Regs. of all sets. - All Regs. can be accessed. - Transform Vector Throughput: 1 vector / 4 cycles - Load/Store Throughput: 2 vectors (4 pairs) / 4 cycles - Enough for Storing Previous Result Vector and Loading Next Vector during Transform Vector 18

Simple 3DCG Geometry Benchmark y screen (z=1) x (2) normal vector (1) (3) z ray vector triangle polygon (1) Coordinate Transformation (2) Perspective Transformation (3) Intensity Calculation 19 3DCG Geometry Performance M polygons/sec 6 5.0 M 4 2 1.2 M 0 without New Arch. with New Arch. 20

Double Precision Support - Floating-point Libraries for WindowsCE - Double Precision - ANSI/IEEE 754 Standard - Emulation with Single Precision Hardware - Best cost-performance way - Peak performance is 27.8 MFLOPS. - Software emulation is 20 times slower. - Double precision hardware is 6 times faster but 2.5 times more. - New Double Precision Mode - Single's code becomes double's code. 21 Double Precision Hardware - Mantissa Part - Add/Subtract/Multiply/Convert: Add Feedback Path Multiplier (32b) Multiplicand (32b) Addend (32b/64b) Addend Aligner Multiply- Add Block Single : 24b x 24b + 73b Integer: 32b x 32b + 64b Aligned Addend (73b) Feedback Path for Double Precision Emulation (65b) Normalizer Rounder Pre-normalizing Result (73b) Result (32b) - Divide/Square Root: Extend from 24 to 53 bits - Exponent Part: Extend from 8 to 11 bits 22

Double Precision Multiply - Four multiply-adds generate product - Sticky bit is generated from partial products - required width: 106b --> 55b Mantissa (53b) Higher (21b) Lower (32b) Lower x Lower (64b) + = + = + = Lower x Higher (53b) Higher x Lower (53b) Higher x Higher (42b) Product (53b + guard/round) sticky bit 23 Conclusions - Excellent Performance with Consumer Price - 300 VAX MIPS - Excellent 3DCG-Performance with Consumer Price - New Inner Product & Vector Transformation Insts. - 5.0 M Polygons/sec - 1.17 GFLOPS (peak with the new insts.) - IEEE 754 Standard Floating-point Architecture - Double-precision with Hardware Emulations - 27.8 MFLOPS (peak) 24