vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,

Size: px
Start display at page:

Download "vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX,"

Transcription

1 Vector Microprocessors Simple MultiMedia Applications for G. Lee and Mark G. Stoodley Corinna of Toronto University Paper:

2 vector extensions to state-of-the-art superscalar processors short Sun VIS, HP MAX-2, SGI MDMX, Digital MVI, Intel Intel Katmai, Motorola AltiVec MMX, to implement cult over past 2 years, shippings have been delayed repeat- to meet target speeds edly late shippings attributed to complex out-of-order de- signs ref: Linley Gwennap, MPR articles, Feb,Oct,Dec 1997 Current Trends multimedia applications are growing in importance current hardware trend to support multimedia is to add BUT control logic for complex superscalar processor is dif- 2

3 2-way, in-order vector length of 64 vector width of 8 (i.e., has 8 lanes) for this study, focus on multimedia applications important emerging applications area others have demonstrated eectiveness of vector ar- on oating-point applications and SPECint chitectures programs Alternative Hardware Solution use simple vector processor design: 3

4 Congurations Processor Area Estimates Die Results Performance and Conclusions Summary Outline Motivation 4

5 Short Vector Long Vector Superscalar 64b MIPS 64b MIPS with vector extensions ISA Superscalar and Vector Processors Processors OOO OOO Simple Feature order out of order in order issue width 4 instructions 2 instructions issue width 4 instructions 2 instructions fetch buer size 56 instructions re-order registers 64 int 64 int 32 int #physical FP 64 FP 32 FP 64 8-element Vreg element Vreg 32 2 IUs 2 IUs 2 IUs datapath LSU 1 LSU 1 LSU 1 VU with 8 IUs 1 VU with 8 IUs 1 memory system data bus, 64-bit address bus 64-bit 2-level cache memory R10000-based compiler SGI V5.3 SGI V5.3 O2 and C VSUIF V1.1.0 O2 5

6 dierences: main vector length of 64, not 32 T0-Based Vector Processor 1 VU, not 2 VUs 64-bit data bus, not 128 bits 64-bit datapath, not 32 bits more powerful scalar core R10000-like latencies for operations fully pipelined multiply for narrow data types ( 16 bits) 6

7 Areas for Two Implementations Component the Long Vector Processor of in mm 2 Area to 0.25m Scaled Area Ecient High Performance Processor Component Implementation Implementation Vector Datapath 64b integer units unit load/store Vector Register File 64b 64-element vector registers MIPS R b integer and FP datapath scalar integer and FP register le scalar issue instruction and Overhead Clocking Total

8 =21mm :25mm 2 Scalar Integer Unit 64b area-ecient implementation: 3mm 2 area of 32b scalar datapath 2 4 area for 16x16 multiplier array + conservative estimate: area for IU is < 1mm 2 very high-performance implementation: 4.5mm 2 based on OOO integer unit in

9 data bus and 8 64b register buses ory area estimate: only one address bus scalar and vector memory instructions use same memory Load/Store Unit need 64x512 cross-bar to transfer data between 64b mem- for 128x256 crossbar area 2 area for shifting/aligning 32b data + address processing handled by scalar portion of processor interface 9

10 drivers, etc. multiplexors, area-ecient implementation: time-multiplexes word and lines bit high-performance implementation: Vector Register File based on layout details given in Asanovic's Ph.D. thesis includes area for overhead circuitry read sense ampliers, data latches for writes and reads, uses extra buses and ports to avoid time-multiplexing 10

11 Comparison Area Existing Superscalar Processors with area dierences to be due to parallel-specic want features due to dierences in line size dierences areas are for processor components only; areas for cache TLB structures, external interface logic, and the pad and excluded ring die areas scaled to a 0.25m process to eliminate areal areas based on actual VLSI implementations 11

12 Breakdown of Processor Die Areas IO 2-way superscalar MIPS R Datapath Registers Instruction Other Issue OOO 4-way superscalar MIPS R Alpha OOO 4-way short vector HP PA simple long vector area efficient 52 high performance Processor Die Area (in mm 2 scaled to 0.25µ) 12

13 threshold. 8 bit 24-bit Converts an image in RGB colorspace YUV values. to 8 bit color Blends two images together composite a blend factor. by 8,16 bit image(s) Convolves an image with a convolve 16-bit kernel. 3x3 16 bit 16,000-byte Unrolled version of IDEA decrypt.unroll decryption. 16 bit message Loop-interchanged version decrypt.inter Highly Vectorizable Benchmark Programs Data Width Input Description Benchmark 8 bit 320x240 Merges two images on chroma basis of a \whiteness" the of IDEA decryption. 13

14 VIVACE Compiler/Simulation Infrastructure [missing] 14

15 Speedup over OOO superscalar OOO superscalar OOO short vector simple long vector Processor Performance 0 chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 15

16 Cycles per Instruction OOO superscalar OOO short vector simple long vector chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 4.25 CPI and Dynamic Instruction Count (in millions) Number of Instructions scalar instructions vl8 vector instructions vl64 vector instructions chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 16

17 Count load r3,0(r2) 128 rolled. r2,r2,256 add with VL=64 vload v3,(r2) 2 stripmined r2,r2,256 add Stripmining = Implicitly Loop Unrolling Dynamic Version Static Instructions Instruction Loop r2,r2,4 add unrolled 64 times load r3,0(r2) 65 explicitly load r3,4(r2) 17

18 Speedup Average Using CPI Equation Deconstructed superscalar CPIss NIss CPIssNIss 1.00 OOO short vector 2.50CPIss 0.24NIss 0.60CPIssNIss 1.67 OOO Cycles per Speedup over Cycle Count Processor Dynamic Instruction Count Instruction Superscalar simple long vector 8.19CPIss 0.045NIss 0.37CPIssNIss

19 varies tremendously instruction instruction not an appropriate unit of work for determining use of hardware eective use operation instead Eectiveness at Using Parallelism? usually low CPI means ecient use of hardware no longer true because amount of work carried out by an amount of work carried out by a functional unit 19

20 Cycles per Operation OOO superscalar CPO and Dynamic Operation Count OOO short vector simple long vector chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean (in millions) Number of Operations scalar operations vector operations vector operations chroma colorspace composite convolve decrypt.inter decrypt.unroll Arithmetic Geometric Average Mean 20

21 use loop unrolling p software pipelining or Eective OLP and ILP in Vector Processors Type of Vector Processors Parallelism OOO short simple long Compiler Assistance operation-level p p simple list use to enable scheduling scalarscalar ILP p vectorscalar ILP instead vectorscalar ILP p p vectorvector ILP to enable 21

22 Speedup Average Using CPO Equation Deconstructed Speedup over superscalar CPOss NOss CPOssNOss 1.00 OOO short vector 0.67CPOss 0.88NOss 0.59CPOssNOss 1.70 OOO Cycles per Processor Cycle Count Dynamic Operation Operation Count Superscalar simple long vector 0.48CPOss 0.77NOss 0.37CPOssNOss

23 superscalar implementation OOO greater performance: 2.7x faster than OOO, 1.6x faster OOO short vector than congured like traditional vector architectures with two enhancements: major much wider vectors slightly wider instruction issue performance gains obtained with R10000-like 2-level Summary: Simple Long Vector Processor benets lower complexity and area cost than those for 4-way caches conservative area and performance estimates for long vector processor 23

Vector IRAM: A Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

Hardware/Compiler Codevelopment for an Embedded Media Processor

Hardware/Compiler Codevelopment for an Embedded Media Processor Hardware/Compiler Codevelopment for an Embedded Media Processor CHRISTOFOROS KOZYRAKIS, STUDENT MEMBER, IEEE, DAVID JUDD, JOSEPH GEBIS, STUDENT MEMBER, IEEE, SAMUEL WILLIAMS, DAVID PATTERSON, FELLOW, IEEE,

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

EE382 Processor Design. Concurrent Processors

EE382 Processor Design. Concurrent Processors EE382 Processor Design Winter 1998-99 Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and Vector Processors Slide 1 Concurrent Processors Vector processors SIMD and small clustered

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

GRE Architecture Session

GRE Architecture Session GRE Architecture Session Session 2: Saturday 23, 1995 Young H. Cho e-mail: youngc@cs.berkeley.edu www: http://http.cs.berkeley/~youngc Y. H. Cho Page 1 Review n Homework n Basic Gate Arithmetics n Bubble

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

CSE 141 Summer 2016 Homework 2

CSE 141 Summer 2016 Homework 2 CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Lecture 2: Computer Performance Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Performance and Cost Purchasing perspective given a collection of machines, which has the - best performance?

More information

The T0 Vector Microprocessor. Talk Outline

The T0 Vector Microprocessor. Talk Outline Slides from presentation at the Hot Chips VII conference, 15 August 1995.. The T0 Vector Microprocessor Krste Asanovic James Beck Bertrand Irissou Brian E. D. Kingsbury Nelson Morgan John Wawrzynek University

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 2: Hardware/Software Interface Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Basic computer components How does a microprocessor

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro Math 230 Assembly Programming (AKA Computer Organization) Spring 2008 MIPS Intro Adapted from slides developed for: Mary J. Irwin PSU CSE331 Dave Patterson s UCB CS152 M230 L09.1 Smith Spring 2008 MIPS

More information

Case Study IBM PowerPC 620

Case Study IBM PowerPC 620 Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Number Representation 09212011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Logic Circuits for Register Transfer

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

Lecture 9: Multiple Issue (Superscalar and VLIW)

Lecture 9: Multiple Issue (Superscalar and VLIW) Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds? Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. 2. Define throughput. 3. Describe why using the clock rate of a processor is a bad way to measure performance. Provide

More information

Chapter 4 Data-Level Parallelism

Chapter 4 Data-Level Parallelism CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

The Von Neumann Computer Model

The Von Neumann Computer Model The Von Neumann Computer Model Partitioning of the computing engine into components: Central Processing Unit (CPU): Control Unit (instruction decode, sequencing of operations), Datapath (registers, arithmetic

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction CS252 Graduate Computer Architecture Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue March 23, 01 Prof. David A. Patterson Computer Science 252 Spring 01 Review Tomasulo Reservations

More information

CISC 360. Computer Architecture. Seth Morecraft Course Web Site:

CISC 360. Computer Architecture. Seth Morecraft Course Web Site: CISC 360 Computer Architecture Seth Morecraft (morecraf@udel.edu) Course Web Site: http://www.eecis.udel.edu/~morecraf/cisc360 Overview Intro to Computer Architecture About the Course Organization

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Exploiting Idle Floating-Point Resources for Integer Execution

Exploiting Idle Floating-Point Resources for Integer Execution Exploiting Idle Floating-Point Resources for Integer Execution, Subbarao Palacharla, James E. Smith University of Wisconsin, Madison Motivation Partitioned integer and floating-point resources on current

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput Lecture 4: Instruction Set Architectures Last Time Performance analysis Amdahl s Law Performance equation Computer benchmarks Today Review of Amdahl s Law and Performance Equations Introduction to ISAs

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Performance Characteristics. i960 CA SuperScalar Microprocessor

Performance Characteristics. i960 CA SuperScalar Microprocessor Performance Characteristics of the i960 CA SuperScalar Microprocessor s. McGeady Intel Corporation Embedded Microprocessor Focus Group i960 CA - History A highly-integrated microprocessor for embedded

More information

EE382 Processor Design. Class Objectives

EE382 Processor Design. Class Objectives EE382 Processor Design Stanford University Winter Quarter 1998-1999 Instructor: Michael Flynn Teaching Assistant: Steve Chou Administrative Assistant: Susan Gere Lecture 1 - Introduction Slide 1 Class

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

Modern Computer Architecture (Processor Design) Prof. Dan Connors

Modern Computer Architecture (Processor Design) Prof. Dan Connors Modern Computer Architecture (Processor Design) Prof. Dan Connors dconnors@colostate.edu Computer Architecture Historic definition Computer Architecture = Instruction Set Architecture + Computer Organization

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction CS 61C: Great Ideas in Computer Architecture Multiple Instruction Issue, Virtual Memory Introduction Instructor: Justin Hsia 7/26/2012 Summer 2012 Lecture #23 1 Parallel Requests Assigned to computer e.g.

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

MOM: a Matrix SIMD Instruction Set Architecture for Multimedia. Applications. Jesus Corbal Roger Espasa Mateo Valero

MOM: a Matrix SIMD Instruction Set Architecture for Multimedia. Applications. Jesus Corbal Roger Espasa Mateo Valero MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications Jesus Corbal Roger Espasa Mateo Valero Departament d'arquitectura de Computadors, Universitat Politecnica de Catalunya{Barcelona,

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. 11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3 PR 0,0 ID:incmb PR ID:St: C,X LSU EE 4720 Dynamic Scheduling Study Guide Fall 2005 1.1 Introduction David M. Koppelman The material on dynamic scheduling is not covered in detail in the text, which is

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILP April 2, 2 Prof. David E. Culler Computer Science 252 Spring 2 Today s Big Idea Reactive: past actions cause

More information

620 Fills Out PowerPC Product Line

620 Fills Out PowerPC Product Line 620 Fills Out PowerPC Product Line New 64-Bit Processor Aimed at Servers, High-End Desktops by Linley Gwennap MICROPROCESSOR BTAC Fetch Branch Double Precision FPU FP Registers Rename Buffer /Tag Predict

More information

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Vector Computers Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Supercomputers Definition of a supercomputer: Fastest machine in world at given task Any machine costing

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Review of instruction set architectures

Review of instruction set architectures Review of instruction set architectures Outline ISA and Assembly Language RISC vs. CISC Instruction Set Definition (MIPS) 2 ISA and assembly language Assembly language ISA Machine language 3 Assembly language

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA)... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information