Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Similar documents
Chapter 9. Alternative Architectures. Chapter 9 Objectives

Chapter 9. Alternative Architectures. Quote. Chapter 9 Objectives. 9.1 Introduction

CMSC 3833 Lecture 64. mov ax, 10 mov ax, 0 mov bx, 5 mov bx, 10 mul bx, ax mov cx, 5 Begin: add ax, bx loop Begin

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Computer Organization CS 206 T Lec# 2: Instruction Sets

RISC Processors and Parallel Processing. Section and 3.3.6

Architectures. CHAPTER 9 Alternative 9.1 INTRODUCTION

EC 513 Computer Architecture

Basic Computer Architecture

Advanced Computer Architecture

omputer Design Concept adao Nakamura

Superscalar Processors

Advanced processor designs

Superscalar Machines. Characteristics of superscalar processors

BlueGene/L (No. 4 in the Latest Top500 List)

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

CS425 Computer Systems Architecture

Outline Marquette University

The Processor: Instruction-Level Parallelism

Module 5 Introduction to Parallel Processing Systems

Multiple Issue ILP Processors. Summary of discussions

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

RAID 0 (non-redundant) RAID Types 4/25/2011

Pipelining, Branch Prediction, Trends

Microprocessor Architecture Dr. Charles Kim Howard University

Chapter 2 Lecture 1 Computer Systems Organization

Pipelining and Vector Processing

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

Lecture 8: RISC & Parallel Computers. Parallel computers

Dr. Joe Zhang PDC-3: Parallel Platforms

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Lecture 9: Multiple Issue (Superscalar and VLIW)

Advanced issues in pipelining

Chapter 13 Reduced Instruction Set Computers

WHY PARALLEL PROCESSING? (CE-401)

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Computer Architecture

Real instruction set architectures. Part 2: a representative sample

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Keywords and Review Questions

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Understand the factors involved in instruction set

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advanced Computer Architecture

EE 4980 Modern Electronic Systems. Processor Advanced

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Top500 Supercomputer list

Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multiprocessors - Flynn s Taxonomy (1966)

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Spring 2014 Midterm Exam Review

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

RISC Principles. Introduction

MARIE: An Introduction to a Simple Computer

CS146 Computer Architecture. Fall Midterm Exam

Parallel Computing Platforms

Instr. execution impl. view

Chapter 4. MARIE: An Introduction to a Simple Computer 4.8 MARIE 4.8 MARIE A Discussion on Decoding

SUPERSCALAR AND VLIW PROCESSORS

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Computer Architecture Today (I)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computer Organization. Chapter 16

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Processing Unit CS206T

3.3 Hardware Parallel processing

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

DC57 COMPUTER ORGANIZATION JUNE 2013

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Processor: Superscalars Dynamic Scheduling

CSE 260 Introduction to Parallel Computation

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Computer Systems Architecture Spring 2016

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Multi-Processor / Parallel Processing

COMPUTER STRUCTURE AND ORGANIZATION

Online Course Evaluation. What we will do in the last week?

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Computer Architecture

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lecture 7: Parallel Processing

CS Computer Architecture

CS311 Lecture: Pipelining and Superscalar Architectures

Transcription:

Announcement Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Seung-Jong Park (Jay) http://wwwcsclsuedu/~sjpark 1 2 Chapter 9 Objectives 91 Introduction Learn the properties that often distinguish RISC from CISC architectures Understand how multiprocessor architectures are classified Appreciate the factors that t create complexity in multiprocessor systems Become familiar with the ways in which some architectures transcend the traditional von Neumann paradigm We have so far studied only the simplest models of computer systems; classical single-processor von Neumann systems This chapter presents a number of different approaches to computer organization and architecture Some of these approaches are in place in today s commercial systems Others may form the basis for the computers of tomorrow 3 4 The underlying philosophy of RISC machines is that a system is better able to manage program execution when the program consists of only a few different instructions that are the same length and require the same number of clock cycles to decode and execute RISC systems access memory only with explicit load and store In CISC systems, many different kinds of instructions access memory, making instruction length variable and fetch-decode-execute time unpredictable The difference between CISC and RISC becomes evident through the basic computer performance equation: RISC systems shorten execution time by reducing the clock cycles per instruction CISC systems improve performance by reducing the number of instructions per program 5 6 1

The simple instruction set of RISC machines enables control units to be hardwired for maximum speed The more complex-- and variable-- instruction set of CISC machines requires microcode-based control units that interpret instructions as they are fetched from memory This translation takes time With fixed-length instructions, RISC lends itself to pipelining and speculative execution Consider the the program fragments: CISC mov ax, 10 mov bx, 5 mul bx, ax RISC mov ax, 0 mov bx, 10 mov cx, 5 Begin add ax, bx loop Begin The total t clock cycles for the CISC version might be: (2 movs 1 cycle) + (1 mul 30 cycles) = 32 cycles While the clock cycles for the RISC version is: (3 movs 1 cycle) + (5 adds 1 cycle) + (5 loops 1 cycle) = 13 cycles With RISC clock cycle being shorter, RISC gives us much faster execution speeds 7 8 Because of their load-store ISAs, RISC architectures require a large number of CPU registers These register provide fast access to data during sequential program execution They can also be employed to reduce the overhead typically caused by passing parameters to subprograms Instead of pulling parameters off of a stack, the subprogram is directed to use a subset of registers It is becoming increasingly difficult to distinguish RISC architectures from CISC architectures Some RISC systems provide more extravagant instruction sets than some CISC systems Some systems combine both approaches The following two slides summarize the characteristics that traditionally typify the differences between these two architectures 9 10 RISC Multiple register sets Three operands per instruction Parameter passing through register windows Single-cycle Hardwired control Highly pipelined CISC Single register set One or two register operands per instruction Parameter passing through memory Multiple cycle Microprogrammed control Less pipelined RISC Simple instructions, few in number Fixed length Complexity in compiler Only LOAD/STORE instructions access memory Few addressing modes CISC Many complex Variable length Complexity in microcode Many instructions can access memory Many addressing modes Continued 11 12 2

Many attempts have been made to come up with a way to categorize computer architectures Flynn s Taxonomy has been the most enduring of these, despite having some limitations Flynn s Taxonomy takes into consideration the number of processors and the number of data paths incorporated into an architecture A machine can have one or many processors that operate on one or many data streams The four combinations of multiple processors and multiple data paths are described by Flynn as: 13 14 Flynn s Taxonomy falls short in a number of ways: First, there appears to be no need for MISD machines Second, parallelism is not homogeneous This assumption ignores the contribution of specialized processors Third, it provides no straightforward way to distinguish architectures of the MIMD category One idea is to divide these systems into those that share memory, and those that don t, as well as whether the interconnections are bus-based or switch-based Symmetric multiprocessors (SMP) and massively parallel processors (MPP) are MIMD architectures that differ in how they use memory SMP systems share the same memory and MPP do not An easy way to distinguish SMP from MPP is: MPP many processors + distributed memory + communication via network SMP fewer processors + shared memory + communication via memory 15 16 h Other examples of MIMD architectures are found in distributed computing, where processing takes place collaboratively among networked computers A network of workstations (NOW) uses otherwise idle systems to solve a problem A collection of workstations (COW) is a NOW where one workstation coordinates the actions of the others A dedicated cluster parallel computer (DCPC) is a group of workstations brought together to solve a specific problem A pile of PCs (POPC) is a cluster of (usually) heterogeneous systems that form a dedicated parallel system 17 Flynn s Taxonomy has been expanded to include SPMD (single program, multiple data) architectures Each SPMD processor has its own data set and program memory Different nodes can execute different instructions within the same program using instructions similar to: If mynodenum = 1 do this, else do that Yet another idea missing from Flynn s is whether the architecture is instruction driven or data driven 18 The next slide provides a revised taxono my 3

Current Supercomputer 500 19 20 BlueGene (1) The BlueGene/L custom processor chip BlueGene (2) The BlueGene/L (a) Chip (b) Card (c) Board (d) Cabinet (e) System A Growth of High Performance Computing 94 Parallel and Multiprocessor Architectures 1959 IBM 7094 1976 1991 1996 2003 1949 Cray 1 Intel Delta T3E Cray X1 Edsac 10 6 10 9 10 12 10 1 10 3 15 One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS 1823 1951 1964 1982 1988 1997 2001 Babbage Difference 1943 Univac 1 CDC 6600 Cray XMP Cray YMP ASCI Red Earth Simu Engine Harvard M lator ark 1 Parallel processing is capable of economically increasing system throughput while providing better fault tolerance The limiting factor is that no matter how well an algorithm is parallelized, there is always some portion that must be done sequentially Additional processors sit idle while the sequential work is performed Thus, it is important to keep in mind that an n -fold increase in processing power does not necessarily result in an n -fold increase in throughput By Thomas Sterling - Caltech & JPL 23 24 4

94 Superpipelining Recall that pipelining divides the fetch-decode-execute cycle into stages that each carry out a small part of the process on a set of Ideally, an instruction exits the pipeline during each tick of the clock Superpipelining occurs when a pipeline has stages that require less than half a clock cycle to complete The pipeline is equipped with a separate clock running at a frequency that t is at least double that t of the main system clock Superpipelining is only one aspect of superscalar design 94 Superscalar (Dynamic multiple-issue processors ) Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly This architecture t also requires compilers that t make optimum use of the hardware 25 26 Pentium 4 Block Diagram 94 VLIW (Static multiple-issue processors) Very long instruction word (VLIW) architectures differ from superscalar architectures because the VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units Compiler takes greater responsibility for exploiting parallelism One could argue that this is the best approach because the compiler can better identify instruction dependencies However, compilers tend to be conservative and cannot have a view of the run time code Eg, Intel Itanium and Itanium 2 for the IA-64 ISA EPIC (Explicit Parallel Instruction Computer) 27 28 Example of VLIW - IA-64 Compiler IA-64 Compiler Views Wider Scope Instruction 2 Memory (M) Original Source Code Compile Instruction 1 Memory (M) Hardware More efficient use of execution resources 128 bits (bundle) Parallel Machine Code multiple functional units Instruction 0 Integer (I) Template 5 bits (MMI) CISC vs RISC vs SS vs VLIW CISC RISC Superscalar VLIW Instr size variable size fixed size fixed size fixed size (but large) Instr format variable fixed format fixed format fixed format format Registers few, some many GP GP and many, many special rename (RUU) GP Memory reference Key Issues Instruction flow embedded in many instr s decode complexity load/store load/store load/store data forwarding, hazards hardware dependency resolution (compiler) code scheduling EX M WB EX M WB 29 5