Processor Organization and Performance

Similar documents
Processor Organization and Performance

Digital System Design Using Verilog. - Processing Unit Design

Computer Organization CS 206 T Lec# 2: Instruction Sets

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

CARLETON UNIVERSITY School of Computer Science. COMP 2003A Computer Organization Fall 2005 Mid-term Examination Solution Key

Chapter 2: Instructions How we talk to the computer

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Lecture1: introduction. Outline: History overview Central processing unite Register set Special purpose address registers Datapath Control unit

Chapter 04: Instruction Sets and the Processor organizations. Lesson 20: RISC and converged Architecture

Chapter 13 Reduced Instruction Set Computers

Instruction Sets: Characteristics and Functions Addressing Modes

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Lecture 11: Control Unit and Instruction Encoding

ECE 486/586. Computer Architecture. Lecture # 7

COMPUTER ORGANIZATION & ARCHITECTURE

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

3.0 Instruction Set. 3.1 Overview

Chapter 4 The Von Neumann Model

CHAPTER 5 A Closer Look at Instruction Set Architectures

CS 24: INTRODUCTION TO. Spring 2018 Lecture 3 COMPUTING SYSTEMS

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

DC57 COMPUTER ORGANIZATION JUNE 2013

CISC / RISC. Complex / Reduced Instruction Set Computers

2. Define Instruction Set Architecture. What are its two main characteristics? Be precise!

Instruction Set Design

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

Processing Unit CS206T

CHAPTER 5 A Closer Look at Instruction Set Architectures

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Introduction to Computer Engineering. CS/ECE 252 Prof. Mark D. Hill Computer Sciences Department University of Wisconsin Madison

Pipelining, Branch Prediction, Trends

CPU Structure and Function

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

Grundlagen Microcontroller Processor Core. Günther Gridling Bettina Weiss

CHAPTER 5 A Closer Look at Instruction Set Architectures

CPE300: Digital System Architecture and Design

Real instruction set architectures. Part 2: a representative sample

Chapter 4 The Von Neumann Model

Instructions: Language of the Computer

LC-3 Architecture. (Ch4 ish material)

Computer Organization and Technology Processor and System Structures

Single cycle MIPS data path without Forwarding, Control, or Hazard Unit

Typical Processor Execution Cycle

Cpu Architectures Using Fixed Length Instruction Formats

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Chapter 2A Instructions: Language of the Computer

Computer Organization Question Bank

MARIE: An Introduction to a Simple Computer

CSEE 3827: Fundamentals of Computer Systems

CS Computer Architecture

Pipelining and Vector Processing

Instructions: Language of the Computer

TYPES OF INTERRUPTS: -

HPC VT Machine-dependent Optimization

Interfacing Compiler and Hardware. Computer Systems Architecture. Processor Types And Instruction Sets. What Instructions Should A Processor Offer?

1.3 Data processing; data storage; data movement; and control.

Chapter 4 The Von Neumann Model

STRUCTURE OF DESKTOP COMPUTERS

Topic Notes: MIPS Instruction Set Architecture

3 Computer Architecture and Assembly Language

COMPUTER ORGANIZATION & ARCHITECTURE

CSIS1120A. 10. Instruction Set & Addressing Mode. CSIS1120A 10. Instruction Set & Addressing Mode 1

COMP2121: Microprocessors and Interfacing. Instruction Set Architecture (ISA)

CHAPTER 8: Central Processing Unit (CPU)

UNIT- 5. Chapter 12 Processor Structure and Function

Modern Processors. RISC Architectures

55:132/22C:160, HPCA Spring 2011

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine

EC-801 Advanced Computer Architecture

History of the Intel 80x86

ECE 486/586. Computer Architecture. Lecture # 8

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

Chapter 17. Microprogrammed Control. Yonsei University

CSCE 5610: Computer Architecture

Practical Malware Analysis

CS 101, Mock Computer Architecture

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

RISC Principles. Introduction

Topics/Assignments. Class 10: Big Picture. What s Coming Next? Perspectives. So Far Mostly Programmer Perspective. Where are We? Where are We Going?

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

Parallelism of Java Bytecode Programs and a Java ILP Processor Architecture

Computer Systems Laboratory Sungkyunkwan University

The register set differs from one computer architecture to another. It is usually a combination of general-purpose and special purpose registers

CPU Design John D. Carpinelli, All Rights Reserved 1

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Computer Architecture. Lecture 2 : Instructions

Module 3 Instruction Set Architecture (ISA)

Instruction Selection. Problems. DAG Tiling. Pentium ISA. Example Tiling CS412/CS413. Introduction to Compilers Tim Teitelbaum

PROBLEMS. 7.1 Why is the Wait-for-Memory-Function-Completed step needed when reading from or writing to the main memory?

Major Advances (continued)

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Lecture 4: Instruction Set Design/Pipelining

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Micro-Operations. execution of a sequence of steps, i.e., cycles

Chapter 2. Instruction Set. RISC vs. CISC Instruction set. The University of Adelaide, School of Computer Science 18 September 2017

CISC Attributes. E.g. Pentium is considered a modern CISC processor

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Transcription:

Chapter 6 Processor Organization and Performance 6 1 The three-address format gives the addresses required by most operations: two addresses for the two input operands and one address for the result. However, some processors like the Pentium compromise by using the two-address format because operands in these processors can be located in memory (leading to longer addresses). This is not a problem with the modern RISC processors as they use the store/load architecture. In these processors, most instructions find their operands in registers; the result is also placed in a register. Since registers can be identified with a shorter address, using the three-address format does not unduly impact the instruction length. The following figure shows the difference instruction sizes when we use register-based versus memory-based operands. We assume that there are 32 registers and memory address is 32 bits long. 8 bits 5 bits 5 bits 5 bits 23 bits Opcode Rdest Rsrc1 Rsrc2 Register format 8 bits 32 bits 104 bits Opcode destination address 32 bits 32 bits source1 address source2 address Memory format 6 2 Yes, Pentium s use of two-address format is justified for the following reason: Operands in Pentium can be located in memory, which implies longer addresses for these operands. Comparing the following figures, we see that we reduce the instruction length from 104 bits to 72 bits by moving from three to two address format. 1

2 Chapter 6 8 bits 5 bits 5 bits 5 bits 23 bits Opcode Rdest Rsrc1 Rsrc2 Register format 8 bits 32 bits 104 bits Opcode destination address 32 bits 32 bits source1 address source2 address Memory format Three-address format 8 bits 5 bits 5 bits 18 bits Opcode Rdest Rsrc Register format 8 bits 32 bits 72 bits Opcode destination address 32 bits source address Memory format Two-address format A further reason is that most instructions end up using an address twice. Here is the example we discussed in Section 6.2.1: Using the three address format, the C statement A = B + C * D - E + F + A is converted to the following code: mult T,C,D ; T = C*D add T,T,B ; T = B + C*D sub T,T,E ; T = B + C*D - E add T,T,F ; T = B + C*D - E + F add A,T,A ; A = B + C*D - E + F + A Notice that all instructions, barring the first one, use an address twice. In the middle three instructions, it is the temporary T and in the last one, it is A. This also supports using two addresses. 6 3 In the load/store architecture, all instructions excepting the load and store get their operands from the registers; the results produced by these instructions also go into the registers. This results in several advantages. The main ones discussed in this chapter are the following:

Chapter 6 3 1. Since the operands come from the internal registers and results are stored in the registers, the load/store architecture speeds up instruction execution. 2. The load/store architecture also reduces the instruction length as addressing registers takes far fewer bits than addressing a memory location. 3. Reduced processor complexity allows these processors to have a large number of registers, which improves performance. There are some other advantages (such as fixed instruction length) that are discussed in Chapter 14. 6 4 In Section 6.2.5, we assumed that the stack operation (push or pop) does not require a memory access. Thus, we used two memory accesses for each push/pop instruction (one to read the instruction and the other to get the value to be pushed/popped). If the push/pop operations require memory access, we need to add one additional memory access for each push/pop instruction. This implies we need 7 more memory accesses, leading to 19 + 7 = 26 memory accesses. 6 5 RISC processors use the load/store architecture, which assumes that the operands required by most instructions are in the internal registers. Load and store instructions are the only exceptions. These instructions move data between memory and registers. If we have few registers, we cannot keep the operands and results that can be used by other instructions (we will be overwriting them frequently with data from memory). This does not exploit the basic feature of the load/store architecture. If we have more registers, we can keep the data longer in the registers (e.g., result produced by an arithmetic instruction that is required by another instruction), which reduces the number of memory accesses. Otherwise, we will be reading and writing data using the load and store instructions (lose the main advantage of the load/store architecture). 6 6 In normal branch execution, shown in the figure below, when the branch instruction is executed, control is transferred to the target immediately. The Pentium, for example, uses this type of branching. In delayed branch execution, control is transferred to the target after executing the instruction that follows the branch instruction. In the figure below, before the control is transferred, the instruction instruction y (shown shaded) is executed. This instruction slot is called the delay slot. For example, the SPARC uses delayed branch execution. In fact, it also uses delayed execution for procedure calls. Why does this help? The reason is that by the time the processor decodes the branch instruction, the next instruction is already fetched. Thus, instead of throwing it away, we improve efficiency by executing it. This strategy requires reordering of some instructions.

4 Chapter 6 instruction x jump target instruction y instruction z instruction a target: instruction b instruction c instruction x jump target instruction y instruction z instruction a target: instruction b instruction c (a) Normal branch execution (b) Delayed branch execution 6 7 In set-then-jump design, condition testing and branching are separated (for example, Pentium uses this design). A condition code register communicates the test results to the branch instruction. On the other hand, test-and-jump design combines testing and branching into a single instruction. The first design is more general-purpose in the sense that all branching can be handled using this separation. The disadvantage is that two separate instructions need to be executed. For example, in Pentium, cmp (compare) and a conditional jump instructions are used to implement conditional branch. Furthermore, this design needs condition code registers to carry the test result. The test-and-jump is useful only for certain types of branches where testing can be part of the instruction. However, there are situations where testing cannot done as part of the branch instruction. For example, consider the overflow condition that results from an add operation. The status result of the addition must be stored in something like a condition code register or a flag for use by a branch instruction later on. Some processors like the MIPS, which follow the test-and-jump design, must handle such scenarios. The MIPS processor, for example, uses exceptions to flag these conditions. 6 8 The main advantage of storing the return address in a register is that simple procedure calls do not have to access memory. Thus, the overhead associated with a procedure invocation is reduced compared to processors like Pentium that store the return address on the stack. However, the stack-based mechanism used by Pentium is more general-purpose in that it can handle any type of procedure call. In contrast, the register-based scheme can only handle simple procedure invocations. For example, recursive procedures cause problems for the register-based scheme. 6 9 The size of the instruction depends on the number of addresses and whether these addresses identify registers or memory locations. Since RISC processors use instructions that are register-based and simple addressing modes, there is no variation in the type of information carried from instruction to instruction. This leads to fixed-size instructions. The Pentium, which is a CISC processor, encodes instructions that vary from one byte to several bytes. Part of the reason for using variable length instructions is that CISC processors tend to provide complex addressing modes. For example, in the Pentium, if we use register-based operands, we need just 3 bits to identify a register. On the other hand, if we use a memory-based operand, we

Chapter 6 5 need up to 32 bits. In addition, if we use an immediate operand, we need a further 32 bits to encode this value into the instruction. Thus, an instruction that uses a memory address and an immediate operand needs 8 bytes just for these two components. You can realize from this description that providing flexibility in specifying an operand leads to dramatic variations in instruction sizes. 6 10 There are two main reasons for this: Allowing both operands to be in memory leads to even greater variations in instruction lengths. Typically, a register in Pentium can be identified using 3 bits whereas a memory address takes 32 bits. This complicates encoding and decoding of instructions further. In addition, no one would want to work with all memory-based operands. Registers are extensively used by compilers to optimize code. By not allowing both operands in memory, inefficient code will not be executed. 6 11 If PC and IR are not connected to the system bus, we have to move the contents of PC to MAR using the A bus. Similarly, the instruction read from the memory placed in the MDR register, which must be moved to the IR register. In both cases, one additional cycle is needed. This degrades processor performance. The amount of increase in overhead depends on the instruction being executed. For example, in instruction fetch discussed in Section 6.5.2 (page 226), we need two additional cycles for the movement of data between PC and MAR and between MDR and IR. This accounts for an increase of 50%. 6 12 We assume that shl works on the B input of the ALU and shifts left by one bit position. To implement shl4, we need to execute shl four times. This is shown in the following table: Instruction Step Control signals shl4 %G7,%G5 S1 G5out: ALU=shl: Cin; S2 Cout: ALU=shl: Cin; S3 Cout: ALU=shl: Cin; S4 Cout: ALU=shl: Cin; S5 Cout: G7in: end; 6 13 We use add to perform multiply by 10. Our algorithm to multiply X by 10 is given below: X + X = 2X (store this result - we need it in the last step) 2X + 2X = 4X 4X + 4X = 8X 8X + 2X = 10X This algorithm is implemented as shown in the following table:

6 Chapter 6 Instruction Step Control signals mul10 %G7,%G5 S1 G5out: Ain; S2 G5out: ALU=add: Cin; S3 Cout: Ain: G5in; S4 Cout: ALU=add: Cin; S5 Cout: Ain; S6 Cout: ALU=add: Cin; S7 G5out: Ain; As shown in this table, we need 9 cycles. 6 14 The implementation is shown below: S8 Cout: ALU=add: Cin; S9 Cout: G7in: end; Instruction Step Control signals mov %G7,%G5 S1 G5out: ALU=BtoC: G7in; 6 15 MIPS stands for millions of instructions per second. Although it is a simple metric, it is practically useless to express the performance of a system. Since instructions vary widely among the processors, a simple instruction execution rate will not tell us anything about the system. For example, complex instructions take more clocks than simple instructions. Thus, a complex instruction rate will be lower than that for simple instructions. The MIPS metric does not capture the actual work done by these instructions. MIPS performance metric is perhaps useful in comparing various versions of processors derived from the same instruction set. 6 16 Synthetic benchmarks are programs specifically written for performance testing. For example, the Whetstones and Dhrystones benchmarks are examples of synthetic benchmarks. Real benchmarks, on the other hand, use actual programs of the intended application to capture system performance. Therefore, they capture the system performance more accurately. 6 17 Whetstone is a synthetic benchmark in which the performance is expressed in MWIPS, millions of Whetstone instructions per second. This benchmark is a small program, which may not measure the system performance for all applications. Another drawback with this benchmark is that it encouraged excessive optimization by compilers to distort the performance results. 6 18 Computer systems are no longer limited to number crunching. Modern computer systems are more complex and these systems run a variety of different applications (3D rendering, string processing, number crunching, and so on). Performance measured for one type of application may be

Chapter 6 7 inappropriate for some other application. Thus, it is important to measure performance of various components for different types of applications.