EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

Similar documents
The Evolution of Microprocessors. Per Stenström

EN1640: Design of Computing Systems Topic 06: Memory System

EN1640: Design of Computing Systems Topic 06: Memory System

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

Chapter 4. The Processor

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes


ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

CS 4200/5200 Computer Architecture I

Chapter 4. The Processor

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

ECE 15B Computer Organization Spring 2011

A Model RISC Processor. DLX Architecture

Computer Architecture

Reduced Instruction Set Computer (RISC)

Computer Architecture. The Language of the Machine

Reminder: tutorials start next week!

Computer Architecture CS372 Exam 3

COMPUTER ORGANIZATION AND DESIGN

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

The MIPS Instruction Set Architecture

ECE 486/586. Computer Architecture. Lecture # 7

LECTURE 3: THE PROCESSOR

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ECE 486/586. Computer Architecture. Lecture # 8

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Instruction Set Architecture (ISA)

Computer Science 61C Spring Friedland and Weaver. The MIPS Datapath

CS3350B Computer Architecture MIPS Instruction Representation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Computer Architecture Spring 2016

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

CS3350B Computer Architecture Winter 2015

Anne Bracy CS 3410 Computer Science Cornell University. See P&H Chapter: , , Appendix B

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS 351 Exam 2 Mon. 11/2/2015

Instruction Set Architecture of. MIPS Processor. MIPS Processor. MIPS Registers (continued) MIPS Registers

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Computer Organization and Structure

COSC 6385 Computer Architecture - Pipelining

ISA: The Hardware Software Interface

Midterm. Sticker winners: if you got >= 50 / 67

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CMSC411 Fall 2013 Midterm 1

Computer Architecture Review. Jo, Heeseung

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESI

Computer Architecture Experiment

Instruction Pipelining Review

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

Processor. Han Wang CS3410, Spring 2012 Computer Science Cornell University. See P&H Chapter , 4.1 4

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Reduced Instruction Set Computer (RISC)

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Programmable Machines

Improving Performance: Pipelining

Programmable Machines

Flow of Control -- Conditional branch instructions

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Instructions: MIPS ISA. Chapter 2 Instructions: Language of the Computer 1

EEM 486: Computer Architecture. Lecture 2. MIPS Instruction Set Architecture

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Processor Architecture

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Review of instruction set architectures

Instruction Set Principles. (Appendix B)

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

What is Pipelining? RISC remainder (our assumptions)

CENG 3420 Lecture 06: Datapath

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Chapter 4 The Processor 1. Chapter 4A. The Processor

CPU Pipelining Issues

MIPS ISA. 1. Data and Address Size 8-, 16-, 32-, 64-bit 2. Which instructions does the processor support

Chapter 5. Memory Technology

ECE331: Hardware Organization and Design

COMPSCI 313 S Computer Organization. 7 MIPS Instruction Set

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

101 Assembly. ENGR 3410 Computer Architecture Mark L. Chang Fall 2009

Lecture 4: Instruction Set Architecture

Examples of branch instructions

Introduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University


CS 61C: Great Ideas in Computer Architecture Datapath. Instructors: John Wawrzynek & Vladimir Stojanovic

Cycle Time for Non-pipelined & Pipelined processors

Chapter 4. The Processor Designing the datapath

Computer Architecture. MIPS Instruction Set Architecture

CPU Architecture and Instruction Sets Chapter 1

Transcription:

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1

Classical concepts (prerequisite) 1. Instruction set architecture (ISA) 2. Pipelining 3. Cache memory 4. Virtual memory 5. DRAM This set of lectures is only for refreshing. You should already know this material; it is prerequisite for enrolling in this class S. Reda EN2910A FALL'15 2

Steps of program execution fetch instruction @ PC decode instruction fetch Operands execute instruction store result update PC ISA Design Choices: What is instruction format / size? How is it decoded? Where are the operands located? What are their sizes? What is the size of the register file? How to access the memory? What are supported operations? How to determine the successor instruction? S. Reda EN2910A FALL'15 3

Instruction types 1. Memory transfer loads and stores 2. Arithmetic and logic: Arithmetic (e.g., add, mult) à could be integer / floating. Logic (e.g., or, and). 3. Control instructions: Jumps, conditional branches, jump and link CISC versus RISC instruction sets: CISC example: ADD [R1], [R2], [R3]. How many RISC instructions are needed to implement this CISC instruction? S. Reda EN2910A FALL'15 4

Examples of ISA operations Types Opcode Assembly code Meaning Comments Data Transfers LB, LH, LW, LD LW R1,#20(R2) R1<=MEM[(R2)+20] for bytes, half-words SB, SH, SW, SD SW R1,#20(R2) MEM[(R2)+20]<=(R1) words, and double words L.S, L.D L.S F0,#20(R2) F0<=MEM[(R2)+20] single/double float load S.S, S.D S.S F0,#20(R2) MEM[(R2)+20]<=(F0) single/double float store ALU operations ADD, SUB, ADDU, SUBU ADD R1,R2,R3 R1<=(R2)+(R3) add/sub signed or unsigned ADDI, SUBI, ADDIU, SUBIU ADDI R1,R2,#3 R1<=(R2)+3 add/sub immediate signed or unsigned AND, OR, XOR, AND R1,R2,R3 R1<=(R2).AND.(R3) bitwise logical AND, OR, XOR ANDI, ORI, XORI, ANDI R1,R2,#4 R1<=(R2).ANDI.4 bitwise AND, OR, XOR immediate SLT, SLTU SLT R1,R2,R3 R1<=1 if R2<R3 else R1<=0 SLTI, SLTUI SLTI R1,R2,#4 R1<=1 if R2<4 else R1<=0 test on R2,R3 outcome in R1, signed or unsigned comparison test R2 outcome in R1, signed or unsigned comparison S. Reda EN2910A FALL'15 5

Examples of ISA operations Types Opcode Assembly code Meaning Comments Branches/Jumps BEQZ, BNEZ BEQZ R1,label PC<=label if (R1)=0 conditional branch-equal 0/not equal 0 BEQ, BNE BNE R1,R2,label PC<=label if (R1)=(R2) conditional branchequal/not equal J J target PC<=target target is an immediate field JR JR R1 PC<=(R1) target is in register JAL JAL target R1<=(PC)+4; PC<=target jump to target after saving the return address in R31 Floating point ADD.S,SUB.S,MUL.S,DI V.S ADD.D,SUB.D,MUL.D,DI V.D ADD.S F1,F2,F3 F1<=(F2)+(F3) float arithmetic single precision ADD.D F0,F2,F4 F0<=(F2)+(F4) float arithmetic double precision S. Reda EN2910A FALL'15 6

Memory addressing modes MODE EXAMPLE MEANING REGISTER ADD R4,R3 reg[r4] <- reg[r4] +reg[r3] IMMEDIATE ADD R4, #3 reg[r4] <- reg[r4] + 3 DISPLACEMENT ADD R4, 100(R1) reg[r4] <- reg[r4] + Mem[100 + reg[r1]] REGISTER INDIRECT ADD R4, (R1) reg[r4] <- reg[r4] + Mem[reg[R1]] INDEXED ADD R3, (R1+R2) reg[r3] <- reg[r3] + Mem[reg[R1] + reg[r2]] DIRECT OR ABSOLUTE ADD R1, (1001) reg[r1] <- reg[r1] + Mem[1001] MEMORY INDIRECT ADD R1, @R3 reg[r1] <- reg[r1] + Mem[Mem[Reg[3]]] POST INCREMENT ADD R1, (R2)+ ADD R1, (R2) then R2 <- R2+d PREDECREMENT ADD R1, -(R2) R2 <- R2-d then ADD R1, (R2) PC-RELATIVE BEZ R1, 100 if R1==0, PC <- PC+100 PC-RELATIVE JUMP 200 Concatenate bits of PC and offset S. Reda EN2910A FALL'15 7

Example of ISA (MIPS) encoding LW Rt, displacement(rs) SW Rt, displacement(rs) ADDI Rt, Rs, immediate BEQ Rt, Rs, offset ADD Rd, Rt, Rs J target JAL target S. Reda EN2910A FALL'15 8

Architectural state Determines everything about a processor: PC 32 registers Memory CLK CLK CLK PC' PC 32 32 A RD 32 32 Instruction Memory 5 5 5 32 A1 A2 A3 WD3 WE3 Register File RD1 RD2 32 32 32 32 A RD Data Memory WD WE 32 S. Reda EN2910A FALL'15 9

Typical datapath (MIPS) S. Reda EN2910A FALL'15 10

Pipelining (IF/ID/EX/MEM/WB) Pipeline is a form of temporal parallelism Reducing the path of the critical path enable faster operation of clock à ideal speed up is achieved when pipeline is balanced Ideal CPI of 1; however, stalls, branches and cache memory misses increase CPI beyond 1 S. Reda EN2910A FALL'15 11

Pipeline operation abstraction S. Reda EN2910A FALL'15 12

Pipeline hazards Structural Problem: Not enough read/write data ports for the register file or caches Solution: stall or add more hardware resources Data Problem: Data dependency where the input of one instruction is dependent on a proceeding instruction(s) that has not written its results; aka, Read After Write (RAW) hazard. Solution: reorder instructions by compiler, stall or forward Control Deciding on next instruction to fetch depends on results of proceedings instructions in the pipeline Solution: stall or (branch prediction + speculative execution) S. Reda EN2910A FALL'15 13

Resolving data hazards by forwarding dependencies forward path Writing to register file in 1sthalf of cycle; reading in 2 nd half. S. Reda EN2910A FALL'15 14

Forwarding might not be possible all the time Forwarding not possible here S. Reda EN2910A FALL'15 15

Hazard avoidance by stalling and forwarding How to stall the pipeline? S. Reda EN2910A FALL'15 16

Circuit for forwarding Condition for forwarding: forward from either EX/MEM or MEM/WB pipeline registers if the destination register of either of these pipeline registers matches one of the sources of the ALU. Hazard detection and forward can lead to reduction in clock frequency S. Reda EN2910A FALL'15 17

Control hazards Results from branch evaluation are available at the end of cycle 3. Which instruction should be fetched in the second cycle? Solution: either stall or predict not taken and flush if necessary More on branch prediction + speculative execution later in class S. Reda EN2910A FALL'15 18

Memory hierarchy Technology cost / GB Access time Speed Cache Main Memory Virtual Memory Size SRAM ~ $10,000 ~ 1 ns DRAM ~ $100 ~ 100 ns Hard Disk ~ $1 ~ 10,000,000 ns Ideal memory: access time of SRAM with capacity and cost/gb of disk Exploit locality to make memory accesses fast: Temporal Locality: If data used recently, likely to use it again soon Spatial locality: If data used recently, likely to use nearby data soon S. Reda EN2910A FALL'15 19

Cache memory The level of the memory hierarchy closest to the CPU Fast (typically ~ 1 cycle access time) Made out of 6T SRAM cells If data is present in cache à hit; otherwise à miss à data must be copied in blocks (i.e., maybe multiple of words) from main memory or lower cache levels. Design goal: maximize cache memory hit ratio subject to latency and area constraints. Design Issues: Total size: #blocks and block size Designs: direct-mapped, fully associative and N-way associative. Write policies S. Reda EN2910A FALL'15 20

Direct-mapped cache memory Memory Address Tag Set 27 3 Byte Offset 00 V Tag Data 8-entry x (1+27+32)-bit SRAM = 27 32 Location determined by address Direct mapped: only one choice (block address) modulo (#blocks in cache) #blocks is a power of 2 Hit Data S. Reda EN2910A FALL'15 21

Fully associative cache V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data tag byte hit selection lines 8x1 MUX Any block can be placed anywhere à no conflict misses Requires many tag comparators à Expensive to build S. Reda EN2910A FALL'15 22

N-way associative cache Memory Address Tag Set 28 2 Byte Offset 00 V Tag Way 1 Way 0 Data V Tag Data 28 32 28 32 = = Hit 1 Hit 0 1 32 0 Hit 1 Hit Data Aim: strike a balance between the hardware simplicity of direct-mapped cache and the flexibility of full associate cache S. Reda EN2910A FALL'15 23

Cache memory issues Write policies Write through Write back Replacement policies for associative cache designs Random Least recently used (LRU) S. Reda EN2910A FALL'15 24

Multi-level caches n Primary cache attached to CPU n Small, but fast n Level-2 cache services misses from primary cache n Larger, slower, but still faster than main memory n Main memory services L-2 cache misses n Some high-end systems include L-3 cache S. Reda EN2910A FALL'15 25

Virtual memory Each program uses virtual addresses Entire virtual address space stored on a hard disk. Subset of virtual address data in DRAM CPU translates virtual addresses into physical addresses Data not in DRAM is fetched from the hard disk Each program has its own virtual to physical mapping Two programs can use the same virtual address for different data Programs don t need to be aware that others are running One program (or virus) can t corrupt the memory used by another This is called memory protection S. Reda EN2910A FALL'15 26

Virtual to physical address translation Each application has its own page table, the address of which is stored in the page table register. Page could be in physical memory or on disk. If accessed and page on disk, an exception is raised and an OS handler transfers page to physical memory and updates page table. S. Reda EN2910A FALL'15 27

Translation look-side buffer (TLB) TLB: Small cache (access time 1 cycle) of most recent translations Small: accessed in < 1 cycle Typically 16-512 entries Fully associative > 99 % hit rates typical Reduces memory access cycles for most loads & stores from 2 to 1 Virtual Address Virtual Page Number 0x00002 19 Page Offset 47C 12 Entry 1 Entry 0 V Virtual Page Number Physical Page Number 1 0x7FFFD 0x0000 1 0x00002 0x7FFF V Virtual Page Number Physical Page Number 19 15 19 15 TLB = = Hit 1 Hit 0 1 0 Hit 1 Physical Hit S. Reda EN2910A FALL'15 Address 0x7FFF 47C 28 15 12

Virtual memory + cache chaining virtual address page offset 20 bits 12 bits 4 KB page size Translate TLB 20 bits 12 bits 18 bits 12 bits 2 bits Tag 16 KB cache size 4 bytes / block Direct mapped 4096 blocks hit data S. Reda EN2910A FALL'15 29

Main memory: DRAM DRAM is usually a shared resource among multiple processors, GPU and I/O devices à a controller (Northbridge in x86 systems) is need to coordinate the access 30

DRAM organization 1T bit cells à compact and few steps to fabricate à enable cheap, large memory. Reads are destructive; content must be restored after reading Capacitors are leaky so they must be periodically refreshed à contributes to the slow access of DRAMs Board busses from the processor to the DRAM are slow 31

Summary of background ISA design: instruction types, memory access modes, encoding choices Pipelining: provides speedup; complications: structural, data and control hazards; solutions: hazards detection with forwarding and/or stalling. Cache memory: size; designs (direct mapped, fully associative, N-way). Virtual memory: advantages; translation from virtual to physical. DRAM: slow à latency must be hidden by cache hierarchy S. Reda EN2910A FALL'15 32