EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL'15 1

Classical concepts (prerequisite) 1. Instruction set architecture (ISA) 2. Pipelining 3. Cache memory 4. Virtual memory 5. DRAM This set of lectures is only for refreshing. You should already know this material; it is prerequisite for enrolling in this class S. Reda EN2910A FALL'15 2

Steps of program execution fetch instruction @ PC decode instruction fetch Operands execute instruction store result update PC ISA Design Choices: What is instruction format / size? How is it decoded? Where are the operands located? What are their sizes? What is the size of the register file? How to access the memory? What are supported operations? How to determine the successor instruction? S. Reda EN2910A FALL'15 3

Instruction types 1. Memory transfer loads and stores 2. Arithmetic and logic: Arithmetic (e.g., add, mult) à could be integer / floating. Logic (e.g., or, and). 3. Control instructions: Jumps, conditional branches, jump and link CISC versus RISC instruction sets: CISC example: ADD [R1], [R2], [R3]. How many RISC instructions are needed to implement this CISC instruction? S. Reda EN2910A FALL'15 4

Examples of ISA operations Types Opcode Assembly code Meaning Comments Data Transfers LB, LH, LW, LD LW R1,#20(R2) R1<=MEM[(R2)+20] for bytes, half-words SB, SH, SW, SD SW R1,#20(R2) MEM[(R2)+20]<=(R1) words, and double words L.S, L.D L.S F0,#20(R2) F0<=MEM[(R2)+20] single/double float load S.S, S.D S.S F0,#20(R2) MEM[(R2)+20]<=(F0) single/double float store ALU operations ADD, SUB, ADDU, SUBU ADD R1,R2,R3 R1<=(R2)+(R3) add/sub signed or unsigned ADDI, SUBI, ADDIU, SUBIU ADDI R1,R2,#3 R1<=(R2)+3 add/sub immediate signed or unsigned AND, OR, XOR, AND R1,R2,R3 R1<=(R2).AND.(R3) bitwise logical AND, OR, XOR ANDI, ORI, XORI, ANDI R1,R2,#4 R1<=(R2).ANDI.4 bitwise AND, OR, XOR immediate SLT, SLTU SLT R1,R2,R3 R1<=1 if R2<R3 else R1<=0 SLTI, SLTUI SLTI R1,R2,#4 R1<=1 if R2<4 else R1<=0 test on R2,R3 outcome in R1, signed or unsigned comparison test R2 outcome in R1, signed or unsigned comparison S. Reda EN2910A FALL'15 5

Examples of ISA operations Types Opcode Assembly code Meaning Comments Branches/Jumps BEQZ, BNEZ BEQZ R1,label PC<=label if (R1)=0 conditional branch-equal 0/not equal 0 BEQ, BNE BNE R1,R2,label PC<=label if (R1)=(R2) conditional branchequal/not equal J J target PC<=target target is an immediate field JR JR R1 PC<=(R1) target is in register JAL JAL target R1<=(PC)+4; PC<=target jump to target after saving the return address in R31 Floating point ADD.S,SUB.S,MUL.S,DI V.S ADD.D,SUB.D,MUL.D,DI V.D ADD.S F1,F2,F3 F1<=(F2)+(F3) float arithmetic single precision ADD.D F0,F2,F4 F0<=(F2)+(F4) float arithmetic double precision S. Reda EN2910A FALL'15 6

Memory addressing modes MODE EXAMPLE MEANING REGISTER ADD R4,R3 reg[r4] <- reg[r4] +reg[r3] IMMEDIATE ADD R4, #3 reg[r4] <- reg[r4] + 3 DISPLACEMENT ADD R4, 100(R1) reg[r4] <- reg[r4] + Mem[100 + reg[r1]] REGISTER INDIRECT ADD R4, (R1) reg[r4] <- reg[r4] + Mem[reg[R1]] INDEXED ADD R3, (R1+R2) reg[r3] <- reg[r3] + Mem[reg[R1] + reg[r2]] DIRECT OR ABSOLUTE ADD R1, (1001) reg[r1] <- reg[r1] + Mem[1001] MEMORY INDIRECT ADD R1, @R3 reg[r1] <- reg[r1] + Mem[Mem[Reg[3]]] POST INCREMENT ADD R1, (R2)+ ADD R1, (R2) then R2 <- R2+d PREDECREMENT ADD R1, -(R2) R2 <- R2-d then ADD R1, (R2) PC-RELATIVE BEZ R1, 100 if R1==0, PC <- PC+100 PC-RELATIVE JUMP 200 Concatenate bits of PC and offset S. Reda EN2910A FALL'15 7

Example of ISA (MIPS) encoding LW Rt, displacement(rs) SW Rt, displacement(rs) ADDI Rt, Rs, immediate BEQ Rt, Rs, offset ADD Rd, Rt, Rs J target JAL target S. Reda EN2910A FALL'15 8

Architectural state Determines everything about a processor: PC 32 registers Memory CLK CLK CLK PC' PC 32 32 A RD 32 32 Instruction Memory 5 5 5 32 A1 A2 A3 WD3 WE3 Register File RD1 RD2 32 32 32 32 A RD Data Memory WD WE 32 S. Reda EN2910A FALL'15 9

Typical datapath (MIPS) S. Reda EN2910A FALL'15 10

Pipelining (IF/ID/EX/MEM/WB) Pipeline is a form of temporal parallelism Reducing the path of the critical path enable faster operation of clock à ideal speed up is achieved when pipeline is balanced Ideal CPI of 1; however, stalls, branches and cache memory misses increase CPI beyond 1 S. Reda EN2910A FALL'15 11

Pipeline operation abstraction S. Reda EN2910A FALL'15 12

Pipeline hazards Structural Problem: Not enough read/write data ports for the register file or caches Solution: stall or add more hardware resources Data Problem: Data dependency where the input of one instruction is dependent on a proceeding instruction(s) that has not written its results; aka, Read After Write (RAW) hazard. Solution: reorder instructions by compiler, stall or forward Control Deciding on next instruction to fetch depends on results of proceedings instructions in the pipeline Solution: stall or (branch prediction + speculative execution) S. Reda EN2910A FALL'15 13

Resolving data hazards by forwarding dependencies forward path Writing to register file in 1sthalf of cycle; reading in 2 nd half. S. Reda EN2910A FALL'15 14

Forwarding might not be possible all the time Forwarding not possible here S. Reda EN2910A FALL'15 15

Hazard avoidance by stalling and forwarding How to stall the pipeline? S. Reda EN2910A FALL'15 16

Circuit for forwarding Condition for forwarding: forward from either EX/MEM or MEM/WB pipeline registers if the destination register of either of these pipeline registers matches one of the sources of the ALU. Hazard detection and forward can lead to reduction in clock frequency S. Reda EN2910A FALL'15 17

Control hazards Results from branch evaluation are available at the end of cycle 3. Which instruction should be fetched in the second cycle? Solution: either stall or predict not taken and flush if necessary More on branch prediction + speculative execution later in class S. Reda EN2910A FALL'15 18

Memory hierarchy Technology cost / GB Access time Speed Cache Main Memory Virtual Memory Size SRAM ~ $10,000 ~ 1 ns DRAM ~ $100 ~ 100 ns Hard Disk ~ $1 ~ 10,000,000 ns Ideal memory: access time of SRAM with capacity and cost/gb of disk Exploit locality to make memory accesses fast: Temporal Locality: If data used recently, likely to use it again soon Spatial locality: If data used recently, likely to use nearby data soon S. Reda EN2910A FALL'15 19

Cache memory The level of the memory hierarchy closest to the CPU Fast (typically ~ 1 cycle access time) Made out of 6T SRAM cells If data is present in cache à hit; otherwise à miss à data must be copied in blocks (i.e., maybe multiple of words) from main memory or lower cache levels. Design goal: maximize cache memory hit ratio subject to latency and area constraints. Design Issues: Total size: #blocks and block size Designs: direct-mapped, fully associative and N-way associative. Write policies S. Reda EN2910A FALL'15 20

Direct-mapped cache memory Memory Address Tag Set 27 3 Byte Offset 00 V Tag Data 8-entry x (1+27+32)-bit SRAM = 27 32 Location determined by address Direct mapped: only one choice (block address) modulo (#blocks in cache) #blocks is a power of 2 Hit Data S. Reda EN2910A FALL'15 21

Fully associative cache V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data tag byte hit selection lines 8x1 MUX Any block can be placed anywhere à no conflict misses Requires many tag comparators à Expensive to build S. Reda EN2910A FALL'15 22

N-way associative cache Memory Address Tag Set 28 2 Byte Offset 00 V Tag Way 1 Way 0 Data V Tag Data 28 32 28 32 = = Hit 1 Hit 0 1 32 0 Hit 1 Hit Data Aim: strike a balance between the hardware simplicity of direct-mapped cache and the flexibility of full associate cache S. Reda EN2910A FALL'15 23

Cache memory issues Write policies Write through Write back Replacement policies for associative cache designs Random Least recently used (LRU) S. Reda EN2910A FALL'15 24

Multi-level caches n Primary cache attached to CPU n Small, but fast n Level-2 cache services misses from primary cache n Larger, slower, but still faster than main memory n Main memory services L-2 cache misses n Some high-end systems include L-3 cache S. Reda EN2910A FALL'15 25

Virtual memory Each program uses virtual addresses Entire virtual address space stored on a hard disk. Subset of virtual address data in DRAM CPU translates virtual addresses into physical addresses Data not in DRAM is fetched from the hard disk Each program has its own virtual to physical mapping Two programs can use the same virtual address for different data Programs don t need to be aware that others are running One program (or virus) can t corrupt the memory used by another This is called memory protection S. Reda EN2910A FALL'15 26

Virtual to physical address translation Each application has its own page table, the address of which is stored in the page table register. Page could be in physical memory or on disk. If accessed and page on disk, an exception is raised and an OS handler transfers page to physical memory and updates page table. S. Reda EN2910A FALL'15 27

Translation look-side buffer (TLB) TLB: Small cache (access time 1 cycle) of most recent translations Small: accessed in < 1 cycle Typically 16-512 entries Fully associative > 99 % hit rates typical Reduces memory access cycles for most loads & stores from 2 to 1 Virtual Address Virtual Page Number 0x00002 19 Page Offset 47C 12 Entry 1 Entry 0 V Virtual Page Number Physical Page Number 1 0x7FFFD 0x0000 1 0x00002 0x7FFF V Virtual Page Number Physical Page Number 19 15 19 15 TLB = = Hit 1 Hit 0 1 0 Hit 1 Physical Hit S. Reda EN2910A FALL'15 Address 0x7FFF 47C 28 15 12

Virtual memory + cache chaining virtual address page offset 20 bits 12 bits 4 KB page size Translate TLB 20 bits 12 bits 18 bits 12 bits 2 bits Tag 16 KB cache size 4 bytes / block Direct mapped 4096 blocks hit data S. Reda EN2910A FALL'15 29

Main memory: DRAM DRAM is usually a shared resource among multiple processors, GPU and I/O devices à a controller (Northbridge in x86 systems) is need to coordinate the access 30

DRAM organization 1T bit cells à compact and few steps to fabricate à enable cheap, large memory. Reads are destructive; content must be restored after reading Capacitors are leaky so they must be periodically refreshed à contributes to the slow access of DRAMs Board busses from the processor to the DRAM are slow 31

Summary of background ISA design: instruction types, memory access modes, encoding choices Pipelining: provides speedup; complications: structural, data and control hazards; solutions: hazards detection with forwarding and/or stalling. Cache memory: size; designs (direct mapped, fully associative, N-way). Virtual memory: advantages; translation from virtual to physical. DRAM: slow à latency must be hidden by cache hierarchy S. Reda EN2910A FALL'15 32