Cache Organizations for Multi-cores

Similar documents
Lecture 14: Recap. Today s topics: Recap for mid-term No electronic devices in midterm

Lecture 5: Procedure Calls

Lecture 4: MIPS Instruction Set

Chapter 2. Computer Abstractions and Technology. Lesson 4: MIPS (cont )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Do-While Example. In C++ In assembly language. do { z--; while (a == b); z = b; loop: addi $s2, $s2, -1 beq $s0, $s1, loop or $s2, $s1, $zero

Programming at different levels

Lecture: Out-of-order Processors

ECE331: Hardware Organization and Design

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

Lecture 5: Procedure Calls

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Chapter 2A Instructions: Language of the Computer

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Control Instructions

CENG3420 Lecture 03 Review

ECE 30 Introduction to Computer Engineering

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Control Instructions. Computer Organization Architectures for Embedded Computing. Thursday, 26 September Summary

Lecture 8: Addition, Multiplication & Division

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

Introduction to the MIPS. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

MIPS Assembly Programming

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

ECE331: Hardware Organization and Design

COMPSCI 313 S Computer Organization. 7 MIPS Instruction Set

Instructions: Assembly Language

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Computer Science and Engineering 331. Midterm Examination #1. Fall Name: Solutions S.S.#:

CS61C Machine Structures. Lecture 12 - MIPS Procedures II & Logical Ops. 2/13/2006 John Wawrzynek. www-inst.eecs.berkeley.

NUMBER OPERATIONS. Mahdi Nazm Bojnordi. CS/ECE 3810: Computer Organization. Assistant Professor School of Computing University of Utah

COMP 303 Computer Architecture Lecture 3. Comp 303 Computer Architecture

McGill University Faculty of Engineering FINAL EXAMINATION Fall 2007 (DEC 2007)

Instructions: MIPS ISA. Chapter 2 Instructions: Language of the Computer 1

Thomas Polzer Institut für Technische Informatik

CSCE 212: FINAL EXAM Spring 2009

Computer Architecture V Fall Practice Exam Questions

Computer Organization MIPS Architecture. Department of Computer Science Missouri University of Science & Technology

Computer Architecture I Midterm I

Lecture 10: Floating Point, Digital Design

Computer Architecture CS372 Exam 3

ECE331: Hardware Organization and Design

Chapter 2. Instructions:

CSE Lecture In Class Example Handout

CS 341l Fall 2008 Test #2

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Number Systems and Computer Arithmetic

Format. 10 multiple choice 8 points each. 1 short answer 20 points. Same basic principals as the midterm

s complement 1-bit Booth s 2-bit Booth s

COMPUTER ORGANIZATION AND DESIGN

Architecture II. Computer Systems Laboratory Sungkyunkwan University

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #7. Warehouse Scale Computer

CS 61C: Great Ideas in Computer Architecture More MIPS, MIPS Functions

CS 110 Computer Architecture Lecture 6: More MIPS, MIPS Functions

ECE260: Fundamentals of Computer Engineering

ENGN1640: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Computer Organization MIPS ISA

Final Exam Fall 2007

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

ECE 331 Hardware Organization and Design. Professor Jay Taneja UMass ECE - Discussion 3 2/8/2018

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

Levels of Programming. Registers

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CS 351 Exam 2 Mon. 11/2/2015

Chapter 2. Instructions: Language of the Computer

CS 61C: Great Ideas in Computer Architecture. MIPS Instruction Formats

CS/COE1541: Introduction to Computer Architecture

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

2B 52 AB CA 3E A1 +29 A B C. CS120 Fall 2018 Final Prep and super secret quiz 9

Lecture 2. Instructions: Language of the Computer (Chapter 2 of the textbook)

CS146 Computer Architecture. Fall Midterm Exam

ELEC / Computer Architecture and Design Fall 2013 Instruction Set Architecture (Chapter 2)

Lecture 7: Examples, MARS, Arithmetic

Computer Organisation CS303

Computer System Architecture Midterm Examination Spring 2002

Course Administration

Compiling Techniques

CSCI 402: Computer Architectures. Instructions: Language of the Computer (3) Fengguang Song Department of Computer & Information Science IUPUI.

CS 61C: Great Ideas in Computer Architecture More RISC-V Instructions and How to Implement Functions

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

CS252 Graduate Computer Architecture Midterm 1 Solutions

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

231 Spring Final Exam Name:

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

ECE232: Hardware Organization and Design. Computer Organization - Previously covered

Lectures 5. Announcements: Today: Oops in Strings/pointers (example from last time) Functions in MIPS

Computer Architecture. The Language of the Machine

UNIT I BASIC STRUCTURE OF COMPUTERS Part A( 2Marks) 1. What is meant by the stored program concept? 2. What are the basic functional units of a

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ECE 15B Computer Organization Spring 2010

MIPS Functions and Instruction Formats

Prof. Kavita Bala and Prof. Hakim Weatherspoon CS 3410, Spring 2014 Computer Science Cornell University. See P&H 2.8 and 2.12, and A.

Lecture 24: Virtual Memory, Multiprocessors

Math 230 Assembly Programming (AKA Computer Organization) Spring 2008

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

ECE Sample Final Examination

Transcription:

Lecture 26: Recap Announcements: Assgn 9 (and earlier assignments) will be ready for pick-up from the CS front office later this week Office hours: all day next Tuesday Final exam: Wednesday 13 th, 7:50-10am, EMCB 101 Same rules as mid-term, except no laptops (open book, open notes/slides/assignments) (print pages from the textbook CD if necessary) 20% pre-midterm, 80% post-midterm Advanced course in Spring: CS 7820 Parallel Computer Architecture more on multi-cores, multi-thread programming, cache coherence and synchronization, interconnection networks 1

Cache Organizations for Multi-cores L1 caches are always private to a core L2 caches can be private or shared which is better? P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 2

Cache Organizations for Multi-cores L1 caches are always private to a core L2 caches can be private or shared Advantages of a shared L2 cache: efficient dynamic allocation of space to each core data shared by multiple cores is not replicated every block has a fixed home hence, easy to find the latest copy Advantages of a private L2 cache: quick access to private L2 good for small working sets private bus to private L2 less contention 3

View from 5,000 Feet 4

5-Stage Pipeline and Bypassing Must worry about data, control, and structural hazards Some data hazard stalls can be eliminated: bypassing 5

Example lw $1, 8($2) lw $4, 8($1) 6

Example lw $1, 8($2) sw $1, 8($3) 7

Branch Delay Slots 8

Pipeline with Branch Predictor PC IF (br) Branch Predictor Reg Read Compare Br-target 9

Bimodal Predictor 14 bits Branch PC Table of 16K entries of 2-bit saturating counters 10

An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 Register File R1-R32 ALU ALU ALU Results written to ROB and tags broadcast to IQ Issue Queue (IQ) 11

Cache Organization Byte address 10100000 How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Tag Way-1 Way-2 Tag array Compare Data array 12

Virtual Memory The virtual and physical memory are broken up into pages 8KB page size Virtual address virtual page number 13 page offset Translated to physical page number Physical address 13

TLB Since the number of pages is very high, the page table capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory wastage 14

Cache and TLB Pipeline Virtual page number TLB Physical page number Virtual address Virtual index Tag array Offset Data array Physical tag comparion Physical tag Virtually Indexed; Physically Tagged Cache 15

I/O Hierarchy CPU Cache Memory Bus Disk Memory I/O Controller I/O Bus Network USB DVD 16

RAID 3 Data is bit-interleaved across several disks and a separate disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1,, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism 17

RAID 4 and RAID 5 Data is block interleaved this allows us to get all our data from a single disk on a read in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk parity information can be updated simply by checking if the new data differs from the old data 18

RAID 5 If we have a single disk for parity, multiple writes can not happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes 19

Example P1 reads X: not found in cache-1, request sent on bus, memory responds, X is placed in cache-1 in shared state P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state P1 P2 Cache-1 Cache-2 Main Memory P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state 20

Directory-Based Example Memory Directory Processor & Caches I/O Memory Directory X Processor & Caches I/O Memory Directory Y Processor & Caches I/O A: Rd X B: Rd X C: Rd X A: Wr X A: Wr X C: Wr X B: Rd X A: Rd X A: Rd Y B: Wr X B: Rd Y B: Wr X B: Wr Y Interconnection network 21

Basic MIPS Instructions lw $t1, 16($t2) add $t3, $t1, $t2 addi $t3, $t3, 16 sw $t3, 16($t2) beq $t1, $t2, 16 blt is implemented as slt and bne j 64 jr $t1 sll $t1, $t1, 2 Convert to assembly: while (save[i] == k) i += 1; i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop 22 Exit:

Registers The 32 MIPS registers are partitioned as follows: Register 0 : $zero always stores the constant 0 Regs 2-3 : $v0, $v1 return values of a procedure Regs 4-7 : $a0-$a3 input arguments to a procedure Regs 8-15 : $t0-$t7 temporaries Regs 16-23: $s0-$s7 variables Regs 24-25: $t8-$t9 more temporaries Reg 28 : $gp global pointer Reg 29 : $sp stack pointer Reg 30 : $fp frame pointer Reg 31 : $ra return address 23

Memory Organization Stack Proc A s values High address Dynamic data (heap) Static data (globals) Text (instructions) Proc B s values $gp Proc C s values Stack grows this way $fp $sp Low address 24

Procedure Calls/Returns proca { int j; j = ; call procb(j); = j; } proca: $s0 = # value of j $t0 = # some tempval $a0 = $s0 # the argument jal procb = $v0 procb (int j) { int k; = j; k = ; return k; } procb: $t0 = # some tempval = $a0 # using the argument $s0 = # value of k $v0 = $s0; jr $ra 25

Saves and Restores Caller saves: $ra, $a0, $t0, $fp Callee saves: $s0 As every element is saved on stack, the stack pointer is decremented If the callee s values cannot remain in registers, they will also be spilled into the stack (don t have to create space for them at the start of the proc) proca: $s0 = # value of j $t0 = # some tempval $a0 = $s0 # the argument jal procb = $v0 procb: $t0 = # some tempval = $a0 # using the argument $s0 = # value of k $v0 = $s0; jr $ra 26

Recap Numeric Representations Decimal 35 10 = 3 x 10 1 + 5 x 10 0 Binary 00100011 2 = 1 x 2 5 + 1 x 2 1 + 1 x 2 0 Hexadecimal (compact representation) 0x 23 or 23 hex = 2 x 16 1 + 3 x 16 0 0-15 (decimal) 0-9, a-f (hex) Dec Binary Hex 0 0000 00 1 0001 01 2 0010 02 3 0011 03 Dec Binary Hex 4 0100 04 5 0101 05 6 0110 06 7 0111 07 Dec Binary Hex 8 1000 08 9 1001 09 10 1010 0a 11 1011 0b Dec Binary Hex 12 1100 0c 13 1101 0d 14 1110 0e 15 1111 0f 27

2 s Complement 0000 0000 0000 0000 0000 0000 0000 0000 two = 0 ten 0000 0000 0000 0000 0000 0000 0000 0001 two = 1 ten 0111 1111 1111 1111 1111 1111 1111 1111 two = 2 31-1 1000 0000 0000 0000 0000 0000 0000 0000 two = -2 31 1000 0000 0000 0000 0000 0000 0000 0001 two = -(2 31 1) 1000 0000 0000 0000 0000 0000 0000 0010 two = -(2 31 2) 1111 1111 1111 1111 1111 1111 1111 1110 two = -2 1111 1111 1111 1111 1111 1111 1111 1111 two = -1 Note that the sum of a number x and its inverted representation x always equals a string of 1s (-1). x + x = -1 x + 1 = -x hence, can compute the negative of a number by -x = x + 1 inverting all bits and adding 1 This format can directly undergo addition without any conversions! Each number represents the quantity x 31-2 31 + x 30 2 30 + x 29 2 29 + + x 1 2 1 + x 0 2 0 28

Multiplication Example Multiplicand Multiplier Product 1000 ten x 1001 ten --------------- 1000 0000 0000 1000 ---------------- 1001000 ten In every step multiplicand is shifted next bit of multiplier is examined (also a shifting step) if this bit is 1, shifted multiplicand is added to the product 29

Division 1001 ten Quotient Divisor 1000 ten 1001010 ten Dividend -1000 10 101 1010-1000 10 ten Remainder At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient 30

Division 1001 ten Quotient Divisor 1000 ten 1001010 ten Dividend 0001001010 0001001010 0000001010 0000001010 100000000000 0001000000 0000100000 0000001000 Quo: 0 000001 0000010 000001001 At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient 31

Binary FP Numbers 20.45 decimal =? Binary 20 decimal = 10100 binary 0.45 x 2 = 0.9 (not greater than 1, first bit after binary point is 0) 0.90 x 2 = 1.8 (greater than 1, second bit is 1, subtract 1 from 1.8) 0.80 x 2 = 1.6 (greater than 1, third bit is 1, subtract 1 from 1.6) 0.60 x 2 = 1.2 (greater than 1, fourth bit is 1, subtract 1 from 1.2) 0.20 x 2 = 0.4 (less than 1, fifth bit is 0) 0.40 x 2 = 0.8 (less than 1, sixth bit is 0) 0.80 x 2 = 1.6 (greater than 1, seventh bit is 1, subtract 1 from 1.6) and the pattern repeats 10100.011100110011001100 Normalized form = 1.0100011100110011 x 2 4 32

IEEE 754 Format Final representation: (-1) S x (1 + Fraction) x 2 (Exponent Bias) Represent -0.75 ten in single and double-precision formats Single: (1 + 8 + 23) 1 0111 1110 1000 000 Double: (1 + 11 + 52) 1 0111 1111 110 1000 000 What decimal number is represented by the following single-precision number? 1 1000 0001 01000 0000-5.0 33

FP Addition Consider the following decimal example (can maintain only 4 decimal digits and 2 exponent digits) 9.999 x 10 1 + 1.610 x 10-1 Convert to the larger exponent: 9.999 x 10 1 + 0.016 x 10 1 Add 10.015 x 10 1 Normalize 1.0015 x 10 2 Check for overflow/underflow Round 1.002 x 10 2 Re-normalize 34

Performance Measures Performance = 1 / execution time Speedup = ratio of performance Performance improvement = speedup -1 Execution time = clock cycle time x CPI x number of instrs Program takes 100 seconds on ProcA and 150 seconds on ProcB Speedup of A over B = 150/100 = 1.5 Performance improvement of A over B = 1.5 1 = 0.5 = 50% Speedup of B over A = 100/150 = 0.66 (speedup less than 1 means performance went down) Performance improvement of B over A = 0.66 1 = -0.33 = -33% or Performance degradation of B, relative to A = 33% If multiple programs are executed, the execution times are combined into a single number using AM, weighted AM, or GM 35

Boolean Algebra A + B = A. B A. B = A + B A B C E 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 0 Any truth table can be expressed as a sum of products (A. B. C) + (A. C. B) + (C. B. A) Can also use product of sums Any equation can be implemented with an array of ANDs, followed by an array of ORs 36

Adder Implementations Ripple-Carry adder each 1-bit adder feeds its carry-out to next stage simple design, but we must wait for the carry to propagate thru all bits Carry-Lookahead adder each bit can be represented by an equation that only involves input bits (a i, b i ) and initial carry-in (c 0 ) -- this is a complex equation, so it s broken into sub-parts For bits a i, b i,, and c i, a carry is generated if a i.b i = 1 and a carry is propagated if a i + b i = 1 C i+1 = g i + p i. C i Similarly, compute these values for a block of 4 bits, then for a block of 16 bits, then for a block of 64 bits.finally, the carry-out for the 64 th bit is represented by an equation such as this: C 4 = G 3 + G 2.P 3 + G 1.P 2.P 3 + G 0.P 1.P 2.P 3 + C 0.P 0.P 1.P 2.P 3 Each of the sub-terms is also a similar expression 37

Title Bullet 38