Pipelining Exercises, Continued

Similar documents
Virtual Memory Overview

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CS/CoE 1541 Mid Term Exam (Fall 2018).

6.004 Tutorial Problems L20 Virtual Memory

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

Agenda. CS 61C: Great Ideas in Computer Architecture. Virtual Memory II. Goals of Virtual Memory. Memory Hierarchy Requirements

Virtual Memory, Address Translation

Virtual Memory, Address Translation

Memory Hierarchy. Mehran Rezaei

ADDRESS TRANSLATION AND TLB

lecture 18 cache 2 TLB miss TLB - TLB (hit and miss) - instruction or data cache - cache (hit and miss)

Virtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.

ADDRESS TRANSLATION AND TLB

ECE Sample Final Examination

Question 1: (20 points) For this question, refer to the following pipeline architecture.

Virtual Memory: From Address Translation to Demand Paging

CSE 141 Spring 2016 Homework 5 PID: Name: 1. Consider the following matrix transpose code int i, j,k; double *A, *B, *C; A = (double

Virtual Memory. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Static, multiple-issue (superscaler) pipelines

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

Computer Architecture CS372 Exam 3

Computer Architecture V Fall Practice Exam Questions

Pipelining and Caching. CS230 Tutorial 09


4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Virtual Memory. Samira Khan Apr 27, 2017

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Chapter 10: Virtual Memory. Lesson 05: Translation Lookaside Buffers

EN1640: Design of Computing Systems Topic 06: Memory System

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Virtual Memory. Virtual Memory

John Wawrzynek & Nick Weaver

Virtual Memory II CSE 351 Spring

CSE 378 Final Exam 3/14/11 Sample Solution

Computer Architecture Lecture 13: Virtual Memory II

Chapter 7. Microarchitecture. Copyright 2013 Elsevier Inc. All rights reserved.

Virtual Memory Worksheet

Pipelining. CSC Friday, November 6, 2015

ECE232: Hardware Organization and Design

Translation Buffers (TLB s)

Multiple Instruction Issue. Superscalars

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

CMSC411 Fall 2013 Midterm 1

COMPUTER ORGANIZATION AND DESI

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

ELEC3441: Computer Architecture Second Semester, Homework 3 (r1.1) SOLUTION. r1.1 Page 1 of 12

Virtual Memory. CS 3410 Computer System Organization & Programming

Guerrilla 7: Virtual Memory, ECC, IO

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Chapter 8 :: Topics. Chapter 8 :: Memory Systems. Introduction Memory System Performance Analysis Caches Virtual Memory Memory-Mapped I/O Summary

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Cache Performance (H&P 5.3; 5.5; 5.6)

Virtual Memory Review. Page faults. Paging system summary (so far)

ENCM 369 Winter 2016 Lab 11 for the Week of April 4

A Few Problems with Physical Addressing. Virtual Memory Process Abstraction, Part 2: Private Address Space

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

CS 61C: Great Ideas in Computer Architecture. Virtual Memory III. Instructor: Dan Garcia

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CS61C Fall 2013 Final Instructions

Digital Logic & Computer Design CS Professor Dan Moldovan Spring Copyright 2007 Elsevier 8-<1>

Pipelined processors and Hazards

The Processor: Instruction-Level Parallelism

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

Virtual Memory. User memory model so far:! In reality they share the same memory space! Separate Instruction and Data memory!!

Virtual memory Paging

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CS152 Exam #2 Fall Professor Dave Patterson

Programming at different levels

(Refer Slide Time: 01:25)

Virtual Memory: From Address Translation to Demand Paging

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS146 Computer Architecture. Fall Midterm Exam

18-447: Computer Architecture Lecture 18: Virtual Memory III. Yoongu Kim Carnegie Mellon University Spring 2013, 3/1

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

Inside out of your computer memories (III) Hung-Wei Tseng

Lec 11 How to improve cache performance

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Memory hierarchy review. ECE 154B Dmitri Strukov

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Preventing Stalls: 1

Computer System Architecture Quiz #1 March 8th, 2019

Written Exam / Tentamen

Do not open this exam until instructed to do so. CS/ECE 354 Final Exam May 19, CS Login: QUESTION MAXIMUM SCORE TOTAL 115

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 24

CMSC411 Fall 2013 Midterm 2 Solutions

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

UC Berkeley CS61C : Machine Structures

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

Another View of the Memory Hierarchy. Lecture #25 Virtual Memory I Memory Hierarchy Requirements. Memory Hierarchy Requirements

Cache Organizations for Multi-cores

1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12

Virtual Memory - Objectives

CS 351 Final Review Quiz

Transcription:

Pipelining Exercises, Continued. Spot all data dependencies (including ones that do not lead to stalls). Draw arrows from the stages where data is made available, directed to where it is needed. Circle the involved registers in the instructions. Assume no forwarding. One dependency has been drawn for you. time -> addi $t0 $t1 100 lw $t2 ($t0) add $t3 $t1 $t2 sw $t3 8($t0) lw $t 0($t6) or $t $t0 $t3 Without forwarding, the register values become available in the write- back phase, and are needed in the decode phase.. Redraw the arrows for the above question assuming that our hardware provides forwarding. time -> addi $t0 $t1 100 lw $t2 ($t0) add $t3 $t1 $t2 sw $t3 8($t0) lw $t 0($t6) or $t $t0 $t3 With forwarding, the register values become available as soon as they are computed/retrieved, and are needed as late as possible in the computation. Notice that arithmetic operations with forwarding do not cause stalls, but load word still does. 1

Instruction Scheduling Suppose we have an array of structs of this form: struct point { int x; int y; }; We wish to square each member of point and add them to another array: sum[i] = p[i].x*p[i].x+p[i].y*p[i].y; Suppose the number of points in p is in $a0, base of p in $a1, and the base of sum is in $a2. Then we can perform the operation with this MIPS code: compiledata: beq $a0, $0, exit lw $t0, 0($a1) lw $t1, ($a1) addi $a1, $a1, 8 addi $a2, $a2, exit: Exercises: Assume that you have a dual- issue machine wherein one ALU/branch operation can be scheduled in parallel with a load/store operation. Can you schedule the instructions in the above loop to improve performance? There are 3 load/stores and 8 other instructions. Even with forwarding, loads cannot provide the data from memory in time for the next instruction. Therefore the data dependencies occur in the following places: - - must happen at least 2 cycles after lw $t0, 0($a1) must happen at least 2 cycles after lw $t1, ($a1) In addition, we must also preserve the following orderings: - must happen after the two multiplies - must happen after the add and before addi $a2, $a2, - addi $a1, $a1, 8 must happen after the loads Below is one potential fastest ordering: 1 beq $a0, $0, exit lw $t0, 0($a1) 2 addi $a1, $a1, 8 (this avoids load- use data hazards) lw $t1, ($a1) 3 6 7 8 addi $a2, $a2, 2

Here is another potential ordering: 1 beq $a0, $0, exit 2 addi $a2, $a2, (this avoids load- use data hazards) lw $t0, 0($a1) 3 addi $a1, $a1, 8 (this avoids load- use data hazards) lw $t1, ($a1) 6 7 8 sw $t3, -($a2) It is okay to be executing addi $a1, $a1, 8 and lw $t1, ($a1) at the same time because the register write of addi is done later in the pipeline, so they will both read the correct value of $a1. Unroll the loop by a factor of 2, apply register renaming, and schedule again (you may assume $a0 is even). How much improvement can be obtained? When we unroll the loop, we only need to double the instructions that do the real work (i.e. not the incrementing or the loop comparison instructions). This means that we now have 6 load/stores and 11 other instructions. By using register renaming, we now have many new instructions that don t have data dependencies with each other and we can more easily schedule full issue packets. One potential ordering is below: 1 beq $a0, $0, exit lw $t0, 0($a1) 2 addi $a1, $a1, 16 lw $t1, ($a1) 3 lw $t, -8($a1) lw $t, -($a1) 6 mul $t, $t, $t 7 mul $t, $t, $t 8 add $t6, $t, $t 9 10 addi $a2, $a2, 8 addi $a0, $a0, -2 sw $t6, ($a2) 11 The original code took 11 single- issue packets per loop (22 per 2 loops). With a dual- issue machine and no loop unrolling, the code would take 8 packets per loop (16 per 2 loops). With loop unrolling, the code takes 11 packets per 2 loops! 3

Virtual Memory Overview Virtual address (VA): What your program uses Virtual Page Number Page Offset Physical address (PA): What actually determines where in memory to go Physical Page Number Page Offset With KiB pages and byte addresses, 2^(page offset bits) = 096, so page offset bits = 12. The Big Picture: Logical Flow Translate VA to PA using the TLB and Page Table. Then use PA to access memory as the program intended. Pages A chunk of memory or disk with a set size. Addresses in the same virtual page get mapped to addresses in the same physical page. The page table determines the mapping. The Page Table Index = Virtual Page Number (not stored) Page Valid Page Dirty Permission Bits (read, write,...) Physical Page Number 0 1 2 (Max virtual page number) Each stored row of the page table is called a page table entry (the grayed section is the first page table entry). The page table is stored in memory; the OS sets a register telling the hardware the address of the first entry of the page table. The processor updates the page dirty in the page table: page dirty bits are used by the OS to know whether updating a page on disk is necessary. Each process gets its own page table. Protection Fault- - The page table entry for a virtual page has permission bits that prohibit the requested operation Page Fault- - The page table entry for a virtual page has its valid bit set to false. The entry is not in memory.

The Translation Lookaside Buffer (TLB) A cache for the page table. Each block is a single page table entry. If an entry is not in the TLB, it s a TLB miss. Assuming fully associative: TLB Entry Valid Tag = Virtual Page Number Page Table Entry Page Dirty Permission Bits Physical Page Number The Big Picture Revisited Exercises What are three specific benefits of using virtual memory? [there are many] Bridges memory and disk in memory hierarchy. Simulates full address space for each process. Enforces protection between processes. What should happen to the TLB when a new value is loaded into the page table address register? The valid bits of the TLB should all be set to 0. The page table entries in the TLB corresponded to the old page table, so none of them are valid once the page table address register points to a different page table. x86 has an "accessed" bit in each page table entry, which is like the dirty bit but set whenever a page is used (load or store). Why is this helpful when using memory as a cache for disk? It allows smarter replacements. We naturally want fewer misses (page faults), so if possible, we would want to replace a page table entry that hasn t been used. The accessed bit is one way of giving us enough information to implement this. Fill this table out! Virtual Address Bits Physical Address Bits Page Size VPN Bits PPN Bits Bits per row of PT ( extra bits) 32 32 16KB 18 18 22 32 26 8KB 19 13 17 36 32 32KB 21 17 21 0 36 32KB 2 21 2 6 0 6KB 8 2 28