EECS 213 Introduction to Computer Systems Dinda, Spring Homework 3. Memory and Cache

Similar documents
CS 3214 Spring # Problem Points Min Max Average Median SD Grader. 1 Memory Layout and Locality Bill

CS , Spring 2004 Exam 1

Sample Exam I PAC II ANSWERS

Introduction to Computer Systems. Exam 1. February 22, This is an open-book exam. Notes are permitted, but not computers.

UW CSE 351, Winter 2013 Midterm Exam

Introduction to Computer Systems. Exam 1. February 22, Model Solution fp

Assembly Programmer s View Lecture 4A Machine-Level Programming I: Introduction

15-213/18-243, Spring 2011 Exam 1

CS/ECE 354 Practice Midterm Exam Solutions Spring 2016

CS 2505 Computer Organization I Test 2. Do not start the test until instructed to do so! printed

administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?

CS , Fall 2004 Exam 1

A short session with gdb verifies a few facts; the student has made notes of some observations:

Practical Malware Analysis

15-213/18-243, Fall 2010 Exam 1 - Version A

CSCE 212H, Spring 2008 Lab Assignment 3: Assembly Language Assigned: Feb. 7, Due: Feb. 14, 11:59PM

CSC 2400: Computing Systems. X86 Assembly: Function Calls

Computer Systems Architecture I. CSE 560M Lecture 3 Prof. Patrick Crowley

CSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College February 9, 2016

Abstraction Recovery for Scalable Static Binary Analysis

CS 31: Intro to Systems ISAs and Assembly. Kevin Webb Swarthmore College September 25, 2018

Control flow. Condition codes Conditional and unconditional jumps Loops Switch statements

CS , Fall 2007 Exam 1

Turning C into Object Code Code in files p1.c p2.c Compile with command: gcc -O p1.c p2.c -o p Use optimizations (-O) Put resulting binary in file p

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

int32_t Buffer[BUFFSZ] = {-1, -1, -1, 1, -1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, -1, -1, -1, -1, -1}; int32_t* A = &Buffer[5];

CS 31: Intro to Systems ISAs and Assembly. Martin Gagné Swarthmore College February 7, 2017

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Machine-Level Programming I: Introduction Jan. 30, 2001

Low-Level Essentials for Understanding Security Problems Aurélien Francillon

15-213/18-213, Fall 2011 Exam 1

Program Exploitation Intro

CSC 8400: Computer Systems. Machine-Level Representation of Programs

X86 Addressing Modes Chapter 3" Review: Instructions to Recognize"

CS107 Handout 13 Spring 2008 April 18, 2008 Computer Architecture: Take II

Systems I. Machine-Level Programming I: Introduction

Labeling Library Functions in Stripped Binaries

ANITA S SUPER AWESOME RECITATION SLIDES

W4118: PC Hardware and x86. Junfeng Yang

CSE P 501 Exam 8/5/04 Sample Solution. 1. (10 points) Write a regular expression or regular expressions that generate the following sets of strings.

CSE 351 Midterm - Winter 2015 Solutions

CS 3330 Exam 3 Fall 2017 Computing ID:

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Computer Systems Lecture 9

CSCI 2121 Computer Organization and Assembly Language PRACTICE QUESTION BANK

Machine Programming 4: Structured Data

16.317: Microprocessor Systems Design I Fall 2013

Instruction Set Architectures

Assembly Language: IA-32 Instructions

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

CSC 2400: Computer Systems. Using the Stack for Function Calls

CSE 582 Autumn 2002 Exam Sample Solution

Question Points Score Total: 100

Important From Last Time

Computer Architecture EE 4720 Final Examination

CS , Spring 2002 Exam 2

BEng (Hons.) Telecommunications. BSc (Hons.) Computer Science with Network Security

Overview of Compiler. A. Introduction

CSC 8400: Computer Systems. Using the Stack for Function Calls

Instruction Set Architectures

CPS104 Recitation: Assembly Programming

TIE: Principled Reverse Engineering of Types in Binary Programs! JongHyup Lee, Thanassis Avgerinos, and David Brumley

CS 3330 Exam 2 Fall 2017 Computing ID:

Machine-level Representation of Programs. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Binghamton University. CS-220 Spring X86 Debug. Computer Systems Section 3.11

CPE 631 Advanced Computer Systems Architecture: Homework #2

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

Introduction Presentation A

Important From Last Time

CISC 360. Machine-Level Programming I: Introduction Sept. 18, 2008

CS 2505 Computer Organization I Test 2. Do not start the test until instructed to do so! printed

CS 2505 Computer Organization I Test 2. Do not start the test until instructed to do so! printed

Key Point. What are Cache lines

238P: Operating Systems. Lecture 7: Basic Architecture of a Program. Anton Burtsev January, 2018

CS / ECE , Spring 2010 Exam 1

EE 361 University of Hawaii Fall

ECE 571 Advanced Microprocessor-Based Design Lecture 9

CPSC 213. Introduction to Computer Systems. Procedures and the Stack. Unit 1e

Function Calls COS 217. Reading: Chapter 4 of Programming From the Ground Up (available online from the course Web site)

CSE P 501 Exam 12/1/11

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

CSE351 Autumn 2012 Midterm Exam (5 Nov 2012)

Compiler Construction D7011E

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 21: Generating Pentium Code 10 March 08

Summary: Direct Code Generation

15-213, Fall 2007 Midterm Exam

Machine Programming 1: Introduction

Meltdown or "Holy Crap: How did we do this to ourselves" Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory

CNIT 127: Exploit Development. Ch 1: Before you begin. Updated

HPC VT Machine-dependent Optimization

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Machine Level Programming II: Arithmetic &Control

CS 5460/6460 Operating Systems

Review addressing modes

15-213/18-243, Summer 2011 Exam 1 Tuesday, June 28, 2011

Corrections made in this version not in first posting:

CIS 341 Midterm February 28, Name (printed): Pennkey (login id): SOLUTIONS

CSC 2400: Computing Systems. X86 Assembly: Function Calls"

CSE 351 Midterm - Winter 2015

Transcription:

Homework 3 Memory and Cache 1. Reorder the fields in this structure so that the structure will (a) consume the most space and (b) consume the least space on an IA32 machine on Linux. struct foo { double a; char b; short *c; char *d; long e; int f; short g; ; 2. Consider a processor that uses 20-bit addresses and can address 2 20 =1M bytes of memory. Suppose that it has one level of cache. As in Figure 6.27 of your textbook, the address is split into a t bit tag, an s bit set index, and a b bit block offset. The cache consists of 8192 bytes, with a block size of 128 bytes. Answer each of the following for direct-mapped, 4-way set associative, and fully associative versions of the cache. a. How many cache lines are there? b. What is b? c. What is s? d. What is t? 3. For the cache in problem 3, draw the cache given that it is structured as follows. You can elide replicated components, but annotate your drawing with how many components there are. a. Direct-mapped b. 2-way set associative c. Fully associative Page 1 of 5

4. Our company wants to optimize the performance of the following code void vector_add(int n, int *a, int *b, int *c) { int i; for (i=0;i<n;i++) { c[i]=a[i]+b[i]; run on the same processor and cache as described in problem 3. The cache is writeback, write-allocate, and has an LRU replacement policy. Integers are 32 bits. a. Suppose the cache is direct mapped. Let n=4096, a=0x0a000, b=0x0c000, c=0x0e000. On average, how many times per loop iteration will you load a cache block from main memory? How many times per loop iteration will you flush a cache block back to main memory? b. While we re all fired up to buy ultra-cool mega-associative cache hardware (which comes only in machined aluminum), a smart alec programmer claims that we can get the same effect by having a=0x0a000, b=0x0c080, and c=0x0e100. Is he right? Why or why not? 5. You should now be coming to realize just how important leveraging locality of access is for performance. Modern processors take advantage of this locality in hardware using many mechanisms, but one important one is called prefetching. The prefetch unit of a CPU will monitor memory accesses and try to predict accesses that might follow soon after. That way, when the CPU issues a load or store to the next address, the cache will have already been populated with it, and it will not need to wait for an expensive access to main memory to complete. However, hardware prefetch units have limited foresight, and sometimes it may be better for the programmer to provide some hints as to what memory will be accessed in the near future. Consider the following piece of code that simply sums up a large 2D array. #define _prefetch(x) builtin_prefetch((x), 0, 3) Page 2 of 5

#define ROWS (1024*1024*16) #define COLS 16 int 2darray_sum() { int sum = 0; int i, j; int * a = malloc(sizeof(int)*rows*cols); /* skip array initialization code */ for (i = 0; i < ROWS; i++) { for (j = 0; j < COLS; j++) { sum += *(a+i*cols+j); _prefetch(a+(i+1)*cols); return sum; This code loops through a 2D-array using pointer arithmetic and sums the elements. The interesting thing we ve added here is the call to _prefetch. GCC provides a builtin function named builtin_prefetch that we ve wrapped with a macro, because for our purposes we only care about the first argument. _prefetch takes a memory address and instructs the hardware to preload the cache with one line starting at that address. For the following question, assume that our data cache size is 32KB and the block size is 64 bytes. a. How big is a? Will it fit in the cache? b. Why did we choose to prefetch at the address a+(i+1)*cols and why did we place it in the outer loop? Explain your reasoning. (Hint: it s no coincidence that COLS is 16) c. Running the above code on our class machine with the prefetch produces a speedup over the case without the prefetch by about 56%. Suppose we make this same comparison, but we make it for various array sizes. When would you expect the prefetch to help us more, with much smaller arrays or with larger arrays? Why? 6. Your CPU performs best when it executes straight-line code, i.e. a sequential stream of instructions, one following immediately after the other in the program text. This instruction locality makes it easy for it to prefetch subsequent instructions, just as it does for data. Control flow instructions like jle and, to a lesser extent, jmp, disrupt these sequential streams and force the program counter to change in a less predictable manner, making the instruction prefetcher less effective. One goal of an optimizing compiler is to make these regions of straight-line code (called basic blocks) as large as possible to improve performance. Consider the following function and the assembly it generates. Page 3 of 5

foo.c int foo (int x, int y) { if (x > y) return 1; else return 0; foo.s <foo>: 8048394: 55 push %ebp 8048395: 89 e5 mov %esp,%ebp 8048397: 83 ec 10 sub $0x10,%esp 804839a: 8b 45 08 mov 0x8(%ebp),%eax 804839d: 3b 45 0c cmp 0xc(%ebp),%eax 80483a0: 7e 15 jle 80483b7 <foo+0x23> 80483a2: 8b 45 0c mov 0xc(%ebp),%eax 80483a5: 8b 55 08 mov 0x8(%ebp),%edx 80483a8: 8d 04 02 lea (%edx,%eax,1),%eax 80483ab: 89 45 fc mov %eax,-0x4(%ebp) 80483ae: c1 65 fc 02 shll $0x2,-0x4(%ebp) 80483b2: 8b 45 fc mov -0x4(%ebp),%eax 80483b5: eb 05 jmp 80483bc <foo+0x28> 80483b7: b8 00 00 00 00 mov $0x0,%eax 80483bc: c9 leave 80483bd: c3 ret Notice how GCC orders the jumps. Consider the case when x is less than y. The CPU has to break sequential execution and jump to address 80483b7. This will negatively affect the performance of the instruction prefetcher. But what if we know that x will be less than y 99 out of 100 times? That performance hit would be unnecessary. Assume the following logical block layout for this code: prelude cmp (x>y) code (no branch) (x<=y) (take branch) the rest a. CPUs can deal with conditional branches like this by predicting which way they will go using a hardware branch predictor. Assume that we have a dumb branch predictor in our CPU that always predicts that a branch is not taken. In other words, it always predicts that it will continue executing at the next instruction. How often will our branch predictor be wrong (i.e., what is its misprediction rate)? b. Given our knowledge of x and y, how would you reorder the blocks to maintain straight-line code? Now what is the misprediction rate using the same branch predictor? c. With your blocks reordered, what will the cmp and subsequent conditional jump look like in assembly? (will it still be a jle?) d. GCC provides a built-in branch hint function named builtin_expect that will do this reordering for you. It is typically renamed to a pair of functions so that it can be used like this: Page 4 of 5

if (likely(cond)) { //do something if (unlikely(cond)) { //do something Using likely will tell the compiler that it should order the block of the if statement first, as it is likely to be executed more often. unlikely does the reverse. These are used to increase performance in real-world projects like the Linux kernel. Again, given your knowledge of x and y, how would you rewrite the C code for foo using these functions? Page 5 of 5