CS 3330 Exam 2 Fall 2017 Computing ID:

Similar documents
CS 3330 Exam 2 Spring 2017 Name: EXAM KEY Computing ID: KEY

CS 3330 Exam 3 Fall 2017 Computing ID:

CS 3330 Exam 1 Fall 2017 Computing ID:

CS 3330 Exam 2 Spring 2016 Name: EXAM KEY Computing ID: KEY

CS 3330 Final Exam Spring 2016 Computing ID:

CS3330 Fall 2014 Exam 2 Page 1 of 6 ID:

CS3330 Exam 3 Spring 2015

minor HCL update Pipe Hazards (finish) / Caching post-quiz Q1 B post-quiz Q1 D new version of hclrs.tar:

SEQ without stages. valc register file ALU. Stat ZF/SF. instr. length PC+9. R[srcA] R[srcB] srca srcb dstm dste. Data in Data out. Instr. Mem.

CS3330 Spring 2015 Exam 2 Page 1 of 6 ID:

Computer Organization: A Programmer's Perspective

CSC 252: Computer Organization Spring 2018: Lecture 11

Pipelining 4 / Caches 1

Intel x86-64 and Y86-64 Instruction Set Architecture

Recitation #6. Architecture Lab

CS 261 Fall Mike Lam, Professor. CPU architecture

Computer Architecture I: Outline and Instruction Set Architecture. CENG331 - Computer Organization. Murat Manguoglu

Pipelining 3: Hazards/Forwarding/Prediction

Changes made in this version not seen in first lecture:

Recitation #6 Arch Lab (Y86-64 & O3 recap) October 3rd, 2017

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

THE UNIVERSITY OF BRITISH COLUMBIA CPSC 261: MIDTERM 1 February 14, 2017

CS 3330: SEQ part September 2016

CS3330 Exam 1 Spring Name: CS3330 Spring 2015 Exam 1 Page 1 of 6 ID: KEY

bitwise (finish) / SEQ part 1

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

Topics of this Slideset. CS429: Computer Organization and Architecture. Instruction Set Architecture. Why Y86? Instruction Set Architecture

CS429: Computer Organization and Architecture

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

CSE351 Spring 2018, Midterm Exam April 27, 2018

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Comprehensive Exams COMPUTER ARCHITECTURE. Spring April 3, 2006

OPEN BOOK, OPEN NOTES. NO COMPUTERS, OR SOLVING PROBLEMS DIRECTLY USING CALCULATORS.

Assembly Language: Function Calls

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Q1: Finite State Machine (8 points)

cmovxx ra, rb 2 fn ra rb irmovq V, rb 3 0 F rb V rmmovq ra, D(rB) 4 0 ra rb mrmovq D(rB), ra 5 0 ra rb OPq ra, rb 6 fn ra rb jxx Dest 7 fn Dest

Unit 8: Programming Languages CS 101, Fall 2018

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

Princeton University Computer Science 217: Introduction to Programming Systems. Assembly Language: Function Calls

Changelog. Performance. locality exercise (1) a transformation

Changelog. Changes made in this version not seen in first lecture: 26 October 2017: slide 28: remove extraneous text from code

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

CS356: Discussion #15 Review for Final Exam. Marco Paolieri Illustrations from CS:APP3e textbook

SEQ part 3 / HCLRS 1

x86-64 Programming III & The Stack

CPSC 261 Midterm 1 Tuesday February 9th, 2016

Malloc Lab & Midterm Solutions. Recitation 11: Tuesday: 11/08/2016

last time out-of-order execution and instruction queues the data flow model idea

CS / ECE 6810 Midterm Exam - Oct 21st 2008

1 Tomasulo s Algorithm

Machine-Level Programming (2)

CS429: Computer Organization and Architecture

Computer Systems Architecture

Data Hazard vs. Control Hazard. CS429: Computer Organization and Architecture. How Do We Fix the Pipeline? Possibilities: How Do We Fix the Pipeline?

CPSC 313, Summer 2013 Final Exam Solution Date: June 24, 2013; Instructor: Mike Feeley

This simulated machine consists of four registers that will be represented in your software with four global variables.

Hardware-based Speculation

Instruc(on Set Architecture. Computer Architecture Instruc(on Set Architecture Y86. Y86-64 Processor State. Assembly Programmer s View

CSE 131 Introduction to Computer Science Fall 2016 Exam I. Print clearly the following information:

Changelog. More Performance. loop unrolling (ASM) exam graded

last time Exam Review on the homework on office hour locations hardware description language programming language that compiles to circuits

Foundations of Computer Systems

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

x86 Programming I CSE 351 Winter

Final Exam Fall 2008

ECE550 PRACTICE Final

Structure of Computer Systems

Lecture 3 CIS 341: COMPILERS

The Stack & Procedures

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

ECE 331 Hardware Organization and Design. UMass ECE Discussion 10 4/5/2018

CS 251, Winter 2019, Assignment % of course mark

CSC 252: Computer Organization Spring 2018: Lecture 6

1 Tomasulo s Algorithm

Optimization part 1 1

Changelog. Changes made in this version not seen in first lecture: 13 Feb 2018: add slide on constants and width

Changes made in this version not seen in first lecture:

THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY Computer Organization (COMP 2611) Spring Semester, 2014 Final Examination

Computer Hardware Engineering

CS 2150 Exam 2, Spring 2018 Page 1 of 6 UVa userid:

CS 2506 Computer Organization II Test 2

ISA Instruction Operation

EECS 470 Midterm Exam Winter 2015

Machine Organization & Assembly Language

CS 251, Winter 2018, Assignment % of course mark

CSE 351 Midterm - Winter 2015 Solutions

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

CS 261 Fall Machine and Assembly Code. Data Movement and Arithmetic. Mike Lam, Professor

CSEE W3827 Fundamentals of Computer Systems Homework Assignment 5 Solutions

CSE 378 Final Exam 3/14/11 Sample Solution

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Changelog. Virtual Memory (2) exercise: 64-bit system. exercise: 64-bit system

Assembly Programming IV

Machine-Level Programming III: Procedures

University of California, Berkeley College of Engineering

CSE 401/M501 Compilers

Course on Advanced Computer Architectures

Part 7. Stacks. Stack. Stack. Examples of Stacks. Stack Operation: Push. Piles of Data. The Stack

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Transcription:

S 3330 Fall 2017 Exam 2 Variant page 1 of 8 Email I: S 3330 Exam 2 Fall 2017 Name: omputing I: Letters go in the boxes unless otherwise specified (e.g., for 8 write not 8 ). Write Letters clearly: if we are unsure of what you wrote you will get a zero on that problem. ubble and Pledge the exam or you will lose points. ssume unless otherwise specified: little-endian 64-bit architecture %rsp points to the most recently pushed value, not to the next unused stack address. questions are single-selection unless identified as select-all Variable Weight: point values per question are marked in square brackets. Mark clarifications: If you need to clarify an answer, do so, and also add a * to the top right corner of your answer box..................................................................................................

S 3330 Fall 2017 Exam 2 Variant page 2 of 8 Email I: Information for questions 1 2 For these questions consider a pipelined processor where the execute stage is split into three stages in order to support instructions with complex LU operations. The resulting stages in this processor are: Fetch ecode Execute 1 Execute 2 Execute 3 Memory Writeback The new split execute stages require the values for the LU operation near the start of the execute 1 stage and only produces results near the end of end of the execute 3 stage, regardless of the complexity of the LU operation. The other stages work as in the five-stage processor described in lecture. Question 1 [2 pt]: (see above) To execute the following assembly snippet on this processor with minimum stalling: addq %rax, %rbx mrmovq 4(%rax), %r13 subq %rbx, %r13 the processor needs to stall for some number of cycles in addition to forwarding register values. How many cycles of stalling are required? Write your answer as a base-10 number. Question 2 [2 pt]: without stalling: (see above) To execute the following assembly snippet on this processor addq %rax, %rbx xorq %rcx, %rdx rrmovq %r8, %r9 rmmovq %r11, 4(%r12) subq %rbx, %r13 the processor needs to forward the value of %rbx from addq to subq. Which forwarding path can the processor use to do this? E F G the end of addq s execute 2 stage to the end of subq s decode stage the end of addq s memory stage to the end of subq s execute 1 stage the end of addq s execute 3 stage to the end of subq s execute 1 stage the end of addq s execute 3 stage to the end of subq s decode stage the end of addq s memory stage to the end of the subq s decode stage the end of addq s writeback stage to the end of subq s execute 1 stage none of the above

S 3330 Fall 2017 Exam 2 Variant page 3 of 8 Email I: Information for questions 3 4 onsider a 4 K (2 12 byte) 2-way set-associative write-back, write-allocate cache with 256-byte (2 8 byte) cache blocks, an LRU replacement policy. Question 3 [2 pt]: (see above) How many sets does this cache have? Write your answer as a base-10 number. Question 4 [2 pt]: (see above) When accessing this cache, which of the following addresses will map to the same cache set as 0x12345? Place a in each box that will map to the same set. Leave other boxes blank. 0x23FF 0x01234 0x12000 0x123FF Question 5 [2 pt]: Suppose a program reads every byte of a 2M array starting with the lowest address and proceeding in order to the highest address, then repeats this process 1000 times. On which type of data cache will the program experience the fewest cache misses on average? (ssume memory accesses other than those to the array are negilible and that 1M represents 2 20 bytes.) a 1M 4-way set associative cache with a random replacement policy and 64 cache blocks a 1.75M fully associative cache with an FIFO (first-in, first-out) replacement policy and 64 cache blocks a 0.5M direct-mapped cache with 32 cache blocks a 1M 8-way set associative cache with an LRU (least recently used) replacement policy and 64 cache blocks

S 3330 Fall 2017 Exam 2 Variant page 4 of 8 Email I: Information for questions 6 7 Suppose a program reads a single byte at each of the following addresses, in the following order: 0x06 0x07 0x16 0x00 0x17 0x03 0x07 Question 6 [2 pt]: (see above) onsider a 16 byte direct-mapped cache with a 4 byte block size. ssume the cache is initially empty. How many of these accesses will be hits? Write your answer as a base-10 number. Question 7 [2 pt]: (see above) onsider a 8 byte 2-way set asssociative cache with a 2 byte block size and an LRU replacement policy. ssume the cache is initially empty. How many of these accesses will be hits? Write your answer as a base-10 number. Question 8 [2 pt]: Suppose we have two processor designs, with different numbers of pipeline stages. On the first processor 10% of the instructions in a benchmark program stall for exactly 1 cycle and 5% stall for exactly 2 cycles. The second processor has more pipeline stages, so 40% of the instructions in the benchmark program stall for an additional cycle (compared to the first processor), but it has half the cycle time. If the benchmark program took 10 seconds to run on the first processor, then how long did take on the second processor? E F more than 10 seconds more than 9 and less than or equal to 10 seconds more than 8 and less than or equal to 9 seconds more than 7 and less than or equal to 8 seconds more than 6 and less than or equal to 7 seconds 6 or less seconds

S 3330 Fall 2017 Exam 2 Variant page 5 of 8 Email I: Information for questions 9 10 onsider the following code: unsigned char array[1024 * 1024]; /*... */ int sum = 0; for (int x = 0; x < 10; ++x) { for (int i = 0; i < 16; ++i) { sum += array[i * 128] * array[i * 128 + i]; } } ssume that: only array is kept in memory (all other variables are stored in registers); array is stored at an address that is a multiple of 4096 (2 12 ); that the compiler does not remove or reorder any accesses to the array; unsigned chars are 1 byte; the cache is empty upon entry to the outer for loop above Note that 1024/8 = 128. Question 9 [2 pt]: (see above) If the above nested for loops are run on a system with a 1K (1024 byte) direct-mappped cache with 1-byte cache blocks, how many data cache misses will it experience? 16 24 31 160 E 175 F 240 G 320 H none of the above Question 10 [2 pt]: (see above) If the above nested for loops are run on a system with an 4K (1024 byte) 16-way set associative cache with 64-byte cache blocks and an LRU replacement policy, how many data cache misses will it experience? 31 240 320 16 E 175 F 160 G 24 H none of the above

S 3330 Fall 2017 Exam 2 Variant page 6 of 8 Email I: Information for questions 11 13 Suppose we are running the following assembly snippet in the five-stage pipelined processor with forwarding and branch prediction discussed in the lecture and in the textbook: xorq %rax, %rax jne after_add addq %rax, %rsp after_add: subq %rax, %rsp andq %rsp, %rax rmmovq %rsp, 0(%rax) popq %rax Recall that this processor predicts all branches as taken and computes the actual result of branches by the end of the execute stage of the conditional jump instruction and fetches the corrected next instruction during the conditional jump s memory stage. The conditional jump is not taken. Question 11 [3 pt]: (see above) In order to execute this assembly snippet correctly, which of the following forwarding operations will the processor need to perform? Place a next to each correct answer. Leave all other boxes blank. E F %rax from xorq to addq %rsp from addq to subq %rax from andq to rmmovq %rsp from subq to rmmovq %rsp from subq to popq %rax from andq to popq Question 12 [2 pt]: (see above) When the andq instruction is in the fetch stage, what pipeline stages will be running a pipeline bubble (nop inserted to handle a pipeline hazard)? Place a in each box that will have a pipeline bubble. Leave all other boxes blank. decode memory execute writeback Question 13 [2 pt]: (see above) When the subq instruction is in the writeback stage, the rmmovq instruction is in what stage? fetch decode execute memory E writeback F none of the above; it is not in the pipeline

S 3330 Fall 2017 Exam 2 Variant page 7 of 8 Email I: Information for questions 14 16 onsider the following two pieces code: /* Version */ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) [i * N + j] += [i + j * 10] * [j]; and: /* Version */ for (int j = 0; j < N; ++j) for (int i = 0; i < N; ++i) [i * N + j] += [i + j * 10] * [j]; ssume that N is a large integer constant and that,, and are large arrays of 4-byte ints. ssume that all values except the accesses to, and are stored in registers. Question 14 [1 pt]: (see above) Which version has better spatial locality in its accesses to? version version they are about the same Question 15 [1 pt]: (see above) Which version has better temporal locality in its accesses to? version version they are about the same Question 16 [1 pt]: (see above) Which version has better temporal locality in its accesses to? version version they are about the same Question 17 [2 pt]: Which of the following statements about loop unrolling are true? Put a next to each box that is true. Leave the other boxes blank. loop unrolling is likely to increase the size of an executable loop unrolling improves the temporal locality of instruction accesses as long as a loop has enough iterations, loop unrolling will decrease the number of instructions executed loop unrolling is not practical if the loop body manipulates two pointers that may alias

S 3330 Fall 2017 Exam 2 Variant page 8 of 8 Email I: Information for questions 18 20 Suppose the components of our Y86 processor take the following amounts of time to process their inputs and either produce a corresponding output or (in the case of register file or data memory writes) be ready to store a value if a rising clock edge happens. component time required instruction memory 250 picoseconds register file reading 200 picoseconds register file writing 200 picoseconds LU 350 picoseconds data memory (reading or writing) 300 picoseconds P increment 30 picoseconds In addition, assume the register delay for any pipeline registers is 25 picoseconds. ssume that other components of the processor take a negligible amount of time. Question 18 [2 pt]: (see above) If we use these components to build a five-stage pipelined Y86 processor, like the one described in lecture and our textbook, what is the minimum cycle time this processor can have? Write your answer as a base-10 number of picoseconds. Question 19 [2 pt]: (see above) If we use these components to build a single-cycle Y86 processor, what is the minimum cycle time this processor can have? o not include any register delay for the P register. Write your answer as a base-10 number of picoseconds. Question 20 [2 pt]: (see above) Suppose we split the LU into two halves, each of which takes 175 picoseconds (half of the original 350 picoseconds). Using this, we split the execute stage into two stages to modify our five-stage pipelined into a six-stage pipelined Y86 processor. y how much will this reduce the cycle time of our Y86 processor? Write your answer as a base-10 number of picoseconds.................................................................................................. Pledge: On my honor as a student, I have neither given nor received aid on this exam. Your signature here