ENCM 501 Winter 2017 Assignment 3 for the Week of January 30

Similar documents
ENCM 501 Winter 2015 Assignment 3 for the Week of February 2

ENCM 501 Winter 2016 Assignment 1 for the Week of January 25

ENCM 501 Winter 2018 Assignment 2 for the Week of January 22 (with corrections)

Slides for Lecture 6

ENCM 501 Winter 2017 Assignment 6 for the Week of February 27

ENCM 369 Winter 2019 Lab 6 for the Week of February 25

ENCM 369 Winter 2018 Lab 9 for the Week of March 19

Contents. Slide Set 1. About these slides. Outline of Slide Set 1. Typographical conventions: Italics. Typographical conventions. About these slides

ENCM 369 Winter 2016 Lab 11 for the Week of April 4

Slide Set 1 (corrected)

ECE 486/586. Computer Architecture. Lecture # 7

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

Slide Set 5. for ENCM 369 Winter 2014 Lecture Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

ENCM 339 Fall 2017 Lecture Section 01 Lab 9 for the Week of November 20

ENCM 335 Fall 2018 Lab 2 for the Week of September 24

Slide Set 4. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

ENCM 369 Winter 2017 Lab 3 for the Week of January 30

Slides for Lecture 15

Integer Multiplication and Division

ENCM 501 Winter 2019 Assignment 9

University of Western Ontario, Computer Science Department CS3350B, Computer Architecture Quiz 1 (30 minutes) January 21, 2015

ENCM 339 Fall 2017 Lecture Section 01 Lab 5 for the Week of October 16

University of Western Ontario, Computer Science Department CS3350B, Computer Architecture Quiz 1 (30 minutes) January 21, 2015

#1 #2 with corrections Monday, March 12 7:00pm to 8:30pm. Please do not write your U of C ID number on this cover page.

Slide Set 11. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Slide Set 8. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CS367 Test 1 Review Guide

Contents Slide Set 9. Final Notes on Textbook Chapter 7. Outline of Slide Set 9. More about skipped sections in Chapter 7. Outline of Slide Set 9

Slide Set 5. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

EC 413 Computer Organization

CPS104 Computer Organization Lecture 1

COMP2121: Microprocessors and Interfacing. Instruction Set Architecture (ISA)

High Performance Computing

ENCM 369 Winter 2015 Lab 6 for the Week of March 2

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Part II Instruction-Set Architecture. Jan Computer Architecture, Instruction-Set Architecture Slide 1

a number of pencil-and-paper(-and-calculator) questions two Intel assembly programming questions

CSE 141 Computer Architecture Spring Lecture 3 Instruction Set Architecute. Course Schedule. Announcements

Slide Set 1. for ENCM 339 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

CSEE 3827: Fundamentals of Computer Systems

Review Questions. 1 The DRAM problem [5 points] Suggest a solution. 2 Big versus Little Endian Addressing [5 points]

Real instruction set architectures. Part 2: a representative sample

Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

ECE 486/586. Computer Architecture. Lecture # 8

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: MIPS Instruction Set Architecture

Contents. Slide Set 2. Outline of Slide Set 2. More about Pseudoinstructions. Avoid using pseudoinstructions in ENCM 369 labs

CSC258: Computer Organization. Memory Systems

Lecture Topics. Branch Condition Options. Branch Conditions ECE 486/586. Computer Architecture. Lecture # 8. Instruction Set Principles.

CPS104 Computer Organization Lecture 1. CPS104: Computer Organization. Meat of the Course. Robert Wagner

Midterm 1 topics (in one slide) Bits and bitwise operations. Outline. Unsigned and signed integers. Floating point numbers. Number representation

ENCM 501 Winter 2015 Tutorial for Week 5

Topic Notes: MIPS Instruction Set Architecture

Basic Concepts COE 205. Computer Organization and Assembly Language Dr. Aiman El-Maleh

Instruction Set Architecture ISA ISA

Computer Organization and Components

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Lecture 4: Instruction Set Architecture

Slide Set 3. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

More advanced CPUs. August 4, Howard Huang 1

ENCM 335 Fall 2018 Lab 6 for the Week of October 22 Complete Instructions

RECITATION SECTION: YOUR CDA 3101 NUMBER:

CIS 371 Spring 2010 Thu. 4 March 2010

Chapter 4. The Processor

EC-801 Advanced Computer Architecture

ECE 341. Lecture # 15

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Instruction Set Architecture. "Speaking with the computer"

Slide Set 4. for ENCM 335 in Fall Steve Norman, PhD, PEng

Homework 2 (r1.1) Due: Part (A) -- Apr 2, 2017, 11:55pm Part (B) -- Apr 2, 2017, 11:55pm Part (C) -- Apr 2, 2017, 11:55pm

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

ECE 571 Advanced Microprocessor-Based Design Lecture 3

Lecture 7: Examples, MARS, Arithmetic

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Lecture 4: MIPS Instruction Set

ECE 4750 Computer Architecture, Fall 2014 T01 Single-Cycle Processors

Cache Memory and Performance

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

COS 140: Foundations of Computer Science

Slide Set 2. for ENCM 335 in Fall Steve Norman, PhD, PEng

Hardware Level Organization

Slide Set 1. for ENEL 339 Fall 2014 Lecture Section 02. Steve Norman, PhD, PEng

Reversing. Time to get with the program

Announcements HW1 is due on this Friday (Sept 12th) Appendix A is very helpful to HW1. Check out system calls

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

University of Calgary Department of Electrical and Computer Engineering ENCM 369: Computer Organization Instructor: Steve Norman

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

ENCM 339 Fall 2017: Cygwin Setup Help

Slide Set 4. for ENCM 339 Fall 2017 Section 01. Steve Norman, PhD, PEng

COMP3221: Microprocessors and. and Embedded Systems. Instruction Set Architecture (ISA) What makes an ISA? #1: Memory Models. What makes an ISA?

Lecture 4: RISC Computers

Chapter 5. A Closer Look at Instruction Set Architectures

CSE 378 Final 3/18/10

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Transcription:

page 1 of 7 ENCM 501 Winter 2017 Assignment 3 for the Week of January 30 Steve Norman Department of Electrical & Computer Engineering University of Calgary January 2017 Assignment instructions and other documents for ENCM 501 can be found at http://people.ucalgary.ca/~norman/encm501winter2017/ Administrative details Group work is permitted Here are the options: You may do your work entirely individually. A group of two or three students may hand in a single assignment for the whole group. Collaboration at the level of individual exercises is acceptable. In that case, submissions of complete, individual assignments are required, with explicit acknowledgments given as needed on an exercise-by-exercise basis. Informal discussion of assignment exercises between students is encouraged, and does not need to be acknowledged. Please be aware that all students are expected to understand all assignment exercises! Collaboration is of course not allowed on quizzes, the midterm test, and the final exam. Due Dates The Due Date for this assignment is 3:30pm, Thursday, Feb. 2. The Late Due Date is 3:30pm, Friday, Feb. 3. The penalty for handing in an assignment after the Due Date but before the Late Due Date is 3 marks. In other words, X/Y becomes (X 3)/Y if the assignment is late. There will be no credit for assignments turned in after the Late Due Date; they will be returned unmarked. Marking scheme A B C D E F total 3 marks 2 marks 3 marks 8 marks 3 marks 2 marks 21 marks

ENCM 501 Winter 2017 Assignment 3 page 2 of 7 How to package and hand in your assignments Please see the instructions in Assignment 1. And, if you are submitting a group assignment, please make sure all group members names are clear and complete on the cover page. Exercise A Exercise A.1 on page A-47 of the textbook. Count load imm (load immediate) within ALU instructions not loadsstores. Despite the name, load immediate does not require a read of data memory. Details of how you calculated an effective CPI. If you use a spreadsheet (which is one reasonable approach), do your best to document the formulas in the spreadsheet. Exercise B: Alignment For this exercise, there are four answers required: size of a struct foo object for 32- and 64-bit machines, with and without reordering of variables within a struct object. Note, however, that real C and C++ compilers are not allowed to reorder fields within struct and class objects. Exercise A.11 on page A-50 of the textbook. Assume 8-byte alignment for fields 8 bytes in size, 4-byte alignment for 4-byte fields, and 2-byte alignment for 2-byte fields. Assume that the size for bool is 1 byte. Answers, stating any assumptions you had to make, and showing how you did your calculations. Exercise C: Instruction lengths It s pretty unlikely that any new ISA would consider using a messy mix of instruction lengths, but this exercise is still good food for thought about the tradeoffs in involved in choosing instruction formats. Exercise A.20 on pages A-52 A-53 of the textbook, part a only. For load imm assume a fixed instruction size of 32 bits.

ENCM 501 Winter 2017 Assignment 3 page 3 of 7 A note about textbook Figure A.31 Here are some examples of what cumulative means in the table: 30.4% of data references (that is, loads and stores) don t need an offset at all. 33.5% of data references either don t need an offset or need an offset with one magnitude bit. That implies that 33.5% 30.4% = 3.1% of data references have offsets of either +1 or 1 (Presumably these would be load-byte or store-byte instructions.) 85.2% of branch instructions require 7 or fewer magnitude bits for their offsets. Taking the sign bit into account, that means that 85.2% of branch offsets would fit within an 8-bit field. An answer, stating any assumptions you had to make, and showing how you did your calculations. Exercise D: CPI estimation This is the Processor Performance Equation: CPU time = IC CPI clock period And here is a quote from a lecture slide: CPI is processor-dependent and also program-dependent, so this equation by itself is not very powerful. In this exercise we ll look at CPI estimates from a few variants of a simple program. We ll see that average CPI can vary a lot even between programs with identical or very similar C source code; that for CPUs in the modern era, CPI can be significantly less than 1., Part I You will need to do this on one of the Optiplex FIXME machines in ICT 320. You can find the source files you need for all four parts using links on the ENCM 501 Assignments Page on the Web. (Among the files you ll need are ts_funcs.h and ts_funcs.c from Assignment 2.) Have a look at ArrayV2.c. It is quite similar to Array.c from Assignment 2, but the array size has been changed, and main has been changed so that it measures only the time spent adding up array elements the measured CPU time no longer includes time spent filling the array. Translate the C file to assembly language with this command: gcc -S ArrayV2.c -o ArrayV2-plain-save.s (You re giving the output assembly language a fancy name to distinguish it from other assembly language files you will generate later.) Inspect the assembly language file using the less command, or by loading the file into a text editor. (If you do the latter, be careful not to inadvertently edit the file.) It should be pretty easy to find the assembly language code for sum_array. the instructions for the for loop run from the one for label.l6 to the instruction jb.l6.

ENCM 501 Winter 2017 Assignment 3 page 4 of 7 (Because you didn t ask for optimization, all of the function arguments and local variables are in memory, not GPRs, so almost all of the instructions in the loop read memory, write memory, or do both.) Make an executable called V2-plain with this command: gcc ArrayV2-plain-save.s ts_funcs.c -o V2-plain -lrt Run the executable repeatedly; throw out any unusually long running times from your data, and calculate an average CPU time for the good runs. (You will likely find that you sometimes get two or measurements that are exactly the same to a weirdly large number of decimal places. That has to do with how clock_gettime is implemented in Cygwin. That function makes a call to a Windows service that gives a time measurement with resolution of about 15.6 milliseconds. All the measurements you get will be multiples of that resolution.) Use the Processor Performance Equation to estimate an average CPI for the time spent between the two calls to clock_gettime in main. Let s neglect all instructions except those within the loop in sum_array. That s reasonable, since the loop runs 600 million times. It also makes the calculation easy, because you won t have to count very many instructions. The processor chips in the Optiplex 980 machines are Intel Core i7 870 models, which run at 2.93 Ghz when under load., Part II You might expect that using compiler optimization could substantially speed up such a simple loop, and you would be right about that. You might also guess that choosing different instructions for the loop might result in a significantly different CPI that s what we re going to look at here. Use these two command to get an assembly language file and an executable: gcc -S -O2 ArrayV2.c -o ArrayV2-O2-save.s gcc -O2 ArrayV2-O2-save.s ts_funcs.c -o V2-O2 -lrt Inspect the assembly language file. The translation of the for loop in sum_array can be found from the label.l10 to the instruction jne.l10. However, that loop will never be used by the executable! Look at the instructions for main. You will find two calls to clock_gettime, but you won t find a call to sum_array! Instead you ll see a loop from.l15 to jne.l15 that is very similar to the one in sum_array. This is an example of a compiler optimization called inlining instructions were inserted into main to do the work of sum_array without actually calling sum_array. Calculate the average CPI, again neglecting all instructions outside the loop that adds up the array elements., Part III This is a digression away from the average CPI calculation that is the main theme of this exercise. It asks for some insight about how the toolchain (compiler, assembler, linker) works for C development. With -O2 optimization in Part II, gcc inlined the function sum_array into the definition of main. Why did gcc also generate separate assembly language code for sum_array, even though that code would never be called by main? Answer the question in a few short but precise sentences.

ENCM 501 Winter 2017 Assignment 3 page 5 of 7, Part IV The array operated on by the program ArrayV2.c is almost 2.4 GB in size, much much larger than the caches within the processor chip. It would be fair to suspect that some of the time measured in Parts I and II is time spent waiting for data to be copied from DRAM to processor caches. To get some idea of whether that is true, I created ArrayV3.c, which does almost exactly the same amount of work as ArrayV2.c, but sums a single 3,000-element array 200,000 times. That array will fit easily within an L1 data cache. Repeat the work you did in Parts I and II, using ArrayV3.c instead of ArrayV2.c. Note the following... Neglect all instructions except for those in the loop that sums array elements. This inner loop runs 3,000 times for every pass through the outer loop, so the approximation should be reasonable. Without compiler optimization, you ll find that the loop in sum_array is the same as it was in Part I. With compiler optimization, look at the loop starting with label.l16 in main., Part V To check whether neglecting outer-loop instructions caused a significant error in Part IV for the case of -O2 optimization only, because it s the easier case adjust your CPI calculations to include instructions in the outer loop, which starts at label.l15 in main. Assume that this mysterious-looking assembler directive....p2align 4,,10... generates a single no-op instruction. For Parts I, II, IV, and V, describe your average CPI calculations in enough detail that a reader would no doubts at all about how you obtained any of your numbers. For Part III, hand in an answer to the question asked in that part. Exercise E: x86-64 micro-ops Consider the circuits of an Intel processor chip that actually execute x86-64 instruction; for the purposes of this exercise, let s call these the execution circuits. As stated in a lecture, the execution circuits do not deal directly with the variablelength CISC-style machine code instructions corresponding to the x86-64 ISA. Instead there are translation circuits dedicated to converting CISC-style instructions into sequences of fixed-width, RISC-like micro-operations, often called micro-ops or uops. This translation is done as a program runs. Instructions sitting in DRAM, L3 caches and L2 caches would all be in the x86-64 ISA format. In some chip designs, the L1 I-cache is replaced by a trace cache that holds micro-ops. In other designs the L1 I-cache holds x86-64 machine code, and a separate uop cache holds microops. 1 1 See http://www.realworldtech.com/sandy-bridge/3/ for more (much more!) detail.

ENCM 501 Winter 2017 Assignment 3 page 6 of 7 A simple instruction like an ADD with one register as a source and another register as a source-and-destination would be translated into a single micro-op. But a more CISC-like operation with one operand in a register and the other in memory would likely generate two or more micro-ops. As far as I know, Intel does not publish micro-op formats. And it seems likely that the format may change slightly from one microarchitecture to the next. For the purposes of this exercise, let s use the following reasonable guesses, which may or may not be close to reality... 1-micro-op instructions. This would include: all move instructions 2 simple arithmetic/logic instructions with either two register operands, or one register operand and one immediate operand jumps and branches 2-micro-op instructions. These would be arithmetic/logical instructions that have memory as a source, such as addq -24(%rbp), %rax 3-micro-op instructions. These would be arithmetic/logical instructions that have memory as both a source and a destination, such as addq $1, -8(%rbp) Consider the program ArrayV3.c, which you worked with in Exercise D, Part IV. Determine the cycles-per-micro-op for the inner loop with and without -O2 optimization. You may assume that in both cases essentially all of the measured time is spent in the inner loop. Cycles-per-micro-op calculations, showing clearly and precisely how you obtained your answers. Exercise F: Endianness Some computers access memory in a little-endian way, and others access memory in a big-endian way. Therefore it is sometimes necessary to reverse the order of the bytes within a multi-byte chunk of data., Part I Copy the file reverse-endi.c from the ENCM 501 Assignments page. Read the file, then, without editing the source, build an executable and run it. Explain why the output tells you whether the machine you are using is littleendian or big-endian. (Base your explanation only on the output you see, not any prior knowledge of the endianness of the machine.) 2 Memory-to-memory moves would probably need two micro-ops, but there are no such instructions in x86-64.

ENCM 501 Winter 2017 Assignment 3 page 7 of 7, Part II Edit the source code to provide correct implementations of reverse_32 and reverse_64. Do it using shift, AND, and OR operations, using reverse_16 as a model. Explanation for Part I, edited source code for Part II.