page 1 of 7 ENCM 501 Winter 2017 Assignment 3 for the Week of January 30 Steve Norman Department of Electrical & Computer Engineering University of Calgary January 2017 Assignment instructions and other documents for ENCM 501 can be found at http://people.ucalgary.ca/~norman/encm501winter2017/ Administrative details Group work is permitted Here are the options: You may do your work entirely individually. A group of two or three students may hand in a single assignment for the whole group. Collaboration at the level of individual exercises is acceptable. In that case, submissions of complete, individual assignments are required, with explicit acknowledgments given as needed on an exercise-by-exercise basis. Informal discussion of assignment exercises between students is encouraged, and does not need to be acknowledged. Please be aware that all students are expected to understand all assignment exercises! Collaboration is of course not allowed on quizzes, the midterm test, and the final exam. Due Dates The Due Date for this assignment is 3:30pm, Thursday, Feb. 2. The Late Due Date is 3:30pm, Friday, Feb. 3. The penalty for handing in an assignment after the Due Date but before the Late Due Date is 3 marks. In other words, X/Y becomes (X 3)/Y if the assignment is late. There will be no credit for assignments turned in after the Late Due Date; they will be returned unmarked. Marking scheme A B C D E F total 3 marks 2 marks 3 marks 8 marks 3 marks 2 marks 21 marks
ENCM 501 Winter 2017 Assignment 3 page 2 of 7 How to package and hand in your assignments Please see the instructions in Assignment 1. And, if you are submitting a group assignment, please make sure all group members names are clear and complete on the cover page. Exercise A Exercise A.1 on page A-47 of the textbook. Count load imm (load immediate) within ALU instructions not loadsstores. Despite the name, load immediate does not require a read of data memory. Details of how you calculated an effective CPI. If you use a spreadsheet (which is one reasonable approach), do your best to document the formulas in the spreadsheet. Exercise B: Alignment For this exercise, there are four answers required: size of a struct foo object for 32- and 64-bit machines, with and without reordering of variables within a struct object. Note, however, that real C and C++ compilers are not allowed to reorder fields within struct and class objects. Exercise A.11 on page A-50 of the textbook. Assume 8-byte alignment for fields 8 bytes in size, 4-byte alignment for 4-byte fields, and 2-byte alignment for 2-byte fields. Assume that the size for bool is 1 byte. Answers, stating any assumptions you had to make, and showing how you did your calculations. Exercise C: Instruction lengths It s pretty unlikely that any new ISA would consider using a messy mix of instruction lengths, but this exercise is still good food for thought about the tradeoffs in involved in choosing instruction formats. Exercise A.20 on pages A-52 A-53 of the textbook, part a only. For load imm assume a fixed instruction size of 32 bits.
ENCM 501 Winter 2017 Assignment 3 page 3 of 7 A note about textbook Figure A.31 Here are some examples of what cumulative means in the table: 30.4% of data references (that is, loads and stores) don t need an offset at all. 33.5% of data references either don t need an offset or need an offset with one magnitude bit. That implies that 33.5% 30.4% = 3.1% of data references have offsets of either +1 or 1 (Presumably these would be load-byte or store-byte instructions.) 85.2% of branch instructions require 7 or fewer magnitude bits for their offsets. Taking the sign bit into account, that means that 85.2% of branch offsets would fit within an 8-bit field. An answer, stating any assumptions you had to make, and showing how you did your calculations. Exercise D: CPI estimation This is the Processor Performance Equation: CPU time = IC CPI clock period And here is a quote from a lecture slide: CPI is processor-dependent and also program-dependent, so this equation by itself is not very powerful. In this exercise we ll look at CPI estimates from a few variants of a simple program. We ll see that average CPI can vary a lot even between programs with identical or very similar C source code; that for CPUs in the modern era, CPI can be significantly less than 1., Part I You will need to do this on one of the Optiplex FIXME machines in ICT 320. You can find the source files you need for all four parts using links on the ENCM 501 Assignments Page on the Web. (Among the files you ll need are ts_funcs.h and ts_funcs.c from Assignment 2.) Have a look at ArrayV2.c. It is quite similar to Array.c from Assignment 2, but the array size has been changed, and main has been changed so that it measures only the time spent adding up array elements the measured CPU time no longer includes time spent filling the array. Translate the C file to assembly language with this command: gcc -S ArrayV2.c -o ArrayV2-plain-save.s (You re giving the output assembly language a fancy name to distinguish it from other assembly language files you will generate later.) Inspect the assembly language file using the less command, or by loading the file into a text editor. (If you do the latter, be careful not to inadvertently edit the file.) It should be pretty easy to find the assembly language code for sum_array. the instructions for the for loop run from the one for label.l6 to the instruction jb.l6.
ENCM 501 Winter 2017 Assignment 3 page 4 of 7 (Because you didn t ask for optimization, all of the function arguments and local variables are in memory, not GPRs, so almost all of the instructions in the loop read memory, write memory, or do both.) Make an executable called V2-plain with this command: gcc ArrayV2-plain-save.s ts_funcs.c -o V2-plain -lrt Run the executable repeatedly; throw out any unusually long running times from your data, and calculate an average CPU time for the good runs. (You will likely find that you sometimes get two or measurements that are exactly the same to a weirdly large number of decimal places. That has to do with how clock_gettime is implemented in Cygwin. That function makes a call to a Windows service that gives a time measurement with resolution of about 15.6 milliseconds. All the measurements you get will be multiples of that resolution.) Use the Processor Performance Equation to estimate an average CPI for the time spent between the two calls to clock_gettime in main. Let s neglect all instructions except those within the loop in sum_array. That s reasonable, since the loop runs 600 million times. It also makes the calculation easy, because you won t have to count very many instructions. The processor chips in the Optiplex 980 machines are Intel Core i7 870 models, which run at 2.93 Ghz when under load., Part II You might expect that using compiler optimization could substantially speed up such a simple loop, and you would be right about that. You might also guess that choosing different instructions for the loop might result in a significantly different CPI that s what we re going to look at here. Use these two command to get an assembly language file and an executable: gcc -S -O2 ArrayV2.c -o ArrayV2-O2-save.s gcc -O2 ArrayV2-O2-save.s ts_funcs.c -o V2-O2 -lrt Inspect the assembly language file. The translation of the for loop in sum_array can be found from the label.l10 to the instruction jne.l10. However, that loop will never be used by the executable! Look at the instructions for main. You will find two calls to clock_gettime, but you won t find a call to sum_array! Instead you ll see a loop from.l15 to jne.l15 that is very similar to the one in sum_array. This is an example of a compiler optimization called inlining instructions were inserted into main to do the work of sum_array without actually calling sum_array. Calculate the average CPI, again neglecting all instructions outside the loop that adds up the array elements., Part III This is a digression away from the average CPI calculation that is the main theme of this exercise. It asks for some insight about how the toolchain (compiler, assembler, linker) works for C development. With -O2 optimization in Part II, gcc inlined the function sum_array into the definition of main. Why did gcc also generate separate assembly language code for sum_array, even though that code would never be called by main? Answer the question in a few short but precise sentences.
ENCM 501 Winter 2017 Assignment 3 page 5 of 7, Part IV The array operated on by the program ArrayV2.c is almost 2.4 GB in size, much much larger than the caches within the processor chip. It would be fair to suspect that some of the time measured in Parts I and II is time spent waiting for data to be copied from DRAM to processor caches. To get some idea of whether that is true, I created ArrayV3.c, which does almost exactly the same amount of work as ArrayV2.c, but sums a single 3,000-element array 200,000 times. That array will fit easily within an L1 data cache. Repeat the work you did in Parts I and II, using ArrayV3.c instead of ArrayV2.c. Note the following... Neglect all instructions except for those in the loop that sums array elements. This inner loop runs 3,000 times for every pass through the outer loop, so the approximation should be reasonable. Without compiler optimization, you ll find that the loop in sum_array is the same as it was in Part I. With compiler optimization, look at the loop starting with label.l16 in main., Part V To check whether neglecting outer-loop instructions caused a significant error in Part IV for the case of -O2 optimization only, because it s the easier case adjust your CPI calculations to include instructions in the outer loop, which starts at label.l15 in main. Assume that this mysterious-looking assembler directive....p2align 4,,10... generates a single no-op instruction. For Parts I, II, IV, and V, describe your average CPI calculations in enough detail that a reader would no doubts at all about how you obtained any of your numbers. For Part III, hand in an answer to the question asked in that part. Exercise E: x86-64 micro-ops Consider the circuits of an Intel processor chip that actually execute x86-64 instruction; for the purposes of this exercise, let s call these the execution circuits. As stated in a lecture, the execution circuits do not deal directly with the variablelength CISC-style machine code instructions corresponding to the x86-64 ISA. Instead there are translation circuits dedicated to converting CISC-style instructions into sequences of fixed-width, RISC-like micro-operations, often called micro-ops or uops. This translation is done as a program runs. Instructions sitting in DRAM, L3 caches and L2 caches would all be in the x86-64 ISA format. In some chip designs, the L1 I-cache is replaced by a trace cache that holds micro-ops. In other designs the L1 I-cache holds x86-64 machine code, and a separate uop cache holds microops. 1 1 See http://www.realworldtech.com/sandy-bridge/3/ for more (much more!) detail.
ENCM 501 Winter 2017 Assignment 3 page 6 of 7 A simple instruction like an ADD with one register as a source and another register as a source-and-destination would be translated into a single micro-op. But a more CISC-like operation with one operand in a register and the other in memory would likely generate two or more micro-ops. As far as I know, Intel does not publish micro-op formats. And it seems likely that the format may change slightly from one microarchitecture to the next. For the purposes of this exercise, let s use the following reasonable guesses, which may or may not be close to reality... 1-micro-op instructions. This would include: all move instructions 2 simple arithmetic/logic instructions with either two register operands, or one register operand and one immediate operand jumps and branches 2-micro-op instructions. These would be arithmetic/logical instructions that have memory as a source, such as addq -24(%rbp), %rax 3-micro-op instructions. These would be arithmetic/logical instructions that have memory as both a source and a destination, such as addq $1, -8(%rbp) Consider the program ArrayV3.c, which you worked with in Exercise D, Part IV. Determine the cycles-per-micro-op for the inner loop with and without -O2 optimization. You may assume that in both cases essentially all of the measured time is spent in the inner loop. Cycles-per-micro-op calculations, showing clearly and precisely how you obtained your answers. Exercise F: Endianness Some computers access memory in a little-endian way, and others access memory in a big-endian way. Therefore it is sometimes necessary to reverse the order of the bytes within a multi-byte chunk of data., Part I Copy the file reverse-endi.c from the ENCM 501 Assignments page. Read the file, then, without editing the source, build an executable and run it. Explain why the output tells you whether the machine you are using is littleendian or big-endian. (Base your explanation only on the output you see, not any prior knowledge of the endianness of the machine.) 2 Memory-to-memory moves would probably need two micro-ops, but there are no such instructions in x86-64.
ENCM 501 Winter 2017 Assignment 3 page 7 of 7, Part II Edit the source code to provide correct implementations of reverse_32 and reverse_64. Do it using shift, AND, and OR operations, using reverse_16 as a model. Explanation for Part I, edited source code for Part II.