ENCM 501 Winter 2016 Assignment 1 for the Week of January 25

page 1 of 5 ENCM 501 Winter 2016 Assignment 1 for the Week of January 25 Steve Norman Department of Electrical & Computer Engineering University of Calgary January 2016 Assignment instructions and other documents for ENCM 501 can be found at http://people.ucalgary.ca/~norman/encm501winter2016/ 1 Administrative details 1.1 Each student must hand in his or her own assignment Later in the course, you will be allowed to work in pairs on some assignments. 1.2 Due Dates The Due Date for this assignment is 3:30pm, Thursday, Jan. 28. The Late Due Date is 3:30pm, Friday, Jan. 29. The penalty for handing in an assignment after the Due Date but before the Late Due Date is 3 marks. In other words, X/Y becomes (X 3)/Y if the assignment is late. There will be no credit for assignments turned in after the Late Due Date; they will be returned unmarked. 1.3 Marking scheme A B C D E total 4 marks 5 marks 6 marks 3 marks 4 marks 22 marks 1.4 How to package and hand in your assignments Please see the instructions in Assignment 1. 2 Exercise A: More about MIPS64 instructions 2.1 Read This First This exercise extends the loop programming example presented in the tutorial of Wednesday, January 20 and solved on slide 9 of the Thursday, January 21 lecture. We are going to look at some simple optimizations that a C compiler for MIPS64 would likely make.

ENCM 501 Winter 2016 Assignment 2 page 2 of 5 Two instructions useful for an optimizing compiler are the conditional move instructions MOVN and MOVZ: MOVN dest, src1, src2 if src2 0, copy src1 to dest, otherwise do nothing. MOVZ dest, src1, src2 if src2 = 0, copy src1 to dest, otherwise do nothing. 2.2 What to Do Rewrite the assembly language loop from slide 9 of Slide Set 2A, so that it does the same job, but contains no jump instructions and only one branch instruction. If possible, eliminate all NOP instructions. (Remark: Using conditional moves instead of branches to implement short if statements can really help performance, because branches can cause pipeline stalls.) 2.3 What to Hand In Hand in typed or neatly hand-written assembly language. 3 Exercise B: Loop unrolling 3.1 Read This First Loop unrolling is a relatively simple and sometimes effective compiler optimization. 3.2 What to Do, Part I Do a Web search for loop unrolling, then write a few short paragraphs to explain what loop unrolling is. Put it in your own words do not simply copy-and-paste. 3.3 What to Do, Part II Rewrite your assembly code from Exercise A, unrolling the loop by a factor of 4, so there a total of 25 passes through the loop and that among the instructions in the loop body there are 4 LD instructions. 3.4 What to Do, Part III Briefly describe changes you would need to make to unroll the loop by a factor of 8. Note that 12 8 is 96, and 13 8 is 104. 3.5 What to Do, Part IV Suppose that you have a program with hundreds of C functions, and that most of those functions contain loops that are easy for a compiler to unroll. Why could it be a bad idea to ask the compiler to unroll all the loops? Specifically, what could go wrong if the program is for an embedded system with a very small memory; what could go wrong if the program is for a desktop computer with 3 levels of cache and a huge amount of DRAM?

ENCM 501 Winter 2016 Assignment 2 page 3 of 5 3.6 What to Hand In Typed or neatly hand-written assembly language for Part II, well-explained answers for the other parts. 4 Exercise C: Comparing run times in a SPEC-like framework 4.1 Read This First This exercise is designed to give some insight into the structure of the SPEC CPU benchmarks how performance is reported for a suite of programs running on a number of different systems. The tiny programs you will work with here are convenient, but contrary to the goals of SPEC in a couple of ways: The work they do filling up data structures with integers, then traversing the data structures to add up those same integers is not at all like the work done by the real applications in the SPEC suites. To avoid making students wait and wait, the programs in this exercise run for just a few seconds, not minutes or hours. Longer runs would result in less measurement error. 4.2 Attention You must do this exercise on one of the machines labeled Optiplex 755 in ICT 320. (Most but not quite all of the boxes in that room have that label.) This is because (a) I want all students do the work with identical hardware and software and (b) supporting this assignment for various compilers and libraries on various operating systems is too much work. 4.3 Cygwin64 The reference platform for programming exercises in this course is Cygwin64, which is installed on the machines in ICT 320. Cygwin64 brings two important capabilities to a Windows box: a command-line interface and set of utility programs that is very similar to what you find on a typical Linux or other Unix-like system; a compiler, linker and libraries that allows you to build C and C++ programs that rely on calls to functions typically found in Unix-like libraries. If you ve never worked with Cygwin before, you can learn the basics needed for ENCM 501 by trying some of the lab exercises from Lab 1 of the Fall 2015 version of ENCM 339. I ve posted the relevant documents and C source files on the ENCM 501 Assignments page. (Cygwin64 is fairly easy to install on 64-bit Windows 7 and Windows 8.1, if you would like to put in on your own machine. I imagine that it works well on Windows 10, but I haven t tried that. Go to https://www.cygwin.com/ for more about downloading and installing Cygwin64.)

ENCM 501 Winter 2016 Assignment 2 page 4 of 5 4.4 What to Do, Part I There are four source files you need to copy. They should be easy to find following links from the ENCM 501 home page. Read the files to get a rough idea of what the code does. Don t worry if some of the details are unfamiliar. There are two programs in the suite, Array and Set. Instead of benchmarking different hardware, you re going to benchmark the same hardware several times with a variety of optimization options presented to the compiler. The reference machine data will come from running the programs compiled without optimization. You will likely find that running times for a given executable, run many times, are all slightly different from each other. I recommend that you run each executable about 10 times, throw out the worst 5 run times, and take the average of the best 5. The rationale is that other programs running on the computer at the same time can sometimes interfere to make a run time unusually bad, but can t do anything to make a run time unusually good. Here are the four systems to test, with each of the two programs: Reference: no compiler optimizations at all. O2: with -O2 optimization. O2-unroll: with -O2 optimization, plus loop unrolling. O3: with -O3 optimization. The command to build Array for the Reference system is gcc Array.c ts_funcs.c -o Array -lrt (On Cygwin, the executable will be called Array.exe; to run the executable the commands./array and./array.exe both work.) To build Array for O2-unroll, it s gcc -O2 -funroll-loops Array.c ts_funcs.c -o Array -lrt To build Set for O3, use g++ -O3 Set.cpp ts_funcs.c -o Set -lrt I hope those three examples are enough to let you figure out the remaining cases. (The option -lrt is needed so that the linker can find the library with clock_gettime.) Get run time data for both programs on all four systems, then determine SPEClike scores for O2, O2-unroll, and O3, using Reference as a reference machine. 4.5 What to Hand In Write a brief report explaining exactly how you collected all your data and how you computed your SPEC-like scores. 4.6 Optional extra part (no marks) It s interesting to look at the instructions chosen by the compiler with different optimization settings. For example, here are a couple of commands to generate assembly language from Array.c: gcc -S -O2 Array.c -o ArrayO2.s gcc -S -O2 -funroll-loops Array.c -o ArrayO2unroll.s If you do that, look at the two.s files to see how relatively large and messy the code for fill_array and sum is when you ask for loop unrolling.

ENCM 501 Winter 2016 Assignment 2 page 5 of 5 5 Exercise D 5.1 What to Do Exercise 1.15 on page 67 of the textbook. 5.2 What to Hand In Solutions, showing clearly how you obtained your answers. 6 Exercise E 6.1 Read This First One of the points of this exercise is that Amdahl s law can be applied at a finegrained level thinking about what happens if there are speedups for some kinds of instructions but not for others. 6.2 What to Do Exercise 1.16 on page 67 68 of the textbook. In part (b) assume that the 10% number refers to the processor without the floating-point enhancement. 6.3 What to Hand In Solutions, showing clearly how you obtained your answers.