Introduction to Computer Architecture

ECE 154A Introduction to Computer Architecture Fall 2013 Dmitri istrukov Software Interface

Agenda Procedures and stack Memory mapping Arrays vs. linked lists Memory management Program compilation, linking, loading and execution

Big Idea Architecture should be convenient for programmers HW support for programming language constructions Debugging, security etc.

Why Subroutines (Procedures) Important? t? Better structure Fewer bugs, i.e. faster and cheaper development More compact code Fewer bugs Very important when memory is limited, e.g. early days Even for today s computers will typically lead in better performance Fewer misses (memory hierarchy) but could have also negative effects if overhead (i.e. control instructions) is significant

Implementing Subroutines Can implement with existing instructions What if procedure is written by somebody else and already compiled (e.g. library) Still doable to patch binaries Procedures are very frequent so let s have special instructions to support it JAL and JR cont: proc: j proc xxx.. xxx j cont

Instructions for Accessing Procedures MIPS procedure call instruction: jal ProcedureAddress #jump and link Saves PC+4 in register $ra to have a link to the next instruction for the procedure return Machine format (J format): 0x03 26 bit address Then can do procedure return with a jr $ra #return Instruction format (R format): 0 0x08

Illustrating a Procedure Call main PC jal proc Prepare to call Prepare to continue proc Save, etc. jr $ra Restore Relationship between the main program and a procedure.

More Issues with Procedures Q1: How to pass to and return from a procedure the data? Would like to use as many as possible register inside procedure (callee) to better utilize temporal locality but some register may be utilized by caller Solution: Spill registers (move RF content to main memory and thenrestore) restore). What is the exact mechanismfor that, in particular Q2: Which registers to spill? Q3: Who is responsible saving (callee vs. caller)? Q4: Where to spill? Solution: There are certain rules enforced in a software which helps such implementation

Typical Use of Registers $0 0 $zero $1 $at Reserved for assembler use $2 $v0 $3 $v1 Procedure results $4 $a0 $5 $a1 Procedure $6 $a2 arguments Saved $7 $a3 $8 $t0 $9 $t1 $10 $t2 $11 $t3 Temporary $12 $t4 values $13 $t5 $14 $t6 $15 $t7 $16 $s0 $17 $s1 $18 $s2 Saved $19 $s3 across Operands $20 $s4 procedure $21 $s5 calls $22 $s6 $23 $s7 $24 $t8 More $25 $t9 temporaries $26 $k0 $27 $k1 Reserved for OS (kernel) $28 $gp Global pointer $29 $sp Stack pointer $30 $fp Frame pointer Saved $31 $ra Return address A4-byte word sits in consecutive memory addresses according to the big-endian order (most significant byte has the lowest address) 3 2 1 0 Answer to Q1 Byte numbering: 3 2 1 0 When loading a byte into a register, it goes in the low end Byte In principle, p one can use registers as he/she likes without sticking to these guidelines Doublew ord Word (one exception: In MIPS kernel registers might be rewritten by hardware on special occasions (exceptions) so it is better not to use them ) However, if the program is supposed to A be doubleword run together with ihothers (e.g. sits in consecutive under certain OS and/or if it uses registers or subroutines memory locations written by other people) according to the then big-endian it is order a good idea to stick to these (most significant rules word comes first)

A Simple MIPS Procedure Procedure to find the absolute value of an integer. $v0 ($a0) Solution The absolute value of x is x if x < 0 and x otherwise. abs: sub $v0,$zero,$a0 # put -($a0) in $v0; # in case ($a0) < 0 bltz $a0,done # if ($a0)<0 then done add $v0,$a0,$zero # else put ($a0) in $v0 done: jr $ra # return to calling program In practice, we seldom use such short procedures because of the overhead that they entail. In this example, we have 3-4 instructions of overhead for 3 instructions of useful computation. No register spilling here -- see next example

Typical Use of Registers $0 0 $zero $1 $at Reserved for assembler use $2 $v0 $3 $v1 Procedure results $4 $a0 $5 $a1 Procedure $6 $a2 arguments Saved $7 $a3 $8 $t0 $9 $t1 $10 $t2 $11 $t3 Temporary $12 $t4 values $13 $t5 $14 $t6 $15 $t7 $16 $s0 $17 $s1 $18 $s2 Saved $19 $s3 across Operands $20 $s4 procedure $21 $s5 calls $22 $s6 $23 $s7 $24 $t8 More $25 $t9 temporaries $26 $k0 $27 $k1 Reserved for OS (kernel) $28 $gp Global pointer $29 $sp Stack pointer $30 $fp Frame pointer Saved $31 $ra Return address A4-byte word sits in consecutive memory addresses according to the big-endian order (most significant byte has the lowest address) 3 2 1 0 Answer to Q2 Byte numbering: 3 2 1 0 When loading a byte into a register, it goes in the low end Byte In principle, p one can use registers as he/she likes without sticking to these guidelines Doublew ord Word (one exception: In MIPS kernel registers might be rewritten by hardware on special occasions (exceptions) so it is better not to use them ) However, if the program is supposed to A be doubleword run together with ihothers (e.g. sits in consecutive under certain OS and/or if it uses registers or subroutines memory locations written by other people) according to the then big-endian it is order a good idea to stick to these (most significant rules word comes first)

Six Steps in Execution of a Procedure (Answer to Q3) 1. Main routine (caller) places parameters in a place where the procedure (callee) can access them $a0 $a3: four argumentregisters 2. Caller transfers control to the callee 3. Callee acquires the storage resources needed 4. Callee performs the desired task 5. Callee places the result value in a place where the caller can access it $v0 $v1: two value registers it for result values 6. Callee returns control to the caller $ra: one return address register to return to the point of origin

Illustrating a Procedure Call main PC jal proc Prepare to call Prepare to continue proc Save, etc. jr $ra Restore Relationship between the main program and a procedure.

Spilling Registers (Answer to Q4) What if the callee needs to use more registers than allocated to argument and return values? callee uses a stack a last in first out queue high addr top of stack $sp One of the general registers, $sp ($29), is used to address the stack (which grows from high address to low address) add data dt onto the stack push $sp = $sp 4 data on stack at new $sp remove data from the stack pop data from stack at $sp $ $ 4 low addr $sp = $sp + 4

high addr low addr Allocating Space on the Stack Saved argument regs (if any) Saved return addr Saved local regs (if any) $fp The segment of the stack containing a procedure s savedregisters andlocal variables is its procedure frame (aka activation record) The frame pointer ($fp) Local arrays & points to the first word of structures (if the frame of a procedure any) $sp providing a stable base register for the procedure $fp is initialized using $sp on a call and $sp is restored using $fp on a return

Example: Parameters and Results low addr $sp $fp c b a. Frame for current procedure $sp Local variables Saved registers $fp z y. Old ($fp) c b a. Frame for current procedure Frame for previous procedure high addr Before calling After calling Use of the stack by a procedure.

More on Procedures Prolog Body spill all register to stack used by procedure expect for $t0 $t9 and the one used for returning values advance stack pointer ($sp) first then write to stack code of the procedure Epilog restore all used registers adjust stack pointer at the end ($sp)

Example of Using the Stack Saving $fp, $ra, and $s0 onto the stack and restoring them at the end of the procedure proc: sw $fp,-4($sp) # save the old frame pointer addi $fp,$sp,0 $sp # save ($sp) into $fp addi $sp,$sp, 12 # create 3 spaces on top of stack sw $ra,-8($fp) # save ($ra) in 2nd stack element sw $s0,-12($fp) # save ($s0) in top stack element $sp. ($s0) ($ra). ($fp). $sp lw $s0,-12($fp) # put top stack element in $s0 $fp lw $ra, -8($fp) # put 2nd stack element in $ra addi $sp,$fp, 0 # restore $sp to original state $fp lw $fp,-4($sp) # restore $fp to original state jr $ra # return from procedure Could be a good idea to modify the stack pointer first in epilog (before writing to stack) and last in prolog. Why?

Typical Use of Registers $0 0 $zero $1 $at Reserved for assembler use $2 $v0 $3 $v1 Procedure results $4 $a0 $5 $a1 Procedure $6 $a2 arguments Saved $7 $a3 $8 $t0 $9 $t1 $10 $t2 $11 $t3 Temporary $12 $t4 values $13 $t5 $14 $t6 $15 $t7 $16 $s0 $17 $s1 $18 $s2 Saved $19 $s3 across Operands $20 $s4 procedure $21 $s5 calls $22 $s6 $23 $s7 $24 $t8 More $25 $t9 temporaries $26 $k0 $27 $k1 Reserved for OS (kernel) $28 $gp Global pointer $29 $sp Stack pointer $30 $fp Frame pointer Saved $31 $ra Return address A4-byte word sits in consecutive memory addresses according to the big-endian order (most significant byte has the lowest address) 3 2 1 0 Byte numbering: 3 2 1 0 When loading a byte into a register, it goes in the low end Byte In principle, p one can use registers as he/she likes without sticking to these guidelines Doublew ord Word (one exception: In MIPS kernel registers might be rewritten by hardware on special occasions (exceptions) so it is better not to use them ) However, if the program is supposed to A be doubleword run together with ihothers (e.g. sits in consecutive under certain OS and/or if it uses registers or subroutines memory locations written by other people) according to the then big-endian it is order a good idea to stick to these (most significant rules word comes first)

Nested Procedure Calls main PC jal abc Prepare to call Prepare to continue abc Procedure abc Save xyz Procedure xyz jal xyz jr $ra Restore jr $ra Example of nested procedure calls.

Fibonacci numbers (Similar problem in HW4) F(n) = F(n 1)+F(n 2) ( F(1) = 1 F(2) = 1 n = 1 2 3 4 5 6 F(n) = 1 1 2 3 5 8 /* Recursive function in c */ int fib(int n) { } If (n==1 n==2) return 1; return fib(n 1)+fib(n 2);

Memory mapping

Big Picture More complicated picture for modern processors. Many details are missing Complication #1: IM and DM are caches: Fast but small memory Complication #2: Program are mapped to virtual address space: the mapping for the program and data in question should be aware of other programs and data (i.e. O/S) each program (process) is mapped to its own virtual address space Additional mechanism (implemented in SW and HW) are taking care of that (will be discussed later) Main memory Virtual memory Add HW + SW HW + SW 4 PC Read Address Instruction Memory Instruction Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Data 2 Write Data ALU Address Data Memory Write Data Read Data Sign Extend 16 32

Big Picture Assume that there is only one program mapped to physical memory Questions to answer Where to store code and where to store date? dt? Would stack structure be enough to keep all the data? What kind of data are typically present? A related question: How to pass more than one A related question: How to pass more than one parameter to procedure?

Address space (language and OS specific) A program s address space contains 4 ~ FFFF FFFF hex stack regions: stack: local variables, grows downward dynamic data (heap): space requested for pointers via malloc() ; resizes dynamically, grows upward static data: variables declared outside main, does not grow or shrink code: loaded when program code starts, ~ 0 does not change hex dynamic data static data For now, OS somehow Why stack grows from top to bottom? prevents accesses between stack and heap (gray hash lines). Wait for virtual memory

Memory Map in MIPS Hex address 00000000 00400000 Reserved Program 1 M words Text segment 63 M words Addressable with 16-bit signed offset 10000000 10008000 1000ffff Static data Dynamic data Data segment $28 $29 $30 $gp $fp $sp Stack 448 M words Stack segment 7ffffffc 80000000 Second half of address space reserved for memory-mapped I/O Overview of the memory address space in MiniMIPS.

Linked Lists vs. Arrays

Pointers (1/4) Sometimes you want to have a procedure increment a variable? What gets printed? void main() { int y = 5; AddOne( y); printf( y = %d\n, y); } $a0 void AddOne(int x) { x = x + 1; } y = 5 frame pointer for main $sp $fp lw $a0, 12($fp) jal AddOne AddOne: addi $t0, $a0, 1 jr $ra y

Pointers (2/4) Solved by passing in a pointer to our subroutine. Now what gets printed? void main() { int y = 5; AddOne(&y); printf( y = %d\n, y); } $a0 void AddOne(int *p) { *p = *p + 1; } y = 6 $sp $fp addi $a0, $fp, 12 jal AddOne AddOne: lw $t0, 0($a0) addi $t0, $t0,1 sw $t0, 0($a0) jr $ra y

Pointers (2.5/4) another way of correcting it Sometimes you want to have a procedure increment a variable? What gets printed? $sp y $fp void main() { int y = 5; y = AddOne( y); printf( y = %d\n, y); } $a0 int AddOne(int x) { x = x + 1; return x;} y = 6 lw $a0, 12($fp) jal AddOne sw $v0, 12($fp) AddOne: addi $v0, $a0, 1 jr $ra

Pointers (3/4) But what htif what htyou want changed is a pointer? What gets printed? $sp $fp q A[0] A[1] A[2] void main() { int A[3] = {50, 60, 70}; int *q = A; IncrementPtr( q); printf( *q = %d\n, *q); *q = 50 A q } void IncrementPtr(int $a0 *p) lw $0 $a0, 20($fp) ) jal IncPtr IncPtr: addi $t0, $a0, 1 jr $ra { p = p + 1; } 50 60 70

Pointers (4/4) Solution! Pass a pointer to a pointer, declared as **h Now what gets printed? void main() { int A[3] = {50, 60, 70}; int *q = A; IncrementPtr(&q); printf( *q = %d\n, *q); * 60 } $a0 $sp $fp *q = 60 A q q q A[0] A[1] A[2] addi $a0, $fp, 20 jal IncPtr IncPtr: lw $t0, 0($a0) addi $t0, $t0,4 sw $t0, 0($a0) jr $ra Note +4! void IncrementPtr(int **h) { *h = *h + 1; } 50 60 70

Arrays example void foo() { int *p, *q, x; int a[4]; p = (int *) malloc (sizeof(int)); q = &x; *p = 1; // p[0] would also work here printf("*p:%u, p:%u, &p:%u\n", *p, p, &p); *q = 2; // q[0] would also work here printf("*q:%u, q:%u, &q:%u\n", *q, q, &q); *a = 3; // a[0] would also work here printf("*a:%u, a:%u, &a:%u\n", *a, a, &a); 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60...... 40? 20? 2? 3?? 1... } p q x unnamed malloc space? 24 a *p:1, p:40, &p:12 *q:2, q:20, &q:16 *a:3, a:24, &a:24 An array name is not a variable

Example of array MIPS pseudocode Example of array C code int a[100]; main: void main () { int b[10]; int size; int *p; **** p = (int *)malloc(sizeof(int)*size); **** free(p); **** }.data a:.word 100.text addi $sp, $sp, 10*4 8 #reg to spill * 4; addi $fp, $sp, 10*4 + 8 + #reg to spill * 4; add $t0, $fp, 10*4 (address of base of array b) **** add $a0, $0, $t1 ($t1 has value of size*4) jal malloc (malloc returning memory address to $v0) **** sw $v0, 44($fp) (modify *p) add $a0, $v0, $0 jal free **** addi $sp, $sp, +10*4 + 8 + #reg to spill * 4; jr $ra malloc and malloc and free are an OS procedures

C structures A struct is a data structure composed from simpler data types. Like a class in Java/C++ but without methods or inheritance. i struct point { /* type definition */ int x; int y; }; void PrintPoint(struct point p) { printf( (%d,%d), p.x, p.y); } As always in C, the argument is passed by value a copy is made. struct point p1 = {0,10}; /* x=0, y=10 */ PrintPoint(p1);

C structures: Pointers to them Usually, more efficient to pass a pointer to the struct. The C arrow operator (->) dereferences and extracts a structure field with a single operator. The following are equivalent: struct point *p; /* code to assign to pointer */ printf( x is %d\n, (*p).x); printf( x is %d\n, p->x);

How big are structs? Recall C operator sizeof() which gives size in bytes (of type or variable) How big is sizeof(p)? struct p { char x; int y; }; 5 bytes? 8 bytes? Compiler may word align integer y

Array vs. Linked list Slowly changing size, order Quickly changing size, order Could be allocated More often dynamically y dynamically or statically (rarely statically) Contiguous location in Could be contiguous (when memory static) but most Fast traversal / no memory often not overhead but fixed structure Slower traversal / additional memory for storing pointers but flexible structure

Example of linked list C code Struct mylist { int value; struct mylist *next; struct mylist *prev; } In principle i can do this (can be allocated in any type of memory): struct mylist *list[100]; Most typically: void main(){ struct mylist *p p, *cur; ***** p = malloc(sizeof(struct mylist)*1); add(cur, p); ***** } delete(cur); ***** Linked list example Example of linked list MIPS pseudocode main: **** addi $a0, $0, 12 jal malloc (malloc returning memory address to $v0) add $a0, $0, $t1 ($t1 has address cur) add $a1, $0, $v0 jal addelement **** jal delete add $a0, $0, $t1 jal free **** jr $ra static dynamic stack

Deleting from doubly linked list example I

Deleting from doubly linked list example II

Memory Management

Memory Management How do we manage memory? Code, Static storage are easy: they never grow or shrink Stack space is also easy: stack frames are created and destroyed in last in, first out (LIFO) order Managing the heap is tricky: memory can be allocated / deallocated at any time

Heap Management Requirements Want malloc() and free() to run quickly. Want minimal memory overhead Want to avoid fragmentation* when most of our free memory is in many small chunks In this case, we might have many free bytes but not be able to satisfy a large request since the free bytes are not contiguous in memory. * This is technically called external fragmention

Heap Management An example Request R1 for 100 bytes Request R2 for 1 byte Memory from R1 is freed Request R3 for 50 bytes R2 (1 byte) R1 (100 bytes)

Heap Management An example Request R1 for 100 bytes Request R2 for 1 byte Memory from R1 is freed Request R3 for 50 bytes R2 (1 byte) R3? R3?

Example (K&R) Malloc/Free Implementation Each block of memory is preceded by a header that has two fields: size of the block and a pointer to the next block All free blocks are kept in a circular linked list, the pointer field is unused in an allocated block

Example Implementation malloc() searches the free list for a block that is big enough. If none is found, more memory is requested from the operating system. If what it gets can t satisfy the request, it fails. free() checks if the blocks adjacent to the freed block are also free If so, adjacent free blocks are merged (coalesced) into a single, larger free block Otherwise, the freed block is just added to the free list

Choosing a block in malloc() If there are multiple free blocks of memory that are big enough for some request, how do we choose which one to use? best fit: choose the smallest block that is big enough for the request first fit: choose the first block we see that is big enough next fit: like first fit but remember where we finished searching and resume searching from there

Tradeoffs of allocation policies Best fit: Tries to limit fragmentation but at the cost of time (must examine all free blocks for each malloc). Leaves lots of small blocks (why?) First fit: Quicker thanbest fit (why?) but potentially more fragmentation. Tends to concentrate small blocks at the beginning of the free list (why?) Next fit: Does not concentrate small blocks at front like first fit, should be faster as a result.

Compiling, Linking, and Loading Programs

The C Code Translation Hierarchy C program compiler assembly code assembler object code library routines linker machine code executable loader memory

Compiler Benefits Comparing performance for bubble (exchange) sort To sort 100,000 words with the array initialized to random values on a Pentium 4 with a 306clock 3.06 rate, a 533 MHz system bus, with 2 GB of DDR SDRAM, using Linux version 2.4.20 gcc opt Relative Clockcycles cycles Instrcount CPI performance (M) (M) None 1.00 158,615 114,938 1.38 O1 (medium) 237 2.37 66,990 37,470 179 1.79 O2 (full) 2.38 66,521 39,993 1.66 O3 (proc mig) 2.41 65,747 44,993 1.46 The unoptimized code has the best CPI, the O1 version has the lowest instruction count, but the O3 version is the fastest. Why?

Assembler Input: Assembly Language Code (e.g., foo.s for MIPS) Output: Object Code, information tables (e.g., foo.oo for MIPS) Reads and Uses Directives Replace Pseudoinstructions Produce Machine Language g Creates Object File

Assembler Directives Give directions to assembler, but do not produce machine instructions.text: Subsequent items put in user text segment (machine code).data: Subsequent items put in user data segment (binary rep of data in source file).globl sym: declares sym global land can be referenced from other files.asciiz str: Store the string str in memory and null terminate it.word w1 wn: Store the n 32 bit quantities in successive memory words

Producing Machine Language What about jumps (j and jal)? Jumps require absolute address. So, forward or not, still can t generate machine instruction without knowing the position of instructions in memory. What about references to data? la gets broken up into lui and ori These will require the full 32 bit address of the dt data. These can t be determined yet, so we create two tables

Symbol Table List of items in this file that may be used by other files. What are they? Labels: function calling Dt Data: anything in the.data section; variables ibl which may be accessed across files

Relocation Table List of items this file needs the address later. What are they? Any label jumped to: j or jal internal external (including lib files) Any piece of data such as the la instruction

Object File Format object file header: size and position of the other pieces of the object file text segment: the machine code data segment: binary representation of the data in the source file relocation information: identifies lines of code that need to be handled symbol table: list of this file s labels and data that can be referenced debugging information

Linker (1/3) Input: Object Code files, information tables (e.g., foo.o,libc.o for MIPS) Output: Executable Code (e.g., a.out for MIPS) Combines several object (.o) files into a single executable ( linking ) Enable Separate Compilation of files Changes to one file do not require recompilation of whole program Windows NT source was > 40 M lines of code! Old name Link Editor from editing the links in jump Old name Link Editor from editing the links in jump and link instructions

Linker (2/3).o file 1 text t 1 data 1 info 1.o file 2 text 2 data 2 info 2 Linker a.out Relocated text 1 Relocated text 2 Relocated data 1 Relocated data 2

Linker (3/3) Step 1: Take text segment from each.o file and put them together. Step 2: Take data segment from each.o file, put them together, and concatenate this onto end of text segments. Step 3: Resolve References Go through Relocation Table; handle each entry That is, fill in all absolute addresses

Acknowledgments Some of the slides contain material developed and copyrighted by M.J. Irwin (Penn state), B. Parhami (UCSB), D. Garcia (UCB) and instructor material for the textbook

Extra Material