COMP 506 Rice University Spring 2018 Arrays and Functions source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 506 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved Chapters 6 & 7 in EaC2e.
Expanding the Expression Grammar The Classic, Left-Recursive, Expression Grammar 1 Goal Expr 2 Expr Expr +Term 3 Expr - Term 4 Term 5 Term Term * Factor 6 Term / Factor 7 Factor 8 Factor ( Expr ) 9 number 10 ident The grammar is great for x - 2 * y Not all programs operate over the domain of ident and number More complex programs require more complex expressions No mention of array references, structure references, or function calls COMP 412, Fall 2018 2
Expanding the Expression Grammar The Classic, Left-Recursive, Expression Grammar 1 Goal Expr 2 Expr Expr +Term 3 Expr - Term 4 Term 5 Term Term * Factor 6 Term / Factor 7 Factor 8 Factor ( Expr ) 9 number 10 ident 11 ident [ Exprs ] 12 ident ( Exprs ) References to aggregates & functions First, the compiler must recognize expressions that refer to arrays, structures, and function calls Second, the compiler must have schemes to generate code to implement those abstractions Array reference Function invocation COMP 412, Fall 2018 3
COMP 506, Lab 2 What Code Should The Compiler Generate for A[i,j]? Compiler must generate code to compute the runtime address of A[i,j] The compiler needs to know where A begins A was assigned a data area and an offset during storage layout Compile-time knowledge is tied to runtime behavior Assume that @A is the address of A s first element Lecture 9 The compiler must have a plan for A s internal layout Programming language usually dictates array element layout Three common choices Row-major order, column-major order, & indirection vectors Compiler needs a formula to calculate the address of an array element General scheme: compute the element address, then issue a load (rvalue) or a store (lvalue) A The Concept 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 COMP The demo 412, Fall documents 2018 do not specify how arrays are laid out. It is your choice. 4
Array Layout Row-Major Order Lay out as a sequence of consecutive rows Rightmost subscript varies fastest A[1,1], A[1,2], A[1,3], A[2,1], A[2,2], A[2,3] Storage Layout Stride one access: successive references in the loop are to adjacent locations in virtual memory. Stride one access maximizes spatial reuse & effectiveness of hardware prefetch units. In row-major order, stride one is along rows. A 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 Stride One Access for ( i = 0; i < n; i++) for ( j = 0; j < n; j++) A[i][j] = 0; Declared arrays in C (and most languages) A The Concept 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 COMP 412, Fall 2018 5
Array Layout Column-Major Order Lay out as a sequence of columns Leftmost subscript varies fastest A[1,1], A[2,1], A[1,2], A[2,2], A[1,3], A[2,3] In column-major order, stride one access is along the columns. Storage Layout A 1,1 2,1 1,2 2,2 1,3 2,3 1,4 2,4 Stride One Access for ( j = 0; j < n; i++) for ( i =0; i < n; j++) A[i][j] = 0; Declared arrays in FORTRAN A The Concept 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 COMP 412, Fall 2018 6
Array Layout Indirection Vectors Vector of pointers to pointers to to values Takes much more space, trades indirection for arithmetic Not amenable to analysis Storage Layout A 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 Stride One Access No reference pattern guarantees stride one access in cache, unless rows are contiguous Arrays in Java int **array; in C 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 COMP 412, Fall 2018 7 A The Concept
Welcome to high-school algebra Computing an Address for an Array Element The General Scheme for an Array Reference Compute an address Issue a load Color Code: Invariant Varying A[ i ] @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) @A is the base address of A. Depending on how A is declared, @A may be an offset from the pointer to the procedure s data area, an offset from some global label, or an arbitrary address (e.g., a pointer) The first two are compile-time constants. low Lower bound on index COMP 412, Fall 2018 8
Computing an Address for an Array Element A[ i ] @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) int A[1:10] Þ low is 1 Make low be 0 for faster access (saves a subtraction) Often a power of 2, known at compile-time Þ use a shift for speed low Lower bound on index Take away point: vector access costs a couple of operations COMP 412, Fall 2018 9
Assume sizeof(a[1]) is 1, for simplicity. Computing an Address for an Array Element A[ i ] @A + ( i low ) x sizeof(a[1]) In general: base(a) + ( i low ) x sizeof(a[1]) What about A[i 1,i 2 ] in A[1:5,1:4]? low 1 Lower bound on row index 1 high 1 Upper bound on row index 5 low 2 Lower bound on column index 1 high 2 Upper bound on column index 4 Row-major order, two dimensions @A + (( i 1 low 1 ) x (high 2 low 2 + 1) + i 2 low 2 ) x sizeof(a[1]) A[4,3] 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 5,1 5,2 5,3 5,4 (i-low 1 ) is 3 (high 2 low 2 + 1) is 4 3 x 4 is 12 i 2 low 2 is 2 Combined, the 2 terms take us to the start of A[4,3] COMP Address 412, computation Fall 2018 took 3 adds, 3 subtracts, 1 multiply and 1 shift 10
Array Layout All three schemes generalize for higher dimensions Row-Major Order Lay out as a sequence of consecutive rows Rightmost subscript varies fastest A[1,1], A[1,2], A[1,3], A[2,1], A[2,2], A[2,3] Storage Layout A 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 Address Polynomial for A[i,j] @A + (( i 1 low 1 ) x (high 2 low 2 + 1) + i 2 low 2 ) x sizeof(a[1]) Why does C start arrays indices at zero? To avoid those subtractions COMP This polynomial 412, Fall 2018 follows Horner s rule for evaluation. 11
Array Layout Column-Major Order Lay out as a sequence of columns Leftmost subscript varies fastest A[1,1], A[2,1], A[1,2], A[2,2], A[1,3], A[2,3] Storage Layout A 1,1 2,1 1,2 2,2 1,3 2,3 1,4 2,4 Address Polynomial for A[i,j] @A + (( i 2 low 2 ) x (high 1 low 1 + 1) + i 1 low 1 ) x sizeof(a[1]) COMP 412, Fall 2018 12
Array Layout Indirection Vectors Vector of pointers to pointers to to values Takes much more space, trades indirection for arithmetic Not amenable to analysis Storage Layout A 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 A[i 1 ] sub i 1, low => r 1 add @A, r 1 => r 2 Address Polynomial for A[i,j] *(A[i 1 ])[i 2 ] where each of the [i] s is, itself, a 1-d array reference *[i 2 ] load r 2 => r 3 sub i 2, low => r 4 add r 3, r 4 => r 5 load r 5 => r 6 Sketch of the code Back when systems supported 1 or 2 cycle indirect load operations, this COMP scheme 412, was Fall efficient. 2018 It replaces a multiply & an add with an indirect load. 13
Array Address Calculations In scientific codes, array address calculations are a major cost Each additional dimension adds more arithmetic Efficiency in address calculation is a critical issue for performance A[i+1,j], A[i,j], A[i,j+1] should all have some common terms Horner s rule evaluation hides the common terms Improving these calculations is a matter of algebra Improving the efficiency of array address calculations has been a major focus of code optimization (& hardware design) for the last 40 years Generate better code Transform it to fit the context Design memory systems that support common array access patterns Optimize hardware for stride one access Include sophisticated prefetch engines that recognize access patterns COMP 412, Fall 2018 14
Array Address Calculations: Better Code for A[i,j] Distribute x over + & - In row-major order @A + ((i low 1 ) x (high 2 low 2 + 1) + j low 2 ) x w Can be re-distributed to @A + (i low 1 ) x (high 2 low 2 +1) x w + (j low 2 ) x w Which can be re-distributed to @A + i x (high 2 low 2 +1) x w + j x w (low 1 x (high 2 low 2 +1) x w) - (low 2 x w) If low i, high i, and w are known, the last term is a constant Define @A 0 as @A (low 1 x (high 2 low 2 +1) x w - low 2 x w) And len 2 as (high 2 -low 2 +1) Then, the address expression becomes @A 0 + (i x len 2 + j ) x w where w = sizeof(a[1,1]) If @A is known, @A 0 is a known constant. Just a couple of operations: 2 adds, 1 multiply, and 1 shift Color Code: Invariant Varying Compile-time constants COMP 412, Fall 2018 15
Does Layout Matter? Which loop is faster? for (x=0; x<n; x++) for (y=0; y<n; y++) A[x][y] = 0; for (y=0; y<n; y++) for (x=0; x<n; x++) A[x][y] = 0; p = & A[0][0]; t = n * n; for (x=0; x<t; x++) *p++ = 0; All data collected on a quiescent, multiuser Intel T9600 @ 2.8 GHz using code compiled with gcc 4.1 O3 All three loops have distinct performance 0.51 seconds on 10,000 x 10,000 array 1.65 seconds on 10,000 x 10,000 array 0.11 seconds on 10,000 x 10,000 array ~ 5x ~ 15x Conventional wisdom suggests using bzero((void *) &A[0][0],(size_t) n * n * sizeof(int)); 0.52 seconds on 10,000 x 10,000 array COMP 412, Fall 2018 16
Expanding the Expression Grammar The Classic, Left-Recursive, Expression Grammar 1 Goal Expr 2 Expr Expr +Term 3 Expr - Term 4 Term 5 Term Term * Factor 6 Term / Factor 7 Factor 8 Factor ( Expr ) 9 number 10 ident 11 ident [ Exprs ] 12 ident ( Exprs ) References to aggregates & functions First, the compiler must recognize expressions that refer to arrays, structures, and function calls Second, the compiler must have schemes to generate code to implement those abstractions Array reference Function invocation COMP 412, Fall 2018 17
Function Calls Another possible leaf in an expression tree is a function call r f(x) + c What code does the compiler generate for f(x)? Compiler may not have access to code for f()? (it may not be written yet) Another compilation or another compiler may have translated f() f(x) might be a system call, such as mmap() or malloc() Programmers expect such calls to just work COMP 412, Fall 2018 18
Function Calls Another possible leaf in an expression tree is a function call r f(x) + c What code does the compiler generate for f(x)? Compiler may not have access to code for f()? (it may not be written yet) Another compilation or another compiler may have translated f() f(x) might be a system call, such as mmap() or malloc() Programmers expect such calls to just work Functions and procedures are a fundamental unit of abstraction in PLs Call creates a new, known, and controlled environment (names, access) Compilers & system need a convention for how procedures interact The linkage convention is the social contract that makes it all wok COMP 412, Fall 2018 19
Linkage Convention The linkage convention plays a critical role in construction of systems Makes separate compilation possible Compiler can generate code to the convention s specifications Adherence to the convention means that things just work Divides responsibility for the runtime environment & behavior among the caller, the callee, & the system Naming and address space management Control flow among functions & procedures (&, to some extent processes) The linkage convention is typically a system-wide convention Allows interoperability among languages & compilers Code compiled with gcc can link with code from clang Admits only limited opportunities for customization & optimization COMP 412, Fall 2018 20
Linkage Convention: High-Level View The linkage must: Preserve & protect the caller s environment from callee (& its callees) For example, it must preserve values in registers that are live across the call Create a new environment for the callee (new name space) At runtime, it creates a local data area for the callee & ties it to the context Map names & resources from caller into the callee, as needed To accomplish these goals: Convention divides responsibility between caller & callee Caller & callee must use the same set of conventions Implies limited opportunities for customization & optimization We expect compiled code to work even when created with different compilers, perhaps in different languages fee foe fie foe must operate in multiple distinct calling contexts. COMP 412, Fall 2018 21
Linkage Convention A procedure or function has two distinct representations Code that implements the procedure Code for all the statements Code to implement linkage convention Labels on entry points and return points Use name mangling to create consistent labels Data that represents an invocation Activation record (AR) Local data + saved context of caller Control information Storage to preserve caller s env t Chain ARs together to create context Static data area (name mangling) @fee_: prolog pre-call @foe_: prolog fee s AR foe s AR ARP @fee_r1: post-return epilog #fee_: #foe_: epilog COMP 412, Fall 2018 22
The Meta Issue One of the most difficult aspects of understanding how compilers work is keeping track of what happens at compiler-design time, compiler-build time, compile time, and run time. At compile time, the compiler emits code for the application program, including all of its procedures (and their prologs, epilogs, & sequences) The compiler uses symbol tables to keep track of names Those tables are compile-time data structures At runtime, those sequences (prolog, epilog, pre-call, & post-return) create, populate, use, and destroy ARs for invocations of procedures The running code stores data (both program data and control information in Ars Those ARs are run-time data structures To confuse matters further, the compiler may preserve the symbol tables so that the runtime debugging environment can locate variables and values, and locate the various data areas and the ARs COMP 412, Fall 2018 23
Linkage Convention Generated at compile time Executedat runtime A procedure or function has two distinct representations (managed by the linkage code) Code that implements the procedure Code for all the statements Code to implement linkage convention Labels on entry points and return points Use name mangling to create consistent labels @fee_: prolog pre-call @foe_: prolog Created & used at runtime Data that represents an invocation Activation record (AR) Local data + saved context of caller Control information Storage to preserve caller s env t Chain ARs together to create context Static data area (name mangling) @fee_r1: post-return epilog #fee_: #foe_: epilog COMP 412, Fall 2018 24 ARP fee s AR foe s AR
The Code: Procedure Linkages Standard Procedure Linkage procedure main prolog z = a*a + b*b x = sqrt( y,z) L: pre-call post-return epilog caller function sqrt prolog epilog callee Each procedure has A standard prolog A standard epilog Each call uses A pre-call sequence A post-return sequence The code for the pre-call and post-return sequences is completely predictable from the call site. It depends on the number & the types of the actual parameters. COMP Note: 412, The gap Fall between 2018 pre-call & post-return code is artistic license. 25
Digression: Name Mangling Compilers use assembly labels to bridge the gap across separate compile steps and from compile time to runtime Howcan this possibly work? It is a social contract among the compilers on a given system Agreement on how to form a label from a procedure or variable name The label is a direct function of the name and its purpose For a procedure foo : Code entry point might be #_foo Static data area might be #_foo_ Profile data might be stored at #_@foo_ For a global variable baz : Start of variable might be at #_%baz_ Details: Must use characters that are not valid in identifiers Must be consistent across all implementations that are linked together System-wide agreement is typical COMP 412, Fall 2018 26
The Code: Procedure Linkages Standard Procedure Linkage procedure main prolog z = a*a + b*b x = sqrt( y,z) L: pre-call post-return epilog caller function sqrt prolog epilog callee These four code sequences: Allocate and initialize a new AR for the callee Populate that AR with needed values & context Preserve parts of the caller s environment that might be modified by callee Contents of registers, pointer to its activation record, Handle parameter passing Evaluate parameters in caller Make them available in callee Transfer control from caller to callee & back (LIFO behavior) COMP 412, Fall 2018 27
The Data: Activation Records A procedure s activation record (AR) combines the control information needed to manage procedure calls (return address, register save area, addressability, & return value) and the procedure s local data area(s) Each time a procedure is invoked, or activated, it needs an AR Allocated on a runtime stack or in the runtime heap The pre-call & post-return sequences manage the ARs Pre-call creates AR & initializes it Post-return cleans up the AR & deallocates it The prolog & epilog sequences manipulate environment data in the AR Create & destroy parts of the new environment Preserve & restore parts of the old environment (depends on needs) The code cannot refer to ARs with mangled labels, because each invocation of ARs are runtime structures, laid out at compile time a procedure creates a unique AR. Thus, the runtime environment has Meta-Issue: Activation records are created, used and destroyed at an AR pointer, the ARP. COMP runtime. 412, The Fall code 2018to maintain them is emitted at compile time. 28 ARP AR