CSC 467 Lecture 13-14: Semantic Analysis Recall Parsing is to translate token stream to parse tree Today How to build trees: syntax direction translation How to add information to trees: semantic analysis On Tree Traversals Trees are classic data structures. Trees have nodes and edges, so they are a special case of graphs. Tree edges are directional, with roles "parent" and "child" attributed to the source and destination of the edge. A tree has the property that every node has zero or one parent. A node with no parents is called a root. A node with no children is called a leaf. A node that is neither a root nor a leaf is an "internal node". Trees have a size (total # of nodes), a height (maximum count of nodes from root to a leaf), and an "arity" (maximum number of children in any one node). Parse trees are k-ary, where there is a variable number of children bounded by a value k determined by the grammar. You may wish to consult your old data structures book, or look at some books from the library, to learn more about trees if you are not totally comfortable with them. #include <stdarg.h> struct tree { short label; /* what production rule this came from */ short nkids; /* how many children it really has */ struct tree *child[1]; /* array of children, size varies 0..k */ ; struct tree *alctree(int label, int nkids,...) { int i; va_list ap; struct tree *ptr = malloc(sizeof(struct tree) + (nkids-1)*sizeof(struct tree *)); if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n"); exit(1); ptr->label = label; ptr->nkids = nkids; va_start(ap, nkids); for(i=0; i < nkids; i++) ptr->child[i] = va_arg(ap, struct tree *); va_end(ap);
return ptr; Besides a function to allocate trees, you need to write one or more recursive functions to visit each node in the tree, either top to bottom (preorder), or bottom to top (postorder). You might do many different traversals on the tree in order to write a whole compiler: check types, generate machine- independent intermediate code, analyze the code to make it shorter, etc. You can write 4 or more different traversal functions, or you can write 1 traversal function that does different work at each node, determined by passing in a function pointer, to be called for each node. void postorder(struct tree *t, void (*f)(struct tree *)) { /* postorder means visit each child, then do work at the parent */ int i; if (t == NULL) return; /* visit each child */ for (i=0; i < t-> nkids; i++) postorder(t->child[i], f); /* do work at parent */ f(t); You would then be free to write as many little helper functions as you want, for different tree traversals, for example: void printer(struct tree *t) { if (t == NULL) return; printf("%p: %d, %d children\n", t, t->label, t->nkids); Semantic Analysis Semantic ("meaning") analysis refers to a phase of compilation in which the input program is studied in order to determine what operations are to be carried out. The two primary components of a classic semantic analysis phase are variable reference analysis and type checking. These components both rely on an underlying symbol table. What we have at the start of semantic analysis is a syntax tree that corresponds to the source program as parsed using the context free grammar. Semantic information is added by annotating grammar symbols with semantic attributes, which are defined by semantic rules. A semantic rule is a specification of how to calculate a semantic attribute that is to be added to the parse tree. So the input is a syntax tree...and the output is the same tree, only "fatter" in the sense that nodes carry more information. Another output of semantic analysis are error messages detecting many types of semantic errors. Two typical examples of semantic analysis include:
variable reference analysis the compiler must determine, for each use of a variable, which variable declaration corresponds to that use. This depends on the semantics of the source language being translated. type checking the compiler must determine, for each operation in the source code, the types of the operands and resulting value, if any. Notations used in semantic analysis: syntax-directed definitions high-level (declarative) specifications of semantic rules translation schemes semantic rules and the order in which they get evaluated In practice, attributes get stored in parse tree nodes, and the semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals. Two Types of Attributes: synthesized attributes computed from information contained within one's children. These are generally easy to compute, even on-the-fly during parsing. inherited attributes computed from information obtained from one's parent or siblings These are generally harder to compute. Compilers may be able to jump through hoops to compute some inherited attributes during parsing, but depending on the semantic rules this may not be possible in general. Compilers resort to tree traversals to move semantic information around the tree to where it will be used. Attribute Examples Isconst and Value Not all expressions have constant values; the ones that do may allow various optimizations. CFG Semantic Rule E 1.isconst = E 2.isconst && T.isconst E 1 : E 2 + T if (E 1.isconst) E 1.value = E 2.value + T.value E.isconst = T.isconst E : T if (E.isconst) E.value = T.value T : T * F T 1.isconst = T 2.isconst &&
T : F F : ( E ) F : ident F : intlit F.isconst if (T 1.isconst) T 1.value = T 2.value * F.value T.isconst = F.isconst if (T.isconst) T.value = F.value F.isconst = E.isconst if (F.isconst) F.value = E.value F.isconst = FALSE F.isconst = TRUE F.value = intlit.ival Symbol Table Module Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized hierarchically according to the scope rules of the language. Although initially concerned with simply storing the names of various that are visible in each scope, symbol tables take on additional roles in the remaining phases of the compiler. In semantic analysis, they store type information. And for code generation, they store memory addresses and sizes of variables. mktable(parent) creates a new symbol table, whose scope is local to (or inside) parent enter(table, symbolname, type, offset) insert a symbol into a table lookup(table, symbolname) lookup a symbol in a table; returns structure pointer including type and offset. lookup operations are often chained together progressively from most local scope on out to global scope. addwidth(table) sums the widths of all entries in the table. ("widths" = #bytes, sum of widths = #bytes needed for an "activation record" or "global data section"). Worry not about this method until code generation you wish to implement. enterproc(table, name, newtable) enters the local scope of the named procedure Variable Reference Analysis The simplest use of a symbol table would check: for each variable, has it been declared? (undeclared error) for each declaration, is it already declared? (redeclared error)
Reading Tree Leaves In order to work with your tree, you must be able to tell, preferably trivially easily, which nodes are tree leaves and which are internal nodes, and for the leaves, how to access the lexical attributes. Options: 1. encode in the parent what the types of children are 2. encode in each child what its own type is (better) How do you do option #2 here? Perhaps the best approach to all this is to unify the tokens and parse tree nodes with something like the following, where perhaps an nkids value of -1 is treated as a flag that tells the reader to use lexical information instead of pointers to children: struct node { int code; /* terminal or nonterminal symbol */ int nkids; union { struct token {... leaf; struct node *kids[9]; u; ; There are actually nonterminal symbols with 0 children (nonterminal with a righthand side with 0 symbols) so you don't necessarily want to use an nkids of 0 is your flag to say that you are a leaf. Type Checking Perhaps the primary component of semantic analysis in many traditional compilers consists of the type checker. In order to check types, one first must have a representation of those types (a type system) and then one must implement comparison and composition operators on those types using the semantic rules of the source language being compiled. Lastly, type checking will involve adding (mostly-) synthesized attributes through those parts of the language grammar that involve expressions and values. Type Systems Types are defined recursively according to rules defined by the source language being compiled. A type system might start with rules like: Base types (int, char, etc.) are types Named types (via typedef, etc.) are types Types composed using other types are types, for example: o array(t, indices) is a type. In some languages indices always start with 0, so array(t, size) works. o T1 x T2 is a type (specifying, more or less, the tuple or sequence T1 followed by T2; x is a so-called cross-product operator).
o record((f1 x T1) x (f2 x T2) x... x (fn x Tn)) is a type o in languages with pointers, pointer(t) is a type o (T 1 x... T n ) -> T n+1 is a type denoting a function mapping parameter types to a return type In some language type expressions may contain variables whose values are types. In addition, a type system includes rules for assigning these types to the various parts of the program; usually this will be performed using attributes assigned to grammar symbols Representing C (C++, Java, etc.) Types The type system is represented using data structures in the compiler's implementation language. In the symbol table and in the parse tree attributes used in type checking, there is a need to represent and compare source language types. You might start by trying to assign a numeric code to each type, kind of like the integers used to denote each terminal symbol and each production rule of the grammar. But what about arrays? What about structs? There are an infinite number of types; any attempt to enumerate them will fail. Instead, you should create a new data type to explicitly represent type information. This might look something like the following: struct c_type { int base_type; /* 1 = int, 2=float,... */ union { struct array { int size; struct c_type *elemtype; a; struct ctype *p; struct struc { char *label; struct field **f; s; u; struct field { char *name; struct ctype *elemtype; Given this representation, how would you initialize a variable to represent each of the following types: int [10][20] struct foo { int x; char *s; Example Semantic Rules for Type Checking grammar rule semantic rule E 1 : E 2 PLUS E 3 E 1.type = check_types(plus, E 2.type, E 3.type)
Where check_types() returns a (struct c_type *) value. One of the values it should be able to return is Error. The operator (PLUS) is included in the check types function because behavior may depend on the operator -- the result type for array subscripting works different than the result type for the arithmetic operators, which may work different (in some languages) than the result type for logical operators that return booleans. Type Promotion and Type Equivalence When is it legal to perform an assignment x = y? When x and y are identical types, sure. Many languages such as C have automatic promotion rules for scalar types such as shorts and longs. The results of type checking may include not just a type attribute, they may include a type conversion, which is best represented by inserting a new node in the tree to denote the promoted value. Example: int x; long y; y = y + x; For records/structures, some languages use name equivalence, while others use structure equivalence. Features like typedef complicate matters. If you have a new type name MY_INT that is defined to be an int, is it compatible to pass as a parameter to a function that expects regular int's? Object-oriented languages also get interesting during type checking, since subclasses usually are allowed anyplace their superclass would be allowed. Implementing Structs 1. storing and retrieving structs by their label -- the struct label is how structs are identified. You do not have to do typedefs and such. The labels can be keys in a separate hash table, similar to the global symbol table. You can put them in the global symbol table so long as you can tell the difference between them and variable names. 2. You have to store fieldnames and their types, from where the struct is declared. You could use a hash table for each struct, but a link list is OK as an alternative. 3. You have to use the struct information to check the validity of each dot operator like in rec.foo. To do this you'll have to lookup rec in the symbol table, where you store rec's type. rec's type must be a struct type for the dot to be legal, and that struct type should include a hash table or link list that gives the names and types of the fields -- where you can lookup the name foo to find its type.