Compiler Design Concepts. Syntax Analysis

Compiler Design Concepts Syntax Analysis

Introduction First task is to break up the text into meaningful words called tokens. newval=oldval+12 id = id + num Token Stream Lexical Analysis Source Code (High Level) identifiers The order of the tokens is not important at this stage. Example: 12 + old val = newval Will also be accepted. Lexical Analyzer s purpose is simply to extract the token. Symbol Table Token Lexeme Id Newval Id oldval Num 12 There should not be any combination which can not pass as token. e.g., 12oldval

Syntax After verifying that there is no lexical error, it is time to check for the order of the tokens id = id + num Syntax Analysis Token Stream Syntax Analysis phase should be able to say if Id = id + num is a valid arrangement or not. Observe that the actual lexemes are not used here. Syntax Analysis phase is not interested to know if it is Oldval = newval + 12 or newval = oldval + 12 Only the structure is important Just like Lexical Analysis was not interested in the order of token

Syntax But the compiler process should not forget the lexemes. They will be used later. id = id + num Syntax Analysis Token Stream Tokens will carry the pointer to the symbol table entry with them. Symbol Table Token Lexeme Id oldval Id newval Num 12

Syntax Okay, now, how to check if the syntax is correct or not Syntax Analysis id = id + num Token Stream Rules S id = id + num It means if there is a combination Id = id + num, it can be called a statement, which may be symbolized as S. That is, in this case, if id = id + num is a valid combination or not. There must be some ruled defined. Which will specify which combinations are valid. This rule is specified by the means of formats called productions S id = id + num Now, it has to be seen whether S fits into the total scheme

Syntax Most constructions from programming languages are easily expressed by Context Free Grammars (CFG) According to CFG, a software program can be seen as made of syntactic categories, by arranging them in a proper order. This is like natural languages where we have parts of speech. These are Expressions, Statements, Declaration, etc. Each syntactic category is made up of valid arrangement of tokens. A syntactic category can be made of other syntactic categories and finally, tokens. Syntactic categories are designated as Non Terminals. Recall that a non terminal can be derived into any combination of terminals and non terminals, but eventually, it should be all tokens.

Syntax The entire source program listing can be considered as a syntactic category, i.e., non terminal, say P A statement (whatever type it may be) can also be considered as another syntactic category, i.e., non terminal, say S So, as a rule, we can write P S; S; Now, S, i.e., a statement can have various expansions. For example, an assignment statement can look like S id := id + id * number ;

Syntax Let s take another string myval = newval* 10 It will be converted to token stream id = id * num Syntax Analysis id = id * num Token Stream Rules S id = id + num S id = id * num If there is another production S id = id * num Then the above combination will be considered valid.

Syntax id = id + num ; id = id * num ; Syntax Analysis Token Stream Lexical Analysis Source Code (High Level) newval=oldval+12; myval=newval*10; S id = id + num S id = id * num So, the stream will be converted to S;S; We can also check later if S;S; is valid or not. It will be valid, if there is a production P S;S; But combinations like S+S or S*S will not be valid Symbol Table Token Id Id Num 12 Id Num 10 Lexeme Newval oldval Myval

Syntax So, any combination of tokens that can be reduced, meaning, that exists on the right hand side of a production is valid. But there are infinite combinations that are valid, e.g., Id = id id Id = id * id Id = id + id id Id = id + id num Id = id * id id... It is impossible to have all. We have to have a limited set of rules using which we can generate all combinations. Just like English grammar. Finite number of words but infinite combinations, that is infinite number of sentences

Syntax This is the house that Jack built This is the malt that laid in the house that Jack built This is the rat that ate the malt that laid in the house that Jack built This is the cat that killed the rat that ate the malt that laid in the house that Jack built This is the dog that chased the cat that killed the rat that ate the malt that laid in the house that Jack built

There are limited types of tokens but the combination is infinite Take for example arithmetic expressions Syntax E E + E E E E E E * E E E /E E id E num Using the above productions, we can validate any arithmetic expression containing variable, number, add, sub, mult & div This is context free grammar E is a non terminal. It has to stay on LHS of at least one production. It can also stay on the RHS of some productions. Id, num, +, - *, /, = These are terminals which are tokens. They stay only on RHS of productions

Syntax E E + E E + id id + id E E + E E + E * E E + E * id E + id * id id + id * id E E + E E + E - E E + E - id E + id - id id + id id E E + E E + E - E E + E num E + id num id + id num E E * E E * E - E E * E - id E * id - id id * id id E E * E E * E - E E * E E / E id * E E / E id * id E / E id * id id / E id * id id / id (the non terminal being derived in each step has been highlighted) One has to choose the appropriate production.

Syntax Recursive usage of productions on terminals and non terminals result in valid statements. Defining a grammar: A Context Free Grammar consists of 1. A set of terminals (T) 2. A set of non terminals (V) 3. A set of productions (P) 4. A start symbol which is a non terminal (S) Start symbol is a non terminal from which the chain of derivations will start. There can be only one. In the example, E is the start symbol. A production is of the form V w Where w is a string of terminals and non terminals.

Syntax A derivation happens when a terminal is replaced by a string of terminals and non terminals as defined in some production. E E + E E + E - E E + E num E + id num id + id num The combination of terminals and non terminals at each stage of derivation is called a Sentential Form. Let s get little cryptic: N: Non terminal α, β, γ : strings of terminals and non terminals If there exists a production N γ Then in a sentential form, N can be replaced by γ So, αnβ can be rewritten as αγβ

Derivation Definition: Given a context-free grammar G with start symbol S, terminal symbols T and productions P, the language L(G) that G generates is defined to be the set of strings of terminal symbols that can be obtained by derivation from S using the productions P, i.e., the set As an example, look at the grammar T R T atc R ε R RbR This grammar generates the string aabbbcc by the derivation shown. We have, for clarity, in each sequence of symbols underlined the non terminal that is rewritten in the following step.

Derivation Production applied Derivation Step 1. T atc 2. T atc 3. T R 4. R RbR 5. R ε 6. R RbR 7. R RbR 8. R ε 9. R ε 10. R ε Rightmost Leftmost Derivation of the string aabbbcc using the given grammar In this derivation, we have applied derivation steps sometimes to the leftmost non terminal, sometimes to the rightmost and sometimes to a non terminal that was neither.

Derivation- Parsing The Syntax Analysis phase checks the structure of the source code statements. This is. called Parsing There are two common methods: 1. Trying to generate the statement from the start symbol and applying production rules. This is called top down parsing. We have generated the sting aabbbcc from the start symbol T T aabbbcc 2. Taking the string and applying productions in reverse to arrive at the start symbol. This is called bottom up parsing aabbbcc T

Derivation However, since derivation steps are local, the order does not matter. So, we might as well decide to always rewrite the leftmost non terminal. Production applied 1. T atc 2. T atc 3. T R 4. R RbR 5. R RbR 6. R ε 7. R RbR 8. R ε 9. R ε 10. R ε Derivation Step A derivation that always rewrites the leftmost non terminal is called a leftmost derivation. Similarly, a derivation that always rewrites the rightmost non terminal is called a rightmost derivation..

Derivation - Trees Drawing the tree from production rules We can draw a derivation as a tree: Root of the tree = Start symbol For a derivation, the string on the RHS of the chosen production are added as children below the non terminal When applying T atc T a, T and c will be drawn as children below T Read the leaves from left to right a T c The leaves of the tree are terminals which, when read from left to right, form the derived string. ε is ignored..

Derivation - Trees Order of derivation does not matter: only choice of rule First b from left Third b from left Second b from left Syntax tree for the string aabbbcc irrespective of order of derivation

Ambiguity But, we may have alternate tree for the same string Choice of production matters Different rule has been applied When a grammar permits several different syntax trees for some strings we call the grammar ambiguous.

Ambiguity Ambiguity is not a problem for validating syntax. Both parse trees show that aabbbcc is a valid string. But the problem is elsewhere. When we evaluate the string: Let s take the example of an Expression E > E + E E E * E E num E E + E E + E * E num + num * num 2 + 3 * 4 E E * E E + E * E num + num * num 2 + 3 * 4

Ambiguity E E + E E + E * E num + num * num E 2 + 3 * 4 E + E Evaluation: 3 * 4 = 12; 2 + 12 = 14 2 E * E 3 4 Sub trees are evaluated first E E * E E + E * E num + num * num 2 + 3 * 4 Evaluation: 2 + 3 = 5; 5 * 4 = 20 E E E * + E E 4 NOTE: THE SUBTREES ARE EVALUATED FIRST 2 3

Ambiguity Resolution Parser can not be built for ambiguous grammar Parser must make a tree while processing the token string. So, ambiguity must be resolved 1) Use disambiguating / precedence rule while parsing 2) Rewrite the grammar to make it unambiguous (with language unchanged) (i) Associativity a b c will be processed as (a - b) c left associative a ** b ** c will be processed as a ** ( b**c) right associative a > b > c will be invalid non associative Note: Each of + and * can be both right associative and left associative, but for convenience, they are made left associative. (parser has to follow any one rule) (i) Precedence a+ b * c will be treated as a + (b * c)

Ambiguity Detection Ambiguity exists in the grammar is there exists a string which can result in two distinct parse trees. - Very hard, almost impossible to find in certain cases In many cases, it is not difficult by looking at the grammar N NαN Note : Parsers can be built only from unambiguous grammars Most of the ambiguity occurs in expression grammar E E op E E num (num is a numeric literal)

Rewriting ambiguous grammar Expression Grammar Rewrite as follows: (a) For left associative operators (e.g., a-b-c) Introduce new non terminal E E op E E E E num Isolate the rightmost non terminal first, push it to a sub tree Derivation example: E E-E (E-E )-E (num-num)-num There is an implicit parenthesis

Rewriting ambiguous grammar (b) For right associative operators (e.g., a**b**c) Introduce new non terminal E E op E E E E num Derivation example: E E ^ E num ^ E num ^ (E ^ E) num ^ (num ^ E) num ^ (num ^ E ) num ^ (num ^ num) There is an implicit parenthesis

Rewriting ambiguous grammar (c) For non associative operators (e.g., a**b**c) e.g., a<b E E op E E E E num e.g., a<b<c is not allowed

Rewriting ambiguous grammar So far, we have handled only the cases where an operator interacts with itself This is easily extendible where the cases where several operators with the same precedence and associativity interact E E + E E E E E E E num + and - are both left associative hence left recursive grammar is required.

Rewriting ambiguous grammar But if we mix left recursive with right recursive, it will be ambiguous again E E + E E E ^ E E E E num As an example, we can not represent 2 + 3 ^ 4 using this grammar.

Rewriting ambiguous grammar Mixing operators with different precedents but equal associativity We must know the precedence of operators First, the higher precedence operator needs to be worked out Use different non terminals for different precedence levels E E + E2 E E E2 E E2 E2 E2*E3 E2 E2/E3 E2 E3 E3 num

Example: Other sources of ambiguity if P then if Q then S1 else S2 Ambiguity is, which if the else is connected to? It might mean if P then ( if Q then S1 else S2 ) Or if P then (if Q then S1) else S2 Note: else clause is optional. Otherwise it would ve been unambiguous

Other sources of ambiguity Let s see The grammar is stmt <id> :=<exp> stmt <stmt>.<stmt> stmt if <exp> then <stmt> else <stmt> stmt if <exp> then <stmt> According to this grammar, the single else can equally match with either if

Other sources of ambiguity Two parse trees, indicating ambiguous grammar

Other sources of ambiguity Usual convention: else matches with the closest if. We will enforce this rule by rewriting the grammar We introduce two new non terminals stmt <matched> stmt <unmatched> matched if <exp> then <matched> else <matched> matched <id> :=<exp> unmatched if <exp> then <matched> > else <unmatched> unmatched if <exp> then <matched>