Left to right design 1

Left to right design The left to right design method suggests that the structure of the program should closely follow the structure of the input. The method is effective when the structure of the input dominates the problem. Many problems in practice have complex input structure. Even if it doesn t dominate the whole problem, the sub problem of handling the input can be solved using left-to-right design. Any program that reads input is a language recognizer or parser. 2

The problem Write a program to act as a simple calculator. Users type in arithmetic expressions, one per line; the program should print the value of each expression. Expressions may involve the four main arithmetic functions and parentheses. For simplicity, assume all numbers are integers. Unix comes with two calculator programs, one (dc) for postfix expressions and the other (bc) for infix expressions. Bc is built on top of dc. 3

Input Expression - Examples 3 + 5 5* 3 7/2 2 + 5* 2 3 5 * (4* 2 2/3) - (2+ 3)* (4+ 5) (2+ 4)/(3* 5) 4+ 5 4

Structure Description Formal notation: x,y x then y x* zero or more repetitions of x x+ one or more repetitions of x x y either x or y [x] either x or nothing x: y x is defined as y 5

Data Description (grammer) file: line*,end line: newline (expr,newline) expr: term,((add sub),term)* term: factor,((mul div),factor)* factor: number (lparen,expr,rparen) number: digit+ add: + sub: '- mul: '* div: '/ lparen: '( rparen: ') newline: '\n end: EOF 6

White Space This data description does not say where white space may appear in the input, because that would make the description unnecessarily complicated. Most programs accept white space in some places and not in others. In standard terminology, a token is a unit of input such that any spaces between tokens are not significant and any spaces within a token are significant (if allowed) 7

Tokens We must decide what the tokens of our grammar are Tokens are the smallest elements of the grammar. They must be defined: without reference to other tokens without recursion independent of preceding tokens The tokens in this program are: number, add, sub, mul, div, lparen, rparen, newline, end. 8

Two stages Traditionally, the task of recognizing the structure of the input has been done in two stages. The option of using only one stage is discussed in a later section. The first stage, lexical analysis, scanning, or tokenizing,groups characters into tokens while ignoring white space and comments (both may appear anywhere and neither is significant). 9

Two stages The second stage, syntactic analysis or parsing, groups tokens into higher-level entities such a expression. In technical language, tokens are called terminal symbols, while the entities recognized by parsers are called nonterminal symbols 10

Tokenizer operations The tokenizer, lexical analyser, or scanner is a function. Each time it is called, it should read the next token, and return an indication of which kind of token it is (number or add or sub etc); and, if there is more than one token of that kind, an indication of the one that was seen. For example, all plus signs are alike, but when the calculator reads in a number, it must know which number it is. 11

Indicators The standard way to indicate the kind of a token is via an enumerated type: typedef enum { ADD, SUB, MUL, DIV, LPAREN, RPAREN, NL, END, NUMBER } TokenKind; 12

Indicators In this case only one kind of token, NUMBER, needs an indicator that says which token of that kind was seen, so the value of the token can be put into an integer (we are not concerned with real numbers in this exercise). 13

Using unions In general, more than one kind of token may have an associated value, and these values may be of different types. For example, some tokenizers must be able to recognize both integers and identifiers. The solution is to use a union: typedef union { int number; char *ident; }TokenValue; Every value of type TokenValue will have enough storage to hold either an int or a char*, but not both. 14

Token representation Conceptually, a token is a kind/value pair, and should be represented as a structure with two fields. typedef struct { TokenKind kind; TokenValue value; } Token; 15

Token representation Token token; token.kind = = NUMBER = > value is in token.value.number token.kind = = IDENT = > value is in token.value.ident token.kind is something else = > token has no associated value However, for simplicity people often use two separate variables for kind and value. 16

Tokenizer structure Tokenizer functions start with code that gets rid of nonsignificant white space and comments, if they are allowed. c = getc(stdin); while (c!= EOF && c!= \n && isspace(c)) c = getc(stdin); The first character left in the input is then often sufficient to find out what kind of token is next.(if it isn't, we must use techniques usually used for parsing.) 17

Consider all the rules for tokens: number: digit+ add: + sub: '- mul: '* div: '/ lparen: '( rparen: ') newline: '\n end: EOF Each token begins with different characters, so we can switch on thefirst non-space character to decide the token kind. 18

Identifiers Many grammars have some kind of identifier token. For our calculator, we might want to allow identifiers for variable names. Identifiers usually have a structure like: ident:letter,(letter digit)* precisely to distinguished them from numbers by the first character 19

The rest of the token Once the tokenizer has found out what kind of token is next, it must read in the rest of the token. The structure of the code that does this should follow the structure of data description of the rest of the token. ident: letter,(letter digit)* /* c is known to be a letter * / buf[i+ + ] = c; c = getc(stdin); while (isalpha(c) isdigit(c)) { buf[i+ + ] = c; c = getc(stdin); } 20

Tokenizer TokenKind do_get_token(int * token_value){ int c, val; c = getc(stdin); while (c!= EOF && c!= \n && isspace(c)) c = getc(stdin); switch (c) { case '+ ': return ADD; case '-': return SUB; case '* ': return MUL; case '/': return DIV; case '(': return LPAREN; case ')': return RPAREN; case '\n': return NL; case EOF: return END; (continued) 21

Tokenizer (2) } case 0 : case 1 : case 2 : case 3 : case 4 : case 5 : case 6 : case 7 : case 8 : case 9 : val = c - 0 ; c = getc(stdin); while (c!= EOF && isdigit(c)) { val = val * 10 + c - 0 ; c = getc(stdin); } ungetc(c, stdin); * token_value = val; return NUMBER; default: /* handle the error * / } 22

Pushback do_get_token must remove exactly one token from the input, together with its preceding white space. We cannot find out whether a digit is the last character in a number or not until we have read the next character. This character may be e.g. +, which represents a token, so we must make sure that the next invocation of do_get_token processes it. Our code does this by calling ungetc, which arranges for the next call to getc on the same file to read the character pushed back by ungetc. 23

Recursive Descent Parsing 24

Recursive Descent Parsing The parser has a function for each nonterminal in the grammar. The structure of this function is derived from the nonterminal s definition in the grammar. The translation scheme is: grammar rule Æ function nonterminal Æ function call terminal Æ check token and consume sequence (,) Æ sequence of statements repetitions (* and + ) Æ while or do statement based on next token kind alternative ( and []) Æ if or switch statement based on next token kind 25

Fixed one-token lookahead This scheme maintains this invariant: when the function of a nonterminal is called, the global variables hold information about the first token that may be part of that nonterminal; and when the function of a nonterminal returns, the global variables hold information about the first token beyond that nonterminal. As soon as a token is recognized, it should be consumed by a call to get_token, which sets the global variables according to the next token. Lookahead is an alternative to pushback. 27

The top level function To implement the lookahead, we must begin our program by looking ahead. Next we handle the top level nonterminal in our grammar: file. We handle a nonterminal with a function call. int main(void) { /* recognizes file: line*,end */ get_token(); get_file(); return 0; } void get_token(void) { next_token_kind = do_get_token( &next_token_value); } 28

file: line*,end file We translate the file grammar rule to a get_file() function whose function is the translation of the RHS of the rule. We translate a nonterminal, such as line, to a call to the function for that nonterminal, such as get_line(). We translate a * repetition into a while loop whose condition tests that the next token could be the first token of what is repeated, in this case NUMBER or LPAREN. 29

file file: line*,end line: newline (expr,newline) expr: term,((add sub),term)* term: factor, ((mul div),factor)* factor: number (lparen,expr,rparen) 30

file void get_file(void) { /* recognizes file: line*, end */ while (next_token_kind == NUMBER next_token_kind == LPAREN) get_line(); if (next_token_kind!= END)... handle the error... /* no need to get token after END */ } 31

Error Conditions We must consider what happens to invalid input. With this definition, if a line begins with, say, PLUS, we get an error message and get_file() returns. It would usually be better to ignore the erroneous line and keep processing. void get_file(void) { /* recognizes file: line*,end */ while (next_token_kind!= END) if (next_token_kind == NUMBER else next_token_kind=lparen) get_line();... print error message and skip line... /* no need to get token after END */ } 32

line line: newline (expr,newline) The alternative construct translates to an if or switch on the next token kind. A terminal is handled by checking it and getting the next token. 33

line cont. void get_line(void) { if (next_token_kind == NL) get_token(); else { get_expr(); if (next_token_kind == NL) get_token(); else... error... } } 34

Consuming tokens Code like if (next_token_kind == something get_token(); else handle a syntax error is common enough that it s often worth writing a function or macro to handle it. void consume(tokenkind tok) { if (next_token_kind == tok) get_token(); else... handle syntax error... } 35

Consuming tokens (2) Using this function simplifies the get_line() function and makes its similarity to the grammar rule more apparent: line: newline (expr,newline) void get_line(void) { if (next_token_kind == NL) get_token(); else { get_expr(); consume(nl); } } 36

Recognize an expression expr: term, ((add sub),term)* void get_expr(void) { get_term(); while (next_token_kind == ADD next_token_kind == SUB) { get_token(); /* ADD or SUB */ get_term(); } } Code for get_term() is very similar 37

Recognizing a factor factor: number (lparen,expr,rparen) void get_factor(void) { switch (next_token_kind) { case NUMBER: get_token(); break; case LPAREN: get_token(); get_expr(); consume(rparen); break; default:... error... } } 38

Actions This code does nothing but check the syntax of the input stream. But it is easy to extend it to perform whatever actions are required, for example: The action can compute the value of the expression. The action can create a tree structure to represent the expression. The action can generate code to evaluate the expression. 39

Action (2) We extend get_expr() to return the value of the expression int get_expr(void) { int val = get_term(); while (next_token_kind == ADD next_token_kind ==SUB) { TokenKind op = next_token_kind; get_token(); if (operation == ADD) val += get_term(); else val -= get_term(); } return val; } 40

Grammar Manipulation Suppose we had defined expr this way: expr: number (lparen,expr,rparen) (expr,add,expr) (expr,sub,expr) (expr,mul,expr) (expr,div,expr) This description is correct, but we cannot decide which alternative to apply just by looking at the first token of an expression. Therefore we cannot derive a working parser from it using the techniques of recursive descent parsers; we must transform the grammar first. 41

Left Factoring Left factoring uses the rule from that a, (b c) (a,b) (a,c) to pull out a common initial part of several alternatives, so it is not repeated. This gives us: expr: number (lparen,expr,rparen) (expr, ( (add,expr) (sub,expr) (mul,expr) (div,expr))) 42

Left Factoring We write this more manageably as: expr: number (lparen,expr,rparen) (expr,rest) rest: (add,expr) (sub,expr) (mul,expr) (div,expr) 43

Left recursion expr: number (lparen,expr,rparen) (expr,rest) We cannot derive a working parser from this data structure description either. The problem is that one of the alternatives for expr starts with expr. If we wrote get_expr() following this grammar, when the token was other than NUMBER or LPAREN, we would immediately call get_expr(). Since we would not have consumed any tokens, the current token would still not be NUMBER or LPAREN, so we would again immediately call get_expr(). And so on 44

Left recursion elimination Consider what our grammar rule will recognize: NUMBER or LPAREN expr RPAREN or NUMBER rest or LPAREN expr RPAREN rest or NUMBER rest rest or LPAREN expr RPAREN rest rest or We see a pattern here: it begins with either NUMBER or LPAREN expr RPAREN, and follows with any number or repetitions of rest. So we can rewrite our rule as: expr: factor, rest* factor: number (lparen,expr,rparen) 45

Left recursion elimination (2) The general rule is to invent a new nonterminal for the non-left-recursive alternatives: factor: number (lparen,expr,rparen) Then define another new nonterminal as all of the left recursive alternatives, with the left recursive nonterminal removed. In this case it s just rest. 46

Left recursion elimination (3) Finally, replace the left recursive rule with one that starts with the new non-left-recursive nonterminal factor and ends with 0 or more repetitions of the other new nonterminal (just rest in this case). This gives us: expr: factor, rest* factor: number (lparen,expr,rparen) 47

Precedence This data description divides up the input 2 + 3 * 4 as 2, followed by + 3, followed by * 4. That is, (2+ 3)* 4. This would be OK if + and * had the same precedence, but they don t. We want the parser to treat 3 * 4 as a unit. In general, we want any sequence of factors with multiplicative operators between them to be treated as a unit. We call these units terms. 48

Fixing precedence We must separate the multiplicative from the addative operators: term: factor,restterm* restterm: (mul,term) (div,term) expr: term,restexpr* restexpr: (add,expr) (sub,expr) After substituting the definitions of restterm and restexpr for their uses and some factoring: term: expr: factor,((mul div),term)* term,((add sub),expr)* 49

Associativity When matching the input DJDLQVW expr: term,((add sub),expr)* we don t want 1 + 2 to be considered an expr, because that would lead to evaluating 10 - (1 + 2), when what we want is (10-1) + 2. We can fix this by changing the grammar to: term: expr: factor,((mul div),factor)* term,((add sub),term)* 50

Compiler technology Scanning and parsing are the best understood aspects of compiler technology. They have a large body of theory, much of it developed in the sixties and seventies. Many tools exist for the automatic creation of tokenizers and parsers. Two of the best known are the scanner generator lex and the parser generator yacc, which are standard on Unix systems. The theories of scanning and parsing are covered in some detail in 433-255, and may be explored further in 433-361. These units should also introduce tools such as lex and yacc. 51

Parsing without tokenizing A separate tokenizer is helpful if parts of the input are to be ignored (e.g. white space, comments) and if the code to check for and parse these parts would have to repeated at several points in the program. If all of the input is significant, or if there are only a few places in the grammar where the parts to be ignored occur, we need not have a tokenizer; the parser should view each character as a token. file: line* line: name,colon,pw,colon,number,colon,users,nl users: [user,(comma,user)*] 52