Lexical Analysis (ASU Ch 3, Fig 3.1) Implementation by hand automatically ((F)Lex) Lex generates a finite automaton recogniser uses regular expressions Tasks remove white space (ws) display source program line numbers (error display) Lexical analyser Issues Parser Symbol Table simple: LA/SA split more efficient flexible / portable 1
Terminology (ASU Ch 3.1) pattern set of input strings associated with token lexeme sequence of chars in i/p matched by a pattern token syntactic object generated by pattern match from i/p terminal symbols in G token (lexemes) const ( const ) if ( if ) relation ( > >= = <> < <= ) id ( pi count D2 ) num ( 3.14 22 0 ) literal ( this is a string ) lexemes correspond to attributes for tokens 2
Attributes, errors & recovery (ASU Ex. 3.1) E.g. E = m*c**2 <id, E > <assign_op, > <id, m > <mul_op, > <id, c > <exp_op, > <num, 2 > Errors & recovery delete extraneous characters insert missing characters replace incorrect character transpose characters (fi > if) resynchronise with i/p stream and find token error distance # steps to transform erroneous program 3
Strings & languages (ASU Ch 3.3) alphabet (char set / class) - any finite set of symbols e.g. Binary { 1, 0 }, ASCII, English, Swedish a string over an alphabet - a finite sequence of symbols drawn from that alphabet string length s e.g. foobar = 6 = 0 language - any set of strings over some fixed alphabet - includes { } - the empty set concatenation - xy where x and y are strings product notation - s 0 =, s 1 = s, s 2 = ss, s n = sss sss (n times) 4
Strings & languages (ASU Ch 3.3) prefix of s - string obtained by removing 0 or more trailing symbols of a string - e.g. ban from banana suffix of s - string obtained by removing 0 or more leading symbols of a string - e.g. nana from banana sub-string of s - string obtained by deleting a prefix & suffix - e.g. nan from banana every prefix / suffix is a sub-string of s proper prefix / suffix / sub-string - non-empty string x s.t. x!= s sub-sequence of s - string obtained by deleting 0 or more not necessarily contiguous symbols from s - e.g. baaa from banana 5
Operations on Languages (ASU Fig 3.8) L union M: L u M = { s s in L or s in M } concatenation LM: LM = { st s in L and t in M } Kleene closure: L * = i=0 U inf L i zero or more concatenations of L positive closure: L + = i=1 U inf L i one or more concatenations of L 6
Regular Expressions (RE) (ASU Ch 3.3) E.g. Pascal <id> ::= letter ( letter digit )* each RE r denotes a language L(r) built from simpler RE s using a set of defining rules is an RE that denotes { } for a in alphabet A, a is an RE that denotes {a}, the set containing the string a (a can mean symbol a / string a / RE {a}) let r, s be RE s denoting L(r), L(s) respectively then (r) (s) is an RE denoting L(r) u L(s) (r) (s) is an RE denoting L (r) L(s) (r) * is an RE denoting ( L(r) )* (r) is an RE denoting L(r) (extra (, ) ) 7
Regular Expressions - examples (ASU Ch 3.3 Ex. 3.3) (a) ((b)*(c)) == a b*c a, or 0 or more b s, c for an alphabet A = {a, b} a b denotes the set {a, b} (a b)(a b) denotes the set {aa, ab, ba, bb} a* denotes the set {, a, aa, aaa, } (a b)* denotes the set of all strings of a s and b s what does a a*b denote? 8
Regular Expressions - Equivalence & Axioms (ASU Fig 3.9) Equivalence - if RE s r & s denote the same L, r = s axioms (algebraic laws) r s = s r is commutative r (s t) = (r s) t is associative (rs)t = r(st) concat is associative r(s t) = rs rt concat distributes over r = r - identity element r* = (r )* relationship * <=> r* * = r* * is idempotent 9
Regular Definitions (RD) (ASU Ch 3.3) A regular definition is a sequence of definitions d 1 => r 1, d 2 => r 2, d n => r n each d i is a distinct name each r i is an RE over the symbols in A u {d 1, d 2,, d i-1 } NB i-1 examples Pascal identifiers letter => A B Z a b z digit => 0 1 9 id => letter (letter digit)* definition names: letter, digit, id 10
Regular Expressions - Notational Shorthand (ASU Ch 3.3) + denotes 1 or more instances : if (r) denotes L (r), (r)+ denotes L (r)+ a+ denotes the set of all strings of one or more a s * denotes 0 (zero) or more instances: r* = r+ r+ = r r*? denotes 0 (zero) or one instances: r? = r (r)? Denotes L(r) u [abc] denotes a b c where a, b, c in A (alphabet) [a-z] denotes a b z id => [A-Za-z][A-Za-z0-9]* 11
Regular Expressions - Limitations (ASU Ch 3.3) Non-regular sets RE s cannot describe balanced / nested constructs e.g. All strings of balanced parentheses this requires a Context Free Grammar (CFG) RE s cannot describe repeating strings e.g. { wcw w is a string of a s and b s} this leads to the next stage in compiling - syntax analyser based on CFG s based on token recognition 12
Example Grammar + Regular Expressions (ASU Ex 3.6) stmt => if expr then stmt if expr then stmt else stmt expr => term relop term term term => id num ==================================== if => if / then => then / else => else relop => < <= = <> > >= id => letter ( letter digit )* num => digit+ (.digit+)? ( E ( + - )? digit+ )? delim => blank tab newline ws => delim+ 13
Token Recognition (ASU Ch 3.4) RE token attribute ws -- -- if if -- then then -- else else -- id id reference to symbol table entry [0..9] num reference to table entry < relop LT <= relop LE etc. (=, <>, >, >= become EQ, NE, GT, GE respectively) 14
Transition Diagrams (ASU Fig 3.12) Lexical Analysers can be represented by transition diagrams: circles are states, edges are labelled by a char start = 2 return(relop, LE) 0 return(relop, EQ) * = retract forward pointer < = > 1 5 6 > other = other 3 4 7 8 * * return(relop, NE) return(relop, LT) return(relop, GE) return(relop, GT) 15
Summary Lexical Analysis automatic ((F)Lex) uses regular expressions and regular definitions pattern / lexeme / token Strings and Languages alphabet / string / language operations: union / concatenation / closure L u M, LM, L*, L+ Regular Expressions equivalence & axioms regular definitions shorthand: r*, r+, r?, [a-z] Tokens produced by LA used by SA (syntax analysis) have attributes (e.g. relop >= ) Transition Diagrams representation of an LA 16