Today Quiz 2 Lexer design Syntax Analysis: Context-Free Grammars Assignments HW2 (out, due Tues) S. Bowers 1 of 15
Implementing a Lexer for MyPL (HW 2) Similar in spirit to HW 1 We ll create three classes: MyPLError for representing all MyPL errors Token token type, lexeme, line and column numbers Lexer mainly a next_token() function (mypl_error.py) (mypl_token.py) (mypl_lexer.py) We ll use Exceptions to handle errors In file mypl_error.py class MyPLError(Exception): def init (self, message, line, column): self.message = message self.line = line self.column = column def str (self): To throw an exception raise Error('oops!', 1, 10) S. Bowers 2 of 15
To handle an exception try: code that could cause an exception except Error as e: print(e) also see Lecture 2 S. Bowers 3 of 15
Representing Tokens (HW 2) Define a set of token-type constants (in mypl_token.py) SEMICOLON = 'SEMICOLON' LPAREN = 'LPAREN' RPAREN = 'RPAREN' EOS = 'EOS' # end of file/stream ($$) Define a Token class (in mypl_token.py) class Token(object): def init (self, tokentype, lexeme, line, column): def str (self): A token object just holds the information for one token A lexer object converts the program text to a stream of tokens Each token is returned by the lexer as a token object for example: the_token = the_lexer.next_token() S. Bowers 4 of 15
Lexical Analyzer (HW 2) We ll define a Lexer class (in mypl_lexer.py) the lexer is initialized with the file to scan the next_token() method returns the next token in the input for example: file_stream = open(filename, `r') the_lexer = lexer.lexer(file_stream) the_token = the_lexer.next_token() # open source file # create a lexer # get first token The basic structure of the Lexer class class Lexer(object): def init (self, input_stream): self.line = 1 self.column = 0 self.input_stream = input_stream def read(self): return self.input_stream.read(1) def peek(self): pos = self.input_stream.tell() symbol = self. read() self.input_stream.seek(pos) return symbol # stores current line # stores current column # returns one character # current stream position # read 1 character # return to old position def next_token(self): S. Bowers 5 of 15
Some hints for implementing next_token() Can check if end of file using peek if self. peek() == '': return Token(token.EOS, '', self.line, self.column) Use read() to build next token one charactor symbol at a time symbol = self. read() self.column += 1 Check if a character is a digit using the isdigit() function if symbol.isdigit(): build up curr_lexeme to hold number return Token(token.INT, curr_lexeme, self.line, self.column) Check if a character is a letter using the isalpha() function if symbol.isalpha(): Check if a character is whitespace using the isspace() function if symbol.isspace(): return self.next_token() S. Bowers 6 of 15
Use the symbol as the lexeme if the lexeme is unimportant if symbol == ';': return Token(token.SEMICOLON, symbol, self.line, self.column) Raise an exception if token can t be created s = 'unexpected symbol "%s"' % symbol raise MyPLError(s, self.line, self.column) You don t need to modularize your next_token() function it is okay in this case to have a relatively long function body you can modularize it if you want, but it isn t required for HW 2 S. Bowers 7 of 15
Compilation Steps (Syntax Analysis) Source Program Lexical Analysis Token Sequence Syntax Analysis Abstract Syntax Tree Front End Semantic Analysis Input Abstract Syntax Tree Computer Machine Code Machine Code Generation Intermediate Code Optimization Intermediate Code Intermediate Code Generation Output Back End Syntax Analysis: The Plan 1. Start with defining a language s syntax using formal grammars 2. Describe how to check syntax using recursive descent parsing 3. Extend parsing by adding abstract syntax trees (parse trees) S. Bowers 8 of 15
Formal Grammars Grammars define rules that specify a language s syntax a language here means a set of allowable strings Grammars can be used within Lexers (lexical analysis) Parsers (syntax analysis) e.g., numbers, strings, comments check if syntax is correct We ll look at context-free grammars won t get into all the details typically studied in a compiler or theory (formal languages) class S. Bowers 9 of 15
Grammar Rules Grammar rules define productions ( rewritings ) s a Here we say s produces (or yields ) a s is a non-terminal symbol (LHS of a rule) sometimes as s or S a is a terminal symbol terminal and non-terminal symbols are disjoint set of terminals is the alphabet of the language often there is a distinguished start symbol Concatenation s ab s yields the string a followed by the string b t uv u a v b t yields the same exact string as s S. Bowers 10 of 15
Alternation s a b s yields the string a or b the same as: s a s b The empty string s a ɛ ɛ denotes the special empty terminal s yields either the string a or '' (empty string) Kleene Star (Closure) s a s yields the strings with zero or more a s S. Bowers 11 of 15
Recursion s (s) () s yields the strings of well balanced parentheses note that this is not possible to express using (closure) the opposite is true through can be implemented using recursion (w/ the empty string ) Q: How can we represent s aa using recursion? s a as sometimes denoted as s a + Exercise: Try to define some languages (exercise sheet) S. Bowers 12 of 15
MyPL Constants Using grammar rules to define constant values BOOLVAL true false INTVAL pdigit digit 0 FLOATVAL INTVAL. digit digit STRINGVAL " character " ID letter ( letter digit _ ) letter a z A Z pdigit 1 9 digit 0 pdigit where character is any symbol (letter, number, etc.) except " S. Bowers 13 of 15
Parsing: An example grammar Simple list of assignment statements stmt_list stmt stmt ';' stmt_list stmt var '=' expr var 'A' 'B' 'C' expr var var '+' var var '-' var Note: many possible grammars for our language! We can use grammars to generate strings (derivations) 1. choose a production (start symbol on left-hand side) 2. replace with right-hand side 3. pick a non-terminal N and production with N on left side 4. replace N with production s right-hand side 5. continue until only terminals remain S. Bowers 14 of 15
Example derivation of A = B + C; B = A stmt_list stmt ; stmt_list var = expr ; stmt_list A = expr ; stmt_list A = var + var ; stmt A = B + var ; stmt_list A = B + C ; stmt_list A = B + C ; stmt A = B + C ; var = expr A = B + C ; B = expr A = B + C ; B = var A = B + C ; B = C This is a left-most derivation derived the string by replacing left-most non-terminals The opposite is a right-most derivation stmt_list stmt ; stmt_list var = expr ; stmt_list var = expr ; stmt var = expr ; var = expr... Sometimes write for a multi-step derivation e.g.: stmt_list var = expr ; var = expr S. Bowers 15 of 15