Compilation 2014 Warm-up project Aslan Askarov aslan@cs.au.dk Revised from slides by E. Ernst
Straight-line Programming Language Toy programming language: no branching, no loops Skip lexing and parsing issues Focus on the meaning interpretation Syntax Stm Stm; Stm Stm id := Exp Stm print ( ExpList ) Exp id Exp num Exp Exp BinOp Exp Exp ( Stm, Exp ) (CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp) ExpList Exp, ExpList ExpList Exp Binop + Binop Binop Binop / (PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)
Straight-line program Source: CompoundStm a := 5 + 3; b := (print (a, a - 1),10 * a); print (b) Corresponding syntax tree: NumExp 5 a AssignStm OpExp BinOp NumExp Plus 3 CompoundStm AssignStm b EseqExp PrintStm OpExp PrintStm LastExpList IdExp PairExpList NumExp BinOp IdExp b IdExp LastExpList 10 Times a a OpExp IdExp BinOp NumExp a Minus 1
SLP syntax representation datatype SML declaration type id = string datatype binop = Plus Minus Times Div datatype stm = CompoundStm of stm * stm AssignStm of id * exp PrintStm of exp list and exp = IdExp of id NumExp of int OpExp of exp * binop * exp EseqExp of stm * exp Stm Stm; Stm Stm id := Exp Stm print ( ExpList ) Exp id Exp num Exp Exp BinOp Exp Exp ( Stm, Exp ) ExpList Exp, ExpList ExpList Exp Binop + Binop Binop Binop / (CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp) (PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)
SLP syntax representation Source program CompoundStm a := 5 + 3; b := (print (a, a - 1),10 * a); print (b) a AssignStm OpExp CompoundStm AssignStm PrintStm NumExp BinOp NumExp b EseqExp LastExpList SML value: val prog = CompoundStm ( AssignStm ( a", OpExp ( NumExp 5, Plus, NumExp 3)), CompoundStm ( AssignStm ("b", EseqExp ( PrintStm [IdExp "a", OpExp ( )], OpExp (NumExp 10, ))), PrintStm [IdExp "b"])) 5 Plus IdExp a IdExp a 3 PairExpList LastExpList OpExp BinOp Minus PrintStm NumExp 10 NumExp 1 OpExp BinOp Times IdExp a IdExp b
Project assignment Follow descriptions p10-12 in MCIML Modularity principles p9-10: discussed on Friday, may be ignored at first
Lexical analysis
Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target code
Lexical analysis First phase in the compilation Input: stream of characters i f ( x > 0 ) \n \t t h e n 1 \n \t e l s e 0 IF LPAREN ID ( x ) GE INT (0) RPAREN THEN INT (1) ELSE INT (0) Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives
Tokens Type ID Examples foo n14 a my-fun INT 73 0 070 REAL 0.0.5 10. IF if COMMA, LPAREN ( ASGMT :=
Non-tokens Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace
Token data structure Many tokens need no associated data, e.g.: IF, COMMA, LPAREN, RPAREN, ASGMT Some tokens carry an associated string: ID ( my-fun ) Some tokens carry associated data of other types: INT (73), INT (1), FLOAT (IEEE754, 1001111100 ) Tokens may include useful additional information: start/end pos in input file (line number + column, or charpos)
Q/A Consider source program var δ := 0.0 Language: case sensitive, ASCII How to report error of using δ? FileName:Line.Col: Illegal character δ
Regular expressions We can use regular expressions to specify programming language tokens Regular expressions: Expected to be well-known Syntax: symbol a choice x y concat x y empty ε repeat x*
Regular expressions used for scanning Examples if (IF); [a-z][a-z0-9]* (ID); [0-9]* (NUM); ([0-9]+. [0-9]*) ([0-9]*. [0-9]+) (REAL); ( -- [a-z]* \n ) ( \t ) (continue());. (error (); continue());
Resolving ambiguities Rule: when a string can match multiple tokens, the longest matching token wins i f x > 0 ID ( ifx ) We also need to specify priorities if we match several tokens of the same length. Usual rule: earliest declaration wins if (IF); [a-z][a-z0-9]* (ID); i f IF ID ( if )
Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Output: Program that translates raw text into stream of tokens
Total NFA for ID,IF,NUM,REAL a-e,g-z,0-9 blank etc. ID f 0-9,a-z 2 3 4 5 i 0-9 1 blank etc. - other IF ID error REAL a-h,j-z NUM 12 13 9 error. 0-9,a-z 7. 0-9 error REAL 8 0-9 0-9 6 0-9 - \n 10 a-z whitespace 11
ML-Lex Lexer generator, built-in part of SML/NJ Accepts lexical specification, produces a scanner Example specification (* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}. [0-9]*) ([0-9]*. {digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); ( -- [a-z]* \n ) ( \n \t )+ => (continue()); => ( ErrorMsg.error yypos Illegal character ; continue());
Lexer states Helpful when handling different kinds of tokens For ex.: use state INITIAL in general lexing (automatic) STRING when scanning the contents of a string COMMENT when scanning a comment Point: keep different concerns apart simpler Syntax:... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext));... <INITIAL> \ => (YYBEGIN STRING; continue());... <STRING>. => (continue());...
Summary Warm-up project: Program in SML Straight-line programming language, no lexing/parsing involved Express programs: use abstract syntax tree datatype Project specified on website, essentially as in the book Lexical analysis Avoid complexity in grammar. Use lexer Based on regular expressions. Implementation via NFA/DFA Theory assumed known Tools: ML-Lex Scanner generator, outputs SML code from spec Note lexer states