COS 301 Programming Languages Lexical and Syntax Analysis Sebesta, Ch. 4 Syntax analysis Programming languages compiled, interpreted, or hybrid All have to do syntax analysis For a compiled language parse trees Overall Syntax Analysis Program (string of characters) Lexical Analyzer Tokens & lexemes Syntax Analyzer Parse trees (decorated) Why separate phases? Different difficulties: Lexical analysis: Simple, so simple approach Optimize, since lot of time spent here Syntax analysis: more complex more complex approach Portability: Syntax analyzer: portable Lexical analyzer: maybe not But: May not really be totally separate phases
Lexical and syntax analysis Lexical analysis: Low-level analysis: looking for identifiers, constants Needs regular grammar Finite state machine (automaton) Syntax analysis: Needs context-free (or attribute) grammar Pushdown automaton (recursive transition network) Lexical Analysis Pattern matching Lexical analyzer (LA): pattern matcher Input: String of characters Look for patterns: lexemes (e.g., myarray) Also determine categories of lexemes: Categories = tokens (e.g., identifier) Often represented by numeric code Output: tokens + lexemes Strips out comments, whitespace
Tokens Identifiers Literals: Numbers: 2, 3, 5.7, 3E4 Characters: x Strings: foo Booleans: TRUE Keywords/reserved words: while, if,etc. Operators: +, -, *, /, **, ^, etc. Punctuation: ;, () {} [] Non-token strings Whitespace (space, tab ) Sometimes not just discarded (e.g., Python) Comments EOL Some operating systems: EOL+newline Sometimes whitespace (C, C++, Java, Lisp, ) Sometimes statement separators (FORTRAN, Basic) EOF Example output foo = foo * PI / 2; Token Lexeme IDENT foo ASSIGN_OP = IDENT foo MULT_OP * IDENT PI DIV_OP / INT_LIT 2 SEMICOLON ;
Building a lexical analyzer One way: Write regular grammar of tokens Give to lex, flex, flex++, etc. table-driven lexical analyzer Another way: Draw state transition diagram for tokens Write custom program to implement it Third way: Draw state transition diagram Construct table-driven implementation Review: Chomsky hierarchy Four levels of languages (grammars) Regular Context-free Context-sensitive Finite-state automaton Recursively-enumerable CFGs needed for syntax Pushdown automaton Linear-bounded automaton Turing machine Regular grammars sufficient for lexical analysis Each can be recognized/generated by automaton (formal machine) state diagram for LA should represent an FSA Regular grammars: Grammars LHS: single nonterminal RHS: at most 1 nonterminal, rightmost/leftmost Context-free grammars: only one nonterminal on LHS Context-sensitive grammars: LHS: any number of terminals, nonterminals Sentential form cannot shrink in derivation Recursively-enumerable (unrestricted) grammars
Regular grammars Tuple {P,T,N,S} P = productions T = terminals N = nonterminals S = start symbol(s) Must be right- or left-regular Right regular grammars RHS contains at most 1 nonterminal Nonterminal must be rightmost symbol Let ω T*, A,B N; productions: A ω B A ω E.g.: let a = an alphanumeric character, and n = numeral: S ar R ar R nr Left regular grammars Same except non-terminal on left A B ω A ω
Linear grammars Linear grammars: Both kinds of rules Not strictly a regular grammar: more powerful E.g.: balance (), {}, begin/end Regular grammar: no Linear grammar: yes E.g.: {a n b n n 1} S! aab or S! aa A! S ε A! Sb b Reg languages linear languages CF languages Example regular grammar: Integers Right-regular grammar for whole numbers: <num> 0 1 <num2> 2 <num2> 9 <num2> <num2> 0 <num2> 1 <num2> 2 <num2> 9 As EBNF: <num2> ε <num> (0 (1 9) {(0 1 9)}) Finite state automata (machine) Automaton = abstract machine Two types: nondeterministic FSA (NFSA) deterministic FSA (DFSA) Only DFSA useful for our purposes Equivalent in power: NFSA can be equivalent DFSA
DFSA DFSA: formal machine, finite # states Accepts input from a tape State + input symbol unique next state Start state, accepting (end) state(s) Transitions: consumes (reads) symbols Accepts string when: Reaches accepting state and no more input left Else: error Uses of FSAs Language recognition Describe other things Control things (i.e., represent simple programs) FSA as graph FSAs can be represented as directed graphs Nodes states Input alphabet + end-of-input symbol State transition function represented by directed edges in graph, labeled with symbols or set of symbols Unique start state One or more final (accepting) states
Example: Vending Machine Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17. Example: Battery Charger From http://www.jcelectronica.com/articles/state_machines.htm Regular expressions Regular expressions: Alternative to regular grammars Specify language at the lexical level Also: in text-processing, web applications Built-in support in many languages: e.g., Perl, Ruby, Java, Javascript, Python,.NET languages
Regular expression conventions Regex Meaning x a character x (stands for itself) \x an escaped character, e.g., \n M N M or N M N M followed by N Note: \ varies with software, typical usage: certain non-printable characters (e.g., \n = newline and \t=tab) ASCII hex (\xff) or Unicode hex (\xffff) Shorthand character classes (\w = word, \s = whitespace \d=digit) Escaping a literal, e.g. \* or \. Meta-symbols Regex Meaning M+ One or more occurrences of M M? Zero or one occurrence of M M* Zero or more occurrences of M [] surrounding a range or set: one of these E.g., [aeiou] the set of vowels E.g., [0-9] the set of digits E.g., [A-Z,a-z,0-9] the set of alphanumeric chars. Any single character ( ) Grouping Regex example Let Σ = { a, b, c } r = (a b)*c This regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression include: c ac bc abc aabbaabbc
Let Σ = { a, b, c } Regex example r = (a c)*b(a c)* This regular expression specifies repetition of either a or c followed by b followed by repetition of either a or c. b ab bcccc abc aaccaab aacabccca Signed integers Leading +/- (optional) At least 1 digit in 0..9 Regex: (\+ \-)?[0-9]+ Regex example Matches include +1, 0, -0, 827356, -98686, Regex example Create regular expression to represent a signed floating point number. There is an optional leading sign ( + or - ) followed by 1 or more digits in the range 0.. 9 followed by an optional decimal point and then 1 or more digits in the range 0.. 9. The \. symbol indicates. is the literal period and not the. symbol for any character. 1. (\+ \-)?[0-9]+(\.[0-9]+)? 2. [-+]?([0-9]+\.[0-9]+ [0-9]+) 3. [-+]?[0-9]+\.?[0-9]* will allow 9. This illustrates how complex regexes can be!
DFSA for regular grammar E.g.: A DFSA that accepts binary strings with an even number of 1 bits Right regular grammar A 0A 1B ε B 0B 1A Regex 0*(10*1)*0* 0 0 A 1 B 1 Regex libraries Many available online See for example http://regexlib.com/default.aspx Lexical analysis state transition diagram For recognizing/generating regular languages A DFSA Nodes states Arcs transitions between states Labels: input characters Actions (optional) Labels can be classes of characters (e.g., 0 9, [A Z,a z], etc.)
A FSA for identifiers Letter, Digit Letter ε S 1 F Explicit accepting state A FSA for identifiers Letter, Digit Letter ε S 1 F Explicit accepting state L, D Could also draw as: L S 1 What language is this? What language is described by this diagram? a S m a m d a d a
Lexical syntax for a simple C-like language anychar [ -~] Note: space(0x20) to tilde (0x7f) Letter [a-za-z] Digit [0-9] Whitespace [ \t] Again note literal space(0x20) EOL \n EOF \004 Lexical syntax for a simple C-like language Keyword bool char else false float if int main true while Identifier {Letter}({Letter} {Digit})* integerlit {Digit}+ floatlit {Digit}+\.{Digit}+ charlit {anychar} Operator = && ==!= < <= > >= + - * /! [ ] Separator :. { } ( ) Comment // ({anychar} {Whitespace})* {eol Some common FSA conventions Unlabeled arc: any other valid input symbol. Recognition of a token ends in a final state. Recognition of a non-token (e.g., whitespace, comment) transitions back to start state. Recognition of end symbol (end of file) ends in a final state.
FSA Automaton must be deterministic. Drop keywords; handle separately with lookup table We must consider all sequences with a common prefix together e.g., Floats and ints Comments and division DFSA for a small C-like language ws = whitespace, l = letter, d = digit, eoln = \n, eof = end of input, All others are literal Whitespace // comments Division op Identifiers DFSAs for a small C-like language Ints and floats Single & double quotes Assignment & comparison Addition Logical and bitwise AND
Lexical Rules <id> ::= <letter> <letter> <id2> <id2> ::= <letter> <id2> <digit> <id2> <letter> <digit> <int> ::= <digit> <digit> <int> <other> ::= + - * / ( ) State Diagram Implementation: Lexical Analyzer from Text front.c (pp. 176-181) - Following is the output of the lexical analyzer of front.c when used on (sum + 47) / total Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is sum Next token is: 21 Next lexeme is + Next token is: 10 Next lexeme is 47 Next token is: 26 Next lexeme is ) Next token is: 24 Next lexeme is / Next token is: 11 Next lexeme is total Next token is: -1 Next lexeme is EOF
Program Structure Program is a DFSA with global variables Utility routines: getchar - gets the next character of input, puts it in nextchar, determines its class and puts the class in charclass getnonblank advances over whitespace to the first char of a token addchar - puts the character from nextchar into the place the lexeme is being accumulated, lexeme lookup - determines whether the string in lexeme is a reserved word (returns a code) front.c 1 #include <stdio.h> #include <ctype.h> /* global declarations */ /* variables */ int charclass; char lexeme[100]; char nextchar; int lexlen; int nexttoken; FILE *in_fp, *fopen(); /* Function declarations */ void void getchar(); void getnonblank(); int lex(); /* Character classes */ #define LETTER 0 #define DIGIT 1 #define UNKNOWN 99 /* Token codes */ #define INT_LIT 10 #define IDENT 11 #define ASSIGN_OP 20 #define ADD_OP 21 #define SUB_OP 22 #define MULT_OP 23 #define DIV_OP 24 #define LEFT_PAREN 25 #define RIGHT_PAREN 26 front.c 2
/* main driver */ main() { } front.c 3 /* open the input data file and process contents */ if ((in_fp = fopen = fopen("front.in","r")) == NULL) printf("error - cannot open front in \n"); else { getchar(); do { } lex(); } while nexttoken!= EOF front.c 4 /* lookup - a function to lookup operators and parentheses and return the token */ int lookup(char ch){ switch(ch){ case '(': nexttoken = LEFT_PAREN; case ')': nexttoken = RIGHT_PAREN; case '+': nexttoken = ADD_OP; case '-': nexttoken = SUB_OP; case '*': nexttoken = MULT_OP; case '/': nexttoken = DIV_OP; default: nexttoken = EOF; } return nexttoken; } front.c 5 /* addchar - a function to add next char to lexeme */ void addchar(){ if (lexlen <= 98){ lexeme[lexlen++] = nextchar; lexeme[lexlen] = 0; } else { printf("error - lexeme too long \n"); } } /* getchar - a function to get the next char of input and determine its character class */ void getchar(){ if ((nextchar = getc(in_fp))!= EOF){ if (isalpha(nextchar)) charclass = LETTER; else if (isdigit(nextchar)) charclass = DIGIT; else charclass = UNKNOWN; } else charclass = EOF; }
front.c 6 /* getnonblank - a function to call getchar until it returns a non-whitespace character */ void getnonblank(){ while (isspace(nextchar)) getchar(); } /* lex - a simple lexical analyzer for arithmetic expressions */ int lex(){ lexlen = 0; getnonblank(); switch (charclass){ case LETTER: /* parse identifiers */ getchar(); while (charclass == LETTER charclass == DIGIT){ getchar(); } nexttoken = IDENT; front.c 7 case DIGIT: /* parse integer literals */ getchar(); while (charclass == DIGIT){ getchar(); } nexttoken = INT_LIT; case UNKNOWN: /* parentheses and operators */ lookup(nextchar); getchar(); case EOF: /* EOF */ nexttoken = EOF; lexeme[0] = 'E'; lexeme[1] = 'O'; lexeme[2] = 'F'; lexeme[3] = 0; } /* end of switch */ printf("next token is: %d, next lexeme is %s\n", nexttoken, lexeme); return nexttoken; Example output (sum + 47) / total Next token is: 25 lexeme is ( Next token is: 11 lexeme is sum Next token is: 21 lexeme is + Next token is: 10 lexeme is 47 Next token is: 26 lexeme is ) Next token is: 24 lexeme is / Next token is: 11 lexeme is total Next token is: -1 lexeme is EOF
Quiz 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 2. Draw a DFSA that recognizes binary strings with at least three consecutive 1 s 3. Below is a BNF grammar for fractional numbers: S -> -FN FN FN -> DL DL.DL DL -> D D DL D -> 0 1 2 3 4 5 6 7 8 9 (a) Rewrite as EBNF (b) Now draw a corresponding DFSA Done? Quiz Answers 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 S 1 1 0 1 0
DFSA for q2 2. Draw a DFSA that recognizes binary strings with at least three consecutive 1 s 0 0 1,0 S 1 1 1 0 3. Below is a BNF grammar for fractional numbers. Rewrite as EBNF: <s> -<fn> <fn> <fn> <dl> <dl>.<dl> <dl> <d> <d> <dl> <d> 0 1 2 3 4 5 6 7 8 9 <s> [-]<fn> <fn> <dl>[.<dl>] <dl> <d>{<d>} And as DFSA: Quiz Answers - 0,1,,9 0,1,,9 0,1,,9 S 0,1,,9. 0,1,,9 Could also have had another state to handle -