Cunning Plan Informal Sketch of Lexical Analysis LA identifies tokens from input string lexer : (char list) (token list) Issues in Lexical Analysis Lookahead Ambiguity Specifying Lexers Regular Expressions Examples #2
One-Slide Summary Lexical analysis turns a stream of characters into a stream of tokens. Regular expressions are a way to specify sets of strings. We use them to describe tokens. #3
Fold Batter Lightly... fold_left f a [1;...;n] == f (... (f (f a 1) 2)) n fold_left (fun a e -> e :: a) [] [1;2;3] fold_left (fun a e -> a @ [e]) [] [1;2;3] = [3;2;1] = [1;2;3] fold_right f [1;...;n] b == f 1 (f 2 (... (f n b))) fold_right (fun a e -> e :: a) [1;2;3] [] = [1;2;3] fold_right (fun e a -> a @ [e]) [1;2;3] [] = [3;2;1] #4
Interpreter and Compiler Structure Source Optimization (Interpreter) Lexical Analysis Run It! Abstract Syntax Tree (Compiler) List of Tokens Parsing Code Generation Assembly Code #5
Detail: Modern Reality PA2 Source PA3 Regular Expressions Context-Free Grammar Lexical Analyzer Generator Parser Generator Lexical Analysis Token List Parsing Syntax Tree #6
Lexical Analysis What do we want to do? Example: if (i == j) z = 0; else z = 1; The input is just a sequence of characters: if (i == j)\n\tz = 0;\nelse\n\tz = 1; Goal: partition input strings into substrings And classify them according to their role #7
What's a Token? Output of lexical analysis is a list of tokens A token is a syntactic category In English: noun, verb, adjective,... In a programming language: Identifier, Integer, Keyword, Whitespace,... Parser relies on token distinctions: e.g., identifiers are treated differently than keywords #8
Tokens Tokens correspond to sets of strings. Identifier: strings of letters or digits, starting with a letter Integer: a non-empty string of digits Keyword: else or if or begin or... Whitespace: a non-empty sequence of blanks, newlines, and/or tabs OpenPar: a left-parenthesis #9
Lexical Analyzer: Build It! An implementation must do two things: Recognize substrings corresponding to tokens Return the value or lexeme of the token The lexeme is the substring #10
Example Recall: if (i == j)\n\tz = 0;\nelse\n\tz = 1; Token-lexeme pairs returned by the lexer: <Keyword, <Whitespace, <OpenPar, <Identifier, <Whitespace, <Relation, <Whitespace,... if > > ( > i > > == > > #11
Lookahead The goal is to partition the string. That is implemented by reading left-to-right, recognizing one token at a time. Lookahead may be required to decide where one token ends and the next token begins Even our simple example has lookahead issues i vs. if = vs. == #13
Still Needed A way to describe the lexemes of each token Recall: lexeme = the substring corresponding to the token A way to resolve ambiguities Is if two variables i and f? Is == two equal signs = =? #14
Languages Definition. Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ. Σ is called the alphabet. #15
Examples of Languages Alphabet = English Characters Language = English Sentences Note: Not every string of English characters is an English sentence. Example: xayenb sbe' Alphabet = ASCII characters Language = C Programs Note: ASCII character set is different from English character set. #16
Notation Languages are sets of strings We need some notation for specifying which sets we want that is, which strings are in the set For lexical analysis we care about regular languages, which can be described using regular expressions. #17
Base Regular Expression Single character: 'c' L('c') = { c } (for any c Σ ) Concatenation: AB A and B are other regular expressions L(AB) = { ab a L(A) and b L(B) } Example: L('i' 'f') = { if } We abbreviate 'i' 'f' as 'if' #19
Compound Regular Expressions Union L(A B) = { s s L(A) or s L(B) } Examples: L('if' 'then' 'else') = { if, then, else } L('0' '1' '2' '3' '4' '5' '6' '7' '8' '9') = what? Fun Example: L( ('0' '1') ('0' '1') ) = { 00, 01, 10, 11 } #20
Q: Advertising (810 / 842) The United States Forest Service's ursine mascot first appeared in 1944. Give his catchphrase safety message.
Natural Languages These languages, of which there are about 250, are often mutually intelligible and constitute a major branch of the Niger-Congo languages. They are spoken largely in central, east and southern Africa. Popular examples include Swahili, with 80 million speakers, Shona, with 11 million, and Zulu, with 10 million. They commonly use words such as muntu or mutu for person. Words such as bongos, chimpanzee, gumbo, jumbo, mambo, rumba and safari come from these languages.
Q: Music (150 / 842) In this 1958 Sheb Wooley song the pigeon-toed title character wears short shorts and wants to get a job in a rock'n'roll band playing the horn, but is perhaps best known for his skin tone and non-standard diet.
Example: Keyword Keyword: else or if or begin or... 'else' 'if' 'begin'... (Recall: 'else' abbreviates 'e' 'l' 's' 'e') #25
Example: Integers Integer: a non-empty string of digits digit = '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' number = digit digit* Abbreviation: A+ = A A* #26
Example: Identifier Identifier: string of letters or digits, starting with a letter letter = 'A'... 'Z' 'a'... 'z' ident = letter ( letter digit )* Is (letter* digit*) the same? #27
Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs (' ' '\t' '\n') + or (' ' '\t' '\n' '\r') + #28
Example: Phone Numbers Regexps are everywhere! Consider: (434) 924-1021 Σ area exch phone number = {0, 1, 2, 3,..., 9, (, ), -} = digit digit digit = digit digit digit = digit digit digit digit = '(' area ')' exch '-' phone #29
Example: Email Addresses Consider weimer@cs.virginia.edu Σ = {a, b,..., z,., @} name = letter+ address = name '@' name ('.' name)* #30
Regexp Summary Regular expressions describe many useful languages Next: Given a string s and a regexp R, is s L(R) But a yes/no answer is not enough! Instead: partition the input into lexemes We will adapt regular expression to this goal #31
Subsequent Outline Specifying lexical structure using regexps Finite Automata Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs) Implementation of Regular Expressions Regexp -> NFA -> DFA -> Tables The tables are the heart of the lexer, which is just a while loop that takes in the current input character and looks up the new state in the transition table. #32
Lexical Specification (1) Select a set of tokens Number, Keyword, Identifier,... Write a regexp for the lexemes of each token Number = digit+ Keyword = 'if' 'else'... Identifier = letter ( letter digit ) * OpenPar = '('... #33
Lexical Specification (2) Construct R, matching all lexemes for all tokens: R = Keyword Identifier Number... R = R1 R2 R3... Fact: if s L(R) then s is a lexeme Furthermore, s L(Rj) for some j This j determines the token that is reported #34
Lexical Specification (3) Let the input be x1... xn Each xi is in the alphabet Σ For 1 i n, check x1... xi L(R) If so, it must be that x1... xi L(Rj) for some j Remove x1... xi from the input and restart #35
Lexing Example R = Whitespace Integer Identifer Plus Parse f +3 +g f matches R, more precisely Identifier matches R, more precisely Whitespace + matches R, more precisely Plus... The token-lexeme pairs are In the future, we'll just <Identifier, f > drop whitespace. <Whitespace, > <Plus, + >... #36
Ambiguities Our algorithm is ambiguous! Example: R = Whitespace Integer Identifier Plus Parse foo+3 f matches R, more precisely Identifier But also fo matches R, and foo, but not foo+ How much input is used? Maximal Munch rule: Pick the longest possible substring that matches R #37
Ambiguities (2) R = Whitespace 'new' Integer Identifier Parse new foo new matches R, more precisely 'new' but also Identifier which one do we pick? In general, use the rule listed first. No, really. So we must list 'new' (and other keywords) before Identifier. #38
Error Handling R = Whitespace Integer Identifier '+' Parse =56 No prefix matches R: not =, nor =5, nor =56 Problem: we can't just get stuck and die Solution: New rule matches all bad strings Put it last Lexer tools allow the writing of: R = R1 R2... Rn Error #39
Summary Cool Demo? Regular expressions provide a concise notation for string patterns Their use in lexical analysis requires small extensions To resolve ambiguities To handle errors Good algorithms known (next) Requiring only a single pass over the input And few operations per character (table lookup) #40