CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CS 1622 Lecture 2 Lexical Analysis CS 1622 Lecture 2 1 Lecture 2 Review of last lecture and finish up overview The first compiler phase: lexical analysis Reading: Chapter 2 in text (by 1/18) CS 1622 Lecture 2 2 The Front End Source code Scanner tokens Parser Responsibilites: Recognize legal and illegal programs Report errors meaningfully Produce and initial storage map Shape the code for the backend Typically automatically constructed From a lexical specification Based on finite automata (meet theory) Very well understood CS 1622 Lecture 2 3 1

Source code Scanner tokens Parser Maps characters into tokens - basic lexical units x = y + z becomes <id> <assign> <id> <binop> <id> Lexeme = string that matches the token x, y, and z are lexemes that match <id> Some tokens have attributes <id, x> or <binop, plus> Eliminates whitespace In some languages performs preprocessing (in C done by the preprocessor) CS 1622 Lecture 2 4 Source code Scanner tokens Parser Recognizes syntactic structure & errors Directs semantic analysis (type checking) Builds for source program For some languages (more precisely: grammars) can be easily built by hand More flexible: use parser generators Can change language more easily Typically very fast Well undestood theory ( Push-down automata CS 1622 Lecture 2 5 Grammars A concise and precise way to specify languages For context-free grammars can build efficient parsers Can typically write a CFG for a programming language Tool of choice for specifying syntactic structure CS 1622 Lecture 2 6 2

Grammars Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite rules (P : N N T ) CS 1622 Lecture 2 7 CFG Example 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. - S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} CS 1622 Lecture 2 8 Deriving Sentences Production Result goal 1 expr 2 expr op term 5 expr op y 7 expr - y 2 expr op term - y 4 expr op 2 - y 6 expr + 2 - y 3 term + 2 - y 5 x + 2 - y To recognize a valid sentence for some CFG, we reverse this process and build up a parse CS 1622 Lecture 2 9 3

Parse Tree x + 2 - y goal expr expr op term expr op term - <id,y> term <id,x> + <number,2> Lots of superfluous detail. 1. goal expr 2. expr expr op term 3. term 4. term number 5. id 6. op + 7. - CS 1622 Lecture 2 10 Abstract Syntax Tree (AST) - <id,x> + <number,2> <id,y> The AST summarizes grammatical structure, without including detail about the derivation This is much more concise ASTs are one form of intermediate representation () CS 1622 Lecture 2 11 The Back End - instruction selection Instruction Selection Instruction Scheduling Register Allocation Machine code Responsibilities: Translates to target code Selects target instructions for (trivial for RISC) Allocates machine resources (registers, memory) Typically implemented manually For CISC some automated pattern matching approaches Lots of hand-crafting done for good backends -- must know target architecture well! CS 1622 Lecture 2 12 4

Back end - instruction scheduling Instruction Selection Instruction Scheduling Register Allocation Machine code Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables Optimal scheduling is NP-Complete in nearly all cases but good heuristic techniques are well understood CS 1622 Lecture 2 13 Back end - register allocation Instruction Selection Instruction Scheduling Register Allocation Machine code Have each value in a register when it is used Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete approximate CS 1622 Lecture 2 14 Traditional Three-pass Compiler Source Code Front End Middle End Back End Machine code Analyzes and rewrites (or transforms) Primary goal is to reduce running time of the compiled code May also improve space, power consumption, Must preserve meaning of the code CS 1622 Lecture 2 15 5

The Optimizer Opt Opt Opt... Opt 1 2 3 n Discover & propagate some constant value Move a computation to a less frequently executed place Specialize some computation based on context Discover a redundant computation & remove it Remove useless or unreachable code Encode an idiom in some particularly efficient form CS 1622 Lecture 2 16 The Scanner: Overview Task: translate the sequence of characters to a corresponding sequence of tokens - essentially grouping characters into words -removing irrevelant characters - e.g., white space Each time the scanner is called, it should find the longest sequence of characters in the input starting with the current character that corresponds to a token, and return that token. CS 1622 Lecture 2 17 How to write a scanner? write it from scratch, or automatically generate it with a scanner generator lex or flex (produce C code), or jlex (produces Java code). input to a scanner generator: one regular expression for each token output of a scanner generator: a finite state machine so, you need to understand: regular expressions finite automata. CS 1622 Lecture 2 18 6

Lexical analyzers Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies source code specifications Scanner Scanner Generator parts of speech tables or code CS 1622 Lecture 2 19 Regular Expressions to Finite Automata Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA CS 1622 Lecture 2 20 Recognizing words Example - begin b e g i n s s 0 1 s 2 s 3 s 4 s 5 c= next char; if c!= b then error c = next char; if c!= e the error; c = next char; if c!= g then error;. Transition diagrams - serve as abstractions for code that would be written - finite automata CS 1622 Lecture 2 21 7

Finite Automata A compiler recognizes legal programs in some (source) language. A finite-state machine recognizes legal strings in some language. Example: Identifiers sequences of one or more letters or digits, starting with a letter: letter digit S letter A CS 1622 Lecture 2 22 Finite-Automata State Graphs A state The start state An accepting/final state A transition a CS 1622 Lecture 2 23 Finite Automata Transition s 1 a s 2 Is read In state s 1 on input a go to state s 2 If end of input or no transition possible If in accepting state => accept Otherwise => reject CS 1622 Lecture 2 24 8

Language defined by FSM The language defined by a FSM is the set of strings accepted by the FSM. in the language of the FSM on previous slide: x, tmp2, XyZzy, position27. not in the language of the FSM on previous slide: 123, a?, 13apples. CS 1622 Lecture 2 25 Example: Integer Literals FA that accepts integer literals with an optional + or - sign: digit digit S + - B A digit CS 1622 Lecture 2 26 Formal FSA Definition A finite automaton is a 5-tuple (Σ, S, δ, s 0, S F ) where: An input alphabet Σ ν A set of states S ν A start state s 0 ν A set of accepting states S F S ν δ is the state transition function: S x Σ S (i.e., encodes transitions state input state) CS 1622 Lecture 2 27 9

FA for the integer-literal example Σ = {digit, +, - ) A set of states S = {S, A and B} A start state S 0 = S A set of accepting states S F S = {B} δ is the state transition function = (S,digit) -> B (S, + ) -> A (S, - ) -> A (B, digit) -> B (A, digit) -> B CS 1622 Lecture 2 28 Two kinds of Automata Deterministic (DFA): No state has more than one outgoing edge with the same label. Non-Deterministic (NFA): States may have more than one outgoing edge with same label. Edges may be labeled with ε (epsilon), the empty string. The automaton can take an ε epsilon transition without looking at the current input character. CS 1622 Lecture 2 29 Example of NFA integer-literal example: digit S ε + - B A digit CS 1622 Lecture 2 30 10

Non-deterministic automata (NFA) often simpler (e.g. smaller) than DFA can be in multiple states at the same time NFA accepts a string is if there exists a sequence of moves starting in the start state, ending in a final state, that consumes the entire string. Think about it as pursuing all choices in parallel or having an oracle that says what to do. Example: the integer-literal NFA on input "+75": CS 1622 Lecture 2 31 Equivalence of DFA and NFA Theorem: For every non-deterministic finite-state machine M, there exists a deterministic machine M' such that M and M' accept the same language. Why is the theorem important for scanner generation? Theorem is not enough: what do we need for automatic scanner generation? CS 1622 Lecture 2 32 How to Implement a FSM A table-driven approach: table: one row for each state in the machine, and one column for each possible character. Table[j][k] which state to go to from state j on character k, an empty entry corresponds to the machine getting stuck. CS 1622 Lecture 2 33 11

The table-driven program for a DFA state = S // S is the start state repeat { } k = next character from the input if k == EOF the // end of input if state is a final state then accept else reject state = T[state,k] if state = empty then reject // got stuck CS 1622 Lecture 2 34 Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA CS 1622 Lecture 2 35 Regular Expressions FA s not good way to specify tokens - diagrams hard to write down regular expressions are another specification technique a compact way to define a language that can be accepted by an automaton. used as the input to a scanner generator define each token, and define white-space, comments, etc these do not correspond to tokens, but must be recognized and ignored. CS 1622 Lecture 2 36 12

Example: Simple identifier English: A letter, followed by zero or more letters or digits. RE: letter. (letter digit)* Operators: means "or". means "followed by (usually just use position) * means zero or more instances () are used for grouping CS 1622 Lecture 2 37 Operands of a regular expression Operands are same as labels on the edges of an FSM single characters, or the special character ε (the empty string) "letter" is a shorthand for a b c... z A... Z "digit is a shorthand for 0 1 9 sometimes we put the characters in quotes necessary when denoting characters:. * CS 1622 Lecture 2 38 Precedence of. * operators. Regular Expression Operator Analogous Arithmetic Operator Precedence plus lowest. times middle * exponentiation highest Consider regular expressions: letter.letter digit* letter.(letter digit)* CS 1622 Lecture 2 39 13

Examples Describe (in English) the language defined by each of the following regular expressions: letter (letter digit*) digit digit* "." digit digit* CS 1622 Lecture 2 40 Example: Integer Literals An integer literal with an optional sign can be defined in English as: (nothing or + or -) followed by one or more digits The corresponding regular expression is: (+ - epsilon).(digit.digit*) A new convenient operator + digit.digit* is the same as digit+ which means "one or more digits CS 1622 Lecture 2 41 Language Defined by a Regular Expression Recall: language = set of strings Language defined by an automaton / RE Regular Exp. the set of strings accepted by the automaton the set of strings that match the expression. epsilon {""} a a.b.c a b c Corresponding Set of Strings {"a"} {"abc"} {"a", "b", "c"} (a b c)* {"", "a", "b", "c", "aa", "ab",..., "bccabb"...} CS 1622 Lecture 2 42 14

REs describe regular languages Patterns form a regular language *** any finite language is regular *** Regular Expression (RE) (over alphabet Σ) ε is a RE denoting the set {ε} If a is in Σ, then a is a RE denoting {a} If x and y are REs denoting L(x) and L(y) then x is an RE denoting L(x); y is a RE denoting L(y); x y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x * is an RE denoting L(x)* Can combine RE to form other REs CS 1622 Lecture 2 43 15