CS 441 Fall 2018 Notes Compiler - software that translates a program written in a source file into a program stored in a target file, reporting errors when found. Source Target program written in a High Level Language program translated into a Low Level Language, usually Assembly, Machine Code or Virtual Machine Code. Interpreter software that translates statements of a source (usually) one statement at a time into Machine Code, then executes the translated statement before translating/executing the next. Categories of Languages: Machine Code binary representation simple commands - differs by CPU model knowledge of the CPU/HW required to program in this language. Assembly text (easy for a programmer to read/write) representation simple commands (usually 1:1 with Machine Code) - differs by CPU model - knowledge of the CPU/HW required to program in this language. High Level Language has more-complex commands translates 1:many Machine Code instructions same language across CPU models and Operating Systems (with some minor differences) in depth knowledge of the CPU/HW not required to program in these languages. Structure of a Compiler: Scanner reads a source, character by character, extracting lexemes that are then represented by tokens. Parser extracts tokens from the Scanner and builds an Abstract Syntax Tree of the tokens, based on the rules of the language. Constrainer adds context-sensitive and other information to the tokens of the AST produced by the Parser, producing a Decorated Abstract Syntax Tree. This DAST represents the full meaning (Semantics) of the source program being translated. Code Generator given the DAST produced by the Constrainer, writes code in the Target Language with the same meaning (semantics) as the source program. String Table stores the exact spelling of identifiers and literals discovered by the scanner. Symbol Table stores semantic information on tokens. Lexeme the string representation of a single word or symbol extracted from the source. Token simplified (integer) representation of a Lexeme. It may be an object/structure containing members/fields describing a single word or symbol from the source. Optimization changes to the DAST and/or Target Code generated to make the Target program more efficient.
Grammar: defines the correct forms of sentences (programs) of a language. - Lexical Grammar: defines the correct forms for Lexemes in terms of characters. - Phrase/Structure Grammar: defines the correct formation of tokens into sentences (programs) for the language. Mathematically defined as (specific types of Grammars may have extensions/modifications): G = {, N, P, S } where - The Alphabet: a set of all possible terminals (individual words/characters) in a language - N Non-Terminals: as set of symbols that represent possible combinations of terminals/non terminals. - P - Productions: a set of rules where each Production specifies a string of terminals/nonterminals that can substitute for another set of terminals/non-terminals. Ex: axa bbyb means axa may be substituted with bbyb - S Goal Symbol: a special, single Non-Terminal (S N) that represents all possible valid sentences (programs) in a language. Derivation: a proof that a given sentence can be generated starting with the Goal Symbol followed by the application of productions producing the given sentence. Example: { a,b,c } traditionally Terminals are lower case letters and N = { S, X, Y, Z } Non-Terminals are upper case letters in the mathematical model P = { S ax S by X cx Z c Z X dz Y byb Y dz Z a } S = S Example: Derive: accda Tree Current Sent. Production Used Derivation S Start ax S ax acx X cx accx X cx accdz X dz accda Z a Terminology: Deterministic: when there is more than one production that can be used for a substitution, exactly one can be chosen by examining the sentence being derived for matching tokens. This usually implies no empty productions (N ) Non-Deterministic: when a grammar is NOT Deterministic by the above definition. Accepts: a grammar accepts a sentence when a derivation can be found for the sentence. Rejects: a grammar rejects a sentence when a derivation can NOT be found for the sentence.
Chomsky Hierarchy: a method for classifying Languages/Grammars by their complexity, indicating the type of automata (machine) that can recognize each level. Class Grammar Recognizer 0 Unrestricted Turing Machine 1 Context Sensitive Linear Bounded Automata 2 Context Free Push Down Automata 3 Regular Finite State Automata Classifying Grammars: - here symbols means terminals and/or non-terminals - LHS means Left Hand Side of the arrow of a Production - RHS means Right Hand Side of the arrow of a Production - LHS means the count of the number of symbols on the LHS - For example productions: = {a,b,c,d,e} N = {S,W,X,Y,Z} S is the Goal/Start Symbol Name Unrestricted Context Sensitive Context Free Regular Restrictions Productions may have any number of symbols on the LHS and RHS Ex: abc de - At least 1 Non-Terminal on the LHS - LHS <= RHS implying no N allowed (except maybe S ) - Ex: awxb YcdeZ - LHS is 1 Non-Terminal, nothing else - RHS can be anything - (N always allowed, but makes it Non-Deterministic) - Ex: X ayzb - LHS is 1 Non-Terminal, nothing else - RHS is 1 terminal, optionally followed by 1 Non-Terminal - implying no N allowed (except maybe S ) - Ex: X ay (X Ya is invalid) Automata: in general, consists of: - a tape on which symbols may written or read - a read/write head that accesses one position on the tape at any given time - Possible actions: read, write, Advance 1, Rewind 1 - A controller: consists of an implementation of a Grammar that controls the actions of the machine. The machine has a current state which is modified with the application of a Grammar Production. - Added options and restrictions define the particular type of Automata - Purpose: To Accept or Reject a sentence (string of terminals) as valid or invalid for the language.
Turing Machine: has - An infinite tape - Read/write/advance/rewind operations - Uses an Unrestricted Grammar - Not commonly used as they are inefficient for computer languages and may get into an infinite loop since they have infinite tape. Linear Bounded Automata: - Has a finite tape - Read/write/advance/rewind operations - Uses a Context Sensitive Grammar Push Down Automata: - Finite tape - Read, Advance, Rewind (but no Write) - Uses a Stack of symbols, starting with Push(S) - Uses a Context Free Grammar Finite State Automata: - Finite tape - Read only (and can read the tape only once) - Limited Memory (usually just Current State and Current Terminal, no stack) - Uses a Regular Grammar FSA Representations: - Formal (Mathematical): {, Q,, q 0, F} where - set of Terminals (called the Alphabet ) Q set of States (Non-Terminals) set of Productions where each represents a transition from one state to the next based on the current input Terminal. The formal form is: (currentstate, currentterminal) = nextstate q 0 start state q 0 Q F - set of Halt States F Q - Graphical: o Circles: States (double circle means Halt State) o Arrows between states with terminals represent productions - Table:
- Regular Expression Meta-Symbols: ab - means a followed immediately by b c* - means 0 or more c s c + - means 1 or more c s c b - means c or b (a b)*c - parenthesis can be used for grouping. Means a or b repeated 0 or more times, followed by a single c. - Algorithm: using two variables, with States, Terminals and Productions somehow encoded: State = startstate c = readnextchar() while state!= error and not end-of-file { state = get next state depending on state and c c = readnextchar() } if end-of-file and state is a halt state ACCEPT Else REJECT Also: See Hash Table notes posted on the course website