Program Analysis ( 软件源代码分析技术 ) ZHENG LI ( 李征 ) lizheng@mail.buct.edu.cn
Lexical and Syntax Analysis
Topic Covered Today Compilation Lexical Analysis Semantic Analysis
Compilation Translating from high-level language to machine code is organized into several phases or passes. In the early days passes communicated through files, but this is no longer necessary.
Language Specification Language Specification We must first describe the language in question by giving its specification. Syntax: Defines symbols (vocabulary) Defines programs (sentences) Semantics: Gives meaning to sentences. The formal specifications are often the input to tools that build translators automatically.
Compiler passes Compiler passes
Compiler passes Compiler passes source program Lexical scanner front end Parser symbol table manager semantic analyzer Translator error handler Optimizer Final assembly back end target program
Symbol Table Management The symbol table is a data structure used by all phases of the compiler to keep track of user defined symbols and keywords. During early phases (lexical and syntax analysis) symbols are discovered and put into the symbol table During later phases symbols are looked up to validate their usage.
Symbol Tables Regular Expression ws if then else id num < <= = < > > >= Token - if then else id num relop relop relop relop relop relop Attribute-Value - - - - pointer to table entry pointer to table entry LT LE EQ NE GT GE Note: Each token has a unique token identifier to define category of lexemes
Error Management Errors can occur at all phases in the compiler Invalid input characters, syntax errors, semantic errors, etc. Good compilers will attempt to recover from errors and continue.
Lexical analyzer Lexical analyzer Also called a scanner or tokenizer Converts stream of characters into a stream of tokens Tokens are: Keywords such as for, while, and class. Special characters such as +, -, (, and < Variable name occurrences Constant occurrences such as 1, 0, true.
Lexical analyzer The lexical analyzer is usually a subroutine of the parser. Each token is a single entity. A numerical code is usually assigned to each type of token.
Lexical analyzer Lexical analyzer Lexical analyzers perform: Line reconstruction delete comments delete white spaces perform text substitution Lexical translation: translation of lexemes -> tokens Often additional information is affiliated with a token.
Token Definitions letter A B C Z a b z digit 0 1 2 9 id letter ( letter digit )* Shorthand Notation: + : one or more r* = r + & r + = r r*? : zero or one r?=r [range] : set range of characters (replaces ) [A-Z] = A B C Z id [A-Za-z][A-Za-z0-9]*
Example of extraction lexemes and produce the corresponding tokens. Sum = oldsum value /100; Token Lexeme IDENT sum ASSIGN_OP = IDENT oldsum SUBTRACT_OP - IDENT value DIVISION_OP / INT_LIT 100 SEMICOLON ;
Parser Parser Performs syntax analysis Imposes syntactic structure on a sentence. Parse trees are used to expose the structure. These trees are often not explicitly built Simpler representations of them are often used Parsers, accepts a string of tokens and builds a parse tree representing the program
Parser Parser The collection of all the programs in a given language is usually specified using a list of rules known as a context free grammar.
Parser Parser A grammar has four components: A set of tokens known as terminal symbols A set of variables or non-terminals A set of productions where each production consists of a non-terminal, an arrow, and a sequence of tokens and/or non-terminals A designation of one of the nonterminals as the start symbol.
Abstract Syntax Tree The parse tree is used to recognize the components of the program and to check that the syntax is correct. As the parser applies productions, it usually generates the component of a simpler tree (known as Abstract Syntax Tree). The meaning of the component is derived out of the way the statement is organized in a subtree. Abstract Syntax Tree
Comparison with Lexical Analysis Phase Input Output Lexer Parser Sequence of characters Sequence of tokens Sequence of tokens Parse tree
Semantic Analyzer The semantic analyzer completes the symbol table with information on the characteristics of each identifier. The symbol table is usually initialized during parsing. One entry is created for each identifier and constant. Scope is taken into account. Two different variables with the same name will have different entries in the symbol table. Semantic Analyzer
Translator The lexical scanner, parser, and semantic analyzer are collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST.
Translator Rather than generate code for a specific architecture, most compilers generate intermediate language Three address code is popular. Really a flattened tree representation. Simple. Flexible (captures the essence of many target architectures). Can be interpreted.
Optimizers Intermediate code is examined and improved. Can be simple: changing a:=a+1 to increment a changing 3*5 to 15 Can be complicated: reorganizing data and data accesses for cache efficiency Optimization can improve running time by orders of magnitude, often also decreasing program size.
Code Generation Generation of real executable code for a particular target machine. It is completed by the Final Assembly phase Final output can either be assembly language for the target machine object code ready for linking The target machine can be a virtual machine (such as the Java Virtual Machine, JVM), and the real executable code is virtual code (such as Java Bytecode).
Compiler Overview Source Program IF (a<b) THEN c=1*d; Lexical Analyzer Token Sequence IF ( ID a < ID b ) THEN ID c = CONST 1 * ID d Syntax Analyzer Semantic Analyzer Code Optimizer Code Generation Syntax Tree IF_stmt 3-Address Code Optimized 3-Addr. Code Assembly Code cond_expr < list GE a, b, L1 MUlT 1, d, c L1: a b assign_stmt GE a, b, L1 MOV d, c L1: lhs rhs c * loadi R1,a cmpi R1,b jge L1 loadi R1,d storei R1,c L1: 1 d
Exercise: Abstract Syntax Tree x := a + b; y := a * b; while (y > a) { } a := a + 1; x := a + b;
Email: lizheng@mail.buct.edu.cn Web: http://cist.buct.edu.cn/staff/zheng/ Office: 科 510