Processadors de Llenguatge II. Rafael Ramirez

Processadors de Llenguatge II Rafael Ramirez rafael.ramirez@upf.edu 55.316

Temas Sintaxis Semántica Verificación formal Compilación de programas Programación Lógica Programación Funcional Programación Concurrente Si hay tiempo Programación Imperativa Programación Orientada a Objectos

Evaluación Examen teoría: 60% Practicas/seminarios: 40% Practicas = (practicas + examen)/2 Participación *Se han de aprobar tanto examen (teoria y practicas) como practicas.

Libros Programming Languages Design and Implementation Terrence Pratt and Marvin Zelkowitz Prentice Hall (2001) Concepts of Programming Languages Robert Sebesta Addison Wesley (2002) Modern compiler implementation in ML Andrew W. Appel Cambridge University Press (1997)

Contacto Clase Prácticas Email rafael.ramirez@upf.edu Horas oficina: martes 12-2.30pm Anuncios, documentacion, http://www.dtic.upf.edu/~rramirez/pl2/

Prácticas/seminarios Seminarios (3): grupos de 2-4 Prácticas (6): grupos de 2

Processadors de Llenguatge II Introducción Pratt 1.2, 1.3.1, 1.3.2

Programming Languages Main issues in language design, impl. and usage Syntax -- What a program looks like Semantics -- What a program means programming paradigm Implementation -- How a program executes Verification is the program 100% correct?

Programming Languages A programming paradigm is a fundamental style of computer programming for solving specific software engineering problems. Paradigms differ in the concepts and abstractions used to represent the elements of a program (such as objects, functions, variables, constraints...) and the steps that compose a computation (assignation, evaluation,...). A programming language can support multiple paradigms. For example programs written in C++ can be purely procedural, or purely object-oriented, or contain elements of both paradigms. P.e., in object-oriented programming, programmers can think of a program as a collection of interacting objects, while in functional programming a program can be thought of as a sequence of stateless function evaluations.

Programming Languages Which language for a specific problem? Pascal, C -- procedural, statement oriented C++, Java, Smalltalk -- Object oriented ML, Lisp -- Functional Prolog -- Logic-based

Language expressiveness Existe un programa que puede ser escrito en un lenguaje para el cual no hay realmente programa equivalente en alguno de los otros lenguajes? e.g. existe un programa que puede ser escrito en Prolog para el cual no haya equivalente en Pascal?

Language expressiveness Church s thesis Any computable function can be computed by a Turing machine. Why do we need all these different languages?

Language Goals 1950s--1960s Compile programs to execute efficiently. Programmers cheap; Machines expensive. Direct connection between language features and hardware - goto statements,... Today CPU power and memory very cheap; programmers expensive. Compile programs that are built efficiently Direct connection between language features and design concepts - encapsulation, inheritance,...

Attributes of a good language Clarity, simplicity a language should provide a clear, simple and unified set of concepts to be used as primitives to develop algorithms. Orthogonality -every combination of features is meaningful easier to remember if few exceptions Naturalness for the application - program structure reflects the logical structure of algorithm Support for abstraction - program data reflects problem being solved

Attributes of a good language (cont) Ease of program verification - verifying that program correctly performs its required function Programming environment - external support for the language Portability of programs - transportability of the resulting programs from the computer on which they are developed to other computer systems Cost of use - program execution, program translation, program creation, and program maintenance

Language paradigms Imperative programming Basic concept is machine state Program consists of a sequence of statements: Stat1; Stat2;... Execution of each statement to change state, i.e. change the value of one or more locations in memory. Functional programming Idea is to find a function that produces the answer, i.e. what is the function that we must apply to the initial state to get the answer. Function composition is major operation: f1(f2(f3(x))) Programming consists of building the function that computes the answer

Language paradigms (cont) Logic programming Based on predicate logic State what the problem is, as opposed to how to solve the problem. Execution is deduction. father(x,y), father(y,z) grandfather(x,z) Object-oriented programming Data Objects are built Limited set of functions operate on these data Complex objects are designed as extensions of simpler objects (inheritance) Syntax: Set of objects (classes) containing data and methods.

Language paradigms (cont) Concurrent programming Program = set of processes Each process execute sequentially Processes execute concurrently Processes communicate and synchronize their execution. Involve safety and progress properties.

Processadors de Llenguatge II Language Syntax Pratt 3.1, 3.3.1

Program structure Language = Syntax + Semantics The syntax of a language is concerned with the form of a program: how expressions, commands, declarations etc. are put together to result in the final program. The semantics of a language is concerned with the meaning of a program: how the programs behave when executed on computers.

Program structure Just like in natural language: Semantics (Meaning) You have in your mind, to express a conditional, when true do something, when false, do something else. Syntax (Form/Structure) C: if (condition) { part1 } else { part2 } Pascal: if condition begin part1 end else begin part2 end LISP: if (condition) (part1) (part2))

Program structure Syntax What a program looks like Useful notations for describing PL s syntax: BNF grammars (context free grammars) Regular grammars Semantics What the program means Static semantics - Semantics determined at compile time: var A: integer; Type and storage for A Dynamic semantics - Semantics determined during execution: X = ``ABC'' X a string; value of X Useful notations: Informal: English documents (e.g. ref manuals, etc.) Formal: operational semantics, axiomatic semantics, etc.

BNF grammars Definition: (Some preliminaries) 1. An alphabet is a set whose elements are called symbols. 2. A word (or string) w, over an alphabet A is a finite sequence of symbols from the alphabet. i.e. w A n where n N. When n=0, the word is called an empty word/string and is denoted as ε. 3. The set of words of length n over an alphabet A, is denoted as A n. 4. The set of words over an alphabet A, is denoted as A * where: A * = A n n N

BNF grammars Definition: 1. A context-free grammar G, is a quadruple (V,Σ,R,S) where: a. V is an alphabet b. Σ is a set of terminals, where Σ V (V Σ is known as the set of non-terminals) c. R is a set of production rules such that R (V Σ) x V* d. S is the start symbol, where S (V Σ) 2. The language generated by grammar G (denoted L(G)), is the set of words over Σ, which can be produced by R.

BNF grammars Nonterminal: A finite set of symbols: <sentence> <subject> <predicate> <verb> <article> <noun> Terminal: A finite set of symbols: the, boy, girl, ran, ate, cake Start symbol: One of the nonterminals: <sentence>

Backus-Naur form (BNF) grammars Rules (productions): A finite set of replacement rules: <sentence> ::= <subject> <predicate> <subject> ::= <article> <noun> <predicate>::= <verb> <article> <noun> <verb> ::= ran ate <article> ::= the <noun> ::= boy girl cake Replacement Operator: Replace any nonterminal by a right hand side value using any rule (written )

Example BNF sentences <sentence> <subject> <predicate> <article> <noun> <predicate> the <noun> <predicate>... the boy ate the cake First rule Second rule Fifth rule Also from <sentence> you can derive the cake ate the boy Syntax does not imply correct semantics Note: Rule <A> ::= <B><C> This BNF rule also written with equivalent (CFG) syntax: A BC

Languages Any string derived from the start symbol is a sentential form. A language generated by grammar G (written L (G) ) is the set of all strings over the terminal alphabet (i.e., sentences) derived from start symbol. That is, a language is the set of sentential forms containing only terminal symbols.

Derivations A derivation is a sequence of sentential forms starting from start symbol. Grammar: B 0B 1B 0 1 Derivation: B 0B 01B 010 Each derivation step is the application of a production rule.

Derivations Derivations may not be unique S SS (S) () S SS (S)S (())S (())() S SS S() (S)() (())() Different derivations but get the same parse tree

Parse tree A parse tree is a hierarchical synt. structure Internal node denote non-terminals Leaf nodes denote terminals. Grammar: B 0B 1B 0 1 Derivation: B 0B 01B 010 From derivation get parse tree as shown in the right.

DERIVATION Grammar V = { } Σ = { } R = { } S = { } Tree PARSING String 19a0c6duw Tree

Ambiguity But from some grammars you can get 2 different parse trees for the same string: ()()(). Each corresponds to a unique derivation: S SS SSS ()SS ()()S ()()() A grammar is ambiguous if some sentence has 2 distinct parse trees. We desire unambiguous grammars to understand semantics.

Why Ambiguity is a problem Most common cause is due to infix binary operations Grammar <expr> ::= <num> <expr> <expr> String 1 2 3 <expr> <expr> <expr> Parse <expr> <expr> <expr> <expr> <expr> <num> <num> <expr> <expr> <num> <num> 1 2 3 Which One? 1 <num> <num> 2 3

Why Ambiguity is a problem It may be true that (1-2)-3 = 1-(2-3) and different parse trees doesn t matter. Such cases, are not true in general, and we must assume the more conservative position: different parse trees have different meaning. Grammar <expr> ::= <num> <expr> <expr> String 1 2 3 <expr> Parse <expr> <expr> <expr> <expr> <expr> <expr> <expr> <num> <num> <expr> <expr> <num> <num> 3 1 <num> <num> 1 2 Which One? Different Parse Trees, 2 3 (1-2)-3 Different Meaning! 1-(2-3)

Why Ambiguity is a problem <expr> ::= <num> <expr> <expr> ambiguous To make it unambiguous, I want the to be Left associative: <expr> ::= <num> <expr> <num> Right Assocative: <expr> ::= <num> <num> <expr>

Why Ambiguity is a problem Several programming languages allow 2 forms of conditional statements: 1. If boolean_exp then statement1 2. If boolean_exp then statement1 else statement2 If boolean_exp1 then if boolean_exp2 then S1 else S2

Exercise 1 Is the following grammar (of arithmetic expressions) an ambiguous grammar? Try to construct a derivation with two different parse trees. <E> ::= <E> + <E> <E> ::= <E> *<E> <E> ::= ( <E> ) <E> ::= I

Exercise 1 - Answer <E> ::= <E> + <E> <E> ::= <E> *<E> 2 + 3 * 4 <E> ::= i E I E E + E E * E I I * 2 3 4 I I + I 2 3 4

To make it unambiguous (Case 1) + to be of a higher precedence than * <expr> ::= <expr2> <expr2> * <expr> <expr2> ::= <num> <num> + <expr2> <expr> <expr2> * <expr> 1+2*3 (1+2)*3 <num> + <expr2> 1 <num> 2 <expr2> <num> 3

Design of Rules From semantics (what you want it to mean) eg. I want to express the idea of addition To abstract syntax (what components you need to capture the meaning) eg. I need an operator and two operands To concrete syntax <expr> ::= <num> <expr> + <expr> To unambiguous concrete syntax. <expr> ::= <num> <expr> + <num> Unambiguous concrete syntax essential for a parser (parser takes a string, returns a syntax tree)

Implementation of Rules Divide the BNF rules into 2 groups: Lexical rules Syntactic rules Lexical rules are rules which deal with grouping individual symbols into meaningful words. Syntactic rules are rules, which deal with the structure of the overall sentence.

Example <E> ::= <E> + <E> <E> - <T> <T> <T> ::= <T> * <F> <T> / <F> <F> <F> ::= <name> <num> (<E>) <name> ::= <alpha> <alpha><alphanum> <alphanum> ::= <alpha><alphanum> <digit><alphanum> <alpha> <digit> <num> ::= <digit> <digit><num> <alpha> ::= a b c A B C Z <digit> ::= 0 1 2 9

Implementation of Rules Lexical Rules get implemented as a Lexical Analyzer: Input : String of Characters Output : List of Tokens Syntactic Rules get implemented as a Syntactic Analyzer: Input : List of Tokens Output : Parse Tree Parser = Lexical Analyzer + Syntactic Analyzer. (Very often, parser is used in a loose sense to refer to the syntactic analyzer)

Extended BNF This is a shorthand notation for BNF rules. It adds no power to the syntax,only a shorthand way to write productions: [ ] Grouping from which one must be chosen Binary_E T [+ -] T { }* - Repetition - 0 or more E T {[+ -] T}*

Extended BNF { } + - Repetition - 1 or more Usage similar to { }* { } opt - Optional I if E then S if E then S else S Can be written in EBNF as I if E then S { else S} opt

Extended BNF Example: Identifier - a letter followed by 0 or more letters or digits: Extended BNF I L { L D }* L a b... D 0 1... Regular BNF

Extended BNF Example: Identifier - a letter followed by 0 or more letters or digits: Extended BNF I L { L D }* L a b... D 0 1... Regular BNF I L L M M CM C C L D L a b... D 0 1...

Classes of grammars BNF: Backus-Naur Form - Context free - Type 2 - Already described Regular grammars: subclass of BNF - Type 3: BNF rules are restricted: A t N t where: N = nonterminal, t = terminal Examples: Binary numbers: B 0 B 1 B 0 1 Identifiers: I a L b L c L... z L a... y z L 1 L 2 L... 9 L 0 L 1... 9 0 a L b L c L... z L a... y z

Other classes of grammars The context free and regular grammars are important for programming language design. Other classes have theoretical importance, but not in this course: Context sensitive grammar: Type 1 Rules: α β where α β i.e., length of α length of β, α: string of non terminals, β: string of terminals & non terminals Unrestricted, recursively enumerable: Type 0 - Rules: α β. No restrictions on α and β.

Parsing BNF and extended BNF are notations for formally describing program syntax. Given the BNF grammar for the syntax of a programming language (say Java), how do we determine that a given Java program obeys all the grammar rules. This is achieved by parsing. Two main methods of parsing Top-down parsing (recursive-descent). Bottom-up parsing.

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Top-Down Parse => <num> => <int> (Try first choice) => <digit> (Try first choice) => 1 (Backtrack : 2.34 not parsed)

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Top-Down Parse => <num> => <int> (Try first choice) => <digit><int> (Try second choice) => 1<int> => 1<digit> (Try first choice) => 12 (Backtrack:.34 not parsed)

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Top-Down Parse => <num> => <real> (Try second choice) => <int-part>.<fraction> => => 12.34 (success!)

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Bottom-Up Parse => 12.34 => <digit>2.34 => <int>2.34 => <num>2.34 (Backtrack)

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Bottom-Up Parse => 12.34 => <digit>2.34 => <digit><digit>.34 => <digit><int>.34 => <int>.34 => <num>.34 (Backtrack)

Implementation of Rules Grammar <num> ::= <int> <real> <int> ::= <digit> <digit><int> <real> ::= <int-part>. <fraction> <int-part> ::= <digit> <int-part> <digit> <fraction> ::= <digit> <digit> <fraction> <digit> ::= 0 1 2 3 4 5 6 7 8 9 Is 12.34 accepted by the grammar with <num> as the starting non-terminal? Bottom-Up Parse => 12.34 => => <int-part>.34 => => <int-part>.<fraction> => <real> => <num> (Success!)

Recursive descent parsing overview A simple parsing algorithm Shows the relationship between the formal description of a programming language and the ability to generate executable code. Each non-terminal becomes a procedure. Recursively call the procedure to check different parts. (+) Simple Translation from BNF ( ) Not Efficient

Recursive descent parsing Example of Top-down parsing. <expression> ::= <term>+<term> <term>-<term> (Some details are left out; just get the idea) procedure Expression; begin Term; /* Call Term to find first term */ while ((nextchar=`+') or (nextchar=`-')) do begin nextchar:=getchar; /* Skip operator */ Term end end procedure Term; begin... end First recognize a Term; as long as the next symbol is either a + or -, we recognize another Term.

Structure of a compiler

Attribute Grammars Productions E E+T T T T*P P P I (E)

Attribute Grammars Production E E+T E T T T*P T P P I P (E) Attribute value(e 1 ) = value(e 2 )+value(t) value(e) = value(t) value(t 1 ) = value(t 2 )* value(p) value(t) = value(p) value(p) = value(i) value(p) = value(e) E 1 is first E in production and E 2 is second E in production Technique often useful for passing data type information within a parse tree.

Example attributed tree

Use of attribute grammars First need to develop parse tree. Attribute grammars assume you already have the derivation; it is not a parsing algorithm. Functions for attributes can be arbitrary. Process to build attributes is mostly manual. Attribute grammars may be used along with parsing to generate intermediate code. As the parse tree is built, information from one nonterminal is passed to another nonterminal higher in the tree. When parse is completed, all attribute values have already been computed.

Summary We need a formalism for describing the set of all allowed programs in a programming language. BNF and EBNF grammars Given a program P in a programming language L and the BNF grammar for L, we can find out whether P is a syntactically correct program in language L. This activity is called parsing. The Recursive Descent Parsing technique is one such parsing technique.