CSCI 2041: Advanced Language Processing

Similar documents
CSCI 2041: Lazy Evaluation

CS 11 Ocaml track: lecture 6

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) Programming Languages & Compilers. Major Phases of a Compiler

Programming Languages and Compilers (CS 421)

Programming Languages & Compilers. Programming Languages and Compilers (CS 421) I. Major Phases of a Compiler. Programming Languages & Compilers

Interpreters. Prof. Clarkson Fall Today s music: Step by Step by New Kids on the Block

ASTs, Objective CAML, and Ocamlyacc

Programming Languages and Compilers (CS 421)

n (0 1)*1 n a*b(a*) n ((01) (10))* n You tell me n Regular expressions (equivalently, regular 10/20/ /20/16 4

TDDD55 - Compilers and Interpreters Lesson 3

Compiler Construction D7011E

COMPILERS AND INTERPRETERS Lesson 4 TDDD16

Working of the Compilers

Monday, August 26, 13. Scanners

Wednesday, September 3, 14. Scanners

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

Writing Evaluators MIF08. Laure Gonnord

Comp 411 Principles of Programming Languages Lecture 3 Parsing. Corky Cartwright January 11, 2019

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

COLLEGE OF ENGINEERING, NASHIK. LANGUAGE TRANSLATOR

Undergraduate Compilers in a Day

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

Do not write in this area EC TOTAL. Maximum possible points:

Syntax Analysis MIF08. Laure Gonnord

CST-402(T): Language Processors

Lexical Analysis. Finite Automata. (Part 2 of 2)

Project 2 Interpreter for Snail. 2 The Snail Programming Language

LECTURE 3. Compiler Phases

Chapter 11 :: Functional Languages

Software II: Principles of Programming Languages

COMPILER DESIGN LECTURE NOTES

CS 415 Midterm Exam Spring 2002

MP 3 A Lexer for MiniJava

MP 3 A Lexer for MiniJava

Kinder, Gentler Nation

Semantic Analysis. Lecture 9. February 7, 2018

Pioneering Compiler Design

Lexing, Parsing. Laure Gonnord sept Master 1, ENS de Lyon

n Parsing function takes a lexing function n Returns semantic attribute of 10/27/16 2 n Contains arbitrary Ocaml code n May be omitted 10/27/16 4

Languages and Compilers

Lecture 15: Iteration and Recursion

10/30/18. Dynamic Semantics DYNAMIC SEMANTICS. Operational Semantics. An Example. Operational Semantics Definition Process

COP4020 Programming Assignment 1 - Spring 2011

Compilers - Chapter 2: An introduction to syntax analysis (and a complete toy compiler)

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CS321 Languages and Compiler Design I. Winter 2012 Lecture 1

Ch. 12: Operator Overloading

CSE450. Translation of Programming Languages. Lecture 11: Semantic Analysis: Types & Type Checking

Compilers. Compiler Construction Tutorial The Front-end

Chapter 3 Lexical Analysis

CS 3304 Comparative Languages. Lecture 1: Introduction

The Compiler So Far. CSC 4181 Compiler Construction. Semantic Analysis. Beyond Syntax. Goals of a Semantic Analyzer.

Why are there so many programming languages? Why do we have programming languages? What is a language for? What makes a language successful?

Lexical and Syntax Analysis

LECTURE 18. Control Flow

Compiler course. Chapter 3 Lexical Analysis

Extending xcom. Chapter Overview of xcom

10/26/17. Attribute Evaluation Order. Attribute Grammar for CE LL(1) CFG. Attribute Grammar for Constant Expressions based on LL(1) CFG

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Introduction to Parsing Ambiguity and Syntax Errors

CSCI 2041: Functions, Mutation, and Arrays

Compiler Design. Computer Science & Information Technology (CS) Rank under AIR 100

A clarification on terminology: Recognizer: accepts or rejects strings in a language. Parser: recognizes and generates parse trees (imminent topic)

Introduction to Parsing Ambiguity and Syntax Errors

Lecture 9 CIS 341: COMPILERS

Lecture content. Course goals. Course Introduction. TDDA69 Data and Program Structure Introduction

PLT 4115 LRM: JaTesté

Parsing II Top-down parsing. Comp 412

Introduction to Compiler Design

COMP-421 Compiler Design. Presented by Dr Ioanna Dionysiou

CS 360 Programming Languages Interpreters

CS 6353 Compiler Construction Project Assignments

1 Lexical Considerations

COP4020 Programming Languages. Compilers and Interpreters Robert van Engelen & Chris Lacher

CSCI 2041: Object Systems

CS152 Programming Language Paradigms Prof. Tom Austin, Fall Syntax & Semantics, and Language Design Criteria

Compilation: a bit about the target architecture

Over-simplified history of programming languages

Automated Tools. The Compilation Task. Automated? Automated? Easier ways to create parsers. The final stages of compilation are language dependant

What is a compiler? var a var b mov 3 a mov 4 r1 cmpi a r1 jge l_e mov 2 b jmp l_d l_e: mov 3 b l_d: ;done

Structure of a compiler. More detailed overview of compiler front end. Today we ll take a quick look at typical parts of a compiler.

Notes on the Exam. Question 1. Today. Comp 104:Operating Systems Concepts 11/05/2015. Revision Lectures (separate questions and answers)

Program Assignment 2 Due date: 10/20 12:30pm

COP4020 Programming Assignment 2 - Fall 2016

Typescript on LLVM Language Reference Manual

GADTs gone mild. Code at Gabriel RADANNE 23 Mars 2017

CSE P 501 Compilers. Parsing & Context-Free Grammars Hal Perkins Winter /15/ Hal Perkins & UW CSE C-1

ffl assign The assign has the form: identifier = expression; For example, this is a valid assign : var1 = 20-3*2; In the assign the identifier gets th

CMSC 350: COMPILER DESIGN

Left to right design 1

Crafting a Compiler with C (II) Compiler V. S. Interpreter

Question 1. Notes on the Exam. Today. Comp 104: Operating Systems Concepts 11/05/2015. Revision Lectures

Comp 204: Computer Systems and Their Implementation. Lecture 25a: Revision Lectures (separate questions and answers)

1. INTRODUCTION TO LANGUAGE PROCESSING The Language Processing System can be represented as shown figure below.

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS 6353 Compiler Construction Project Assignments

Intro To Parsing. Step By Step

Syntax and Parsing COMS W4115. Prof. Stephen A. Edwards Fall 2003 Columbia University Department of Computer Science

TDDD55- Compilers and Interpreters Lesson 3

UNIVERSITY OF COLUMBIA ADL++ Architecture Description Language. Alan Khara 5/14/2014

Transcription:

CSCI 2041: Advanced Language Processing Chris Kauffman Last Updated: Wed Nov 28 13:25:47 CST 2018 1

Logistics Reading OSM: Ch 17 The Debugger OSM: Ch 13 Lexer and Parser Generators (optional) Practical OCaml: Ch 10 Exception Handling (next) Goals Parsing Left/Right Associativity Lexer/Parser Generators A5: Calculon Arithmetic language interpreter 2X credit for assignment 5 Required Problems 100pts 5 Option Problems 50pts Milestone deadline Wed 12/5 Final deadline Tue 12/11 2

Exercise: Subtraction Trees Consider these two parse trees for the given expression let parsetree = parse_tokens (lex_string "10-2-3") in...;; (* TREE A *) (* TREE B *) Sub( Const 10, Sub( Sub( Const 10, Sub( Const 2, Const 2), Const 3));; Const 3);; 1. What are the arithmetic results of evaluating each of them? 2. Which do you expect to result from our previous parsers? 3. Which gives the "correct" result according to standard rules of arithmetic? 3

Answers: Subtraction Trees Consider these two parse trees for the given expression let parsetree = parse_tokens (lex_string "10-2-3") in...;; (* TREE A *) (* TREE B *) Sub( Const 10, Sub( Sub( Const 10, Sub( Const 2, Const 2), Const 3));; Const 3);; 1. What are the arithmetic results of evaluating each of them? A = 11, B = 5 2. Which do you expect to result from our previous parsers? A has been the standard behavior of parsers from lecture lab 3. Which gives the "correct" result according to standard rules of arithmetic? B is the standard interpretation for arithmetic with left-to-right evaluation 4

"Right-heavy" vs "Left-Heavy" Parse Trees Chained operators like 1+2+3+4 have so far yielded "right-heavy" trees Doesn t matter for some operators but matters a lot for subtraction where "left-heavy" trees match the standard rules better Sometimes called left associative interpretation Parser to deal with left associative operators looks a little different than original 5

Exercise: Compare Right/Left Associative Parsers Right Heavy 1 and parse_addsub toks = 2 let (lexpr, rest) = parse_muldiv toks in (* try higher prec first *) 3 match rest with 4 Plus :: tail -> (* + is first *) 5 let (rexpr,rest) = parse_addsub tail in (* recursively generate right-hand *) 6 (Add(lexpr,rexpr), rest) (* add left / right *) 7 Minus :: tail -> (* + is first *) 8 let (rexpr,rest) = parse_addsub tail in (* recursively generate right-hand *) 9 (Sub(lexpr,rexpr), rest) (* subtract left / right *) 10 _ -> (lexpr, rest) (* not add/sub *) Left Heavy 11 and parse_addsub toks = 12 let (lexpr, rest) = parse_muldiv toks in (* create the initial left expr *) 13 let rec iter lexpr toks = (* loop through adjacent exprs *) 14 match toks with 15 Plus :: rest -> (* found + *) 16 let (rexpr,rest) = parse_muldiv rest in (* consume a higher-prec expr *) 17 iter (Add(lexpr,rexpr)) rest (* create Add and iterate again *) 18 Minus :: rest -> (* found - *) 19 let (rexpr,rest) = parse_muldiv rest in (* consume a higher-prec expr *) 20 iter (Sub(lexpr,rexpr)) rest (* create Sub and iterate again *) 21 _ -> (lexpr, toks) 22 in 23 iter lexpr rest (* start iterating *) 6

Answers: Compare Right/Left Associative Parsers Right-associative recurses deeply to the right to generate right hand expression Left-associative iterates consuming add/sub expressions in a (tail recursive) loop Left-associative creates a left-heavy tree by combining right and left expressions in an Add/Sub then passing it forward in the iteration to become the left branch 7

Token Streams and Buffering So far have assumed that Lexer tokenizes the entire input string prior to starting the parser This works for small inputs, but for large files may be inefficient May need to store entire input (file) in memory during lexing Must store entire token list in memory during parsing Real world lexer/parsers make this more efficient via a lexing buffer Lexing buffer stores only part of file and lexing stream API to see next() token and consume() it Frequently seen in interpreter and compiler tutorials // imperative pseudocode for // parsing add/sub expressions // uses a lexing buffer global lexbuf; function parse_addsub(){ var lexpr := parse_muldiv() while lexbuf.next() = "+" or "-" var op := lexbuf.next() lexbuf.consume() var rexpr := parse_muldiv() lexpr := make_tree(op,lexpr,rexpr) return lexpr } 8

Lexing and Parsing Tools Generally do NOT want to write large-scale programs in assembly language: too many things can go wrong Generally do NOT want to write lexers/parsers by hand for large-scale languages: too many things can go wrong High-level programming languages improve over assembly through a compiler or interpreter: translate high-level code to low Lexer/Parser Generators improve over hand-written parser generators: translate high-level grammars to low-level code Lex and Yacc 1 are the classic tools to generate lexer/parsers Usually involves two input files 1. Parser input to Yacc describes token kinds, grammar, actions 2. Lexer input to Lex describes how characters translate to tokens Result in compilable code with built-in lexing buffer and efficient grammar recognition through finite automata 1 Yacc is short for Yet Another Compiler Compiler as it is often used to generate the front-end of a compiler 9

OCaml Lex and Yacc OCaml comes with standard tools for language processing ocamllex: lexer generator ocamlcyacc: parser generator Input has special syntax, not all normal OCaml Will briefly survey these to get a flavor for them Lex/Yacc studied more thoroughly in CSCI 4011: Formal Languages and Automata Theory CSCI 5161: Introduction to Compilers Couch this in discussion a calculator language arith which is part of the code pack > cd arith/ > make... >./arithmain arithmain> 1+1 2 arithmain> 5*9-2 43 arithmain> 10-3-2 5 10

Ocaml Lex Input Simple structure mainly used to set up a rule for token kinds Has dependency on arithparse.ml for token kinds 1 (* arithlex.mll : OCaml lex source file *) 2 3 (* First section is raw ocaml between curlies *) 4 { 5 open Arithparse;; (* bring in token types from arithparse.mli *) 6 exception Eof;; (* declare exception type for end of file *) 7 } 8 9 (* second section defines how the lexer works *) 10 rule token = parse 11 [ \t ] { token lexbuf } (* skip recursing *) 12 [ \n ] { EOL } 13 [ 0-9 ]+ as lxm { INT(int_of_string lxm) } (* regex for numbers *) 14 + { PLUS } 15 - { MINUS } 16 * { TIMES } 17 / { SLASH } 18 ( { OPAREN } 19 ) { CPAREN } 20 eof { raise Eof } (* end of file *) 11

Ocaml Yacc Input 1 Two main sections, first is shown Declares token types and main entry into parser 1 /* arithparse.mly: ocaml yacc sourc file defining a parser. Note the 2 C-style comments rather than OCaml style */ 3 4 /* first section defines token types used by parser using % directives */ 5 %token <int> INT 6 %token PLUS MINUS TIMES SLASH 7 %token OPAREN CPAREN EOL 8 9 %type <int> main /* type returned by production main */ 10 %start main /* entry production for parser */ 11 12 /* end first section */ 13 %% 12

Ocaml Yacc Input 2 Second section shows grammar productions Curlies to the right have actions associated with productions Dollar variables correspond to results of recursive grammar elements 14... 15 %% 16 /* second section which shows expressions */ 17 main: /* initial production */ 18 plusminus EOL { $1 } /* $1 is result of plusminus */ 19 ; 20 plusminus: /* addition and subtraction */ 21 muldiv { $1 } /* could be just mul/div */ 22 plusminus PLUS muldiv { $1 + $3 } /* or an addition */ 23 plusminus MINUS muldiv { $1 - $3 } /* or a subtraction */ 24 ; 25 muldiv: /* multiplication and division */ 26 ident { $1 } /* could be just an ident */ 27 muldiv TIMES ident { $1 * $3 } /* or a multiplication */ 28 muldiv SLASH ident { $1 / $3 } /* or a division */ 29 ; 30 ident: /* identifier */ 31 INT { $1 } /* integer constant */ 32 OPAREN plusminus CPAREN { $2 } /* opening parenthesis */ 33 ; 13

A Main Function 1 (* arithmain.ml: main routine for lexing/parsing and interpreting an 2 arithmetic language. This version directly interprets the language 3 rather than building an expression tree. *) 4 open Printf;; 5 6 let _ = 7 try 8 (* Lexing is an OCaml standard module for lexer support. Next line 9 creates a lexing buffer. *) 10 let lexbuf = Lexing.from_channel stdin in 11 12 while true do (* loop over input until end of file *) 13 printf "arithmain> %!"; (* print prompt *) 14 15 (* Arithlex.token is a function that produces a token. 16 Arithparse.main function takes a token producer and a lexbuf. 17 The next line lexes and parses an expression. *) 18 let result = Arithparse.main Arithlex.token lexbuf in 19 20 printf "%d\n%!" result; (* print integer result *) 21 22 done; (* end of input loop *) 23 24 with Arithlex.Eof -> (* eof exception pops out of loop *) 25 printf "That s all folks!\n"; 26 ;; 14

Compiling Gets Complicated Compiling with lex/yacc is tricky as several functions like Arithparse.main defined based on grammar production rules Also compile order is tricky, best to put build sequence into a Makefile or other build system 1 > make 2 ocamllex arithlex.mll # creates arithlex.ml 3 11 states, 267 transitions, table size 1134 bytes 4 ocamlyacc arithparse.mly # creates arithparse.ml / arithparse.mli 5 ocamlc -g -c arithparse.mli # required by arithlex.ml 6 ocamlc -g -c arithlex.ml # required by arithparse.ml 7 ocamlc -g -c arithparse.ml 8 ocamlc -g -c arithmain.ml # requires arithlex.cmo and arithparse.cmo 9 ocamlc -g -o arithmain arithlex.cmo arithparse.cmo arithmain.cmo 10 11 >./arithmain 12 arithmain> 1+3*2-4 13 3 Note report on line 3: lexing statistics for finite automata generated which will recognize tokens arthlex.ml and arithparse.ml: valid OCaml but machine generated code, not meant for human eyes 15

Generating Parse Trees in Lex/Yacc arith/ system directly interprets input during parsing through grammar actions as in plusminus: /* addition and subtraction muldiv { $1 } /* could be just mul/div plusminus PLUS plusminus { $1 + $3 } /* or an addition plusminus MINUS plusminus { $1 - $3 } /* or a subtraction ; This is typical of interpreters perform no further transformations or optimizations on the code Code pack include arith-tree/ which changes this to create a data structure instead via code like plusminus: /* addition and subtraction muldiv { $1 } /* could be just mul/div */ plusminus PLUS plusminus { Add($1,$3) } /* or an addition */ plusminus MINUS plusminus { Sub($1,$3) } /* or a subtraction */ ; Resulting parse tree is captured in a main routine for printing, transformation, and evaluation Typical of a compiler or at least more sophisticated interpreter 16

How do other languages do it? OCaml and Lisp excel at symbolic computation: manipulating data like expression trees and token sequences OCaml makes it easy to declare new types of data that are algebraic with variants: very well suited for symbolic processing Lisp has untyped symbols built in, as easy as quoting as in the code add is a symbol with name "add" Languages like C, Java, Python are a bit clunkier for symbolic processing Symbols aren t innate in any of them: with string constants, enumerations, classes can emulate them Takes more work and more lines of code than OCaml/Lisp mechanisms Also, none of these have standard lexer/parser generators (though many libraries exist for them) 17

Contrast: Symbolic Data in Java vs OCaml As a sample, today s code pack contains equivalent versions of the arithmetic langauge in OCaml and Java Both of these Accept the same language like 1+2*3-12/4 Use lexer/parser generators to specify high-level language Accept user input on command line or interactively Create an expression tree data structure Print the data structure to the screen OCaml version uses ocamllex / ocamlparse, Build 6 files 17 files Java version uses ANTLR4 parser generator library Build 5 files 35 files 18

Contrast Stats: Symbolic Data in Java vs OCaml OCaml arith-tree/ File LOC Purpose arithlex.mll 15 Lexer definition arithparse.mly 26 Grammar definition 41 Subtotal arithexpr.ml 33 Tree data type and printing arithmain.ml 13 Main function for interactive input loop, printing 88 Total Lines of Code Java arith-java/ File LOC Purpose TokenType.java 15 Declare token types with string names Arith.g 34 Grammar file for ANTLR4 49 Subtotal ArithMain.java 123 Main routine, tree data, interface code, printing 172 Total Lines of Code Not interactive: just parses command line arg and prints tree as Most of the code is interface glue matching classes to parse tree types via Visitor Pattern implementations Mostly due to Java classes not fitting expression trees as well as algebraic variants: classes are the only way to represent data in Java 19

Summary Writing lexers/parsers is hard, riddled with issues like left/right associativity Make life easier by employing a lexer/parser generator OCaml is well-suited for symbolic data processing via data type mechanisms and built-in data structures Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp. Greenspun s Tenth Rule 20