Edward Kmett. Iteratees, Parsec, and Monoids A PARSING TRIFECTA

Similar documents
Compiler Construction D7011E

Applicative, traversable, foldable

CS 320: Concepts of Programming Languages

Applicative, traversable, foldable

Lexical Analysis. Introduction

Monads and all that III Applicative Functors. John Hughes Chalmers University/Quviq AB

Lisp: Lab Information. Donald F. Ross

CS164: Programming Assignment 2 Dlex Lexer Generator and Decaf Lexer

Introduction to Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Introduction to Lexical Analysis

Figure 2.1: Role of Lexical Analyzer

Lexical Analysis. Lecture 3. January 10, 2018

Lexical Analysis. Chapter 1, Section Chapter 3, Section 3.1, 3.3, 3.4, 3.5 JFlex Manual

FROWN An LALR(k) Parser Generator

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 2: Lexical Analysis 23 Jan 08

LECTURE 7. Lex and Intro to Parsing

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

MP 3 A Lexer for MiniJava

Lexical Analysis. Finite Automata

The Front End. The purpose of the front end is to deal with the input language. Perform a membership test: code source language?

Cunning Plan. Informal Sketch of Lexical Analysis. Issues in Lexical Analysis. Specifying Lexers

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages

CIS552: Advanced Programming

Lexical Analysis. Chapter 2

Background Type Classes (1B) Young Won Lim 6/28/18

Lexical Analysis. Finite Automata

Week 2: Syntax Specification, Grammars

10/4/18. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntactic Analysis

CSC 467 Lecture 3: Regular Expressions

A programming language requires two major definitions A simple one pass compiler

Background Type Classes (1B) Young Won Lim 6/14/18

An ANTLR Grammar for Esterel

ML 4 A Lexer for OCaml s Type System

Informatics 1 Functional Programming Lectures 15 and 16. IO and Monads. Don Sannella University of Edinburgh

A lexical analyzer generator for Standard ML. Version 1.6.0, October 1994

Parsing. Zhenjiang Hu. May 31, June 7, June 14, All Right Reserved. National Institute of Informatics

Lexical Analysis. Lecture 2-4

Lecture C-10: Parser Combinators - Introduction

Programming Assignment II

PIC 10A Flow control. Ernest Ryu UCLA Mathematics

Functional Parsing A Multi-Lingual Killer- Application

Polymorphism Overview (1A) Young Won Lim 2/20/18

Advanced Type System Features Tom Schrijvers. Leuven Haskell User Group

CD Assignment I. 1. Explain the various phases of the compiler with a simple example.

COMPILER CONSTRUCTION LAB 2 THE SYMBOL TABLE. Tutorial 2 LABS. PHASES OF A COMPILER Source Program. Lab 2 Symbol table

MP 3 A Lexer for MiniJava

Compiler phases. Non-tokens

10/5/17. Lexical and Syntactic Analysis. Lexical and Syntax Analysis. Tokenizing Source. Scanner. Reasons to Separate Lexical and Syntax Analysis

CSc 453 Lexical Analysis (Scanning)

Chapter 2, Part I Introduction to C Programming

Lecture 4: Higher Order Functions

COP 3402 Systems Software Top Down Parsing (Recursive Descent)

Parser Tools: lex and yacc-style Parsing

Writing a Lexer. CS F331 Programming Languages CSCE A331 Programming Language Concepts Lecture Slides Monday, February 6, Glenn G.

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

Compilers. Bottom-up Parsing. (original slides by Sam

Programming Project 1: Lexical Analyzer (Scanner)

Monad P1 : Several Monad Types (4A) Young Won Lim 2/13/19

Advanced Programming Handout 7. Monads and Friends (SOE Chapter 18)

Programming Assignment I Due Thursday, October 7, 2010 at 11:59pm

Zipping Search Trees. Tom Schrijvers. with Denoit Desouter and Bart Demoen

Parsing a primer. Ralf Lämmel Software Languages Team University of Koblenz-Landau

Scanners. Xiaokang Qiu Purdue University. August 24, ECE 468 Adapted from Kulkarni 2012

LECTURE 11. Semantic Analysis and Yacc

Lexical Analysis. Dragon Book Chapter 3 Formal Languages Regular Expressions Finite Automata Theory Lexical Analysis using Automata

Lexical and Syntax Analysis. Top-Down Parsing

CMSC 350: COMPILER DESIGN

Grammars and Parsing, second week

A Pascal program. Input from the file is read to a buffer program buffer. program xyz(input, output) --- begin A := B + C * 2 end.

Syntax-Directed Translation. Lecture 14

Parser Tools: lex and yacc-style Parsing

COP4020 Programming Assignment 1 - Spring 2011

Objectives. Chapter 2: Basic Elements of C++ Introduction. Objectives (cont d.) A C++ Program (cont d.) A C++ Program

Chapter 2: Basic Elements of C++

Big Picture: Compilation Process. CSCI: 4500/6500 Programming Languages. Big Picture: Compilation Process. Big Picture: Compilation Process.

GHCi: Getting started (1A) Young Won Lim 6/3/17

CS Lecture 2. The Front End. Lecture 2 Lexical Analysis

CSCI312 Principles of Programming Languages!

Chapter 2: Basic Elements of C++ Objectives. Objectives (cont d.) A C++ Program. Introduction

Honu. Version November 6, 2010

Administrivia. Lexical Analysis. Lecture 2-4. Outline. The Structure of a Compiler. Informal sketch of lexical analysis. Issues in lexical analysis

Lexical Analysis. Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!

Monday, August 26, 13. Scanners

Contents. Jairo Pava COMS W4115 June 28, 2013 LEARN: Language Reference Manual

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

CIT Lecture 5 Context-Free Grammars and Parsing 4/2/2003 1

COMPILER DESIGN LECTURE NOTES

CSCI312 Principles of Programming Languages

Languages, Automata, Regular Expressions & Scanners. Winter /8/ Hal Perkins & UW CSE B-1

Wednesday, September 3, 14. Scanners

Trees. Solution: type TreeF a t = BinF t a t LeafF 1 point for the right kind; 1 point per constructor.

flex is not a bad tool to use for doing modest text transformations and for programs that collect statistics on input.

Lexical Analysis. Lecture 3-4

Monads seen so far: IO vs Gen

Lexical Analysis - An Introduction. Lecture 4 Spring 2005 Department of Computer Science University of Alabama Joel Jones

CSE 3302 Programming Languages Lecture 2: Syntax

CIS 194: Homework 6. Due Friday, October 17, Preface. Setup. Generics. No template file is provided for this homework.

Transcription:

Edward Kmett Iteratees, Parsec, and Monoids A PARSING TRIFECTA

Overview The Problem Deriving Iteratees á la Oleg Buffered Iteratees Layering Parsec Over Iteratees Local Context-Sensitivity Monoids From Invariants Layering Monoids (Layout, Parsing) Conclusion

The Problem Monadic parsing is convenient Monoids are parallel and incremental How to include both feature sets?

Parser Combinators Parser combinators make it possible to express parsers directly in Haskell using an embedded domain specific language encoding of your grammar. Parsec is the most well-known parser combinator library in use in Haskell.

Parser Combinator Libraries Parsec 3 Monadic, depth-first, backtracking Iteratees Monadic, resumable, non-backtracking UU-ParsingLib Monadic, breadth-first, backtracking Parsimony (coming soon) Monoidal parsing of Alternative grammars

The Problem In More Detail Monadic parsing is convenient Monoids are parallel and incremental How to include both feature sets? People know how to use Parsec effectively. Parsec needs the full input and can t be resumed. An efficient parsing monoid can t be monadic, because of context-sensitivity

Monoidal Parsing Issues As a Monoid we need to parse input chunks with no contextual information about the chunks. We don t know how far along in the file we are We don t know if we re in the middle of parsing a string We don t know what lexemes are valid in our current state.

What are our options? Walk an Applicative grammar top-down or bottom up in Earley or CYK style and merge subparses. (Parsimony) Extract a small representation of the state transitions represented by the values spanned by our monoid for smaller languages. (i.e. Bicyclic, Regex...) Hunt for frequently occurring invariants in the lexer where context sensitivity vanishes and start new monadic parsers at those locations, folding together their results.

Monoid-Wrapped Parsec Issues We need Parsec to be able to run up to a point and block waiting for Input in order for it to work Monoidally. But, Parsec offers getinput, which can be used to obtain the unparsed portion of the input Stream. We need to be able to resume the same Parsec parser with multiple future parses to deal with incremental parsing, so this at first seems to be paradoxical.

Inverting Parsec: Iteratees An Iteratee is a monad that consumes input and performs computation until it needs more, then asks for more input. They are perhaps best understood by thinking about resumable computations as a monad.

Partiality as a Monad data Partial a = Return a Cont (Partial a) instance Monad Partial where return = Return Return a >>= f = f a the first monad law! Cont k >>= f = Cont (k >>= f) runpartial :: Partial a -> Int -> Either (Partial a) a runpartial (Done a) _ = Right a runpartial c@(cont _) 0 = Left c runpartial (Cont k)!n = runpartial k (n 1)

Iteratees á la Oleg data Input = Chunk ByteString EOF data Iteratee a = Return a Input Cont (Input -> Iteratee a) instance Monad Iteratee where return = Return a (Chunk empty) Return a (Chunk e) >>= f null e = k Return a i >>= f = case f a of Return b _ -> Return b i Cont k -> k i Cont k >>= f = Cont (k >=> f) -- (>=>) :: (a -> m b) -> (b -> m c) -> a -> m c

Reading input in an Iteratee next :: Iteratee (Maybe Char) next = Cont next next :: Input -> Iteratee (Maybe Char) next EOF = Return Nothing EOF next (Chunk i) null i = next otherwise = Return (Just (head i)) (Chunk (tail i)) But once it consumes part of the input, it is gone!

Backtracking Iteratees The problem with the default Iteratee design is that it conflates current position info with the contents of the buffer. We need to retain the full input buffer history for backtracking purposes, but this requires an efficient means of concatenating Bytestrings and indexing into the buffer. Bytestrings alone, lazy or not, won t cut it asymptotically!

Monoids to the rescue newtype Cursor = Cursor { getcursor :: Int } deriving (Eq,Ord,Num,Show,Read,Enum) instance Monoid Cursor where mempty = 0 mappend = (+)

Efficient Buffers newtype Chunk = Chunk { getchunk :: S.ByteString } deriving (Eq,Ord,Show,Read) instance Measured Cursor Chunk where measure (Chunk xs) = Cursor (S.length xs) type Buffer = FingerTree Cursor Chunk index :: Cursor -> Buffer -> Word8 index!i!t = S.index a $ getcursor (i - measure l) where (l,r) = F.split (> i) t Chunk a :< _ = F.viewl r -- O(log (min (n,m-n))) for m chunks and an offset into the nth chunk

Buffered Iteratee data Iteratee a = Done a!buffer!bool Cont (Buffer -> Bool -> Iteratee a) instance Monad Iteratee where return a = Done a F.empty False Done a h False >>= f F.null h = f a Done a h eof >>= f = case f a of Done b -> Done b h eof Cont k -> k h eof Cont k >>= f = Cont (\h eof -> k h eof >>= f)

Care and Feeding of Iteratees type Enumerator a = Iteratee a -> Iteratee a supplybuffer :: Buffer -> Enumerator a supplybuffer b (Cont k) = k b False supplybuffer _ other = other supplystrictbytestring :: S.ByteString -> Enumerator a supplystrictbytestring = supplybuffer. F.singleton. Chunk

Generic Supplies class Supply t where supply :: t -> Enumerator a supplylist :: [t] -> Enumerator a supplylist = foldr (andthen. supply) id instance Supply a => Supply [a] where supply = supplylist instance Supply Buffer where supply = supplybuffer instance Supply Strict.ByteString where... instance Supply Char where...

Supplying EOF data EOF = EOF instance Supply EOF where supply _ (Cont k) = k F.empty True supply _ (Done a h _) = Done a h True

A Digression: Failure data Iteratee a = Done a!buffer!bool Fail String!Buffer!Bool Cont (Buffer -> Bool -> Iteratee a) instance Monad Iteratee where return a = Done a F.empty False Done a h False >>= f F.null h = f a Done a h eof >>= f = case f a of Done b -> Done b h eof Cont k -> k h eof Fail s -> Fail s h eof Cont k >>= f = Cont (\h eof -> k h eof >>= f) Fail s h eof >>= _ = Fail s h eof fail s = Fail s F.empty False instance MonadPlus Iteratee where...

Reading from Buffered Iteratees get :: Cursor -> Iteratee Word8 get i = Cont getword8 where get :: Buffer -> Bool -> Iteratee Word8 get' h eof n < measure h = Done (index n h) h eof eof = Fail "EOF" h eof otherwise = Cont (\h' eof' -> get' (h >< h') eof')

Iteratee-Based Parsec Streams instance Stream Cursor Iteratee Word8 where uncons!n = do w <- get n return $ Just (c, n + 1) < > return Nothing -- I actually use a Char Stream that does UTF8 decoding, but it won t fit on a slide!

A UTF8 Iteratee Parsec Stream instance Stream Cursor Iteratee Char where uncons!n = (getword8 n >>= uncons') < > return Nothing where uncons' c c <= 0x7f = return $ Just (toenum (fromenum c), n + 1) c >= 0xc2 && c <= 0xdf = do d <- getword8 (n + 1) return $ Just (b2 c d, n + 2) c >= 0xe0 && c <= 0xef = do d <- getword8 (n + 1) e <- getword8 (n + 2) return $ Just (b3 c d e, n + 3) c >= 0xf0 && c <= 0xf4 = do d <- getword8 (n + 1) e <- getword8 (n + 2) f <- getword8 (n + 3) return $ Just (b4 c d e f, n + 4) otherwise = return $ Just (replacementchar, n + 1) -- I lied. It will.

Putting It Together type P = ParsecT Cursor () Iteratee parser = string hello < > string goodbye example = supply EOF $ supply bye $ supply good $ runparser parser () "-" (Cursor 0) -- Done goodbye (Chunk (pack goodbye )) -- note the successful backtrack. --You can obtain the current cursor location with getinput

Slicing w/ Sharing sliceit :: Cursor -> Cursor -> Iteratee S.ByteString sliceit!i!j = Cont slice' where sliceit' :: Buffer -> Bool -> Iteratee S.ByteString sliceit' h eof j <= measure h eof = Done (slicebuffer h) h eof otherwise = Cont $ \h' eof' -> sliceit' (h >< h') eof' slicebuffer :: Buffer -> S.ByteString slicebuffer!t req <= rmn = Strict.take req first otherwise = Strict.concat $ Lazy.toChunks $ Lazy.take (fromintegral req) $ Lazy.fromChunks $ first : map getchunk (tolist r') where (l,r) = F.split (> i) t Chunk a :< r' = F.viewl r first = Strict.drop (getcursor (i - measure l)) a req = getcursor $ j - i rmn = Strict.length first

Slicing in a Parsec Parser slice :: Cursor -> Cursor -> P ByteString slice mark release = lift (sliceit mark release) sliced :: P a -> P ByteString sliced parse = do mark <- getinput parse release <- getinput slice mark release

Sliced Recognizers data Token = Ident ByteString Symbol ByteString ident = Ident <$> sliced (lower >> skipmany isident) symbol = Symbol <$> sliced (skipmany issymbol)

An Iteratee Based Monoid Assume we have a language where we spot the invariant in the lexer that after any newline, that is not preceded with a backslash, we can know that any lexeme is valid. We will scan input bytestrings for non-backslashed newlines and start an iteratee after that. Since, this comes from an actual language, for which whitespace is used for Haskell-style layout, we ll have to track the whitespace on the edges of our result

Local Context Sensitivity Many PL grammars are only locally contextsensitive. Real compilers usually use error productions at the next newline or semicolon to resume parsing after a syntax error So, scan input for those and start a parsec lexer that feeds tokens to a Reducer, then merge those results, monoidally.

Recall: File Position Monoid data Delta = Pos!ByteString!Int!Int Lines!Int!Int Cols!Int Tab!Int!Int

Introducing Lex data LexResult t ts = LexSegment!Delta ts!delta LexChunk!Delta deriving (Eq,Show,Data,Typeable) data Lex t ts = LexEmpty LexRaw!Bool!Buffer!Bool LexCooked!Buffer (Iteratee (LexResult t ts)) class Spaceable t where space :: Delta -> t class (Spaceable t, Reducer t ts) => Lexable t ts ts -> t where lexer :: Iteratee (LexResult t ts)

Making Lex into a Monoid (<-<) :: Supply c => Iteratee a -> c -> Iteratee a (<-<) = flip supply instance Lexable t ts => Monoid (Lex t ts) where mempty = LexEmpty LexEmpty `mappend` b = b a `mappend` LexEmpty = a LexRaw _ xs False `mappend` LexRaw True ys _ = LexCooked xs $ lexer <-< tailbuffer ys LexRaw a xs _ `mappend` LexRaw _ ys d = LexRaw a (xs `mappend` ys) d LexRaw _ xs _ `mappend` LexCooked ys m = LexCooked (xs `mappend` ys) m LexCooked xs m `mappend` LexRaw _ ys _ = LexCooked xs (m <-< ys) LexCooked xs m `mappend` LexCooked ys n = LexCooked xs $ merge (m <-< ys <-< EOF) n merge :: Lexable t ts => Iteratee (LexResult t ts) -> Iteratee (LexResult t ts) -> Iteratee (LexResult t ts) merge =... an exercise for the reader...

Applying Lex newtype LexSource t ts = LexSource { getlexsource :: ByteString } instance Lexable t ts => Measured (Lex t ts) (LexSource t ts) where measure = unit. getlexsource type Source = FingerTree (Lex t ts) (LexSource t ts) Just build up a Source, and it can be measured as a reduction of the token stream.

Going Deeper We can feed tokens to any Token-Reducer. Use another monoidal layer, using an Iteratee-based Parsec parser and scan the lexemes for a nice resumption point like layout introducing keywords. And run a simple left-to-right monadic calculation that balances parentheses/brackets/braces and layout.

Bitonic Layout Process layout monoidally by keeping track of a monoid of unfinished closing contexts, reduced results and opening contexts for layout parens, and bracket-like keyword pairs like case.. of, and if.. else. --...)...)... (... (... +... )... )... (... = --...)...)... (... data Bitonic m = Bitonic (Seq (Closing m)) m (Seq (Opening m) Layout typically connotes the detail parser to use inside so we can finish parsing here, declarations after where/let, statements after do, etc.

Conclusion We can parse monoidally using Parsec parsers if we plan ahead this permits incremental and parallel parsing, which is useful for modern compiler demands: Interactive type checking Intellisense and tab-completion. Exact syntax highlighting as you type. These same patterns are useful at multiple levels of the parser. (Lexing, layout, etc.) Buffered Iteratees can invert control for Parsec effectively and, with slicing, improve sharing between tokens and the source tree. Slides and Source will be Available Online via http://comonad.com/