Deductive Parsing with Sequentially Indexed Grammars

Size: px

Start display at page:

Download "Deductive Parsing with Sequentially Indexed Grammars"

Eric Kelley
5 years ago
Views:

1 Deductive Parsing with Sequentially Indexed Grammars Jan van Eijck May 25, 2005 Abstract This paper extends the Earley parsing algorithm for context free languages [3] to the case of sequentially indexed languages. Sequentially indexed languages are related to indexed languages [1, 2]. The difference is that parallel processing of index stacks is replaced by sequential processing [4]. This paper contains the full code of an implementation in Haskell [6], in literate programming style [7], of an algorithm for deductive parsing based on [8], focussing on the case of an Earley style parsing algorithm for sequentially indexed languages. Keywords: Deductive parsing, context free grammars, indexed languages, nested stack automata, Earley parsing algorithm, Haskell, literate programming. 1 Introduction Indexed grammars [1] are quadruples G = (N, T, P, S), where N is a finite nonterminal alphabet, T is a finite terminal alphabet, N T =, P is a finite set of productions of the form (X, α) with X N N, α (N N T ), where N is the set of all X Y with X, Y N, and S N is the start symbol. A production (X, α) is written as X α. Let G = (N, T, P, S) be an indexed grammar. A pair (X, [X 1,..., X n ]), with X, X 1,..., X n N is called an indexed nonterminal. Indexed nonterminals are written as X X1 X n. Let N be the set of all X X1 X n, with X, X 1,..., X n N. Then a sentential form for G is a string α in (N T ). To define the one step derivation relation, we need a preliminary definition: Definition 1 If δ (N N T ) and γ N, then δ γ is given by the following recursion: ɛ ζ = ɛ (w : δ) ζ = w : δ ζ if w T, (Y : δ) ζ = Y ζ : δ ζ if Y N, (Y Z : δ) ζ = Y (Z:ζ) : δ ζ if Y Z N, 1

2 Note that as a special case of this, we have that (Y Z ) θ = Y Z:θ. Using the definition of δ γ, we can define one step derivations: Definition 2 Let α, β be sentential forms for indexed grammar G. Then α G β iff 1. α = γ 1 X ζ γ 2, X δ is a production of the grammar, and β = γ 1 δ ζ γ 2, or, 2. α = γ 1 X (Y :ζ) γ 2, X Y δ is a production of the grammar, and β = γ 1 δ ζ γ 2 In terms of this, α G β is defined in the usual way. This definition is equivalent to the definition in [1]. Sequentially indexed grammars use indices that get pushed to an arbitrary nonterminal in the righthand side of a production. Sequentially indexed grammars look just like indexed grammars, but the definition of derivation is different. The following definition uses list concatenation If ζ is the result of concatenating ζ 1 and ζ 2, we denote this as ζ = ζ 1 ++ζ 2. Definition 3 If δ (N N T ) and γ N, then (δ) ζ is the subset of (N T ) defined recursively as: (ɛ) [] = {ɛ} (ɛ) ζ = if ζ [], (w : δ) ζ = {w : δ δ δ ζ } if w T, (C : δ) ζ = {C ζ1 : δ ζ = ζ 1 ++ζ 2, δ δ ζ2 } if C N (C Y : δ) ζ = {C Y :ζ1 : δ ζ = ζ 1 ++ζ 2, δ δ ζ2 } if C Y N. The relation of one-step derivation is defined in terms of (δ) ζ, as follows: Definition 4 Let α, β be sentential forms for indexed grammar G. Then α G β iff 1. α = γ 1 B ζ γ 2, B δ is a production of the grammar, and β = γ 1 δ γ 2, where δ in (δ) γ, or 2. α = γ 1 B (Y :ζ) γ 2, B Y δ is a production of the grammar, and β = γ 1 δ γ 2, where δ (δ) ζ. In derivations with sequentially indexed grammars, stacks are never allowed to disappear, and stacks are never allowed to get duplicated. In particular, a production B ɛ will not allow a one-step derivation like B Y Y X ɛ, and a production B CD will not allow a one-step derivation like B Y Y X C Y Y X D Y Y X (but it will allow B Y Y X C Y Y X D, B Y Y X C Y Y D X, B Y Y X C Y D Y X, and B Y Y X CD Y Y X ). A production like B X ɛ can lead to a one-step derivation B X ɛ. This effectively treats X as a trace. Sequentially indexed grammars are different from an earlier proposal for a restricted form of indexed grammars, in [5]. Gazdar proposed to use index lists that get copied to a single nonterminal in the righthand sides of productions, but in such a way that this heir-nonterminal has to be indicated in the rule. 2

3 2 General Data Structures module DPS where import List import Char import System.IO.Unsafe (unsafeperformio) Terminal and nonterminal symbols: data Symbol a b = T a N b D b I b b deriving (Eq,Ord,Read) The D nonterminal is useful for extending a grammar with a new start symbol. The I and J indicate nonterminals indexed with another nonterminal (the distinction is useful for indicating whether an index has been pushed to the stack or not). The S nonterminal indicates a nonterminal indexed with a stack of nonterminals. Given show functions for the types a and b, we define a show function for Symbol a b as follows: instance (Show a, Show b) => Show (Symbol a b) where show (T x) = show x show (N x) = show x show (D x) = # : show x show (I x y) = show x ++ "[" ++ show y ++ "]" The property of being a nonterminal: nonterm :: Symbol a b -> Bool nonterm (T _) = False nonterm _ = True Category of a nonterminal: 3

4 ntcat :: Symbol a b -> [b] ntcat (N x) = [x] ntcat (I x _) = [x] ntcat _ = [] Index of a nonterminal: ntidx :: Symbol a b -> [b] ntidx (N x) = [] ntidx (I _ y) = [y] ntidx _ = [] The property of being a dummy symbol dummy :: Symbol a b -> Bool dummy (D _) = True dummy _ = False The property of being an indexed symbol: indexed :: Symbol a b -> Bool indexed (I ) = True indexed _ = False Grammar rules: data Rule a b = Rule (Symbol a b) [Symbol a b] deriving Eq A show function for grammar rules. instance (Show a, Show b) => Show (Rule a b) where show (Rule y zs) = show y ++ "-->" ++ show zs 4

5 Reading a grammar rule: instance (Read a, Read b) => Read (Rule a b) where readsprec p = \ r -> [ (Rule symbol rhs,u) (symbol,s) <- reads r, ("-->", t) <- lex s, (rhs, u) <- reads t ] Example: DPIL> read "N S --> [T a, N S, T a ]" :: Rule Char Char S -->[ a, S, a ] Functions for accessing the left- and righthand sides of a rule. lhs :: Rule a b -> Symbol a b lhs (Rule x ys) = x rhs :: Rule a b -> [Symbol a b] rhs (Rule x ys) = ys Function for counting the number of nonterminals in the righthand side of a rule: ntc :: [Symbol a b] -> Int ntc [] = 0 ntc (N _:rest) = 1 + ntc rest ntc (I : rest) = 1 + ntc rest ntc (_ : rest) = ntc rest A grammar is a list of rules: type Grammar a b = [Rule a b] When specifying a grammar we adopt the convention that the lefthandside symbol of the first grammar rule is the start symbol. 5

6 start :: Grammar a b -> Symbol a b start grammar = lhs (head grammar) Converting a list of strings into a grammar: readgrammar :: (Read a, Read b) => [String] -> Grammar a b readgrammar ls = map (read :: (Read a, Read b) => String -> Rule a b) ls where ls = filter nonempty ls nonempty = \ s -> dropwhile isspace s /= [] A function for reading a grammar from a file. getgrammar :: (Read a, Read b) => FilePath -> IO (Grammar a b) getgrammar filename = do str <- readfile filename return (readgrammar (lines str)) Same, avoiding the IO monad: getgr :: (Read a, Read b) => FilePath -> Grammar a b getgr filename = unsafeperformio (getgrammar filename) 3 Example Grammars for CF Languages For concreteness sake, let us assume that terminal and nonterminal symbols are of type Char. Here is an example grammar, read in from file grammar0 (it is assumed that the file grammar0 is in the current directory): DPS> getgr "grammar0" :: Grammar String String ["S"-->["a","S","b"],"S"-->["a","b"]] 6

7 Here is another example grammar. grammar1 :: Grammar Char Char grammar1 = [Rule (N S ) [T a, N S, T a ], Rule (N S ) [T b, N S, T b ], Rule (N S ) [T a ], Rule (N S ) [T b ] ] An example of a grammar with epsilon rules: grammar2 :: Grammar Char Char grammar2 = [Rule (N S ) [T a, N S, T a ], Rule (N S ) [T b, N S, T b ], Rule (N S ) [T a ], Rule (N S ) [T b ], Rule (N S ) [] ] A grammar for balanced parentheses: grammar3 :: Grammar Char Char grammar3 = [Rule (N S ) [T (, N S, T ), N S ], Rule (N S ) [] ] 4 Grammars for Non-CF Languages grammar4 :: Grammar Char Char grammar4 = [Rule (N S ) [T a, I S X ], Rule (N S ) [N A ], Rule (I A X ) [T b, N A, T c ], Rule (N A ) [] ] 7

8 grammar5 :: Grammar Char Char grammar5 = [Rule (N S ) [T a,i S X ], Rule (N S ) [T b,i S Y ], Rule (N S ) [N A ], Rule (I A X ) [N A, T a ], Rule (I A Y ) [N A, T b ], Rule (N A ) [] ] grammar6 :: Grammar Char Char grammar6 = [Rule (N A ) [I A X ], Rule (N A ) [N B ], Rule (I B X ) [T a, N B ], Rule (N B ) [] ] 5 Derivation Trees Here is a data type for derivation trees: data Tree a b = Leaf a Node b [b] [Tree a b] deriving (Eq,Ord,Show) Here is an example: tree0 = Node S [] [Leaf a, Leaf b ] Displaying a tree on the screen: 8

9 displaytree :: (Show a, Show b) => Tree a b -> IO() displaytree tr = mapm_ putstrln (showtree 0 tr) where showtree :: (Show a, Show b) => Int -> Tree a b -> [String] showtree i (Leaf x) = [(map (\ _ -> ) [1..i]) ++ show x] showtree i (Node x [] ts) = ((map (\ _ -> ) [1..i]) ++ show x) : concat (map (showtree (i+5)) ts) showtree i (Node x xs ts) = ((map (\ _ -> ) [1..i]) ++ show x ++ show xs) : concat (map (showtree (i+5)) ts) The example tree gets displayed as follows: DPIL> displaytree tree0 S a b Displaying a tree list: displaytrees :: (Show a, Show b) => [Tree a b] -> IO() displaytrees trees = sequence_ (map displaytree trees) 6 Earley Items, Axioms, Goals, Consequences Earley items Earley items for context free parsing are of the form i, A α β, j. They consist of a rule A αβ with a in its righthand side to indicate the part of the righthand side that was recognized so far, a pointer i to the parent node where the rule was invoked, and a pointer j to the position in the input that recognition has reached. For parsing indexed languages, we will use three extra components: 1. A stack of the indices at the point where the rule was invoked, 2. A stack of indices for the first nonterminal to the right of, 3. A stack of indices for the tail of the nonterminal list to the right of. We will use Greek letters η, ζ, θ for index stacks, 9

10 The item format now becomes: i, θ, A α β, η, ζ, j where θ, η, ζ are stacks of indices (nonterminals). The item indicates the following: grammar rule A αβ was invoked at point i, at the point of invocation, the top node A has associated stack θ, at point j, part α of the righthand side of the rule has been successfully recognized, η is the stack for the first nonterminal among β (if β has no nonterminals, then η is empty), ζ is the stack for the remainder of the nonterminals in β (if β has less than two nonterminals, then ζ is empty). For good measure, we also include a derivation tree component, by putting a list of derivation trees as the last component of an Earley item. data Item a b = Item Int [b] (Symbol a b) [Symbol a b] [Symbol a b] [b] [b] Int [Tree a b] deriving (Eq,Ord) A show function for items, using * for the dot, and suppressing the derivation tree component. 10

11 instance (Show a, Show b) => Show (Item a b) where show (Item i theta b symbols symbols eta zeta j ts) = "(" ++ show i ++ "," ++ show theta ++ "," ++ show b ++ "==>" ++ show symbols ++ "*" ++ show symbols ++ "," ++ show eta ++ "," ++ show zeta ++ "," ++ show j ++ ")" A function for extracting the list of derivation trees from an Earley item: gettrees :: Item a b -> [Tree a b] gettrees (Item i theta b symbols symbols eta zeta j ts) = ts Axiom In the case of Earley parsing with CF grammars, there is one axiom. It has the form 0, S S, 0, where S is the start symbol of the grammar and S is a new start symbol. Adapting this to the case of parsing with sequentially indexed grammars, the axiom takes the shape 0, [], S S, [], [], 0, indicating that at the beginning of the parse, there is one pending nonterminal, and all stack components are empty. axioms :: Grammar a b -> [Item a b] axioms grammar = [Item 0 [] (D x) [] [N x] [] [] 0 []] where (N x) = start grammar Goal In the case of Earley parsing with CF grammars, there is one goal. It has the form 0, S S, n, where S is the start symbol of the grammar, S is the new start symbol used in the axiom, and n is the length of the input. For the case of Earley style parsing with indexed grammars, we also require that the index stack components are empty at the end of the parse, so the goal shape becomes: 0, [], S S, [], [], n. 11

12 Here is a function for recognizing goals: goal :: (Eq a, Eq b) => Grammar a b -> [a] -> Item a b -> Bool goal grammar tokens (Item i theta symbol symbols symbols eta zeta k trees) = i == 0 && theta == [] && dummy symbol && symbols == [start grammar] && symbols == [] && eta == [] && zeta == [] && k == length tokens Consequences As in the case of Earley parsing with CF grammars, there are three kinds of consequences, for scanning, prediction and completion. consequences :: (Eq a,eq b) => Grammar a b -> [a] -> Item a b -> [Item a b] -> [Item a b] consequences grammar tokens trigger stored = scan tokens trigger ++ predict tokens grammar trigger ++ complete grammar trigger stored Scanning The scanning rule for Earley parsing with CF grammars is the rule that shifts the bullet across a terminal. It has the form (derivation tree component omitted): i, A α wβ, j i, A αw β, j

13 For parsing sequentially indexed languages, three index stack components are added to this. Scanning does not change the index stacks θ, η, ζ. i, θ, A α wβ, η, ζ, j i, θ, A αw β, η, ζ, j + 1 scan :: (Eq a,eq b) => [a] -> Item a b -> [Item a b] scan tokens (Item i theta a alpha [] eta zeta j ts) = [] scan tokens (Item i theta a alpha (symbol:beta) eta zeta j ts) j >= length tokens = [] otherwise = [ Item i theta a (alpha ++ [symbol]) beta eta zeta (j+1) (ts ++ [Leaf (tokens!! j)]) symbol == (T (tokens!! j)) ] Prediction The prediction rule for Earley parsing is the rule that initializes a new rule B γ on the basis of a premisse indicating that B is expected at the current point in the input. In the CF grammar case it has the following form (derivation tree component omitted): i, A α Bβ, j B γ j, B γ, j In the case of Earley-style parsing with sequentially indexed grammars this splits into four rules. The rules split the first index stack. For this we need some terminology. If γ is a list of grammar symbols and η, η, η are index stacks, then c(γ) is the number of nonterminals in γ, and C(η, η, η, γ) is the following constraint: η = η ++η (c(γ) = 0 η = []) (c(γ) = 1 η = []). Splitting a list in two sublists: 13

14 split :: [a] -> [([a],[a])] split [] = [([],[])] split (x:xs) = ([],x:xs): map (\ (us,vs) -> (x:us,vs)) (split xs) Implementation of the constraint: constraint :: (Eq a, Eq b) => ([b],[b],[symbol a b]) -> Bool constraint (stack1,stack2,symbols) = (ntc symbols /= 0 (stack1 == [] && stack2 == [])) && (ntc symbols /= 1 stack2 == []) The first prediction rule covers the case of an expected nonterminal B matched against a rule with head B. The rule distributes the appropriate stack over the new item, in accordance with the constraint imposed by the number of nonterminals in the righthand side of the grammar rule used in the prediction. i, θ, A α Bβ, η, ζ, j j, η, B γ, η, η, j B γ, C(η, η, η, γ) The second rule covers the case of an expected nonterminal B matched against a rule with head B X. This rule pops the index stack associated with B. i, θ, A α Bβ, (X : η), ζ, j j, η, B X γ, η, η B X γ, C(η, η, η, γ), j The third rule covers the case of an expected nonterminal B Y matched against a rule B γ: i, θ, A α B Y β, η, ζ, j j, (Y : η), B γ, η, η, j B γ, C(Y : η, η, η, γ), n j > η Note the side condition on the rule. The side condition prevents unlimited growth of the stack. This is needed to prevent a rule like A A Y from causing an unbounded number of pushes. The fourth rule covers the case of an expected nonterminal B Y matched against a rule B Y γ: i, θ, A α B Y β, η, ζ, j j, η, B Y γ, η, η, j B Y γ, C(η, η, η, γ) If no further symbols are expected, nothing is predicted: 14

15 predict :: (Eq a,eq b) => [a] -> Grammar a b -> Item a b -> [Item a b] predict tokens grammar (Item i theta a alpha [] eta zeta j ts) = [] If a nonterminal without index is expected, we get: predict tokens grammar (Item i theta a alpha (N x:beta) eta zeta j ts) = [ Item j eta (N x) [] gamma eta eta j [] Rule (N z) gamma <- grammar, (eta,eta ) <- split eta, x == z, constraint (eta,eta,gamma) ] ++ [ Item j (tail eta) (I x y) [] gamma eta eta j [] Rule (I x y) gamma <- grammar, x == x, eta /= [], head eta == y, (eta,eta ) <- split (tail eta), constraint (eta,eta,gamma) ] If a nonterminal with an index is expected, we get: 15

16 predict tokens grammar (Item i theta a alpha (I x y:beta) eta zeta j ts) = [ Item j (y:eta) (N x) [] gamma eta eta j [] Rule (N x ) gamma <- grammar, (eta,eta ) <- split (y:eta), x == x, constraint (eta,eta,gamma), length tokens - j > length eta ] ++ [ Item j eta (I x y) [] gamma eta eta j [] Rule (I x y ) gamma <- grammar, x == x, y == y, (eta,eta ) <- split eta, constraint (eta,eta,gamma) ] Finally, we need a catch-all clause to indicate that these are all the predict consequences. This covers the case where the next expected symbol is a terminal. predict tokens grammar (Item i theta a alpha beta eta zeta j ts) = [] Completion The completion rule for Earley parsing is the rule that shifts the bullet across a non-terminal. It has two premisses, and it is of the following form (derivation tree component 16

17 omitted): i, A α Bβ, k k, B γ, j i, A αb β, j For the case of Earley-style parsing with sequentially indexed grammars, this splits into four rules, as follows. The first rule checks that the lefthand tail index stack of the first premisse matches the head index stack of the second premisse, for the case of a match of expected symbol B against completed rule B γ. i, θ, A α Bβ, η, ζ, k k, η, B γ, [], [], j i, θ, A αb β, ζ, ζ C(ζ, ζ, ζ, β), j The second rule covers the case of a match of expected symbol B against completed rule B Y γ. i, θ, A α Bβ, (Y : η), ζ, k k, η, B Y γ, [], [], j i, θ, A αb β, ζ, ζ C(ζ, ζ, ζ, β), j The third rule covers the case of a match of expected symbol B Y against completed rule B γ. i, θ, A α B Y β, η, ζ, k k, (Y : η), B γ, [], [], j i, θ, A αb Y β, ζ, ζ C(ζ, ζ, ζ, β), j The fourth rule covers the case of a match of expected symbol B Y B Y γ. against completed rule i, θ, A α B Y β, η, ζ, k k, η, B Y γ, [], [], j i, θ, A αb Y β, ζ, ζ C(ζ, ζ, ζ, β), j In the implementation this is handled by distinguishing four cases: Trigger of the form i, θ, A α Bβ, η, ζ, k: look for completed item with head B or B Y on the chart. Trigger of the form i, θ, A α B Y β, η, ζ, k: look for completed item with head B or B Y on the chart. Trigger of the form k, η, B γ, [], [], j: look for item with expected symbol B or B Y the chart. on Trigger of the form k, η, B Y γ, [], [], j: look for item with expected symbol B or B Y on the chart. 17

18 complete :: (Eq a, Eq b) => Grammar a b -> Item a b -> [Item a b] -> [Item a b] complete grammar (Item i theta a alpha (N x:beta) eta zeta k ts) stored = [ Item i theta a (alpha++[n x]) beta zeta zeta j (ts ++ [Node x eta ts ]) (Item k eta symbol gamma [] [] [] j ts ) <- stored, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta), k == k, eta == eta, symbol == N x ] ++ [ Item i theta a (alpha++[n x]) beta zeta zeta j (ts ++ [Node x eta ts ]) (Item k eta (I x y) gamma [] [] [] j ts ) <- stored, k == k, x == x, eta /= [], head eta == y, tail eta == eta, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] 18

19 complete grammar (Item i theta a alpha (I x y:beta) eta zeta k ts) stored = [ Item i theta a (alpha++[i x y]) beta zeta zeta j (ts ++ [Node x eta ts ]) (Item k eta symbol gamma [] [] [] j ts ) <- stored, eta /= [], head eta == y, tail eta == eta, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta), k == k, symbol == N x ] ++ [ Item i theta a (alpha++[i x y]) beta zeta zeta j (ts ++ [Node x eta ts ]) (Item k eta symbol gamma [] [] [] j ts ) <- stored, k == k, symbol == I x y, eta == eta, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] 19

20 complete grammar (Item k eta (N x) gamma [] [] [] j ts) stored = [ Item i theta a (alpha++[n x]) beta zeta zeta j (ts ++ [Node x eta ts]) (Item i theta a alpha (symbol:beta) eta zeta k ts ) <- stored, k == k, eta == eta, symbol == N x, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] ++ [ Item i theta a (alpha++[i x y]) beta zeta zeta j (ts ++ [Node x eta ts]) (Item i theta a alpha (I x y:beta) eta zeta k ts ) <- stored, k == k, eta /= [], head eta == y, tail eta == eta, x == x, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] complete grammar (Item k eta (I x y) gamma [] [] [] j ts) stored = [ Item i theta a (alpha++[n x]) beta zeta zeta j (ts ++ [Node x (y:eta) ts]) (Item i theta a alpha (symbol:beta) (y:eta ) zeta k ts ) <- stored, k == k, eta == eta, symbol == N x, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] ++ [ Item i theta a (alpha++[i x y]) beta zeta zeta j (ts ++ [Node x (y:eta) ts]) (Item i theta a alpha (I x y :beta) eta zeta k ts ) <- stored, k == k, x == x, y == y, eta == eta, (zeta,zeta ) <- split zeta, constraint (zeta,zeta,beta) ] 20

21 In the implementation, we have to also specify what happens to premisses of the form This is the final case of the catch-all pattern. i, θ, A α wβ, η, ζ, k. complete grammar item stored = [] This completes the Earley-specific part of the story. 7 Chart and Agenda A chart plus agenda is a pair of item lists. Call this datatype a store. type Store a b = ([Item a b],[item a b]) The idea is to use the agenda for those items that have been proved, but whose direct consequences have not yet been derived, and the chart for the proved items the consequences of which have also been computed. We start out with an empty chart and with a list of all axioms on the agenda. initstore :: (Eq a, Eq b) => Grammar a b -> [a] -> Store a b initstore grammar tokens = ([], axioms grammar) Next, we tackle the items on the agenda one by one: add their consequences to the agenda. move them from the agenda to the chart (as their consequences have been computed). 21

22 exhaustagenda :: (Eq a, Eq b) => Grammar a b -> [a] -> Store a b -> Store a b exhaustagenda grammar tokens (chart,[]) = (chart,[]) exhaustagenda grammar tokens (chart,agenda@(trigger:rest)) = exhaustagenda grammar tokens (newchart,newagenda) where newchart = chart ++ [trigger] store = chart ++ agenda conseq = consequences grammar tokens trigger chart new = conseq \\ store newagenda = rest ++ new Check whether a goal item has been found, and return the list of goal items. goalfound :: (Eq a, Eq b) => Grammar a b -> [a] -> [Item a b] -> [Item a b] goalfound grammar tokens store = filter gl store where gl = goal grammar tokens If a parse is successful, it is nice to display the chart: display :: Show a => [a] -> IO() display [] = return () display (x:xs) = do print x display xs Rather than displaying the whole chart, we will display only the records of the nodes that have been successfully created. To that end, we prune the chart using the following filter: pruned :: (Eq a, Eq b) => [Item a b] -> [Item a b] pruned = filter (\ (Item i theta s symbols symbols eta zeta j ts) -> symbols == []) As output of a parse we allow either a parsetree or a chart, depending on a boolean trigger. 22

23 data OutputKind = Tree Chart deriving Eq Parsing is now a matter of initializing the store, exhausting the agenda, and checking whether a goal item has been found in the chart. parse :: (Eq a, Show a, Eq b, Show b) => Grammar a b -> [a] -> OutputKind -> IO() parse grammar tokens output = if goals /= [] then if output == Tree then displaytrees ptrees else display (pruned chart) else putstrln "no parse" where goals = goalfound grammar tokens chart ptrees = gettrees (head goals) ptree = head (ptrees) init = initstore grammar tokens result = exhaustagenda grammar tokens init chart = fst result Incomplete parses (for debugging): iparse :: (Eq a, Show a, Eq b, Show b) => Grammar a b -> [a] -> IO() iparse grammar tokens = display chart where init = initstore grammar tokens result = exhaustagenda grammar tokens init chart = fst result Parsing with a grammar read from a file: prs :: String -> [String] -> OutputKind -> IO() prs string tokens output = do grammar <- getgrammar string :: IO(Grammar String String) parse grammar tokens output 23

24 8 Testing parsetest :: (Eq a, Eq b) => Grammar a b -> [a] -> Bool parsetest grammar tokens = goals /= [] where goals = goalfound grammar tokens chart init = initstore grammar tokens result = exhaustagenda grammar tokens init chart = fst result test :: (Eq a, Show a, Eq b, Show b) => (Grammar a b, [a]) -> String test (grammar, tokens) = if parsetest grammar tokens then show grammar ++ " " ++ show tokens ++ " succeeds" else show grammar ++ " " ++ show tokens ++ " fails" 24

25 suite1 :: [(Grammar Char Char, [Char])] suite1 = [ (grammar1, ""), (grammar1, "abba"), (grammar1, "aba"), (grammar2, ""), (grammar2, "aba"), (grammar2, "abba"), (grammar2, "aaabbaaa"), (grammar3, ""), (grammar3, "(()())"), (grammar3, "(()()"), (grammar3, "((((())))()"), (grammar3, "((((())))())"), (grammar4, ""), (grammar4, "aabbcc"), (grammar4, "aabbbcc"), (grammar4, "aabbbccc"), (grammar4, "aaaaabbbbbccccc"), (grammar5, ""), (grammar5, "aabaaab"), (grammar5, "aabaab"), (grammar5, "aaaaabbaaaaabb"), (grammar6, ""), (grammar6, "a"), (grammar6, "ab") ] runtests :: IO() runtests = sequence_ (map (putstrln. test) suite1) 9 Function for Stand-alone Use Module declaration: 25

26 module Main where import DPS import System Definition of main function: main :: IO() main = do args <- getargs prs (args!! 0) (words (args!! 1)) Tree 26

27 This allows: sig]$ more grammar6 N "S" --> [N "NP", N "VP"] N "VP" --> [N "TV", N "NP"] N "VP" --> [T "talked"] N "VP" --> [T "smiled"] N "NP" --> [N "Det", N "CN"] N "NP" --> [T "John"] N "NP" --> [T "Mary"] N "TV" --> [T "loved"] N "TV" --> [T "hated"] N "Det" --> [T "the"] N "Det" --> [T "some"] N "CN" --> [T "man"] N "CN" --> [T "woman"] N "CN" --> [N "CN", T "that", I "S" "NP"] I "NP" "NP" --> [] [jve@water sig]$ runhugs Main grammar6 "John hated the man that loved Mary" "S" "NP" "John" "VP" "TV" "hated" "NP" "Det" "the" "CN" "CN" "man" "that" "S"["NP"] "NP"["NP"] "VP" "TV" "loved" "NP" "Mary" [jve@water sig]$ References [1] Aho, A. V. Indexed grammars an extension of context-free grammars. Journal of the ACM 15, 4 (1968),

28 [2] Aho, A. V. Nested stack automata. Journal of the ACM 16, 3 (1969), [3] Earley, J. An efficient context-free parsing algorithm. Communications of the ACM 13 (1970), [4] Eijck, J. v. Sequentially indexed grammars. manuscript, Centre for Mathematics and Computer Science, Amsterdam, [5] Gazdar, G. Applicability of indexed grammars to natural languages. In Natural Language Parsing and Linguistic Theories, U. Reyle and C. Rohrer, Eds. Reidel, Dordrecht, 1988, pp [6] Jones, S. P., Hughes, J., et al. Report on the programming language Haskell 98. Available from the Haskell homepage: [7] Knuth, D. Literate Programming. CSLI Lecture Notes, no. 27. CSLI, Stanford, [8] Shieber, S., Schabes, Y., and Pereira, F. Principles and implementation of deductive parsing. Journal of Logic Programming 24 (1995),

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39 Parsing Earley Parsing Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 39 Table of contents 1 Idea 2 Algorithm 3 Tabulation 4 Parsing 5 Lookaheads 2 / 39 Idea (1) Goal: overcome