CYK- and Earley s Approach to Parsing

Size: px

Start display at page:

Download "CYK- and Earley s Approach to Parsing"

Bartholomew Baker
5 years ago
Views:

1 CYK- and Earley s Approach to Parsing Anton Nijholt and Rieks op den Akker Faculty of Computer Science, University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands 1. Introduction In this paper we discuss versions of the Cocke-Younger-Kasami (CYK) parsing algorithm and Earley s parsing algorithm. In the early 1960s parsing theory flourished under the umbrella of machine translation research. Backtrack algorithms were used for parsing natural languages. Due to the exponential blow-up no widespread application of these algorithms could be expected. It was even thought that a polynomial parsing algorithm for context-free languages was not possible. Greibach[1979] presents the following view on the situation in the early 1960s: We were very much aware of the problem of exponential blow-up in the number of iterations (or paths), though we felt that this did not happen in real natural languages; I do not think we suspected that a polynomial parsing algorithm was possible. However, at that time a method developed by John Cocke at Rand Corporation for parsing a context-free grammar for English was already available. The method required the grammar to be in Chomsky Normal Form (CNF). An analysis of the algorithm showing that it required O (n 3 ) steps was presented some years later by D.H. Younger. An algorithm similar to that of Cocke was developed by T. Kasami and by some others (see the bibliographic notes at the end of this paper). Presently it has become customary to use the name Cocke-Younger-Kasami algorithm. Parsing methods for context-free languages require a bookkeeping structure which at each moment during parsing provides the information about the possible next actions of the parsing algorithm. The structure should give some account of previous actions in order to prevent the doing of unnecessary steps. In computational linguistics a simple and flexible bookkeeping structure called chart has been introduced in the early 1970s. A chart consists of two lists of items. Both lists are continually updated. One list contains pending items, i.e. items that still have to be tried in order to recognize or parse a sentence. Because the items on this list will in general be processed in a particular order the list is often called an agenda. The second list called the work area contains the employed items, i.e. things that are in the process of being tried. The exact form of the lists, their contents and their use differ. Moreover, they can have different graphical representations. In the following section CYK-versions of charts will be discussed. Then, in section 3 we present a general recognition scheme of which Earley s recognizer and the CYKrecognizer are particular instances. In section 4 we discuss earley s method and in section 5 a generalized CYK-recognition method for general context-free grammars. Finally in section 6 another variant of Earley s method is presented. Or some other permutation of these three names.

2 Cocke-Younger-Kasami-like parsing The usual condition of CYK parsing is that the input grammar is in CNF, hence, each rule is of the form A BC or A a. It is possible to relax this condition. With slight modifications of the algorithms we can allow rules of the form A XY and A X, where X and Y are in V. In section 5 we give a more general version of the CYK-approach. CYK Algorithm (Basic Chart Version) The first algorithm we discuss uses a basic chart with items of the form [i,a, j ], where A is a nonterminal symbol and i and j are position markers such that if a 1... an is the input string which has to be parsed, then 0 i <j n. Number i is the start (or left) and number j is the end (or right) position marker of the item. For any input string x = a 1... an a basic chart will be constructed. We start with an empty agenda and an empty list of employed items. (1) Initially, items [i 1,A,i ], 1 i n with A a i in P are added to the agenda. Now keep repeating the following step until the agenda is empty. (2) Remove any member of the agenda and put it in the work area. Assume the chosen item is of the form [j,y,k ]. For any [i,x, j ] in the work area such that A XY is in P add [i,a,k ] (provided it has not been added before) to the agenda. Similarly, for any [k,z,l ] in the work area such that A YZ is in P add [j,a,l ] (provided it has not been added before) to the agenda. If [0,S,n ] appears in the work area then the input string has been recognized. Obviously, for any item [i,a, j ] which is introduced as a pending item we have A * a i aj. Notice that in the algorithm we have not prescribed the order in which items are removed from the agenda. This gives freedom to choose a particular strategy. For example, always take the oldest (first-in, first-out strategy) or always take the most recent item (last-in, first-out strategy) from the agenda. Another strategy may give priority to items [i,x, j ] such that for any other item [k,y,l ] on the pending list we have i k. Similarly, a strategy may give priority to items [i,x, j ] such that for any other item [k,y,l ] on the agenda we have l j. It is not difficult to see that with a properly chosen initialization and strategy we can give the algorithm a left-to-right or a right-to-left nature. It is also possible to give preference to items which contain interesting nonterminals (comparable with the interesting corner chart parsing method discussed in Allport[1988]) or to use the chart in such a way that so-called island and bi-directional parsing methods can be formulated (cf. Stock et al[1988]). As an example of the basic chart algorithm consider our usual example grammar with rules S NP VPS PP NP *n *det *n NP PP PP *prep NP VP *v NP and with input string the man saw Mary. In the algorithm the symbols *det, *n, *prep and *v will be viewed as nonterminal symbols. The rules of the form *det a, *det the, *n man, etc. are not shown. Because of rule NP *n the grammar is not in CNF. Step (1) of the algorithm is therefore modified such that items [i 1,A,i ] for 1 i n and A *a i are added to the agenda. See Fig. 1 for a further explanation of this example. As soon as a new item is constructed information about its constituents disappears. This information can be carried along with the items. For the example it means that instead of the items in the left column below we have those of the right column.

3 - 3 - AGENDA WORK AREA [0,*det, 1] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [1,NP, 2] [0,*det, 1] [2,*v, 3] [1,*n, 2] [3,*n, 4] [3,NP, 4] [0,NP, 2] [0,NP, 2] [0,*det, 1] [2,VP, 4] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [1,S, 4] [0,*det, 1] [0,S, 4] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [0,NP, 2] [2,VP, 4] [0,*det, 1] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [0,NP, 2] [2,VP, 4] [1,S, 4] Initially, [0,*det, 1], [1,*n, 2], [1,NP, 2], [2,*v, 3], [3,*n, 4] and [3,NP, 4] are put on the agenda (first row of this figure) in this order. The strategy we choose consists of always moving the oldest item first from the agenda to the work area. Hence, [0,*det, 1] is removed and appears in the work area. It can not be combined with another item in the work area, therefore the next item [1,*n, 2] is removed from the agenda and put in the work area. It is possible to combine this item with a previously entered item in the work area. The result [0,NP, 2] is entered on the agenda. The present situation is shown in the second row. Item [1,NP, 2] is moved from the agenda and put in the work area. This does not lead to further changes. The same holds for the two next items, [2,*v, 3] and [3,*n, 4], that move from the agenda to the work area. With item [3,NP, 4] entered in the work area we can construct item [2,VP, 4] and enter it on the agenda. We now have the situation depicted in the third row. Item [0,NP, 2] is now added to the work area, but it does not result in changes. Next item [2,VP, 4] is added to the work area and now items [1,S, 4] and [0,S, 4] are obtained and entered on the agenda. We now have the situation of the fourth row. In the next steps we first remove [1,S, 4] from the agenda, enter it in the work area, conclude that no combinations are possible, then remove [0,S, 4] from the agenda, enter it in the work area and [0,S, 4] conclude that we are finished. Fig. 1 Basic chart parsing of the man saw Mary. [0,NP, 2] [2,VP, 4] [1,S, 4] [0,S, 4] [0,NP (*det *n),2] [2,VP (*v NP (*n)),4] [1,S (NP (*n VP (*v NP (*n))),4] [0,S (NP (*det *n) VP (*v NP (*n))),4] In this way the parse trees become part of the items and there is no need to construct them during a second pass through the items. Clearly, in this way partial parse trees will be constructed that will not become part of a full parse tree of the sentence. Worse, we can have a enormous blow-

4 - 4 - up in the number of items that are distinguished this way. Indeed, for long sentences the unbounded ambiguity of our example grammar will cause a Catalan explosion. In general the number of items that are syntactically constructed can be reduced considerably if, before entering these extended items on a chart (see also the next versions of the CYK algorithm) they are subjected to a semantic analysis. The name chart is derived from the graphical representation which is usually associated with this method. This representation uses vertices to depict the positions between the words of a sentence and edges to depict the grammar symbols. With our example we can start with the simple chart of Fig. 2.a. the man saw Mary Fig. 2.a The initial CYK-like chart. By spanning edges from vertex to vertex according to the grammar rules and working in a bottom-up way, we end up with the chart of Fig. 2.b. Working in a bottom-up way means that an edge with a particular label will be spanned only after the edges for its constituents have been spanned. Apart from this bottom-up order the basic chart version does not enforce a particular order of spanning edges. Any strategy can be chosen as long as we confine ourselves to the trammels of step (1) and step (2) of the algorithm. In the literature on computational linguistics chart parsing is usually associated with a more elegant strategy of combining items or constructing spanning edges. This strategy will be discussed in a twin paper on Earley-like parsing algorithms. It should be noted that although in the literature chart parsing is often explained with the help of these graphical representations, any practical implementation of a chart parser ( a spanning jenny ) requires a program which manipulates items according to one of the methods presented in this paper. S S NP VP NP NP *det *n *v *n 0 the 1 man 2 saw 3 Mary 4 Fig. 2.b The complete CYK-like chart for the man saw Mary. We now give a global analysis of the (theoretical) time complexity of the algorithm. On the lists O (n 2 ) items are introduced. Each item that is constructed with step (1) or (2) of the algorithm is entered on and removed from the agenda. For each of the O (n 2 ) items that is removed from the agenda and entered on the list of employed items the following is done. The list of O (n 2 ) employed items is checked to see whether a combination is possible. There are at most O (n) items which allow a combination. For each of these possible combinations we have to check O (n 2 ) pending items to see whether it has been introduced before. Hence, for each of the O (n 2 ) items that are introduced O (n 3 ) steps have to be taken. Therefore the time complexity of the algorithm is O (n 5 ). The algorithm recognizes. More has to be done if a parse tree is requested. Starting with an item [0,S,n ] we can in a recursive top-down manner try to find items whose nonterminals can be concatenated to right-hand sides of productions. For each of the O (n) nonterminals which are used in a parse tree for a sentence of a CNF grammar it will be necessary

5 - 5 - to do O (n 3 ) comparisons. Therefore a parse tree of a recognized sentence can be obtained in O (n 4 ) time. The algorithm which we introduced is simply adaptable to grammars with right-hand sides of productions that have a length greater than two. However, since the number of possible combinations corresponding with a right-hand side increases with the length of the right-hand sides of the productions this has negative consequences for the time complexity of the algorithm. If p is the length of the longest right-hand side then recognition requires O (n p +3 ) comparisons. We are not particularly satisfied with the time bounds obtained here. Nevertheless, despite the naivity of the algorithm its bounds are polynomial. Improvements of the algorithm can be obtained by giving more structure to the basic chart. This will, however, decrease the flexibility of the method. It is rather unsatisfactory that each time a new item is created the algorithm cycles through all its items to see whether it is already present. Any programmer who is asked to implement this algorithm would try (and succeed) to give more structure to the chart in order to avoid this. Moreover, a more refined structure can make it unnecessary to consider all items when combinations corresponding to right-hand sides are tried. For example, when an item of the form [j,y,k ] is removed from the agenda, then only items with a right position marker j and items with a left position marker k have to be considered for combination. Rather than having O (n 3 ) comparisons for each item it could be sufficient to have O (n 2 ) comparisons for each item, giving the algorithm a time complexity of O (n 4 ) for grammars in CNF. Below such a version of the CYK algorithm is given. We will return to the algorithm given here when we discuss parallel implementations of the CYK algorithms. CYK Algorithm (String Chart Version) Let a 1... an be a string of terminal symbols. The string will be read from left to right. For each symbol a j an item set I j will be constructed. Each item set consists of elements of the form [i, A ], where A is a nonterminal symbol and i, with 0 i n 1, is a (start) position marker. For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2) until no new items will be added to I j. (2.2) Let [k, C ] be an item in I j. For each item of the form [i, B ] in I k such that there is a production A BC in P add [i, A ] (provided it has not been added before) to I j. It is not difficult to see that the algorithm satisfies the following property: [i, A ] I j if and only if A *a i aj. A string will be recognized as soon as [0,S ] appears in the item set I n. As an example consider the sentence John saw Mary with Linda generated by our example grammar. We allow again a slight modification of our algorithm in handling the rule NP *n. In Fig. 3 the five constructed item sets are displayed. Notice that each of the five item sets in Fig. 3 is in fact a basic chart, i.e., a data structure consisting of an agenda, a work area (the employed items) and rules for constructing items and moving items from the agenda to the work area. Except for item [3,PP ], each application of step (2.2) in this example always yields one element only which is put in the item set and immediately employed in the next step. In general looking for a combination may produce several new items and therefore items in an item set may have to wait on an internal agenda before they are

6 - 6 - I 1 I 2 I 3 I 4 I 5 [0,*n ] [1,*v ] [2,*n ] [3,*prep ] [4,*n ] [0,NP ] [2,NP ] [4,NP ] [1,VP ] [3,PP ] [0,S ] [2,NP ] [0,S ] [1,VP ] Fig. 3 String chart parsing of John saw Mary with Linda. employed to construct new items. Therefore we can say that the algorithm uses a string of basic charts, which explains the name we have chosen for this CYK version. In a practical implementation it is possible to attach a tag to each item to mark whether it is waiting or not, or to put waiting items indeed on a separate list. In the string chart version we have chosen a left-to-right order for constructing item sets. We could as well have chosen a right-to-left order corresponding with a right-to-left reading of the sentence. The reader will have no difficulties in writing down this string version of the CYK algorithm. With the sentence John saw Mary with Linda generated by our example grammar this version should produce the five item sets of Fig. 4 (first I 4, then I 3, etc.). I 0 I 1 I 2 I 3 I 4 [*n, 1] [*v, 2] [*n, 3] [*prep, 4] [*n, 5] [NP, 1] [VP, 3] [NP, 3] [PP, 5] [NP, 5] [S, 3] [VP, 5] [NP, 5] [S, 5] Fig. 4 Right-to-left string chart parsing of John saw Mary with Linda. With the graphical representation usually associated with chart parsing the (left-to-right) string chart version of the CYK algorithm enforces an order on the construction of the spanning edges. In each item set I j we collect all the items whose associated edges terminate in vertex j. Hence, a corresponding construction of the chart of Fig. 5 would start at vertex 1, where all edges which may come from the left and terminate in 1 are drawn, then we go to vertex 2 and draw all the edges coming from the left and terminating in 2, etc. S VP S VP NP PP NP NP NP *n *v *n *prep *n John saw Mary with Linda Fig. 5 The CYK chart for John saw Mary with Linda. As we noted above, whenever with step (2.2) of the algorithm an item [i,a ] in item set I j leads to the introduction of more than one item in I j, then there is no particular order prescribed for

7 - 7 - processing them. For the graphical representation of Fig. 5 this means that, when we draw an edge from vertex i to vertex j with label A, then there can be more than one edge terminating in i with which new spanning edges terminating in j can be drawn and the order in which these edges will be drawn is not yet prescribed. Before discussing the time complexity of the algorithm we show some different ways of constructing the item sets. The string chart version discussed above can be said to be, in a modest way, item driven. That is, in order to add a new item to an item set we consider an item which is already there and then we visit a previously constructed item set to look for a possible combination of items. The position marker of an item determines which of the previously constructed sets has to be visited. Without loss of correctness the string chart version can be presented such that in order to construct a new item the previously constructed item sets are visited in a strict right to left order. This will reduce the (slightly) available freedom in the algorithm to choose a particular item in the set that is under construction and see whether it gives rise to the introduction of a new item. Below this revised string chart version of the CYK algorithm is presented. For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2), first for k =j 1 and then for decreasing k until this step has been performed for k =1. (2.2) Let [k, C ] be an item in I j. For each item of the form [i, B ] in I k such that there is a production A BC in P add [i, A ] (provided it has not been added before) to I j. A further reduction of the freedom which is still available in the algorithm is obtained if we demand that in step (2.2) the items of the form [i,b ] in I k are tried in a particular order of i. In a decreasing order first the items with i =k 1 are tried, then those with i =k 2, and so on. A second alternative of the string chart version is obtained if we demand that in I j first all items with position marker j 1, then all items with position marker j 2, etc., are computed. This leads to the following algorithm: For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2), first for i =j 2 and then for decreasing i until this step has been performed for i =0. (2.2) Add [i,a ] to I j, if, for any k such that i <k <j, [i,b ] I k, [k,c ] I j and A BC is a production rule and [i,a ] is not already present in I j. We will return to this version when we introduce a tabular chart version of the CYK algorithm. We turn to a global analysis of the time complexity of the algorithm. We confine ourselves to the first algorithm that has been presented in this subsection. The essential difference between this CYK version and the basic chart version is of course that rather than checking an unstructured set of O (n 2 ) items we can now always restrict ourselves to sets of O (n) items. That is, each item set has a number of items proportional to n. In order to add an item to an item set I j it is necessary to consider O (n) items in a previously constructed item set. Each previously constructed item set will be asked to contribute to I j a bounded number of times (always less than or

8 - 8 - equal to the number of nonterminals of the grammar). Hence, the total number of combinations that are tried is O (n 2 ). For each of these attempts to construct a new item we have to check O (n) items that are available in I j in order to see whether it has been added before. Hence, to construct an item set requires O (n 3 ) steps and recognition of the input takes O (n 4 ) steps. Parse trees of a recognized sentence can be constructed by starting with item [0,S ] in I n and by checking I 1 and I n for its constituents. This process can be repeated recursively for the constituents that are found. In this way each parse tree can be constructed in a top-down manner in O (n 3 ) time. As demonstrated in the example, the CYK version discussed here is simply adaptable to grammars that do not fully satisfy the CNF condition or to grammars whose productions have right-hand sides with lengths greater than two. Although the time bounds for this algorithm are better than for the previous CYK version we still can do better by further refining the data structure that is used. This can be done by introducing some list processing techniques in the algorithm, but it can be shown more clearly by introducing a tabular chart parsing version of the CYK method. This will be done in the next subsection. We will return to the string chart version when we discuss parallel implementations of the CYK algorithm. CYK Algorithm (Tabular Chart Version) It is usual to present the CYK algorithm in the following form (Graham and Harrison[1976], Harrison[1978]). For any string x = a 1 a 2... an to be parsed an upper-triangular (n +1) (n +1) recognition table T is constructed. Each table entry t i, j with i < j will contain a subset of N (the set of nonterminal symbols) such that A t i, j if and only if A *a i aj. Hence, in this version the nonterminals are the items and each table entry is an item set. String x belongs to L (G) if and only if S is in t 0,n when the construction of the table is completed. Fig. 6 may be helpful in understanding the algorithm. For any input string x = a 1... an an upper-triangular (n +1) (n +1) table T will be constructed. Initially, table entries t i, j with i <j are empty. Assume that the input string, if desired terminated with an endmarker, is available on the matrix diagonal (cf. Fig. 6). (1) Compute t i,i +1, as i ranges from 0 to n 1, by placing A in t i,i +1 exactly when there is a production A a i +1 in P. (2) In order to compute t i, j, j i >1, assume that all entries t k,l with l j, k i and kl ij have been computed. Add A to t i, j if, for any k such that i <k <j, B t i,k, C t k, j and A BC is a production rule and A is not already present in t i, j. Unlike the previous versions of the CYK algorithm and unlike an algorithm of Yonezawa and Ohsawa[1988], which needed explicit representations of positional information, here this information is available in the indices of the table entries. For example, t 1,4 consists of all nonterminals that generate the substring of the input between positions 1 and 4. Notice that after step (1) the computation of the entries can be done diagonal by diagonal until entry t 0,n has been completed. However, the exact order in which the entries are computed is not prescribed. Rather than proceeding diagonal by diagonal it is possible to proceed column by column such that each column is computed from the bottom to the top. For each entry of a diagonal only entries of preceding diagonals are used to compute its value. More specifically, in order to compute whether a nonterminal should be included in an entry t i, j it is necessary to compare t i,k and t k, j, for each k in between i and j. The order in which this may be done is, provided these entries are all

9 - 9 - a 1 0,1 0,2 0,3 0,4 0,5 a 2 1,2 1,3 1,4 1,5 a 3 2,3 2,4 2,5 a 4 3,4 3,5 a 5 4,5 $ Fig. 6 The upper-triangular CYK-table. available, arbitrary. Due to the systematic construction of each entry the need for an internal agenda has gone. The item driven aspect from the previous versions has disappeared completely and what remains is a control structure in which even empty item sets have to be visited. In Fig. 7 we have displayed the CYK table for example sentence John saw Mary with Linda produced by our running example grammar. *n John S S NP saw *v VP VP *n Mary NP NP with *prep PP *n Linda NP Fig. 7 Tabular chart parsing of John saw Mary with Linda. We leave it to the reader to provide a tabular chart version which computes the entries column by column (each column bottom-up) from right to left. Since the tabular CYK chart is a direct representation of the parse tree(s) for a sentence the strictly upper-triangular tables produced by this right-to-left version do not differ from those produced by the left-to-right version. Also left to the reader is the construction of the graphical representation according to the rules of this tabular chart version. For that construction it is necessary to distinguish between the column by column and the diagonal by diagonal computation of the entries. In order to discuss the time complexity of this version observe that the number of elements in each entry is limited by the number of nonterminal symbols in the grammar and not by the length of the sentence. Therefore, the amount of storage that is required by this method is only proportional to n 2. Notice that for each of the O (n 2 ) entries O (n) comparisons have to be made, therefore the number of elementary operations is proportional to n 3. An implementation can be given which accesses the table entries in such a way that the time bound of the algorithm is indeed O (n 3 ). Starting at the upper right entry we can descend to the first diagonal and construct a parse tree in O (n 2 ) time. This can be reduced to linear time if during the construction of the table with each nonterminal in t i, j pointers are added to the nonterminals in t i,k and t k, j that caused it to be placed in t i, j. Provided that the grammar has a bounded ambiguity this can be done for each possible realization of a nonterminal in an entry without increasing the O (n 3 ) time bound for

10 recognition. 3. A recognition scheme Given a context-free grammar G = (N, Σ,P,S) and a sentence w Σ, the recognition problem for G is the problem of evaluating the predicate RECOGNIZE (G,w) which is true if S G w, and false otherwise. A recognizer for a CFG G is a predicate REC G such that REC G (w) is true if w is derivable from S using productions in G, and false otherwise. A parser is an implementation of a parse function, i.e. a function that not only recognizes a sentence, but, in addition, delivers a representation of the (or all) derivations of the sentence. Let G be a context-free grammar and w = a 1... an a sentence to be recognized. Our algorithm works according to a recognition scheme, of which Earley s algorithm is also a particular instance. For every symbol a j of the input sentence we construct a set of parse objects I j. (The parse object are sets of items in Earley s algorithm. There, an item has the form [A αptβ,i ], where A αβ is a production and i is some number 0 i j.) The initial step of the algorithm is the construction of the set of parse objects I 0. The main part is the construction of the set I j from the previous constructed sets I 0,I 1,...,I j 1. The final step of the algorithm checks whether the finally constructed set I n satisfies a particular condition. (In Earley s algorithm, the final step tests whether the item [S αpt,0] is a member of I n.) If this is true, the sentence is accepted. The general scheme is shown in Figure 8. Recognition Scheme begin Initialize { I 0 is computed} for j :=1 tot n do Compute( I j ) end; { I 0,..., I n are computed } end Answer(I n ) Fig. 8 The recognition scheme The version of Earley s algorithm that works according to this scheme, is sometimes called the string version. There is also a tabular version, that uses a recognition matrix. (The tabular and the string version of Earley s method are explained in detail in section 4.) The tabular version of our method can best be seen as a variant of the CYK method. Let G = (N, Σ,P,S) be a CFG in Chomsky normal form. In the CYK method, nonterminal A N is an element of the entry T i, j of the recognition matrix T (with 0 i, j n), if and only if A a i aj. Hence, the sentence a 1... an is accepted if S T 0,n, and rejected otherwise. The well-known CYK-recognition algorithm for the construction of the recognition matrix T is shown in Figure 9. The operator * in this algorithm is defined as follows: X * Y = { A A BC P ; B X and C Y } 4. Earley s recognition method

11 begin CYK-recognizer; for i := 0 tot n 1 do T i,i +1 := {A A a i +1 P } end; for d := 2 tot n do end end; for i := 0 tot n d do j := i +d; T i, j := j 1 T i,k * T k, j k =i +1 if and only if S T 0,n then accept else reject end end CYK-recognizer. Fig. 9 The CYK-recognizer Earley s algorithm can be applied to arbitrary CFGs. An arbitrary CFG may contain ε- rules and may be cyclic. A CFG G = (N, Σ,P,S) is cyclic if for some nonterminal A N: A + A. Using Earley s method we obtain a result, the parse objects, from which we can derive the parse tree of the input sentence. In this method, the parse objects are called item sets. An item [A X 1... Xk ptx k Xm,i ] is a tuple consisting of a production of G with a dot (that marks a position) in the right-hand side (a dotted rule), and a number. Item [A X 1... Xk ptx k Xm,i ] is an item for a 1... an if A X 1... Xm P and 0 k m and 0 i n. Suppose that we want to recognize the sentence a 1 a 2... an. We define sets I 0, I 1,..., I n of items for a 1... an, such that they have the following characteristic property. [A αptβ,i ] I j if and only if A αβ P, S a 1... ai A δ, and α a i aj. From this property it follows immediately that [S αpt,0] I n if and only if there is a production S α in P, and α * a 1... an. Hence, a 1... an is in L (G), the language generated by G. Intuitively, [A αptβ,i ] I j means that the production A αβ can be applied in the left-most derivation, such that the prefix a 1... ai has already been recognized, and that the substring a i aj can be generated from α. The part β behind the dot in this item is only predicted. Earley s algorithm is a method to compute the items sets I j, such that the characteristic property holds. In this method all possible left-most derivations of the prefixes of a sentence, are put together in item sets. Since left-most derivations are considered simultaneously, (i.e. in a breadth-first way), the problem with left-recursive grammars, that we encounter if we first completely work out one possible choice in a left-most derivation (as in a depth-first method), is avoided. We present the original (string) version of Earley s algorithm for constructing the item sets. Earley s Recognition-algorithm Construct I 0, following steps (1), (2) and (3) below.

12 (1) I 0 := ; If S α in P, then add [S ptα,0] to I 0. Apply steps (2) and (3) as long as new items can be added to I 0. (2) If [B γpt,0] is in I 0, then add item [A αbptβ,0] to I 0 for all items of the form [A αptb β,0] I 0. (3) If [A αptb β,0] I 0 then add for all productions B γ in P, [B ptγ,0] to I 0, if this item is not already in the set. Construct item set I j from the already constructed sets I 0,...,I j 1. (4) SCANNER: If [B αpta β,i ] I j 1, and a =a j, then add [B αaptβ,i ] to I j. Apply steps (5) and (6) as long as new items can be added to I j. (5) COMPLETER: For all items in I j of the form [A αpt,k ], for which an item in I k of the form [B δpta β,i ] exists, add item [B δaptβ,i ] to I j. (6) PREDICTOR: If [A αptb β,i ] I j, then add for each production B γ in P, item [B ptγ, j ] to I j (if it has not already been added to this set). The names SCANNER, COMPLETER and PREDICTOR denote what in each step happens. The scanner operation reads a new symbol from the sentence and introduces new items from those that expected this symbol: the dot is placed just behind the terminal symbol in the new items. The completer operation is applicable if a production has been completely recognized: the dot is placed at the end in the item. The number k in the item [A γpt,k ] I j denotes that item [A ptγ,k ] was introduced in item set I k. Hence, in this case: γ a k aj. The predictor-operation introduces (predicts) productions that are possibly applied in the leftmost derivations. The construction of the item sets for a particular CFG and an input sentence, is shown in the following example. Example 1 Let the CFG G 1 be given by the following productions. S abd B HCa ε H ε D bb C F F B If we apply Earley s algorithm to the sentence aaabb, we obtain the sets of items as shown in Figure 10. In his thesis, Earley offers the following suggestions for an efficient implementation of his method. I II III For the predictor operation it is efficient to have for each nonterminal A N a list containing the A-productions (i.e. productions of the form A α). The items of a set are placed in a pointer list in order to have an easy and efficient implementation for the scanner operation. At the time item set I j is constructed, make a vector V j of pointers, in which V j (i) points at the list of items of which the second component is i. Then, if we want to check whether an

13 I 0 : [S ptabd, 0] I 1 : I 2 : SCANNER: [S aptbd, 0] PREDICTOR: [B pthca, 1] [B pt, 1] [H pt, 1] COMPLETER: [B HptCa, 1] PREDICTOR: [C ptf, 1] [F ptb, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 1] SCANNER: [B HCapt, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 2] I 3 : I 4 : I 5 : SCANNER: [B HCapt, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 3] SCANNER: [D bptb, 3] SCANNER: [D bbpt, 3] COMPLETER: [S abdpt, 0] Fig. 10 The items sets for Example 1 IV item [A αptβ,i ] has already been added to I j, we just have to check V j (i). For the completer operation, we keep for each set I j and each nonterminal B a list of those items that have the form [A αptb β,k ]. We now consider the complexity of Earley s algorithm. For arbitrary CFGs, the time complexity of this recognition method is O (n 3 ), in which n is the length of the input sentence. Hence, there is a constant C, not depending on n, such that for sufficiently long input, the time needed for recognizing the input does not exceed C n 3. The constant C depends on the size of the grammar. T (n), the time needed for an input of length n, is about P 3.m 2.n 3, in which m is the maximal length of the right-hand side of a production in the grammar. From this estimate we see that, if we apply Earley s method to small sentences (more common in natural languages than in programming languages), we definitively have to take the constant C into account. For unambiguous grammars, the time complexity is O (n 2 ), and the same remarks concerning the constant factor hold. Dealing with an unambiguous grammar, it is not necessary to check whether an item [A αptβ,i ], in which α ε, has already been added to the item set I j. In this case, an item can

14 only be added by one combination of items. Therefore we gain a factor n in the time complexity if we compare this with that for the general case. For deterministic context-free languages (i.e. the languages generated by LR (1) grammars, the time needed is linear in the length of the input sentence. It may be better, in this case, to use a recognition or parsing method especially appropriate for LR grammars or some subclass of this grammar class. The amount of space needed for the item sets is O (n 2 ). This also holds for unambiguous grammars. The number of items in a set I j is O (j). The completer operation claims most of the space. For each item [A γpt,i ] in I j this operation can add q*i new items at most, where q only depends on the grammar, not on n and j. We now present a way to obtain the parse tree, or a representation of this tree, from the item sets constructed during recognition. Let I 0,..., I n be the item sets constructed in processing the sentence a 1... an. The parse algorithm, which delivers a right-most derivation, correctly works for cycle-free CFGs only. If there is no item of the form [S αpt,0] in I n, then the sentence is not accepted. If there is such an item in set I n, then the parse algorithm, shown in Figure 11, is activated. The initial call is Parse ([S α pt, 0], n). Parse ([A X 1... Xm pt, i ], j); {k and q are local variables.} begin {Assume A X 1... Xm has production number p}. Write p to output; k :=m; q :=j; while1 k > 0 do end end. if and only if X k Σ then k := k 1; q := q 1 else search for an item [X k γpt,r ] in I q for which [A X 1... Xk 1 ptx k... Xm,i ] is in I r ; Parse ([X k γ pt, r ], q); k := k 1; q := r end Fig. 11 The parse algorithm for Earley s method A CFG is cyclic if and only if it is infinitely ambiguous, i.e. it generates a sentence with an unbounded number of parse trees. It is not possible to deliver one by one all parse trees of an infinitely ambiguous sentence, in bounded time. We need some finite representation of this infinite set of trees. Such a representation is a shared forest, a finite graph that represents all possible parse trees of a sentence. We refer to the literature on shared forests for more details. For cycle-free grammars, the parsing algorithm can be augmented in such a way that it delivers all parse trees of a sentence, one by one. Also for CFGs that are not cyclic, the number of parse trees can be quite large. This number may grow even exponentially with the length of the input. This is, for example, the case with the grammar having productions S SS, S a. For this grammar,

15 the time complexity of the all-parse parser will, of course, also be exponential. We now discuss tabular variants of Earley s method (Earley[1968]). In this tabular recognition method, for a sentence w = a 1 a 2... an, an (n +1)* (n +1) recognition matrix T is constructed. The elements of T are sets of dotted rules A αptβ, that we will also call items, here. T i, j denotes the element in the j-th column and the i-th row of T. The j-th column of this matrix coincides with the item-set I j in the string version of Earley s method. Hence, if item [A αptβ, k ] is an element of item-set I j, then A αptβ will be in T k, j. Because of the characteristic property satisfied by the items sets in the string version, the following will hold. A αptβ is an element of T i, j, if and only if there is a derivation S a 1... ai A δ and, moreover, α a i aj. And, again, as a particular case of this characteristic property, S αpt is in T 0,n for some S-production S α in P, if and only if S α a 1... an. We define the following operations on sets of items. These operations are used in the tabular recognizer, shown in Figure 12. Let G = (N, Σ,P,S) be a CFG, Q a set of items of G, and M a subset of V. Q M = {A αb βptγ A αptb βγ Q, β ε, B M} We now extend this product operation. Q and R now both are sets of items of G. and Notice that Q R Q * R. Q R = {A αb βptγ A αptb βγ Q, β ε,b ηpt R}, Q * R = {A αb βptγ A αptb βγ Q, β ε, B C, C ηpt R}. Let M be a subset of N. We define the function Predict as follows. Predict (M) = {C γptξ C γξ P, γ ε, B C η, B M, η V } Predict satisfies the following properties, that can be used for its evaluation, when it is applied to a nonterminal, or a set of nonterminals. Let X and Y be sets, then Predict (X Y) = Predict (X) Predict (Y). If A B α, then Predict (B) Predict (A). Hence, Predict ({A}) = {A αptβ α ε } Predict (X), in which X = {B A B γ }. Notice that Predict only depends on the grammar G and, therefore, can be computed before recognition of a particular sentence. We extend the definition of Predict, so that it can be applied to item sets R, as follows. Predict (R) = Predict ({ B A αptb β R}) Notice that the operators * and x comprise the predictor and completer operations of Earley s method. Furthermore, the predictor part in this algorithm only computes the diagonal elements of the recognition matrix T. Example 2 Let G 2 be a CFG given by the following set of rules. S abc B bbh B b C c

16 begin RECOGNIZER; T 0,0 := Predict ({S}); for1 j := 1 tot n do SCAN2NER: for3 0 i j 1 do T i, j := T i, j 1 {a j }; end; COMPLETER: for k := j 1 downto 0 do T k, j := T k, j ( T k,k * T k, j ); for4 i := k 1 downto 0 do end; end T i, j := T i, j ( T i,k T k, j ) end PREDICTOR: T j, j := Predict ( T i, j); 0 i j 1 if and only if S T 0,n then accept else reject end end RECOGNIZER. Fig. 12 The tabular recognizer H d If we apply the tabular recognizer of Figure 12 to the sentence abbbddc, we obtain the recognition matrix in Figure 13. From the fact that S abcpt is an element of T 0,7, we conclude that abbbddc is in the language L (G 2 ). The difference between this tabular version and the original string version of Earley s method becomes apparent if we apply these to grammars that contain ε-rules. To see this, consider the grammar with the productions: S AB A ε B CDC C ε D a. This grammar has only one tree, namely the one for the sentence a. According to the string version of Earley s method, item [C pt,1] is added to I 1 before item [B CDCpt,0] is added to I 1. If we apply the tabular algorithm, however, C pt is added at the end. Here, item B CDCpt is

17 I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 S ptabc S aptbc S abptc S abptc S abcpt B ptbbh B bptbh B bbpth B bbpth B bbhpt B ptb B bpt B ptbbh B bptbh B bbpth B bbhpt B ptb B bpt C ptc B ptbbh B bptbh B ptb B bpt H ptd B ptbbh B ptb H ptd H dpt H ptd H dpt C ptc C cpt Fig. 13 The recognition matrix for Example 2 already added by the completer operation, i.e. before item C pt is added. We call the algorithm in Figure 12 the column-by-column variant, and we present the entry-by-entry variant of the tabular recognition algorithm in Figure 14. In this algorithm the elements T i, j are computed completely element by element. Elements within one column are computed from the bottom to the top, except for the diagonal elements; they are finally computed, like in the column-by-column variant. In the column-by-column variant as well as in the entry-by-entry variant the columns are computed from left to right. This is necessary because the diagonal elements must be computed before the elements in the next column can be computed. The diagonal sets in turn depend on the elements in the rows above these elements. This dependency is particularly bad for parallel computation of the elements. This can also be seen from the condition that S a 1... ai A γ should hold in order that an item A αptβ is an element of T i, j. Intuitively, this means that the left context plays a role in determining the contents of the following entries in the recognition matrix. Especially this aspect, sometimes called top-down filtering, distinguishes Earley s method from the CYK method, which is purely bottom-up. In order to remove the dependency on the elements in the main diagonal, we can fill the entries in this diagonal with the set Predict (N), N being the set of all nonterminal symbols of the grammar. Now, we can compute the other elements of the matrix diagonal by diagonal, working from the main diagonal in right-upward direction. This method is followed in the diagonal-bydiagonal variant of the tabular recognition method, shown in Figure 15. First, the elements of T i, j are computed for which d = j i, the difference between the column index j and the row index i, is 0. Then, this difference d is incremented each time by one, and the elements of the corresponding diagonal are computed. The elements in each diagonal are computed from the bottom upwards.

18 begin RECOGNIZER; T 0,0 := Predict ({S}); for1 j := 1 tot n do for2 i := j 1 downto 0 do SCAN3NER: T i, j := T i, j 1 {a j }; end; COMPLETER: for4 i <k <j do end; T i, j := T i, j ( T i,k T k, j ) T i, j := T i, j ( T i,i * T i, j ); end; PREDICTOR: T j, j := Predict ( T i, j); 0 i j 1 if and only if S T 0,n then accept else reject end end RECOGNIZER. Fig. 14 The entry-by-entry-variant 5. A recognition method without dots In this section we present our general recognition method. It is based on the following simple property : S w if and only if there is some α V, such that α w, and S GLS (α), where for β V, GLS (β) = {A for some B N: A B and B β P } GLS is an abbreviation of Generalized Left-Hand Side At first, we restrict ourselves to ε-free CFGs, i.e. CFGs without ε-rules. We define the following predicates defined on strings in V : Compl (α) = True if α is the right-hand side of some production of G. Extend (α) = True if α is a prefix of the right-hand side of some production of G. If Extend(αX) holds, we say that α is extendible by X. Notice that since every string is a prefix of itself Compl (α) implies Extend (α) and since ε is a prefix of every string, Extend (ε) holds. The parse objects have the form [α,i ], where α V and 1 i n.

19 begin RECOGNIZER; for1 i := 0 tot n do T i,i := Predict (N); end; for d := 1 tot n do for2 i := 0 tot n d do COMPUTE (T i,i +d ); end; end if and only if S T 0,n then accept else reject end end RECOGNIZER. The procedure COMPUTE is the following. COMPUTE (T i, j ); begin1 SCANN2ER: T i, j := T i, j 1 {a j }; COMPLETER: for3 k :=j 1 downto i +1 do end; T i, j := T i, j ( T i,k T k, j ) T i, j := T i, j ( T i,i * T i, j ); end COMPUTE. Fig. 15 The diagonal-by-diagonal variant We now specify for our method the procedures Initialize, Compute and Answer in the Recognition Scheme (Figure 8). Initialize sets I 0 to contain the object [ε,0]. Compute(I j ) is defined as follows. (1) Start with I j = [ε, j ]. (2) SCANNER: For all objects [α,i ] in I j 1 for which α is extendible with a j add [αa j,i ] to I j. Repeat the following step until no more objects can be added to I j. (3) COMPLETER: If [α,i ] I j such that Compl (α) then add [γb,k ] to I j for all [γ,k ] in I i for which γ is extendible by B, such that B GLS (α). (Notice that γ is possibly ε.)

20 Answer checks whether [S, 0] is a member of I n. The correctness of this recognition algorithm follows from Theorem 3. Theorem 3. A parse-object [α,i ] is a member of I j if and only if α a i aj and, Extend(α) or α = S. This theorem implies that after all sets I i have been computed, the sentence a 1... an can be accepted if [S, 0] I n, and should be rejected otherwise. Example 4. Consider the grammar with the following productions S bac; A a aa aaa; C c When we apply the algorithm to the input sentence baaaac, we obtain the following sets of parse objects. I 0 : I 1 : I 2 : I 3 : I 4 : I 5 : I 6 : [ε,0]; [ε,1], [b, 0]; [ε,2], [a, 1], [A, 1], [ba, 0]; [ε,3], [a, 2], [aa, 1], A, 2], [A, 1], [aa, 1], [ba, 0]; [ε,4], [aa, 2], [A, 2], [a, 3], [A, 3], [aa, 1], [aa, 2], [aaa, 1], [A, 1], [ba, 0]; [ε,5], [aa, 3], [A, 3], [a, 4], [A, 4], [aa, 2], [aa, 3], [aaa, 1], [A, 1], [aaa, 2], [A, 2], [aa, 1], [ba, 0]; [ε,6], [c, 5], [C, 5], [bac, 0], [S, 0]. Since [S, 0] is an element of I 6, the string is accepted. The time complexity of the algorithm is O (n 3 ) where n is the length of the string to be recognized. The number of objects [α,i ] in a set I j is O (j). The Completer operation, see above, may take O (j) time for each element in I j. This means that the time complexity is the same as of Earley s recognizer. In general, the space complexity is quadratic, which is also comparable with Earley s method. Since we do not store complete items, but only extendable parts of right-hand sides of production rules in the parse objects, this method does, in general, not need as much space as Earley s. Obviously, the predicates Compl and Extend, and the GLS-sets can be computed, for a given CFG G, before processing a sentence. We presented the string version of our recognition method. In the tabular version the j th column of the triangular recognition matrix T coincides with the set I j in the string version: T i, j contains the string α for which [α,i ] in I j. In Earley s method an item [A αptβ,i ] is a member of the item-set I j if not only α a i aj, but in addition, there a derivation S a 1... ai A δ must exist. The predictor operation in Earley s causes that this extra condition holds. We compare the tabular version of our method to the CYK method. In the CYK method, A N is an element of the element T i, j of the recognition matrix T, where 0 i, j n, if and only if A a i aj. In the present method, the following, will hold: α T i, j if and only if

21 α a i aj. Moreover, we put no restrictions on the grammar, except that we assume the grammar to be ε-free. We will show later that this restriction is not essential. Our method can be derived from the tabular CYK method by a simple reinterpretation of the constructors used in that algorithm. We define the operator Closure on subsets W of V as follows: Closure(W) = W GLS (α) α W The predicate Sub (α) holds iff α is a substring of a right-hand side of a production. Using this predicate, we can define the binary operator on sets of strings as follows: X Y = Closure {α α=γδ,γ X, δ Y,Sub (α) } With these operators we can present the recognition algorithm for our method. It is shown in Figure 16. begin Recognizer; for i :=0 tot n 1 do T i,i +1 := Closure{a i +1 } end; for d := 2 tot n do end end; for i :=0 tot n d do j :=i + d; T i, j := j 1 T i,k T k, j k =i +1 if and only if S T 0,n then accept else reject end end Recognizer. Fig. 16 The recognizer without dots The correctness of the algorithm is a consequence of the following theorem. Theorem 5. The set T i, j contains α if and only if α a i aj and, Sub (α) or α = S. Example 6. If we apply the recognition algorithm to the grammar of Example 4 and the string baaaac, we obtain the recognition matrix shown in Figure 17. The operator Closure can be redefined easily to handle also grammars with ε-rules. In that case we should have: if H ε and Sub (αh) or Sub (H α), include αh, respectively H α in a set T i, j if α is already in this set. If we add a pair of indices to each item in every item-set T i, j indicating from which matrix element(s) the item is derived, then we can easily construct the parse tree(s) from this augmented recognition matrix, by following the trace(s) given by these indices. Notice that the graph given by these indices is already a representation of the parse tree(s). 6. Items with two dots: another Earley variant

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39 Parsing Earley Parsing Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 39 Table of contents 1 Idea 2 Algorithm 3 Tabulation 4 Parsing 5 Lookaheads 2 / 39 Idea (1) Goal: overcome