CYK- and Earley s Approach to Parsing

Size: px
Start display at page:

Download "CYK- and Earley s Approach to Parsing"

Transcription

1 CYK- and Earley s Approach to Parsing Anton Nijholt and Rieks op den Akker Faculty of Computer Science, University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands 1. Introduction In this paper we discuss versions of the Cocke-Younger-Kasami (CYK) parsing algorithm and Earley s parsing algorithm. In the early 1960s parsing theory flourished under the umbrella of machine translation research. Backtrack algorithms were used for parsing natural languages. Due to the exponential blow-up no widespread application of these algorithms could be expected. It was even thought that a polynomial parsing algorithm for context-free languages was not possible. Greibach[1979] presents the following view on the situation in the early 1960s: We were very much aware of the problem of exponential blow-up in the number of iterations (or paths), though we felt that this did not happen in real natural languages; I do not think we suspected that a polynomial parsing algorithm was possible. However, at that time a method developed by John Cocke at Rand Corporation for parsing a context-free grammar for English was already available. The method required the grammar to be in Chomsky Normal Form (CNF). An analysis of the algorithm showing that it required O (n 3 ) steps was presented some years later by D.H. Younger. An algorithm similar to that of Cocke was developed by T. Kasami and by some others (see the bibliographic notes at the end of this paper). Presently it has become customary to use the name Cocke-Younger-Kasami algorithm. Parsing methods for context-free languages require a bookkeeping structure which at each moment during parsing provides the information about the possible next actions of the parsing algorithm. The structure should give some account of previous actions in order to prevent the doing of unnecessary steps. In computational linguistics a simple and flexible bookkeeping structure called chart has been introduced in the early 1970s. A chart consists of two lists of items. Both lists are continually updated. One list contains pending items, i.e. items that still have to be tried in order to recognize or parse a sentence. Because the items on this list will in general be processed in a particular order the list is often called an agenda. The second list called the work area contains the employed items, i.e. things that are in the process of being tried. The exact form of the lists, their contents and their use differ. Moreover, they can have different graphical representations. In the following section CYK-versions of charts will be discussed. Then, in section 3 we present a general recognition scheme of which Earley s recognizer and the CYKrecognizer are particular instances. In section 4 we discuss earley s method and in section 5 a generalized CYK-recognition method for general context-free grammars. Finally in section 6 another variant of Earley s method is presented. Or some other permutation of these three names.

2 Cocke-Younger-Kasami-like parsing The usual condition of CYK parsing is that the input grammar is in CNF, hence, each rule is of the form A BC or A a. It is possible to relax this condition. With slight modifications of the algorithms we can allow rules of the form A XY and A X, where X and Y are in V. In section 5 we give a more general version of the CYK-approach. CYK Algorithm (Basic Chart Version) The first algorithm we discuss uses a basic chart with items of the form [i,a, j ], where A is a nonterminal symbol and i and j are position markers such that if a 1... an is the input string which has to be parsed, then 0 i <j n. Number i is the start (or left) and number j is the end (or right) position marker of the item. For any input string x = a 1... an a basic chart will be constructed. We start with an empty agenda and an empty list of employed items. (1) Initially, items [i 1,A,i ], 1 i n with A a i in P are added to the agenda. Now keep repeating the following step until the agenda is empty. (2) Remove any member of the agenda and put it in the work area. Assume the chosen item is of the form [j,y,k ]. For any [i,x, j ] in the work area such that A XY is in P add [i,a,k ] (provided it has not been added before) to the agenda. Similarly, for any [k,z,l ] in the work area such that A YZ is in P add [j,a,l ] (provided it has not been added before) to the agenda. If [0,S,n ] appears in the work area then the input string has been recognized. Obviously, for any item [i,a, j ] which is introduced as a pending item we have A * a i aj. Notice that in the algorithm we have not prescribed the order in which items are removed from the agenda. This gives freedom to choose a particular strategy. For example, always take the oldest (first-in, first-out strategy) or always take the most recent item (last-in, first-out strategy) from the agenda. Another strategy may give priority to items [i,x, j ] such that for any other item [k,y,l ] on the pending list we have i k. Similarly, a strategy may give priority to items [i,x, j ] such that for any other item [k,y,l ] on the agenda we have l j. It is not difficult to see that with a properly chosen initialization and strategy we can give the algorithm a left-to-right or a right-to-left nature. It is also possible to give preference to items which contain interesting nonterminals (comparable with the interesting corner chart parsing method discussed in Allport[1988]) or to use the chart in such a way that so-called island and bi-directional parsing methods can be formulated (cf. Stock et al[1988]). As an example of the basic chart algorithm consider our usual example grammar with rules S NP VPS PP NP *n *det *n NP PP PP *prep NP VP *v NP and with input string the man saw Mary. In the algorithm the symbols *det, *n, *prep and *v will be viewed as nonterminal symbols. The rules of the form *det a, *det the, *n man, etc. are not shown. Because of rule NP *n the grammar is not in CNF. Step (1) of the algorithm is therefore modified such that items [i 1,A,i ] for 1 i n and A *a i are added to the agenda. See Fig. 1 for a further explanation of this example. As soon as a new item is constructed information about its constituents disappears. This information can be carried along with the items. For the example it means that instead of the items in the left column below we have those of the right column.

3 - 3 - AGENDA WORK AREA [0,*det, 1] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [1,NP, 2] [0,*det, 1] [2,*v, 3] [1,*n, 2] [3,*n, 4] [3,NP, 4] [0,NP, 2] [0,NP, 2] [0,*det, 1] [2,VP, 4] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [1,S, 4] [0,*det, 1] [0,S, 4] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [0,NP, 2] [2,VP, 4] [0,*det, 1] [1,*n, 2] [1,NP, 2] [2,*v, 3] [3,*n, 4] [3,NP, 4] [0,NP, 2] [2,VP, 4] [1,S, 4] Initially, [0,*det, 1], [1,*n, 2], [1,NP, 2], [2,*v, 3], [3,*n, 4] and [3,NP, 4] are put on the agenda (first row of this figure) in this order. The strategy we choose consists of always moving the oldest item first from the agenda to the work area. Hence, [0,*det, 1] is removed and appears in the work area. It can not be combined with another item in the work area, therefore the next item [1,*n, 2] is removed from the agenda and put in the work area. It is possible to combine this item with a previously entered item in the work area. The result [0,NP, 2] is entered on the agenda. The present situation is shown in the second row. Item [1,NP, 2] is moved from the agenda and put in the work area. This does not lead to further changes. The same holds for the two next items, [2,*v, 3] and [3,*n, 4], that move from the agenda to the work area. With item [3,NP, 4] entered in the work area we can construct item [2,VP, 4] and enter it on the agenda. We now have the situation depicted in the third row. Item [0,NP, 2] is now added to the work area, but it does not result in changes. Next item [2,VP, 4] is added to the work area and now items [1,S, 4] and [0,S, 4] are obtained and entered on the agenda. We now have the situation of the fourth row. In the next steps we first remove [1,S, 4] from the agenda, enter it in the work area, conclude that no combinations are possible, then remove [0,S, 4] from the agenda, enter it in the work area and [0,S, 4] conclude that we are finished. Fig. 1 Basic chart parsing of the man saw Mary. [0,NP, 2] [2,VP, 4] [1,S, 4] [0,S, 4] [0,NP (*det *n),2] [2,VP (*v NP (*n)),4] [1,S (NP (*n VP (*v NP (*n))),4] [0,S (NP (*det *n) VP (*v NP (*n))),4] In this way the parse trees become part of the items and there is no need to construct them during a second pass through the items. Clearly, in this way partial parse trees will be constructed that will not become part of a full parse tree of the sentence. Worse, we can have a enormous blow-

4 - 4 - up in the number of items that are distinguished this way. Indeed, for long sentences the unbounded ambiguity of our example grammar will cause a Catalan explosion. In general the number of items that are syntactically constructed can be reduced considerably if, before entering these extended items on a chart (see also the next versions of the CYK algorithm) they are subjected to a semantic analysis. The name chart is derived from the graphical representation which is usually associated with this method. This representation uses vertices to depict the positions between the words of a sentence and edges to depict the grammar symbols. With our example we can start with the simple chart of Fig. 2.a. the man saw Mary Fig. 2.a The initial CYK-like chart. By spanning edges from vertex to vertex according to the grammar rules and working in a bottom-up way, we end up with the chart of Fig. 2.b. Working in a bottom-up way means that an edge with a particular label will be spanned only after the edges for its constituents have been spanned. Apart from this bottom-up order the basic chart version does not enforce a particular order of spanning edges. Any strategy can be chosen as long as we confine ourselves to the trammels of step (1) and step (2) of the algorithm. In the literature on computational linguistics chart parsing is usually associated with a more elegant strategy of combining items or constructing spanning edges. This strategy will be discussed in a twin paper on Earley-like parsing algorithms. It should be noted that although in the literature chart parsing is often explained with the help of these graphical representations, any practical implementation of a chart parser ( a spanning jenny ) requires a program which manipulates items according to one of the methods presented in this paper. S S NP VP NP NP *det *n *v *n 0 the 1 man 2 saw 3 Mary 4 Fig. 2.b The complete CYK-like chart for the man saw Mary. We now give a global analysis of the (theoretical) time complexity of the algorithm. On the lists O (n 2 ) items are introduced. Each item that is constructed with step (1) or (2) of the algorithm is entered on and removed from the agenda. For each of the O (n 2 ) items that is removed from the agenda and entered on the list of employed items the following is done. The list of O (n 2 ) employed items is checked to see whether a combination is possible. There are at most O (n) items which allow a combination. For each of these possible combinations we have to check O (n 2 ) pending items to see whether it has been introduced before. Hence, for each of the O (n 2 ) items that are introduced O (n 3 ) steps have to be taken. Therefore the time complexity of the algorithm is O (n 5 ). The algorithm recognizes. More has to be done if a parse tree is requested. Starting with an item [0,S,n ] we can in a recursive top-down manner try to find items whose nonterminals can be concatenated to right-hand sides of productions. For each of the O (n) nonterminals which are used in a parse tree for a sentence of a CNF grammar it will be necessary

5 - 5 - to do O (n 3 ) comparisons. Therefore a parse tree of a recognized sentence can be obtained in O (n 4 ) time. The algorithm which we introduced is simply adaptable to grammars with right-hand sides of productions that have a length greater than two. However, since the number of possible combinations corresponding with a right-hand side increases with the length of the right-hand sides of the productions this has negative consequences for the time complexity of the algorithm. If p is the length of the longest right-hand side then recognition requires O (n p +3 ) comparisons. We are not particularly satisfied with the time bounds obtained here. Nevertheless, despite the naivity of the algorithm its bounds are polynomial. Improvements of the algorithm can be obtained by giving more structure to the basic chart. This will, however, decrease the flexibility of the method. It is rather unsatisfactory that each time a new item is created the algorithm cycles through all its items to see whether it is already present. Any programmer who is asked to implement this algorithm would try (and succeed) to give more structure to the chart in order to avoid this. Moreover, a more refined structure can make it unnecessary to consider all items when combinations corresponding to right-hand sides are tried. For example, when an item of the form [j,y,k ] is removed from the agenda, then only items with a right position marker j and items with a left position marker k have to be considered for combination. Rather than having O (n 3 ) comparisons for each item it could be sufficient to have O (n 2 ) comparisons for each item, giving the algorithm a time complexity of O (n 4 ) for grammars in CNF. Below such a version of the CYK algorithm is given. We will return to the algorithm given here when we discuss parallel implementations of the CYK algorithms. CYK Algorithm (String Chart Version) Let a 1... an be a string of terminal symbols. The string will be read from left to right. For each symbol a j an item set I j will be constructed. Each item set consists of elements of the form [i, A ], where A is a nonterminal symbol and i, with 0 i n 1, is a (start) position marker. For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2) until no new items will be added to I j. (2.2) Let [k, C ] be an item in I j. For each item of the form [i, B ] in I k such that there is a production A BC in P add [i, A ] (provided it has not been added before) to I j. It is not difficult to see that the algorithm satisfies the following property: [i, A ] I j if and only if A *a i aj. A string will be recognized as soon as [0,S ] appears in the item set I n. As an example consider the sentence John saw Mary with Linda generated by our example grammar. We allow again a slight modification of our algorithm in handling the rule NP *n. In Fig. 3 the five constructed item sets are displayed. Notice that each of the five item sets in Fig. 3 is in fact a basic chart, i.e., a data structure consisting of an agenda, a work area (the employed items) and rules for constructing items and moving items from the agenda to the work area. Except for item [3,PP ], each application of step (2.2) in this example always yields one element only which is put in the item set and immediately employed in the next step. In general looking for a combination may produce several new items and therefore items in an item set may have to wait on an internal agenda before they are

6 - 6 - I 1 I 2 I 3 I 4 I 5 [0,*n ] [1,*v ] [2,*n ] [3,*prep ] [4,*n ] [0,NP ] [2,NP ] [4,NP ] [1,VP ] [3,PP ] [0,S ] [2,NP ] [0,S ] [1,VP ] Fig. 3 String chart parsing of John saw Mary with Linda. employed to construct new items. Therefore we can say that the algorithm uses a string of basic charts, which explains the name we have chosen for this CYK version. In a practical implementation it is possible to attach a tag to each item to mark whether it is waiting or not, or to put waiting items indeed on a separate list. In the string chart version we have chosen a left-to-right order for constructing item sets. We could as well have chosen a right-to-left order corresponding with a right-to-left reading of the sentence. The reader will have no difficulties in writing down this string version of the CYK algorithm. With the sentence John saw Mary with Linda generated by our example grammar this version should produce the five item sets of Fig. 4 (first I 4, then I 3, etc.). I 0 I 1 I 2 I 3 I 4 [*n, 1] [*v, 2] [*n, 3] [*prep, 4] [*n, 5] [NP, 1] [VP, 3] [NP, 3] [PP, 5] [NP, 5] [S, 3] [VP, 5] [NP, 5] [S, 5] Fig. 4 Right-to-left string chart parsing of John saw Mary with Linda. With the graphical representation usually associated with chart parsing the (left-to-right) string chart version of the CYK algorithm enforces an order on the construction of the spanning edges. In each item set I j we collect all the items whose associated edges terminate in vertex j. Hence, a corresponding construction of the chart of Fig. 5 would start at vertex 1, where all edges which may come from the left and terminate in 1 are drawn, then we go to vertex 2 and draw all the edges coming from the left and terminating in 2, etc. S VP S VP NP PP NP NP NP *n *v *n *prep *n John saw Mary with Linda Fig. 5 The CYK chart for John saw Mary with Linda. As we noted above, whenever with step (2.2) of the algorithm an item [i,a ] in item set I j leads to the introduction of more than one item in I j, then there is no particular order prescribed for

7 - 7 - processing them. For the graphical representation of Fig. 5 this means that, when we draw an edge from vertex i to vertex j with label A, then there can be more than one edge terminating in i with which new spanning edges terminating in j can be drawn and the order in which these edges will be drawn is not yet prescribed. Before discussing the time complexity of the algorithm we show some different ways of constructing the item sets. The string chart version discussed above can be said to be, in a modest way, item driven. That is, in order to add a new item to an item set we consider an item which is already there and then we visit a previously constructed item set to look for a possible combination of items. The position marker of an item determines which of the previously constructed sets has to be visited. Without loss of correctness the string chart version can be presented such that in order to construct a new item the previously constructed item sets are visited in a strict right to left order. This will reduce the (slightly) available freedom in the algorithm to choose a particular item in the set that is under construction and see whether it gives rise to the introduction of a new item. Below this revised string chart version of the CYK algorithm is presented. For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2), first for k =j 1 and then for decreasing k until this step has been performed for k =1. (2.2) Let [k, C ] be an item in I j. For each item of the form [i, B ] in I k such that there is a production A BC in P add [i, A ] (provided it has not been added before) to I j. A further reduction of the freedom which is still available in the algorithm is obtained if we demand that in step (2.2) the items of the form [i,b ] in I k are tried in a particular order of i. In a decreasing order first the items with i =k 1 are tried, then those with i =k 2, and so on. A second alternative of the string chart version is obtained if we demand that in I j first all items with position marker j 1, then all items with position marker j 2, etc., are computed. This leads to the following algorithm: For any input string x = a 1... an item sets I 1,..., I n will be constructed. (1) I 1 = {[0, A ] A a 1 is in P}. (2) Having constructed I 1,..., I j 1, construct I j in the following way. Initially, I j =. (2.1) Add [j 1, A ] to I j for any production A a j in P. Now perform step (2.2), first for i =j 2 and then for decreasing i until this step has been performed for i =0. (2.2) Add [i,a ] to I j, if, for any k such that i <k <j, [i,b ] I k, [k,c ] I j and A BC is a production rule and [i,a ] is not already present in I j. We will return to this version when we introduce a tabular chart version of the CYK algorithm. We turn to a global analysis of the time complexity of the algorithm. We confine ourselves to the first algorithm that has been presented in this subsection. The essential difference between this CYK version and the basic chart version is of course that rather than checking an unstructured set of O (n 2 ) items we can now always restrict ourselves to sets of O (n) items. That is, each item set has a number of items proportional to n. In order to add an item to an item set I j it is necessary to consider O (n) items in a previously constructed item set. Each previously constructed item set will be asked to contribute to I j a bounded number of times (always less than or

8 - 8 - equal to the number of nonterminals of the grammar). Hence, the total number of combinations that are tried is O (n 2 ). For each of these attempts to construct a new item we have to check O (n) items that are available in I j in order to see whether it has been added before. Hence, to construct an item set requires O (n 3 ) steps and recognition of the input takes O (n 4 ) steps. Parse trees of a recognized sentence can be constructed by starting with item [0,S ] in I n and by checking I 1 and I n for its constituents. This process can be repeated recursively for the constituents that are found. In this way each parse tree can be constructed in a top-down manner in O (n 3 ) time. As demonstrated in the example, the CYK version discussed here is simply adaptable to grammars that do not fully satisfy the CNF condition or to grammars whose productions have right-hand sides with lengths greater than two. Although the time bounds for this algorithm are better than for the previous CYK version we still can do better by further refining the data structure that is used. This can be done by introducing some list processing techniques in the algorithm, but it can be shown more clearly by introducing a tabular chart parsing version of the CYK method. This will be done in the next subsection. We will return to the string chart version when we discuss parallel implementations of the CYK algorithm. CYK Algorithm (Tabular Chart Version) It is usual to present the CYK algorithm in the following form (Graham and Harrison[1976], Harrison[1978]). For any string x = a 1 a 2... an to be parsed an upper-triangular (n +1) (n +1) recognition table T is constructed. Each table entry t i, j with i < j will contain a subset of N (the set of nonterminal symbols) such that A t i, j if and only if A *a i aj. Hence, in this version the nonterminals are the items and each table entry is an item set. String x belongs to L (G) if and only if S is in t 0,n when the construction of the table is completed. Fig. 6 may be helpful in understanding the algorithm. For any input string x = a 1... an an upper-triangular (n +1) (n +1) table T will be constructed. Initially, table entries t i, j with i <j are empty. Assume that the input string, if desired terminated with an endmarker, is available on the matrix diagonal (cf. Fig. 6). (1) Compute t i,i +1, as i ranges from 0 to n 1, by placing A in t i,i +1 exactly when there is a production A a i +1 in P. (2) In order to compute t i, j, j i >1, assume that all entries t k,l with l j, k i and kl ij have been computed. Add A to t i, j if, for any k such that i <k <j, B t i,k, C t k, j and A BC is a production rule and A is not already present in t i, j. Unlike the previous versions of the CYK algorithm and unlike an algorithm of Yonezawa and Ohsawa[1988], which needed explicit representations of positional information, here this information is available in the indices of the table entries. For example, t 1,4 consists of all nonterminals that generate the substring of the input between positions 1 and 4. Notice that after step (1) the computation of the entries can be done diagonal by diagonal until entry t 0,n has been completed. However, the exact order in which the entries are computed is not prescribed. Rather than proceeding diagonal by diagonal it is possible to proceed column by column such that each column is computed from the bottom to the top. For each entry of a diagonal only entries of preceding diagonals are used to compute its value. More specifically, in order to compute whether a nonterminal should be included in an entry t i, j it is necessary to compare t i,k and t k, j, for each k in between i and j. The order in which this may be done is, provided these entries are all

9 - 9 - a 1 0,1 0,2 0,3 0,4 0,5 a 2 1,2 1,3 1,4 1,5 a 3 2,3 2,4 2,5 a 4 3,4 3,5 a 5 4,5 $ Fig. 6 The upper-triangular CYK-table. available, arbitrary. Due to the systematic construction of each entry the need for an internal agenda has gone. The item driven aspect from the previous versions has disappeared completely and what remains is a control structure in which even empty item sets have to be visited. In Fig. 7 we have displayed the CYK table for example sentence John saw Mary with Linda produced by our running example grammar. *n John S S NP saw *v VP VP *n Mary NP NP with *prep PP *n Linda NP Fig. 7 Tabular chart parsing of John saw Mary with Linda. We leave it to the reader to provide a tabular chart version which computes the entries column by column (each column bottom-up) from right to left. Since the tabular CYK chart is a direct representation of the parse tree(s) for a sentence the strictly upper-triangular tables produced by this right-to-left version do not differ from those produced by the left-to-right version. Also left to the reader is the construction of the graphical representation according to the rules of this tabular chart version. For that construction it is necessary to distinguish between the column by column and the diagonal by diagonal computation of the entries. In order to discuss the time complexity of this version observe that the number of elements in each entry is limited by the number of nonterminal symbols in the grammar and not by the length of the sentence. Therefore, the amount of storage that is required by this method is only proportional to n 2. Notice that for each of the O (n 2 ) entries O (n) comparisons have to be made, therefore the number of elementary operations is proportional to n 3. An implementation can be given which accesses the table entries in such a way that the time bound of the algorithm is indeed O (n 3 ). Starting at the upper right entry we can descend to the first diagonal and construct a parse tree in O (n 2 ) time. This can be reduced to linear time if during the construction of the table with each nonterminal in t i, j pointers are added to the nonterminals in t i,k and t k, j that caused it to be placed in t i, j. Provided that the grammar has a bounded ambiguity this can be done for each possible realization of a nonterminal in an entry without increasing the O (n 3 ) time bound for

10 recognition. 3. A recognition scheme Given a context-free grammar G = (N, Σ,P,S) and a sentence w Σ, the recognition problem for G is the problem of evaluating the predicate RECOGNIZE (G,w) which is true if S G w, and false otherwise. A recognizer for a CFG G is a predicate REC G such that REC G (w) is true if w is derivable from S using productions in G, and false otherwise. A parser is an implementation of a parse function, i.e. a function that not only recognizes a sentence, but, in addition, delivers a representation of the (or all) derivations of the sentence. Let G be a context-free grammar and w = a 1... an a sentence to be recognized. Our algorithm works according to a recognition scheme, of which Earley s algorithm is also a particular instance. For every symbol a j of the input sentence we construct a set of parse objects I j. (The parse object are sets of items in Earley s algorithm. There, an item has the form [A αptβ,i ], where A αβ is a production and i is some number 0 i j.) The initial step of the algorithm is the construction of the set of parse objects I 0. The main part is the construction of the set I j from the previous constructed sets I 0,I 1,...,I j 1. The final step of the algorithm checks whether the finally constructed set I n satisfies a particular condition. (In Earley s algorithm, the final step tests whether the item [S αpt,0] is a member of I n.) If this is true, the sentence is accepted. The general scheme is shown in Figure 8. Recognition Scheme begin Initialize { I 0 is computed} for j :=1 tot n do Compute( I j ) end; { I 0,..., I n are computed } end Answer(I n ) Fig. 8 The recognition scheme The version of Earley s algorithm that works according to this scheme, is sometimes called the string version. There is also a tabular version, that uses a recognition matrix. (The tabular and the string version of Earley s method are explained in detail in section 4.) The tabular version of our method can best be seen as a variant of the CYK method. Let G = (N, Σ,P,S) be a CFG in Chomsky normal form. In the CYK method, nonterminal A N is an element of the entry T i, j of the recognition matrix T (with 0 i, j n), if and only if A a i aj. Hence, the sentence a 1... an is accepted if S T 0,n, and rejected otherwise. The well-known CYK-recognition algorithm for the construction of the recognition matrix T is shown in Figure 9. The operator * in this algorithm is defined as follows: X * Y = { A A BC P ; B X and C Y } 4. Earley s recognition method

11 begin CYK-recognizer; for i := 0 tot n 1 do T i,i +1 := {A A a i +1 P } end; for d := 2 tot n do end end; for i := 0 tot n d do j := i +d; T i, j := j 1 T i,k * T k, j k =i +1 if and only if S T 0,n then accept else reject end end CYK-recognizer. Fig. 9 The CYK-recognizer Earley s algorithm can be applied to arbitrary CFGs. An arbitrary CFG may contain ε- rules and may be cyclic. A CFG G = (N, Σ,P,S) is cyclic if for some nonterminal A N: A + A. Using Earley s method we obtain a result, the parse objects, from which we can derive the parse tree of the input sentence. In this method, the parse objects are called item sets. An item [A X 1... Xk ptx k Xm,i ] is a tuple consisting of a production of G with a dot (that marks a position) in the right-hand side (a dotted rule), and a number. Item [A X 1... Xk ptx k Xm,i ] is an item for a 1... an if A X 1... Xm P and 0 k m and 0 i n. Suppose that we want to recognize the sentence a 1 a 2... an. We define sets I 0, I 1,..., I n of items for a 1... an, such that they have the following characteristic property. [A αptβ,i ] I j if and only if A αβ P, S a 1... ai A δ, and α a i aj. From this property it follows immediately that [S αpt,0] I n if and only if there is a production S α in P, and α * a 1... an. Hence, a 1... an is in L (G), the language generated by G. Intuitively, [A αptβ,i ] I j means that the production A αβ can be applied in the left-most derivation, such that the prefix a 1... ai has already been recognized, and that the substring a i aj can be generated from α. The part β behind the dot in this item is only predicted. Earley s algorithm is a method to compute the items sets I j, such that the characteristic property holds. In this method all possible left-most derivations of the prefixes of a sentence, are put together in item sets. Since left-most derivations are considered simultaneously, (i.e. in a breadth-first way), the problem with left-recursive grammars, that we encounter if we first completely work out one possible choice in a left-most derivation (as in a depth-first method), is avoided. We present the original (string) version of Earley s algorithm for constructing the item sets. Earley s Recognition-algorithm Construct I 0, following steps (1), (2) and (3) below.

12 (1) I 0 := ; If S α in P, then add [S ptα,0] to I 0. Apply steps (2) and (3) as long as new items can be added to I 0. (2) If [B γpt,0] is in I 0, then add item [A αbptβ,0] to I 0 for all items of the form [A αptb β,0] I 0. (3) If [A αptb β,0] I 0 then add for all productions B γ in P, [B ptγ,0] to I 0, if this item is not already in the set. Construct item set I j from the already constructed sets I 0,...,I j 1. (4) SCANNER: If [B αpta β,i ] I j 1, and a =a j, then add [B αaptβ,i ] to I j. Apply steps (5) and (6) as long as new items can be added to I j. (5) COMPLETER: For all items in I j of the form [A αpt,k ], for which an item in I k of the form [B δpta β,i ] exists, add item [B δaptβ,i ] to I j. (6) PREDICTOR: If [A αptb β,i ] I j, then add for each production B γ in P, item [B ptγ, j ] to I j (if it has not already been added to this set). The names SCANNER, COMPLETER and PREDICTOR denote what in each step happens. The scanner operation reads a new symbol from the sentence and introduces new items from those that expected this symbol: the dot is placed just behind the terminal symbol in the new items. The completer operation is applicable if a production has been completely recognized: the dot is placed at the end in the item. The number k in the item [A γpt,k ] I j denotes that item [A ptγ,k ] was introduced in item set I k. Hence, in this case: γ a k aj. The predictor-operation introduces (predicts) productions that are possibly applied in the leftmost derivations. The construction of the item sets for a particular CFG and an input sentence, is shown in the following example. Example 1 Let the CFG G 1 be given by the following productions. S abd B HCa ε H ε D bb C F F B If we apply Earley s algorithm to the sentence aaabb, we obtain the sets of items as shown in Figure 10. In his thesis, Earley offers the following suggestions for an efficient implementation of his method. I II III For the predictor operation it is efficient to have for each nonterminal A N a list containing the A-productions (i.e. productions of the form A α). The items of a set are placed in a pointer list in order to have an easy and efficient implementation for the scanner operation. At the time item set I j is constructed, make a vector V j of pointers, in which V j (i) points at the list of items of which the second component is i. Then, if we want to check whether an

13 I 0 : [S ptabd, 0] I 1 : I 2 : SCANNER: [S aptbd, 0] PREDICTOR: [B pthca, 1] [B pt, 1] [H pt, 1] COMPLETER: [B HptCa, 1] PREDICTOR: [C ptf, 1] [F ptb, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 1] SCANNER: [B HCapt, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 2] I 3 : I 4 : I 5 : SCANNER: [B HCapt, 1] COMPLETER: [S abptd, 0] [F Bpt, 1] [C Fpt, 1] [B HCpta, 1] PREDICTOR: [D ptbb, 3] SCANNER: [D bptb, 3] SCANNER: [D bbpt, 3] COMPLETER: [S abdpt, 0] Fig. 10 The items sets for Example 1 IV item [A αptβ,i ] has already been added to I j, we just have to check V j (i). For the completer operation, we keep for each set I j and each nonterminal B a list of those items that have the form [A αptb β,k ]. We now consider the complexity of Earley s algorithm. For arbitrary CFGs, the time complexity of this recognition method is O (n 3 ), in which n is the length of the input sentence. Hence, there is a constant C, not depending on n, such that for sufficiently long input, the time needed for recognizing the input does not exceed C n 3. The constant C depends on the size of the grammar. T (n), the time needed for an input of length n, is about P 3.m 2.n 3, in which m is the maximal length of the right-hand side of a production in the grammar. From this estimate we see that, if we apply Earley s method to small sentences (more common in natural languages than in programming languages), we definitively have to take the constant C into account. For unambiguous grammars, the time complexity is O (n 2 ), and the same remarks concerning the constant factor hold. Dealing with an unambiguous grammar, it is not necessary to check whether an item [A αptβ,i ], in which α ε, has already been added to the item set I j. In this case, an item can

14 only be added by one combination of items. Therefore we gain a factor n in the time complexity if we compare this with that for the general case. For deterministic context-free languages (i.e. the languages generated by LR (1) grammars, the time needed is linear in the length of the input sentence. It may be better, in this case, to use a recognition or parsing method especially appropriate for LR grammars or some subclass of this grammar class. The amount of space needed for the item sets is O (n 2 ). This also holds for unambiguous grammars. The number of items in a set I j is O (j). The completer operation claims most of the space. For each item [A γpt,i ] in I j this operation can add q*i new items at most, where q only depends on the grammar, not on n and j. We now present a way to obtain the parse tree, or a representation of this tree, from the item sets constructed during recognition. Let I 0,..., I n be the item sets constructed in processing the sentence a 1... an. The parse algorithm, which delivers a right-most derivation, correctly works for cycle-free CFGs only. If there is no item of the form [S αpt,0] in I n, then the sentence is not accepted. If there is such an item in set I n, then the parse algorithm, shown in Figure 11, is activated. The initial call is Parse ([S α pt, 0], n). Parse ([A X 1... Xm pt, i ], j); {k and q are local variables.} begin {Assume A X 1... Xm has production number p}. Write p to output; k :=m; q :=j; while1 k > 0 do end end. if and only if X k Σ then k := k 1; q := q 1 else search for an item [X k γpt,r ] in I q for which [A X 1... Xk 1 ptx k... Xm,i ] is in I r ; Parse ([X k γ pt, r ], q); k := k 1; q := r end Fig. 11 The parse algorithm for Earley s method A CFG is cyclic if and only if it is infinitely ambiguous, i.e. it generates a sentence with an unbounded number of parse trees. It is not possible to deliver one by one all parse trees of an infinitely ambiguous sentence, in bounded time. We need some finite representation of this infinite set of trees. Such a representation is a shared forest, a finite graph that represents all possible parse trees of a sentence. We refer to the literature on shared forests for more details. For cycle-free grammars, the parsing algorithm can be augmented in such a way that it delivers all parse trees of a sentence, one by one. Also for CFGs that are not cyclic, the number of parse trees can be quite large. This number may grow even exponentially with the length of the input. This is, for example, the case with the grammar having productions S SS, S a. For this grammar,

15 the time complexity of the all-parse parser will, of course, also be exponential. We now discuss tabular variants of Earley s method (Earley[1968]). In this tabular recognition method, for a sentence w = a 1 a 2... an, an (n +1)* (n +1) recognition matrix T is constructed. The elements of T are sets of dotted rules A αptβ, that we will also call items, here. T i, j denotes the element in the j-th column and the i-th row of T. The j-th column of this matrix coincides with the item-set I j in the string version of Earley s method. Hence, if item [A αptβ, k ] is an element of item-set I j, then A αptβ will be in T k, j. Because of the characteristic property satisfied by the items sets in the string version, the following will hold. A αptβ is an element of T i, j, if and only if there is a derivation S a 1... ai A δ and, moreover, α a i aj. And, again, as a particular case of this characteristic property, S αpt is in T 0,n for some S-production S α in P, if and only if S α a 1... an. We define the following operations on sets of items. These operations are used in the tabular recognizer, shown in Figure 12. Let G = (N, Σ,P,S) be a CFG, Q a set of items of G, and M a subset of V. Q M = {A αb βptγ A αptb βγ Q, β ε, B M} We now extend this product operation. Q and R now both are sets of items of G. and Notice that Q R Q * R. Q R = {A αb βptγ A αptb βγ Q, β ε,b ηpt R}, Q * R = {A αb βptγ A αptb βγ Q, β ε, B C, C ηpt R}. Let M be a subset of N. We define the function Predict as follows. Predict (M) = {C γptξ C γξ P, γ ε, B C η, B M, η V } Predict satisfies the following properties, that can be used for its evaluation, when it is applied to a nonterminal, or a set of nonterminals. Let X and Y be sets, then Predict (X Y) = Predict (X) Predict (Y). If A B α, then Predict (B) Predict (A). Hence, Predict ({A}) = {A αptβ α ε } Predict (X), in which X = {B A B γ }. Notice that Predict only depends on the grammar G and, therefore, can be computed before recognition of a particular sentence. We extend the definition of Predict, so that it can be applied to item sets R, as follows. Predict (R) = Predict ({ B A αptb β R}) Notice that the operators * and x comprise the predictor and completer operations of Earley s method. Furthermore, the predictor part in this algorithm only computes the diagonal elements of the recognition matrix T. Example 2 Let G 2 be a CFG given by the following set of rules. S abc B bbh B b C c

16 begin RECOGNIZER; T 0,0 := Predict ({S}); for1 j := 1 tot n do SCAN2NER: for3 0 i j 1 do T i, j := T i, j 1 {a j }; end; COMPLETER: for k := j 1 downto 0 do T k, j := T k, j ( T k,k * T k, j ); for4 i := k 1 downto 0 do end; end T i, j := T i, j ( T i,k T k, j ) end PREDICTOR: T j, j := Predict ( T i, j); 0 i j 1 if and only if S T 0,n then accept else reject end end RECOGNIZER. Fig. 12 The tabular recognizer H d If we apply the tabular recognizer of Figure 12 to the sentence abbbddc, we obtain the recognition matrix in Figure 13. From the fact that S abcpt is an element of T 0,7, we conclude that abbbddc is in the language L (G 2 ). The difference between this tabular version and the original string version of Earley s method becomes apparent if we apply these to grammars that contain ε-rules. To see this, consider the grammar with the productions: S AB A ε B CDC C ε D a. This grammar has only one tree, namely the one for the sentence a. According to the string version of Earley s method, item [C pt,1] is added to I 1 before item [B CDCpt,0] is added to I 1. If we apply the tabular algorithm, however, C pt is added at the end. Here, item B CDCpt is

17 I 0 I 1 I 2 I 3 I 4 I 5 I 6 I 7 S ptabc S aptbc S abptc S abptc S abcpt B ptbbh B bptbh B bbpth B bbpth B bbhpt B ptb B bpt B ptbbh B bptbh B bbpth B bbhpt B ptb B bpt C ptc B ptbbh B bptbh B ptb B bpt H ptd B ptbbh B ptb H ptd H dpt H ptd H dpt C ptc C cpt Fig. 13 The recognition matrix for Example 2 already added by the completer operation, i.e. before item C pt is added. We call the algorithm in Figure 12 the column-by-column variant, and we present the entry-by-entry variant of the tabular recognition algorithm in Figure 14. In this algorithm the elements T i, j are computed completely element by element. Elements within one column are computed from the bottom to the top, except for the diagonal elements; they are finally computed, like in the column-by-column variant. In the column-by-column variant as well as in the entry-by-entry variant the columns are computed from left to right. This is necessary because the diagonal elements must be computed before the elements in the next column can be computed. The diagonal sets in turn depend on the elements in the rows above these elements. This dependency is particularly bad for parallel computation of the elements. This can also be seen from the condition that S a 1... ai A γ should hold in order that an item A αptβ is an element of T i, j. Intuitively, this means that the left context plays a role in determining the contents of the following entries in the recognition matrix. Especially this aspect, sometimes called top-down filtering, distinguishes Earley s method from the CYK method, which is purely bottom-up. In order to remove the dependency on the elements in the main diagonal, we can fill the entries in this diagonal with the set Predict (N), N being the set of all nonterminal symbols of the grammar. Now, we can compute the other elements of the matrix diagonal by diagonal, working from the main diagonal in right-upward direction. This method is followed in the diagonal-bydiagonal variant of the tabular recognition method, shown in Figure 15. First, the elements of T i, j are computed for which d = j i, the difference between the column index j and the row index i, is 0. Then, this difference d is incremented each time by one, and the elements of the corresponding diagonal are computed. The elements in each diagonal are computed from the bottom upwards.

18 begin RECOGNIZER; T 0,0 := Predict ({S}); for1 j := 1 tot n do for2 i := j 1 downto 0 do SCAN3NER: T i, j := T i, j 1 {a j }; end; COMPLETER: for4 i <k <j do end; T i, j := T i, j ( T i,k T k, j ) T i, j := T i, j ( T i,i * T i, j ); end; PREDICTOR: T j, j := Predict ( T i, j); 0 i j 1 if and only if S T 0,n then accept else reject end end RECOGNIZER. Fig. 14 The entry-by-entry-variant 5. A recognition method without dots In this section we present our general recognition method. It is based on the following simple property : S w if and only if there is some α V, such that α w, and S GLS (α), where for β V, GLS (β) = {A for some B N: A B and B β P } GLS is an abbreviation of Generalized Left-Hand Side At first, we restrict ourselves to ε-free CFGs, i.e. CFGs without ε-rules. We define the following predicates defined on strings in V : Compl (α) = True if α is the right-hand side of some production of G. Extend (α) = True if α is a prefix of the right-hand side of some production of G. If Extend(αX) holds, we say that α is extendible by X. Notice that since every string is a prefix of itself Compl (α) implies Extend (α) and since ε is a prefix of every string, Extend (ε) holds. The parse objects have the form [α,i ], where α V and 1 i n.

19 begin RECOGNIZER; for1 i := 0 tot n do T i,i := Predict (N); end; for d := 1 tot n do for2 i := 0 tot n d do COMPUTE (T i,i +d ); end; end if and only if S T 0,n then accept else reject end end RECOGNIZER. The procedure COMPUTE is the following. COMPUTE (T i, j ); begin1 SCANN2ER: T i, j := T i, j 1 {a j }; COMPLETER: for3 k :=j 1 downto i +1 do end; T i, j := T i, j ( T i,k T k, j ) T i, j := T i, j ( T i,i * T i, j ); end COMPUTE. Fig. 15 The diagonal-by-diagonal variant We now specify for our method the procedures Initialize, Compute and Answer in the Recognition Scheme (Figure 8). Initialize sets I 0 to contain the object [ε,0]. Compute(I j ) is defined as follows. (1) Start with I j = [ε, j ]. (2) SCANNER: For all objects [α,i ] in I j 1 for which α is extendible with a j add [αa j,i ] to I j. Repeat the following step until no more objects can be added to I j. (3) COMPLETER: If [α,i ] I j such that Compl (α) then add [γb,k ] to I j for all [γ,k ] in I i for which γ is extendible by B, such that B GLS (α). (Notice that γ is possibly ε.)

20 Answer checks whether [S, 0] is a member of I n. The correctness of this recognition algorithm follows from Theorem 3. Theorem 3. A parse-object [α,i ] is a member of I j if and only if α a i aj and, Extend(α) or α = S. This theorem implies that after all sets I i have been computed, the sentence a 1... an can be accepted if [S, 0] I n, and should be rejected otherwise. Example 4. Consider the grammar with the following productions S bac; A a aa aaa; C c When we apply the algorithm to the input sentence baaaac, we obtain the following sets of parse objects. I 0 : I 1 : I 2 : I 3 : I 4 : I 5 : I 6 : [ε,0]; [ε,1], [b, 0]; [ε,2], [a, 1], [A, 1], [ba, 0]; [ε,3], [a, 2], [aa, 1], A, 2], [A, 1], [aa, 1], [ba, 0]; [ε,4], [aa, 2], [A, 2], [a, 3], [A, 3], [aa, 1], [aa, 2], [aaa, 1], [A, 1], [ba, 0]; [ε,5], [aa, 3], [A, 3], [a, 4], [A, 4], [aa, 2], [aa, 3], [aaa, 1], [A, 1], [aaa, 2], [A, 2], [aa, 1], [ba, 0]; [ε,6], [c, 5], [C, 5], [bac, 0], [S, 0]. Since [S, 0] is an element of I 6, the string is accepted. The time complexity of the algorithm is O (n 3 ) where n is the length of the string to be recognized. The number of objects [α,i ] in a set I j is O (j). The Completer operation, see above, may take O (j) time for each element in I j. This means that the time complexity is the same as of Earley s recognizer. In general, the space complexity is quadratic, which is also comparable with Earley s method. Since we do not store complete items, but only extendable parts of right-hand sides of production rules in the parse objects, this method does, in general, not need as much space as Earley s. Obviously, the predicates Compl and Extend, and the GLS-sets can be computed, for a given CFG G, before processing a sentence. We presented the string version of our recognition method. In the tabular version the j th column of the triangular recognition matrix T coincides with the set I j in the string version: T i, j contains the string α for which [α,i ] in I j. In Earley s method an item [A αptβ,i ] is a member of the item-set I j if not only α a i aj, but in addition, there a derivation S a 1... ai A δ must exist. The predictor operation in Earley s causes that this extra condition holds. We compare the tabular version of our method to the CYK method. In the CYK method, A N is an element of the element T i, j of the recognition matrix T, where 0 i, j n, if and only if A a i aj. In the present method, the following, will hold: α T i, j if and only if

21 α a i aj. Moreover, we put no restrictions on the grammar, except that we assume the grammar to be ε-free. We will show later that this restriction is not essential. Our method can be derived from the tabular CYK method by a simple reinterpretation of the constructors used in that algorithm. We define the operator Closure on subsets W of V as follows: Closure(W) = W GLS (α) α W The predicate Sub (α) holds iff α is a substring of a right-hand side of a production. Using this predicate, we can define the binary operator on sets of strings as follows: X Y = Closure {α α=γδ,γ X, δ Y,Sub (α) } With these operators we can present the recognition algorithm for our method. It is shown in Figure 16. begin Recognizer; for i :=0 tot n 1 do T i,i +1 := Closure{a i +1 } end; for d := 2 tot n do end end; for i :=0 tot n d do j :=i + d; T i, j := j 1 T i,k T k, j k =i +1 if and only if S T 0,n then accept else reject end end Recognizer. Fig. 16 The recognizer without dots The correctness of the algorithm is a consequence of the following theorem. Theorem 5. The set T i, j contains α if and only if α a i aj and, Sub (α) or α = S. Example 6. If we apply the recognition algorithm to the grammar of Example 4 and the string baaaac, we obtain the recognition matrix shown in Figure 17. The operator Closure can be redefined easily to handle also grammars with ε-rules. In that case we should have: if H ε and Sub (αh) or Sub (H α), include αh, respectively H α in a set T i, j if α is already in this set. If we add a pair of indices to each item in every item-set T i, j indicating from which matrix element(s) the item is derived, then we can easily construct the parse tree(s) from this augmented recognition matrix, by following the trace(s) given by these indices. Notice that the graph given by these indices is already a representation of the parse tree(s). 6. Items with two dots: another Earley variant

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39

Parsing. Earley Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 39 Parsing Earley Parsing Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 39 Table of contents 1 Idea 2 Algorithm 3 Tabulation 4 Parsing 5 Lookaheads 2 / 39 Idea (1) Goal: overcome

More information

Decision Properties for Context-free Languages

Decision Properties for Context-free Languages Previously: Decision Properties for Context-free Languages CMPU 240 Language Theory and Computation Fall 2018 Context-free languages Pumping Lemma for CFLs Closure properties for CFLs Today: Assignment

More information

Syntax Analysis Part I

Syntax Analysis Part I Syntax Analysis Part I Chapter 4: Context-Free Grammars Slides adapted from : Robert van Engelen, Florida State University Position of a Parser in the Compiler Model Source Program Lexical Analyzer Token,

More information

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing

Ling/CSE 472: Introduction to Computational Linguistics. 5/4/17 Parsing Ling/CSE 472: Introduction to Computational Linguistics 5/4/17 Parsing Reminders Revised project plan due tomorrow Assignment 4 is available Overview Syntax v. parsing Earley CKY (briefly) Chart parsing

More information

Parsing Chart Extensions 1

Parsing Chart Extensions 1 Parsing Chart Extensions 1 Chart extensions Dynamic programming methods CKY Earley NLP parsing dynamic programming systems 1 Parsing Chart Extensions 2 Constraint propagation CSP (Quesada) Bidirectional

More information

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous.

Section A. A grammar that produces more than one parse tree for some sentences is said to be ambiguous. Section A 1. What do you meant by parser and its types? A parser for grammar G is a program that takes as input a string w and produces as output either a parse tree for w, if w is a sentence of G, or

More information

Parsing. Cocke Younger Kasami (CYK) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 35

Parsing. Cocke Younger Kasami (CYK) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 35 Parsing Cocke Younger Kasami (CYK) Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 35 Table of contents 1 Introduction 2 The recognizer 3 The CNF recognizer 4 CYK parsing 5 CYK

More information

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing

Parsing. Roadmap. > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing Roadmap > Context-free grammars > Derivations and precedence > Top-down parsing > Left-recursion > Look-ahead > Table-driven parsing The role of the parser > performs context-free syntax analysis > guides

More information

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino

3. Syntax Analysis. Andrea Polini. Formal Languages and Compilers Master in Computer Science University of Camerino 3. Syntax Analysis Andrea Polini Formal Languages and Compilers Master in Computer Science University of Camerino (Formal Languages and Compilers) 3. Syntax Analysis CS@UNICAM 1 / 54 Syntax Analysis: the

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Anoop Sarkar http://www.cs.sfu.ca/~anoop 9/18/18 1 Ambiguity An input is ambiguous with respect to a CFG if it can be derived with two different parse trees A parser needs a

More information

Languages and Compilers

Languages and Compilers Principles of Software Engineering and Operational Systems Languages and Compilers SDAGE: Level I 2012-13 3. Formal Languages, Grammars and Automata Dr Valery Adzhiev vadzhiev@bournemouth.ac.uk Office:

More information

Solving systems of regular expression equations

Solving systems of regular expression equations Solving systems of regular expression equations 1 of 27 Consider the following DFSM 1 0 0 0 A 1 A 2 A 3 1 1 For each transition from state Q to state R under input a, we create a production of the form

More information

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309

PART 3 - SYNTAX ANALYSIS. F. Wotawa TU Graz) Compiler Construction Summer term / 309 PART 3 - SYNTAX ANALYSIS F. Wotawa (IST @ TU Graz) Compiler Construction Summer term 2016 64 / 309 Goals Definition of the syntax of a programming language using context free grammars Methods for parsing

More information

CS 406/534 Compiler Construction Parsing Part I

CS 406/534 Compiler Construction Parsing Part I CS 406/534 Compiler Construction Parsing Part I Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy and Dr.

More information

October 19, 2004 Chapter Parsing

October 19, 2004 Chapter Parsing October 19, 2004 Chapter 10.3 10.6 Parsing 1 Overview Review: CFGs, basic top-down parser Dynamic programming Earley algorithm (how it works, how it solves the problems) Finite-state parsing 2 Last time

More information

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis.

Chapter 4. Lexical and Syntax Analysis. Topics. Compilation. Language Implementation. Issues in Lexical and Syntax Analysis. Topics Chapter 4 Lexical and Syntax Analysis Introduction Lexical Analysis Syntax Analysis Recursive -Descent Parsing Bottom-Up parsing 2 Language Implementation Compilation There are three possible approaches

More information

3. Parsing. Oscar Nierstrasz

3. Parsing. Oscar Nierstrasz 3. Parsing Oscar Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes. http://www.cs.ucla.edu/~palsberg/ http://www.cs.purdue.edu/homes/hosking/

More information

CYK parsing. I love automata theory I love love automata automata theory I love automata love automata theory I love automata theory

CYK parsing. I love automata theory I love love automata automata theory I love automata love automata theory I love automata theory CYK parsing 1 Parsing in CFG Given a grammar, one of the most common applications is to see whether or not a given string can be generated by the grammar. This task is called Parsing. It should be easy

More information

Compiler Construction

Compiler Construction Compiler Construction Exercises 1 Review of some Topics in Formal Languages 1. (a) Prove that two words x, y commute (i.e., satisfy xy = yx) if and only if there exists a word w such that x = w m, y =

More information

Syntax Analysis, III Comp 412

Syntax Analysis, III Comp 412 Updated algorithm for removal of indirect left recursion to match EaC3e (3/2018) COMP 412 FALL 2018 Midterm Exam: Thursday October 18, 7PM Herzstein Amphitheater Syntax Analysis, III Comp 412 source code

More information

UNIT-III BOTTOM-UP PARSING

UNIT-III BOTTOM-UP PARSING UNIT-III BOTTOM-UP PARSING Constructing a parse tree for an input string beginning at the leaves and going towards the root is called bottom-up parsing. A general type of bottom-up parser is a shift-reduce

More information

Introduction to Syntax Analysis

Introduction to Syntax Analysis Compiler Design 1 Introduction to Syntax Analysis Compiler Design 2 Syntax Analysis The syntactic or the structural correctness of a program is checked during the syntax analysis phase of compilation.

More information

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6

Compiler Design 1. Bottom-UP Parsing. Goutam Biswas. Lect 6 Compiler Design 1 Bottom-UP Parsing Compiler Design 2 The Process The parse tree is built starting from the leaf nodes labeled by the terminals (tokens). The parser tries to discover appropriate reductions,

More information

Parsing III. (Top-down parsing: recursive descent & LL(1) )

Parsing III. (Top-down parsing: recursive descent & LL(1) ) Parsing III (Top-down parsing: recursive descent & LL(1) ) Roadmap (Where are we?) Previously We set out to study parsing Specifying syntax Context-free grammars Ambiguity Top-down parsers Algorithm &

More information

Introduction to Syntax Analysis. The Second Phase of Front-End

Introduction to Syntax Analysis. The Second Phase of Front-End Compiler Design IIIT Kalyani, WB 1 Introduction to Syntax Analysis The Second Phase of Front-End Compiler Design IIIT Kalyani, WB 2 Syntax Analysis The syntactic or the structural correctness of a program

More information

Syntactic Analysis. Top-Down Parsing

Syntactic Analysis. Top-Down Parsing Syntactic Analysis Top-Down Parsing Copyright 2017, Pedro C. Diniz, all rights reserved. Students enrolled in Compilers class at University of Southern California (USC) have explicit permission to make

More information

Syntax Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Syntax Analysis. Amitabha Sanyal. (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Syntax Analysis (www.cse.iitb.ac.in/ as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay September 2007 College of Engineering, Pune Syntax Analysis: 2/124 Syntax

More information

UNIT III & IV. Bottom up parsing

UNIT III & IV. Bottom up parsing UNIT III & IV Bottom up parsing 5.0 Introduction Given a grammar and a sentence belonging to that grammar, if we have to show that the given sentence belongs to the given grammar, there are two methods.

More information

Ling 571: Deep Processing for Natural Language Processing

Ling 571: Deep Processing for Natural Language Processing Ling 571: Deep Processing for Natural Language Processing Julie Medero January 14, 2013 Today s Plan Motivation for Parsing Parsing as Search Search algorithms Search Strategies Issues Two Goal of Parsing

More information

Parsing partially bracketed input

Parsing partially bracketed input Parsing partially bracketed input Martijn Wieling, Mark-Jan Nederhof and Gertjan van Noord Humanities Computing, University of Groningen Abstract A method is proposed to convert a Context Free Grammar

More information

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011

Syntax Analysis. The Big Picture. The Big Picture. COMP 524: Programming Languages Srinivas Krishnan January 25, 2011 Syntax Analysis COMP 524: Programming Languages Srinivas Krishnan January 25, 2011 Based in part on slides and notes by Bjoern Brandenburg, S. Olivier and A. Block. 1 The Big Picture Character Stream Token

More information

Assignment 4 CSE 517: Natural Language Processing

Assignment 4 CSE 517: Natural Language Processing Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set

More information

Computer Science 160 Translation of Programming Languages

Computer Science 160 Translation of Programming Languages Computer Science 160 Translation of Programming Languages Instructor: Christopher Kruegel Top-Down Parsing Parsing Techniques Top-down parsers (LL(1), recursive descent parsers) Start at the root of the

More information

Normal Forms and Parsing. CS154 Chris Pollett Mar 14, 2007.

Normal Forms and Parsing. CS154 Chris Pollett Mar 14, 2007. Normal Forms and Parsing CS154 Chris Pollett Mar 14, 2007. Outline Chomsky Normal Form The CYK Algorithm Greibach Normal Form Chomsky Normal Form To get an efficient parsing algorithm for general CFGs

More information

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Parsing. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice. Parsing Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice. Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students

More information

The CKY Parsing Algorithm and PCFGs. COMP-599 Oct 12, 2016

The CKY Parsing Algorithm and PCFGs. COMP-599 Oct 12, 2016 The CKY Parsing Algorithm and PCFGs COMP-599 Oct 12, 2016 Outline CYK parsing PCFGs Probabilistic CYK parsing 2 CFGs and Constituent Trees Rules/productions: S VP this VP V V is rules jumps rocks Trees:

More information

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5

Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5 1 Not all languages are regular So what happens to the languages which are not regular? Can we still come up with a language recognizer?

More information

Parsing Part II (Top-down parsing, left-recursion removal)

Parsing Part II (Top-down parsing, left-recursion removal) Parsing Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit

More information

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7 Top-Down Parsing and Intro to Bottom-Up Parsing Lecture 7 1 Predictive Parsers Like recursive-descent but parser can predict which production to use Predictive parsers are never wrong Always able to guess

More information

Syntax Analysis, III Comp 412

Syntax Analysis, III Comp 412 COMP 412 FALL 2017 Syntax Analysis, III Comp 412 source code IR Front End Optimizer Back End IR target code Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp

More information

Computational Linguistics

Computational Linguistics Computational Linguistics CSC 2501 / 485 Fall 2018 3 3. Chart parsing Gerald Penn Department of Computer Science, University of Toronto Reading: Jurafsky & Martin: 13.3 4. Allen: 3.4, 3.6. Bird et al:

More information

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7

Top-Down Parsing and Intro to Bottom-Up Parsing. Lecture 7 Top-Down Parsing and Intro to Bottom-Up Parsing Lecture 7 1 Predictive Parsers Like recursive-descent but parser can predict which production to use Predictive parsers are never wrong Always able to guess

More information

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

CS415 Compilers. Syntax Analysis. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University CS415 Compilers Syntax Analysis These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Limits of Regular Languages Advantages of Regular Expressions

More information

CS 321 Programming Languages and Compilers. VI. Parsing

CS 321 Programming Languages and Compilers. VI. Parsing CS 321 Programming Languages and Compilers VI. Parsing Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = words Programs = sentences For further information,

More information

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10

Ambiguity, Precedence, Associativity & Top-Down Parsing. Lecture 9-10 Ambiguity, Precedence, Associativity & Top-Down Parsing Lecture 9-10 (From slides by G. Necula & R. Bodik) 9/18/06 Prof. Hilfinger CS164 Lecture 9 1 Administrivia Please let me know if there are continued

More information

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill

Syntax Analysis. COMP 524: Programming Language Concepts Björn B. Brandenburg. The University of North Carolina at Chapel Hill Syntax Analysis Björn B. Brandenburg The University of North Carolina at Chapel Hill Based on slides and notes by S. Olivier, A. Block, N. Fisher, F. Hernandez-Campos, and D. Stotts. The Big Picture Character

More information

3.7 Denotational Semantics

3.7 Denotational Semantics 3.7 Denotational Semantics Denotational semantics, also known as fixed-point semantics, associates to each programming language construct a well-defined and rigorously understood mathematical object. These

More information

Vorlesung 7: Ein effizienter CYK Parser

Vorlesung 7: Ein effizienter CYK Parser Institut für Computerlinguistik, Uni Zürich: Effiziente Analyse unbeschränkter Texte Vorlesung 7: Ein effizienter CYK Parser Gerold Schneider Institute of Computational Linguistics, University of Zurich

More information

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468

Parsers. Xiaokang Qiu Purdue University. August 31, 2018 ECE 468 Parsers Xiaokang Qiu Purdue University ECE 468 August 31, 2018 What is a parser A parser has two jobs: 1) Determine whether a string (program) is valid (think: grammatically correct) 2) Determine the structure

More information

2.2 Syntax Definition

2.2 Syntax Definition 42 CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions

More information

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution

CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution CS143 Handout 20 Summer 2011 July 15 th, 2011 CS143 Practice Midterm and Solution Exam Facts Format Wednesday, July 20 th from 11:00 a.m. 1:00 p.m. in Gates B01 The exam is designed to take roughly 90

More information

A Characterization of the Chomsky Hierarchy by String Turing Machines

A Characterization of the Chomsky Hierarchy by String Turing Machines A Characterization of the Chomsky Hierarchy by String Turing Machines Hans W. Lang University of Applied Sciences, Flensburg, Germany Abstract A string Turing machine is a variant of a Turing machine designed

More information

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones

Parsing III. CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones Parsing III (Top-down parsing: recursive descent & LL(1) ) (Bottom-up parsing) CS434 Lecture 8 Spring 2005 Department of Computer Science University of Alabama Joel Jones Copyright 2003, Keith D. Cooper,

More information

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017

Compilerconstructie. najaar Rudy van Vliet kamer 140 Snellius, tel rvvliet(at)liacs(dot)nl. college 3, vrijdag 22 september 2017 Compilerconstructie najaar 2017 http://www.liacs.leidenuniv.nl/~vlietrvan1/coco/ Rudy van Vliet kamer 140 Snellius, tel. 071-527 2876 rvvliet(at)liacs(dot)nl college 3, vrijdag 22 september 2017 + werkcollege

More information

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr

Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Gr Formal Languages and Compilers Lecture V: Parse Trees and Ambiguous Grammars Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/

More information

1. [5 points each] True or False. If the question is currently open, write O or Open.

1. [5 points each] True or False. If the question is currently open, write O or Open. University of Nevada, Las Vegas Computer Science 456/656 Spring 2018 Practice for the Final on May 9, 2018 The entire examination is 775 points. The real final will be much shorter. Name: No books, notes,

More information

Introduction to Parsing. Comp 412

Introduction to Parsing. Comp 412 COMP 412 FALL 2010 Introduction to Parsing Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make

More information

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata

Formal Languages and Compilers Lecture IV: Regular Languages and Finite. Finite Automata Formal Languages and Compilers Lecture IV: Regular Languages and Finite Automata Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/

More information

Lecture 25 Spanning Trees

Lecture 25 Spanning Trees Lecture 25 Spanning Trees 15-122: Principles of Imperative Computation (Fall 2018) Frank Pfenning, Iliano Cervesato The following is a simple example of a connected, undirected graph with 5 vertices (A,

More information

Monday, September 13, Parsers

Monday, September 13, Parsers Parsers Agenda Terminology LL(1) Parsers Overview of LR Parsing Terminology Grammar G = (Vt, Vn, S, P) Vt is the set of terminals Vn is the set of non-terminals S is the start symbol P is the set of productions

More information

CMPSCI 311: Introduction to Algorithms Practice Final Exam

CMPSCI 311: Introduction to Algorithms Practice Final Exam CMPSCI 311: Introduction to Algorithms Practice Final Exam Name: ID: Instructions: Answer the questions directly on the exam pages. Show all your work for each question. Providing more detail including

More information

Job-shop scheduling with limited capacity buffers

Job-shop scheduling with limited capacity buffers Job-shop scheduling with limited capacity buffers Peter Brucker, Silvia Heitmann University of Osnabrück, Department of Mathematics/Informatics Albrechtstr. 28, D-49069 Osnabrück, Germany {peter,sheitman}@mathematik.uni-osnabrueck.de

More information

LL(1) predictive parsing

LL(1) predictive parsing LL(1) predictive parsing Informatics 2A: Lecture 11 John Longley School of Informatics University of Edinburgh jrl@staffmail.ed.ac.uk 13 October, 2011 1 / 12 1 LL(1) grammars and parse tables 2 3 2 / 12

More information

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs

programming languages need to be precise a regular expression is one of the following: tokens are the building blocks of programs Chapter 2 :: Programming Language Syntax Programming Language Pragmatics Michael L. Scott Introduction programming languages need to be precise natural languages less so both form (syntax) and meaning

More information

COP 3402 Systems Software Syntax Analysis (Parser)

COP 3402 Systems Software Syntax Analysis (Parser) COP 3402 Systems Software Syntax Analysis (Parser) Syntax Analysis 1 Outline 1. Definition of Parsing 2. Context Free Grammars 3. Ambiguous/Unambiguous Grammars Syntax Analysis 2 Lexical and Syntax Analysis

More information

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University

Compilation 2012 Context-Free Languages Parsers and Scanners. Jan Midtgaard Michael I. Schwartzbach Aarhus University Compilation 2012 Parsers and Scanners Jan Midtgaard Michael I. Schwartzbach Aarhus University Context-Free Grammars Example: sentence subject verb object subject person person John Joe Zacharias verb asked

More information

Algorithms for NLP. Chart Parsing. Reading: James Allen, Natural Language Understanding. Section 3.4, pp

Algorithms for NLP. Chart Parsing. Reading: James Allen, Natural Language Understanding. Section 3.4, pp -7 Algorithms for NLP Chart Parsing Reading: James Allen, Natural Language Understanding Section 3.4, pp. 53-6 Chart Parsing General Principles: A Bottom-Up parsing method Construct a parse starting from

More information

University of Nevada, Las Vegas Computer Science 456/656 Fall 2016

University of Nevada, Las Vegas Computer Science 456/656 Fall 2016 University of Nevada, Las Vegas Computer Science 456/656 Fall 2016 The entire examination is 925 points. The real final will be much shorter. Name: No books, notes, scratch paper, or calculators. Use pen

More information

Homework & NLTK. CS 181: Natural Language Processing Lecture 9: Context Free Grammars. Motivation. Formal Def of CFG. Uses of CFG.

Homework & NLTK. CS 181: Natural Language Processing Lecture 9: Context Free Grammars. Motivation. Formal Def of CFG. Uses of CFG. C 181: Natural Language Processing Lecture 9: Context Free Grammars Kim Bruce Pomona College pring 2008 Homework & NLTK Review MLE, Laplace, and Good-Turing in smoothing.py Disclaimer: lide contents borrowed

More information

LL(1) predictive parsing

LL(1) predictive parsing LL(1) predictive parsing Informatics 2A: Lecture 11 Mary Cryan School of Informatics University of Edinburgh mcryan@staffmail.ed.ac.uk 10 October 2018 1 / 15 Recap of Lecture 10 A pushdown automaton (PDA)

More information

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38

Syntax Analysis. Martin Sulzmann. Martin Sulzmann Syntax Analysis 1 / 38 Syntax Analysis Martin Sulzmann Martin Sulzmann Syntax Analysis 1 / 38 Syntax Analysis Objective Recognize individual tokens as sentences of a language (beyond regular languages). Example 1 (OK) Program

More information

Context Free Grammars. CS154 Chris Pollett Mar 1, 2006.

Context Free Grammars. CS154 Chris Pollett Mar 1, 2006. Context Free Grammars CS154 Chris Pollett Mar 1, 2006. Outline Formal Definition Ambiguity Chomsky Normal Form Formal Definitions A context free grammar is a 4-tuple (V, Σ, R, S) where 1. V is a finite

More information

Parser Generation. Bottom-Up Parsing. Constructing LR Parser. LR Parsing. Construct parse tree bottom-up --- from leaves to the root

Parser Generation. Bottom-Up Parsing. Constructing LR Parser. LR Parsing. Construct parse tree bottom-up --- from leaves to the root Parser Generation Main Problem: given a grammar G, how to build a top-down parser or a bottom-up parser for it? parser : a program that, given a sentence, reconstructs a derivation for that sentence ----

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Lecture 5: Syntax Analysis (Parsing) Zheng (Eddy) Zhang Rutgers University January 31, 2018 Class Information Homework 1 is being graded now. The sample solution

More information

MIT Specifying Languages with Regular Expressions and Context-Free Grammars

MIT Specifying Languages with Regular Expressions and Context-Free Grammars MIT 6.035 Specifying Languages with Regular essions and Context-Free Grammars Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Language Definition Problem How to precisely

More information

Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals.

Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals. Bottom-up Parsing: Table-driven using an explicit stack (no recursion!). Stack can be viewed as containing both terminals and non-terminals. Basic operation is to shift terminals from the input to the

More information

Iterative CKY parsing for Probabilistic Context-Free Grammars

Iterative CKY parsing for Probabilistic Context-Free Grammars Iterative CKY parsing for Probabilistic Context-Free Grammars Yoshimasa Tsuruoka and Jun ichi Tsujii Department of Computer Science, University of Tokyo Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033 CREST, JST

More information

Context Free Languages and Pushdown Automata

Context Free Languages and Pushdown Automata Context Free Languages and Pushdown Automata COMP2600 Formal Methods for Software Engineering Ranald Clouston Australian National University Semester 2, 2013 COMP 2600 Context Free Languages and Pushdown

More information

Types of parsing. CMSC 430 Lecture 4, Page 1

Types of parsing. CMSC 430 Lecture 4, Page 1 Types of parsing Top-down parsers start at the root of derivation tree and fill in picks a production and tries to match the input may require backtracking some grammars are backtrack-free (predictive)

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved.

Syntax Analysis. Prof. James L. Frankel Harvard University. Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved. Syntax Analysis Prof. James L. Frankel Harvard University Version of 6:43 PM 6-Feb-2018 Copyright 2018, 2015 James L. Frankel. All rights reserved. Context-Free Grammar (CFG) terminals non-terminals start

More information

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;}

Parsing. source code. while (k<=n) {sum = sum+k; k=k+1;} Compiler Construction Grammars Parsing source code scanner tokens regular expressions lexical analysis Lennart Andersson parser context free grammar Revision 2012 01 23 2012 parse tree AST builder (implicit)

More information

Formal Languages and Compilers Lecture VI: Lexical Analysis

Formal Languages and Compilers Lecture VI: Lexical Analysis Formal Languages and Compilers Lecture VI: Lexical Analysis Free University of Bozen-Bolzano Faculty of Computer Science POS Building, Room: 2.03 artale@inf.unibz.it http://www.inf.unibz.it/ artale/ Formal

More information

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012

LL(k) Parsing. Predictive Parsers. LL(k) Parser Structure. Sample Parse Table. LL(1) Parsing Algorithm. Push RHS in Reverse Order 10/17/2012 Predictive Parsers LL(k) Parsing Can we avoid backtracking? es, if for a given input symbol and given nonterminal, we can choose the alternative appropriately. his is possible if the first terminal of

More information

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence.

A left-sentential form is a sentential form that occurs in the leftmost derivation of some sentence. Bottom-up parsing Recall For a grammar G, with start symbol S, any string α such that S α is a sentential form If α V t, then α is a sentence in L(G) A left-sentential form is a sentential form that occurs

More information

Advanced PCFG Parsing

Advanced PCFG Parsing Advanced PCFG Parsing Computational Linguistics Alexander Koller 8 December 2017 Today Parsing schemata and agenda-based parsing. Semiring parsing. Pruning techniques for chart parsing. The CKY Algorithm

More information

Outline. The strategy: shift-reduce parsing. Introduction to Bottom-Up Parsing. A key concept: handles

Outline. The strategy: shift-reduce parsing. Introduction to Bottom-Up Parsing. A key concept: handles Outline Introduction to Bottom-Up Parsing Lecture Notes by Profs. Alex Aiken and George Necula (UCB) he strategy: -reduce parsing A key concept: handles Ambiguity and precedence declarations CS780(Prasad)

More information

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview Tokens and regular expressions Syntax and context-free grammars Grammar derivations More about parse trees Top-down and bottom-up

More information

Lexical Analysis. Chapter 2

Lexical Analysis. Chapter 2 Lexical Analysis Chapter 2 1 Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

More information

Wednesday, September 9, 15. Parsers

Wednesday, September 9, 15. Parsers Parsers What is a parser A parser has two jobs: 1) Determine whether a string (program) is valid (think: grammatically correct) 2) Determine the structure of a program (think: diagramming a sentence) Agenda

More information

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs:

Parsers. What is a parser. Languages. Agenda. Terminology. Languages. A parser has two jobs: What is a parser Parsers A parser has two jobs: 1) Determine whether a string (program) is valid (think: grammatically correct) 2) Determine the structure of a program (think: diagramming a sentence) Agenda

More information

Wednesday, August 31, Parsers

Wednesday, August 31, Parsers Parsers How do we combine tokens? Combine tokens ( words in a language) to form programs ( sentences in a language) Not all combinations of tokens are correct programs (not all sentences are grammatically

More information

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions. CS 787: Advanced Algorithms NP-Hardness Instructor: Dieter van Melkebeek We review the concept of polynomial-time reductions, define various classes of problems including NP-complete, and show that 3-SAT

More information

The CYK Parsing Algorithm

The CYK Parsing Algorithm The CYK Parsing Algorithm Lecture 19 Section 6.3 Robb T. Koether Hampden-Sydney College Fri, Oct 7, 2016 Robb T. Koether (Hampden-Sydney College) The CYK Parsing Algorithm Fri, Oct 7, 2016 1 / 21 1 The

More information

COP4020 Programming Languages. Syntax Prof. Robert van Engelen

COP4020 Programming Languages. Syntax Prof. Robert van Engelen COP4020 Programming Languages Syntax Prof. Robert van Engelen Overview n Tokens and regular expressions n Syntax and context-free grammars n Grammar derivations n More about parse trees n Top-down and

More information

Context-free grammars

Context-free grammars Context-free grammars Section 4.2 Formal way of specifying rules about the structure/syntax of a program terminals - tokens non-terminals - represent higher-level structures of a program start symbol,

More information

Top down vs. bottom up parsing

Top down vs. bottom up parsing Parsing A grammar describes the strings that are syntactically legal A recogniser simply accepts or rejects strings A generator produces sentences in the language described by the grammar A parser constructs

More information

Introduction to Lexical Analysis

Introduction to Lexical Analysis Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexical analyzers (lexers) Regular

More information

General Overview of Compiler

General Overview of Compiler General Overview of Compiler Compiler: - It is a complex program by which we convert any high level programming language (source code) into machine readable code. Interpreter: - It performs the same task

More information

CS502: Compilers & Programming Systems

CS502: Compilers & Programming Systems CS502: Compilers & Programming Systems Top-down Parsing Zhiyuan Li Department of Computer Science Purdue University, USA There exist two well-known schemes to construct deterministic top-down parsers:

More information