A New Method of N-gram Statistics for Large Number of ad Automatic Extractio of Words ad Phrases from Large Text Data of Japaese Makoto Nagao, Shisuke Mori Departmet of Electrical Egieerig Kyoto Uiversity Abstract I the process of establishig the iformatio theory, C. E. Shao proposed the Markov process as a good model to characterize a atural laguage. The core of this idea is to calculate the frequecies of strigs composed of characters (-grams), but this statistical aalysis of large text data ad for a large has ever bee carried out because of the memory limitatio of computer ad the shortage of text data. Takig advatage of the recet powerful computers we developed a ew algorithm of -grams of large text data for arbitrary large ad calculated successfully, withi relatively short time, -grams of some Japaese text data cotaiig betwee two ad thirty millio characters. From this experimet it became clear that the automatic extractio or determiatio of words, compoud words ad collocatios is possible by mutually comparig -gram statistics for dieret values of. category: topical paper, quatitative liguistics, large text corpora, text processig Itroductio Claude E. Shao established the iformatio theory i 948 []. His theory icluded the cocept that a laguage could be approximated by a- th order Markov modelby to be exteded to iity. Sice his proposal there were may trials to calculate -grams (statistics of character strigs of a laguage) for a big text data of a laguage. However computers up to the preset could ot calculate them for a large because the calculatio required huge amout of memory space ad time. For example the frequecy calculatio of 0- grams of Eglish requires at least 26 0 0 0 6 giga word memory space. Therefore the calculatio was doe at most for =4 with modest text quatity. We developed a ew method of calculatig - grams for large 's. We do ot prepare a table for a -gram. Our methods cosists of two stages. The rst stage performs the sortig of substrigs of a text ad ds out the legth of the prex parts which are the same for the adjacet substrigs i the sorted table. The secod stage is the calculatio of a -gram whe it is asked for a specic. Oly the existig character combiatios require the table etries for the frequecy cout, so that we eed ot reserve a big space for -gram table. The program we have developed requires 7l bytes for a l character text of two byte code such as Japaese ad Chiese texts ad 6l bytes for a l character text of Eglish ad other Europea laguages. By the preset program ca be exteded up to 2. The program ca be chaged very easily for larger if it is required. We performed -gram frequecy calculatios for three dieret text data. We were ot so much iterested i the etropyvalue of a laguage but were iterested i the extractio of varieties of laguage properties, such as words, compoud words, collocatios ad so o. The calculatio of frequecy of occurreces of character strigs is particularly importat to determie what is a word i such laguages as Japaese ad Chiese where there is o spaces betwee words ad the determiatio of word boudaries is ot so easy. I this paper we will explai some of our results o these problems. 2 Calculatio of -grams for a arbitrary large umber of It was very dicult to calculate -grams for a large umber of because of the memory limitatio of a computer. For example, Japaese laguage has more tha 4000 dieret characters ad if we wat
to have 0-gram frequecies of a Japaese text, we must reserve 4000 0 etries, which exceed 0 3. Therefore oly 3 or 4-grams were calculated so far. A ew method we developed ca calculate - grams for a arbitrary large umber of with a reasoable memory size i a reasoable calculatio time. It cosists of two stages. The rst stage is to get a table of alphabetically sorted substrigs of a text strig ad to get the value of coicidece umber of prex characters of adjacetly sorted strigs. The secod stage is to calculate the frequecy of - grams for all the existig character strigs from the sorted strigs for a specic umber of. 2. First stage () Whe a text is give it is stored i a computer as oe log character strig. It may iclude setece boudaries, paragraph boudaries ad so o if they are regarded as compoets of text. Whe a text is composed of l characters it occupies 2l byte memory because a Japaese character is ecoded by 6 bit code. We prepare aother table of the same size (l), eachetry of which keeps the poiter to a substrig of the text strig. This is illustrated i Figure. poiter table 0 i text strig ( l characters : 2lbytes) the i-th word we set p = 32 bits so that we ca accept the text size up to 2 32 4 giga characters. The poiter table represets a set of l words. We apply the dictioary sortig operatio to this set of l words. It is performed by utilizig the poiters i the poiter table. We used comb sort[2] which is a improved versio of bubble sort. The sortig time is the order of O(l log l). Whe the sortig is completed the result is the chage of poiter positios i the poiter table, ad there is o replacemet of actual words. As we are iterested i - grams of less tha 2, actual sortig of words is performed for the leftmost 2 or less characters of words. (2) Next we compare two adjacet words i the poiter table, ad cout the legth of the prex parts which are the same i the two words. For example whe \extesio to the left side..." ad \extesio to the right side..." are two words placed adjacet, the umber is 7. This is stored i the table of coicidece umber of prex characters. This is show i Figure 2. As we are iterested i 2, oe byte is give to a etry of this table. The total memory space required to this rst stage operatio is 2l+4l+l = 7l bytes. For example whe a text size is 0 mega Japaese characters, 70 mega byte memory must be reserved. This is ot dicult by the preset-day computers. table of coicidece umber of characters poiter table text strig ( l characters : 2lbytes) the i-th word l- 4bytes Figure : Text strig ad the poiter table to substrigs. byte i 4bytes A substrig poited by i- is deed as composed of the characters from the i-th positio to the ed of the text strig (see Figure ). We call this substrig aword. The rst word is the text strig itself, ad the secod word is the strig which starts from the secod character ad eds at the al character of the text strig. Similarly the last word is the al character of the text strig. As the text size is l characters a poiter must have at least p bits where 2 p l. I our program Figure 2: Sorted poiter table ad table of coicidece umber of characters We developed twosoftware versios, oe by usig mai memory aloe, ad the other by usig a disc memory where the software has the additioal operatios of disc merge sort. By the disc versio we ca hadle a text of more tha 00 mega character Japaese text. The software was implemeted o a 2
SUN SPARC Statio. 2.2 Secod stage The secod stage is the calculatio of -gram frequecy table. This is doe by usig the poiter table ad the table of coicidece umber of prex characters. Let us x to a certai umber. We rst read out the rst characters of the rst word i the poiter table, ad see the umber i the table of coicidece umber of prex characters. If this is equal to or larger tha it meas that the secod word has at least the same prex characters with the rst word. The we see the ext etry of the coicidece umber of prex characters ad check whether it is equal to or larger tha or ot. We cotiue this operatio util we meet the coditio that the umber is smaller tha. The umber of words checked up to this is the frequecy of the prex characters of the rst word. At this stage the rst prex characters of the ext word is dieret, ad so the same operatio as the rst characters is performed from here, that is, to check the umber i the coicidece umber of prex characters to see whether it is equal to or larger tha or ot, ad so o. I this way we get the frequecy of the secod prex characters. We perform this process util the last etry of the table. These operatios give the -gram table of the give text. We do ot eed ay extra memory space i this operatio whe we prit out every -gram strig ad its frequecy whe they are obtaied. We calculated -grams for some dieret Japaese texts which were available i electroic form i our laboratory. These were the followigs.. Ecyclopedic Dictioary of Computer Sciece (3.7 M bytes) 2. Jouralistic essays from Asahi Newspaper (8 M bytes) 3. Miscellaeous texts available i our laboratory (9 M bytes) The rst two texts were ot large ad could be maaged i the mai memory. The third oe was processed by usig a disc memory by applyig a merge sort program three times. The rst two texts were processed withi oe ad two hours by a stadard SUN SPARC Statio for the rst stage metioed above. The third text required about twety four hours. Calculatio of -gram frequecy (the secod stage) took less tha a hour icludig prit-out. 3 Extractio of useful liguistic iformatio from -gram frequecy data 3. Etropy Everybody is iterested i the etropy value of a laguage. Shao's theory says that the etropy is calculated by the formula [3] H (L) = X P (w) log P (w) where P (w) is the probability of occurrece of w, ad the summatio is for all the dieret strigs w of characters appearig i a laguage. The etropy of a laguage L is H(L) = lim! H (L) We calculated H (L) for the texts metioed i Sectio 2 for =; 2; 3; ::. The results is show i Figure 3. Ulike our iitial expectatio that the etropy will coverge to a certai costat value betwee 0.6 ad.3 which C. E. Shao estimated for Eglish, it cotiued to decrease to zero. We checked i detail whether our method had somethig wrog, but there was othig doubtful. Our coclusio for this strage pheomeo was that the text quatity of a few mega characters were too small to get a meaigful statistics for a large because we have more tha 4000 dieret characters i the Japaese laguage. For Eglish ad may other Europea laguages which have alphabetic sets of less tha fty characters the situatio may be better. But still the text quatity ofafew giga bytes or more will be ecessary to get a meaigful etropy value for = 0 or more. 9 8 7 6 4 3 2 H etropy 0 0 0 20 2 30 3 40 Figure 3: Etropy curve by -gram 3
3.2 Obtaiig the logest compoud word From the -gram frequecy table we ca get may iterestig iformatio. Whe we have a strig w (legth ) of high frequecy as show i Figure 4, we ca try to d out the logest strig w 0 which icludes w by the followig process by usig the -gram frequecy table. partial strigs frequecies %>/ 0 >/ 689 >/, 30 /,4 784,4-784,4-> 770 4->; 47 w frequecy Figure : Frequecies of partial strigs ad obtaiig the logest word " >/,4->" x w... 06.?96<6' (must do...)... /,#<?3'> (it is kow that...)... H6(/,4-> (ca do...)... I:>/,4-> (ca ask...) Figure 4: Obtaiig the logest word w 0 from a high frequecy word fragmet w () extesio to the left: We cut o the last character of w ad add a character x to the left of w. We call this a cut-ad-pasted word. We lookforthecharacter x which will give the maximum frequecy to the cut-ad-pasted word. Repeat the same operatio step by step to the left ad draw a frequecy curve for these words. This operatio will be stopped whe the frequecy curve drops to a certai value. This process is performed by seeig the -gram frequecy table aloe. (2) extesio to the right: The same operatio as () is performed by cuttig the left character ad addig a character to the right. (3) extractio of high frequecy part: From the frequecy curve as show i Figure 4 we ca easily extract a high frequecy part as the logest strig. A example is show i Figure The strigs extracted i this way are very ofte compoud words of postpositios i Japaese. Postpositioal phrases are usually composed of oe to three words, ad are used as if they are compoud postpositios. Some extracted examples are, 3.3 Word extractio After gettig high frequecy character strigs by the above method we ca make cosultatios with dictioaries for these strigs. The we d out may strigs which are ot icluded i the dictioaries. Some are phrases(collocatios, idiomatic expressios), some others are termiology words, ad ukow (ew) words. From the text data of Ecyclopedic Dictioary of Computer Sciece we extracted may termiological words. I geeral the frequecies of -grams become smaller as becomes larger. But we had sometimes relatively high frequecy values i -grams of large 's. These were very ofte termiological words or termiological phrases. We extracted such termiological phrases as, (: ::) EF4N+?2X^T[Z (programs writte by (: ::) laguage) PG#a7*.>bghL (problem solvig i articial itelligece) YBV$jS]U\WZ (page replacemet algorithm) X^T[Z8lm&R_Q (partial correctess of programs) 3.4 Compoud word We ca get more iterestig iformatio whe we compare data of dieret 's. Whe we have a character strig (legth ) of high frequecy, which we may be able to dee as a word (w), we are recommeded to check whether two substrigs (w ad w 2 ) of the legth ad 2 ( + 2 = ) as 4
Table : Determiatio of compoud word Compoud word proper segmetatio improper segmetatio k"e$ (280) = k" (4)Ae$ (40) k"e (280), "e$ (280), "e (280) OoMp (66) = Oo (208)AMp (2698) OoM (66), omp (66), om (66) sdiq (88) = sd (242)Aiq (30) sdi (88), diq (88), di (88) ( ):frequecy i Ecyclopedic Dictioary of Computer Sciece w w 2 Figure 6: Possible segmetatio of a word ito two compoets w w show i Figure 6 have high frequecy appearace i -gram ad 2 -gram tables. If we ca d out such a situatio by chagig (ad 2 )weca coclude that the origial character strig w is a compoud word of w ad w 2. Some examples are show i Table. 3. Collocatio We ca see whether a particular word w has strog collocatioal relatios with some other words from the -gram frequecy results. We ca get a - gram table where is sucietly large, w is the prex of these -grams, ad some words (w 0, w 00, :::) may appear i relatively high frequecy. This is show i Figure 7. We ca d out easily that w 0 w 0 ad w 0 w 00 are two allocatioal expressios from this gure. For example we have C! JD (eect) ad d out that C!J@r.>D (receive eect) ad C!J@c)>D (give eect) have relatively high frequecies ad there are o other sigicat combiatios i the -gram table with C!JD as the prex. C`f D (i ad out hospital) have almost all the time C@K=D (repeat) as the followig phrase, ad so we will be able to judge that C`f @K=D is a idiomatic expressio. 4 Coclusios We developed a ew method ad software for - gram frequecy calculatio for up to 2, ad calculated -grams for some large text data of Japaese. From these data we could derive words, compoud words ad collocatios automatically. Figure 7: Fidig collocatioal word pairs w 0 w 0 ad w 0 w 00 We thik that this method is equally useful for laguages like Chiese where there is o word spaces i a setece, ad for Europea laguages as well, ad also for speech phoeme sequeces to get more detailed HMM models. Aother possibility is that whe we get a large text data with part-speech tags, we ca extract high frequecy part-of-speech sequeces by this -gram calculatio over the part-of-speech data. These may be regarded as grammar rules of the primary level. By replacig these part-of-speech sequeces by sigle o-termial symbols we ca calculate ew -grams, ad will be able to get higher level grammar rules. These examples idicate that large text data with varieties of aotatios are very importat ad valuable for the extractio of liguistic iformatio by calculatig -grams for larger value of. Refereces [] C. E. Shao: A mathematical theory of commuicatio, Bell System Tech.J., Vol.27, pp.379-423, pp.623-66, (948). [2] Stephe Lacey, Richard Box: Nikkei BYTE, November, pp.30-32, (99). [3] N. Abramso: Iformatio theory ad codig, McGraw Hill, (963).