l-1 text string ( l characters : 2lbytes) pointer table the i-th word table of coincidence number of prex characters. pointer table the i-th word

Similar documents
the beginning of the program in order for it to work correctly. Similarly, a Confirm

The Magma Database file formats

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Ones Assignment Method for Solving Traveling Salesman Problem

CMPT 125 Assignment 2 Solutions

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

How do we evaluate algorithms?

Computers and Scientific Thinking

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Performance Plus Software Parameter Definitions

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

Lower Bounds for Sorting

. Written in factored form it is easy to see that the roots are 2, 2, i,

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Getting Started. Getting Started - 1

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Fast Fourier Transform (FFT) Algorithms

Python Programming: An Introduction to Computer Science

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

Arithmetic Sequences

Searching a Russian Document Collection Using English, Chinese and Japanese Queries

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

3D Model Retrieval Method Based on Sample Prediction

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Extending The Sleuth Kit and its Underlying Model for Pooled Storage File System Forensic Analysis

Chapter 3 Classification of FFT Processor Algorithms

Lecture 5. Counting Sort / Radix Sort

The isoperimetric problem on the hypercube

IMP: Superposer Integrated Morphometrics Package Superposition Tool

Octahedral Graph Scaling

Abstract. Chapter 4 Computation. Overview 8/13/18. Bjarne Stroustrup Note:

WYSE Academic Challenge Sectional Computer Science 2005 SOLUTION SET

Evaluation scheme for Tracking in AMI

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

NTH, GEOMETRIC, AND TELESCOPING TEST

CS 683: Advanced Design and Analysis of Algorithms

CS 111: Program Design I Lecture 21: Network Analysis. Robert H. Sloan & Richard Warner University of Illinois at Chicago April 10, 2018

Lecture 28: Data Link Layer

One advantage that SONAR has over any other music-sequencing product I ve worked

CSE 111 Bio: Program Design I Lecture 17: software development, list methods

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

EE123 Digital Signal Processing

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

The Implementation of Data Structures in Version 5 of Icon* Ralph E. Gr is wo Id TR 85-8

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Descriptive Statistics

Bayesian approach to reliability modelling for a probability of failure on demand parameter

A NOTE ON COARSE GRAINED PARALLEL INTEGER SORTING

Python Programming: An Introduction to Computer Science

It just came to me that I 8.2 GRAPHS AND CONVERGENCE

OCR Statistics 1. Working with data. Section 3: Measures of spread

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CSE 417: Algorithms and Computational Complexity

Reversible Realization of Quaternary Decoder, Multiplexer, and Demultiplexer Circuits

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Which movie we can suggest to Anne?

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n))

CS 111: Program Design I Lecture 16: Module Review, Encodings, Lists

Ch 9.3 Geometric Sequences and Series Lessons

Chapter 2. C++ Basics. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

Using the Keyboard. Using the Wireless Keyboard. > Using the Keyboard

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Weston Anniversary Fund

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Recursion. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Review: Method Frames

Image Segmentation EEE 508

Fire Recognition in Video. Walter Phillips III Mubarak Shah Niels da Vitoria Lobo.

BOOLEAN MATHEMATICS: GENERAL THEORY

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

In this chapter, you learn the concepts and terminology of databases and

DETECTION OF LANDSLIDE BLOCK BOUNDARIES BY MEANS OF AN AFFINE COORDINATE TRANSFORMATION

Analysis of Algorithms

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

New HSL Distance Based Colour Clustering Algorithm

Counting Regions in the Plane and More 1

CS 11 C track: lecture 1

CS 111: Program Design I Lecture 15: Objects, Pandas, Modules. Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

BAAN IVc/BaanERP. Conversion Guide Oracle7 to Oracle8

Code Review Defects. Authors: Mika V. Mäntylä and Casper Lassenius Original version: 4 Sep, 2007 Made available online: 24 April, 2013

Recursion. Recursion. Mathematical induction: example. Recursion. The sum of the first n odd numbers is n 2 : Informal proof: Principle:

Data Structures and Algorithms Part 1.4

Τεχνολογία Λογισμικού

New Results on Energy of Graphs of Small Order

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Transcription:

A New Method of N-gram Statistics for Large Number of ad Automatic Extractio of Words ad Phrases from Large Text Data of Japaese Makoto Nagao, Shisuke Mori Departmet of Electrical Egieerig Kyoto Uiversity Abstract I the process of establishig the iformatio theory, C. E. Shao proposed the Markov process as a good model to characterize a atural laguage. The core of this idea is to calculate the frequecies of strigs composed of characters (-grams), but this statistical aalysis of large text data ad for a large has ever bee carried out because of the memory limitatio of computer ad the shortage of text data. Takig advatage of the recet powerful computers we developed a ew algorithm of -grams of large text data for arbitrary large ad calculated successfully, withi relatively short time, -grams of some Japaese text data cotaiig betwee two ad thirty millio characters. From this experimet it became clear that the automatic extractio or determiatio of words, compoud words ad collocatios is possible by mutually comparig -gram statistics for dieret values of. category: topical paper, quatitative liguistics, large text corpora, text processig Itroductio Claude E. Shao established the iformatio theory i 948 []. His theory icluded the cocept that a laguage could be approximated by a- th order Markov modelby to be exteded to iity. Sice his proposal there were may trials to calculate -grams (statistics of character strigs of a laguage) for a big text data of a laguage. However computers up to the preset could ot calculate them for a large because the calculatio required huge amout of memory space ad time. For example the frequecy calculatio of 0- grams of Eglish requires at least 26 0 0 0 6 giga word memory space. Therefore the calculatio was doe at most for =4 with modest text quatity. We developed a ew method of calculatig - grams for large 's. We do ot prepare a table for a -gram. Our methods cosists of two stages. The rst stage performs the sortig of substrigs of a text ad ds out the legth of the prex parts which are the same for the adjacet substrigs i the sorted table. The secod stage is the calculatio of a -gram whe it is asked for a specic. Oly the existig character combiatios require the table etries for the frequecy cout, so that we eed ot reserve a big space for -gram table. The program we have developed requires 7l bytes for a l character text of two byte code such as Japaese ad Chiese texts ad 6l bytes for a l character text of Eglish ad other Europea laguages. By the preset program ca be exteded up to 2. The program ca be chaged very easily for larger if it is required. We performed -gram frequecy calculatios for three dieret text data. We were ot so much iterested i the etropyvalue of a laguage but were iterested i the extractio of varieties of laguage properties, such as words, compoud words, collocatios ad so o. The calculatio of frequecy of occurreces of character strigs is particularly importat to determie what is a word i such laguages as Japaese ad Chiese where there is o spaces betwee words ad the determiatio of word boudaries is ot so easy. I this paper we will explai some of our results o these problems. 2 Calculatio of -grams for a arbitrary large umber of It was very dicult to calculate -grams for a large umber of because of the memory limitatio of a computer. For example, Japaese laguage has more tha 4000 dieret characters ad if we wat

to have 0-gram frequecies of a Japaese text, we must reserve 4000 0 etries, which exceed 0 3. Therefore oly 3 or 4-grams were calculated so far. A ew method we developed ca calculate - grams for a arbitrary large umber of with a reasoable memory size i a reasoable calculatio time. It cosists of two stages. The rst stage is to get a table of alphabetically sorted substrigs of a text strig ad to get the value of coicidece umber of prex characters of adjacetly sorted strigs. The secod stage is to calculate the frequecy of - grams for all the existig character strigs from the sorted strigs for a specic umber of. 2. First stage () Whe a text is give it is stored i a computer as oe log character strig. It may iclude setece boudaries, paragraph boudaries ad so o if they are regarded as compoets of text. Whe a text is composed of l characters it occupies 2l byte memory because a Japaese character is ecoded by 6 bit code. We prepare aother table of the same size (l), eachetry of which keeps the poiter to a substrig of the text strig. This is illustrated i Figure. poiter table 0 i text strig ( l characters : 2lbytes) the i-th word we set p = 32 bits so that we ca accept the text size up to 2 32 4 giga characters. The poiter table represets a set of l words. We apply the dictioary sortig operatio to this set of l words. It is performed by utilizig the poiters i the poiter table. We used comb sort[2] which is a improved versio of bubble sort. The sortig time is the order of O(l log l). Whe the sortig is completed the result is the chage of poiter positios i the poiter table, ad there is o replacemet of actual words. As we are iterested i - grams of less tha 2, actual sortig of words is performed for the leftmost 2 or less characters of words. (2) Next we compare two adjacet words i the poiter table, ad cout the legth of the prex parts which are the same i the two words. For example whe \extesio to the left side..." ad \extesio to the right side..." are two words placed adjacet, the umber is 7. This is stored i the table of coicidece umber of prex characters. This is show i Figure 2. As we are iterested i 2, oe byte is give to a etry of this table. The total memory space required to this rst stage operatio is 2l+4l+l = 7l bytes. For example whe a text size is 0 mega Japaese characters, 70 mega byte memory must be reserved. This is ot dicult by the preset-day computers. table of coicidece umber of characters poiter table text strig ( l characters : 2lbytes) the i-th word l- 4bytes Figure : Text strig ad the poiter table to substrigs. byte i 4bytes A substrig poited by i- is deed as composed of the characters from the i-th positio to the ed of the text strig (see Figure ). We call this substrig aword. The rst word is the text strig itself, ad the secod word is the strig which starts from the secod character ad eds at the al character of the text strig. Similarly the last word is the al character of the text strig. As the text size is l characters a poiter must have at least p bits where 2 p l. I our program Figure 2: Sorted poiter table ad table of coicidece umber of characters We developed twosoftware versios, oe by usig mai memory aloe, ad the other by usig a disc memory where the software has the additioal operatios of disc merge sort. By the disc versio we ca hadle a text of more tha 00 mega character Japaese text. The software was implemeted o a 2

SUN SPARC Statio. 2.2 Secod stage The secod stage is the calculatio of -gram frequecy table. This is doe by usig the poiter table ad the table of coicidece umber of prex characters. Let us x to a certai umber. We rst read out the rst characters of the rst word i the poiter table, ad see the umber i the table of coicidece umber of prex characters. If this is equal to or larger tha it meas that the secod word has at least the same prex characters with the rst word. The we see the ext etry of the coicidece umber of prex characters ad check whether it is equal to or larger tha or ot. We cotiue this operatio util we meet the coditio that the umber is smaller tha. The umber of words checked up to this is the frequecy of the prex characters of the rst word. At this stage the rst prex characters of the ext word is dieret, ad so the same operatio as the rst characters is performed from here, that is, to check the umber i the coicidece umber of prex characters to see whether it is equal to or larger tha or ot, ad so o. I this way we get the frequecy of the secod prex characters. We perform this process util the last etry of the table. These operatios give the -gram table of the give text. We do ot eed ay extra memory space i this operatio whe we prit out every -gram strig ad its frequecy whe they are obtaied. We calculated -grams for some dieret Japaese texts which were available i electroic form i our laboratory. These were the followigs.. Ecyclopedic Dictioary of Computer Sciece (3.7 M bytes) 2. Jouralistic essays from Asahi Newspaper (8 M bytes) 3. Miscellaeous texts available i our laboratory (9 M bytes) The rst two texts were ot large ad could be maaged i the mai memory. The third oe was processed by usig a disc memory by applyig a merge sort program three times. The rst two texts were processed withi oe ad two hours by a stadard SUN SPARC Statio for the rst stage metioed above. The third text required about twety four hours. Calculatio of -gram frequecy (the secod stage) took less tha a hour icludig prit-out. 3 Extractio of useful liguistic iformatio from -gram frequecy data 3. Etropy Everybody is iterested i the etropy value of a laguage. Shao's theory says that the etropy is calculated by the formula [3] H (L) = X P (w) log P (w) where P (w) is the probability of occurrece of w, ad the summatio is for all the dieret strigs w of characters appearig i a laguage. The etropy of a laguage L is H(L) = lim! H (L) We calculated H (L) for the texts metioed i Sectio 2 for =; 2; 3; ::. The results is show i Figure 3. Ulike our iitial expectatio that the etropy will coverge to a certai costat value betwee 0.6 ad.3 which C. E. Shao estimated for Eglish, it cotiued to decrease to zero. We checked i detail whether our method had somethig wrog, but there was othig doubtful. Our coclusio for this strage pheomeo was that the text quatity of a few mega characters were too small to get a meaigful statistics for a large because we have more tha 4000 dieret characters i the Japaese laguage. For Eglish ad may other Europea laguages which have alphabetic sets of less tha fty characters the situatio may be better. But still the text quatity ofafew giga bytes or more will be ecessary to get a meaigful etropy value for = 0 or more. 9 8 7 6 4 3 2 H etropy 0 0 0 20 2 30 3 40 Figure 3: Etropy curve by -gram 3

3.2 Obtaiig the logest compoud word From the -gram frequecy table we ca get may iterestig iformatio. Whe we have a strig w (legth ) of high frequecy as show i Figure 4, we ca try to d out the logest strig w 0 which icludes w by the followig process by usig the -gram frequecy table. partial strigs frequecies %>/ 0 >/ 689 >/, 30 /,4 784,4-784,4-> 770 4->; 47 w frequecy Figure : Frequecies of partial strigs ad obtaiig the logest word " >/,4->" x w... 06.?96<6' (must do...)... /,#<?3'> (it is kow that...)... H6(/,4-> (ca do...)... I:>/,4-> (ca ask...) Figure 4: Obtaiig the logest word w 0 from a high frequecy word fragmet w () extesio to the left: We cut o the last character of w ad add a character x to the left of w. We call this a cut-ad-pasted word. We lookforthecharacter x which will give the maximum frequecy to the cut-ad-pasted word. Repeat the same operatio step by step to the left ad draw a frequecy curve for these words. This operatio will be stopped whe the frequecy curve drops to a certai value. This process is performed by seeig the -gram frequecy table aloe. (2) extesio to the right: The same operatio as () is performed by cuttig the left character ad addig a character to the right. (3) extractio of high frequecy part: From the frequecy curve as show i Figure 4 we ca easily extract a high frequecy part as the logest strig. A example is show i Figure The strigs extracted i this way are very ofte compoud words of postpositios i Japaese. Postpositioal phrases are usually composed of oe to three words, ad are used as if they are compoud postpositios. Some extracted examples are, 3.3 Word extractio After gettig high frequecy character strigs by the above method we ca make cosultatios with dictioaries for these strigs. The we d out may strigs which are ot icluded i the dictioaries. Some are phrases(collocatios, idiomatic expressios), some others are termiology words, ad ukow (ew) words. From the text data of Ecyclopedic Dictioary of Computer Sciece we extracted may termiological words. I geeral the frequecies of -grams become smaller as becomes larger. But we had sometimes relatively high frequecy values i -grams of large 's. These were very ofte termiological words or termiological phrases. We extracted such termiological phrases as, (: ::) EF4N+?2X^T[Z (programs writte by (: ::) laguage) PG#a7*.>bghL (problem solvig i articial itelligece) YBV$jS]U\WZ (page replacemet algorithm) X^T[Z8lm&R_Q (partial correctess of programs) 3.4 Compoud word We ca get more iterestig iformatio whe we compare data of dieret 's. Whe we have a character strig (legth ) of high frequecy, which we may be able to dee as a word (w), we are recommeded to check whether two substrigs (w ad w 2 ) of the legth ad 2 ( + 2 = ) as 4

Table : Determiatio of compoud word Compoud word proper segmetatio improper segmetatio k"e$ (280) = k" (4)Ae$ (40) k"e (280), "e$ (280), "e (280) OoMp (66) = Oo (208)AMp (2698) OoM (66), omp (66), om (66) sdiq (88) = sd (242)Aiq (30) sdi (88), diq (88), di (88) ( ):frequecy i Ecyclopedic Dictioary of Computer Sciece w w 2 Figure 6: Possible segmetatio of a word ito two compoets w w show i Figure 6 have high frequecy appearace i -gram ad 2 -gram tables. If we ca d out such a situatio by chagig (ad 2 )weca coclude that the origial character strig w is a compoud word of w ad w 2. Some examples are show i Table. 3. Collocatio We ca see whether a particular word w has strog collocatioal relatios with some other words from the -gram frequecy results. We ca get a - gram table where is sucietly large, w is the prex of these -grams, ad some words (w 0, w 00, :::) may appear i relatively high frequecy. This is show i Figure 7. We ca d out easily that w 0 w 0 ad w 0 w 00 are two allocatioal expressios from this gure. For example we have C! JD (eect) ad d out that C!J@r.>D (receive eect) ad C!J@c)>D (give eect) have relatively high frequecies ad there are o other sigicat combiatios i the -gram table with C!JD as the prex. C`f D (i ad out hospital) have almost all the time C@K=D (repeat) as the followig phrase, ad so we will be able to judge that C`f @K=D is a idiomatic expressio. 4 Coclusios We developed a ew method ad software for - gram frequecy calculatio for up to 2, ad calculated -grams for some large text data of Japaese. From these data we could derive words, compoud words ad collocatios automatically. Figure 7: Fidig collocatioal word pairs w 0 w 0 ad w 0 w 00 We thik that this method is equally useful for laguages like Chiese where there is o word spaces i a setece, ad for Europea laguages as well, ad also for speech phoeme sequeces to get more detailed HMM models. Aother possibility is that whe we get a large text data with part-speech tags, we ca extract high frequecy part-of-speech sequeces by this -gram calculatio over the part-of-speech data. These may be regarded as grammar rules of the primary level. By replacig these part-of-speech sequeces by sigle o-termial symbols we ca calculate ew -grams, ad will be able to get higher level grammar rules. These examples idicate that large text data with varieties of aotatios are very importat ad valuable for the extractio of liguistic iformatio by calculatig -grams for larger value of. Refereces [] C. E. Shao: A mathematical theory of commuicatio, Bell System Tech.J., Vol.27, pp.379-423, pp.623-66, (948). [2] Stephe Lacey, Richard Box: Nikkei BYTE, November, pp.30-32, (99). [3] N. Abramso: Iformatio theory ad codig, McGraw Hill, (963).