Inducing translations from officially published materials in Canadian government websites

Size: px

Start display at page:

Download "Inducing translations from officially published materials in Canadian government websites"

Emma Patrick
5 years ago
Views:

1 Inducng translatons from offcally publshed materals n Canadan government webstes Qbo Zhu Statstcs Canada & Insttute of Cogntve Scence, Carleton Unversty 1125 Colonel By Drve, Ottawa, Ontaro, Canada K1S 5B6 Qbo.Zhu@statcan.ca Dana Inkpen School of Informaton Technology & Engneerng, Unversty of Ottawa 800 Kng Edward, Ottawa, Ontaro, Canada K1N 6N5 dana@ste.uottawa.ca Ash Asudeh Insttute of Cogntve Scence & School of Lngustcs and Language Studes, Carleton Unversty 1125 Colonel By Drve, Ottawa, Ontaro, Canada K1S 5B6 ash_asudeh@carleton.ca Abstract Btexts collected from web-based materals that are offcally publshed on government webstes can be used for a varety of purposes n language analyss and natural language processng. Mnng offcally publshed web pages can thus be an nvaluable undertakng for translators n government departments who are producng the translatons, and for machne translaton researchers who are studyng how those translatons are produced. In ths paper, we present the StatCan Daly Translaton Extracton System (SDTES) and demonstrate how t s used to nduce translatons from offcally publshed blngual materals from government webstes n Canada. New evaluaton results show that SDTES s a very effectve system for dentfyng and extractng sentences that are translaton pars from most of the federal government web pages whch are currently under the CLF2 (Common Look and Feel for the Internet 2.0) framework. 1 Introducton Well-algned blngual materals are a useful source of translatons that can be used for translaton studes, translaton recyclng, tranng data n machne learnng, translaton memores, nformaton retreval, machne translaton, machne aded translaton, and natural language processng (Jutras, 2000; Moore, 2002; Callson-Burch et al., 2005; Hutchns, 2005; Deng et al., 2006). In the past decade or so, an emergng trend n the use of blngual texts has become rather notceable: use of the web as a blngual corpus (Resnk, 1999; Chen and Ne, 2000). Wth the ncreasngly wdespread avalablty of vast quanttes of web-based blngual texts, more and more researchers are explorng ways to collect, from the web, blngual texts of such language pars as Englsh-French, Englsh- German, Englsh-Italan, Englsh-Chnese, Englsh-Arabc and others (Kraaj et al., 2003; Resnk and Smth, 2003). Web-based materals that are offcally-publshed blngual materals on Canadan government webstes contan exemplary translatons, and mnng these web-based blngual materals can be benefcal to translators, lngusts and researchers n many felds. The StatCan Daly Translaton Extracton System (SDTES) s a system that automatcally extracts translatons from the Daly news release texts of Statstcs Canada (Zhu et al., 2007). In ths paper, we wll descrbe the algorthms and procedures of SDTES and present new results that show that SDTES can be effectvely used to nduce translatons from other government webstes n Canada. The paper s dvded nto 4 parts. The frst part s ths ntroducton. In the second part, we wll

2 analyze the general characterstcs of the blngual texts from Canadan government webstes. The thrd part contans the algorthms and procedures we used to nduce translatons from web-based blngual materals. In the fourth part, we wll present evaluaton metrcs and results. The ffth part s the concluson. 2 Features of web-based materals n Canadan government webstes Here are some general features that we observed of offcally-publshed web materals on government webstes n Canada. 1. It s requred that federal government web pages n Canada be blngual. For example, t s relatvely rare for us to fnd a statc HTML Englsh page wthout a translated French page. Dfferent departments or agences may have dfferent fle namng conventons, but we generally do not have cases where two Englsh pages correspond to one French page or vce versa. 2. Texts are normally translated by government employees who are traned n translaton or text edtng. There are rules and gudelnes for them to observe about translatng, edtng and proofreadng texts to be offcally publshed. For example, wthout a good reason, edtors are not allowed to add texts n one language to convey messages that are not present n the other language. Ths means that nsertons and deletons at the sentence level wll be very rare, f any. 3. Web pages have to be presented n the same way for the two languages. For example, f we use a headng (h1) for the ttle n the Englsh page, we should do the same for the correspondng French page. Currently, almost all the federal government web stes are requred to be under the CLF2 framework. Therefore we can expect to fnd many consstent correspondences n some man HTML markups such as h1, h2, h3, table, etc. 4. Important modules and templates of web pages normally go through usablty tests before they are used for offcally publshed translaton products. One suggeston that often surfaces from users s that extra long web pages should be avoded when we can. If possble, long web pages should be ether splt nto shorter pages, or rewrtten to accommodate easy navgaton and browsng. Therefore web pages on government webstes are usually not very long. 5. Most texts are doman specfc or text-genre specfc, and use standard termnology of the specfc department. The texts are more of a standard wrtten style than of a spoken language style. Texts are mostly expostory data and the translaton style s typcally not free translaton. All these gve rse to frequent translaton correspondences at the word level, phrase level or sentence level. Some stes, such as the stes that publsh data for statstcs, banks, and fnances, contan a lot of numercal data. Numbers can serve as mportant anchors for text algnment. We used these observatons as basc assumptons about the web pages of government webstes n Canada, and desgned algorthms and procedures to algn the web pages and to detect potental errors n translaton and n algnment. 3 Algorthms and procedures SDTES contans two major components: one for blngual text algnment and the other for msalgnment detecton. In ths part, we present a set of protocols and algorthms that are desgned for the two components. 3.1 Text mappng usng the Gale-Church statstcal model The algnment component of SDTES s mostly based on the Gale and Church (1993) length model. The basc assumpton of ths model s that there s a strong lkelhood that, for example, a long sentence n Englsh wll correspond to a long sentence n French; smlarly a short sentence n one language wll correspond to a short sentence n the other. Roughly speakng, f the average lengths of sentences n French and Englsh are known, t s possble to set up a dstrbuton of algnment possbltes from the sentence length nformaton. Suppose we have parallel texts S and T whch can be splt nto n segments each. Each segment n S (s ) s the translaton of a segment n T (t ), and they are algned to form an algnment segment par a. For each, 1 n. A can be defned as the algnment of S and T that conssts of a seres of algned segment pars: A a1,..., an. For the set B of all possble algnments ( A B), the goal s to fnd the mum-lkelhood algnment:

3 A = arg Pr( A S, T ) (1) It s assumed that the probablty of any algned segment par s ndependent of any other segment par: B = arg Pr( a s ) = 1 A (2) The next assumpton s that the length dfference functon d ( s ) s the only feature nfluencng the algnment probablty: B = arg Pr( a d( s )) = 1 A (3) By Bayes theorem, equaton (3) becomes: = 1 P( N M ) P( M ) P ( M N) =, P( N) B Pr( d ( s ) a ) Pr( a ) A = arg (4) Pr( d( s, t )) where the dstrbutons Pr( d ( s, t ) a ) and Pr( a ) can be estmated usng hand-algned data. When the normalzng constant Pr( d ( s )) s gnored and the logarthm s used, B = arg log Pr( d( s ) a )Pr( a ) = 1 A (5) Gale and Church employed dynamc programmng for establshng the optmal algnment to solve Equaton (5). When applyng the Gale and Church algorthm n SDTES, we adopted a two-parse algnment procedure. Pror to the procedure, SDTES moves table contents and mage contents to the end of the HTML page because n many cases these floatng elements n HTML pages can end up n dfferent paragraph postons n dfferent languages. In some cases these elements can be the source of massve msalgnment for the Gale-Church algorthm. In the frst parse, SDTES counts the few man HTML elements such as h2tle, and others to see f the par of text fles has the same number of man HTML features n them. If the numbers were the same, the system splts the texts nto blocks separated by these feature HTML tags. SDTES frst locates the paragraph boundares n the texts and assgn sentence delmters to them. Then t assgns paragraph delmters to text blocks that are dvded by the man HTML tags. If the numbers of major HTML elements are dfferent, the system treats the whole text document as a sngle text block, and assgns a paragraph delmter to t. When the paragraph delmters and the sentence delmters are all assgned, SDTES does the frst round of text algnment by the Gale-Church algorthm. In the second parse, the system automatcally reconstructs the Englsh document and the French document from the pared paragraphs that are algned n the frst parse. SDTES locates the sentence boundares n the algned texts and assgn sentence delmters to them. Then SDTES separates the two halves n each algned par, and assgns a paragraph delmter to each of them. After ths, texts of dfferent languages are collected and reassembled n two fles, one for Englsh and one for French. By reorganzng the text structures on the bass of the paragraph parng results n the frst around of text algnment, SDTES was able to reset the Englsh and French documents to the orgnal text format pror to the ntal algnment. The two nput fles are processed wth the newly assgned boundary symbols by the Gale-Church algorthm. The second parse produces text algnment at a more fne-graned text segment level. 3.2 Msalgnment detecton usng anchor nformaton Anchor nformaton ncludes postons or propertes n one text whch seem to match up wth those n a parallel text. The nformaton can be about structural features n the text or delmters and markers that ndcate hard and soft boundares (Gale and Church, 1993:89), or true ponts of correspondence (Melamed, 1999:107) n algnment. Usng anchor nformaton, regons of text can be dentfed and algnments of text segments can be sought. Most algnment methods make use of anchor nformaton n one way or another. In STDES, we use the followng structural features as anchors to assst msalgnment dentfcaton.

4 HTML markups: Anchor nformaton can be markups that go wth the text and that reveal metanformaton or style nformaton about the text. For example some commonly used HTML tags such as h1, h2, h3, br, p, hr, table,, pre, form, mg, and a can be good anchors n blngual text algnment. For a more detaled lst of structural tags, format tags, content tags and rrelevant tags that can be used n algnment, see Sanchez-Vllaml et al. (2006). As ndcated earler, n Canadan government webstes, we are lkely to have a prolferaton of such HTML markups. Usually, f a text n French contans a markup for a secton n talcs, then the correspondng secton n Englsh s lkely to have the same markup. We dd some HTML style unfcaton formattng so that some parts of the HTML codes are hghlghted, whle some are gnored. For example, the code <a > becomes <a alnk> after the unfcaton formattng, because otherwse the lnk to an Englsh webpage and the lnk to a French webpage mght have been dfferent. Lexcal unts: Specfc lexcal unts such as words or phrases can serve as anchor ponts (Brown et al., 1991). In SDTES, we manly use cognate words as anchors to help locate traces of algnment devatons. Identfcaton of cognates n SDTES s a two-step operaton. The frst step s to produce canddate cognate lsts usng the K-vec algorthm (Fung and Church, 1994). The man objectve s to fnd cognate canddates wthn an acceptable text-regon range and lmt the number of words to be consdered as cognate pars. The K-vec method was developed as a means of generatng a quck-and-drty estmate of a blngual lexcon that could be used as a startng pont for a more detaled algnment algorthm (Fung and Church, 1994). We use the K-vec++ package (Pedersen and Varma, 2002) for the mplementaton of the K-vec algorthm. The package s called the K-vec++ package because t extends the K-vec algorthm n a number of ways. Usng the Perl programs n the K-vec++ package, SDTES was able to obtan a very rough lst that mght contan cognates for each par of documents, although there s a lot of nose n the lst too. In the second step, we apply an algorthm called Acceptable Matchng Sequence (AMS) to dentfy true cognates from the canddate cognate lsts generated by the K-vec algorthm. Ths s an algorthm that we desgned and developed n SDTES. An AMS has two non-overlappng substrngs that can be matched n the same order n both of the words n a cognate par. The algorthm extracts two substrngs (θ a and β b ) from a source word, say a French word (W 1 ), wth a length threshold (T) for the two substrngs combned. The purpose s to fnd f the target Englsh word (W 2 ) has the two substrngs θ a and β b n the same order n the strng sequence. Suppose x = mn(length(w 1 ), length(w 2 )), and y=(length(w 1 ), length(w 2 )), and z s the length dfference threshold. length(w 1 ) - length(w 2 ) z. We use the length dfference threshold for ntal flterng: f y>10 then 0 z 4, otherwse 0 z 3. SDTES determnes the T parameter wth reference to y. Snce T s the combned length of θ a and β b, 0 a T, 0 b T, a+b=t. The system dscards word pars wth y<4. SDTES sets T=8 f y>10. For all the rest, SDTES uses the smple lnear regresson model T=0.5y+1.8 to compute the threshold value. Ths regresson model s derved from the regresson analyss of sample cognate pars we pcked from the StatCan Daly releases. When matchng θ a and β b n W 2, skppng some characters s acceptable before, after or between the two substrngs; we call them Don t Care Characters (DCCs). The ntal value a n θ a s set to 0, and b n β b to T. The system starts to match the two substrngs θ a and β b n W 2. If a match s found for θ a and β b n W 2 n the rght order, W 1 and W 2 are a cognate par, f not, one substrng θ a s ncreased n length (a=a+1) whle the other substrng β b gets decreased (b=b-1). The search contnues tll a two-substrng match s found or a>t or b<0. The man advantage of usng K-vec wth AMS for cognate dentfcaton s that K-vec does not rely on sentence boundary nformaton. In ths way, t can help detect errors n algnment that depend on sentence boundares. At the same tme, AMS has the straghtforwardness of the naïve matchng algorthm of Smard et al. (1992), but avods problems caused by common prefxes or by the requrement that the frst four letters have to match. Ths mprovement can ncrease the number of correct cognate pars dentfed. At the same tme, AMS nherts the strength of no-crossnglnks constrant n the Longest Common Subsequence Rato (LCSR) algorthm by Melamed (1999). Also, n lmtng the number of substrngs

5 to be matched, AMS overcomes the nherent weakness of LCSR n postng non-ntutve lnks because of lack of context senstvty, as noted n a recent study by Kondrak and Dorr (2004). Ths can help reduce the number of false postves for SDTES such as courters/computers, mensuels/results, and paruton/startng. We found that the K-vec technque together wth AMS algorthm s a good ft for cognate dentfcaton for the msalgnment detecton purpose n SDTES. Numbers and punctuatons: Smlarly, numbers n texts can serve as anchors n algnment. They are good ndcatons of correspondence because a number n one language s usually nterpreted as a number n the other language. Some punctuaton marks can also be anchor ponts. For example, f there s a queston mark n Englsh, normally we are expectng a correspondng queston mark n ts translaton text n French. When detectng msalgned portons of texts, what SDTES does s to parse every algned text segment par, and compares the features n each half of the par before t arrves at one of the two decsons: pass and problem. The detectng process starts from two pror flterng mechansms. One of them s the length rato crtera. If a text segment n one language s more than 3 tmes longer than the correspondng text segment n the other language, the par s marked as a problem par. The second crteron s matchng type. Because the extracted translatons are ndependent translaton pars that wll be used for translaton memory systems and cross-language nformaton retreval systems, matchng types lke 1:0 and 0:1 have to be dscarded. When these two crtera have been checked, SDTES compares the structural and lexcal clues of the HTML text segments for further detecton. These clues nclude selected cognates, punctuaton marks, numbers and HTML tags. Fnally, we have a no-match-tolerance prncple: f there are no structural and textual clues present, and f the two pror flterng crtera (length and matchng type) are checked, we mark the segment as pass. When the msalgnment detecton process s completed, SDTES apples a flterng mechansm to the lst of stored translatons to elmnate the followng types of algned pars: 1) Pars that contan only meta-nformaton codng or codes that are derved from the HTML codng unfcaton process. 2) Algned segments that nclude only the numercal nformaton. 3) Duplcate sentences or smlar constructons that have been seen more than once n the collecton of texts. Sometmes ths nvolves unfyng or dscardng some nformaton such as numbers and tags n the text. 4 Results and performance evaluaton For evaluaton, we measure the performance of the two man components of SDTES: one s the text algnng component and the other s the msalgnment detecton component. We used the fles collected from for evaluaton. These web pages contan 2,682 offcally publshed news releases from Aprl of 2000 to June of We assgned each page a fle name begnnng wth a sequental number from 1 to 9. We grouped these fles accordng to the ntal number n the fle name such as 1, 2, 3,, 9, and randomly chose two fles from each of the 9 groups. Then we manually algned these 18 selected fles to buld the reference collecton. For the evaluaton of the algnment component, we defne M as the set of segments n the manually algned reference collecton, A as the set of machne algned segments before the msalgnment detecton devce s appled. Precson (P) and recall (R) are as follows: A M A M P = R = A M We used the same randomly selected fles to evaluate the performance of the msalgnment detecton component. Here, the system proposed algnments (A') nclude those remanng pars after the system has fltered what t dentfes as msalgned segments. M s the reference set,.e., the algned translaton unts that the human evaluator thnks the system should have reported. The recall (R) represents the proporton of system-proposed translaton segments (A') that are rght wth respect to the reference (M), and the precson (P) s the proporton of correctly proposed algnment segments wth respect to the total of those proposed (A'). A' M A' M P = R = A' M As we can see from Table 1, good results have been reported of the text algnment component. For the 18 algned fles used for evaluaton, the overall

6 algnment precson for the algned translaton pars s 0.967; the recall s Table 1 also ndcates that SDTES s accurate n detectng msalgned pars (P=0.988 and R=0.974). A comparson of precson and recall for the data sets before and after the msalgnment flterng shows the effect of the msalgnment detecton algorthm. Precson mproved from to wth a neglgble loss of recall from to Text algnment Msalgnment detecton Fle P R P R Overall * Table 1. Performance evaluaton of the two man components n SDTES Whle the SDTES model has acheved good results for offcally publshed government blngual text data on the web, there are also lmtatons. For the text algnment part, we stll fnd some examples, although the number s very small, of chans of msalgned sentences. For about a dozen of so fles n the collecton, chan msalgnment segments can be found where the Gale and Church algorthm fnds t hard to come back to the rght track. Although most of the msalgned pars can be fltered by the msalgnment detecton mecha- * Computed over all the mstakes n all the documents. nsm, t s worthwhle to analyze these fles to trace the causes of the msalgnments. There are four types of texts that can trgger massve msalgnment for the Gale and Church algorthm. Alphabetcal lsts: In some government texts, there are lsts of names, commttees, or terms and defntons. Items n these lsts usually land n dfferent postons when sorted alphabetcally n dfferent languages. Because of the poston changes, translatons n these lsts can hardly match on a lne-to-lne bass. Swapped paragraphs: When paragraphs have rearranged postons because of blocks of texts such as note to readers sectons or nserted menu tems that are ntroduced by dv and other HTML tags, the correct algnment chan can be dsrupted. Deletons and nsertons: For any accdental deletons and nsertons, once the system starts to algn the wrong text segments, t usually contnues to msalgn a number of text segments before t can fnally correct tself. Algnment types: We also found a few examples where the exstng 6 algnment types such as 0:1, 1:0, 1:1, 1:2, 2:1 or 2:2 were not enough to cover the rght algnment. We need somethng lke 3:1 or 3:2 to algn them properly. When we gve t an alternatve algnment pattern, t would establsh the wrong lnks wth the followng text segments and cause chan msalgnment. In assessng the performance of the msalgnment detecton devce, we have to take nto account two types of errors. One type nvolves some msalgned pars that stll go undetected. We call them overlooked pars. The other type ncludes correct algnments that are wrongly labeled as msalgnments and are thus mstakenly fltered n the process. We call these overdone pars. As can be seen from Table 1, the number of overlooked pars that are not captured by the system was mnmal: 98.8% of the dentfed correct algnment pars are truly relable translaton pars. Mostly, the overlooked pars are pars of partally correct algnments such as those close to pars wth 1:0 or 0:1 algnment patterns. Generally speakng, overdone pars are very few. When we examne these pars, we found that nconsstent use of HTML codes and varous ways of nterpretng numbers were two notable sources of errors n msalgnment detecton. Inconsstent use of HTML tags: In analyzng the results of the evaluaton, we encountered some

7 examples n whch HTML codes are not used consstently n the two halves of the algned translaton segments. For example, n some Englsh sentences, we found the span tag, but the correspondng tag n French s font. In some cases, there are some HTML markups that are present n only one language, but not n the other. For example, we have some French sentences wth the HTML tag sup, but the same tag s nowhere to be found n ts correspondng translaton segment n Englsh. Number nterpretatons: There are cases n whch we have numbers on one sde of the algned par, but not necessarly on the other sde. In some cases, n Englsh we fnd the number 10, but n French, the correspondng number s X. In other cases, numbers are found n both texts, but the numbers are not the same. For example n Fgure 1, usng dfferent numbers (1990 vs. 90) s not necessarly wrong, but t confuses SDTES and leads the system to msjudge correct translaton pars as overdone pars. e 1. In the early 1990s, despte larger declnes n earnngs n the North than n Canada, employment ncome remaned hgher. f 1. Au début des années 90, malgré une basse des gans plus prononcée dans le Nord que dans l'ensemble du Canada, le revenu d'emplo y est demeuré plus élevé. Fgure 1. Dstnct numbers are used n the algned translaton par SDTES was ntally bult on the bass of the Daly releases of Statstcs Canada publshed between 2004 and Then we extended the applcaton to data collecton of the Daly releases from 1995 to Altogether, we assembled 32,276 Daly release fles (16,138 for each language). After flterng and formattng, SDTES generated 488,646 algned text segments. In ths study, we are usng the model for fve other government webstes of whch the Canadan Forces webste s one, and more than 200,000 pars of translatons were generated. The algned text segments are organzed n an XML format for easy exportablty nto the translaton memory system and other applcaton systems. Meta nformaton tems about each of the algned pars were recorded, such as the length nformaton (before the HTML codes are strpped), the source of the matched strngs, and the matchng patterns. Fgure 2 ncludes a sample record of the fnal algned segments. For the sake of presentaton clarty, lne breaks are added to dfferent levels of XML elements. <bead> <en>it wll ensure that our Canadan Forces n Afghanstan receve the supples and equpment they need to get the job done.</en> <fr>il fera en sorte que les Forces canadennes en Afghanstan reçovent l'approvsonnement et le matérel dont elles ont beson pour fare leur traval.</fr> <pa>1:1</pa> <d>2185:22</d> <re>pas</re> <le>120=150</le> </bead> Fgure 2. SDTES XML output (modfed verson) These algned pars of translatons are clean and consstent, and most of them are free of translaton errors. They are ready to be used n many applcatons and systems such as translaton memory systems, nformaton retreval templates, and machne translaton systems. 5 Concluson In ths paper, we presented the algorthms and procedures of the StatCan Daly Translaton Extracton System (SDTES) and demonstrated how t s used to nduce translatons from offcally publshed materals n Canadan government webstes. SDTES contans two man components: one for text algnment and the other for msalgnment detecton that we use to flter out possble translaton errors. For the text algnment component, we used the Gale-Church algorthm for a two-parse algnment at the paragraph level and the text segment level. For msalgnment detecton, we used a few structural features as anchors for btext comparson. The structural features nclude cognate words, numbers, HTML markups, punctuaton marks, etc. In extractng cognates, we employed the K-vec algorthm n combnaton wth our AMS algorthm. Our evaluaton shows that SDTES has acheved good results and that t s a good model for nducng translatons from offcally-publshed materals on federal government webstes n Canada beyond

8 the Statstcs Canada webstes for whch t was ntally desgned. It would have been good f we could set the orgnal Gale and Church algorthm (wthout the two-parse procedure) as the baselne, and compare t wth the algnment method we adopted n SDTES to evaluate the effects of the two-parse procedure. However, ths attempt was deterred by problems n dentfyng the paragraph boundares wthout the help of HTML tags n the source fles. If we run the algorthm wthout paragraph detecton, the results would be very poor. We leave ths to future work. Acknowledgments We thank Professor Davd Sankoff, Dr. Joel Martn, Professor Charles J. Fllmore, Professor Andrew Brook, and Professor Jm Daves for ther nvaluable nput on varous parts of the paper and for dscussng the StatCan Daly Translaton Extracton System wth us. We also thank the three anonymous revewers for ther nsghtful comments. References Brown, P. F., J. C. La, and R. L. Mercer Algnng sentences n parallel corpora. In Proceedngs of the 29th Annual Meetng of the Assocaton for Computatonal Lngustcs, pages , Berkeley, CA. Callson-Burch, C., C. Bannard, and Schroeder, J A compact data structure for searchable translaton memores. In 10th European Assocaton for Machne Translaton Conference: Buldng Applcatons of Machne Translaton, pages Chen, J. and J. Y. Ne Parallel web text mnng for cross-language IR. In Proceedngs of RIAO 2000: Content-Based Multmeda Informaton Access, vol. 1, pages 62-78, Pars, France. Deng, Y., S. Kumar, and W. Byrne Segmentaton and algnment of parallel text for statstcal machne translaton. Natural Language Engneerng, 12(4): Fung, P. and K. W. Church K-vec: A new approach for algnng parallel texts. In Proceedngs of the 15th Internatonal Conference on Computatonal Lngustcs, pages 1,096-1,102, Kyoto, Japan. Gale, W. A. and K. W. Church A program for algnng sentences n blngual corpora. Computatonal Lngustcs, 19(1): Hutchns, J Towards a defnton of examplebased machne translaton. In MT Summt X. Workshop: Second Workshop on Example-Based Machne Translaton, pages 63-70, Phuket, Thaland. Jutras, J-M An automatc revser: the TransCheck system. In Proceedngs of Appled Natural Language Processng, pages , Seattle, WA. Kondrak G. and B. Dorr Identfcaton of confusable drug names: a new approach and evaluaton methodology. In Proceedngs of the 20th Internatonal Conference on Computatonal Lngustcs, vol. II, pages , Geneva, Swtzerland. Kraaj, W., J.-Y. Ne. and M. Smard Embeddng web-based statstcal translaton models n crosslanguage nformaton retreval. Computatonal Lngustcs, 29(3): Melamed, I. Dan Btext maps and algnment va pattern recognton. Computatonal Lngustcs, 25(1): Moore, Robert C Fast and accurate sentence algnment of blngual corpora. In Machne Translaton: From Research to Real Users, 5th Conference of the Assocaton for Machne Translaton n the Amercas, pages , Sprnger-Verlag, Berln. Pedersen, Ted and N. Varma K-vec++: Approach for fndng word correspondences. Avalable onlne: ~tpederse/parallel.html. Resnk, P Mnng the web for blngual text. In Proceedngs of the 37th Meetng of ACL, pages , College Park, MD. Resnk, P. and Noah A. Smth The web as a parallel corpus. Computatonal Lngustcs, 29 (3): Sanchez-Vllaml, E., S. Santos-Anton, S. Ortz-Rojas and M. L. Forcada Evaluaton of algnment methods for HTML parallel text. In Advances n Natural Language Processng, Proceedngs of Fn- TAL 2006, 5th Internatonal Conference on Natural Language Processng, pages , Turku, Fnland, LNCS Sprnger, Berln. Smard, M., G. Foster and P. Isabelle Usng cognates to algn sentences n blngual corpora. In Proceedngs of the Fourth Internatonal Congress on Theoretcal and Methodologcal Issues n Machne Translaton, pages 67-81, Montreal, Canada. Zhu, Q., D. Inkpen and A. Asudeh Automatc extracton of translatons from web-based blngual materals. Machne Translaton, 21 (3):

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,