Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Proceedngs of the Thrd NTCIR Workshop Descrpton of NTU Approach to NTCIR3 Multlngual Informaton Retreval Wen-Cheng Ln and Hsn-Hs Chen Department of Computer Scence and Informaton Engneerng Natonal Tawan Unversty Tape, TAIWAN E-mal: densln@nlg.cse.ntu.edu.tw; hh_chen@cse.ntu.edu.tw Abstract Ths paper deals wth Chnese, Englsh and Japanese multlngual nformaton retreval. Several mergng strateges, ncludng raw-score mergng, round-robn mergng, normalzed-score mergng, and normalzedby-top-k mergng, were nvestgated. Expermental results show that centralzed approach s better than dstrbuted approach. In dstrbuted approach, normalzed-by-top-k wth consderaton of translaton penalty outperforms the other mergng strateges. Keywords: Mergng Strategy, Multlngual Informaton Retreval, Query Translaton 1. Introducton Multlngual Informaton Retreval [7] uses a query n one language to retreve documents n dfferent languages. In addton to language translaton ssue, how to conduct a ranked lst that contans documents n dfferent languages from several text collectons s also crtcal. There are two possble archtectures n MLIR say, centralzed and dstrbuted. In a centralzed archtecture, a huge collecton that contans documents n dfferent languages s used. In a dstrbuted archtecture, documents n dfferent languages are ndexed and retreved separately, and all the results are merged nto a multlngual ranked lst. Several mergng strateges have been proposed. Raw-score mergng selects documents based on ther orgnal smlarty scores. Normalzed-score mergng normalzes the smlarty score of each document and sorts all the documents by ther normalzed scores. For each topc, the smlarty score of each document s dvded by the maxmum score n ths topc. Roundrobn mergng nterleaves the results n the ntermedate runs. In ths paper, we adopted dstrbuted archtecture and proposed mergng strateges to merge the result lsts. The rest of ths paper s organzed as follows. Secton 2 descrbes the ndexng method. Secton 3 shows the query translaton process. Secton 4 descrbes our mergng strateges. Secton 5 shows the experment results. Secton 6 concludes the remark. 2. Indexng The document set used n NTCIR3 MLIR task conssts of Chnese, Englsh and Japanese documents. The numbers of documents n Chnese, Englsh, and Japanese document sets are 381,681, 22,927 and 236,664, respectvely. The partcpants can use two or all of these three document collectons as the target language sets. We used all of these three document collectons to conduct X CJE experments. The IR model we used s the basc vector space model. Documents and queres are represented as term vectors, and cosne vector smlarty formula s used to measure the smlarty of a query and a document. Approprate terms are extracted from each document n ndexng stage. In the experments, the <HEADLINE> and <TEXT> sectons were used for ndexng. For Englsh, all words were retaned, and all letters were transformed to lower case. The Japanese documents were frst segmented by ChaSen [6]. All words n the above two sectons were used as ndex terms. For Chnese, we used Chnese character bgrams to ndex Chnese documents. The term weghtng functon for all document sets s tf*df. 3. Query Translaton In the experment, the Japanese queres were used as source queres and translated nto target languages,.e., Englsh and Chnese. We used CO model [1], whch s a hybrd dctonary- and corpus-based 2003 Natonal Insttute of Informatcs

The Thrd NTCIR Workshop, Sep.2001 - Oct. 2002 method, to translate queres. Snce we dd not have a Japanese-Chnese dctonary, we used Englsh as an ntermedate language n the ntal study. The Japanese queres were translated nto Englsh, and then the translated Englsh queres were further translated nto Chnese. The Japanese queres were translated nto Englsh n the way as follows: (a) The Japanese query was segmented by ChaSen. (b) For each Japanese query term, we found ts Englsh translaton equvalents by lookng up a Japanese-Englsh dctonary. (c) By usng co-occurrence nformaton traned from TREC6 text collecton [4], we selected the best Englsh translaton equvalent for each source query term. We adopted mutual nformaton (MI) [2] to measure the co-occurrence strength between words. For a query term, we compared the MI values of all the translaton equvalent pars (x, y), where x s the translaton equvalent of ths term, and y s the translaton equvalent of another query term wthn a sentence. The word par (x, y j ) wth the hghest MI value s extracted, and the translaton equvalent x s regarded as the best translaton equvalent of ths query term. Selecton s carred out based on the order of the query terms. Translated Englsh queres were translated nto Chnese usng the same method except that Englsh queres dd not need to be segmented. The MI values of Chnese words were traned from Academa Snca Balanced Corpus (ASBC) [5]. 4. Mergng Strateges There are two possble archtectures n MLIR,.e., centralzed and dstrbuted. In a centralzed archtecture, document collectons n dfferent languages are vewed as a sngle document collecton and are ndexed n one huge ndex fle. The advantage of centralzed archtecture s that t avods the mergng problem. It needs only one retrevng phase to produce a result lst that contans relevant documents n dfferent languages. One of problems of a centralzed archtecture s that ndex terms may be over weghted. In other words, the total number of documents ncreases, but the number of occurrences of a term does not. In tf*df scheme, the df of a term s ncreased and t s over-weghted. Ths phenomenon s clear n small text collecton. For example, the N n df formula s 22,927 when Englsh document s used. However, ths value s ncreased to 641,272,.e., about 27.97 tmes larger, f the three document collectons are merged together. Comparatvely, the weghts of Chnese ndex terms are ncreased only 1.68 tmes due to the sze of N. The ncrements of weghts are unbalance for document collectons n dfferent sze. Thus, IR system may prefer documents n small document collecton. The second archtecture s a dstrbuted MLIR. Documents n dfferent languages are ndexed and retreved separately. The ranked lsts of all monolngual and cross-lngual runs are merged nto one multlngual ranked lst. How to merge result lsts s a problem. Recent works have proposed varous approaches to deal wth the mergng problem. A smple mergng method s raw-score mergng, whch sorts all results by ther orgnal smlarty scores and then selects the top ranked documents. Raw-score mergng s based on the assumpton that the smlarty scores across collectons are comparable. However, the collecton-dependent statstcs n document or query weghts nvaldates ths assumpton [3, 8]. Another approach, roundrobn mergng, nterleaves the results based on the rank. Ths approach assumes that each collecton has approxmately the same number of relevant documents and the dstrbuton of relevant documents s smlar across the result lsts. Actually, dfferent collectons do not contan equal numbers of relevant documents. Thus, the performance of round-robn mergng may be poor. The thrd approach s normalzed-score mergng. For each topc, the smlarty score of each document s dvded by the maxmum score n ths topc. After adjustng scores, all results are put nto a pool and sorted by the normalzed score. Ths approach maps the smlarty scores of dfferent result lsts nto the same range, from 0 to 1, and makes the scores more comparable. But t has a problem. If the maxmum score s much hgher than the second one n the same result lst, the normalzed-score of the document at rank 2 would be made lower even f ts orgnal score s hgh. Thus, the fnal rank of ths document would be lower than that of the top ranked documents wth very low but smlar orgnal scores n another result lst. Smlarty score reflects the degree of smlarty between a document and a query. A document wth hgh smlarty score seems to be more relevant to the desred query. But, f the query s not formulated well, e.g., napproprate translaton of a query, a document wth hgh score stll does not meet the user s nformaton need. When mergng results, such documents that have ncorrect hgh scores should not be ncluded n the fnal result lst. Thus, we have to consder the effectveness of each ndvdual run n the mergng stage. The basc dea of our mergng strategy s that adjustng the smlarty scores of documents n each result lst to make them more comparable and to reflect ther confdence. The smlarty scores are adjusted by the followng formula. S ˆ 1 j = Sj W (1) S k

Proceedngs of the Thrd NTCIR Workshop where S j s the orgnal smlarty score of the document at rank j n the ranked lst of Ŝ j S k topc, s the adjusted smlarty score of the document at rank j n the ranked lst of topc, s the average smlarty score of top k documents, and W s the weght of query n a cross-lngual run. We dvde the weght adjustng process nto two steps. Frst, we use a modfed score normalzaton method to normalze the smlarty scores. The orgnal score of each document s dvded by the average score of top k documents nstead of the maxmum score. We call ths normalzed-by-top-k. Second, the normalzed score multples a weght that reflects the retreval effectveness of the desred topc n each text collecton. Because of not knowng the retreval performance n advance, we have to guess the performance of each run. For each language par, the queres are translated nto target language and then system retreves the target language documents. A good translaton should have better performance. We can predct the retreval performance based on the translaton performance. There are two factors affectng the translaton performance,.e., the degree of translaton ambguty and the number of unknown words. For each query, we compute the average number of translaton equvalents of query terms and the number of unknown words n each language par, and use them to compute the weghts of each crosslngual run. The weght can be determned by the followng three formulas: 2 51 T U + W = c1 + c2 c3 1 (2) 50 n 1 U + W = c1 + c2 c3 1 (3) T n 1 U + W = c + c c 1 (4) 1 2 3 T n where W s the weght of query n a cross-lngual run, T s the average number of translaton equvalents of query terms n query, U s the number of unknown words n query, n s the number of query terms n query, and c 1, c 2 and c 3 are tunable parameters, and c 1 +c 2 +c 3 =1. In the experment, the Japanese queres were translated nto Englsh, and then the translated Englsh queres were further translated nto Chnese. Some Japanese query terms have no Englsh translaton, and therefore they cannot be translated nto Englsh and also Chnese. The unknown words n Japanese-Englsh translaton are also unknown n Englsh-Chnese translaton. Thus, the number of unknown words n Japanese-Englsh-Chnese translaton s the sum of those n Japanese-Englsh translaton and Englsh-Chnese translaton. 5. Results We submtted three J CJE multlngual runs and one E E monolngual run. All runs use descrpton feld only. The Englsh monolngual run, NTU-E-E- D-01, uses offcal Englsh topcs to retreve Englsh documents. The three multlngual runs use Japanese topcs as source queres. The Japanese topcs were translated nto Englsh and Chnese by CO Model descrbed n Secton 3. The source Japanese topcs and the translated Englsh and Chnese topcs were used to retreve Japanese, Englsh and Chnese documents, respectvely. Then, we merged these three result lsts. We used dfferent mergng strateges for the three multlngual runs. 1. NTU-J-CJE-D-01 Frst, we used formula (1) to adjust the smlarty score of each document. We used the average smlarty score of top 10 documents for normalzaton. The weght W was determned by formula (2). The values of c 1, c 2 and c 3 were set to 0.1, 0.4 and 0.5, respectvely. Then all results were put n a pool and sorted by the adjusted score. The top 1000 documents were selected as the fnal results. 2. NTU-J-CJE-D-02 The mergng strategy s the same as run NTU-J- CJE-D-01 except that the weght W was determned by formula (3). 3. NTU-J-CJE-D-03 The smlarty scores were adjusted by multplyng a constant weght. The smlarty scores n Japanese-Englsh run multpled 1.5; the smlarty scores n Japanese-Chnese run multpled 0.5; the smlarty scores n monolngual Japanese run were not changed. These values were traned from the experments usng dry-run data. The results of our offcal runs are shown n Table 1. Table 2 shows the unoffcal evaluaton of ntermedate monolngual (.e., Japanese to Japanese) and cross-lngual runs (.e., Japanese to Englsh and Japanese to Englsh). The relevant assessment of each language s extracted from multlngual assessment fle. The performance of run NTU-J- CJE-D-02 s slghtly better than that of run NTU-J- CJE-01. The weght W of run NTU-J-CJE-D-02 s smaller than that of run NTU-J-CJE-D-01 for most queres. But t seems not small enough for Japanese- Chnese cross-lngual run. Snce the Chnese translatons of Japanese queres are not translated well, the performance of Japanese-Chnese crosslngual run s worse. When mergng results, the

The Thrd NTCIR Workshop, Sep.2001 - Oct. 2002 Table 1. The results of offcal runs Run # Topc Scorng Mode Average Precson Recall NTU-E-E-D-01 32 Rgd 0.2072 391 / 444 Relax 0.2519 641 / 741 NTU-J-CJE-D-01 50 Rgd 0.0884 1211 / 4053 Relax 0.0839 1769 / 6648 NTU-J-CJE-D-02 50 Rgd 0.0907 1172 / 4053 Relax 0.0865 1719 / 6648 NTU-J-CJE-D-03 50 Rgd 0.0934 1194 / 4053 Relax 0.0893 1766 / 6648 Table 2. The results of ntermedate runs Run # Topc Scorng Mode Average Precson Recall ntu-fr-j-j-d 45 Rgd 0.1506 1064/ 1659 ntu-fr-j-e-d 40 Rgd 0.1269 225/456 ntu-fr-j-c-d 48 Rgd 0.0146 517/1938 Japanese-Chnese cross-lngual run should have lower weght. The performances of the offcal multlngual runs do not dffer too much. The best run s NTU-J-CJE-D-03 whose average precson s 0.0934. The weghts traned from dry-run experments stll perform well n the formal-run. In order to compare the effectveness of dfferent mergng strateges, we also conducted several unoffcal runs shown as follows. 1. ntu-fr-j-cje-d-01 The mergng strategy s the same as run NTU-J- CJE-D-01, but the values of parameters c 1, c 2 and c 3 were set to 0, 0.6 and 0.4, respectvely. 2. ntu-fr-j-cje-d-02 The mergng strategy s the same as run NTU-J- CJE-D-02, but the values of parameters c 1, c 2 and c 3 were set to 0, 0.6 and 0.4, respectvely. 3. ntu-fr-j-cje-d-04 The mergng strategy s the same as run ntu-fr-jcje-d-01 except that the weght W was determned by formula (4). 4. ntu-fr-j-cje-d-raw-score We used raw-score mergng to merge result lsts. 5. ntu-fr-j-cje-d-normalzed-score The result lsts were merged by normalzedscore mergng strategy. The maxmum smlarty score was used for normalzaton. 6. ntu-fr-j-cje-d-normalzed-top10 In ths run, we used the modfed normalzedscore mergng method. We dd not consder the performance drop caused by query translaton. That s, the weght W n formula (1) was 1 for every sub run. 7. ntu-fr-j-cje-d-round-robn We used round-robn mergng to merge result lsts. 8. ntu-fr-j-cje-d-centralzed Ths run adopted centralzed archtecture. All document collectons were ndexed n one ndex fle. The topcs contaned orgnal Japanese query terms, translated Englsh query terms and translated Chnese query terms. The results of unoffcal runs are shown n Table 3. We used the rgd relevant set to evaluate the unoffcal runs. In the offcal evaluaton, the Japanese documents wthout text were removed from the ranked lst. We dd not remove those Japanese documents when evaluatng our unoffcal runs n Table 1. Therefore, the results of unoffcal runs cannot be compared to the offcal runs. We reevaluated the offcal runs wthout removng the Japanese documents wthout text. The new results of runs NTU-J-CJE-D-01, NTU-J-CJE-D-02 and NTU- J-CJE-D-03 are shown n the last three rows n Table 3. Table 3 shows that the performances of ntu-fr-jcje-d-01, ntu-fr-j-cje-d-02, and ntu-fr-j-cje-d-04 are smlar to those of offcal runs even the values of parameters c 1, c 2 and c 3 are changed. The performance of raw-score mergng s good. Ths s probably because we use the same IR model and term weghtng scheme for all text collectons. Comparatvely, the performances of normalzedscore, round-robn and normalzed-by-top-k mergng are poor, especally round-robn mergng strategy. Normalzd-by-top-k s better than normalzed score mergng. When consderng the translaton penalty, the performance of normalzed-by-top-k s ncreased and s better than the other mergng strateges. It shows that translaton penalty s helpful. The best run s ntu-fr-j-cje-d-centralzed, whch ndexes all documents n dfferent languages together. In ths run, most of the top ranked documents are n Japanese or n Englsh n most topcs. Table 2 shows that the performances of Japanese monolngual

Proceedngs of the Thrd NTCIR Workshop Table 3. The results of unoffcal runs Run Average Precson Recall ntu-fr-j-cje-d-01 0.0833 1194 / 4053 ntu-fr-j-cje-d-02 0.0872 1152 / 4053 ntu-fr-j-cje-d-04 0.0868 1124 / 4053 ntu-fr-j-cje-d-raw-score 0.0867 1310 / 4053 ntu-fr-j-cje-d-normalzed-score 0.0492 1245 / 4053 ntu-fr-j-cje-d-normalzed-top10 0.0514 1257 / 4053 ntu-fr-j-cje-d-round-robn 0.0447 1233 / 4053 ntu-fr-j-cje-d-centralzed 0.0973 1149 / 4053 NTU-J-CJE-D-01 0.0842 1211 / 4053 NTU-J-CJE-D-02 0.0863 1172 / 4053 NTU-J-CJE-D-03 0.0891 1194 / 4053 Table 4. The results of unoffcal runs usng new Japanese-Chnese run Run Average Precson Recall ntu-fr-j-c-d-2 0.0289 340/1938 ntu-fr-j-cje-d-01-2 0.0841 1242 / 4053 ntu-fr-j-cje-d-02-2 0.0869 1233 / 4053 ntu-fr-j-cje-d-04-2 0.0863 1229 / 4053 ntu-fr-j-cje-d-raw-score-2 0.0850 1315 / 4053 ntu-fr-j-cje-d-normalzed-score-2 0.0685 1273 / 4053 ntu-fr-j-cje-d-normalzed-top10-2 0.0635 1277 / 4053 ntu-fr-j-cje-d-round-robn-2 0.0516 1225/ 4053 ntu-fr-j-cje-d-centralzed-2 0.0990 1177 / 4053 Table 5. Normalzed-by-top-k wth translaton penalty (C 1 =0, C 2 =0.4, C 3 =0.6) Run Average Precson Recall ntu-fr-j-cje-d-01-3 0.0877 1205 / 4053 ntu-fr-j-cje-d-02-3 0.0883 1203 / 4053 ntu-fr-j-cje-d-04-3 0.0880 1196 / 4053 retreval and Japanese-Englsh cross-lngual retreval are much better than that of Japanese-Chnese crosslngual retreval. Therefore, the fnal result lst should not contan too many Chnese documents. The over-weghtng phenomenon n centralzed archtecture ncreases the scores of Japanese and Englsh documents, so that more Japanese and Englsh documents are ncluded n the result lst of run ntu-fr-j-cje-d-centralzed. Ths makes the performance better. In the ntal experments, we use Englsh as a pvot language to derve Chnese translaton equvalents of Japanese query terms. Expermental results show that the performance of Japanese- Chnese cross-lngual run s very bad. In the further tests, we translated Japanese queres nto Chnese drectly by usng BtEx onlne Japanese-Chnese dctonary (http://www.btex-cn.com/). Table 4 lsts the performances of new Japanese-Chnese crosslngual run and multlngual runs. The performance of the new Japanese-Chnese cross-lngual run s mproved only a lttle,.e., from 0.0146 to 0.0289. The major reason s that many query terms have no translaton. After analyss, there are 351 dstnct query terms n the descrpton feld of Japanese queres. Among these, total 232 terms have no translaton. The remanng query terms have only one translaton. Furthermore, our system dd not return any documents n the new Japanese-Chnese cross-lngual run for topcs 16, 19 and 23. Ths s because all the translated Chnese terms of these three topcs are ungrams and our system uses bgrams as ndex terms. In such a case, no relevant document s proposed. The performance of centralzed archtecture s stll the best. When consderng the translaton penalty, normalzed-by-top-k (.e., runs ntu-fr-j-cje-d-02-2 and ntu-fr-j-cje-d-04-2) s better than the other mergng strateges. Compared to the old translaton scheme, the performances of all mergng strateges

The Thrd NTCIR Workshop, Sep.2001 - Oct. 2002 except raw-score and normalzed-by-top-k wth translaton penalty are ncreased when the Japanese queres are translated nto Chnese by usng BtEx onlne dctonary. Snce the Japanese query terms have only one Chnese translaton, emphaszng the degree of translaton ambguty part n formulas 2-4 ncreases the mergng weght of Japanese-Chnese cross-lngual run. That decreases the performance of normalzed-by-top-k wth translaton penalty. Thus, we adjusted the values of parameters c 1, c 2 and c 3 to 0, 0.4 and 0.6, respectvely. Table 5 shows that the performances are mproved. In summary, the translaton penalty s an mportant factor n mergng, and we should also consder the qualty factor of dctonares (e.g., ts coverage). [5] Huang, C.R. and Chen, K.J., 1995. Academc Snca Balanced Corpus. Techncal Report 95-02/98-04. Academc Snca, Tape, Tawan. [6] Matsumoto, Y., Ktauch, A., Yamashta, T., and Hrano, Y., 1999. Japanese Morphologcal Analyss System ChaSen verson 2.0 Manual. Techncal Report NAIST-IS-TR99009, Nara Insttute of Scence and Technology Techncal Report. [7] Oard, D.W. and Dorr, B.J., 1996. A Survey of Multlngual Text Retreval. Techncal Report UMIACS-TR-96-19, Unversty of Maryland, Insttute for Advanced Computer Studes. [8] Voorhees, E.M., Gupta, N.K., and Johnson-Lard, B., 1995. The Collecton Fuson Problem. In proceedngs of the Thrd Text REtreval Conference (TREC-3), Gathersburg, Maryland, November, 1994. NIST Publcaton, 95-104. 6. Concludng Remarks Ths paper consders the two archtectures n MLIR. The centralzed approach performed well n all the experments. However, the centralzed archtecture s not sutable n practce, especally for very huge corpora. Centralzed archtecture needs spendng more tme to ndex and to retreve documents n all languages. Dstrbuted archtecture s more flexble. It s easy to add or delete corpora n dfferent languages and employ dfferent retreval systems n dstrbuted archtecture. Mergng problem s crtcal n dstrbuted archtecture. Ths paper proposed several mergng strateges to ntegrate the result lsts of collectons n dfferent languages. The expermental results showed that the performance of normalzed-by-top-k wth translaton penalty was better than raw-score mergng, normalzed-score mergng and round-robn mergng. References [1] Chen, H.H., Ban, G.W., and Ln, W.C., 1999. Resolvng translaton ambguty and target polysemy n cross-language nformaton retreval. In Proceedngs of 37 th Annual Meetng of the Assocaton for Computatonal Lngustcs, Maryland, June 1999. Assocaton for Computatonal Lngustcs, 215-222. [2] Church, K., Gale, W., Hanks, P., and Hndle, D., 1989. Parsng, Word Assocatons and Typcal Predcate- Argument Relatons. In Proceedngs of Internatonal Workshop on Parsng Technologes, Pttsburgh, PA, August 1989. Carnege Mellon Unversty, Pttsburgh, PA, 389-398. [3] Dumas, S.T., 1993. LSI meets TREC: A Status Report. In proceedngs of the Frst Text REtreval Conference (TREC-1), Gathersburg, Maryland, November, 1992. NIST Publcaton, 137-152. [4] Harman, D.K., 1997. TREC-6 Proceedngs. Gathersburg, Maryland. Natonal Insttute of Standards and Technology.