UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department of Computer Scence and Engneerng (3) Department of Geography E-mal: meruz@buffalo.edu Abstract Ths paper summarzes the work done at the State Unversty of New York at Buffalo (UB) n the GeoCLEF 2006 track. The approach presented uses pure IR technques (ndexng of sngle word terms as well as word bgrams, and automatc retreval feedback) to try to mprove performance of queres wth geographcal references. The man purpose of ths work s to dentfy the strengths and shortcomngs of ths approach so that t serves as bass for future development of a geographcal reference extracton system. We submtted four runs to the monolngual Englsh task, 2 automatc runs and two manual runs, usng the ttle and descrpton felds of the topcs. Our offcal results are above the medan system (auto=0.2344 MAP, manual=0.2445 MAP). We also present an unoffcal run that uses ttle descrpton and narratve whch shows a 10% mprovement n results wth respect to our baselne runs. Our manual runs were prepared by creatng a Boolean query based on the topc descrpton and manually addng terms that are consulted from geographcal resources avalable on the web. Although the average performance of the manual run s comparable to the automatc runs, a query by query analyss shows sgnfcant dfferences among ndvdual queres. In general, we got sgnfcant mprovements (more that 10% average precson) n 8 of the 25 queres. However, we also notced that 5 queres n the manual runs perform sgnfcantly below the automatc runs. Categores and Subject Descrptors H.3 [Informaton Storage and Retreval]: H.3.3 Content Analyss and Indexng; H.3.3 Informaton Search and Retreval; H.3.4 Systems and Software General Terms Expermentaton, Keywords Geographcal Informaton Retreval, Query Expanson 1 Introducton For Our partcpaton n GeoCLEF 2006 we used pure nformaton retreval technques to expand geographcal terms present n the topcs. We used a verson of the SMART[3] system that has been updated to handle modern weghtng schemes (BM25, pvoted length normalzaton, etc.) as well as multlngual support (ISO- Latn1encodng and stemmng for 13 European languages usng Porter s stemmer). We decded to work only wth Englsh documents and resources snce they were readly avalable. Secton 2 and 3 present the detals on collecton preparaton and query processng. Secton 4 presents the retreval model mplemented wth the SMART system. Secton 5 shows results on the GeoCLEF 2005 data that was used for tunng parameters. Secton 6 presents results usng the offcal GeoCLEF 2006 topcs as well as a bref analyss and dscusson of the results. Secton 7 presents our conclusons and future work.

2 Collecton Preparaton Detals about the GeoCLEF document collecton are note dscussed n ths paper but the reader s referred to the GeoCLEF overvew paper. Our document collecton conssts of 169,477 documents from LA Tmas and The Glasgow Herald. Processng of Englsh documents followed a standard IR approach dscardng stop words and usng Porter s stemmer. Addtonally we added word bgrams that dentfy pars of contguous non-stop words to form a two word phrase. These bgrams allowed a stop word to be part of the bgram f they ncluded the word of snce t was dentfed as common component of geographcal names (.e. Unted_Kngdom and Cty_of_Lverpool would be a vald bgrams). Documents were ndexed usng the vector space model (as mplemented n the SMART system) wth two ctypes. The frst ctype was used to ndex words n the ttle and body of the artcle whle the second ctype represented the ndexng of the word bgrams prevously descrbed. 3 Query processng To process the topcs we followed the same approach descrbed above (usng stop words, stemmng, and addng word bgrams). Each query was represented usng two ctypes. The frst ctype for sngle word terms extracted from the parts that wll be used n the query (.e. ttle and descrpton). For our offcal runs we only use the ttle and descrpton. We desgned a way to dentfy geographcal features and expand them usng geographcal resources but due to the short tme avalable for developng we could not nclude t n our offcal runs. For ths reason we submtted results usng a pure IR approach for ths year and work on the development of the geographcal feature extracton for next year. Our results should be consdered as baselne results. One of the authors created a manual verson of the queres usng geographcal resources avalable on the nternet and wrtng a Boolean query. Ths manual run was ncluded n the offcal results. We also explore automatc retreval feedback of both automatc and manual queres. 4 Retreval Model We use a generalzed vector space model that combnes the representaton of the two ctypes and weghts the contrbuton of each part n the fnal smlarty score between document d r and query q r. The fnal score s computed as the lnear combnaton of ctype1 (words) and ctype2 (bgrams) as follows: r r r r r r sm( d, q) = λ * sm ( d, q) + μ * sm ( d, q) words bgrams Where λ and µ are coeffcents that control the contrbuton of each of the two ctypes. The values of these coeffcents are computed emprcally usng the optmal results n the GeoCLEF 2005 topcs. The smlarty values are computed usng pvoted length normalzaton weghtng scheme[4] (pvot=347.259, slope= 0.2). We also performed automatc retreval feedback by retrevng 1000 documents usng the orgnal query and assumng that the top n documents are relevant and the bottom 100 documents are not relevant. Ths allows us to select the top m terms ranked accordng to Roccho s relevance feedback formula[2]: w new ( t) = α w org ( t) + β Re l w( t, d Rel ) γ Re l w( t, d Rel Where α, β, and γ are coeffcents that control the contrbuton of the orgnal query, the relevant documents (Rel) and the non-relevant documents ( Rel) respectvely. The optmal values for these parameters are also determned usng the CLEF 2005 topcs. Note that the automatc query expanson adds m terms to each of the two ctypes. )

5 Prelmnary Experments Usng CLEF2005 Topcs We frst tested our baselne system usng the GeoCLEF2005 topcs. We used the ttle, descrpton and geographc tags. Table 1 shows the performance values for the baselne run and for the best run submtted to GeoCLEF 2005[1] (BKGeoE1). The mean average precson for ths baselne run s 0.3592 whch s pretty good and would have been among the top 3 systems n GeoCLEF 2005. Ths certanly ndcates that a pure IR system was enough to answer most of the topcs proposed last year. Table 1 performance of our baselne system aganst best run n GeoCLEF 2005 UB Baselne UB retreval feedback Best Run (BKGeoE1) n= 5, m= 50 α= 16, β= 96, γ= 8 Parameters MAP 36.42% 37.93% 39.36% P@5 59.20% 58.40% 57.60% P@10 46.80% 50.40% 48.00% P@20 35.60% 37.20% 41.00% P@100 18.16% 18.80% 21.48% 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% Baselne Best-run (BERKE1) Ret feeback 40.00% 30.00% 20.00% 10.00% 0.00% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fgure 1 Recall -Precson graph of our baselne and Ret feedback systems aganst the best run n CLEF 2005 A query by query analyss reveals that the IR approach performs well n many topcs but there are a few that could be mproved (See Table 2). The system dd not perform well n topcs 2, 7, 8, 11 and 23. After analyzng these topcs we conclude that most of them could have performed better f we had use some sort of expanson of contnents usng the countres located n them (.e. European countres).

Table 2 Query by query evaluaton of baselne run Qd #Relev #relret Avg-P exact-p P@5 P@10 P@20 1 14 14 60% 57% 80% 60% 40% 2 11 8 10% 18% 20% 20% 15% 3 10 8 44% 40% 80% 40% 20% 4 43 39 35% 37% 80% 70% 55% 5 27 25 52% 56% 100% 80% 55% 6 13 11 28% 38% 40% 30% 30% 7 85 57 6% 9% 0% 10% 5% 8 10 10 4% 0% 0% 0% 0% 9 19 17 46% 42% 100% 80% 40% 10 12 12 82% 75% 100% 90% 50% 11 21 13 7% 5% 20% 10% 5% 12 76 57 14% 17% 60% 50% 35% 13 7 7 52% 43% 60% 40% 25% 14 43 41 39% 49% 40% 70% 50% 15 110 110 74% 72% 60% 70% 75% 16 15 15 88% 80% 100% 100% 65% 17 129 129 50% 49% 80% 90% 60% 18 48 41 29% 38% 80% 50% 50% 19 100 79 16% 24% 0% 30% 30% 20 9 8 11% 22% 40% 20% 10% 21 29 27 44% 38% 100% 80% 55% 22 46 42 48% 52% 80% 80% 75% 23 43 19 3% 5% 20% 10% 5% 24 105 104 50% 56% 60% 50% 65% 25 3 3 59% 67% 60% 30% 15% All 1028 896 38% 40% 58% 50% 37% 6 Results Usng GeoCLEF 2006 Topcs We submtted four offcal runs: two usng automatc query processng and two usng manual methods. As expected our results (both automatc and manual) performed above the medan system. Results are presented n Table 3. The automatc runs perform slghtly above the medan system whch ndcates that the set of topcs for ths year where harder to solve usng only IR technques. After takng a look to the offcal topcs we realze that we could have used a better expanson method usng the geographcal resources (.e dentfyng queres that have specfc lattude and longtude references to restrct the set of retreved results). On the other hand, the manual queres perform n average smlarly to the automatc runs but a query by query analyss reveals that there are qute a few queres that outperform sgnfcantly the automatc runs. However, at the same tme there are two queres that perform sgnfcantly below the automatc systems. Note that the frst manual run (UBGManual1) does not use automatc feedback whle the second manual run (UBGManual2) uses automatc retreval feedback. Ths merts further analyss to dentfy those strateges that are successful n mprovng performance.

Table 3 Performance of GeoCLEF 2006 Topcs Offcal Runs Run Label Mean Avg. P Parameters UBGTDrf1 (automatc feedback) UBGTDrf2 (automatc feedback) UBGManual1 (Manual run only) UBGManual2 (automatc feedback) 0.2344 n= 10, m= 20 0.2330 n= 5, m= 50 0.2307 0.2446 n= 10, m= 20 Unoffcal Runs UBGTDNrf1 0.2758 n= 5, m= 5 We also noted that our best run (not submtted) performs qute well wth respect to our baselne offcal runs. Ths run uses ttle, descrpton and narratve, and conservatve retreval feedback parameters (n=5 documents and m=5 terms). It s also encouragng that ths run, when compared to the manual run, captures several of the good terms that were added manually. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25-0.2-0.3-0.4-0.5-0.6-0.7-0.8-0.9-1 besttdn Manual2 Fgure 2 Comparson of best manual run and best automatc run usng our system

7 Concluson Ths paper presents an IR based approach to Geographcal Informaton retreval. Although ths s our baselne system we can see that the results are compettve, especally f we use the long topcs (ttle descrpton and narratve). We stll need to do more n depth analyss of the reasons why some manual queres mproved sgnfcantly wth respect to the medan system and the problem presented n 5 queres that dd perform sgnfcantly below the medan. We plan to explore way to generate automatc geographc references and ontology based expanson for next year. References 1. Gey, F., Larson, R., Sanderson, M., Joho, H. and Clough, P. GeoCLEF: the CLEF 2005 Cross- Language Geographc Informaton Retreval Track Workng Notes for the CLEF 2005 Workshop, 21-23 September, Venna, Austra, 2005. 2. Roccho, J.J. Relevance feedback n nformaton retreval. n Salton, G. ed. The SMART Retreval System: Experments n Automatc Document Processng, Prentce Hall, Englewood Clff, NJ, 1971, 313-323. 3. Salton, G. (ed.), The SMART Retreval System: Experments n Automatc Document Processng. Prentce Hall, Englewood Clff, NJ, 1971. 4. Snghal, A., Buckley, C. and Mtra, M., Pvoted Document Length Normalzaton. n 19th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, (1996), ACM Press, pages, 21-29.