UB at GeoCLEF Department of Geography Abstract

Similar documents
Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

A Binarization Algorithm specialized on Document Images and Photos

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Exploring Image, Text and Geographic Evidences in ImageCLEF 2007

Query Clustering Using a Hybrid Query Similarity Measure

Performance Evaluation of Information Retrieval Systems

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

MLKD s Participation at the CLEF 2011 Photo Annotation and Concept-Based Retrieval Tasks

An Iterative Implicit Feedback Approach to Personalized Search

Optimizing Document Scoring for Query Retrieval

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Querying by sketch geographical databases. Yu Han 1, a *

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Cluster Analysis of Electrical Behavior

Cross-lingual Pseudo Relevance Feedback Based on Weak Relevant Topic Alignment

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

Intrinsic Plagiarism Detection Using Character n-gram Profiles

Classifying Acoustic Transient Signals Using Artificial Intelligence

Cross-Language Information Retrieval

Analysis of Continuous Beams in General

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

An Image Fusion Approach Based on Segmentation Region

Alignment Results of SOBOM for OAEI 2010

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Simulation Based Analysis of FAST TCP using OMNET++

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

A Method of Hot Topic Detection in Blogs Using N-gram Model

Web Document Classification Based on Fuzzy Association

User Authentication Based On Behavioral Mouse Dynamics Biometrics

High-Boost Mesh Filtering for 3-D Shape Enhancement

IN recent years, we have been witnessing the explosive

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback

A Method of Query Expansion Based on Event Ontology

The Effect of Similarity Measures on The Quality of Query Clusters

RESEARCH ON EQUIVALNCE OF SPATIAL RELATIONS IN AUTOMATIC PROGRESSIVE CARTOGRAPHIC GENERALIZATION

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD

Relevance Feedback Document Retrieval using Non-Relevant Documents

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

CS47300: Web Information Search and Management

Module Management Tool in Software Development Organizations

Information Retrieval

Structural Analysis of Musical Signals for Indexing and Thumbnailing

A Knowledge Management System for Organizing MEDLINE Database

The Codesign Challenge

Load Balancing for Hex-Cell Interconnection Network

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Improving the Quality of Information Retrieval Using Syntactic Analysis of Search Query

Keyword-based Document Clustering

Improving Web Image Search using Meta Re-rankers

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Merging Results by Using Predicted Retrieval Effectiveness

Generalized Team Draft Interleaving

A Hybrid Text Classification System Using Sentential Frequent Itemsets

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Ontology Generator from Relational Database Based on Jena

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Classic Term Weighting Technique for Mining Web Content Outliers

Semantic Image Retrieval Using Region Based Inverted File

Modeling Inter-cluster and Intra-cluster Discrimination Among Triphones

Virtual Machine Migration based on Trust Measurement of Computer Node

Array transposition in CUDA shared memory

LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval

Automatic Text Categorization of Mathematical Word Problems

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Enhanced Watermarking Technique for Color Images using Visual Cryptography

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Novel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition

Background Removal in Image indexing and Retrieval

Object-Based Techniques for Image Retrieval

Load-Balanced Anycast Routing

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Available online at Available online at Advanced in Control Engineering and Information Science

Decision Strategies for Rating Objects in Knowledge-Shared Research Networks

Visual Thesaurus for Color Image Retrieval using Self-Organizing Maps

Modular PCA Face Recognition Based on Weighted Average

A NOTE ON FUZZY CLOSURE OF A FUZZY SET

Gender Classification using Interlaced Derivative Patterns

Transcription:

UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department of Computer Scence and Engneerng (3) Department of Geography E-mal: meruz@buffalo.edu Abstract Ths paper summarzes the work done at the State Unversty of New York at Buffalo (UB) n the GeoCLEF 2006 track. The approach presented uses pure IR technques (ndexng of sngle word terms as well as word bgrams, and automatc retreval feedback) to try to mprove performance of queres wth geographcal references. The man purpose of ths work s to dentfy the strengths and shortcomngs of ths approach so that t serves as bass for future development of a geographcal reference extracton system. We submtted four runs to the monolngual Englsh task, 2 automatc runs and two manual runs, usng the ttle and descrpton felds of the topcs. Our offcal results are above the medan system (auto=0.2344 MAP, manual=0.2445 MAP). We also present an unoffcal run that uses ttle descrpton and narratve whch shows a 10% mprovement n results wth respect to our baselne runs. Our manual runs were prepared by creatng a Boolean query based on the topc descrpton and manually addng terms that are consulted from geographcal resources avalable on the web. Although the average performance of the manual run s comparable to the automatc runs, a query by query analyss shows sgnfcant dfferences among ndvdual queres. In general, we got sgnfcant mprovements (more that 10% average precson) n 8 of the 25 queres. However, we also notced that 5 queres n the manual runs perform sgnfcantly below the automatc runs. Categores and Subject Descrptors H.3 [Informaton Storage and Retreval]: H.3.3 Content Analyss and Indexng; H.3.3 Informaton Search and Retreval; H.3.4 Systems and Software General Terms Expermentaton, Keywords Geographcal Informaton Retreval, Query Expanson 1 Introducton For Our partcpaton n GeoCLEF 2006 we used pure nformaton retreval technques to expand geographcal terms present n the topcs. We used a verson of the SMART[3] system that has been updated to handle modern weghtng schemes (BM25, pvoted length normalzaton, etc.) as well as multlngual support (ISO- Latn1encodng and stemmng for 13 European languages usng Porter s stemmer). We decded to work only wth Englsh documents and resources snce they were readly avalable. Secton 2 and 3 present the detals on collecton preparaton and query processng. Secton 4 presents the retreval model mplemented wth the SMART system. Secton 5 shows results on the GeoCLEF 2005 data that was used for tunng parameters. Secton 6 presents results usng the offcal GeoCLEF 2006 topcs as well as a bref analyss and dscusson of the results. Secton 7 presents our conclusons and future work.

2 Collecton Preparaton Detals about the GeoCLEF document collecton are note dscussed n ths paper but the reader s referred to the GeoCLEF overvew paper. Our document collecton conssts of 169,477 documents from LA Tmas and The Glasgow Herald. Processng of Englsh documents followed a standard IR approach dscardng stop words and usng Porter s stemmer. Addtonally we added word bgrams that dentfy pars of contguous non-stop words to form a two word phrase. These bgrams allowed a stop word to be part of the bgram f they ncluded the word of snce t was dentfed as common component of geographcal names (.e. Unted_Kngdom and Cty_of_Lverpool would be a vald bgrams). Documents were ndexed usng the vector space model (as mplemented n the SMART system) wth two ctypes. The frst ctype was used to ndex words n the ttle and body of the artcle whle the second ctype represented the ndexng of the word bgrams prevously descrbed. 3 Query processng To process the topcs we followed the same approach descrbed above (usng stop words, stemmng, and addng word bgrams). Each query was represented usng two ctypes. The frst ctype for sngle word terms extracted from the parts that wll be used n the query (.e. ttle and descrpton). For our offcal runs we only use the ttle and descrpton. We desgned a way to dentfy geographcal features and expand them usng geographcal resources but due to the short tme avalable for developng we could not nclude t n our offcal runs. For ths reason we submtted results usng a pure IR approach for ths year and work on the development of the geographcal feature extracton for next year. Our results should be consdered as baselne results. One of the authors created a manual verson of the queres usng geographcal resources avalable on the nternet and wrtng a Boolean query. Ths manual run was ncluded n the offcal results. We also explore automatc retreval feedback of both automatc and manual queres. 4 Retreval Model We use a generalzed vector space model that combnes the representaton of the two ctypes and weghts the contrbuton of each part n the fnal smlarty score between document d r and query q r. The fnal score s computed as the lnear combnaton of ctype1 (words) and ctype2 (bgrams) as follows: r r r r r r sm( d, q) = λ * sm ( d, q) + μ * sm ( d, q) words bgrams Where λ and µ are coeffcents that control the contrbuton of each of the two ctypes. The values of these coeffcents are computed emprcally usng the optmal results n the GeoCLEF 2005 topcs. The smlarty values are computed usng pvoted length normalzaton weghtng scheme[4] (pvot=347.259, slope= 0.2). We also performed automatc retreval feedback by retrevng 1000 documents usng the orgnal query and assumng that the top n documents are relevant and the bottom 100 documents are not relevant. Ths allows us to select the top m terms ranked accordng to Roccho s relevance feedback formula[2]: w new ( t) = α w org ( t) + β Re l w( t, d Rel ) γ Re l w( t, d Rel Where α, β, and γ are coeffcents that control the contrbuton of the orgnal query, the relevant documents (Rel) and the non-relevant documents ( Rel) respectvely. The optmal values for these parameters are also determned usng the CLEF 2005 topcs. Note that the automatc query expanson adds m terms to each of the two ctypes. )

5 Prelmnary Experments Usng CLEF2005 Topcs We frst tested our baselne system usng the GeoCLEF2005 topcs. We used the ttle, descrpton and geographc tags. Table 1 shows the performance values for the baselne run and for the best run submtted to GeoCLEF 2005[1] (BKGeoE1). The mean average precson for ths baselne run s 0.3592 whch s pretty good and would have been among the top 3 systems n GeoCLEF 2005. Ths certanly ndcates that a pure IR system was enough to answer most of the topcs proposed last year. Table 1 performance of our baselne system aganst best run n GeoCLEF 2005 UB Baselne UB retreval feedback Best Run (BKGeoE1) n= 5, m= 50 α= 16, β= 96, γ= 8 Parameters MAP 36.42% 37.93% 39.36% P@5 59.20% 58.40% 57.60% P@10 46.80% 50.40% 48.00% P@20 35.60% 37.20% 41.00% P@100 18.16% 18.80% 21.48% 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% Baselne Best-run (BERKE1) Ret feeback 40.00% 30.00% 20.00% 10.00% 0.00% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fgure 1 Recall -Precson graph of our baselne and Ret feedback systems aganst the best run n CLEF 2005 A query by query analyss reveals that the IR approach performs well n many topcs but there are a few that could be mproved (See Table 2). The system dd not perform well n topcs 2, 7, 8, 11 and 23. After analyzng these topcs we conclude that most of them could have performed better f we had use some sort of expanson of contnents usng the countres located n them (.e. European countres).

Table 2 Query by query evaluaton of baselne run Qd #Relev #relret Avg-P exact-p P@5 P@10 P@20 1 14 14 60% 57% 80% 60% 40% 2 11 8 10% 18% 20% 20% 15% 3 10 8 44% 40% 80% 40% 20% 4 43 39 35% 37% 80% 70% 55% 5 27 25 52% 56% 100% 80% 55% 6 13 11 28% 38% 40% 30% 30% 7 85 57 6% 9% 0% 10% 5% 8 10 10 4% 0% 0% 0% 0% 9 19 17 46% 42% 100% 80% 40% 10 12 12 82% 75% 100% 90% 50% 11 21 13 7% 5% 20% 10% 5% 12 76 57 14% 17% 60% 50% 35% 13 7 7 52% 43% 60% 40% 25% 14 43 41 39% 49% 40% 70% 50% 15 110 110 74% 72% 60% 70% 75% 16 15 15 88% 80% 100% 100% 65% 17 129 129 50% 49% 80% 90% 60% 18 48 41 29% 38% 80% 50% 50% 19 100 79 16% 24% 0% 30% 30% 20 9 8 11% 22% 40% 20% 10% 21 29 27 44% 38% 100% 80% 55% 22 46 42 48% 52% 80% 80% 75% 23 43 19 3% 5% 20% 10% 5% 24 105 104 50% 56% 60% 50% 65% 25 3 3 59% 67% 60% 30% 15% All 1028 896 38% 40% 58% 50% 37% 6 Results Usng GeoCLEF 2006 Topcs We submtted four offcal runs: two usng automatc query processng and two usng manual methods. As expected our results (both automatc and manual) performed above the medan system. Results are presented n Table 3. The automatc runs perform slghtly above the medan system whch ndcates that the set of topcs for ths year where harder to solve usng only IR technques. After takng a look to the offcal topcs we realze that we could have used a better expanson method usng the geographcal resources (.e dentfyng queres that have specfc lattude and longtude references to restrct the set of retreved results). On the other hand, the manual queres perform n average smlarly to the automatc runs but a query by query analyss reveals that there are qute a few queres that outperform sgnfcantly the automatc runs. However, at the same tme there are two queres that perform sgnfcantly below the automatc systems. Note that the frst manual run (UBGManual1) does not use automatc feedback whle the second manual run (UBGManual2) uses automatc retreval feedback. Ths merts further analyss to dentfy those strateges that are successful n mprovng performance.

Table 3 Performance of GeoCLEF 2006 Topcs Offcal Runs Run Label Mean Avg. P Parameters UBGTDrf1 (automatc feedback) UBGTDrf2 (automatc feedback) UBGManual1 (Manual run only) UBGManual2 (automatc feedback) 0.2344 n= 10, m= 20 0.2330 n= 5, m= 50 0.2307 0.2446 n= 10, m= 20 Unoffcal Runs UBGTDNrf1 0.2758 n= 5, m= 5 We also noted that our best run (not submtted) performs qute well wth respect to our baselne offcal runs. Ths run uses ttle, descrpton and narratve, and conservatve retreval feedback parameters (n=5 documents and m=5 terms). It s also encouragng that ths run, when compared to the manual run, captures several of the good terms that were added manually. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25-0.2-0.3-0.4-0.5-0.6-0.7-0.8-0.9-1 besttdn Manual2 Fgure 2 Comparson of best manual run and best automatc run usng our system

7 Concluson Ths paper presents an IR based approach to Geographcal Informaton retreval. Although ths s our baselne system we can see that the results are compettve, especally f we use the long topcs (ttle descrpton and narratve). We stll need to do more n depth analyss of the reasons why some manual queres mproved sgnfcantly wth respect to the medan system and the problem presented n 5 queres that dd perform sgnfcantly below the medan. We plan to explore way to generate automatc geographc references and ontology based expanson for next year. References 1. Gey, F., Larson, R., Sanderson, M., Joho, H. and Clough, P. GeoCLEF: the CLEF 2005 Cross- Language Geographc Informaton Retreval Track Workng Notes for the CLEF 2005 Workshop, 21-23 September, Venna, Austra, 2005. 2. Roccho, J.J. Relevance feedback n nformaton retreval. n Salton, G. ed. The SMART Retreval System: Experments n Automatc Document Processng, Prentce Hall, Englewood Clff, NJ, 1971, 313-323. 3. Salton, G. (ed.), The SMART Retreval System: Experments n Automatc Document Processng. Prentce Hall, Englewood Clff, NJ, 1971. 4. Snghal, A., Buckley, C. and Mtra, M., Pvoted Document Length Normalzaton. n 19th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, (1996), ACM Press, pages, 21-29.