Geographical Classification of Documents Using Evidence from Wikipedia

Similar documents
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval. (M&S Ch 15)

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Information Retrieval. Information Retrieval and Web Search

Chapter 6: Information Retrieval and Web Search. An introduction

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS 6320 Natural Language Processing

Information Retrieval and Web Search

Author Prediction for Turkish Texts

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Introduction to Information Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Automated Tagging for Online Q&A Forums

METEOR-S Web service Annotation Framework with Machine Learning Classification

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Representation of Documents and Infomation Retrieval

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Creating a Classifier for a Focused Web Crawler

Content-based Recommender Systems

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

A Semantic Model for Concept Based Clustering

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

Chapter 8 The C 4.5*stat algorithm

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Developing Focused Crawlers for Genre Specific Search Engines

Information Retrieval

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Unstructured Data. CS102 Winter 2019

Reading group on Ontologies and NLP:

Information Retrieval

Topic Classification in Social Media using Metadata from Hyperlinked Objects

Query Refinement and Search Result Presentation

The use of frequent itemsets extracted from textual documents for the classification task

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Information Retrieval

GIR experiements with Forostar at GeoCLEF 2007

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

Query Phrase Expansion using Wikipedia for Patent Class Search

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval. hussein suleman uct cs

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

IE in Context. Machine Learning Problems for Text/Web Data

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Chapter 2. Architecture of a Search Engine

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Detection and Extraction of Events from s

Stefano Ferilli 1 Floriana Esposito 1 Domenico Redavid 2

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Efficient query processing

Chapter 4. Processing Text

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

WCL2R: A Benchmark Collection for Learning to Rank Research with Clickthrough Data

Web Information Retrieval using WordNet

Introduction to Information Retrieval

A Taxonomy of Semi-Supervised Learning Algorithms

Java Archives Search Engine Using Byte Code as Information Source

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Semantic Indexing of Technical Documentation

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Prior Art Retrieval Using Various Patent Document Fields Contents

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

ihits: Extending HITS for Personal Interests Profiling

Discovering Geographic Locations in Web Pages Using Urban Addresses

Automatically Constructing a Directory of Molecular Biology Databases

A Universal Model for XML Information Retrieval

WebSci and Learning to Rank for IR

dr.ir. D. Hiemstra dr. P.E. van der Vet

Studying the Impact of Text Summarization on Contextual Advertising

1 Document Classification [60 points]

Information Retrieval & Text Mining

Team COMMIT at TREC 2011

Classifying XML Documents by using Genre Features

Information Retrieval: Retrieval Models

Mapping Network Relationships from Spatial Database Schemas to GML Documents

CS60092: Information Retrieval

Semantic Search in s

Tuning Large Scale Deduplication with Reduced Effort

A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles

CS229 Final Project: Predicting Expected Response Times

Contextual Information Retrieval Using Ontology-Based User Profiles

Natural Language Processing

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Automatic Extraction of Event Information from Newspaper Articles and Web Pages

Birkbeck (University of London)

68A8 Multimedia DataBases Information Retrieval - Exercises

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

Document Clustering: Comparison of Similarity Measures

Text Analytics (Text Mining)

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Information Retrieval Using Context Based Document Indexing and Term Graph

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Influence of Word Normalization on Text Classification

Searching the Deep Web

Slides for Data Mining by I. H. Witten and E. Frank

ISSN , Volume 16, Number 3

Transcription:

Geographical Classification of Documents Using Evidence from Wikipedia Rafael Odon de Alencar (odon.rafael@gmail.com) Clodoveu Augusto Davis Jr. (clodoveu@dcc.ufmg.br) Marcos André Gonçalves (mgoncalv@dcc.ufmg.br) Universidade Federal de Minas Gerais, Brazil GIR 10, 18-19th Feb. 2010, Zurich, Switzerland

Introduction / Motivation (common to most of GIR 2010) Geography-related terms are often used in Web search queries Many user activities on the Web are directly related to the user s location It is important to conceive applications that take into consideration this intention 2

Introduction Recent work has suggested identifying the geographic context of documents Association of Web pages to places Advances can enhance current information retrieval mechanisms Allow people to perform local search Enable geographically-focused advertising Develop novel ranking strategies 3

Introduction Identification of the geographic context of a Web document: Inferred by the location of its Web server (GeoIP) Inferred by the location of its visitors and of adjacent pages in the Web graph Determined by analyzing the document s textual content 4

Introduction Identification of the geographic context of a document: Inferred by the location of its Web server (GeoIP) Inferred by the location of its visitors and of adjacent pages in the Web graph. Determined by analyzing the document s s textual content 5

Introduction In previous work, our group has developed means to recognize direct and indirect evidence of location, using an extraction ontology Addresses Postal codes Telephone numbers and area codes Positioning expressions: <place of interest> <location expression> <landmark> Hotel CLOSE TO Convention Center 6

Introduction However, not all pages include unambiguous and easily recognizable evidence This work looks at other types of textual evidence Terms and expressions semantically related to a location Not necessarily other place names 7

Our Proposal Use the Wikipedia as a semantic network, composed by its entries (nodes) and links (arcs), to gather textual geographic evidence for places 13

Our Proposal This work intends to demonstrate that such evidence is valid using classification experiments Classes: a subset of Brazilian states, considered as single labels Database: a set of articles from the local news section of newspapers We don t intend to propose a definitive geographic classification model 14

Geographic Evidence from Wikipedia Start with a set of places Find the Wikipedia entry for each place Collect the titles of inlinks and of outlinks Titles of entries are used as terms for IR Use weights to indicate how frequent (how important) a term is Organize such information as evidence for a classifier 15

Geographic Evidence from Wikipedia Consider a set of places and its adjacent entries (links) in Wikipedia 16

Geographic Evidence from Wikipedia Each place has a list of inlinks and outlinks Weights are used to inform the discriminative value of each term 17

Geographic Evidence from Wikipedia The weight of a term t is based on its adjacency to the considered set of m places More exclusive terms have a weight close to 1.0 More popular terms have a weight close to 0.0 ( ) wt = adj( t) 1 1 m 2 18

Geographic Evidence from Wikipedia Classification: we find occurrences of entry titles in documents Document 1 Our company has offices in Belo Horizonte and Ouro Preto Document 2 This year s samba festival will also occur in other Brazilian southeast state capitals 19

Geographic Evidence from Wikipedia Considering the occurrences found, we use weighted sums to describe the relationship of a document to places from the set in in S ( p, d ) = i j out wt ( l) Frequency( tl, dj ) S ( pi, dj ) = wt ( l) Frequency( tl, dj ) l= 1 out l= 1 20

Geographic Evidence from Wikipedia Some improvements were done, in order to get a richer description Use separate sums for every level of importance This can lead a classifier to better understand the relationship between the text and the places 21

Document collection We classified documents associated to Brazilian states We considered a subset of 8 from the 27 Brazilian states We extracted 831 articles from 8 different local news sections We read the titles of each article to be sure they were indeed related to the respective state 22

Document collection Only the article title and its body were extracted (no structure was preserved) The text was pre-processed: Stemming: words reduced to radical form Stopwords removal: ignore conjunctions, prepositions, punctuations and other inexpressive words. 23

Document collection 24

Evaluation We chose the Multinomial Naïve Bayes Classifier to perform our tests Features represent the frequency of terms Ignores the position of the terms in texts Considers features to be independent (naïve assumption) In practice this simplifies the learning process Adjusts a model based on the probability of a class to generate an instance considering the given examples 25

Evaluation N-fold cross validation was used for tests The dataset is divided in N parts Every part is used as a test set for training with the other N-1 parts Every instance is guaranteed to be used both for test and training Success rate is obtained from the whole dataset All tests performed using Weka 3.6.1 26

Evaluation TF-IDF measurements of a bag-of-words representation of documents was used as a baseline for our evaluation ni, j tf Bag-of-words: reduces i, j= nk, j documents to lists of k terms D TF-IDF: gives us terms idfi= log frequencies, normalized by d : ti the document length and term popularity in the collection. ( ) { d} tfidf i, j=tfi, j idfi 27

Evaluation Results Success rate for different training set sizes Training (50%) 100% 80% 60% 40% 20% Test (50%) Training Set Size 28

Evaluation Results Success rate for different number of classes TF-IDF Wiki 29

Evaluation Results Effects of removing place names Our hypothesis: TF-IDF of bag-of-words classification has a non-geographic bias Other irrelevant terms are represented by the features We defined 100 place names to be removed from the documents in order to check the impact on precision State names, abbreviations, important city names, and others More than 35,000 removals 30

Evaluation Results Effects of removing place names We classified the dataset before and after the place names removal for comparison 10-fold cross validation was used Impact in precision: Wikipedia Model: more than 30% of loss TF-IDF Bag-of-words model: about 6% of loss 31

Conclusion Wikipedia model Operates very well with less training Adding more classes decreases its precision Is sensible to geographic evidence in the text TF-IDF bag-of-words model Captures non-geographic detail from the training documents 32

Conclusion Future work includes: Generate a much larger collection, from more sources, and considering the 27 states Mix types of places: states, cities, countries, etc. Improve the matching of entry titles in documents by considering alternatives or synonyms for them Consider multi-label classification 33

Direções atuais: Classificação não é o caminho ideal, apenas um experimento interessante. Grande parte do sucesso está na identificação correta do assunto do texto: Keywords extraction Topic Indexing Reproduzir o mecanismo de Wikificação do Milne & Witten 2008 Construir coleção de documentos com aspectos hierárquicos, multi-classe, com ou sem contexto definido. 34

Geographical Classification of Documents Using Evidence from Wikipedia Rafael Odon de Alencar (odon.rafael@gmail.com) Clodoveu Augusto Davis Jr. (clodoveu@dcc.ufmg.br) Marcos André Gonçalves (mgoncalv@dcc.ufmg.br) Universidade Federal de Minas Gerais, Brazil GIR 10, 18-19th Feb. 2010, Zurich, Switzerland