Handling Place References in Text

Size: px

Start display at page:

Download "Handling Place References in Text"

Arthur Manning
6 years ago
Views:

1 Handling Place References in Text

2 Introduction Most (geographic) information is available in the form of textual documents Place reference resolution involves two-subtasks: Recognition : Delimiting occurrences of place references in text Disambiguation : Resolving place references to geo-coordinates Essential task in Geographic Information Retrieval Supports access through geography to textual documents Existing methods mostly rely on hand-tuned heuristics Labor-intensive to develop, optimize and maintain Current research focusing on data-driven methods Main problems are related to natural language ambiguity!

3 Ambiguity Geographic/non-geographic ambiguity refers to the case of place names having other, non geographic meanings: Reading in England, Buffalo in the US, Should be addressed while recognizing place references Geographic/geographic ambiguity arises when multiple distinct places share the same name: Almost every major city in Europe has a sister city of the same name in the New World Should be addressed while disambiguating place references

4 Place reference recognition Approaches based on dictionaries Sliding window approaches Aho-Corasick algorithm (finite state automaton) Good recall, good performance, often poor precision Approaches based on rules Regular expression patterns (finite state automaton) Grammatical rules Large human effort involved in creating the rules Approaches based on machine learning Hidden Markov Modeling Conditional Random Fields Good generalization behavior, requires large amounts of training data

5 Aho-Corasick String Matching Locate all occurrences of any of a finite number of keywords (e.g., location names) in a string of text. Consists of two steps: Constructing a finite state pattern matching machine from the keywords Using the pattern matching machine to process the text string in a single pass

6 Pattern Matching Machine Let P y, y,, y 1 2 k be a finite set of string patterns which we shall call keywords Let x be an arbitrary string which we shall call the text string (i.e., the document). The behavior of the pattern matching machine is dictated by three functions: a goto function g a failure function f an output function output

7 The Pattern Matching Machine Goto function g :maps a pair consisting of a state and an input symbol into a state or fail. Failure function f :maps a state into a state, and is consulted whenever the goto function reports fail. Fast transitions between failed pattern matches (e.g. a search for cat in a tree that does not contain cat, but contains cart, and thus would fail at the node prefixed by ca) to other branches of the tree that share a common prefix (e.g., in the previous case, a branch for attribute might be the best lateral transition) Output function:associating a set of keyword patterns (possibly empty) with every state.

8 Aho-Corasick Algorithm Pattern Tree State Machine h 0 s Goto Function e 1 i 3 h Black Arrows Failure Function Blue Arrows s r s 7 4 e 5 Output Function Red Dots 9 Pattern set { he, she, his, hers }

9 Aho-Corasick Search Algorithm l: the starting position in Text String T c: the current character of T to be compared with a character on the tree K w: the current node on the tree K Input: Pattern set P and text T Output: all occurrences in T of any pattern from P Algorithm: Aho-Corasick l=1; c=1; w=root of K Repeat while there is an edge (w, w ) labeled with T[c] if w` is numbered by pattern i then report that p i occurs in T starting at l; w=w ; c++; w=failure(w) and l=c-length-prefix(w); Until c> T

10 Hidden Markov Models HMMs are the standard sequence modeling tool in NLP and IE Finite state model Graphical model... S t - 1 S t S t+1 transitions observations Generates: O t - 1 O t O t +1 State sequence Observation sequence o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 P( s, o) o t1 P( s t s t1 ) P( ot st ) Parameters: for all states S={s 1,s 2, } Start state probabilities: P(s t ) Transition probabilities: P(s t s t-1 ) Observation (emission) probabilities: P(o t s t ) Training: Maximize probability of training observations

11 Placename Extraction with HMMs Given a sequence of observations: Yesterday Bruno Martins went to Campo Grande and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) arg max s P( s, o) Yesterday Bruno Martins went to Campo Grande. Any words said to be generated by the designated location name state are extracted as a location name: Location name: Campo Grande

12 B-I-O Encoding Encode the chunking problem of recognizing place references into a tagging problem os assigning classes to individual word tokens. Begin_place Inside_place Other O B_per I_per O O B_loc I_loc Yesterday Bruno Martins went to Campo Grande.

13 Hidden Markov Models Learning the model with training data General algorithm based on Expectation-Maximization (EM) 1. Initialise model λ 0 2. Compute new model λ, using λ 0 and observed sequence 3. Adjust the model λ 0 λ 4. Repeat steps 2 and 3 until log P(X,Y λ) log P(X,Y λ 0 ) < d Using the model (i.e., decoding) Choose output label sequence that maximizes the probability of the token observation sequence Viterbi dynamic programming algorithm that keeps the best label sequence at each instance

14 The Viterbi Algorithm The algorithm sweeps through all the tag possibilities for each word, computing the best sequence leading to each possibility. Dynamic Programming Approach: The key that makes this algorithm efficient is that we only need to know the best sequences leading to the previous word, because of the Markov assumption used in the Model.

15 The Viterbi Algorithm Let T = # of tags in our annotation problem (e.g., B-I-O tags for each entity type) W = # of words in the text to be annotated /* Initialization Step */ for t = 1 to T Score(t, 1) = Pr(Word 1 Tag t ) * Pr(Tag t ) BackPtr(t, 1) = 0; /* Iteration Step */ for w = 2 to W for t = 1 to T Score(t, w) = Pr(Word w Tag t ) *M AX j=1,t (Score(j, w-1) * Pr(Tag t Tag j )) BackPtr(t, w) = index of j that gave the max above /* Sequence Identification */ Seq(W ) = t that maximizes Score(t,W ) for w = W -1 to 1 Seq(w) = BackPtr(Seq(w+1),w+1)

16 Disambiguation and Gazetteers Place reference disambiguation relies on (external) gazetteer data for places. A gazetteer is a database associating place names to the corresponding place metadata Similar to address geocoding service

17 Some Popular Gazetteer Services The Alexandria Digital Library (ADL) Gazetteer Pioneering effort in defining data models and XML access protocols for managing gazetteer data Their dataset was built by integrating data from multiple sources, but usage requires a private license The geonames.org world gazetteer Dataset built by integrating data from multiple sources, with 8 million geographic names, in multiple languages, for more than 6.5 million unique geographic features Geographic features are only associated with centroid coordinates, as opposed to polygons or MBRs Does not include historical place names of time periods The Getty Thesaurus of Geographical Names Describes about 1 million places around the globe, with alternative names in multiple languages Usage of TGN data requires a private license Includes historical place names (associated with time periods), but not names of historical periods The Yahoo! Geoplanet Database Many more

18 Place Reference Disambiguation Most approaches leverage on contextual information: External : information on gazetteers (e.g., population, types,...) Internal : words and other entities surrounding the place reference. Disambiguation heuristics can be grouped into: Default senses : Disambiguation should be made to the most important candidate referent, estimated with basis on geometric area or population. Spatial minimalism : Disambiguation should be made to the candidate that minimizes the distance towards other place references in the same context, or the geometric area that covers all place references in the same context. Attribute coherence : Disambiguation should be made to the candidate referent that has attributes (e.g., the place type) similar to those that are mentioned in the textual context where the reference appears.

19 Disambiguation with Machine Learning Disambiguation can be seen as a problem of ranking candidate referents and choosing the best candidate The ranking can be based on a estimation of the geospatial distance between the candidate referent and a referent corresponding to the true disambiguation Regression models used to estimate geospatial distance Several features that are co-related with the geospatial distance Find a function that combines the available features in order to estimate the geospatial distance associated to the candidate Linear regression Genetic Programming SVM regression

20 Disambiguation with Machine Learning

21 Disambiguation Features String similarity between candidate name for the referent and the reference string in the text Population count for the candidate referent Geospatial area for the candidate referent Number of alternative names for the candidate referent Geospatial distance between candidate referent and closest interpretation for place references in the same textual unit (e.g., the same paragraph). Area of the convex hull covering candidate referent and all candidates of place references in the same text unit many more have been tested in the related literature

22 State of the art results

23 Current research challenges Some commercial services already exist... Yahoo! Placemaker Metacarta Text Geotagging Service But there are many open research challenges: Multilingual place reference resolution with Mach. Learning Requires more annotation standards/corpora such as SpatialML Using advanced sequence tagging models Considering other geospatial reference resolution tasks: Resolution of geospatial relations given in text Fine-grained classification of place references in text

24 Questions?

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de