Annotating Spatio-Temporal Information in Documents

Size: px

Start display at page:

Download "Annotating Spatio-Temporal Information in Documents"

Julianna Hutchinson
5 years ago
Views:

1 Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group June 8, 2010 Name Classification and Grounding in Multilingual Corpora University of Zurich

2 Motivation Information Extraction Model Pipeline HeidelTime Summary University of Heidelberg Oldest German university founded in 1386 Volluniversität 12 faculties, 180 fields of study students (20% international students) Computer Science Computational Linguistics June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 2 / 60

Database Systems Research Group Major research topics include: Geospatial and spatio-temporal data management Moving objects and object trajectories Processing and mining

3 Database Systems Research Group Major research topics include: Geospatial and spatio-temporal data management Moving objects and object trajectories Processing and mining geospatial data streams Spatial and temporal information extraction Spatial and temporal information retrieval June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 3 / 60

4 Motivation A lot of information is only published in unstructured format text Information extraction helps to identify valuable information Names Locations Dates This information is useful for several search and exploration tasks June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 4 / 60

5 Motivation Query to Google: Alexander von Humboldt more than 1 Million results a lot of unstructured information need for help for document search and exploration June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 5 / 60

6 Motivation Figure: Part of Wikipedia Page Alexander von Humboldt June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 6 / 60

7 Motivation Information Extraction Model Pipeline HeidelTime Summary Motivation What is the document talking about? Events = space + time Figure: Part of Wikipedia Page Alexander von Humboldt June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 7 / 60

8 Motivation Information Extraction Model Pipeline HeidelTime Summary Motivation Figure: Part of Wikipedia Page Alexander von Humboldt June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 8 / 60

9 Motivation Information Extraction Model Pipeline HeidelTime Summary Motivation Figure: Part of Wikipedia Page Alexander von Humboldt June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 9 / 60

10 Motivation Goal Extraction and Exploration of Spatio-Temporal Information in Documents ( extraction of events) Tasks information extraction (temporal and spatial) a model for spatio-temporal information (events) implementation: document processing pipeline June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 10 / 60

11 Outline 1 Information Extraction Temporal Information Extraction Spatial Information Extraction 2 A model for spatio-temporal information Spatio-Temporal Document Profiles 3 Document Processing Pipeline Yahoo Placemaker 4 The Temporal Tagger HeidelTime 5 Summary and Ongoing Work June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 11 / 60

12 Information Extraction Information Extraction a lot of information only published in unstructured format Temporal information and spatial information in documents widely spread most valuable for search and exploration tasks Temporal and spatial information extraction Named Entity Recognition and Normalization tasks June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 13 / 60

13 Temporal Information Extraction Temporal information Timex3 explicit: October 12, implicit: Columbus Day relative: today Extraction identify temporal expressions with offset information Normalization (to Timex3 ISO standard) all expressions are normalized to their standard format all expressions referring to the same value have identical standard format value June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 14 / 60

14 Spatial Information Extraction Spatial Information highly ambiguous (Go to Springfield in the US) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 15 / 60

15 Spatial Information Extraction Springfields in the United States June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 16 / 60

16 Spatial Information Extraction Spatial Information highly ambiguous associated with longitude/latitude information associated with a geometry (point or polygonal region) Extraction identify spatial expression with offset information Normalization all expressions get their longitude/latitude information all expressions referring to the same location have identical longitude/latitude information (e.g., New York City, NYC, Big Apple ) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 17 / 60

17 A model for spatio-temporal information Document profiles: a model describing a document s information in a concise manner a data structure to make spatial and temporal information accessible for search and exploration tasks temporal document profiles spatial document profiles spatio-temporal document profiles June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 19 / 60

18 Temporal Document Profiles A temporal document profile tdp(d) is a sequence of tuples e i, c i, p i e i temporal expression c i normalized value (chronon) p i offset information in the document Example tdp(d) = {..., January 6, 1802, , ,... } All tuples extracted by the temporal tagger normalized to their standard format June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 20 / 60

19 Spatial Document Profiles A spatial document profile sdp(d) is a sequence of tuples g i, v i, p i g i geographic expression v i normalized value (longitute/latitude) p i offset information in the document Example sdp(d) = {..., Quito, -78.5/-0.19, ,... } All tuples extracted by the geo tagger normalized to their standard format June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 21 / 60

20 Spatio-Temporal Document Profiles Question: How to combine spatial and temporal information to extract events? Method: Extraction of co-occurrences of spatial and temporal information June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 22 / 60

21 Spatio-Temporal Document Profiles Co-occurrence both expressions occur in the same window of the document (e.g., paragraph or sentence) A spatio-temporal document profile stdp(d) combines the spatial and temporal information is a sequence of tuples e, c, g, v, p t, p s e, c, p t is in tdp(d) g, v, p s is in sdp(d) p t and p s belong to the same window of the document June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 23 / 60

Spatio-Temporal Document Profiles Example: Entities with normalization: te 1 November 24, 1800-11-24, 7837-7848 se 1 Cuba, -79.5/22.

22 Spatio-Temporal Document Profiles Example: Entities with normalization: te 1 November 24, , se 1 Cuba, -79.5/22.0, se 2 Cartagena, Columbia, -75.5/10.4, Cooccurrences: te 1 se 1 te 1 se 2 June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 24 / 60

23 Spatio-Temporal Document Profiles Example: Entities with normalization: te 2 January 6, 1802, , se 3 Magdalena, -74.5/10.0, , se 4 Cordillera Real, -78.0/0.0, se 5 Quito, -78.5/-0.19, Cooccurrence: te 2 se 3 te 2 se 4 te 2 se 5 June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 25 / 60

24 Spatio-Temporal Document Profiles stdp(d) = {..., e 1, , Cuba, -79.5/22.0, p t, p s, e 1, , Cartagena, Columbia, -75.5/10.4, p t, p s, e 2, , Magdalena, -74.5/10.0, p t, p s, e 2, , Cordillera Real, -78.0/0.0, p t, p s, e 2, , Quito, -78.5/-0.19, p t, p s,...,} June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 26 / 60

25 Document Trajectory stdp(d) sequence of tuples ordered by time a good model hard to analyze not eye-catching document trajectory a trajectory is a sequence of time/location pairs stdp(d) can be seen as a document trajectory sequence of events document trajectories can be visualized on a map June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 27 / 60

26 Document Trajectory Figure: Part of the Document Trajectory of Wikipedia s Humboldt page June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 28 / 60

27 Document Trajectory Useful for search and exploration tasks: visualization of the document s events on a map one document multiple documents spatio-temporal snippets June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 29 / 60

28 Document Processing Pipeline Goals: flexible pipeline corpus independent processing pipeline ability to integrate new components easily June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 31 / 60

29 Document Processing Pipeline UIMA: Unstructured Information Management Architecture component framework for unstructured content helps to connect tools not built to be used together: all components work on the same data structure the CAS object Common Analysis Structure June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 32 / 60

30 UIMA - Components of a Pipeline Docs Collection Reader Analysis Engines CAS Consumer Results June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 33 / 60

31 UIMA - Components of a Pipeline Docs Collection Reader Analysis Engines CAS Consumer Results CAS doc text metadata Collection Reader reads documents from source (e.g file system, database) instantiates a CAS for each document initializes CAS with doc text (metadata, etc.) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 33 / 60

32 UIMA - Components of a Pipeline Docs Collection Reader Analysis Analysis Analysis Analysis Engines Engines Engines Engines CAS Consumer Results Analysis Engines CAS doc text metadata annotations usually several Analysis Engines analyze the document read content of the CAS add annotations to the CAS June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 33 / 60

33 UIMA - Components of a Pipeline Docs Collection Reader Analysis Engines CAS Consumer Results CAS Consumer reads content of the CAS does final processing evaluation, visualization, indexing CAS doc text metadata annotations June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 33 / 60

34 UIMA - Components of a Pipeline Docs Collection Reader Analysis Engines CAS Consumer Results CAS UIMA - What s the clue? single components are not directly connected to each other instead: use of CAS components are independent of each other components only have to be able to handle CAS June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 33 / 60

35 Document Processing Pipeline Sources Tasks Results Wikipedia Featured Articles Goldstandard Paragraph Splitting Sentence Splitting Geo Tagging Temporal Tagging Co occurrence Extraction Document Profiles Evaluation Results Document Trajectories Store results in a Database June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 34 / 60

36 Document Processing Pipeline Sources Tasks Results Collection Reader Wiki Reader Analysis Engines Paragraph Splitter Analysis Engines Sentence Splitter CAS Consumer Database Writer Gold Standard Reader Geo Tagger Temporal Tagger Co occurrence Extractor Visualizor Evaluator June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 34 / 60

37 Document Processing Pipeline Components Sentence Splitter OpenNLP Sentence Splitter Geo Tagger MetaCarta Service Yahoo Placemaker Temporal Tagger own implementation June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 35 / 60

38 Yahoo Placemaker What is Yahoo Placemaker? free geo-parsing web service returns geographic metadata Processing steps of Yahoo Placemaker identifies places in unstructured content disambiguates those places returns unique identifiers (WOEIDs) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 36 / 60

39 Yahoo Placemaker Supported languages: multiple languages e.g., English, German, Italian, French, Spanish, Japanese, Chinese,... Information on identified places: latitude/longitude information normalized name June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 37 / 60

40 Yahoo Placemaker Additional information using Yahoo GeoPlanet API: bounding box containment information e.g.: World Trade Center Downtown Manhatten New York New York (State) United States Earth... June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 38 / 60

41 The Temporal Tagger HeidelTime HeidelTime: a rule-based system for the extraction of temporal expressions their normalization (according to Timex3 standard) Optimized for TempEval-2 challenge Evaluated within TempEval-2 challenge June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 40 / 60

42 The Temporal Tagger HeidelTime The TempEval-2 challenge Task 13 of SemEval th Workshop on Semantic Evaluation 6 tasks: Extraction and normalization of temporal expressions (Task A) events (Task B) temporal relations (Task C-F) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 41 / 60

43 Temporal Expressions 4 types of semantics: Dates April 29, 2010 Times 12 p.m. Durations two weeks Sets twice a week June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 42 / 60

44 Temporal Expressions 3 types of occurrences: explicit: October 12, implicit: Columbus Day relative: today June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 43 / 60

45 HeidelTime Extraction: mainly regular expressions other features (POS, POS of next token, etc.) Normalization: knowledge resources (names of months, holidays, etc.) linguistic clues (tense of sentences) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 44 / 60

46 HeidelTime Rules: Every rule is a triple: expression rule normalization function type information Example of a temporal expression: June 8, 2010 June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 45 / 60

47 HeidelTime Expression rule (of type date): date_r1 = (remonth) g1 (reday) g2, (refullyear) g3 Normalization function: norm_r1(g1, g2, g3) = g3 normmonth(g1) normday(g2) Expression Resources: remonth = (... June July... ) reseason = (... summer... ) Normalization functions: normmonth( June ) = 06 normmonth( July ) = 07 normseason( summer ) = SU Normalized temporal expression June 8, June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 46 / 60

48 HeidelTime: Architecture Realized as UIMA component Rule development within UIMA pipeline June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 47 / 60

49 HeidelTime: Architecture UIMA Document Processing Pipeline TempEval 2 data TempEval 2 File Writer TempEval 2 Reader Collection Readers CAS Consumers rule design workflow other heterogeneous sources Sentence Splitter Tokenizer POS Tagger HeidelTime Analysis Engines TempEval 2 Evaluator other Collection Readers other Analysis Engines other Consumers task workflow Rule development: TempEval-2 data: training data (goldstandard) TempEval-2 Evaluator: lists of fp, fn, tp Evaluation TempEval-2 data: test data TempEval-2 File Writer creates files to submit June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 48 / 60

50 HeidelTime: Evaluation TempEval-2 9 systems for Task A (15 runs) HeidelTime 2 runs precision optimized rule set recall optimized rule set June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 49 / 60

51 HeidelTime: Evaluation Extraction: 100 Recall [%] Precision-optimized: P R F-score 90 % 82 % 86 % Precision [%] Recall-optimized: P R F-score 82 % 91 % 86 % Figure: Performance of participating systems with F-Score contour for reference. June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 50 / 60

52 HeidelTime: Evaluation Normalization: 100 other systems HeidelTime Value Normalization [%] HT-1 HT-2 s-1 s-2 s-3 s-4 s-5 s-6 s-7 s-8 s-9 s-10 s-11 s-12 s-13 System Correct value (normalized value): precision-optimized 85 % recall-optimized 77 % Correct type (date, time,... ): precision-optimized 96 % recall-optimized 92 % Figure: Value normalization results of participating systems. June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 51 / 60

53 HeidelTime: Evaluation Evaluation results: HeidelTime: best system for extraction task HeidelTime: best system for normalization task Differences to other systems: SemEval workshop in July (at ACL conference) June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 52 / 60

54 HeidelTime: Goals Adaptations for other languages: new extraction resources (names of months, days,... ) new normalization functions for those expressions new rules Adaptations for other types of documents TempEval: news documents other documents Normalization more difficult document creation time less useful for normalization June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 53 / 60

55 Summary Model and Implementation extraction of events (space & time) a way to organize temporal and spatial information spatio-temporal document profiles document trajectories Search and Exploration tasks: visualization of events exploration of spatio-temporal snippets similarity search using stdp query constraints using stdp June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 55 / 60

56 Summary Geo Tagging several Geo Tagger available quility depends on: used gazetteer for coverage used (NLP) methods for disambiguation Temporal Tagging few tools available HeidelTime achieves good results for English June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 56 / 60

57 Ongoing Work Temporal Tagger adapt HeidelTime to other languages and corpora clean-up code to make HeidelTime available Improve Model: cooccurrence approach ignores context instead of cooccurrences use of NLP methods better understanding of syntax and semantics new NLP components as new analysis engines Which date belongs to which location? In 1792 and 1797 he was in Vienna, in 1795 he made a geological and botanical tour through Switzerland and Italy. June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 57 / 60

58 Ongoing Work Evaluation compare different NER tools for locations evaluate the quality of document trajectories Enlarge the model: add Who or What to Where and When! June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 58 / 60

59 Further Reading Spatio-temporal Information: Jannik Strötgen, Michael Gertz, and Pavel Popov. Extraction and Exploration of Spatio-Temporal Information in Documents. In: GIR 10: Proceedings of the 6th Workshop On Geographic Information Retrieval, Zurich, Switzerland, February 18-19, ACM. Temporal Tagger HeidelTime: Jannik Strötgen and Michael Gertz. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. To appear in: SemEval-2010: 5th International Workshop on Semantic Evaluations (at ACL 2010), ACL. June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 59 / 60

60 Thank you for your attention! June 8, 2010 Annotating Spatio-Temporal Information Jannik Strötgen 60 / 60

Temporal Information Extraction using Regular Expressions

Temporal Information Extraction using Regular Expressions Anton Fagerberg D10, Lund Institute of Technology, Sweden anton@antonfagerberg.com ada10afa@student.lu.se 2014-01-13 Abstract This is a description