Wikulu: Information Management in Wikis Enhanced by Language Technologies

Size: px

Start display at page:

Download "Wikulu: Information Management in Wikis Enhanced by Language Technologies"

Adelia Stone
6 years ago
Views:

1 Wikulu: Information Management in Wikis Enhanced by Language Technologies Iryna Gurevych (this is joint work with Dr. Torsten Zesch, Daniel Bär and Nico Erbs) 1

2 UKP Lab: Projects UKP Lab Educational Natural Language Processing Natural Language Processing and Wikis (Dr. Torsten Zesch) Semantic Information Management (Dr. György Szarvas) Language Technology for ehumanities (Richard Eckart, Dipl. Inf.) Statistical Semantics (Jun. Prof. Chris Biemann) 2

3 Wikis Everywhere 3

4 The Wiki Paradigm 4

5 NLP4Wiki & Wiki4NLP Wiki Wiki4NLP NLP4Wiki Natural Language Processing Information Extraction Information Retrieval Link Detection Named Entity Recognition Question Answering Summarization Text Categorization 5

6 JWPL & JWKTL Using Wikipedia and Wiktionary Articles, Links, Redirects, Paragraphs, Categories, Tables, JWPL Definitions, Examples, Synonyms, Hypernyms, Hyponyms, Word class, JWKTL 6

7 Talk Outline Introduction to Wikulu Project goals Demonstration Link Discovery Anchor Discovery Target Discovery Experiments and outlook 7

8 Why is NLP so important to Wikis? In the beginning... Adding and finding easy Flexible and open People add lots of content I can t find anything!? Where do I put this? 8

9 Wiki User Complaints Searching Organizing M. Buffa: Intranet Wikis Proceedings of the IntraWebs Workshop 2006 at the 15th International World Wide Web Conference. 9

10 Nice Use Case for Language Technologies Wikulu - Hawaiian for organize [ kukulu ] fast [ wiki ] Natural Language Processing supports adding, organizing and finding 10

11 Types of User Interactions Adding Content Detect duplicates Suggest appropriate points for insertion Organizing Content Suggest intra-wiki links Suggest tags Suggest page split/merge Finding Content Semantic search Show related pages 11

12 Relevant Research Issues Foundational research to adapt and integrate natural language processing methods in wikis Create intuitive interface to support users with their dayto-day tasks in a wiki Support multiple wiki platforms Information security and more 12

13 The Wikulu Architecture Traditional Wiki Wikulu Approach 13

14 Wikulu Architecture in Detail 14

15 Talk Outline Introduction to Wikulu Project goals Demonstration Link Discovery Anchor Discovery Target Discovery Experiments and outlook 15

16 What is Link Discovery about? Growth Everything is known, finding targets is easy Where to link to?? 16

17 Task Definition Document Collection Source Document UKP Lab develops natural language processing techniques for automatically understanding written text and applies them to information management like information retrieval, question answering, and structuring information in Wikis. Target Document t1 Target Document t2 17

18 Encoding of Links in the Wiki Markup 18

19 Subtasks of Link Discovery 19

20 Bridging NLP and Link Discovery Ancor discovery is similar to keyphrase extraction Target discovery is similar to information retrieval Use techniques from these fields! 20

21 Each Wiki is Different 21

22 Wiki Examples at INEX 2009 Wikipedia General knowledge encyclopedia Dense link structure 5000 orphans out of ~2.7 million documents Te Ara Encyclopedia about New Zealand No links 438 articles (~3000 files) all articles should be linked Highly structured articles 22

23 Experimental Goals Design an algorithm using no prior knowledge of links and the wiki page structure Can be applied to any Wiki and any unstructured document collection Comparison with two naive baselines and two systems making use of prior knowledge about links Investigate the effects of the link number used for training in link discovery 23

24 Anchor Discovery Ranking Selection Ancor Discovery Text Information Title Information Link Information Tokens n-grams Noun Phrases Document Titles Anchor Phrases Anchor Candidates Length tf.idf Cooccurrence Graph Anchor Probability Anchor Strength Top-ranked Anchor Candidates 24

25 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 25

26 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 26

27 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 27

28 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 28

29 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 29

30 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 30

31 Anchor Candidate Ranking Without links Term frequency and specificity measure, e.g. TF.IDF Using links Keyphraseness (Mihalcea & Csomai, 2007) Maximum Association Strength 31

32 Keyphraseness Example: bank bank used as anchor in 1760 documents bank occurs in documents 32

33 Example : bat bat used in 1775 documents targets Bat (1682) Baseball bat (34) Cricket bat (33)... 33

34 Target Discovery Existing Link Targets Geography of Paraguay City of London Bank_(geography) Bank 34

35 Target Discovery Ranking Selection Target Discovery Text Information Title Information Link Information Top-ranked Anchor Candidates Full Text Search Full Text Search In Titles Existing Targets Score of Full Text Search Title Disambiguation Score of Full Text Search in Titles Target Strength 35

36 Our Experiments 37

37 Experimental Data Wikipedia snapshot from October 8, 2008 from INEX ,666,190 articles 120,399,731 links Eliminated all links consisting of stopwords and ordinal numbers only Subset of 6,665 randomly selected articles with over 500,000 user-defined links used as the Gold Standard 38

38 Geva s Page Name Matching (Geva, 2007) GPNM system Text Title Link Titles Text Title Link Length Text Title Link Existing Targets Text Title Link Target Strength 39

39 Itakura & Clark (2007) Link Mining ICLM system Text Title Link Anchor Phrases Text Title Link Anchor Strength Text Title Link Existing Targets Text Title Link Target Strength 40

40 Our Approach (no prior knowledge of links) UKP system Text Title Link Tokens n-grams Noun Phrases Text Text Text Title Link Title Link Title Link Length tf.idf Cooccurrence Graph Full Text Search Engine Full Text Search Score 41

41 Results of Ancor Discovery Name Anchor Selection Anchor Ranking Info Type 1% 1% 1% 6% GPNM Titles Length Title % 6% ICLM Anchor Phrases Anchor Strength Link UKP Tokens Cooccurren ce Graph Text Baseline Noun Phrase First Text Baseline Token Random Text

42 Anchor Discovery Results for ICLM 43

43 Recall of Target Discovery 44

44 Recall of Target Discovery with 5 Suggestions 45

45 Future Work User-based evaluation of the proposed links instead of Wikipedia links as Gold Standard Inverted user interaction Discover semantically related documents as recommended link targets Then, discover an ancor Or link via PAGE is related to this document Linking to external documents (outside the Wiki), very useful in elearning scenarios 46

46 Interdisciplinary Collaborations Now that we have NLP-enhanced technology for wikis, investigate their usefulness and acceptance by users Fabian Tamin. Creating Wiki Page Overview Snippets. Diploma Thesis. Computer Science Department. Technische Universität Darmstadt Cooperation with Prof. Dr. Nina Keith (Dpt. of Psychology) Bachelor Thesis by Christopher Schwarz. Do overview snippets improve information search in the World Wide Web? Institut für Psychologie. Technische Universität Darmstadt. 47

47 Acknowledgements 48

Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis Daniel Bär, Nicolai Erbs, Torsten Zesch, and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer