Wikulu: Information Management in Wikis Enhanced by Language Technologies

Size: px
Start display at page:

Download "Wikulu: Information Management in Wikis Enhanced by Language Technologies"

Transcription

1 Wikulu: Information Management in Wikis Enhanced by Language Technologies Iryna Gurevych (this is joint work with Dr. Torsten Zesch, Daniel Bär and Nico Erbs) 1

2 UKP Lab: Projects UKP Lab Educational Natural Language Processing Natural Language Processing and Wikis (Dr. Torsten Zesch) Semantic Information Management (Dr. György Szarvas) Language Technology for ehumanities (Richard Eckart, Dipl. Inf.) Statistical Semantics (Jun. Prof. Chris Biemann) 2

3 Wikis Everywhere 3

4 The Wiki Paradigm 4

5 NLP4Wiki & Wiki4NLP Wiki Wiki4NLP NLP4Wiki Natural Language Processing Information Extraction Information Retrieval Link Detection Named Entity Recognition Question Answering Summarization Text Categorization 5

6 JWPL & JWKTL Using Wikipedia and Wiktionary Articles, Links, Redirects, Paragraphs, Categories, Tables, JWPL Definitions, Examples, Synonyms, Hypernyms, Hyponyms, Word class, JWKTL 6

7 Talk Outline Introduction to Wikulu Project goals Demonstration Link Discovery Anchor Discovery Target Discovery Experiments and outlook 7

8 Why is NLP so important to Wikis? In the beginning... Adding and finding easy Flexible and open People add lots of content I can t find anything!? Where do I put this? 8

9 Wiki User Complaints Searching Organizing M. Buffa: Intranet Wikis Proceedings of the IntraWebs Workshop 2006 at the 15th International World Wide Web Conference. 9

10 Nice Use Case for Language Technologies Wikulu - Hawaiian for organize [ kukulu ] fast [ wiki ] Natural Language Processing supports adding, organizing and finding 10

11 Types of User Interactions Adding Content Detect duplicates Suggest appropriate points for insertion Organizing Content Suggest intra-wiki links Suggest tags Suggest page split/merge Finding Content Semantic search Show related pages 11

12 Relevant Research Issues Foundational research to adapt and integrate natural language processing methods in wikis Create intuitive interface to support users with their dayto-day tasks in a wiki Support multiple wiki platforms Information security and more 12

13 The Wikulu Architecture Traditional Wiki Wikulu Approach 13

14 Wikulu Architecture in Detail 14

15 Talk Outline Introduction to Wikulu Project goals Demonstration Link Discovery Anchor Discovery Target Discovery Experiments and outlook 15

16 What is Link Discovery about? Growth Everything is known, finding targets is easy Where to link to?? 16

17 Task Definition Document Collection Source Document UKP Lab develops natural language processing techniques for automatically understanding written text and applies them to information management like information retrieval, question answering, and structuring information in Wikis. Target Document t1 Target Document t2 17

18 Encoding of Links in the Wiki Markup 18

19 Subtasks of Link Discovery 19

20 Bridging NLP and Link Discovery Ancor discovery is similar to keyphrase extraction Target discovery is similar to information retrieval Use techniques from these fields! 20

21 Each Wiki is Different 21

22 Wiki Examples at INEX 2009 Wikipedia General knowledge encyclopedia Dense link structure 5000 orphans out of ~2.7 million documents Te Ara Encyclopedia about New Zealand No links 438 articles (~3000 files) all articles should be linked Highly structured articles 22

23 Experimental Goals Design an algorithm using no prior knowledge of links and the wiki page structure Can be applied to any Wiki and any unstructured document collection Comparison with two naive baselines and two systems making use of prior knowledge about links Investigate the effects of the link number used for training in link discovery 23

24 Anchor Discovery Ranking Selection Ancor Discovery Text Information Title Information Link Information Tokens n-grams Noun Phrases Document Titles Anchor Phrases Anchor Candidates Length tf.idf Cooccurrence Graph Anchor Probability Anchor Strength Top-ranked Anchor Candidates 24

25 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 25

26 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 26

27 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 27

28 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 28

29 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 29

30 Identifying Anchor Candidates No links Document title Section titles N-Grams Noun phrases Existing links Anchor phrases 30

31 Anchor Candidate Ranking Without links Term frequency and specificity measure, e.g. TF.IDF Using links Keyphraseness (Mihalcea & Csomai, 2007) Maximum Association Strength 31

32 Keyphraseness Example: bank bank used as anchor in 1760 documents bank occurs in documents 32

33 Example : bat bat used in 1775 documents targets Bat (1682) Baseball bat (34) Cricket bat (33)... 33

34 Target Discovery Existing Link Targets Geography of Paraguay City of London Bank_(geography) Bank 34

35 Target Discovery Ranking Selection Target Discovery Text Information Title Information Link Information Top-ranked Anchor Candidates Full Text Search Full Text Search In Titles Existing Targets Score of Full Text Search Title Disambiguation Score of Full Text Search in Titles Target Strength 35

36 Our Experiments 37

37 Experimental Data Wikipedia snapshot from October 8, 2008 from INEX ,666,190 articles 120,399,731 links Eliminated all links consisting of stopwords and ordinal numbers only Subset of 6,665 randomly selected articles with over 500,000 user-defined links used as the Gold Standard 38

38 Geva s Page Name Matching (Geva, 2007) GPNM system Text Title Link Titles Text Title Link Length Text Title Link Existing Targets Text Title Link Target Strength 39

39 Itakura & Clark (2007) Link Mining ICLM system Text Title Link Anchor Phrases Text Title Link Anchor Strength Text Title Link Existing Targets Text Title Link Target Strength 40

40 Our Approach (no prior knowledge of links) UKP system Text Title Link Tokens n-grams Noun Phrases Text Text Text Title Link Title Link Title Link Length tf.idf Cooccurrence Graph Full Text Search Engine Full Text Search Score 41

41 Results of Ancor Discovery Name Anchor Selection Anchor Ranking Info Type 1% 1% 1% 6% GPNM Titles Length Title % 6% ICLM Anchor Phrases Anchor Strength Link UKP Tokens Cooccurren ce Graph Text Baseline Noun Phrase First Text Baseline Token Random Text

42 Anchor Discovery Results for ICLM 43

43 Recall of Target Discovery 44

44 Recall of Target Discovery with 5 Suggestions 45

45 Future Work User-based evaluation of the proposed links instead of Wikipedia links as Gold Standard Inverted user interaction Discover semantically related documents as recommended link targets Then, discover an ancor Or link via PAGE is related to this document Linking to external documents (outside the Wiki), very useful in elearning scenarios 46

46 Interdisciplinary Collaborations Now that we have NLP-enhanced technology for wikis, investigate their usefulness and acceptance by users Fabian Tamin. Creating Wiki Page Overview Snippets. Diploma Thesis. Computer Science Department. Technische Universität Darmstadt Cooperation with Prof. Dr. Nina Keith (Dpt. of Psychology) Bachelor Thesis by Christopher Schwarz. Do overview snippets improve information search in the World Wide Web? Institut für Psychologie. Technische Universität Darmstadt. 47

47 Acknowledgements 48

Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis Daniel Bär, Nicolai Erbs, Torsten Zesch, and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer

More information

Overview of the INEX 2009 Link the Wiki Track

Overview of the INEX 2009 Link the Wiki Track Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,

More information

UKP at CrossLink: Anchor Text Translation for Cross-lingual Link Discovery

UKP at CrossLink: Anchor Text Translation for Cross-lingual Link Discovery UKP at CrossLink: Anchor Text Translation for Cross-lingual Link Discovery Jungi Kim and Iryna Gurevych Ubiquitous Knowledge Processing (UKP) Lab Technische Universität Darmstadt Hochschulstrasse 10 D-64289

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Building and Annotating Corpora of Collaborative Authoring in Wikipedia

Building and Annotating Corpora of Collaborative Authoring in Wikipedia Building and Annotating Corpora of Collaborative Authoring in Wikipedia Johannes Daxenberger, Oliver Ferschke and Iryna Gurevych Workshop: Building Corpora of Computer-Mediated Communication: Issues, Challenges,

More information

Beyond the Synset: Synonyms in Collaboratively Constructed Semantic Resources Michael Matuschek and Iryna Gurevych

Beyond the Synset: Synonyms in Collaboratively Constructed Semantic Resources Michael Matuschek and Iryna Gurevych Beyond the Synset: Synonyms in Collaboratively Constructed Semantic Resources Michael Matuschek and Iryna Gurevych 30.10.2010 Computer Science Department UKP Lab - Prof. Dr. Iryna Gurevych Michael Matuschek

More information

Automated Cross-lingual Link Discovery in Wikipedia

Automated Cross-lingual Link Discovery in Wikipedia Automated Cross-lingual Link Discovery in Wikipedia Ling-Xiang Tang 1, Daniel Cavanagh 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1, Laurianne Sitbon 1 1 Faculty of Science and Technology, Queensland University

More information

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation

More information

Annotating Spatio-Temporal Information in Documents

Annotating Spatio-Temporal Information in Documents Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Worth its Weight in Gold or Yet Another Resource

Worth its Weight in Gold or Yet Another Resource Worth its Weight in Gold or Yet Another Resource A Comparative Study of Wiktionary, OpenThesaurus and GermaNet Christian M. Meyer and Iryna Gurevych First Workshop on Automated Knowledge Base Construction

More information

Hierarchy Identification for Automatically Generating Table-of-Contents

Hierarchy Identification for Automatically Generating Table-of-Contents Hierarchy Identification for Automatically Generating Table-of-Contents Nicolai Erbs α α Ubiquitous Knowledge Processing Lab Department of Computer Science, Technische Universität Darmstadt Iryna Gurevych

More information

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery Ling-Xiang Tang 1, Shlomo Geva 1, Andrew Trotman 2, Yue Xu 1, Kelly Y. Itakura 1 1 Faculty of Science and Technology, Queensland University

More information

WebAnno: a flexible, web-based annotation tool for CLARIN

WebAnno: a flexible, web-based annotation tool for CLARIN WebAnno: a flexible, web-based annotation tool for CLARIN Richard Eckart de Castilho, Chris Biemann, Iryna Gurevych, Seid Muhie Yimam #WebAnno This work is licensed under a Attribution-NonCommercial-ShareAlike

More information

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7 International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7 A Hybrid Method for Extracting Key Terms of Text Documents Ahmad Ali Al-Zubi Computer Science Department

More information

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ??? @ INSIDE DEEPQA Managing complex unstructured data with UIMA Simon Ellis INTRODUCTION 22 nd November, 2013 WAT SON TECHNOLOGIES AND OPEN ARCHIT ECT URE QUEST ION ANSWERING PROFESSOR JIM HENDLER S IMON

More information

KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia Using Explicit Semantic Analysis

KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia Using Explicit Semantic Analysis KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia Using Explicit Semantic Analysis Petr Knoth KMI, The Open University Walton Hall, Milton Keynes United Kingdom p.knoth@open.ac.uk

More information

The Wikipedia XML Corpus

The Wikipedia XML Corpus INEX REPORT The Wikipedia XML Corpus Ludovic Denoyer, Patrick Gallinari Laboratoire d Informatique de Paris 6 8 rue du capitaine Scott 75015 Paris http://www-connex.lip6.fr/denoyer/wikipediaxml {ludovic.denoyer,

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22 Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities 112 Outline Morning program Preliminaries Semantic matching Learning to rank Afternoon program Modeling user behavior Generating responses Recommender systems Industry insights Q&A 113 are polysemic Finding

More information

NLP Final Project Fall 2015, Due Friday, December 18

NLP Final Project Fall 2015, Due Friday, December 18 NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Processing Structural Constraints

Processing Structural Constraints SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

GIR experiements with Forostar at GeoCLEF 2007

GIR experiements with Forostar at GeoCLEF 2007 GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2

More information

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself) Watson & WMR2017 (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself) R. BASILI A.A. 2016-17 Overview Motivations Watson Jeopardy NLU in Watson

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island Taming Text How to Find, Organize, and Manipulate It GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS 11 MANNING Shelter Island contents foreword xiii preface xiv acknowledgments xvii about this book

More information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Yunqing Xia 1 and Sen Na 2 1 Tsinghua University 2 Canon Information Technology (Beijing) Co. Ltd. Before we start Who are we? THUIS is

More information

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis

More information

Click Logs as a Source of Data for Evaluating the Quality of Hypertext

Click Logs as a Source of Data for Evaluating the Quality of Hypertext Click Logs as a Source of Data for Evaluating the Quality of Hypertext David Alexander a thesis submitted for the degree of Master of Science at the University of Otago, Dunedin, New Zealand. 9 th August,

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Text Mining: A Burgeoning technology for knowledge extraction

Text Mining: A Burgeoning technology for knowledge extraction Text Mining: A Burgeoning technology for knowledge extraction 1 Anshika Singh, 2 Dr. Udayan Ghosh 1 HCL Technologies Ltd., Noida, 2 University School of Information &Communication Technology, Dwarka, Delhi.

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Learning to Rank Aggregated Answers for Crossword Puzzles

Learning to Rank Aggregated Answers for Crossword Puzzles Learning to Rank Aggregated Answers for Crossword Puzzles Massimo Nicosia 1,2, Gianni Barlacchi 2 and Alessandro Moschitti 1,2 1 Qatar Computing Research Institute 2 University of Trento m.nicosia@gmail.com,

More information

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction The 2014 Conference on Computational Linguistics and Speech Processing ROCLING 2014, pp. 110-124 The Association for Computational Linguistics and Chinese Language Processing Collaborative Ranking between

More information

Proposed Cooperative ICT Projects. Mie Mie Thet Thwin. Rector University of Computer Studies, Yangon, Myanmar

Proposed Cooperative ICT Projects. Mie Mie Thet Thwin. Rector University of Computer Studies, Yangon, Myanmar Proposed Cooperative ICT Projects Mie Mie Thet Thwin Rector University of Computer Studies, Yangon, Myanmar Contents Cyber Security Projects Big Data Analytic Projects Research & Education Network Other

More information

Re-contextualization and contextual Entity exploration. Sebastian Holzki

Re-contextualization and contextual Entity exploration. Sebastian Holzki Re-contextualization and contextual Entity exploration Sebastian Holzki Sebastian Holzki June 7, 2016 1 Authors: Joonseok Lee, Ariel Fuxman, Bo Zhao, and Yuanhua Lv - PAPER PRESENTATION - LEVERAGING KNOWLEDGE

More information

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart

Design and Realization of the EXCITEMENT Open Platform for Textual Entailment. Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart Design and Realization of the EXCITEMENT Open Platform for Textual Entailment Günter Neumann, DFKI Sebastian Pado, Universität Stuttgart Textual Entailment Textual Entailment (TE) A Text (T) entails a

More information

Text Mining for Software Engineering

Text Mining for Software Engineering Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software

More information

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 94 CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING 5.1 INTRODUCTION Expert locator addresses the task of identifying the right person with the appropriate skills and knowledge. In large organizations, it

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Getting Started with DKPro Agreement

Getting Started with DKPro Agreement Getting Started with DKPro Agreement Christian M. Meyer, Margot Mieskes, Christian Stab and Iryna Gurevych: DKPro Agreement: An Open-Source Java Library for Measuring Inter- Rater Agreement, in: Proceedings

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome

More information

Textual Emigration Analysis

Textual Emigration Analysis Textual Emigration Analysis Andre Blessing and Jonas Kuhn IMS - Universität Stuttgart, Germany clarin@ims.uni-stuttgart.de Abstract We present a web-based application which is called TEA (Textual Emigration

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Prior Art Search using International Patent Classification Codes and All-Claims-Queries

Prior Art Search using International Patent Classification Codes and All-Claims-Queries Prior Art Search using International Patent Classification Codes and All-Claims-Queries György Szarvas, Benjamin Herbert, Iryna Gurevych UKP Lab, Technische Universität Darmstadt, Germany http://www.ukp.tu-darmstadt.de

More information

Semantic-Based Keyword Recovery Function for Keyword Extraction System

Semantic-Based Keyword Recovery Function for Keyword Extraction System Semantic-Based Keyword Recovery Function for Keyword Extraction System Rachada Kongkachandra and Kosin Chamnongthai Department of Computer Science Faculty of Science and Technology, Thammasat University

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

TALP at WePS Daniel Ferrés and Horacio Rodríguez

TALP at WePS Daniel Ferrés and Horacio Rodríguez TALP at WePS-3 2010 Daniel Ferrés and Horacio Rodríguez TALP Research Center, Software Department Universitat Politècnica de Catalunya Jordi Girona 1-3, 08043 Barcelona, Spain {dferres, horacio}@lsi.upc.edu

More information

Ontology based Web Page Topic Identification

Ontology based Web Page Topic Identification Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer

More information

OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni OPEN INFORMATION EXTRACTION FROM THE WEB Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni Call for a Shake Up in Search! Question Answering rather than indexed key

More information

DBpedia Spotlight at the MSM2013 Challenge

DBpedia Spotlight at the MSM2013 Challenge DBpedia Spotlight at the MSM2013 Challenge Pablo N. Mendes 1, Dirk Weissenborn 2, and Chris Hokamp 3 1 Kno.e.sis Center, CSE Dept., Wright State University 2 Dept. of Comp. Sci., Dresden Univ. of Tech.

More information

Improving Retrieval Experience Exploiting Semantic Representation of Documents

Improving Retrieval Experience Exploiting Semantic Representation of Documents Improving Retrieval Experience Exploiting Semantic Representation of Documents Pierpaolo Basile 1 and Annalina Caputo 1 and Anna Lisa Gentile 1 and Marco de Gemmis 1 and Pasquale Lops 1 and Giovanni Semeraro

More information

Phrase Detection in the Wikipedia

Phrase Detection in the Wikipedia Phrase Detection in the Wikipedia Miro Lehtonen 1 and Antoine Doucet 1,2 1 Department of Computer Science P. O. Box 68 (Gustaf Hällströmin katu 2b) FI 00014 University of Helsinki Finland {Miro.Lehtonen,Antoine.Doucet}

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Beyond Bag of Words Bag of Words a document is considered to be an unordered collection of words with no relationships Extending

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Heading-aware Snippet Generation for Web Search

Heading-aware Snippet Generation for Web Search Heading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp Web Search Result Snippets Are short

More information

WordNet-based User Profiles for Semantic Personalization

WordNet-based User Profiles for Semantic Personalization PIA 2005 Workshop on New Technologies for Personalized Information Access WordNet-based User Profiles for Semantic Personalization Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano LACAM

More information

The Emerging Web of Linked Data

The Emerging Web of Linked Data 4th Berlin Semantic Web Meetup 26. February 2010 The Emerging Web of Linked Data Prof. Dr. Christian Bizer Freie Universität Berlin Outline 1. From a Web of Documents to a Web of Data Web APIs and Linked

More information

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf R.B. Trieschnigg, Djoerd Hiemstra, and Franciska M.G. de

More information

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks Wei Zheng Hui Fang Department of Electrical and Computer Engineering University of Delaware

More information

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Florian Boudin LINA - UMR CNRS 6241, Université de Nantes, France Keyphrase 2015 1 / 22 Errors made by

More information

Entity Linking at TAC Task Description

Entity Linking at TAC Task Description Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate

More information

IJMIE Volume 2, Issue 8 ISSN:

IJMIE Volume 2, Issue 8 ISSN: DISCOVERY OF ALIASES NAME FROM THE WEB N.Thilagavathy* T.Balakumaran** P.Ragu** R.Ranjith kumar** Abstract An individual is typically referred by numerous name aliases on the web. Accurate identification

More information

Exploiting DBpedia for Graph-based Entity Linking to Wikipedia

Exploiting DBpedia for Graph-based Entity Linking to Wikipedia Exploiting DBpedia for Graph-based Entity Linking to Wikipedia Master Thesis presented by Bernhard Schäfer submitted to the Research Group Data and Web Science Prof. Dr. Christian Bizer University Mannheim

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

A cocktail approach to the VideoCLEF 09 linking task

A cocktail approach to the VideoCLEF 09 linking task A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,

More information

Best Practices for World-Class Search

Best Practices for World-Class Search Best Practices for World-Class Search MARY HOLSTEGE Distinguished Engineer, MarkLogic @mathling 4 June 2018 MARKLOGIC CORPORATION SLIDE: 2 4 June 2018 MARKLOGIC CORPORATION Search Application: Search for

More information

Creating a Testbed for the Evaluation of Automatically Generated Back-of-the-book Indexes

Creating a Testbed for the Evaluation of Automatically Generated Back-of-the-book Indexes Creating a Testbed for the Evaluation of Automatically Generated Back-of-the-book Indexes Andras Csomai and Rada Mihalcea University of North Texas Computer Science Department csomaia@unt.edu, rada@cs.unt.edu

More information

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity

More information

DOCUMENT content comprehension is an important

DOCUMENT content comprehension is an important Proceedings of the International Multiconference on Computer Science and Information Technology pp. 265 272 ISBN 978-83-60810-22-4 ISSN 1896-7094 TermPedia for Interactive Document Enrichment Using Technical

More information

A Comparative Study Weighting Schemes for Double Scoring Technique

A Comparative Study Weighting Schemes for Double Scoring Technique , October 19-21, 2011, San Francisco, USA A Comparative Study Weighting Schemes for Double Scoring Technique Tanakorn Wichaiwong Member, IAENG and Chuleerat Jaruskulchai Abstract In XML-IR systems, the

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

Ontology Extraction from Heterogeneous Documents

Ontology Extraction from Heterogeneous Documents Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg

More information

Chinese Named Entity Recognition and Disambiguation Based on Wikipedia

Chinese Named Entity Recognition and Disambiguation Based on Wikipedia Chinese Named Entity Recognition and Disambiguation Based on Wikipedia Yu Miao, Lv Yajuan, Liu Qun, Su Jinsong, and Xiong Hao Key Laboratory of Intelligent Information Processing, Institute of Computing

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

INEX REPORT. Report on INEX 2010

INEX REPORT. Report on INEX 2010 INEX REPORT Report on INEX 2010 D. Alexander P. Arvola T. Beckers P. Bellot T. Chappell C.M. De Vries A. Doucet N. Fuhr S. Geva J. Kamps G. Kazai M. Koolen S. Kutty M. Landoni V. Moriceau R. Nayak R. Nordlie

More information

Automatic people tagging for expertise profiling in the enterprise

Automatic people tagging for expertise profiling in the enterprise Automatic people tagging for expertise profiling in the enterprise Pavel Serdyukov * (Yandex, Moscow, Russia) Mike Taylor, Vishwa Vinay, Matthew Richardson, Ryen White (Microsoft Research, Cambridge /

More information

Stanford-UBC at TAC-KBP

Stanford-UBC at TAC-KBP Stanford-UBC at TAC-KBP Eneko Agirre, Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline

More information

Joseph Casamento. Christopher Struttmann. Faculty Advisor: Phillip Chan

Joseph Casamento. Christopher Struttmann. Faculty Advisor: Phillip Chan Joseph Casamento Christopher Struttmann Faculty Advisor: Phillip Chan About Us Chris Struttmann Florida Tech Senior (B.S. Software

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Information Extraction Techniques in Terrorism Surveillance

Information Extraction Techniques in Terrorism Surveillance Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism

More information

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information