Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios
|
|
- Jeffry Baker
- 6 years ago
- Views:
Transcription
1 Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios
2 Problem definition The goal of Entity Recognition and Disambiguation (ERD) Identify mentions of entities Link the mentions to a relevant entry in an external knowledge base The knowledge base is typically a large subset of Wikipedia articles Example: The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip. 2
3 Recognition and Disambiguation The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip. Recognition Is this a valid mention of an entity present in the knowledge base? Disambiguation Which of the potential entities (senses) is correct? 3
4 Recognition and Disambiguation The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip. Recognition Is this a valid mention of an entity present in the knowledge base? Disambiguation Which of the potential entities (senses) is correct? Default sense the entity with a largest number of wiki-links with the mention as the anchor text Tulip focuses on default sense entities Main goal is to recognize whether the default sense is consistent with the document 4
5 Our background Visual Text Analytics Lab Some experience with using ERD systems No experience implementing ERD systems Key issue with state-of-the-art systems: obvious false positive mistakes Visualize Prof. Smith's research interests: Data Mining Machine Learning 50 cent Our goal: minimize the number of false positives 5
6 Tulip system overview Spotter Find all mentions of entities in the text (Solr Text Tagger) Special handling for personal names Recognizer Retrieve profiles of spotted entities (from Sunflower) Generate a topic centroid representing the document Select entities consistent with the document 6
7 Spotter Spotter Find all mentions of entities in the text (Solr Text Tagger) Special handling for personal names Recognizer Retrieve profiles of spotted entities (from Sunflower) Generate a topic centroid representing the document Select entities consistent with the document 7
8 Solr Text Tagger Solr (Lucene) is a text search engine Indexes textual documents Retrieve documents for keyword-based queries Solr Text Tagger Indexes entity surface forms stored in a lexicon E.g., Baltimore Ravens, Ravens, Baltimore ( ) Uses full text documents as queries Finds all entity mentions in the document Retrieves the mentioned entities (candidate selection) Implemented based on Solr's Finite State Transducers By David Smiley and Rupert Westenthaler (thanks!) 8
9 Building the lexicon Three sources of entity surface forms (external datasets) Entity names (from Freebase) Wiki-links anchor text (from Wikipedia) Web anchor text (from Google's Wikilinks corpus) 9
10 Building the lexicon Three sources of entity surface forms (external datasets) Entity names (from Freebase) Wiki-links anchor text (from Wikipedia) Web anchor text (from Google's Wikilinks corpus) Special handling of personal names Jack and London are not allowed as surface forms for Jack London Instead they are indexed as generic personal names and will be matched only if Jack London is mentioned by his full name 10
11 Building the lexicon Three sources of entity surface forms (external datasets) Entity names (from Freebase) Wiki-links anchor text (from Wikipedia) Web anchor text (from Google's Wikilinks corpus) Special handling of personal names Jack and London are not allowed as surface forms for Jack London Instead they are indexed as generic personal names and will be matched only if Jack London is mentioned by his full name Flagging suspicious surface forms (e.g., It - Stephen King's novel) stop-word filter marks all stop-words or phrases composed of stopwords (e.g., This is) Wiktionary filter marks all common nouns, verbs, adjectives, etc. found in Wiktionary lower-case filter marks all lower-case words or phrases 11
12 Spotter example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip. Default sense for all mentions (Freebase only) 12
13 Spotter example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip. Default sense for all mentions (Freebase only) Default sense for all mentions (Freebase + Wikpedia) 13
14 Spotter example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip. Default sense for all mentions (Freebase only) Default sense for all mentions (Freebase + Wikpedia) Suspicious mentions removed 14
15 Spotter example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1]. Techs fall (1) (...) [7], led by Microsoft [1] (...) [13] and Intel [1] (...) [9]. Michael Kors [1] rises. Gold (1) (...) [31] and oil slip. Default sense for all mentions (Freebase only) Default sense for all mentions (Freebase + Wikpedia) Suspicious mentions removed How can we remove Michael Kors and bring back Home Depot? Relatedness of entities to the document 15
16 Recognizer Spotter Find all mentions of entities in the text (Solr Text Tagger) Special handling for personal names Recognizer Retrieve profiles of spotted entities (from Sunflower) Generate a topic centroid representing the document Select entities consistent with the document 16
17 Relatedness score The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip. How strongly or are related to the document? Our solution Retrieve a profile of every entity mentioned in the text Agglomerate the profiles in a centroid representing the document Check which entities are coherent with the topics (relatedness score) 17
18 Relatedness score The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip. How strongly or are related to the document? Our solution Retrieve a profile of every entity mentioned in the text Agglomerate the profiles in a centroid representing the document Check which entities are coherent with the topics (relatedness score) How do we create the entity profiles? 18
19 Relatedness Sunflower A concept graph based on unified category graph from 120 Wikipedia language versions Each language version acts like a witness for the importance of stored relation Compact and accurate category profiles for all Wikipedia articles Removal of unimportant categories Inference of more general categories 19
20 Sunflower from graph to term profile Sunflower graph is: Directed Weighted (importance score) Sparse (only k most important links per node) Category-based profile is a sparse, weighted term vector All categories at distance < d Term weights based on edge weights E.g., k = 3, d = 2 Path weight is the product of edge weights w(intel Comp. of US Ec. of US) = 0.42*0.27 = 0.11 Category weight is the sum of path weights w(ec. of US) = =
21 Topic centroids in Tulip Retrieve category-based profiles for all default senses (example next slide) 21
22 22
23 Topic centroids in Tulip Retrieve category-based profiles for all default senses (example next slide) Topic Centroid Generation Centroid is a linear combination of entity profiles Default senses of non-suspicious mentions only (entity core) 23
24 Topic centroids in Tulip Retrieve category-based profiles for all default senses (example next slide) Topic Centroid Generation Centroid is a linear combination of entity profiles Default senses of non-suspicious mentions only (entity core) Topic Centroid Refinement Entities far from the centroid are removed from the core Cosine similarity with predefined threshold t coh =0.2 24
25 Topic centroids in Tulip Retrieve category-based profiles for all default senses (example next slide) Topic Centroid Generation Centroid is a linear combination of entity profiles Default senses of non-suspicious mentions only (entity core) Topic Centroid Refinement Entities far from the centroid are removed from the core Cosine similarity with predefined threshold t coh =0.2 Entity Scoring Relatedness score assigned to each default sense entity (including suspicious mentions) 25
26 Topic centroids in Tulip Retrieve category-based profiles for all default senses (example next slide) Topic Centroid Generation Centroid is a linear combination of entity profiles Default senses of non-suspicious mentions only (entity core) Topic Centroid Refinement Entities far from the centroid are removed from the core Cosine similarity with predefined threshold t coh =0.2 Entity Scoring Relatedness score assigned to each default sense entity (including suspicious mentions) System output Entities with score > t coh Entity with best relatedness score for each mention 26
27 Challenge results Tulip got second place in the long track Category-based topic centroids promising solution for relatedness Top recall among all submitted systems (?!) Lowest latency among all submitted systems 27
28 Lightweight ERD Entity Recognition and Disambiguation is typically just a single step in a more complex document processing system To be practical the ERD system has to be lightweight: Fast lowest latency among all competing systems, over 200 documents per minute Adaptable both Solr Text Tagger and Sunflower can be easily adapted to changing data Compact the full system requires less than 4 GB of operational memory and uses no external data repositories 28
29 Lightweight ERD Entity Recognition and Disambiguation is typically just a single step in a more complex document processing system To be practical the ERD system has to be lightweight: Fast lowest latency among all competing systems, throughput of over 200 documents per minute Adaptable both Solr Text Tagger and Sunflower can be easily adapted to changing data Compact the full system requires less than 4 GB of operational memory and uses no external data repositories 29
30 The importance of default sense Analysis on 50 documents with ground-truth data (1166 entities) 85% of mentions that can be disambiguated, should be disambiguated with default sense Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72) Focusing on default sense Tulip missed < 5% of entities 30
31 The importance of default sense Analysis on 50 documents with ground-truth data (1166 entities) 85% of mentions that can be disambiguated, should be disambiguated with default sense Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72) Focusing on default sense Tulip missed < 5% of entities 31
32 The importance of default sense Analysis on 50 documents with ground-truth data (1166 entities) 85% of mentions that can be disambiguated, should be disambiguated with default sense Another 5% is explicitly disambiguated with another mention in the document (e.g., E72 and Nokia E72) Focusing on default sense Tulip missed < 5% of entities 32
33 Conclusions Wikipedia-based category profiles can be used to determine the relatedness of an entity to the topics of a document Small size of category profiles allows the system to represent the aggregated topics of the document in form of a centroid, which simplifies the recognition process The pruning of suspicious mentions and focus on the default sense entities helps Tulip to build precise document centroids that can be further used to clean or expand the set of returned entities The accuracy of extracted entities relies more on the successful recognition of correct entity mentions rather than their disambiguation Project website: 33
34 Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios
35 Solr Text Tagger Two level Finite State Transducers approach Word to index (each edge is a letter) Surface form to list of entities (each edge is a word) 35
Exploiting Semantic Analysis of Documents for the Domain User
Exploiting Semantic Analysis of Documents for the Domain User Evangelos Milios Dalhousie University Closing Conference: Statistical and Computational Analytics for Big Data Dalhousie University June 12,
More informationA Multilingual Entity Linker Using PageRank and Semantic Graphs
A Multilingual Entity Linker Using PageRank and Semantic Graphs Anton Södergren Department of computer science Lund University, Lund karl.a.j.sodergren@gmail.com Pierre Nugues Department of computer science
More informationUnstructured Data. CS102 Winter 2019
Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data
More informationUnderstanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22
Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion
More informationAugust 2012 Daejeon, South Korea
Building a Web of Linked Entities (Part I: Overview) Pablo N. Mendes Free University of Berlin August 2012 Daejeon, South Korea Outline Part I A Web of Linked Entities Challenges Progress towards solutions
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationWikulu: Information Management in Wikis Enhanced by Language Technologies
Wikulu: Information Management in Wikis Enhanced by Language Technologies Iryna Gurevych (this is joint work with Dr. Torsten Zesch, Daniel Bär and Nico Erbs) 1 UKP Lab: Projects UKP Lab Educational Natural
More informationSEMANTIC INDEXING (ENTITY LINKING)
Анализа текста и екстракција информација SEMANTIC INDEXING (ENTITY LINKING) Jelena Jovanović Email: jeljov@gmail.com Web: http://jelenajovanovic.net OVERVIEW Main concepts Named Entity Recognition Semantic
More informationIs Brad Pitt Related to Backstreet Boys? Exploring Related Entities
Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Nitish Aggarwal, Kartik Asooja, Paul Buitelaar, and Gabriela Vulcu Unit for Natural Language Processing Insight-centre, National University
More informationNATURAL LANGUAGE PROCESSING
NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity
More informationPapers for comprehensive viva-voce
Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India
More informationProceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan
Overview of the NTCIR-9 Crosslink Task: Cross-lingual Link Discovery Ling-Xiang Tang 1, Shlomo Geva 1, Andrew Trotman 2, Yue Xu 1, Kelly Y. Itakura 1 1 Faculty of Science and Technology, Queensland University
More informationA new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation
A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation Mariana Neves 1, Monica Chagoyen 1, José M Carazo 1, Alberto Pascual-Montano
More informationLinking Entities in Chinese Queries to Knowledge Graph
Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn
More informationNERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017
NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain
More informationCIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets
CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,
More informationSemantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.
Semantic Web Company PoolParty - Server PoolParty - Technical White Paper http://www.poolparty.biz Table of Contents Introduction... 3 PoolParty Technical Overview... 3 PoolParty Components Overview...
More informationEntity Linking at TAC Task Description
Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate
More informationReview on Text Mining
Review on Text Mining Aarushi Rai #1, Aarush Gupta *2, Jabanjalin Hilda J. #3 #1 School of Computer Science and Engineering, VIT University, Tamil Nadu - India #2 School of Computer Science and Engineering,
More informationJianyong Wang Department of Computer Science and Technology Tsinghua University
Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity
More informationA fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP
A fully-automatic approach to answer geographic queries: at GikiP Johannes Leveling Sven Hartrumpf Intelligent Information and Communication Systems (IICS) University of Hagen (FernUniversität in Hagen)
More informationA cocktail approach to the VideoCLEF 09 linking task
A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More informationPrecise Medication Extraction using Agile Text Mining
Precise Medication Extraction using Agile Text Mining Chaitanya Shivade *, James Cormack, David Milward * The Ohio State University, Columbus, Ohio, USA Linguamatics Ltd, Cambridge, UK shivade@cse.ohio-state.edu,
More informationI Know Your Name: Named Entity Recognition and Structural Parsing
I Know Your Name: Named Entity Recognition and Structural Parsing David Philipson and Nikil Viswanathan {pdavid2, nikil}@stanford.edu CS224N Fall 2011 Introduction In this project, we explore a Maximum
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationSigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group
More informationLODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015
LODtagger Guide for Users and Developers Bahar Sateli René Witte Release 1.0 July 24, 2015 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents 1 LODtagger
More informationPresented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu
Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationLODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016
LODtagger Guide for Users and Developers Bahar Sateli René Witte Release 1.1 October 7, 2016 Semantic Software Lab Concordia University Montréal, Canada http://www.semanticsoftware.info Contents 1 LODtagger
More informationWebSAIL Wikifier at ERD 2014
WebSAIL Wikifier at ERD 2014 Thanapon Noraset, Chandra Sekhar Bhagavatula, Doug Downey Department of Electrical Engineering & Computer Science, Northwestern University {nor.thanapon, csbhagav}@u.northwestern.edu,ddowney@eecs.northwestern.edu
More informationMaster Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala
Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue
More informationAnnotating Spatio-Temporal Information in Documents
Annotating Spatio-Temporal Information in Documents Jannik Strötgen University of Heidelberg Institute of Computer Science Database Systems Research Group http://dbs.ifi.uni-heidelberg.de stroetgen@uni-hd.de
More informationUnderstanding the Query: THCIB and THUIS at NTCIR-10 Intent Task
Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Yunqing Xia 1 and Sen Na 2 1 Tsinghua University 2 Canon Information Technology (Beijing) Co. Ltd. Before we start Who are we? THUIS is
More informationProf. Ahmet Süerdem Istanbul Bilgi University London School of Economics
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful
More informationRe-contextualization and contextual Entity exploration. Sebastian Holzki
Re-contextualization and contextual Entity exploration Sebastian Holzki Sebastian Holzki June 7, 2016 1 Authors: Joonseok Lee, Ariel Fuxman, Bo Zhao, and Yuanhua Lv - PAPER PRESENTATION - LEVERAGING KNOWLEDGE
More informationThe Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources
FOSS4G 2017 Boston The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources Devika Kakkar and Ben Lewis Harvard Center for Geographic Analysis
More informationConceptual and Logical Design
Conceptual and Logical Design Lecture 3 (Part 1) Akhtar Ali Building Conceptual Data Model To build a conceptual data model of the data requirements of the enterprise. Model comprises entity types, relationship
More informationCross Lingual Question Answering using CINDI_QA for 2007
Cross Lingual Question Answering using CINDI_QA for QA@CLEF 2007 Chedid Haddad, Bipin C. Desai Department of Computer Science and Software Engineering Concordia University 1455 De Maisonneuve Blvd. W.
More informationELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators
ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators Pablo Ruiz, Thierry Poibeau and Frédérique Mélanie Laboratoire LATTICE CNRS, École Normale Supérieure, U Paris 3 Sorbonne Nouvelle
More informationIRCE at the NTCIR-12 IMine-2 Task
IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Beyond Bag of Words Bag of Words a document is considered to be an unordered collection of words with no relationships Extending
More informationMIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion
MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad
More informationSemantically Driven Snippet Selection for Supporting Focused Web Searches
Semantically Driven Snippet Selection for Supporting Focused Web Searches IRAKLIS VARLAMIS Harokopio University of Athens Department of Informatics and Telematics, 89, Harokopou Street, 176 71, Athens,
More informationParsing tree matching based question answering
Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts
More informationChrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO
Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome
More informationTIB AV-Portal. Margret Plank 19th of January 2015 TACC Meeting
TIB AV-Portal Margret Plank 19th of January 2015 TACC Meeting German National Library of Science and Technology (TIB) German National Library of Science and Technology for all areas of engineering as well
More informationImproving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University
Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar, Youngstown State University Bonita Sharif, Youngstown State University Jenna Wise, Youngstown State University Alyssa Pawluk, Youngstown
More informationLearning Semantic Entity Representations with Knowledge Graph and Deep Neural Networks and its Application to Named Entity Disambiguation
Learning Semantic Entity Representations with Knowledge Graph and Deep Neural Networks and its Application to Named Entity Disambiguation Hongzhao Huang 1 and Larry Heck 2 Computer Science Department,
More informationData Mining Concepts & Tasks
Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time
More informationPlagiarism Detection Using FP-Growth Algorithm
Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,
More informationScienceDirect. Enhanced Associative Classification of XML Documents Supported by Semantic Concepts
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 194 201 International Conference on Information and Communication Technologies (ICICT 2014) Enhanced Associative
More informationMining Wikipedia for Large-scale Repositories
Mining Wikipedia for Large-scale Repositories of Context-Sensitive Entailment Rules Milen Kouylekov 1, Yashar Mehdad 1;2, Matteo Negri 1 FBK-Irst 1, University of Trento 2 Trento, Italy [kouylekov,mehdad,negri]@fbk.eu
More information@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha
@Note2 tutorial Hugo Costa (hcosta@silicolife.com) Ruben Rodrigues (pg25227@alunos.uminho.pt) Miguel Rocha (mrocha@di.uminho.pt) 23-01-2018 The document presents a typical workflow using @Note2 platform
More informationFinal report - CS224W: Using Wikipedia pages traffic data to infer pertinence of links
Final report - CS224W: Using Wikipedia pages traffic data to infer pertinence of links Marc Thibault; SUNetID number: 06227968 Mael Trean; SUNetIDnumber: 06228438 I - Introduction/Motivation/Problem Definition
More informationExam Marco Kuhlmann. This exam consists of three parts:
TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding
More informationGhent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task
Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task Lucas Sterckx, Thomas Demeester, Johannes Deleu, Chris Develder Ghent University - iminds Gaston Crommenlaan 8 Ghent, Belgium
More informationEntity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico
David Soares Batista Disciplina de Recuperação de Informação, Instituto Superior Técnico November 11, 2011 Motivation Entity-Linking is the process of associating an entity mentioned in a text to an entry,
More informationDocument Retrieval using Predication Similarity
Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research
More informationFinding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track
Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track V.G.Vinod Vydiswaran, Kavita Ganesan, Yuanhua Lv, Jing He, ChengXiang Zhai Department of Computer Science University of
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London What Is Text Clustering? Text Clustering = Grouping a set of documents into classes of similar
More informationGraph-Based Concept Clustering for Web Search Results
International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 6, December 2015, pp. 1536~1544 ISSN: 2088-8708 1536 Graph-Based Concept Clustering for Web Search Results Supakpong Jinarat*,
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationSemantic Label and Structure Model based Approach for Entity Recognition in Database Context
Semantic Label and Structure Model based Approach for Entity Recognition in Database Context Nihel Kooli, Abdel Belaïd To cite this version: Nihel Kooli, Abdel Belaïd. Semantic Label and Structure Model
More informationTowards Summarizing the Web of Entities
Towards Summarizing the Web of Entities contributors: August 15, 2012 Thomas Hofmann Director of Engineering Search Ads Quality Zurich, Google Switzerland thofmann@google.com Enrique Alfonseca Yasemin
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationOpen Data Integration. Renée J. Miller
Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that
More informationGet First Page in One Month. How I ranked my blog in Google Page 1 in a month
Get First Page in One Month How I ranked my blog in Google Page 1 in a month 2015 Dipendra Pokharel, DipIncome.com Contents Background and Introduction(This is where I have introduced myself and shared
More informationOverview of the INEX 2009 Link the Wiki Track
Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationA Semantic Based Search Engine for Open Architecture Requirements Documents
Calhoun: The NPS Institutional Archive Reports and Technical Reports All Technical Reports Collection 2008-04-01 A Semantic Based Search Engine for Open Architecture Requirements Documents Craig Martell
More informationWikiOnto: A System For Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus
2009 IEEE International Conference on Semantic Computing WikiOnto: A System For Semi-automatic Extraction And Modeling Of Ontologies Using Wikipedia XML Corpus Lalindra De Silva University of Colombo School
More informationQuery Difficulty Prediction for Contextual Image Retrieval
Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.
More informationA Framework for Summarization of Multi-topic Web Sites
A Framework for Summarization of Multi-topic Web Sites Yongzheng Zhang Nur Zincir-Heywood Evangelos Milios Technical Report CS-2008-02 March 19, 2008 Faculty of Computer Science 6050 University Ave., Halifax,
More informationNamed Entity Detection and Entity Linking in the Context of Semantic Web
[1/52] Concordia Seminar - December 2012 Named Entity Detection and in the Context of Semantic Web Exploring the ambiguity question. Eric Charton, Ph.D. [2/52] Concordia Seminar - December 2012 Challenge
More informationQuery Expansion using Wikipedia and DBpedia
Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org
More informationQuestion Answering Systems
Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction
More informationDocument Structure Analysis in Associative Patent Retrieval
Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,
More informationObject Purpose Based Grasping
Object Purpose Based Grasping Song Cao, Jijie Zhao Abstract Objects often have multiple purposes, and the way humans grasp a certain object may vary based on the different intended purposes. To enable
More informationA multi-step attack-correlation method with privacy protection
A multi-step attack-correlation method with privacy protection Research paper A multi-step attack-correlation method with privacy protection ZHANG Yongtang 1, 2, LUO Xianlu 1, LUO Haibo 1 1. Department
More informationBuilding Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018
Building Multilingual Resources and Neural Models for Word Sense Disambiguation Alessandro Raganato March 15th, 2018 About me alessandro.raganato@helsinki.fi http://wwwusers.di.uniroma1.it/~raganato ERC
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationA Scheme of Automated Object and Facet Extraction for Faceted Search over XML Data
IDEAS 2014 A Scheme of Automated Object and Facet Extraction for Faceted Search over XML Data Takahiro Komamizu, Toshiyuki Amagasa, Hiroyuki Kitagawa University of Tsukuba Background Introduction Background
More informationA Discriminatively Trained, Multiscale, Deformable Part Model
A Discriminatively Trained, Multiscale, Deformable Part Model by Pedro Felzenszwalb, David McAllester, and Deva Ramanan CS381V Visual Recognition - Paper Presentation Slide credit: Duan Tran Slide credit:
More informationGIR experiements with Forostar at GeoCLEF 2007
GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2
More informationVisualization and text mining of patent and non-patent data
of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent
More informationQuality-Aware Entity-Level Semantic Representations for Short Texts
Quality-Aware Entity-Level Semantic Representations for Short Texts Wen Hua #, Kai Zheng #, Xiaofang Zhou # # School of ITEE, The University of Queensland, Brisbane, Australia School of Computer Science
More informationSemantic text features from small world graphs
Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK
More informationCMU System for Entity Discovery and Linking at TAC-KBP 2016
CMU System for Entity Discovery and Linking at TAC-KBP 2016 Xuezhe Ma, Nicolas Fauceglia, Yiu-chang Lin, and Eduard Hovy Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave, Pittsburgh,
More informationRecommender Systems: Practical Aspects, Case Studies. Radek Pelánek
Recommender Systems: Practical Aspects, Case Studies Radek Pelánek 2017 This Lecture practical aspects : attacks, context, shared accounts,... case studies, illustrations of application illustration of
More informationKAIFIA: Knowledge Assisted Intelligent Framework for Information Access
KAIFIA: Knowledge Assisted Intelligent Framework for Information Access Chattun Lallah Intelligent Media Systems and Services The University of Reading http://www.imss.reading.ac.uk c.lallah@reading.ac.uk
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationChapter 4: Text Clustering
4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can
More informationApache Solr Learning to Rank FTW!
Apache Solr Learning to Rank FTW! Berlin Buzzwords 2017 June 12, 2017 Diego Ceccarelli Software Engineer, News Search dceccarelli4@bloomberg.net Michael Nilsson Software Engineer, Unified Search mnilsson23@bloomberg.net
More informationInformation Retrieval
Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University
More informationTokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017
Tokenization and Sentence Segmentation Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017 Outline 1 Tokenization Introduction Exercise Evaluation Summary 2 Sentence segmentation
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationLet s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed
Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,
More information