Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task
|
|
- Conrad Anderson
- 5 years ago
- Views:
Transcription
1 Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task Lucas Sterckx, Thomas Demeester, Johannes Deleu, Chris Develder Ghent University - iminds Gaston Crommenlaan 8 Ghent, Belgium {firstname.lastname@intec.ugent.be} Abstract This paper presents the system of the UGENT IBCN team for the TAC KBP 2015 cold start (slot filling variant) task. This was the team s second participation. The slot filling system uses distant supervision to generate training data combined with feature labeling and semi-supervision, and two different types of classifiers. We show that the noise reduction step significantly improves precision, and propose an application of word embeddings for slot filling. 1 Introduction Relation extraction is a vital step for information extraction. This task has received attention in TAC KBP for a number of years as the Slot Filling track, and as a vital sub-task of the recently-introduced Cold-Start track. As this is our first participation in the Knowledge Base Population - Slot Filling and Cold Start tracks, our system starts from previous work by other teams, in particular systems described in (?) and (?) which use facts from external databases to generate training data, also known as distant supervision. Distant supervision has become an effective way for generating training data in the slot filling task, as proven in last year s top submission (?). The main contribution of our system is twofold: (1) we add a noise reduction step that boosts precision of the distant supervision step, and (2) we explore the use of word embeddings (?) for a relation detection classifier. FreeBase GigaWord 2014 Entity Query Query Expansion Binary-SVM Binary-LR Candidate-Document Retrieval TAC 2015 CS Source Corpus Distant Supervision + Partial Supervision Ensemble Answers High-Precision Patterns Figure 1: System Overview In the following Section, we give an overview of the system and describe different components. A more elaborate discussion of the training with distant supervision is given in Section 3. Section 4 discusses some adaptations of the system for the cold start task. Finally, results and a brief conclusion are given in Sections 5 and 6. 2 System Overview Figure 1 shows an overview of the slot filling system. Interactions between different components of the system and the different sources of data are visualized by arrows. In this Section we discuss those parts of the system which act at run time for the generation of slot fillers. 2.1 Query Expansion and Document Retrieval The first step is the retrieval of all documents containing the entity query (person or organization) from the TAC 2014 source document collection. We
2 expand the query by including all alternate names obtained from Freebase and Wiki-redirects for the given query. When we do not retrieve any alternate names, we clean the query, e.g., remove middle initials for persons and remove any company suffixes (LLC, Inc., Corp.) and repeat the search for alternate names using this filtered query. For indexing and search of the source collection we use Whoosh Relation Classifiers In each retrieved document we identify relevant sentences by checking if any of the entities from the expanded set of query entity names are present. Note that we did not use any co-reference resolution to increase recall, as suggested in earlier work (?;?). Next, we assign all slot candidates from the relevant sentences with a type (e.g., title, state-or-province). Slot candidates are extracted using the Stanford 7- class Named Entity Recognizer (?) and assigned a type using lists of known candidates for each type. For each combination of the extracted candidates with the query entity, we perform a classification of a type-matching relation from the TAC-schema. We trained four different sets of classifiers, i.e., two sets of binary Support Vector Machines (SVMs) and two multiclass classifiers, that resulted in four different runs submitted for the slot filling task. Only the first classifier was used for a run in the cold start task Binary SVMs 41 different binary SVMs are used to detect the presence or absence of each relation in the sentence for the query entity and a possible slot filler. All binary SVMs trained for different relations use the same set of features, which is a combination of dependency tree features, token sequence features, entity features, semantic features and an order feature. We extract these features using Stanford CoreNLP tools (?). This corresponds to the features used in (?). A complete overview of the used features is given in Table 1 using an illustration of the features for example relation-tuple <Ray Young, General Motors> and the sentence Ray Young, the chief financial officer of General Motors, said GM could not bail out Delphi The same example sentence as used in (?) name wbo bm am et1,et2 ntw description words in between the two entities two words in front of the first entity two words after the last entity the entity types of the two entities # words in WBO Table 2: The features used to train the multiclass classifiers. Figure 2: Schematic representation of the CNN classifier. The two sets of binary SVM classifiers differ in their training data. Classifiers for the second run use the original output from the distant supervision step, while the first set is trained on training examples after a noise reduction step. Section 3 describes this training data in more detail Multiclass Convolutional Neural Network We experiment with word embeddings and see if relation classification can benefit from their use in the classification task. Therefore, we implement a single Convolutional Neural Network (CNN) that functions as a multi-class classifier and we compare the performance of this network with the classification obtained by a logistic regression classifier with the same set of features, but without the use of word embeddings (discussed in the next Section). The CNN only uses a subset of the features used by the SVMs, as shown in Table 2. The network is trained on the cleaned training data, which we introduce later on. We use the SENNA-embeddings (?) as word embeddings for the lookup tables. The length-varying WBO-features are modeled by a convolutional layer combined with a max layer. A schematic representation of our network is shown in Figure Multiclass Logistic Regression The final classifier is a multi-class logistic regression classifier. It uses the same features as the CNN classifier (see Table 2), and the same training data. The classifier is constructed to test the impact of the word embeddings on the classification results. This classifier consists of a single logistic regression layer and the features are a concatenation of the features given in Table 2. The WBO-features are represented
3 Feature Description Feature Value Dependency name Shortest path connecting the two names in PERSON appos offficer Tree the dependency parsing tree coupled with prep of ORGANIZATION entity types of the two names e1dh The head word for name one said e2dh The head word for name two officer same e12dh Whether e1dh is the same as e2dh false e1dw The dependent word for name one officer e2dw The dependent word for name two nil Token tpattern The middle token sequence pattern,the chief financial officer of Sequence ntw Number of words between the two names 6 Features wbf First word in between, wbl Last word in between of wbo Other words in between {the, chief, financial, officer} bm1f First word before the first name nil bm1l Second word before the first name nil am2f First word after the second name, am2l Second word after the second name said Entity e1 String of name one Ray Young Features e2 String of name two General Motors e12 Conjunction of e1 and e2 Ray Young General Motors et1 Entity type of name one PERSON et2 Entity type of name two ORGANIZATION et12 Conjunction of et1 and et2 PERSON ORGANIZATION Semantic mtitle Title in between True Feature Order order 1 if name one comes before name two; Feature 2 otherwise. 1 Table 1: The features used to train the SVMs for each relation type. as a Bag of Words, thereby ignoring the order of the words. To reduce the number of dimensions, we do not use all word ids, but use only the most representative words. The representative power was measured by the information gain of the words for the classification task, obtained with Pearsons χ 2 - test. 2.3 Entity Linking Finally, the slot fillers extracted from the different documents are combined in an Entity Linking step. We link the entities from different documents and combine the extracted relation-tuples to obtain our final set of extracted relations. The output of this step consists of a list of all possible relation-tuples if the relation can have multiple tuples, e.g., for person cities of residence. If only one relation instance is allowed, e.g., for city of birth, we choose the relation-tuple with the highest evidence. The evidence score for each relation-tuple is obtained by summing the evidence of all relation instances of this relation-tuple, i.e., the sum of the evidencescore given by the classifier for each sentence that expresses the relation-tuple. 3 Distant Supervision with Noise Reduction Training data for the classifiers is generated using distant supervision. The left side of Figure 3 shows the different steps for the generation of the training data. We start by mapping FreeBase relations to KBP slots and subsequently search the full GigaWord corpus for possible mentions of these rela-
4 Map FreeBase relations to KBP CS slots Search GigaWord for all possible relation mentions Generate positive and negative training examples Feature Extraction (Entity + Surface + Syntactic) Label Features and Filter Positive Training Examples Train Classifier Figure 3: Classifier Overview Candidate Document Tag and Retrieve candidates (NER + Table lookups) Feature extraction (Entity + Surface + Syntactic) Classify Candidates tions, i.e., two entities from a fact tuple co-occurring in sentences. Negative examples are all co-occuring entities which are not present in FreeBase. 3.1 Noise Reduction The training data obtained by distant supervision is noisy, i.e., not all extracted sentences that may indicate a given relationship actually express this relationship. e.g.,, President Obama visited Honolulu does not express the relationship per:city of birth, although president Obama was born in Honolulu. To address this problem, we add a noise reduction for 14 (due to time constraints) frequently occurring relationtypes. We let a team of students label approximately distantly supervised instances per relation type with a true or false label. The fraction of instances labeled True for the different relation types is shown in Table 3. The fraction of instances labeled True strongly depends on the type of relation. To use these manually refined instances, we train a logistic regression (LR) classifier for each annotated relation and apply this classifier to the rest of the examples of the distant supervision output. For the relations with only a very small fraction of true examples (e.g., city of birth etc.) we did not train a LR classifier, but use a list of trigger words that need to be included, to filter the data. The results in Table 4 show that this noise reduction step indeed results in a significant increase in precision. 3.2 Subsampling The dataset contains a lot more negative instances than positive instances. Therefore, we subsample relation True fraction per:title 91.78% per:employee or member of 90.90% per:origin 86.48% org:stateorprovince of headquarters 86.64% org:top members employees 73.61% per:age 65.60% per:charges 58.97% per:spouse 56.10% per:countries of residence 58.97% per:city of death 15.26% org:founded by 12.89% per:cities of residence 2.15% per:city of birth 1.29% per:country of birth 0.30% Table 3: Fraction of instances labeled True for 14 relation-types the negative instances for each relation, to obtain approximately 50% positive instances and 50% negative instances for each relation. 4 Adaptations for the Cold Start Task In our participation for the Cold Start Task (slot filling variant), our setup only required minimal modifications. We only used our SVM classifier with noise reduction for this task. The slot filling system and tagger were extended to handle relations involving Geopolitical Entities for the cold start variant. A first run is performed on the provided single-slot queries, and a second on slots for the resulting answers from the first run. 5 Results 5.1 Slot Filling task The results for the slot filling task are shown in Table 4. Our best results are the median score for this year s competition. We ranked at the 10th place of the 18 participating teams and ended as the 3th new team of the 6 new teams. Furthermore, by only evaluating based on the retrieved fillers (ignoring document id s) our F1 increased by 3.15%, resulting in a 9th place. Binary SVMs clearly perform better than the multiclass CNN and the multiclass LR. This large differ-
5 precision recall F1 b* VSM + NR b* VSM m* CNN + NR m* LR + NR Median Score Table 4: Results of the different runs on the slot filling task. b* stands for binary and m* for multiclass. NR represents the noise-reduction step. precision recall F1 Hop Hop All Hops Conclusion This paper described our first setup for the slot filling and cold start task which achieved results close to the median performance. We can conclude that the multiclass classifiers, only using lexical data, seriously underperformed the more elaborate binary SVMs and that we can significantly increase the performance of the classifiers by incorporating noise reduction of the training data obtained with distant supervision. Table 5: Results of the different hops and the aggregate in the slot filling variant of the Cold Start task. ence between approaches is mostly due to the lack of extra features in the multiclass CNN. Unfortunately, this makes it difficult to assess the impact of using a multiclass classifier vs. the multiple binary classifiers. Note that the multiclass classifiers use the same filtered training instances of the binary SVM with noise reduction. The noise reduction is clearly beneficial for precision. By using the filtered training data, the precision of the binary SVMs improves from 16.4% to 24.1%, while recall only drops from 16.3% to 15.9%. The comparison of the CNN with the LR classifier shows a slight increase in F1 obtained by incorporating the word embeddings. However, both values are far below the F1 obtained with the binary SVMs. So far, we were unable to improve any results by using word embeddings. In future work we will test whether we can improve these results by also including additional features for the CNN network. 5.2 Cold Start task The results for the different hops of the slot filling variant of the Cold Start task are shown in Table 5. Precision and recall on the initial queries (Hop 0) are close to the performance on the regular slot filling task. In the second iteration (Hop 1) a lot of recall is lost, since a considerable fraction of these queries were not generated after the first run. Our system achieved second place out of three teams performing in the slot filling variant.
PRIS at TAC2012 KBP Track
PRIS at TAC2012 KBP Track Yan Li, Sijia Chen, Zhihua Zhou, Jie Yin, Hao Luo, Liyin Hong, Weiran Xu, Guang Chen, Jun Guo School of Information and Communication Engineering Beijing University of Posts and
More informationGraph-Based Relation Validation Method
Graph-Based Relation Validation Method Rashedur Rahman 1, Brigitte Grau 2, and Sophie Rosset 3 1 IRT SystemX, LIMSI, CNRS, University Paris Saclay rashedur.rahman@irt-systemx.fr 2 LIMSI, CNRS, ENSIIE,
More informationStanford s 2013 KBP System
Stanford s 2013 KBP System Gabor Angeli, Arun Chaganty, Angel Chang, Kevin Reschke, Julie Tibshirani, Jean Y. Wu, Osbert Bastani, Keith Siilats, Christopher D. Manning Stanford University Stanford, CA
More informationNUS-I2R: Learning a Combined System for Entity Linking
NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm
More informationGeneralizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling
Generalizing from Freebase and Patterns using Cluster-Based Distant Supervision for KBP Slot-Filling Benjamin Roth Grzegorz Chrupała Michael Wiegand Mittul Singh Dietrich Klakow Spoken Language Systems
More informationCombining Neural Networks and Log-linear Models to Improve Relation Extraction
Combining Neural Networks and Log-linear Models to Improve Relation Extraction Thien Huu Nguyen and Ralph Grishman Computer Science Department, New York University {thien,grishman}@cs.nyu.edu Outline Relation
More informationStanford-UBC at TAC-KBP
Stanford-UBC at TAC-KBP Eneko Agirre, Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline
More informationPresented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu
Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day
More informationStacked Ensembles of Information Extractors for Knowledge-Base Population
Stacked Ensembles of Information Extractors for Knowledge-Base Population Raymond J. Mooney Nazneen Rajani, Vidhoon Vishwanathan, Yinon Bentor University of Texas at Austin Knowledge-Base Population (KBP)
More informationNatural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)
Natural Language Processing SoSe 2015 Question Answering Dr. Mariana Neves July 6th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Introduction History QA Architecture Outline 3 Introduction
More informationNatural Language Processing
Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationQuestion Answering Systems
Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction
More informationNatural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )
Natural Language Processing SoSe 2014 Question Answering Dr. Mariana Neves June 25th, 2014 (based on the slides of Dr. Saeedeh Momtazi) ) Outline 2 Introduction History QA Architecture Natural Language
More informationSmart Routing. Requests
Smart Routing of Requests Martin Klein 1 Lyudmila Balakireva 1 Harihar Shankar 1 James Powell 1 Herbert Van de Sompel 2 1 Research Library Los Alamos National Laboratory 2 Data Archiving and Networked
More informationAnswer Extraction. NLP Systems and Applications Ling573 May 13, 2014
Answer Extraction NLP Systems and Applications Ling573 May 13, 2014 Answer Extraction Goal: Given a passage, find the specific answer in passage Go from ~1000 chars -> short answer span Example: Q: What
More informationNatural Language Processing. SoSe Question Answering
Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation
More informationProgramming Projects
Programming Projects Benjamin Roth, Nina Poerner, Anne Beyer Centrum für Informations- und Sprachverarbeitung Ludwig-Maximilian-Universität München beroth@cis.uni-muenchen.de Benjamin Roth, Nina Poerner,
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationOutline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction
Outline Part I. Introduction Part II. ML for DI ML for entity linkage ML for data extraction ML for data fusion ML for schema alignment Part III. DI for ML Part IV. Conclusions and research direction Data
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationStructural and Syntactic Pattern Recognition
Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent
More informationSentence selection with neural networks over string kernels
Sentence selection with neural networks over string kernels Mihai Dan Mașala, Ștefan Rușeți, Traian Rebedea KES 2017 University POLITEHNICA of Bucharest Introduction Sentence selection: given a question,
More informationTAC KBP 2016 Cold Start Slot Filling and Slot Filler Validation Systems by IRT SystemX
TAC KBP 2016 Cold Start Slot Filling and Slot Filler Validation Systems by IRT SystemX Rashedur Rahman (1,2,3), Brigitte Grau (2,3,4), Sophie Rosset (2,3), Yoann Dupont (5), Jérémy Guillemot (1), Olivier
More informationEntity Linking at TAC Task Description
Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate
More informationEffective Slot Filling Based on Shallow Distant Supervision Methods
Effective Slot Filling Based on Shallow Distant Supervision Methods Benjamin Roth Tassilo Barth Michael Wiegand Mittul Singh Dietrich Klakow Spoken Language Systems Saarland University D-66123 Saarbrücken,
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationINFORMATION EXTRACTION
COMP90042 LECTURE 13 INFORMATION EXTRACTION INTRODUCTION Given this: Brasilia, the Brazilian capital, was founded in 1960. Obtain this: capital(brazil, Brasilia) founded(brasilia, 1960) Main goal: turn
More informationCS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:
CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled
More informationMachine Learning in GATE
Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort
More informationMini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class
Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class Guidelines Submission. Submit a hardcopy of the report containing all the figures and printouts of code in class. For readability
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationDistant Supervision via Prototype-Based. global representation learning
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Distant Supervision via Prototype-Based Global Representation Learning State Key Laboratory of Computer Science, Institute
More informationCIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets
CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,
More informationBUAA AUDR at ImageCLEF 2012 Photo Annotation Task
BUAA AUDR at ImageCLEF 2012 Photo Annotation Task Lei Huang, Yang Liu State Key Laboratory of Software Development Enviroment, Beihang University, 100191 Beijing, China huanglei@nlsde.buaa.edu.cn liuyang@nlsde.buaa.edu.cn
More informationParsing tree matching based question answering
Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts
More informationCreating a Classifier for a Focused Web Crawler
Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationABC-CNN: Attention Based CNN for Visual Question Answering
ABC-CNN: Attention Based CNN for Visual Question Answering CIS 601 PRESENTED BY: MAYUR RUMALWALA GUIDED BY: DR. SUNNIE CHUNG AGENDA Ø Introduction Ø Understanding CNN Ø Framework of ABC-CNN Ø Datasets
More informationTowards Summarizing the Web of Entities
Towards Summarizing the Web of Entities contributors: August 15, 2012 Thomas Hofmann Director of Engineering Search Ads Quality Zurich, Google Switzerland thofmann@google.com Enrique Alfonseca Yasemin
More informationCAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016 Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationUBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014
UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014 Ander Barrena, Eneko Agirre, Aitor Soroa IXA NLP Group / University of the Basque Country, Donostia, Basque Country ander.barrena@ehu.es,
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationDavid McClosky, Mihai Surdeanu, Chris Manning and many, many others 4/22/2011. * Previously known as BaselineNLProcessor
David McClosky, Mihai Surdeanu, Chris Manning and many, many others 4/22/2011 * Previously known as BaselineNLProcessor Part I: KBP task overview Part II: Stanford CoreNLP Part III: NFL Information Extraction
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationTagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation
TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, Cordelia Schmid LEAR team, INRIA Rhône-Alpes, Grenoble, France
More informationRetrieval Evaluation. Hongning Wang
Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User
More informationApplying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task
Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationA Short Introduction to CATMA
A Short Introduction to CATMA Outline: I. Getting Started II. Analyzing Texts - Search Queries in CATMA III. Annotating Texts (collaboratively) with CATMA IV. Further Search Queries: Analyze Your Annotations
More informationEnsemble Methods, Decision Trees
CS 1675: Intro to Machine Learning Ensemble Methods, Decision Trees Prof. Adriana Kovashka University of Pittsburgh November 13, 2018 Plan for This Lecture Ensemble methods: introduction Boosting Algorithm
More informationDBpedia Spotlight at the MSM2013 Challenge
DBpedia Spotlight at the MSM2013 Challenge Pablo N. Mendes 1, Dirk Weissenborn 2, and Chris Hokamp 3 1 Kno.e.sis Center, CSE Dept., Wright State University 2 Dept. of Comp. Sci., Dresden Univ. of Tech.
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationProposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016
Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016 V.2.1 0. Changes to This Document This revision is oriented towards the general public. The notion of provenance
More informationDetection and Extraction of Events from s
Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to
More informationInformation Extraction Techniques in Terrorism Surveillance
Information Extraction Techniques in Terrorism Surveillance Roman Tekhov Abstract. The article gives a brief overview of what information extraction is and how it might be used for the purposes of counter-terrorism
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationAuthorship Disambiguation and Alias Resolution in Data
Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a
More informationMulti-label classification using rule-based classifier systems
Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationA Korean Knowledge Extraction System for Enriching a KBox
A Korean Knowledge Extraction System for Enriching a KBox Sangha Nam, Eun-kyung Kim, Jiho Kim, Yoosung Jung, Kijong Han, Key-Sun Choi KAIST / The Republic of Korea {nam.sangha, kekeeo, hogajiho, wjd1004109,
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationWikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population
Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population Heather Simpson 1, Stephanie Strassel 1, Robert Parker 1, Paul McNamee
More informationCRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools
CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools Wahed Hemati, Alexander Mehler, and Tolga Uslu Text Technology Lab, Goethe Universitt
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationEfficient Voting Prediction for Pairwise Multilabel Classification
Efficient Voting Prediction for Pairwise Multilabel Classification Eneldo Loza Mencía, Sang-Hyeun Park and Johannes Fürnkranz TU-Darmstadt - Knowledge Engineering Group Hochschulstr. 10 - Darmstadt - Germany
More informationVISION & LANGUAGE From Captions to Visual Concepts and Back
VISION & LANGUAGE From Captions to Visual Concepts and Back Brady Fowler & Kerry Jones Tuesday, February 28th 2017 CS 6501-004 VICENTE Agenda Problem Domain Object Detection Language Generation Sentence
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationNew York University 2014 Knowledge Base Population Systems
New York University 2014 Knowledge Base Population Systems Thien Huu Nguyen, Yifan He, Maria Pershina, Xiang Li, Ralph Grishman Computer Science Department New York University {thien, yhe, pershina, xiangli,
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More information15-780: Problem Set #2
15-780: Problem Set #2 February 19, 2014 1. Constraint satisfaction problem (CSP) [20pts] A common problem at universities is to schedule rooms for exams. The basic structure of this problem is to divide
More informationTools and Infrastructure for Supporting Enterprise Knowledge Graphs
Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com
More informationCLASSIFICATION JELENA JOVANOVIĆ. Web:
CLASSIFICATION JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is classification? Binary and multiclass classification Classification algorithms Naïve Bayes (NB) algorithm
More informationConceptual Design. The Entity-Relationship (ER) Model
Conceptual Design. The Entity-Relationship (ER) Model CS430/630 Lecture 12 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke Database Design Overview Conceptual design The Entity-Relationship
More informationApril 3, 2012 T.C. Havens
April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate
More informationA CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012
A CAS STUDY: Structure learning for Part-of-Speech Tagging Danilo Croce WM 2011/2012 27 gennaio 2012 TASK definition One of the tasks of VALITA 2009 VALITA is an initiative devoted to the evaluation of
More information311 Predictions on Kaggle Austin Lee. Project Description
311 Predictions on Kaggle Austin Lee Project Description This project is an entry into the SeeClickFix contest on Kaggle. SeeClickFix is a system for reporting local civic issues on Open311. Each issue
More informationRich feature hierarchies for accurate object detection and semantic segmentation
Rich feature hierarchies for accurate object detection and semantic segmentation BY; ROSS GIRSHICK, JEFF DONAHUE, TREVOR DARRELL AND JITENDRA MALIK PRESENTER; MUHAMMAD OSAMA Object detection vs. classification
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationAT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands
AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com
More informationUniversity of Amsterdam at INEX 2010: Ad hoc and Book Tracks
University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,
More informationQuestion Answering Approach Using a WordNet-based Answer Type Taxonomy
Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering
More informationClassifying Depositional Environments in Satellite Images
Classifying Depositional Environments in Satellite Images Alex Miltenberger and Rayan Kanfar Department of Geophysics School of Earth, Energy, and Environmental Sciences Stanford University 1 Introduction
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationKnowledge Based Systems Text Analysis
Knowledge Based Systems Text Analysis Dr. Shubhangi D.C 1, Ravikiran Mitte 2 1 H.O.D, 2 P.G.Student 1,2 Department of Computer Science and Engineering, PG Center Regional Office VTU, Kalaburagi, Karnataka
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationS-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking
S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking Yi Yang * and Ming-Wei Chang # * Georgia Institute of Technology, Atlanta # Microsoft Research, Redmond Traditional
More informationBi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track
Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Jeffrey Dalton University of Massachusetts, Amherst jdalton@cs.umass.edu Laura
More information1 Document Classification [60 points]
CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text
More informationA modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems
A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More information