Rhetorical Classification of Anchor Text for Citation Recommendation

Size: px
Start display at page:

Download "Rhetorical Classification of Anchor Text for Citation Recommendation"

Transcription

1 Rhetorical Classification of Anchor Text for Citation Recommendation Daniel Duma University of Edinburgh James Ravenscroft University of Warwick Maria Liakata University of Warwick Ewan Klein University of Edinburgh Amanda Clare Aberystwyth University CCS Concepts Computing methodologies Natural language processing; Discourse, dialogue and pragmatics; Information systems Information retrieval; Keywords Core Scientific Concepts; CoreSC; context based; citation recommendation; anchor text; incoming link contexts ABSTRACT Wouldn t it be helpful if your text editor automatically suggested papers that are contextually relevant to your work? We concern ourselves with this task: we desire to recommend contextually relevant citations to the author of a paper. A number of rhetorical annotation schemes for academic articles have been developed over the years, and it has often been suggested that they could find application in Information Retrieval scenarios such as this one. In this paper we investigate the usefulness for this task of CoreSC, a sentence-based, functional, scientific discourse annotation scheme (e.g. Hypothesis, Method, Result, etc.). We specifically apply this to anchor text, that is, the text surrounding a citation, which is an important source of data for building document representations. By annotating each sentence in every document with CoreSC and indexing them separately by sentence class, we aim to build a more useful vector-space representation of documents in our collection. Our results show consistent links between types of citing sentences and types of cited sentences in anchor text, which we argue can indeed be exploited to increase the relevance of recommendations. 1. INTRODUCTION Scientific papers follow a formal structure, and the language of academia requires clear argumentation [9]. This has Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). WOSP June, 2016, Newark, NJ, USA c 2016 Copyright held by the owner/author(s). ACM ISBN. DOI: led to the creation of classification schemes for the rhetorical and argumentative structure of scientific papers, of which two of the most prominent are Argumenative Zoning [19] and Core Scientific Concepts (CoreSC, [11]). The former focusses on the relation between current and previous work whereas the latter mostly on the content of a scientific investigation. These are among the first approaches to incorporate successful automatic classification of sentences in full scientific papers, using a supervised machine learning approach. It has often been suggested that these rhetorical schemes could be applied in information retrieval scenarios ([19], [12], [3]). Indeed, some experimental academic retrieval tools have tried applying them to different retrieval modes ([18], [14], [1]), and here we explore their potential application to a deeper integration with the writing process. Our aim is to make automatic citation recommendation as relevant as possible to the author s needs and to integrate it into the authoring workflow. Automatically recommending contextually relevant academic literature can help the author identify relevant previous work and find contrasting methods and results. In this work we specifically look at the domain of biomedical science, and examine the usefulness of CoreSC for this purpose. 2. PREVIOUS WORK The ever-increasing volume of scientific literature is a fact, and the need to navigate it a real one. This has brought much attention to the task of Context-Based Citation Recommendation (CBCR) over the last few years [6, 5, 3, 7]. The task consists in recommending relevant papers to be cited at a specific point in a draft scientific paper, and is universally framed as an information retrieval scenario. We need to recommend a citation for each citation placeholder: a special token inserted in the text of a draft paper where the citation should appear. In a standard IR approach, the corpus of potential papers to recommend (the document collection) is indexed for retrieval using a standard vector-space-model approach. Then, for each citation placeholder, the query is generated from the textual context around it (the citing context), and a similarity measure between the query and each document is then applied to rank the documents in the collection. A list of documents ranked by relevance is returned in reply to the query, so as to maximise the chance of picking the most useful paper to cite.

2 Figure 1: A high-level illustration of our approach. The class of the citing sentence is the query type and it determines a set of weights to apply to the classes of sentences in the anchor paragraphs of links to documents in our collection. In this example, for Bac, only 3 classes have non-zero weights: Bac, Met and Res. We show extracts from 3 different citing papers, exemplifying terms matching in different classes of sentences. The citing sentence is the sentence in which the prospective citation must appear. It determines the function of this citation and therefore provides information that can be exploited to increase the relevance of the suggested citations. As it is common practice, we evaluate our performance at this task by trying to recover the original citations found in papers that have already been published. Perhaps the seminal piece of work in this area is He et al. s [6] work, where they built an experimental citation recommendation system using the documents indexed by the CiteSeerX search engine as a test collection (over 450,000 documents), which was deployed as a testable system [8]. Recently, all metrics on this task and dataset were improved by applying multi-layered neural networks [7]. Other techniques have been applied to this task, such as collaborative filtering [2] and translation models [5], and other aspects of it have been explored, such as document representation [3] and context extraction [16]. 2.1 Incoming link contexts In order to make contextual suggestions as useful and relevant as possible, we argue here that we need to apply a measure of understanding to the text of the draft paper. Specifically, we hypothesize that there is a consistent relation between the type of citing sentence and the type of cited sentence. In this paper, we specifically target incoming link contexts, also known as anchor text in the information retrieval literature, which is text that occurs in the vicinity of a citation to a document. Incoming link contexts (henceforth ILCs) have previously been used to generate vector-space representations of documents for the purpose of context-based citation recommendation. The idea is intuitive: a citation to a paper is accompanied by text that often summarizes a key point in the cited paper, or its contribution to the field. It has been found experimentally that there is useful information in these ILCs that is not found in the cited paper itself [17], and using them exclusively to generate a document s representation has proven superior to using the contents of the actual document [3]. Typically these contexts are treated as a single bag-of-words, often simply concatenated. We propose a different approach here, where we separate the text in these contexts according to the type of sentence. All sentences of a same type from all ILCs to a same document are then indexed into the same field in a document in our index, allowing us to query by type of sentence in which the keywords appeared. Figure 1 illustrates our approach: the class of citing sentence is the query type, and for each query type we learn a set of weights to apply to finding the extracted keywords in different types of cited sentences in ILCs. Our approach is to apply existing rhetorical annotation schemes to classify sentences in citing documents and use this segmentation of the anchor text to a citation to increase the relevance of recommendations. For the task of recommending a citation for a given span of text, the ideal resource for classifying these spans would

3 Figure 2: Indexing and query generation for evaluation using the same corpus. We use a cut-off year of publication to create our document collection and our test set. Each document in the collection is indexed containing only text from Incoming Link Contexts (ILCs) citing it from other documents in the document collection. Text from all sentences of the same CoreSC class from all ILCs to this document are indexed into a single Lucene field. Citations to this document from the test set are then used to generate the queries to evaluate on, where the keywords are extracted from the citing context (1 sentence up, 1 down, including the citing sentence) and the query type is the class of the citing sentence. deal with the function of a citation within its argumentative context. While specific schemes for classifying the function of a citation have been developed (e.g. [20]), we are not aware of a scheme particularly tailored to our domain of biomedical science. Instead, we employ the CoresSC class of a citing sentence as a proxy for the function of all citations found inside it, which we have previously shown is a reasonable approach [4]. CoreSC takes the sentence as the minimum unit of annotation, continuing the standard approach to date, which we maintain in this work. 3. METHODOLOGY We label each sentence in our corpus according with CoreSC (see Table 1), which captures its rhetorical function in the document, and we aim to find whether there is a particular link between the class of citing sentence and the class of cited text, that is, the classes of sentences found in ILCs. As illustrated in Figure 2, we apply a cut-off date to separate our corpus into a large document collection and a smaller test set from which we will extract our queries for evaluation. We index each document in our document collection into a Lucene 1 index, creating a field in each document for each class of CoreSC (Hypothesis, Background, Method, etc.). We collect incoming-link contexts to all the documents in our document collection, that is, the potential documents to recommend, only from the document collection, excluding documents in our test set. We extract the paragraph where the incoming citation occurs as the ILC, keeping each sentence s label. All the text in sentences of that class from all ILCs to that document will be indexed into the same field. This allows us to apply different weights to the same keywords depending on the class of sentence they originally appeared in ILCs. Evaluation: In order to reduce purpose-specific annotation, we use the implicit judgements found in existing sci- 1

4 Category Hypothesis Motivation Background Goal Object-New Method-New Method-Old Experiment Model Observation Result Conclusion Description A statement not yet confirmed rather than a factual statement The reasons behind an investigation Generally accepted background knowledge and previous work A target state of the investigation where intended discoveries are made An entity which is a product or main theme of the investigation Means by which authors seek to achieve a goal of the investigation A method mentioned pertaining to previous work An experimental method A statement about a theoretical model or framework The data/phenomena recorded in an investigation Factual statements about the outputs of an investigation Statements inferred from observations & results relating to research hypothesis Table 1: CoreSC classes and their description. CoreSC is a content-focussed rhetorical annotation scheme developed and tested in the biomedical domain [11, 10]. Note that in this work we treat Method-Old and Method-New as a single category. entific publications as our ground truth. That is, we substitute all citations in the text of each paper in our test set with citation placeholders and make it our task to match each placeholder with the correct reference that was originally cited. We only consider resolvable citations, that is, citations to references that point to a paper that is in our collection, which means we have access to its metadata and full machine-readable contents. The task then becomes, for each citation placeholder, to: 1. extract its citing context, and from it the query terms (see Figure 2), and 2. attempt to retrieve the original paper cited in the context from the whole document collection We measure how well we did at our task by how far down the list of ranked retrieval results we find the original paper cited. We use two metrics to measure accuracy: Normalized Discounted Cumulative Gain (NDCG), a smooth discounting scheme over ranks, and top-1 accuracy, which is just the number of times the original paper was retrieved in the first position. Query extraction: For evaluation, the class of citing sentence becomes the query type, and for each type we apply a different set of per-field weights to each extracted term. We extract the context of the citation using a symmetric window of 3 sentences: 1 before the citation, the sentence containing the citation and 1 after. This is a frequently applied method [7] and is close to what has been assumed to be the optimal window of 2 sentences up, 2 down [13], while yielding fewer query terms and therefore allowing us more experimental freedom through faster queries. Similarity: We use the default Lucene similarity formula for assessing the similarity between a query and a document (Figure 3). score(q, d) = coord(q, d) tf(t d) idf(t) 2 norm(t, d) t q Figure 3: Default Lucene similarity formula In this formula, the coord term is an absolute multiplier of the number of terms in the query q found in the document d, tf is the absolute term frequency score of term t in document d, idf(t) is the inverse document score and norm is a normalization factor that divides the overall score by the length of document d. Note that all these quantities are per-field, not per-document. Technical implementation: We index the document collection using the Apache Lucene retrieval engine, specifically through the helpful interface provided by elasticsearch For each document, we create one field for each CoreSC class, and index into each field all the words from all sentences in the document that have been labelled with that class. The query is formed of all the terms in the citation s context that are not in a short list of stopwords. Lucene queries take the basic form field:term, where each combination of field and term form a unique term in the query. We want to match the set of extracted terms to all fields in the document, as each field represents one class of CoreSC. The default Lucene similarity formula (Figure 2) gives a boost to a term matching across multiple fields, which in our case would introduce spurious results. To avoid this, we employ DisjunctionMax queries, where only the top scoring result is evaluated out of a number of them. Having one query term for each of the classes of CoreSC for each distinct token (e.g. Bac: method, Goa: method, Hyp: method, etc.), only the one with the highest score will be evaluated as a match. Weight training: Testing all possible weight combinations is infeasible due to the combinatorial explosion, so we adopt the greedy heuristic of trying to maximise the objective function at each step. Our weight training algorithm can be summarized as hill climbing with restarts. For each fold, and for each citation type, we aim to find the best combination of weights to set on sentence classes that will maximise our metric, in this case the NDCG score that we compute by trying to recover the original citation. We keep the queries the same in structure and term content and we only change the weights applied to each field in a document to recommend. Each field, as explained above, contains only the terms from the sentences in the document of one CoreSC class. The weights are initialized at 1 and they move by 1, 6, and 2 in sequence, going through a minimum of 3 iterations. Each time a weight movement is applied, it is only kept if the score increases, otherwise the previous weight value is restored. 2

5 Figure 4: Weight values for the query types (types of citing sentences) that improved across all folds. The weight values for the 4 folds are shown, together with test scores and improvement over the baseline. These weights apply to text indexed from sentences in ILCs to the same document and the weight cells are shaded according to their value, darker is higher. In bold, citation types that consistently improve across folds. On the right-hand side are the scores obtained through testing and the percentage increase over the baseline, in which all weights were set to 1. *NDCG and Accuracy (top-1) are averaged scores over all citations in the test set for that fold. This simple algorithm is not guaranteed to find a globally optimal combination of parameters for the very complex function we are optimizing, but it is sufficient for our current objective. We aim to apply more robust parameter tuning techniques to learning the weights in future work. 4. EXPERIMENTS Our corpus is formed of one million papers from the PubMed Central Open Access collection 3. These papers are already provided in a clean, hand-authored XML format with a welldefined XML schema 4. For our experiments we used all papers published up to and including 2014 as our document collection ( 950k documents), and selected 1000 random papers published in or after 2015 as our test set. We treat the documents in the test set as our draft documents from which to extract the citations that we aim to recover and their citation contexts. We generate the queries from these contexts and the query type is the CoreSC class of the citing sentence. These are to our knowledge the largest experiments of this kind ever carried out with this corpus. We need to test whether our conditional weighting of text spans based on CoreSC classification is actually reflecting some underlying truth and is not just a random effect of the dataset. To this end, we employ 4-fold cross-validation, where we learn the weights for 3 folds and test their impact on one fold, and we report the averaged gains over each fold. The full source code employed to run these experiments and instructions on how to replicate them are available on GitHub 5. The automatically annotated corpus is currently available on request, and we aim to make it publicly available shortly. 5. RESULTS AND DISCUSSION Figure 4 shows the results for the 7 classes of citing sentences for which there was consistent improvement across all 4 folds, with a matrix of the best weight values that were found for each fold. On the right-hand side are the testing scores obtained for each fold and the percentage increase over the baseline, in which all weights are set to 1. For the remaining 4 classes (Experiment, Model, Motivation, Observation) the experiments failed to find consistent improvement, with wild variation across folds. As is to be expected, the citations are skewed in numbers towards some CoreSC classes. A majority of citations occur within sentences that were automatically labelled Back- 5

6 Figure 5: Citation network: links between query types and classes of cited sentences. On the left, the results presented here of CoreSC-labelled incoming link contexts. *On the right, a comparison with previous work (see [4]), where we explored the link between citing sentences and CoreSC-labelled contents of the cited document. The thickness of the lines represents the weight given to terms indexed from that class of cited sentence. ground, Methodology and Results, no doubt due to a pattern in the layout of the content of articles. This yields many more Bac, Met and Res citations to evaluate on, and for this reason we set a hard limit to the number of citations per CoreSC to 1000 in these experiments. A number of patterns are immediately evident from these initial results. For all query types, it seems to be almost universally useful to know that Background or Methodology sentences in a document s incoming link contexts match the query terms extracted from the citation context. The possibility exists that this is partly an effect of there being more sentences of type Background and Method in our collection. Similarly, it seems it is better to ignore other classes of sentences in the incoming link contexts of candidate papers, specifically Experiment, Hypothesis Motivation and Observation. Also notably, Conclusion seems to be relevant only to queries of type Goal, Hypothesis and Result. Even more notably, Goal and Object seem relevant to Goal queries, and exclusively to them. Note here that the fact that a weight combination was found where the best weight for a citing sentence class is 0 does not mean that including information from this CoreSC is not useful but rather that it is in fact detrimental, as eliminating it actually increased the average NDCG score. These are of course averaged results, and it is certain that the weights that we find are not optimal for each individual test case, only better on average. It is important to note that our evaluation pipeline necessarily consists of many steps, and encounters issues with XML conversion, matching of citations with references, matching of references in papers to references in the collection, etc., where each step in the pipeline introduces a measure of error that we have not estimated here. The one we can offer an estimate for is that of the automatic sentence classifier. The Sapienta classifier 6 we employ here has recently been independently evaluated on a different corpus from the orig- 6 inally annotated corpus used to train it. It yielded 51.9% accuracy over all eleven classes, improving on the 50.4% 9- fold cross-validation accuracy over its training corpus [15]. Further to this, we judge that the consistency of correlations we find confirms that what we can see in Figure 4 is not due to random noise, but rather hints at underlying patterns in the connections between scientific articles in the corpus. Figure 5 shows our results as a graph, with the per-class weights flowing from the class of citing sentence to the class of cited sentence. For this graph, we take a majority vote for the weights from Figure 4: if three folds agree and a fourth differs by a small value, we take this to be noise and use the majority value. If folds agree in two groups we average the values. We show a side-by-side comparison of these new results with our previous results where we indexed a document s actual contents instead of the incoming link contexts to it. We had previously proposed that there is an observable link between the class of citing sentence and the class of sentence in the cited document [4]. Now we find the same evidence for a link between the class of citing sentence and the class of sentence within incoming link contexts, so inside other documents citing a given document. There are both similarities and differences between the weights found for incoming link contexts and document text. Background and Method are almost as universally relevant for one as for the other, and Results equally as irrelevant for citing sentences of classes Conclusion and Goal. However, we also find that whereas sentences of type Observation found inside a document s text are useful (for Background, Object and Result), they are not when they are found inside incoming link contexts to that document. 6. CONCLUSION AND FUTURE WORK We have presented a novel application of CoreSC discourse function classification to context-based citation recommen-

7 dation, an information retrieval application. We have carried out experiments on the full PubMed Central Open Access Corpus and found strong indications of correlation between different classes of sentences in the Incoming Link Contexts of documents citing a single document. We also find that these relationships are not intuitively predictable and yet consistent. This suggests that there are gains to be reaped in a practical application of CoreSC to context-based citation recommendation. In future work we aim to evaluate this against more standard approaches, such as concatenating and indexing the anchor text and the document text together. 7. REFERENCES [1] M. Angrosh, S. Cranefield, and N. Stanger. Context identification of sentences in research articles: Towards developing intelligent tools for the research community. Natural Language Engineering, 19(04): , [2] C. Caragea, A. Silvescu, P. Mitra, and C. L. Giles. Can t see the forest for the trees?: a citation recommendation system. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pages ACM, [3] D. Duma and E. Klein. Citation resolution: A method for evaluating context-based citation recommendation systems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), page 358âĂŞ363, Baltimore, Maryland, USA, [4] D. Duma, M. Liakata, A. Clare, J. Ravenscroft, and E. Klein. Applying core scientific concepts to context-based citation recommendation. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference, [5] J. He, J.-Y. Nie, Y. Lu, and W. X. Zhao. Position-aligned translation model for citation recommendation. In String Processing and Information Retrieval, pages Springer, [6] Q. He, J. Pei, D. Kifer, P. Mitra, and L. Giles. Context-aware citation recommendation. In Proceedings of the 19th international conference on World wide web, pages ACM, [7] W. Huang, Z. Wu, C. Liang, P. Mitra, and C. L. Giles. A neural probabilistic model for context based citation recommendation. In AAAI 15 Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, [8] W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer: A citation recommendation system. In Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on, pages IEEE, [9] K. Hyland. Academic discourse: English in a global context. Bloomsbury Publishing, [10] M. Liakata, S. Saha, S. Dobnik, C. Batchelor, and D. Rebholz-Schuhmann. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics, 28(7): , [11] M. Liakata, S. Teufel, A. Siddharthan, and C. R. Batchelor. Corpora for the conceptualisation and zoning of scientific papers. In LREC, [12] P. I. Nakov, A. S. Schwartz, and M. Hearst. Citances: Citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIRâĂŹ04 workshop on Search and Discovery in Bioinformatics, pages 81 88, [13] V. Qazvinian and D. R. Radev. Identifying non-explicit citing sentences for citation-based summarization. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages Association for Computational Linguistics, [14] J. Ravenscroft, M. Liakata, and A. Clare. Partridge: An effective system for the automatic cassification of the types of academic papers. In Research and Development in Intelligent Systems XXX, pages Springer, [15] J. Ravenscroft, A. Oellrich, S. Saha, and M. Liakata. Multi-label annotation in scientific articles -the multi-label cancer risk assessment corpus. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference, [16] A. Ritchie. Citation context analysis for information retrieval. Technical report, University of Cambridge Computer Laboratory, [17] A. Ritchie, S. Teufel, and S. Robertson. Creating a test collection for citation-based ir experiments. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages Association for Computational Linguistics, [18] U. Schäfer and U. Kasterka. Scientific authoring support: A tool to navigate in typed citation graphs. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics and writing: Writing processes and authoring aids, pages Association for Computational Linguistics, [19] S. Teufel. Argumentative zoning: Information extraction from scientific text. PhD thesis, Citeseer, [20] S. Teufel, A. Siddharthan, and D. Tidhar. Automatic classification of citation function. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages Association for Computational Linguistics, 2006.

SAPIENT Automation project

SAPIENT Automation project Dr Maria Liakata Leverhulme Trust Early Career fellow Department of Computer Science, Aberystwyth University Visitor at EBI, Cambridge mal@aber.ac.uk 25 May 2010, London Motivation SAPIENT Automation Project

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project

Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project Francesco Ronzano, Ana Freire, Diego Saez-Trumper and Horacio Saggion Department of Information and Communication

More information

Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications

Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications Amanda Clare 1,3, Samuel Croset 2,3 (croset@ebi.ac.uk), Christoph Grabmueller 2,3, Senay

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Context based Re-ranking of Web Documents (CReWD)

Context based Re-ranking of Web Documents (CReWD) Context based Re-ranking of Web Documents (CReWD) Arijit Banerjee, Jagadish Venkatraman Graduate Students, Department of Computer Science, Stanford University arijitb@stanford.edu, jagadish@stanford.edu}

More information

Detection and Extraction of Events from s

Detection and Extraction of Events from  s Detection and Extraction of Events from Emails Shashank Senapaty Department of Computer Science Stanford University, Stanford CA senapaty@cs.stanford.edu December 12, 2008 Abstract I build a system to

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis

Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis Identification of Navigation Lead Candidates Using Citation and Co-Citation Analysis Robert Moro, Mate Vangel, Maria Bielikova Slovak University of Technology in Bratislava Faculty of Informatics and Information

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK NG, Jun Ping National University of Singapore ngjp@nus.edu.sg 30 November 2009 The latest version of QANUS and this documentation can always be downloaded from

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009 Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS Dear Participant of the MScIS Program, If you have chosen to follow an internship, one of the requirements is to write a Thesis. This document gives you

More information

Dmesure: a readability platform for French as a foreign language

Dmesure: a readability platform for French as a foreign language Dmesure: a readability platform for French as a foreign language Thomas François 1, 2 and Hubert Naets 2 (1) Aspirant F.N.R.S. (2) CENTAL, Université Catholique de Louvain Presentation at CLIN 21 February

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Scholarly Big Data: Leverage for Science

Scholarly Big Data: Leverage for Science Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Mining Association Rules in Temporal Document Collections

Mining Association Rules in Temporal Document Collections Mining Association Rules in Temporal Document Collections Kjetil Nørvåg, Trond Øivind Eriksen, and Kjell-Inge Skogstad Dept. of Computer and Information Science, NTNU 7491 Trondheim, Norway Abstract. In

More information

KNOW At The Social Book Search Lab 2016 Suggestion Track

KNOW At The Social Book Search Lab 2016 Suggestion Track KNOW At The Social Book Search Lab 2016 Suggestion Track Hermann Ziak and Roman Kern Know-Center GmbH Inffeldgasse 13 8010 Graz, Austria hziak, rkern@know-center.at Abstract. Within this work represents

More information

Personalized Web Search

Personalized Web Search Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

More information

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Ruslan Salakhutdinov Word Sense Disambiguation Word sense disambiguation (WSD) is defined as the problem of computationally

More information

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands Svetlana Stoyanchev, Hyuckchul Jung, John Chen, Srinivas Bangalore AT&T Labs Research 1 AT&T Way Bedminster NJ 07921 {sveta,hjung,jchen,srini}@research.att.com

More information

Extracting Algorithms by Indexing and Mining Large Data Sets

Extracting Algorithms by Indexing and Mining Large Data Sets Extracting Algorithms by Indexing and Mining Large Data Sets Vinod Jadhav 1, Dr.Rekha Rathore 2 P.G. Student, Department of Computer Engineering, RKDF SOE Indore, University of RGPV, Bhopal, India Associate

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

SDMX self-learning package No. 5 Student book. Metadata Structure Definition

SDMX self-learning package No. 5 Student book. Metadata Structure Definition No. 5 Student book Metadata Structure Definition Produced by Eurostat, Directorate B: Statistical Methodologies and Tools Unit B-5: Statistical Information Technologies Last update of content December

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Automatic Generation of Query Sessions using Text Segmentation

Automatic Generation of Query Sessions using Text Segmentation Automatic Generation of Query Sessions using Text Segmentation Debasis Ganguly, Johannes Leveling, and Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Dublin-9, Ireland {dganguly,

More information

Where Should the Bugs Be Fixed?

Where Should the Bugs Be Fixed? Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports Presented by: Chandani Shrestha For CS 6704 class About the Paper and the Authors Publication

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

Hierarchical Location and Topic Based Query Expansion

Hierarchical Location and Topic Based Query Expansion Hierarchical Location and Topic Based Query Expansion Shu Huang 1 Qiankun Zhao 2 Prasenjit Mitra 1 C. Lee Giles 1 Information Sciences and Technology 1 AOL Research Lab 2 Pennsylvania State University

More information

Query Reformulation for Clinical Decision Support Search

Query Reformulation for Clinical Decision Support Search Query Reformulation for Clinical Decision Support Search Luca Soldaini, Arman Cohan, Andrew Yates, Nazli Goharian, Ophir Frieder Information Retrieval Lab Computer Science Department Georgetown University

More information

Active Code Search: Incorporating User Feedback to Improve Code Search Relevance

Active Code Search: Incorporating User Feedback to Improve Code Search Relevance Active Code Search: Incorporating User Feedback to Improve Code Search Relevance Shaowei Wang, David Lo, and Lingxiao Jiang School of Information Systems, Singapore Management University {shaoweiwang.2010,davidlo,lxjiang}@smu.edu.sg

More information

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs

Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Tools and Infrastructure for Supporting Enterprise Knowledge Graphs Sumit Bhatia, Nidhi Rajshree, Anshu Jain, and Nitish Aggarwal IBM Research sumitbhatia@in.ibm.com, {nidhi.rajshree,anshu.n.jain}@us.ibm.com,nitish.aggarwal@ibm.com

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Academic Paper Recommendation Based on Heterogeneous Graph

Academic Paper Recommendation Based on Heterogeneous Graph Academic Paper Recommendation Based on Heterogeneous Graph Linlin Pan, Xinyu Dai, Shujian Huang, and Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023,

More information

Motivating Ontology-Driven Information Extraction

Motivating Ontology-Driven Information Extraction Motivating Ontology-Driven Information Extraction Burcu Yildiz 1 and Silvia Miksch 1, 2 1 Institute for Software Engineering and Interactive Systems, Vienna University of Technology, Vienna, Austria {yildiz,silvia}@

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

CL Scholar: The ACL Anthology Knowledge Graph Miner

CL Scholar: The ACL Anthology Knowledge Graph Miner CL Scholar: The ACL Anthology Knowledge Graph Miner Mayank Singh, Pradeep Dogga, Sohan Patro, Dhiraj Barnwal, Ritam Dutt, Rajarshi Haldar, Pawan Goyal and Animesh Mukherjee Department of Computer Science

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

Incompatibility Dimensions and Integration of Atomic Commit Protocols

Incompatibility Dimensions and Integration of Atomic Commit Protocols The International Arab Journal of Information Technology, Vol. 5, No. 4, October 2008 381 Incompatibility Dimensions and Integration of Atomic Commit Protocols Yousef Al-Houmaily Department of Computer

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Information Search in Web Archives

Information Search in Web Archives Information Search in Web Archives Miguel Costa Advisor: Prof. Mário J. Silva Co-Advisor: Prof. Francisco Couto Department of Informatics, Faculty of Sciences, University of Lisbon PhD thesis defense,

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

Repositorio Institucional de la Universidad Autónoma de Madrid.

Repositorio Institucional de la Universidad Autónoma de Madrid. Repositorio Institucional de la Universidad Autónoma de Madrid https://repositorio.uam.es Esta es la versión de autor de la comunicación de congreso publicada en: This is an author produced version of

More information

Just-In-Time Hypermedia

Just-In-Time Hypermedia A Journal of Software Engineering and Applications, 2013, 6, 32-36 doi:10.4236/jsea.2013.65b007 Published Online May 2013 (http://www.scirp.org/journal/jsea) Zong Chen 1, Li Zhang 2 1 School of Computer

More information

Exam Marco Kuhlmann. This exam consists of three parts:

Exam Marco Kuhlmann. This exam consists of three parts: TDDE09, 729A27 Natural Language Processing (2017) Exam 2017-03-13 Marco Kuhlmann This exam consists of three parts: 1. Part A consists of 5 items, each worth 3 points. These items test your understanding

More information

Relevance and Effort: An Analysis of Document Utility

Relevance and Effort: An Analysis of Document Utility Relevance and Effort: An Analysis of Document Utility Emine Yilmaz, Manisha Verma Dept. of Computer Science University College London {e.yilmaz, m.verma}@cs.ucl.ac.uk Nick Craswell, Filip Radlinski, Peter

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Document Clustering of Scientific Texts Using Citation Contexts

Document Clustering of Scientific Texts Using Citation Contexts Noname manuscript No. (will be inserted by the editor) Document Clustering of Scientific Texts Using Citation Contexts Bader Aljaber. Nicola Stokes. James Bailey. Jian Pei Received: date / Accepted: date

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Automated Extraction of Event Details from Text Snippets

Automated Extraction of Event Details from Text Snippets Automated Extraction of Event Details from Text Snippets Kavi Goel, Pei-Chin Wang December 16, 2005 1 Introduction We receive emails about events all the time. A message will typically include the title

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

Taccumulation of the social network data has raised

Taccumulation of the social network data has raised International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and

More information

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of

More information

A Novel Approach of Mining Write-Prints for Authorship Attribution in Forensics

A Novel Approach of Mining Write-Prints for Authorship Attribution in  Forensics DIGITAL FORENSIC RESEARCH CONFERENCE A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics By Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, Mourad Debbabi Presented At

More information

Applying the Annotation View on Messages for Discussion Search

Applying the Annotation View on Messages for Discussion Search Applying the Annotation View on Messages for Discussion Search Ingo Frommholz University of Duisburg-Essen ingo.frommholz@uni-due.de In this paper we present our runs for the TREC 2005 Enterprise Track

More information

Natural Language Processing. SoSe Question Answering

Natural Language Processing. SoSe Question Answering Natural Language Processing SoSe 2017 Question Answering Dr. Mariana Neves July 5th, 2017 Motivation Find small segments of text which answer users questions (http://start.csail.mit.edu/) 2 3 Motivation

More information

An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs

An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs He Jiang 1, 2, 3 Jingxuan Zhang 1 Zhilei Ren 1 Tao Zhang 4 jianghe@dlut.edu.cn jingxuanzhang@mail.dlut.edu.cn zren@dlut.edu.cn

More information

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

second_language research_teaching sla vivian_cook language_department idl

second_language research_teaching sla vivian_cook language_department idl Using Implicit Relevance Feedback in a Web Search Assistant Maria Fasli and Udo Kruschwitz Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom fmfasli

More information

University of Santiago de Compostela at CLEF-IP09

University of Santiago de Compostela at CLEF-IP09 University of Santiago de Compostela at CLEF-IP9 José Carlos Toucedo, David E. Losada Grupo de Sistemas Inteligentes Dept. Electrónica y Computación Universidad de Santiago de Compostela, Spain {josecarlos.toucedo,david.losada}@usc.es

More information

Enhancing Automatic Wordnet Construction Using Word Embeddings

Enhancing Automatic Wordnet Construction Using Word Embeddings Enhancing Automatic Wordnet Construction Using Word Embeddings Feras Al Tarouti University of Colorado Colorado Springs 1420 Austin Bluffs Pkwy Colorado Springs, CO 80918, USA faltarou@uccs.edu Jugal Kalita

More information