Chapter 6. Queries and Interfaces
|
|
- Ralf Chase
- 5 years ago
- Views:
Transcription
1 Chapter 6 Queries and Interfaces
2 Keyword Queries Simple, natural language queries were designed to enable everyone to search Current search engines do not perform well (in general) with natural language queries People trained (in effect) to use keywords Compare average of about 2.3 words/web query to average of 30 words/community-based Question Answering (CQA) query Keyword selection is not always easy Query refinement techniques can help 2
3 Query-Based Stemming Make decision about stemming at query time rather than during indexing Improved flexibility and effectiveness Query is expanded using word variants Documents are not stemmed e.g., rock climbing expanded with climb, not stemmed to climb 3
4 Stem Classes Stemming generates stem classes A stem class is the group of words that will be transformed into the same stem by the stemming algorithm Generated by running stemmer on large corpus e.g., Porter stemmer on TREC News on 3 stem classes with the first entry being the stem 4
5 Stem Classes Stem classes are often too big and inaccurate Modify using analysis of word co-occurrence Assumption: Word variants that could substitute for each other should co-occur often in documents 5
6 Modifying Stem Classes For all pairs of words in the stem classes, count how often they co-occuroccur in text windows of W words, where W is in the range Compute a co-occurrence or association metric for each pair, which measures the degree of association between the words. Construct a graph where the vertices represent words and the edges are between words whose co-occurrence metric is above a threshold value, which h is set empirically. i Find the connected components of the graph. These are the new stem classes. 6
7 Modifying Stem Classes Dices Coefficient is an example of a term association measure, where n x is the number of windows (documents) containing x a measure of the proportion of term co-occurrence Two vertices are in the same connected component of a graph if there is a path between them Forms word clusters Sample output tof modification 7
8 Context Sensitive Stemming According to [Peng 07]: Although stemming increases recall, when considering Web searches, stemming lowers precision and requires significant additional computation. Traditional stemming transforms all query terms, i.e., it always performs the same transformation for the same query word, regardless of the context in which the words appears. Stemming by query expansion, on the other hand, handles each query differently, i.e., considers the context of each word in a query, which is more effective. The Porter Stemmer algorithm does not considers the semantic meaning of words and thus can make mistakes when converting a word to its base form, e.g., news to new. [Peng 07] F. Peng, et al. Context Sensitive Stemming for Web Search. In Proc. of Conf. on Research and Development in Information Retrieval
9 Spell Checking Important part of query processing 10-15% of all Web queries have spelling errors Errors include typical word processing errors but also many other types, (e.g., errors extracted from query logs) 9
10 Spell Checking Basic approach: suggest corrections for words not found in spelling dictionary Many spelling errors are related to websites, products, companies, people, etc. that are unlikely to be found Suggestions found by comparing word to words in dictionary using similarity measure Most common similarity measure is edit distance Number of operations required to transform one word into the other 10
11 Edit Distance Damerau-Levenshtein Distance Counts the minimum number of insertions, deletions, substitutions, or transpositions of single characters required e.g., Damerau-Levenshtein distance 1 (single-character errors) Distance 2 11
12 Edit Distance Different techniques used to speed up calculation of edit distances -- restrict to words that start with same character (spelling errors rarely occur in the first letter) come with similar length (spelling errors rarely occur on words with the same length) sound the same (homophone, rules map words to codes) Last option uses a (same) phonetic code to group words e.g., Soundex 12
13 Soundex Code (correct word may not always have the same Soundex code) 13
14 Spelling Correction Issues Ranking corrections (> 1 possible corrections for an error) Did you mean... feature requires accurate ranking of possible corrections (more likely: the best suggestion) Context t Choosing right suggestion depends on context (other words) e.g., lawers lowers, lawyers, layers, lasers, lagers but trial lawers trial lawyers Run-on errors (word boundaries are skipped/mistyped) eg e.g., mainscourcebank Missing spaces can be considered another single character error in right framework 14
15 Noisy Channel Model Address the issues of ranking, context, and run-on errors User chooses word w based on probability distribution P(w) Called the language g model Can capture context information about the frequency of occurrence of a word in text, e.g., P(w( 1 w 2 2) ) The probability of observing a word, given that another one has just been observed User writes word w, but noisy channel causes word e to be written instead with probability P(e w) Called error model Represents information about the frequency of spelling errors 15
16 Noisy Channel Model Need to estimate probability of correction to represent info. about the frequency of different types of errors Error Model P(w e) =P(e w)p(w), i.e., the probability that given a written word e,, the correct word is w Estimate language model using context P(w) = λp(w) +(1 λ)p(w wp) Probability of Word Occurrence Probability of Word Co-occurrence where wp is a previous word of w, and λ is a parameter which specifies the relative importance of P(w) & P(w wp) Examples. fish tink : tank and think both likely corrections, but P(tank fish) > P(think fish) 16
17 Noisy Channel Model Language model probabilities estimated using corpus and query log Both simple and complex methods have been used for estimating error model Simple approach: assume that all words with same edit distance have same probability, only edit distance 1 and 2 considered More complex approach: incorporate estimates based on common typing errors Estimates are derived from large collections of text by finding many pairs of (in)correctly spelled words 17
18 Cucerzan and Brill (2004) s Spell Checker using (i) query log (QL) and (ii) dictionary (D) 1. Tokenize the query. 2. For each token, a set of alternative words and pairs of words is found using an edit distance modified by weighting g certain types of errors. The data structure that is searched for the alternatives contains words and pairs from QL and D. 3. The noisy channel model is used to select the best correction. 4. The process of looking for alternatives and finding the best correction is repeated until no better correction is found. 18
19 Query Expansion A variety of (semi)-automatic query expansion techniques have been developed Goal: improve effectiveness by matching related terms Semi-automatic techniques require user interaction to select best expansion terms Query suggestion is a related technique Alternative queries, not necessarily more terms 19
20 Query Expansion Approaches usually based on an analysis of term co- occurrence either in (i) the entire document collection, (ii) a large collection of queries, or (iii) the top-ranked documents in a result list Query-based stemming also an expansion technique Automatic expansion based on general thesaurus, but not effective Does not take context into account 20
21 Term Association Measures Key to effective expansion: choose appropriate words for the context of the query e.g., aquarium for tank in tropical fish tanks, but not for armor for tanks Dice s Coefficient: computing the term-occurrence Mutual Information: measure co-occurrence words log P(a, b) rank P(a)P(b) n ab n a. n b 21
22 Term Association Measures Mutual Information measure favors low frequency terms Expected Mutual Information Measure (EMIM) corrects the problem of MIM by weighting g MIM values using P(a, b) where N is the number of documents in a collection favoring high-frequency terms, which can be a problem 22
23 Term Association Measures Pearson s Chi-squared (χ 2 ) measure Compares the number of co-occurrences of two words with the expected number of co-occurrences if the two words were independent independent co-occurrences Normalizes this comparison by the expected number Also limited form focused on word co-occurrence 23
24 Association Measure Summary Favor lowfrequency terms Favor highfrequency terms 24
25 Association Measures Associated words are of little use for expanding the query tropical fish Expansion based on whole query takes context into account e.g., g, using Dice with term tropical fish gives the following highly associated words: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet Impractical for all possible queries, other approaches used to achieve e this effect 25
26 Other Approaches Pseudo-relevance feedback Expansion terms based on top retrieved documents for initial query Context vectors Represent words by the co-occurredoccurred words e.g., top 35 most strongly associated words for aquarium (using Dice s coefficient): c e Rank words for a query by ranking context t vectors 26
27 Other Approaches Query logs Best source of information about queries and related terms Short pieces of text and click data e.g., most frequent words in queries containing tropical fish from MSN log: stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies Query suggestion based on finding similar queries Group based on click data 27
28 Query Expansion on Web Searches According to [Crabtree 07] automatic query expansion (on Web searches) Enhances recall but not precision: expanding queries using 20 terms improves the recall of the result set, but hurts the precision due to query drift. Helps identifying other relevant pages by identifying additional terms that occur on relevant pages Relies on both local and global document analysis (considering documents in the result set or using a thesauri/properties of the entire corpus, respectively) as sources for refinement terms [Crabtree 07] D. Crabtree et al. Exploiting Underrepresented Query Aspects for Automatic Query Expansion. In Proc. of Conf. on Knowledge Discovery and Data Mining
29 Relevance Feedback User identifies relevant (and maybe non-relevant) documents in the initial result list System modifies query using terms from those documents and re-ranks documents Example of simple ML algorithm using training data But, very little training data Pseudo-relevance feedback just assumes top-ranked documents are relevant no user input 29
30 Relevance Feedback Example Snippets Top 10 documents for tropical fish 30
31 Relevance Feedback Example If we assume top 10 are relevant, most frequent terms are (with frequency): a (926), td (535), href (495), http (357), width (345), com (343), nbsp (316), www (260), tr (239), htm (233), class (225), jpg (221) Too many stopwords and HTML expressions Use only snippets and remove stopwords tropical (26), fish (28), aquarium (8), freshwater (5), breeding (4), information (3), species (3), tank (2), Badman s (2), page (2), hobby (2), forums (2) 31
32 Relevance Feedback Both relevance and pseudo-relevance feedback are effective, but not used in many applications Pseudo-relevance feedback has reliability issues, especially with queries that don t retrieve many relevant documents Some applications use relevance feedback Filtering, more like this Query suggestion more popular May be less accurate, but can work if initial query fails 32
33 Context and Personalization If a query has the same words as another query, results will be the same regardless of Who submitted the query Why the query was submitted Where the query was submitted What other queries were submitted in the same session These other factors (the context) could have a significant impact on relevance Difficult to incorporate into ranking 33
34 User Models Generate user profiles based on documents that the person looks at Such as Web pages visited, messages, or word processing documents on the desktop Modify user s queries using words from his profile Generally not effective Imprecise profiles, information needs can change significantly 34
35 Query Logs Query logs provide important contextual information that can be used effectively Context in this case is Previous queries that are the same Previous queries that are similar Query sessions including the same query Query history for individuals could be used for caching 35
36 Location is context Local Search Local search uses geographic information to modify the ranking of search results Location derived from the query text Location of the device where the query originated Examples Underworld 3 cape cod Underworld 3 from mobile device in Hyannis 36
37 Local Search Identify the geographic region associated with Web pages Use location metadata that has been manually added to the document Or identify locations such as place names, city names, or country names in text Identify the geographic region associated with the query 10-15% of queries contain some location reference Rank Web pages using location information in addition to text and link-based features 37
38 Snippet Generation Query-dependent document summary Simple summarization approach Rank each sentence in a document using a significance factor Select the top sentences for the summary First proposed by Luhn in 50 s 38
39 Sentence Selection Significance factor for a sentence is calculated based on the occurrence of significant words If f d,w is the frequency of word w in document d, then w is a significant word if it is not a stopword, ie i.e., a high-frequency word, and where s d is the number of sentences in document d Text is bracketed by significant words (limit on number of non-significant words in bracket) 39
40 Sentence Selection Significance factor for bracketed text spans is computed by (i) dividing the square of the number of significant words in the span by (ii) the total number of words Example. The limit it set for non-significant ifi words in a bracket is 4 Significance factor = 4 2 /7 =
41 Snippet Generation Involves more features than just significance factor Example. (A sentence-based approach) on a news story 1. Whether the sentence is a heading 2. Whether it is the 1 st or 2 nd line of the document 3. The total number of query terms occurring in the sentence 4. The number of unique query terms in the sentence 5. The longest contiguous run of query words in the sentence 6. A density measure of query words (i.e., significance factor) Weighted combination of features used to rank sentences 41
42 Blog Snippets According to [Parapar 10], in performing a Blog search Considering reader s comments improves the quality of post snippets and thus enhances users access to relevant post in a result list Indexing comments, as well as the post itself, enhances the recall of a blog search, which is more important in blog than web search, since searchers are usually interested in all recent posts about a specific topic) [Parapar 10] J. Parapar et al. Blog Snippets: A Comments-Biased Approach. In Proc. of Intl. Conf. on Research and Development in Information Retrieval
43 Clustering Results Result lists often contain documents related to different aspects of the query topic Clustering is used to group related documents to simplify browsing Example clusters for query tropical fish 43
44 Result List Example Top 10 docs for tropical fish 44
45 Requirements Efficiency Clustering Results Must be specific to each query and are based on the top-ranked documents for that query Typically based on snippets Easy to understand Can be difficult to assign good labels to groups Monothetic vs. polythetic classification 45
46 Monothetic Types of Classification Every member of a class has the property that defines the class Typical assumption made by users Easy to understand Polythetic Members of classes share many properties but there is no single defining property Most clustering algorithms (e.g., K-means) produce this type of output 46
47 Classification Example D 1 = {a, b, c} D 2 = {a, d, e} D 3 = {d, e, f, g} D 4 = {f, g} Possible monothetic classification { D 1,D 2 } (labeled using a) and { D 2,D 3 } (labeled e) Possible polythetic classification { D 2,D 3,D 4 }, D 1 Labels? 47
48 Result Clusters Simple algorithm Group based on words in snippets Refinements Use phrases Use more features Whether phrases occurred in titles or snippets Length of the phrase Collection frequency of the phrase Overlap of the resulting clusters 48
CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Previously: Indexing Process Query Process Queries Queries Query Expansion Spell Checking Context
More informationSearch Engines Chapter 6 Queries and Interfaces Felix Naumann
Search Engines Chapter 6 Queries and Interfaces 2.6.2009 Felix Naumann Overview 2 Information needs Query transformation & refinement Showing results Cross-language search Information Needs 3 An information
More informationQuery Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed
Query Suggestion Query Suggestion A variety of automatic or semi-automatic query suggestion techniques have been developed Goal is to improve effectiveness by matching related/similar terms Semi-automatic
More informationQuery Refinement and Search Result Presentation
Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Previously: Indexing Process Query Process Informa.on Needs An informa(on need is the underlying
More informationChapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationIN4325 Query refinement. Claudia Hauff (WIS, TU Delft)
IN4325 Query refinement Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for the search engine
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationSearch Engine Architecture II
Search Engine Architecture II Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationImproving Difficult Queries by Leveraging Clusters in Term Graph
Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationRanked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?
Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not
More informationInformation Retrieval
Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information
More informationMEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI
MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process Processing Text Converting documents to index terms Why? Matching the exact
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationSec. 8.7 RESULTS PRESENTATION
Sec. 8.7 RESULTS PRESENTATION 1 Sec. 8.7 Result Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, a list of the document titles plus a short summary,
More informationChapter IR:IV. IV. Indexing. Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing
Chapter IR:IV IV. Indexing Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing IR:IV-276 Indexing HAGEN/POTTHAST/STEIN 2017 Inverted Indexes
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationSearch Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]
Search Evaluation Tao Yang CS293S Slides partially based on text book [CMS] [MRS] Table of Content Search Engine Evaluation Metrics for relevancy Precision/recall F-measure MAP NDCG Difficulties in Evaluating
More informationAn Information Retrieval Approach for Source Code Plagiarism Detection
-2014: An Information Retrieval Approach for Source Code Plagiarism Detection Debasis Ganguly, Gareth J. F. Jones CNGL: Centre for Global Intelligent Content School of Computing, Dublin City University
More informationRobust Relevance-Based Language Models
Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationSearch Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011
Search Engines Exercise 5: Querying Dustin Lange & Saeedeh Momtazi 9 June 2011 Task 1: Indexing with Lucene We want to build a small search engine for movies Index and query the titles of the 100 best
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationInformation Retrieval
Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationChapter IR:II. II. Architecture of a Search Engine. Indexing Process Search Process
Chapter IR:II II. Architecture of a Search Engine Indexing Process Search Process IR:II-87 Introduction HAGEN/POTTHAST/STEIN 2017 Remarks: Software architecture refers to the high level structures of a
More informationWeb Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationUnsupervised Keyword Extraction from Single Document. Swagata Duari Aditya Gupta Vasudha Bhatnagar
Unsupervised Keyword Extraction from Single Document Swagata Duari Aditya Gupta Vasudha Bhatnagar Presentation Outline Introduction and Motivation Statistical Methods for Automatic Keyword Extraction Graph-based
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationInformation Retrieval
Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationSYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationA Framework for Summarization of Multi-topic Web Sites
A Framework for Summarization of Multi-topic Web Sites Yongzheng Zhang Nur Zincir-Heywood Evangelos Milios Technical Report CS-2008-02 March 19, 2008 Faculty of Computer Science 6050 University Ave., Halifax,
More informationMicrosoft PowerPoint Presentations
Microsoft PowerPoint Presentations In this exercise, you will create a presentation about yourself. You will show your presentation to the class. As you type your information, think about what you will
More informationInformation Retrieval Term Project : Incremental Indexing Searching Engine
Information Retrieval Term Project : Incremental Indexing Searching Engine Chi-yau Lin r93922129@ntu.edu.tw Department of Computer Science and Information Engineering National Taiwan University Taipei,
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Graph Data & Introduction to Information Retrieval Huan Sun, CSE@The Ohio State University 11/21/2017 Slides adapted from Prof. Srinivasan Parthasarathy @OSU 2 Chapter 4
More informationEffective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization
Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationEffective Structured Query Formulation for Session Search
Effective Structured Query Formulation for Session Search Dongyi Guan Hui Yang Nazli Goharian Department of Computer Science Georgetown University 37 th and O Street, NW, Washington, DC, 20057 dg372@georgetown.edu,
More informationWeb Page Similarity Searching Based on Web Content
Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Processing Text Converting documents to index terms Why? Matching the exact string of characters typed by the user is too
More informationSearch Engines Information Retrieval in Practice
Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationA System for Query-Specific Document Summarization
A System for Query-Specific Document Summarization Ramakrishna Varadarajan, Vagelis Hristidis. FLORIDA INTERNATIONAL UNIVERSITY, School of Computing and Information Sciences, Miami. Roadmap Need for query-specific
More informationUnstructured Data. CS102 Winter 2019
Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data
More informationReading group on Ontologies and NLP:
Reading group on Ontologies and NLP: Machine Learning27th infebruary Automated 2014 1 / 25 Te Reading group on Ontologies and NLP: Machine Learning in Automated Text Categorization, by Fabrizio Sebastianini.
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationTREC-10 Web Track Experiments at MSRA
TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **
More informationClassification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014
Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of
More informationRetrieval and Feedback Models for Blog Distillation
Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information
More informationHow Does a Search Engine Work? Part 1
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationBi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track
Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Jeffrey Dalton University of Massachusetts, Amherst jdalton@cs.umass.edu Laura
More informationA Document Graph Based Query Focused Multi- Document Summarizer
A Document Graph Based Query Focused Multi- Document Summarizer By Sibabrata Paladhi and Dr. Sivaji Bandyopadhyay Department of Computer Science and Engineering Jadavpur University Jadavpur, Kolkata India
More informationAutomatic Identification of User Goals in Web Search [WWW 05]
Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval 1 Outline Dictionaries Wildcard queries skip Edit distance skip Spelling correction skip Soundex 2 Inverted index Our
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationA Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2
A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor
More informationProgramming Exercise 6: Support Vector Machines
Programming Exercise 6: Support Vector Machines Machine Learning May 13, 2012 Introduction In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting
More informationA Short Introduction to CATMA
A Short Introduction to CATMA Outline: I. Getting Started II. Analyzing Texts - Search Queries in CATMA III. Annotating Texts (collaboratively) with CATMA IV. Further Search Queries: Analyze Your Annotations
More informationSE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?
PLAN SE Workshop Ellen Wilson Olena Zubaryeva Search Engines: How do they work? Search Engine Optimization (SEO) optimize your website How to search? Tricks Practice What is a Search Engine? A page on
More informationImgSeek: Capturing User s Intent For Internet Image Search
ImgSeek: Capturing User s Intent For Internet Image Search Abstract - Internet image search engines (e.g. Bing Image Search) frequently lean on adjacent text features. It is difficult for them to illustrate
More informationCountering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008
Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationConcept-Based Document Similarity Based on Suffix Tree Document
Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri
More informationInformation Retrieval. Techniques for Relevance Feedback
Information Retrieval Techniques for Relevance Feedback Introduction An information need may be epressed using different keywords (synonymy) impact on recall eamples: ship vs boat, aircraft vs airplane
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationSearch Results Clustering in Polish: Evaluation of Carrot
Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology Introduction search engines tools of everyday use
More informationOntology based Web Page Topic Identification
Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer
More informationAn Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments
An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationInferring User Search for Feedback Sessions
Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department
More informationCABI Training Materials. Ovid Silver Platter (SP) platform. Simple Searching of CAB Abstracts and Global Health KNOWLEDGE FOR LIFE.
CABI Training Materials Ovid Silver Platter (SP) platform Simple Searching of CAB Abstracts and Global Health www.cabi.org KNOWLEDGE FOR LIFE Contents The OvidSP Database Selection Screen... 3 The Ovid
More informationKEYWORD GENERATION FOR SEARCH ENGINE ADVERTISING
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 6, June 2014, pg.367
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationText Mining. Representation of Text Documents
Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,
More information