Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields
|
|
- Lydia Leonard
- 5 years ago
- Views:
Transcription
1 Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books, What is the best way to search within this data? grep for keywords in it? Store it in a DBMS and use SQL queries with LIKE %keyword1% AND LIKE %keyword2%? If there s enough data, these approaches are very slow and very inefficient Hash and B+- tree indexes don t help Copyright Ben CartereAe 2 1
2 Informa/on Retrieval Informa(on retrieval (IR) studies systems for indexing and querying large full- text corpora Google is the most widely- known modern example Work in IR and DB has mostly been separate IR has roots in library science and informa/on science going back to the 1950s, today allied with AI DB is more firmly rooted in algorithms and systems In recent years they have begun to intersect via XML, text mining, data mining Copyright Ben CartereAe 3 IR systems vs DBMS Both involve queries that are matched to records (possibly using an index) to retrieve results Afer that, many differences: IR Relevance seman/cs Keyword search Unstructured data Read- only (mostly) Rank best- matching results DBMS Rela/onal seman/cs Full SQL query language Structured data Read/write Return full result set Copyright Ben CartereAe 4 2
3 Common Topics, Different Focus IR and DB have many things in common, but focus differs between the two: DBMS Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control IR Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control Copyright Ben CartereAe 5 Relevance and Ranking In a DBMS, records either match a SQL query or they don t there is no middle ground In IR systems, some documents can be beaer matches than others Matching documents may not be relevant Documents that don t match may be relevant Relevance describes the usefulness of a document to a par/cular user Copyright Ben CartereAe 6 3
4 query relevant results (for one user) top- ranked results relevant results (for a different user) Copyright Ben CartereAe 7 What Determines Relevance? Many factors can contribute to a document being relevant to a user: Frequency of search terms within the document Frequency of search terms in the corpus Proximity of search terms in document Popularity of a document among other users Popularity of a document among content developers User prior knowledge User task The system can only make guesses about relevance; it can t read the user s mind Guesses are based on computa/ons of above factors Copyright Ben CartereAe 8 4
5 The Bag of Words Model Bag of words refers to a simple representa/on scheme for text: Documents and queries are simply unordered sets of words Syntac/c informa/on (grammar) stripped out Two details: Very common words ( stopwords ) are removed E.g. the, a, of, in, that, Words are converted to their stems E.g. surfs, surfing, surfed surf Copyright Ben CartereAe 9 Text Indexes The bag of words model allows a /me- and space- efficient indexing model: the inverted index Instead of storing a rela/on with documents as records and terms as aaributes, store each term with the list of documents it appears in An inverted list or pos(ng list To answer a one- term query, simply retrieve pos/ng list for that term Copyright Ben CartereAe 10 5
6 Longer Queries Boolean queries use AND, OR, NOT syntax term1 AND (term2 OR term3) To answer an AND query: Get pos/ng lists for all terms and take the intersec/on To answer an OR query: Take union of all pos/ng lists To answer an AND NOT query: Set subtrac/on To answer an OR NOT query: Union of term1 and NOT term2 which will be a very large set OR NOT usually not allowed Copyright Ben CartereAe 11 Ranking Querying the inverted index only provides the set of documents that match the query The next step is to rank them in order of likelihood of relevance to the user Ranking algorithms are the subject of much study in the field Most are based on the probability ranking principle, which says that the op/mal ordering is in decreasing order of probability of relevance Copyright Ben CartereAe 12 6
7 Text Sta/s/cs and Ranking Another advantage of inverted files: They can store a lot of informa/on about terms within documents and in the corpus For each term, also store the following: Document frequency df, the total number of documents the term appears in Collec/on term frequency c=, the total number of /mes the term appears in all documents For each term/document pair, store: Term frequency =, the number of /mes the term appears in the document For each document, store: Document length N, the total number of terms in the document Copyright Ben CartereAe 13 Ranking Func/ons Use v, N, df to calculate a score for each document The score indicates the likelihood of relevance One common approach treats a document D as a vector in V- dimensional space The magnitude along each dimension is set to the weight of the corresponding term Weights are func/ons of v, N, and df The query Q is represented as a vector in the same space The score S(Q, D) is defined to be the cosine of the angle between the two vectors Copyright Ben CartereAe 14 7
8 Evalua/ng Ranking Func/ons Different func/ons produce different rankings of documents How do we choose among ranking func/ons? Performance evalua(on Judge the relevance of each document in the ranking to a user s informa/on need Calculate a summary measure of performance over the relevance judgments Common measures: Precision, recall, average precision, DCG Copyright Ben CartereAe 15 How Google Works (Kind Of) Google will never say how they actually work, but they have provided some details in the past Basic approach: Crawl the web constantly Within documents, store some formaxng info Store info about links between documents Use links between documents to gauge popularity Store indexes across many cheap servers Process queries in parallel Copyright Ben CartereAe 16 8
9 From Anatomy of a Large- Scale Hypertextual Web Search Engine, Brin & Page, 2001 Copyright Ben CartereAe 17 Google s Index Google uses an inverted index Each term s pos/ng list is made up of: An internal page ID The number of hits on that page For each hit, a fixed- size data structure Hit data structure: Stores font info, posi/on in document, and some markup informa/on Total size = 2 bytes (16 bits) Forward barrels store a sequence of hits appearing in a page Inverted barrels store hits in all pages for each term The lexicon supports fast access to the inverted barrels Copyright Ben CartereAe 18 9
10 PageRank PageRank is a cita<on analysis algorithm Basic idea: Pages with many links from high- quality pages are likely to be high- quality High- quality pages are more likely to link to other high- quality pages The recursive flavor comes through in the PageRank formula Copyright Ben CartereAe 19 PageRank User Model PageRank is the result of modeling users as random surfers : A user starts randomly on some page in the web They randomly click links, never going back At some point they jump to a new page selected uniformly at random PR(A) = the probability that a random surfer is looking at page A Google uses PageRank to modify an IR score- based ranking Copyright Ben CartereAe 20 10
Introduc3on to Data Management
ICS 101 Fall 2014 Introduc3on to Data Management Assoc. Prof. Lipyeow Lim Informa3on & Computer Science Department University of Hawaii at Manoa Lipyeow Lim - - University of Hawaii at Manoa 1 The Data
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationGes$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing
Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management:
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Compression Collec9on and vocabulary sta9s9cs: Heaps and
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Indexes Indexes are data structures designed to make search faster Text search
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annota1ons by Michael L. Nelson All slides Addison Wesley, 2008 Evalua1on Evalua1on is key to building effec$ve and efficient search engines measurement usually
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013
More informationInforma(on Retrieval
Introduc)on to Informa)on Retrieval CS3245 Informa(on Retrieval Lecture 7: Scoring, Term Weigh9ng and the Vector Space Model 7 Last Time: Index Construc9on Sort- based indexing Blocked Sort- Based Indexing
More informationTrustworthy Keyword Search for Regulatory Compliant Records Reten;on
Trustworthy Keyword Search for Regulatory Compliant Records Reten;on S. Mitra, W. Hsu, M. WinsleA Presented by Thao Pham Introduc;on Important documents: emails, mee;ng memos, must be maintained in a trustworthy
More informationRela+onal Algebra. Rela+onal Query Languages. CISC437/637, Lecture #6 Ben Cartere?e
Rela+onal Algebra CISC437/637, Lecture #6 Ben Cartere?e Copyright Ben Cartere?e 1 Rela+onal Query Languages A query language allows manipula+on and retrieval of data from a database The rela+onal model
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationRelevance of a Document to a Query
Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationXML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson
Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)
More informationSession 10: Information Retrieval
INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!
More informationInforma(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT
Administra*ve Introduc*on to Informa(on Retrieval CS457 Fall 2011! David Kauchak Projects Status 2 on Friday Paper next Friday work on the paper in parallel if you re not done with experiments by early
More informationIntroduction to Information Retrieval. Lecture Outline
Introduction to Information Retrieval Lecture 1 CS 410/510 Information Retrieval on the Internet Lecture Outline IR systems Overview IR systems vs. DBMS Types, facets of interest User tasks Document representations
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationOutline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.
Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Course Goals To help you to understand search engines, evaluate and compare them, and
More informationArchitecture and Implementation of Database Systems (Summer 2018)
Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens
More informationAn Adaptive Approach in Web Search Algorithm
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 15 (2014), pp. 1575-1581 International Research Publications House http://www. irphouse.com An Adaptive Approach
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationBehrang Mohit : txt proc! Review. Bag of word view. Document Named
Intro to Text Processing Lecture 9 Behrang Mohit Some ideas and slides in this presenta@on are borrowed from Chris Manning and Dan Jurafsky. Review Bag of word view Document classifica@on Informa@on Extrac@on
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information
More informationCS60092: Informa0on Retrieval. Sourangshu Bha<acharya
CS60092: Informa0on Retrieval Sourangshu Bha
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationSearch Engine Architecture. Hongning Wang
Search Engine Architecture Hongning Wang CS@UVa CS@UVa CS4501: Information Retrieval 2 Document Analyzer Classical search engine architecture The Anatomy of a Large-Scale Hypertextual Web Search Engine
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationInforma(on Retrieval
Introduc)on to Informa(on Retrieval cs160 Introduction David Kauchak adapted from: h6p://www.stanford.edu/class/cs276/handouts/lecture1 intro.ppt Introduc)ons Name/nickname Dept., college and year One
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationDefinitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Definitions Instructor: Walid Magdy 19-Sep-2017 Lecture Objectives Learn about main concepts in IR Document Information need Query Index BOW 2 1 IR in a nutshell
More informationIntroduc)on to. CS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationNatural Language Processing
Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document
More informationInformation Retrieval
Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Course Goals To help you to understand search engines, evaluate and compare them, and
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationOverview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa
ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/2011 Lipyeow Lim -- University of Hawaii at Manoa 1 Example
More informationRoadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases
Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationOutline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.
Outline Lecture 2: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University January 23, 2013 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Retrieval Models Provide a mathema1cal framework for defining the search process includes
More informationRanked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?
Ranked Retrieval One option is to average the precision scores at discrete Precision 100% 0% More junk 100% Everything points on the ROC curve But which points? Recall We want to evaluate the system, not
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationExam IST 441 Spring 2011
Exam IST 441 Spring 2011 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationExam IST 441 Spring 2014
Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationIntroduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction
Introduction to Information Retrieval and Anatomy of Google Information Retrieval Introduction Earlier we discussed methods for string matching Appropriate for small documents that fit in memory available
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationCS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Sourangshu Bha1acharya Today s lecture hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address ques)ons
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationIntroduction to IR Systems: Supporting Boolean Text Search
Introduction to IR Systems: Supporting Boolean Text Search Ramakrishnan & Gehrke: Chapter 27, Sections 27.1 27.2 CPSC 404 Laks V.S. Lakshmanan 1 Information Retrieval A research field traditionally separate
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationQuery Phrase Expansion using Wikipedia for Patent Class Search
Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval CS276: Informa*on Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 4: Index Construc*on Plan Last lecture: Dic*onary data structures Tolerant retrieval
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationWeb Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationRunning Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.
Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,
More informationSearch Engine Architecture. Search Engine Architecture
Search Engine Architecture CISC489/689 010, Lecture #2 Wednesday, Feb. 11 Ben CartereGe Search Engine Architecture A soiware architecture consists of soiware components, the interfaces provided by those
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationSEMINAR: GRAPH-BASED METHODS FOR NLP
SEMINAR: GRAPH-BASED METHODS FOR NLP Organisatorisches: Seminar findet komplett im Mai statt Seminarausarbeitungen bis 15. Juli (?) Hilfen Seminarvortrag / Ausarbeitung auf der Webseite Tucan number for
More information