This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.
|
|
- Percival Kelley
- 5 years ago
- Views:
Transcription
1
2 This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring
3 Information retrieval systems Information retrieval (IR): n. searching for documents or information in documents. Question-answering: respond with a specific answer to a question (e.g., Wolfram Alpha). Document retrieval: find documents relevant to a query, ranked by relevance (e.g., or Google). Text analytics/data mining: General organization of large textual databases (e.g., Lexis-Nexis, OpenText, MedSearch,.) CSC401/2511 Spring
4 Terminology Information retrieval has slightly different terminology than the tasks we ve seen previously: Document: a book, article, web page, or paragraph (depending on the task and data). Collection: a corpus of documents Term: a word type Stop word: a functional (non-content) word (e.g., the) CSC401/2511 Spring
5 Query types Different kinds of questions can be asked. Factoid questions, e.g., How often were the peace talks in Ireland delayed or disrupted as a result of acts of violence? Narrative (open-ended) questions, e.g., Can you tell me about contemporary interest in the Greek philosophy of stoicism? Complex/hybrid questions, e.g., Who was involved in the Schengen agreement to eliminate border controls in Western Europe and what did they hope to accomplish? CSC401/2511 Spring
6 Question answering (QA) Which woman has won more than 1 Nobel prize? (Marie Curie) Question Answering (QA) usually involves a specific answer to a question. CSC401/2511 Spring
7 Document retrieval vs IR One strategy is to turn question answering into information retrieval (IR) and let the human complete the task. CSC401/2511 Spring
8 Question answering (QA) CSC401/2511 Spring
9 Knowledge-based QA 1. Build a structured semantic representation of the query. Extract times, dates, locations, entities using regular expressions. Fit to well-known templates. CSC401/2511 Spring Query databases with these semantics. Ontologies (Wikipedia infoboxes). Restaurant review databases. Calendars. Movie schedules.
10 IR-based QA CSC401/2511 Spring
11 IR-based QA CSC401/2511 Spring
12 IR-based QA Information retrieval Question answering CSC401/2511 Spring
13 IBM s Watson Human 1 Game Control System Clue Grid Decisions to Buzz and Bet Strategy Watson s Game Controller Text-to-Speech Clue & Category Answers & Confidences Watson s QA Engine 2,880 IBM Power750 Compute Cores 15 TB of Memory Human 2 Clues, Scores & Other Game Data Content equivalent to ~ 1,000,000 books source: A Brief Overview and Thoughts for Healthcare Education and Performance Improvement by the IBM Watson team CSC401/2511 Spring
14 IBM s Watson: search This man became the 44 th President of the United States in 2008 CSC401/2511 Spring
15 IBM s Watson: search Title-oriented search: In some cases, the solution is in the title of highly-ranked documents. E.g., This pizza delivery boy celebrated New Year s at Applied Cryogenics. CSC401/2511 Spring
16 IBM s Watson: selection Once candidates have been gathered from various sources and methods, rank them according to various scores (IBM Watson uses >50 scoring metrics). In cell division, mitosis splits the nucleus & cytokinesis splits this liquid cushioning the nucleus CSC401/2511 Spring
17 IBM s Watson: selection One aspect of Jeopardy! is that answers are often posed with puns that have to be disambiguated. Bilbo shouldn t have played riddles in the dark with this shady character from WordNet s Synonym-sets CSC401/2511 Spring
18 How to make money out of this? CSC401/2511 Spring
19 Making money before search Advertisers used to pay for banner ads that did not depend on user queries. CPM (Cost per mille): Pay for each ad display. CPC (Cost per click): Pay when user clicks an ad. CTR (Click through rate): Fraction of ad displays that result in click-throughs. CPA (Cost per action): Pay only when user makes online purchase after click-through. CSC401/2511 Spring
20 Making money with search Advertisers now bid for keywords. Ads are displayed for the highest bidders when a query contains those keywords. PPC (Pay per click): CPC for ads served based on a ranking of bid keywords and user interest (e.g., Google AdWords). (it s a bit more complicated ) CSC401/2511 Spring
21 How are ads ranked? Today, a two-bid process is typical. First, organizations bid on keywords By itself, this can lead to abuse, monopolization, and irrelevant content. Second, we re-rank based on relevance based on click-through. CSC401/2511 Spring
22 How are ads ranked? Advertiser Bid CTR Ad rank Rank Paid A $ (minimum) B $ $2.68 C $ $1.51 D $ $0.51 Bid: amount determined by advertiser for keyword. CTR: click-through rate an approximation of relevance. Ad rank: Bid CTR trades off advertiser and user interests. Rank: actual rank. Paid: Minimum amount necessary to maintain rank + 1. CSC401/2511 Spring
23 How are ads ranked? Advertiser Bid CTR Ad rank Rank Paid A $ (minimum) B $ $2.68 C $ $1.51 D $ $0.51 Paid: Minimum amount necessary to maintain rank + 1. Paid & CTR & = Bid &-. CTR &-. + $0.01 Paid & = Bid & $0.01 E.g., Paid. = $3.00 ;.;< ;.;= + $0.01 = $1.51 CSC401/2511 Spring
24 Aside highest paying search terms (according to $69.10 mesothelioma treatment options $66.46 mesothelioma risk $65.85 personal injury lawyer michigan $65.74 michigan personal injury attorney $62.59 student loans consolidation $61.44 car accident attorney los angeles $61.26 mesothelioma survival rate $60.96 treatment of mesothelioma $59.44 online car insurance quotes $59.39 arizona dui lawyer CSC401/2511 Spring
25 Back to basics. How do we find the right documents for a query? CSC401/2511 Spring
26 Queries A query is a textual key which orders a specific subset of documents (or answers) in a collection. Historically, these were highly structured in a logical language, but in modern search engines queries are more often streams of syntactically disconnected keywords. A boolean query is a logical combination of boolean membership predicates. Brutus AND Caesar AND NOT Calpurnia CSC401/2511 Spring
27 Term-document incidence Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER For the query Brutus AND Caesar AND NOT Calpurnia, (Brutus) (Caesar) (Not Calpurnia) (Bitwise AND) CSC401/2511 Spring
28 Boolean Queries and big collections If we have 1 million documents, each with 1000 tokens 1 billion tokens at most 1 billion 1 s in the matrix. If we have 500,000 distinct terms, the term-document incidence matrix will have 500,000,000,000 elements. There will be << 1 billion 1s in this matrix. Very sparse and a waste of space. Can there be a better way? CSC401/2511 Spring
29 Inverted index Given a query word, the inverted index for that word gives us all documents that contain that word in either the title, the abstract (summary), some hidden metadata, or the entire text. More sophisticated versions also include the frequency and positions of the query word in each document. Matlab query Inverted index D. documents How does one construct such indices? CSC401/2511 Spring
30 Inverted index construction 1. Collect the documents to be indexed. Friends, Romans, countrymen So let it be with Caesar 2. Tokenize the text. Friends Romans countrymen So 3. Do preprocessing and normalization, resulting in the indexing terms. friend roman countryman so 4. Create a dictionary (hash) of documents given terms. CSC401/2511 Spring
31 Simple conjunctive query Given the query Brutus AND Calpurnia, 1. Locate Brutus in the dictionary. Retrieve documents list. 2. Locate Calpurnia in the dictionary. Retrieve documents list. 3. Intersect the two document lists. Return the result to the user. Linear in the lengths of document lists. (if lists are sorted) CSC401/2511 Spring
32 Constructing indices Spiders (aka. Robots, bots, crawlers) start with root (seed) URLs. Follow all links on these pages recursively. Novel pages are processed and indexed. Despite the exponential growth in memory across depth, breadth-first search is quite popular. Depth-first search is linear in depth, but can get lost. Trivia: If you click on the first contentful link in any Wikipedia page, you will eventually be led to the Philosophy article. CSC401/2511 Spring
33 Increasing entropy? Boolean retrieval is precise and was very popular for decades (it still is used for structured data, like desktop file search). The amount and value of unstructured data (i.e., text) has grown faster than structured data on the web Unstructured Structured Data volume Market cap (data from Chris Manning) Data volume Market cap CSC401/2511 Spring
34 Zipf s law on the web These variables have Zipfian distributions: Number of links to and links from a page. Length of web pages. Number of web page hits. (graph from Ray Mooney) CSC401/2511 Spring
35 New challenges for IR on the web Distributed data: Documents spread over millions of web servers. Volatile data: Document change or disappear frequently and rapidly. Large volume: Petabytes of data. Poor quality: No editorial control, false information, poor writing, typographic errours. Heterogeneity: Various media, languages, encodings. Unstructured: No uniform structure, HTML errors, CSC401/2511 Spring duplicate documents.
36 Detecting duplicates duplicates The user will become annoyed when many top-ranking hits are identical/similar. Nearly-identical pages can be determined by hashing E.g., don t index en.m.wikipedia.org/wiki/ if you ve indexed en.wikipedia.org/wiki/. Zero marginal relevance occurs when a highly relevant document becomes irrelevant by being ranked below a (near-)duplicate. CSC401/2511 Spring
37 Detecting duplicates duplicates Compute similarity with some edit-distance measure. Syntactic similarity (e.g., overlap of bigrams) easier to measure than semantic similarity. If this measure is above some threshold θ for some pair of documents, we consider them duplicates. Jaccard coefficient: J A, B = D F D F Is a measure of similarity on [0.. 1] J A, A = 1 J A, B = 0 iff A B = CSC401/2511 Spring
38 Jaccard coefficient on 2-grams Documents: d. : Jack London went to Toronto d K : Jack London went to the city of Toronto d < : Jack went from Toronto to London J d., d K = < L = J d., d < = 0 CSC401/2511 Spring
39 Link analysis When we re crawling the web and indexing, we want to retain some record of similarity between (non-duplicate) documents in terms of their link structure. This will help in searching. CSC401/2511 Spring
40 Bibliometrics: citation analysis Impact factor: Developed in 1972 to measure the quality and influence of scientific journals. Measures how often articles are cited. Bibliographic coupling: Measure of similarity between documents according to the intersection of their citations (Kessler, 1963). A B CSC401/2511 Spring
41 Bibliometrics: citation analysis Co-citation: Measure of similarity between documents according to the intersection of the documents that cite them (Small, 1973). A B CSC401/2511 Spring
42 Links are not citations Many links are navigational within a website. Many pages with high in-degree are portals without much content. Some links are not necessarily endorsements. Relevance of citations in scientific settings is (theoretically) enforced by peer review. Can we mimic the enforcement of relevance usually done by human experts in scientific articles? CSC401/2511 Spring
43 Authorities and hubs Authorities are pages recognized as significant, trustworthy, and useful for a topic. In-degree (number of incoming links) is an estimate of authority. Should incoming links from authoritative pages count more than others? Hubs are index pages that provide lots of links to relevant content pages. e.g., reddit.com is a hub page for recycled memes. CSC401/2511 Spring
44 HITS (hits hits hits hits) The HITS algorithm (Kleinberg, 1998) attempts to learn hubs and authorities on a given topic given relevant web subgraphs. Hubs and authorities tend to form bipartite graphs. Hubs Authorities CSC401/2511 Spring
45 HITS First, find (top N) most relevant pages for a query this is the root set, R. (we ll see how to do this next lecture) Next, look at the link structure relative to R. The base set, S is R and all pages that link to and are linked from pages in R S R CSC401/2511 Spring
46 HITS: Authorities and In-degree Even for S, nodes with high in-degree may not be authorities they may just be generically popular pages. Authority should be determined by strong hubs. Iteratively (slowly) converge on a mutually reinforcing set of hubs and authorities. For every page p S, maintain Authority score: a R (initialized to 1/ S ) Hub score: h R (initialized to 1/ S ) CSC401/2511 Spring
47 HITS update rules Authorities p are pointed to ( ) by lots of good hubs q: a R = W h X X:X R a [ = h. + h K + h < Hubs point to lots of good authorities: h X = W a R R:X R h [ = a. + a K + a < CSC401/2511 Spring
48 Page similarity using HITS Given honda.com, we also get: toyota.com ford.com bmwusa.com saturn.com nissanmotors.com This method can have trouble with ambiguous queries, however CSC401/2511 Spring
49 PageRank PageRank (Brin & Page, 1998) is an alternative to HITS that does not distinguish between hub and authority. CSC401/2511 Spring
50 PageRank initial idea Assume that in-degree does not account for the authority of the source of a link. For page p, the page rank is: R p = c W R(q) N X X:X R where N X is the total number of out-links over all q. c is a normalizing constant. A page s rank flows out equally among outgoing links. CSC401/2511 Spring
51 PageRank flow of authority PageRank would iteratively adjust all R p until overall page ranking converged Steady state CSC401/2511 Spring
52 PageRank problem Groups of purely self-referential pages (linked from the outside) are sinks that absorb all the rank in the system during the iterative rank assignment process. CSC401/2511 Spring
53 PageRank rank source An ethereal rank source E continually replenishes the rank of each page p by a fixed amount E p R p = c W R(q) X:X R N X + E(p) CSC401/2511 Spring
54 Complete ranking A complete ranking involves combining: PageRank. Preferences using HTML tags (e.g., title or abstract are often highly informative). Similarity of query words and documents. How do we relate query words and documents in the first place? CSC401/2511 Spring
55 Next lecture How to relate query terms and documents. Vector space model. How to generalize query terms. Latent semantic indexing. How to rank documents. Singular value decomposition. How to evaluate different search engines. CSC401/2511 Spring
56 Misc Some slide and material based on those of Ray J. Mooney (UTexas, CS371R), Hinrich Schütze, Christina Lioma, and Chris Manning (Stanford, CS276). Dan Jurafsky (Stanford, CS124) CSC401/2511 Spring
57 Aside PageRank algorithm Given the total set of pages S, Let p S: E p = a for some 0 α 1 b Initialize p S: R p = 1/ S Until convergence: For each p S: R e R q p 1 α W + E(p) X:X R N X 1 c R b R e p For each p S: R p cr (p) //normalize CSC401/2511 Spring
Information retrieval systems
Information retrieval systems Information retrieval (IR): n. searching for documents or information in documents. Question-answering: respond with a specific answer to a question (e.g., Wolfram Alpha).
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare
More informationBibliometrics: Citation Analysis
Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationCSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)
CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-09 Schütze: Boolean
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 17/26: Web Search Basics Paul Ginsparg Cornell University, Ithaca, NY 29
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)
More informationInformation Retrieval and Text Mining
Information Retrieval and Text Mining http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze & Wiltrud Kessler Institute for Natural Language Processing, University of Stuttgart 2012-10-16
More informationboolean queries Inverted index query processing Query optimization boolean model September 9, / 39
boolean model September 9, 2014 1 / 39 Outline 1 boolean queries 2 3 4 2 / 39 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR) is finding
More informationBoolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok
Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationInformation Retrieval
Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details
More informationIntroduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.
Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late
More informationIntroduction to Information Retrieval
Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 1: Boolean Retrieval Paul Ginsparg Cornell University, Ithaca, NY 27 Aug
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents
More informationIndexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index
Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process
More informationAdvanced Retrieval Information Analysis Boolean Retrieval
Advanced Retrieval Information Analysis Boolean Retrieval Irwan Ary Dharmawan 1,2,3 iad@unpad.ac.id Hana Rizmadewi Agustina 2,4 hagustina@unpad.ac.id 1) Development Center of Information System and Technology
More informationLecture 1: Introduction and the Boolean Model
Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.04.22 Schütze: Boolean
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationCSCI 5417 Information Retrieval Systems! What is Information Retrieval?
CSCI 5417 Information Retrieval Systems! Lecture 1 8/23/2011 Introduction 1 What is Information Retrieval? Information retrieval is the science of searching for information in documents, searching for
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationIntroducing Information Retrieval and Web Search. borrowing from: Pandu Nayak
Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
More information5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval
Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted
More informationWeb Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationBehrang Mohit : txt proc! Review. Bag of word view. Document Named
Intro to Text Processing Lecture 9 Behrang Mohit Some ideas and slides in this presenta@on are borrowed from Chris Manning and Dan Jurafsky. Review Bag of word view Document classifica@on Informa@on Extrac@on
More informationUnstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)
Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI,
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -
More informationCS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)
CS 572: Information Retrieval Lecture 2: Hello World! (of Text Search) 1/13/2016 CS 572: Information Retrieval. Spring 2016 1 Course Logistics Lectures: Monday, Wed: 11:30am-12:45pm, W301 Following dates
More informationLecture 1: Introduction and Overview
Lecture 1: Introduction and Overview Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent 2014 1
More informationQuerying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze
Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Boolean Retrieval Weighted Boolean Retrieval Zone Indices
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationSearch: the beginning. Nisheeth
Search: the beginning Nisheeth Interdisciplinary area Information retrieval NLP Search Machine learning Human factors Outline Components Crawling Processing Indexing Retrieval Evaluation Research areas
More informationPart 2: Boolean Retrieval Francesco Ricci
Part 2: Boolean Retrieval Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content p Term document matrix p Information
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIntroduction to Information Retrieval IIR 1: Boolean Retrieval
.. Introduction to Information Retrieval IIR 1: Boolean Retrieval Mihai Surdeanu (Based on slides by Hinrich Schütze at informationretrieval.org) Fall 2014 Boolean Retrieval 1 / 77 Take-away Why you should
More information1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.
1Boolean retrieval information retrieval The meaning of the term information retrieval (IR) can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph
More informationClassic IR Models 5/6/2012 1
Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users
More informationInforma(on Retrieval
Introduc)on to Informa(on Retrieval cs160 Introduction David Kauchak adapted from: h6p://www.stanford.edu/class/cs276/handouts/lecture1 intro.ppt Introduc)ons Name/nickname Dept., college and year One
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationInformation Retrieval
Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationExam IST 441 Spring 2014
Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 10: Introduction to Web Retrieval June 22, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig
More informationOverview of Information Retrieval and Organization. CSC 575 Intelligent Information Retrieval
Overview of Information Retrieval and Organization CSC 575 Intelligent Information Retrieval 2 How much information? Google: ~100 PB a day; 1+ million servers (est. 15-20 Exabytes stored) Wayback Machine
More information- Content-based Recommendation -
- Content-based Recommendation - Institute for Software Technology Inffeldgasse 16b/2 A-8010 Graz Austria 1 Content-based recommendation While CF methods do not require any information about the items,
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationWeb Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University
Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationInformation Retrieval
Introduction to Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructurednature
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationGes$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing
Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management Full- Text Indexing Contents } Introduction } Inverted Indices } Construction } Searching 2 GAvI - Full- Text Informa$on Management:
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationHomework: Exercise 19. Homework: Exercise 21. Homework: Exercise 20. Homework: Exercise 22. Detour: Apache Lucene
Homework: Exercise 19 Are the following statements true or false? Information Retrieval and Web Search Engines In a Boolean retrieval system, stemming never lowers precision Lecture 10: Introduction to
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationIntroduction to Information Retrieval
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval? Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 1: Boolean Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationLecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule
Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question
More informationLec 8: Adaptive Information Retrieval 2
Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:
More informationComputer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm
Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More information: Semantic Web (2013 Fall)
03-60-569: Web (2013 Fall) University of Windsor September 4, 2013 Table of contents 1 2 3 4 5 Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationCS60092: Informa0on Retrieval. Sourangshu Bha<acharya
CS60092: Informa0on Retrieval Sourangshu Bha
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationIntroduction to Text Mining. Hongning Wang
Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationText Retrieval and Web Search IIR 1: Boolean Retrieval
Text Retrieval and Web Search IIR 1: Boolean Retrieval Mihai Surdeanu (Based on slides by Hinrich Schütze at informationretrieval.org) Spring 2017 Boolean Retrieval 1 / 88 Take-away Why you should take
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationAuthoritative Sources in a Hyperlinked Environment
Authoritative Sources in a Hyperlinked Environment Journal of the ACM 46(1999) Jon Kleinberg, Dept. of Computer Science, Cornell University Introduction Searching on the web is defined as the process of
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationPlan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis
CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling
More informationInformation Retrieval. Lecture 4: Search engines and linkage algorithms
Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2
More informationInformation Networks. Hacettepe University Department of Information Management DOK 422: Information Networks
Information Networks Hacettepe University Department of Information Management DOK 422: Information Networks Search engines Some Slides taken from: Ray Larson Search engines Web Crawling Web Search Engines
More information