2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response
|
|
- Eustace Grant
- 5 years ago
- Views:
Transcription
1 CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. b. Primary focus of IR since the 50s has been on text and documents i. Now also include images, video, audio, music, and scanned documents all with different retrieval mechanisms that can be used on the particular media. c. Text Documents can be web pages, , books, news stories, scholarly papers, text messages, Word, Powerpoint, PDF, forum postings, patents, IM sessions, etc. d. Text Documents have lots of text and some structure, e.g. title, author, date for papers; subject, sender, destination for e. Text Documents vs. Relational Database Records i. Database records (or tuples in relational databases) 1. Well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. 2. Easy to compare fields using well-defined semantics and query language (SQL) ii. Text Documents - Comparison of query text to document text is the core challenge f. IR tasks include ad-hoc search, filtering, classifications, and question/answering. g. Relevance of a document to a query is usually based on statistical properties of text rather than linguistic i. Statistical properties - counting simple text features such as words instead of parsing and analyzing the sentences ii. Linguistic properties breaking down sentences into parts of speech and noun phrases and verb phrases. h. Judging effectiveness 2 measures i. Recall Percentage of ALL relevant documents retrieved by query ii. Precision Percentage of retrieved documents that are relevant i. Indexes are data structures designed to improve search efficiency i. Designing and implementing them are major issues for search engines 1
2 2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response time and throughput) b. Text Acquisition - Types of Crawlers Web, Enterprise, Desktop i. Web Crawlers efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness) ii. Document Crawlers for enterprise and desktop search follow links and scan directories c. Document Feeds i. Real time feeds from blogs, news channels, etc., ii. RSS feed in XML d. Document Conversion/Transformation i. Parse/Convert document format into text; e.g. HTML, XML, Word, PDF ii. Convert text encoding from different languages and encoding schemes; UTF 8, 16, 32, ASCII, EBCDIC iii. Tokenizer recognizes words in the text 1. Consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators iv. Stopping remove common words, e.g. and, or, the v. Stemming remove word endings, e.g. ing, ed, e. Document Statistics i. Counts of token/term/word in document ii. Counts of token/term/word in collection of document iii. Counts of token/term/word occurrence in documents f. Inverted Index reverses document-term information into term-document information g. Index Distribution i. Distributes indexes across multiple computers and/or multiple sites ii. Essential for fast query processing with large numbers of documents iii. See Big Table h. Scoring queries i. Calculates score for the query compared to indexed documents ii. Basic form of score is q i d i 1. q i and d i are query and document term weights for term i i. Google Big Table i. Row is created when a new document is stored ii. Rows sorted lexicographically (alphabetically); e.g. so pages on same topic are stored together usually on same machine iii. Large tables are broken into tablets at row boundaries. iv. Tablets hold a contiguous range of rows holds MB v. Server manages ~100 tablets j. How does Google work? 3 key features 2
3 i. Web Crawling tiers of crawlers with different frequencies: News Sites change most frequently, other sites less frequently. Crawlers are called Spiders. They search for documents to index and store. ii. Indexing Does the document have the key terms Katy and Perry? 1. When type in a set of terms to search, you are searching the Index, not the web. iii. Ranking Page Rank plus many other signals ; e.g. Katy and Perry next to each other provide a ranked list of the documents that best match the query based on a scoring function. 3
4 3. Week 3 - Web Crawling: HTTP, Web Scraping, Freshness, RSS feeds, Character Encoding a. HTTP Hypertext Transfer Protocol i. Client/Server model Client on your device ii. Stateless Protocol ea. Transaction is independent of any other. But Cookies track your state. iii. Application Layer Protocol on top of a transport layer protocol, e.g. TCP. TCP cares how data is sent, how acknowledged. HTTP doesn t care about transport layer details. iv. Client Actions GET, POST, DELETE, v. Defines Response Status Code 404 not found. vi. Headers tiny bits of custom information sent with responses. Such as content type, e.g. text file, json, etc. Cache Control. Both directions. b. Web Scraping in the absence of and API can scrape any web page i. URL OPEN then GET the HTML ii. Read in HTML text into big string iii. Parse HTML string iv. Find elements of the page, e.g. Body Tag, Span Tag, etc. v. HTML is an Embedded Tagging Language vi. In PYTHON you can parse HTML using Beautiful Soup and other modules. c. Freshness Web Crawling 6: Keeping Index Fresh i. How often to re-crawl a web page? How often is a web page refreshed? Want index to be fresh. ii. Once the page is refreshed (since we crawled it), then to page is Stale and no longer Fresh. iii. But you don t know when the page has been changed. iv. Assume (1) page changes are independent of one another, and (2) cannot have 2 changes at same time. v. Exponential inter arrival time Poisson Distribution. vi. Want to know the expected age of a page. vii. Need Frequency of Updates = C Can estimate C by counting the changes over an interval. viii. Estimate C Adaptively if page changed in interval, set the interval smaller. d. RSS Feeds- Really Simple Syndication file format i. Similar to onscreen guide for cable TV ii. Channel, Item (blog post) iii. XML, Like HTML but not page structure, tags for channel, item, description, etc. iv. Feed with Entries. v. Blogs, News, Pod Casts, Meetups, etc. e. Character Encoding: UNICODE, ASCII, EBCDIC i. Code Points UNICODE 8, 16, Latin scrip is same as ASCII a given range. 4
5 2. Different language scripts are different ranges. 3. Math symbols, special characters, etc. have ranges. ii. Encoding Code Points 1. UTF8 is 8 bit 2. Underlying code is hexadecimal. f. LSH Locality Sensitive Hashing i. Similar hash codes for near-by points (documents) ii. SIMHASH efficient variant of LSH 1. Tokenize 2. Assign weights to tokens 3. Compute b-bit binary hash for each word. 4. Convert 0 to -1 and multiply by word weight 5. Add by Columns. Set columns > 0 to 1, 0 otherwise. 5
6 4. Week 4 a. Scoring Terms i. Term Frequency and Similarity functions 1. See 2. Which words occur in both query and document? 3. Basic Similarity Score - Dot product of the terms in the query and terms in document. 4. Term Frequency can be a. Binary presence or absence b. Count number of times term occurs in document c. Normalize by document length to avoid long document bias d. Rank invariance change the numbers but not the rank ii. Inverse Document Frequency rare words carry more meaning Some words carry content - aardvark, cryogenic. Have them carry more weight. a. If occurs once in a corpus then probably typo b. If 3 or 4 times in a corpus then important word. c. If number of documents containing that word is small compared to the number of documents in the corpus, then probably important. 3. Stopwords don t and, or, the (linguistic glue) 4. Logorithm puts idf component on same scale as tf component iii. Feature Selection with Tf-Idf Retrieval models and ranking algorithms depend heavily on statistical properties of words a. e.g., important words occur often in documents but are not high frequency in collection 3. See Zipf s law b. Zipf s Law i. Distribution of word frequencies is very skewed 1. a few words occur very often, many words hardly ever occur 2. e.g., two most common words ( the, of ) make up about 10% of all word occurrences in text documents ii. Zipf s law : 1. observation that rank (r) of a word times its frequency (f) is approximately a constant (k) 2. assuming words are ranked in order of decreasing frequency a. i.e., r.f» k or r.p r» c, where P r is probability of word occurrence and c» 0.1 for English 6
7 iii. What is the proportion of words with a given frequency? 1. Word that occurs n times has rank rn = k/n 2. Number of words with frequency n is 3. r n r n+1 = k/n k/(n + 1) = k/n(n + 1) 4. Proportion found by dividing by total number of words = highest rank = k 5. So, proportion with frequency n is 1/n(n+1) c. Vocabulary growth Heap s Law i. As corpus grows, so does vocabulary size ii. Fewer new words when corpus is already large iii. Observed relationship (Heaps Law): v = k.n β, where v is vocabulary size (number of unique words); n is the number of words in corpus; k, β are parameters that vary for each corpus (typical values given are 10 k 100 and β 0.5) iv. New words come from a variety of sources: spelling errors, invented words (e.g. product, company names), code, other languages, addresses, etc. v. How many pages contain all of the query terms? 1. For the query a b c : f abc = N f a /N f b /N f c /N = (f a f b f c )/N 2 a. Assuming that terms occur independently b. f abc is the estimated size of the result set c. f a, f b, f c are the number of documents that terms a, b, and c occur in d. N is the number of documents in the collection d. Tokenizing problems i. Small words can be important in some queries, usually in combinations 1. xp, ma, pm, ben e king, el paso, master p, gm, j lo, world war II ii. Both hyphenated and non-hyphenated forms of many words are common 1. Sometimes hyphen is not needed 2. e-bay, wal-mart, active-x, cd-rom, t-shirts iii. At other times, hyphens should be considered either as part of the word or a word separator 1. winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking iv. Special characters are an important part of tags, URLs, code in documents v. Capitalized words can have different meaning from lower case words 1. Bush, Apple vi. Apostrophes can be a part of a word, a part of a possessive, or just a mistake 7
8 1. rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's vii. Numbers can be important, including decimals 1. nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, (a US patent?) viii. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations 1. I.B.M., Ph.D., cs.umbc.edu, F.E.A.R. ix. Note: tokenizing steps for queries must be identical to steps for documents e. Tokenizing Process i. First step is to use parser to identify appropriate parts of document to tokenize ii. Defer complex decisions to other components iii. word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case iv. everything indexed v. Examples of rules used with TREC 1. Apostrophes in words ignored a. o connor oconnor bob s bobs 2. Periods in abbreviations ignored a. I.B.M. ibm Ph.D. ph d f. Stopping i. Function words (determiners, prepositions) have little meaning on their own ii. High occurrence frequencies iii. Treated as stopwords (i.e. removed) 1. reduce index space, improve response time, improve effectiveness iv. Can be important in combinations 1. e.g., to be or not to be v. Lists are customized for applications, domains, and even parts of documents 1. e.g., click is a good stopword for anchor text vi. Best policy may be to index all words in documents, make decisions about which words to use at query time g. Stemming i. Many morphological variations of words 1. inflectional (plurals, tenses) 2. derivational (making verbs nouns etc.) ii. In most cases, these have the same or very similar meanings iii. Stemmers attempt to reduce morphological variations of words to a common stem 1. usually involves removing suffixes iv. Can be done at indexing time or as part of query processing (like stopwords) 8
9 v. Generally a small but significant effectiveness improvement 1. can be crucial for some languages, e.g., 5-10% improvement for English, up to 50% in Arabic vi. Two basic types 1. Dictionary-based: uses lists of related words 2. Algorithmic: uses program to determine related words a. suffix-s: remove s endings assuming plural, e.g., cats cat, lakes lake, wiis wii b. Many false negatives: supplies supplie c. Some false positives: ups up vii. Porter Stemmer - Algorithmic stemmer used in IR experiments since the 1970s viii. Krovetz Stemmer - Hybrid algorithmic-dictionary 1. Word checked in dictionary 2. If present, either left alone or replaced with exception 3. If not present, word is checked for suffixes that could be removed 4. After removal, dictionary is checked again h. Character n-grams i. Building stemmers takes allot of time and expertise in the language - Character n-grams work well ii. Use a sliding window of length n over the string. 1. Get a lot of overlapping n-grams that capture a lot of linguistic information 2. Tolerant of garbles and mis-spelling in text 3. Use for any language/script (particularly European languages, but will also work for Chinese and others) iii. N-grams of size (per Dr. Cohen) 1. 3 are good for language identification (id) are good for topic id are good for speaker id are good for malware detection, plagiarism, 9
10 5. Week 5 Building and using Inverted Indexes a. Use hash table (dictionary) to store information by term (word, n-gram,...) i. Represent documents as vectors 1. One entry per term (word) 2. Sparse: ~ 10^6 entries, most = 0 3. Zipf s law says ½ of all terms will occur only once 4. Value = number of times the term occurs in the document 5. Document is a vector 6. Collection of documents becomes a sparse matrix -- don t want to store all of the zeros in a matrix so use inverted index which is a sparse representation ii. Inverted lists (also called postings lists) 1. attached to each bucket (term) in the hash table 2. Column Vector is an inverted list with one posting for each document sorted by document number order 3. Contains information about the value of the term in the document a. Counts, weights, etc. 4. Keep additional information with terms: a. number of documents in the list (number of documents that contain that term), b. number of times term is used in the collection 5. Compact and fast because sorted in document order iii. Linear Merge - 1. Merging inverted indexes of the terms that occur in a query and a document (or 2 documents)- works because the documents are in sorted order. a. Set pointer to the top of the first 2 inverted lists b. Compare documents in each posting list i. If Boolean scoring, then just looking for a match ii. If calculating a scoring function it must be decomposable into a linear function e.g. cosine similarity, c. If not same document, advance to the posting in the list with the lowest document number then go back to b. d. else you have match 2. If more than two terms, need multi-way merge iv. Proximity Indexes: Phrase search 1. Turn phrase into Boolean And if no positional information in indexes 2. Term pairs re-tokenize into 2 or 3 word sequences increases size of inverted index 3. Proximity index stores positions instead of counts a. Key to Rich indexing structure, fields, tags, etc. 10
11 b. DNA index store c. Could build a Near operator. d. Order may be important 4. Use Linear Merge comparison looks for positions within a given range v. Extent Indexes: Structure and Tags 1. Separate field or Index for each structural element now need to look at multiple indexes 2. Push structure into index values author = William 3. Extent index like positional index a. Special new term for each structural element named entity term with positional index span beginning and end, e.g. in doc3 spanning positions 1 3. b. Does term fall in the span for a special term. That is does William fall within the Author span. c. Over lays on other terms b. Index Compression Indexes are big. Reduce I/O time. Make big numbers small by converting to the delta between document numbers and pack deltas into smaller number of bits i. Delta encoding 1. Doc numbers are big. 10 Billion Docs 2. 4 bytes can only store 4 billion docs 3. often difference in document numbers is small because sorted by doc ID. Take differences (Deltas). 4. Works well for frequent words ii. V-byte encoding 1. Pack numbers into fewer bytes to represent small numbers 2. Variable number of bytes needs 1 byte to encode, etc bits with high bit to indicate this is the last byte to be encoded a. 6 = ; b. 127 = ; c. 128 = For sparse terms, the deltas between document numbers in your posting lists will be bigger and thus will need multiple bytes to represent the difference. iii. Skip pointers c. Query Execution i. Set-based a document matches or not ii. Ranked retrieval rank by relevance iii. Goal fast retrieval 1. Doc-at-a-time (DAAT) a. Assemble matching vector for each document i. List of matching words/counts/positions 11
12 ii. Compute score iii. Output score b. K-way linear merge i. Computing scoring function for each document as go through the merge ii. Optimal with 2 lists iii. Naïve variant with k lists is merge one by one, e.g. merge a,b with c then a,b,c with d, etc. 1. Complexity Worst case O(n^2) query with all rare words iv. Better way: Loop over all pointers at once 1. Complexity Loop over k pointers until all docs done. If no two lists have same docs (n), then O(kn) 2. If lists are dense and point to same docs then score operation is O(k^2) v. Even better: Use Min-heap (priority queue) 1. Keep pointers in a data structure to find the smallest in logarithmic time O(n Log k) vi. Only emits score when non-zero 2. Term-at-a-time (TaaT) Flip problem don t merge lists a. Incrementally compute scores for all documents updated one term at a time. b. Initialize array for term, iterate over inverted list and update with each doc information c. Compute score in no particular order d. Final step is to extract scores from the array (take top 5 from sorted array e. O(n + N) where n=length of all lists, N=number of docs 3. Trade offs a. TaaT -- is linear in n, and has sequential I/O (fetched by term), thus, order doesn t matter so can process in the order stored on disk in a single pass. But have to store the array of N accumulators for each document. This can be an issue if on disk. Also, can only use if scoring function is quasi linear, e.g. it is decomposable into parts that are: i. Dependent on query, g(q) ii. Dependent on document, h(d) iii. Uses a Dot Product summation b. DaaT i. Higher run time O(nk), O(k) memory, nonsequential I/O (take hit on performance) 12
13 ii. Any scoring function iii. Structured queries 4. Why complexity matters compute Pair-wise Similarity with 10^6 docs, 10^6 terms, ~10^3 terms/doc a. Brute force: 10^15 operations (10^6x10^6x10^3) i. Compute similarity of every doc to every other doc b. DaaT: c10^13 (10 to 100 times faster) i. For each query doc Q 10^6 ii. Merge 10^3 inverted lists (one per query term remember avg doc length is 10^3) 1. Avg size of list 10^3=10^9 (nonempty cells over all posts)/10^6 (columns) posts 2. Total number of posts across all lists is 10^6 = 10^3*10^3 iii. Heap Merge 10^3 lists with 10^6 elements is 10^7 = 10^6*log(10^3). C is a constant to maintain Heap. iv. Compute similarity for all matches above 10^6 (total number of inverted lists) which is less than 10^7 v. Complexity = C*(10^6*10^7) = C*10^13 c. TaaT: 10^12 (1000 times faster) i. For each query 10^6 ii. For each term in query 10^3 iii. For each doc D in inverted list 10^3 (=10^9/10^6) iv. Complexity = 10^6*10^3*10^3 = 10^12 and no C term d. Index Construction i. Add a document to the index by appending to the inverted lists of the terms in the document, create new entry if term does not exist ii. Why will not scale? 1. Everything in memory, then i above is ok. But need a tuple for each post so uses lots of memory 2. If on disk, then for each word, fetch inverted list, append an entry, then write back to disk. 3. Use Index Merging for documents in batches same as Linear merge for combining partial indexes on disk 13
Chapter 4. Processing Text
Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process Processing Text Converting documents to index terms Why? Matching the exact
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Processing Text Converting documents to index terms Why? Matching the exact string of characters typed by the user is too
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationInformation Retrieval
Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationPrivacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationIndexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton
Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationSearch Engine Architecture. Search Engine Architecture
Search Engine Architecture CISC489/689 010, Lecture #2 Wednesday, Feb. 11 Ben CartereGe Search Engine Architecture A soiware architecture consists of soiware components, the interfaces provided by those
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationNatural Language Processing
Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationHow Does a Search Engine Work? Part 1
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy
CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationStructural Text Features. Structural Features
Structural Text Features CISC489/689 010, Lecture #13 Monday, April 6 th Ben CartereGe Structural Features So far we have mainly focused on vanilla features of terms in documents Term frequency, document
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationInformation Retrieval
Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationInformation Retrieval 6. Index compression
Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo What we have seen so far 2 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationAnalyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc
Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB
More informationCS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University
CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Processing Text Conver.ng documents to index terms Why? Matching the exact string
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique
More informationIndex Compression. David Kauchak cs160 Fall 2009 adapted from:
Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationQuery Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4
Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationText Retrieval an introduction
Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More information5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval
Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted
More informationFull-Text Indexing For Heritrix
Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design
More informationDepartment of Electronic Engineering FINAL YEAR PROJECT REPORT
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationCrawling. CS6200: Information Retrieval. Slides by: Jesse Anderton
Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationSYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationManning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques
Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval:
More informationInformation Retrieval. Lecture 2 - Building an index
Information Retrieval Lecture 2 - Building an index Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 40 Overview Introduction Introduction Boolean
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationOverview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components
Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2
More informationOutline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.
Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence
More information