2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response

Size: px
Start display at page:

Download "2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response"

Transcription

1 CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. b. Primary focus of IR since the 50s has been on text and documents i. Now also include images, video, audio, music, and scanned documents all with different retrieval mechanisms that can be used on the particular media. c. Text Documents can be web pages, , books, news stories, scholarly papers, text messages, Word, Powerpoint, PDF, forum postings, patents, IM sessions, etc. d. Text Documents have lots of text and some structure, e.g. title, author, date for papers; subject, sender, destination for e. Text Documents vs. Relational Database Records i. Database records (or tuples in relational databases) 1. Well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. 2. Easy to compare fields using well-defined semantics and query language (SQL) ii. Text Documents - Comparison of query text to document text is the core challenge f. IR tasks include ad-hoc search, filtering, classifications, and question/answering. g. Relevance of a document to a query is usually based on statistical properties of text rather than linguistic i. Statistical properties - counting simple text features such as words instead of parsing and analyzing the sentences ii. Linguistic properties breaking down sentences into parts of speech and noun phrases and verb phrases. h. Judging effectiveness 2 measures i. Recall Percentage of ALL relevant documents retrieved by query ii. Precision Percentage of retrieved documents that are relevant i. Indexes are data structures designed to improve search efficiency i. Designing and implementing them are major issues for search engines 1

2 2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response time and throughput) b. Text Acquisition - Types of Crawlers Web, Enterprise, Desktop i. Web Crawlers efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness) ii. Document Crawlers for enterprise and desktop search follow links and scan directories c. Document Feeds i. Real time feeds from blogs, news channels, etc., ii. RSS feed in XML d. Document Conversion/Transformation i. Parse/Convert document format into text; e.g. HTML, XML, Word, PDF ii. Convert text encoding from different languages and encoding schemes; UTF 8, 16, 32, ASCII, EBCDIC iii. Tokenizer recognizes words in the text 1. Consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators iv. Stopping remove common words, e.g. and, or, the v. Stemming remove word endings, e.g. ing, ed, e. Document Statistics i. Counts of token/term/word in document ii. Counts of token/term/word in collection of document iii. Counts of token/term/word occurrence in documents f. Inverted Index reverses document-term information into term-document information g. Index Distribution i. Distributes indexes across multiple computers and/or multiple sites ii. Essential for fast query processing with large numbers of documents iii. See Big Table h. Scoring queries i. Calculates score for the query compared to indexed documents ii. Basic form of score is q i d i 1. q i and d i are query and document term weights for term i i. Google Big Table i. Row is created when a new document is stored ii. Rows sorted lexicographically (alphabetically); e.g. so pages on same topic are stored together usually on same machine iii. Large tables are broken into tablets at row boundaries. iv. Tablets hold a contiguous range of rows holds MB v. Server manages ~100 tablets j. How does Google work? 3 key features 2

3 i. Web Crawling tiers of crawlers with different frequencies: News Sites change most frequently, other sites less frequently. Crawlers are called Spiders. They search for documents to index and store. ii. Indexing Does the document have the key terms Katy and Perry? 1. When type in a set of terms to search, you are searching the Index, not the web. iii. Ranking Page Rank plus many other signals ; e.g. Katy and Perry next to each other provide a ranked list of the documents that best match the query based on a scoring function. 3

4 3. Week 3 - Web Crawling: HTTP, Web Scraping, Freshness, RSS feeds, Character Encoding a. HTTP Hypertext Transfer Protocol i. Client/Server model Client on your device ii. Stateless Protocol ea. Transaction is independent of any other. But Cookies track your state. iii. Application Layer Protocol on top of a transport layer protocol, e.g. TCP. TCP cares how data is sent, how acknowledged. HTTP doesn t care about transport layer details. iv. Client Actions GET, POST, DELETE, v. Defines Response Status Code 404 not found. vi. Headers tiny bits of custom information sent with responses. Such as content type, e.g. text file, json, etc. Cache Control. Both directions. b. Web Scraping in the absence of and API can scrape any web page i. URL OPEN then GET the HTML ii. Read in HTML text into big string iii. Parse HTML string iv. Find elements of the page, e.g. Body Tag, Span Tag, etc. v. HTML is an Embedded Tagging Language vi. In PYTHON you can parse HTML using Beautiful Soup and other modules. c. Freshness Web Crawling 6: Keeping Index Fresh i. How often to re-crawl a web page? How often is a web page refreshed? Want index to be fresh. ii. Once the page is refreshed (since we crawled it), then to page is Stale and no longer Fresh. iii. But you don t know when the page has been changed. iv. Assume (1) page changes are independent of one another, and (2) cannot have 2 changes at same time. v. Exponential inter arrival time Poisson Distribution. vi. Want to know the expected age of a page. vii. Need Frequency of Updates = C Can estimate C by counting the changes over an interval. viii. Estimate C Adaptively if page changed in interval, set the interval smaller. d. RSS Feeds- Really Simple Syndication file format i. Similar to onscreen guide for cable TV ii. Channel, Item (blog post) iii. XML, Like HTML but not page structure, tags for channel, item, description, etc. iv. Feed with Entries. v. Blogs, News, Pod Casts, Meetups, etc. e. Character Encoding: UNICODE, ASCII, EBCDIC i. Code Points UNICODE 8, 16, Latin scrip is same as ASCII a given range. 4

5 2. Different language scripts are different ranges. 3. Math symbols, special characters, etc. have ranges. ii. Encoding Code Points 1. UTF8 is 8 bit 2. Underlying code is hexadecimal. f. LSH Locality Sensitive Hashing i. Similar hash codes for near-by points (documents) ii. SIMHASH efficient variant of LSH 1. Tokenize 2. Assign weights to tokens 3. Compute b-bit binary hash for each word. 4. Convert 0 to -1 and multiply by word weight 5. Add by Columns. Set columns > 0 to 1, 0 otherwise. 5

6 4. Week 4 a. Scoring Terms i. Term Frequency and Similarity functions 1. See 2. Which words occur in both query and document? 3. Basic Similarity Score - Dot product of the terms in the query and terms in document. 4. Term Frequency can be a. Binary presence or absence b. Count number of times term occurs in document c. Normalize by document length to avoid long document bias d. Rank invariance change the numbers but not the rank ii. Inverse Document Frequency rare words carry more meaning Some words carry content - aardvark, cryogenic. Have them carry more weight. a. If occurs once in a corpus then probably typo b. If 3 or 4 times in a corpus then important word. c. If number of documents containing that word is small compared to the number of documents in the corpus, then probably important. 3. Stopwords don t and, or, the (linguistic glue) 4. Logorithm puts idf component on same scale as tf component iii. Feature Selection with Tf-Idf Retrieval models and ranking algorithms depend heavily on statistical properties of words a. e.g., important words occur often in documents but are not high frequency in collection 3. See Zipf s law b. Zipf s Law i. Distribution of word frequencies is very skewed 1. a few words occur very often, many words hardly ever occur 2. e.g., two most common words ( the, of ) make up about 10% of all word occurrences in text documents ii. Zipf s law : 1. observation that rank (r) of a word times its frequency (f) is approximately a constant (k) 2. assuming words are ranked in order of decreasing frequency a. i.e., r.f» k or r.p r» c, where P r is probability of word occurrence and c» 0.1 for English 6

7 iii. What is the proportion of words with a given frequency? 1. Word that occurs n times has rank rn = k/n 2. Number of words with frequency n is 3. r n r n+1 = k/n k/(n + 1) = k/n(n + 1) 4. Proportion found by dividing by total number of words = highest rank = k 5. So, proportion with frequency n is 1/n(n+1) c. Vocabulary growth Heap s Law i. As corpus grows, so does vocabulary size ii. Fewer new words when corpus is already large iii. Observed relationship (Heaps Law): v = k.n β, where v is vocabulary size (number of unique words); n is the number of words in corpus; k, β are parameters that vary for each corpus (typical values given are 10 k 100 and β 0.5) iv. New words come from a variety of sources: spelling errors, invented words (e.g. product, company names), code, other languages, addresses, etc. v. How many pages contain all of the query terms? 1. For the query a b c : f abc = N f a /N f b /N f c /N = (f a f b f c )/N 2 a. Assuming that terms occur independently b. f abc is the estimated size of the result set c. f a, f b, f c are the number of documents that terms a, b, and c occur in d. N is the number of documents in the collection d. Tokenizing problems i. Small words can be important in some queries, usually in combinations 1. xp, ma, pm, ben e king, el paso, master p, gm, j lo, world war II ii. Both hyphenated and non-hyphenated forms of many words are common 1. Sometimes hyphen is not needed 2. e-bay, wal-mart, active-x, cd-rom, t-shirts iii. At other times, hyphens should be considered either as part of the word or a word separator 1. winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking iv. Special characters are an important part of tags, URLs, code in documents v. Capitalized words can have different meaning from lower case words 1. Bush, Apple vi. Apostrophes can be a part of a word, a part of a possessive, or just a mistake 7

8 1. rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's vii. Numbers can be important, including decimals 1. nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, (a US patent?) viii. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations 1. I.B.M., Ph.D., cs.umbc.edu, F.E.A.R. ix. Note: tokenizing steps for queries must be identical to steps for documents e. Tokenizing Process i. First step is to use parser to identify appropriate parts of document to tokenize ii. Defer complex decisions to other components iii. word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case iv. everything indexed v. Examples of rules used with TREC 1. Apostrophes in words ignored a. o connor oconnor bob s bobs 2. Periods in abbreviations ignored a. I.B.M. ibm Ph.D. ph d f. Stopping i. Function words (determiners, prepositions) have little meaning on their own ii. High occurrence frequencies iii. Treated as stopwords (i.e. removed) 1. reduce index space, improve response time, improve effectiveness iv. Can be important in combinations 1. e.g., to be or not to be v. Lists are customized for applications, domains, and even parts of documents 1. e.g., click is a good stopword for anchor text vi. Best policy may be to index all words in documents, make decisions about which words to use at query time g. Stemming i. Many morphological variations of words 1. inflectional (plurals, tenses) 2. derivational (making verbs nouns etc.) ii. In most cases, these have the same or very similar meanings iii. Stemmers attempt to reduce morphological variations of words to a common stem 1. usually involves removing suffixes iv. Can be done at indexing time or as part of query processing (like stopwords) 8

9 v. Generally a small but significant effectiveness improvement 1. can be crucial for some languages, e.g., 5-10% improvement for English, up to 50% in Arabic vi. Two basic types 1. Dictionary-based: uses lists of related words 2. Algorithmic: uses program to determine related words a. suffix-s: remove s endings assuming plural, e.g., cats cat, lakes lake, wiis wii b. Many false negatives: supplies supplie c. Some false positives: ups up vii. Porter Stemmer - Algorithmic stemmer used in IR experiments since the 1970s viii. Krovetz Stemmer - Hybrid algorithmic-dictionary 1. Word checked in dictionary 2. If present, either left alone or replaced with exception 3. If not present, word is checked for suffixes that could be removed 4. After removal, dictionary is checked again h. Character n-grams i. Building stemmers takes allot of time and expertise in the language - Character n-grams work well ii. Use a sliding window of length n over the string. 1. Get a lot of overlapping n-grams that capture a lot of linguistic information 2. Tolerant of garbles and mis-spelling in text 3. Use for any language/script (particularly European languages, but will also work for Chinese and others) iii. N-grams of size (per Dr. Cohen) 1. 3 are good for language identification (id) are good for topic id are good for speaker id are good for malware detection, plagiarism, 9

10 5. Week 5 Building and using Inverted Indexes a. Use hash table (dictionary) to store information by term (word, n-gram,...) i. Represent documents as vectors 1. One entry per term (word) 2. Sparse: ~ 10^6 entries, most = 0 3. Zipf s law says ½ of all terms will occur only once 4. Value = number of times the term occurs in the document 5. Document is a vector 6. Collection of documents becomes a sparse matrix -- don t want to store all of the zeros in a matrix so use inverted index which is a sparse representation ii. Inverted lists (also called postings lists) 1. attached to each bucket (term) in the hash table 2. Column Vector is an inverted list with one posting for each document sorted by document number order 3. Contains information about the value of the term in the document a. Counts, weights, etc. 4. Keep additional information with terms: a. number of documents in the list (number of documents that contain that term), b. number of times term is used in the collection 5. Compact and fast because sorted in document order iii. Linear Merge - 1. Merging inverted indexes of the terms that occur in a query and a document (or 2 documents)- works because the documents are in sorted order. a. Set pointer to the top of the first 2 inverted lists b. Compare documents in each posting list i. If Boolean scoring, then just looking for a match ii. If calculating a scoring function it must be decomposable into a linear function e.g. cosine similarity, c. If not same document, advance to the posting in the list with the lowest document number then go back to b. d. else you have match 2. If more than two terms, need multi-way merge iv. Proximity Indexes: Phrase search 1. Turn phrase into Boolean And if no positional information in indexes 2. Term pairs re-tokenize into 2 or 3 word sequences increases size of inverted index 3. Proximity index stores positions instead of counts a. Key to Rich indexing structure, fields, tags, etc. 10

11 b. DNA index store c. Could build a Near operator. d. Order may be important 4. Use Linear Merge comparison looks for positions within a given range v. Extent Indexes: Structure and Tags 1. Separate field or Index for each structural element now need to look at multiple indexes 2. Push structure into index values author = William 3. Extent index like positional index a. Special new term for each structural element named entity term with positional index span beginning and end, e.g. in doc3 spanning positions 1 3. b. Does term fall in the span for a special term. That is does William fall within the Author span. c. Over lays on other terms b. Index Compression Indexes are big. Reduce I/O time. Make big numbers small by converting to the delta between document numbers and pack deltas into smaller number of bits i. Delta encoding 1. Doc numbers are big. 10 Billion Docs 2. 4 bytes can only store 4 billion docs 3. often difference in document numbers is small because sorted by doc ID. Take differences (Deltas). 4. Works well for frequent words ii. V-byte encoding 1. Pack numbers into fewer bytes to represent small numbers 2. Variable number of bytes needs 1 byte to encode, etc bits with high bit to indicate this is the last byte to be encoded a. 6 = ; b. 127 = ; c. 128 = For sparse terms, the deltas between document numbers in your posting lists will be bigger and thus will need multiple bytes to represent the difference. iii. Skip pointers c. Query Execution i. Set-based a document matches or not ii. Ranked retrieval rank by relevance iii. Goal fast retrieval 1. Doc-at-a-time (DAAT) a. Assemble matching vector for each document i. List of matching words/counts/positions 11

12 ii. Compute score iii. Output score b. K-way linear merge i. Computing scoring function for each document as go through the merge ii. Optimal with 2 lists iii. Naïve variant with k lists is merge one by one, e.g. merge a,b with c then a,b,c with d, etc. 1. Complexity Worst case O(n^2) query with all rare words iv. Better way: Loop over all pointers at once 1. Complexity Loop over k pointers until all docs done. If no two lists have same docs (n), then O(kn) 2. If lists are dense and point to same docs then score operation is O(k^2) v. Even better: Use Min-heap (priority queue) 1. Keep pointers in a data structure to find the smallest in logarithmic time O(n Log k) vi. Only emits score when non-zero 2. Term-at-a-time (TaaT) Flip problem don t merge lists a. Incrementally compute scores for all documents updated one term at a time. b. Initialize array for term, iterate over inverted list and update with each doc information c. Compute score in no particular order d. Final step is to extract scores from the array (take top 5 from sorted array e. O(n + N) where n=length of all lists, N=number of docs 3. Trade offs a. TaaT -- is linear in n, and has sequential I/O (fetched by term), thus, order doesn t matter so can process in the order stored on disk in a single pass. But have to store the array of N accumulators for each document. This can be an issue if on disk. Also, can only use if scoring function is quasi linear, e.g. it is decomposable into parts that are: i. Dependent on query, g(q) ii. Dependent on document, h(d) iii. Uses a Dot Product summation b. DaaT i. Higher run time O(nk), O(k) memory, nonsequential I/O (take hit on performance) 12

13 ii. Any scoring function iii. Structured queries 4. Why complexity matters compute Pair-wise Similarity with 10^6 docs, 10^6 terms, ~10^3 terms/doc a. Brute force: 10^15 operations (10^6x10^6x10^3) i. Compute similarity of every doc to every other doc b. DaaT: c10^13 (10 to 100 times faster) i. For each query doc Q 10^6 ii. Merge 10^3 inverted lists (one per query term remember avg doc length is 10^3) 1. Avg size of list 10^3=10^9 (nonempty cells over all posts)/10^6 (columns) posts 2. Total number of posts across all lists is 10^6 = 10^3*10^3 iii. Heap Merge 10^3 lists with 10^6 elements is 10^7 = 10^6*log(10^3). C is a constant to maintain Heap. iv. Compute similarity for all matches above 10^6 (total number of inverted lists) which is less than 10^7 v. Complexity = C*(10^6*10^7) = C*10^13 c. TaaT: 10^12 (1000 times faster) i. For each query 10^6 ii. For each term in query 10^3 iii. For each doc D in inverted list 10^3 (=10^9/10^6) iv. Complexity = 10^6*10^3*10^3 = 10^12 and no C term d. Index Construction i. Add a document to the index by appending to the inverted lists of the terms in the document, create new entry if term does not exist ii. Why will not scale? 1. Everything in memory, then i above is ok. But need a tuple for each post so uses lots of memory 2. If on disk, then for each word, fetch inverted list, append an entry, then write back to disk. 3. Use Index Merging for documents in batches same as Linear merge for combining partial indexes on disk 13

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process Processing Text Converting documents to index terms Why? Matching the exact

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Processing Text Converting documents to index terms Why? Matching the exact string of characters typed by the user is too

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

Search Engine Architecture. Search Engine Architecture

Search Engine Architecture. Search Engine Architecture Search Engine Architecture CISC489/689 010, Lecture #2 Wednesday, Feb. 11 Ben CartereGe Search Engine Architecture A soiware architecture consists of soiware components, the interfaces provided by those

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Outline of the course

Outline of the course Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Structural Text Features. Structural Features

Structural Text Features. Structural Features Structural Text Features CISC489/689 010, Lecture #13 Monday, April 6 th Ben CartereGe Structural Features So far we have mainly focused on vanilla features of terms in documents Term frequency, document

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Information Retrieval

Information Retrieval Information Retrieval Data Processing and Storage Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Information Retrieval 6. Index compression

Information Retrieval 6. Index compression Ghislain Fourny Information Retrieval 6. Index compression Picture copyright: donest /123RF Stock Photo What we have seen so far 2 Boolean retrieval lawyer AND Penang AND NOT silver query Input Set of

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB

More information

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University Indexing Process Processing Text Conver.ng documents to index terms Why? Matching the exact string

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Text Retrieval an introduction

Text Retrieval an introduction Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval Acknowledgement Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014 Contents of lectures, projects are extracted

More information

Full-Text Indexing For Heritrix

Full-Text Indexing For Heritrix Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval:

More information

Information Retrieval. Lecture 2 - Building an index

Information Retrieval. Lecture 2 - Building an index Information Retrieval Lecture 2 - Building an index Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 40 Overview Introduction Introduction Boolean

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2

More information

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö. Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence

More information