2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response

Size: px

Start display at page:

Download "2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response"

Eustace Grant
5 years ago
Views:

1 CMSC 476/676 Review 1. Week 1 Overview of Information Retrieval a. Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. b. Primary focus of IR since the 50s has been on text and documents i. Now also include images, video, audio, music, and scanned documents all with different retrieval mechanisms that can be used on the particular media. c. Text Documents can be web pages, , books, news stories, scholarly papers, text messages, Word, Powerpoint, PDF, forum postings, patents, IM sessions, etc. d. Text Documents have lots of text and some structure, e.g. title, author, date for papers; subject, sender, destination for e. Text Documents vs. Relational Database Records i. Database records (or tuples in relational databases) 1. Well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. 2. Easy to compare fields using well-defined semantics and query language (SQL) ii. Text Documents - Comparison of query text to document text is the core challenge f. IR tasks include ad-hoc search, filtering, classifications, and question/answering. g. Relevance of a document to a query is usually based on statistical properties of text rather than linguistic i. Statistical properties - counting simple text features such as words instead of parsing and analyzing the sentences ii. Linguistic properties breaking down sentences into parts of speech and noun phrases and verb phrases. h. Judging effectiveness 2 measures i. Recall Percentage of ALL relevant documents retrieved by query ii. Precision Percentage of retrieved documents that are relevant i. Indexes are data structures designed to improve search efficiency i. Designing and implementing them are major issues for search engines 1

2 2. Week 2 Overview of Search Engine Architecture a. Search Engine Architecture defined by effectiveness (quality of results) and efficiency (response time and throughput) b. Text Acquisition - Types of Crawlers Web, Enterprise, Desktop i. Web Crawlers efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness) ii. Document Crawlers for enterprise and desktop search follow links and scan directories c. Document Feeds i. Real time feeds from blogs, news channels, etc., ii. RSS feed in XML d. Document Conversion/Transformation i. Parse/Convert document format into text; e.g. HTML, XML, Word, PDF ii. Convert text encoding from different languages and encoding schemes; UTF 8, 16, 32, ASCII, EBCDIC iii. Tokenizer recognizes words in the text 1. Consider issues like capitalization, hyphens, apostrophes, non-alpha characters, separators iv. Stopping remove common words, e.g. and, or, the v. Stemming remove word endings, e.g. ing, ed, e. Document Statistics i. Counts of token/term/word in document ii. Counts of token/term/word in collection of document iii. Counts of token/term/word occurrence in documents f. Inverted Index reverses document-term information into term-document information g. Index Distribution i. Distributes indexes across multiple computers and/or multiple sites ii. Essential for fast query processing with large numbers of documents iii. See Big Table h. Scoring queries i. Calculates score for the query compared to indexed documents ii. Basic form of score is q i d i 1. q i and d i are query and document term weights for term i i. Google Big Table i. Row is created when a new document is stored ii. Rows sorted lexicographically (alphabetically); e.g. so pages on same topic are stored together usually on same machine iii. Large tables are broken into tablets at row boundaries. iv. Tablets hold a contiguous range of rows holds MB v. Server manages ~100 tablets j. How does Google work? 3 key features 2

3 i. Web Crawling tiers of crawlers with different frequencies: News Sites change most frequently, other sites less frequently. Crawlers are called Spiders. They search for documents to index and store. ii. Indexing Does the document have the key terms Katy and Perry? 1. When type in a set of terms to search, you are searching the Index, not the web. iii. Ranking Page Rank plus many other signals ; e.g. Katy and Perry next to each other provide a ranked list of the documents that best match the query based on a scoring function. 3

4 3. Week 3 - Web Crawling: HTTP, Web Scraping, Freshness, RSS feeds, Character Encoding a. HTTP Hypertext Transfer Protocol i. Client/Server model Client on your device ii. Stateless Protocol ea. Transaction is independent of any other. But Cookies track your state. iii. Application Layer Protocol on top of a transport layer protocol, e.g. TCP. TCP cares how data is sent, how acknowledged. HTTP doesn t care about transport layer details. iv. Client Actions GET, POST, DELETE, v. Defines Response Status Code 404 not found. vi. Headers tiny bits of custom information sent with responses. Such as content type, e.g. text file, json, etc. Cache Control. Both directions. b. Web Scraping in the absence of and API can scrape any web page i. URL OPEN then GET the HTML ii. Read in HTML text into big string iii. Parse HTML string iv. Find elements of the page, e.g. Body Tag, Span Tag, etc. v. HTML is an Embedded Tagging Language vi. In PYTHON you can parse HTML using Beautiful Soup and other modules. c. Freshness Web Crawling 6: Keeping Index Fresh i. How often to re-crawl a web page? How often is a web page refreshed? Want index to be fresh. ii. Once the page is refreshed (since we crawled it), then to page is Stale and no longer Fresh. iii. But you don t know when the page has been changed. iv. Assume (1) page changes are independent of one another, and (2) cannot have 2 changes at same time. v. Exponential inter arrival time Poisson Distribution. vi. Want to know the expected age of a page. vii. Need Frequency of Updates = C Can estimate C by counting the changes over an interval. viii. Estimate C Adaptively if page changed in interval, set the interval smaller. d. RSS Feeds- Really Simple Syndication file format i. Similar to onscreen guide for cable TV ii. Channel, Item (blog post) iii. XML, Like HTML but not page structure, tags for channel, item, description, etc. iv. Feed with Entries. v. Blogs, News, Pod Casts, Meetups, etc. e. Character Encoding: UNICODE, ASCII, EBCDIC i. Code Points UNICODE 8, 16, Latin scrip is same as ASCII a given range. 4

5 2. Different language scripts are different ranges. 3. Math symbols, special characters, etc. have ranges. ii. Encoding Code Points 1. UTF8 is 8 bit 2. Underlying code is hexadecimal. f. LSH Locality Sensitive Hashing i. Similar hash codes for near-by points (documents) ii. SIMHASH efficient variant of LSH 1. Tokenize 2. Assign weights to tokens 3. Compute b-bit binary hash for each word. 4. Convert 0 to -1 and multiply by word weight 5. Add by Columns. Set columns > 0 to 1, 0 otherwise. 5

6 4. Week 4 a. Scoring Terms i. Term Frequency and Similarity functions 1. See 2. Which words occur in both query and document? 3. Basic Similarity Score - Dot product of the terms in the query and terms in document. 4. Term Frequency can be a. Binary presence or absence b. Count number of times term occurs in document c. Normalize by document length to avoid long document bias d. Rank invariance change the numbers but not the rank ii. Inverse Document Frequency rare words carry more meaning Some words carry content - aardvark, cryogenic. Have them carry more weight. a. If occurs once in a corpus then probably typo b. If 3 or 4 times in a corpus then important word. c. If number of documents containing that word is small compared to the number of documents in the corpus, then probably important. 3. Stopwords don t and, or, the (linguistic glue) 4. Logorithm puts idf component on same scale as tf component iii. Feature Selection with Tf-Idf Retrieval models and ranking algorithms depend heavily on statistical properties of words a. e.g., important words occur often in documents but are not high frequency in collection 3. See Zipf s law b. Zipf s Law i. Distribution of word frequencies is very skewed 1. a few words occur very often, many words hardly ever occur 2. e.g., two most common words ( the, of ) make up about 10% of all word occurrences in text documents ii. Zipf s law : 1. observation that rank (r) of a word times its frequency (f) is approximately a constant (k) 2. assuming words are ranked in order of decreasing frequency a. i.e., r.f» k or r.p r» c, where P r is probability of word occurrence and c» 0.1 for English 6

7 iii. What is the proportion of words with a given frequency? 1. Word that occurs n times has rank rn = k/n 2. Number of words with frequency n is 3. r n r n+1 = k/n k/(n + 1) = k/n(n + 1) 4. Proportion found by dividing by total number of words = highest rank = k 5. So, proportion with frequency n is 1/n(n+1) c. Vocabulary growth Heap s Law i. As corpus grows, so does vocabulary size ii. Fewer new words when corpus is already large iii. Observed relationship (Heaps Law): v = k.n β, where v is vocabulary size (number of unique words); n is the number of words in corpus; k, β are parameters that vary for each corpus (typical values given are 10 k 100 and β 0.5) iv. New words come from a variety of sources: spelling errors, invented words (e.g. product, company names), code, other languages, addresses, etc. v. How many pages contain all of the query terms? 1. For the query a b c : f abc = N f a /N f b /N f c /N = (f a f b f c )/N 2 a. Assuming that terms occur independently b. f abc is the estimated size of the result set c. f a, f b, f c are the number of documents that terms a, b, and c occur in d. N is the number of documents in the collection d. Tokenizing problems i. Small words can be important in some queries, usually in combinations 1. xp, ma, pm, ben e king, el paso, master p, gm, j lo, world war II ii. Both hyphenated and non-hyphenated forms of many words are common 1. Sometimes hyphen is not needed 2. e-bay, wal-mart, active-x, cd-rom, t-shirts iii. At other times, hyphens should be considered either as part of the word or a word separator 1. winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking iv. Special characters are an important part of tags, URLs, code in documents v. Capitalized words can have different meaning from lower case words 1. Bush, Apple vi. Apostrophes can be a part of a word, a part of a possessive, or just a mistake 7

8 1. rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's vii. Numbers can be important, including decimals 1. nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, (a US patent?) viii. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations 1. I.B.M., Ph.D., cs.umbc.edu, F.E.A.R. ix. Note: tokenizing steps for queries must be identical to steps for documents e. Tokenizing Process i. First step is to use parser to identify appropriate parts of document to tokenize ii. Defer complex decisions to other components iii. word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case iv. everything indexed v. Examples of rules used with TREC 1. Apostrophes in words ignored a. o connor oconnor bob s bobs 2. Periods in abbreviations ignored a. I.B.M. ibm Ph.D. ph d f. Stopping i. Function words (determiners, prepositions) have little meaning on their own ii. High occurrence frequencies iii. Treated as stopwords (i.e. removed) 1. reduce index space, improve response time, improve effectiveness iv. Can be important in combinations 1. e.g., to be or not to be v. Lists are customized for applications, domains, and even parts of documents 1. e.g., click is a good stopword for anchor text vi. Best policy may be to index all words in documents, make decisions about which words to use at query time g. Stemming i. Many morphological variations of words 1. inflectional (plurals, tenses) 2. derivational (making verbs nouns etc.) ii. In most cases, these have the same or very similar meanings iii. Stemmers attempt to reduce morphological variations of words to a common stem 1. usually involves removing suffixes iv. Can be done at indexing time or as part of query processing (like stopwords) 8

9 v. Generally a small but significant effectiveness improvement 1. can be crucial for some languages, e.g., 5-10% improvement for English, up to 50% in Arabic vi. Two basic types 1. Dictionary-based: uses lists of related words 2. Algorithmic: uses program to determine related words a. suffix-s: remove s endings assuming plural, e.g., cats cat, lakes lake, wiis wii b. Many false negatives: supplies supplie c. Some false positives: ups up vii. Porter Stemmer - Algorithmic stemmer used in IR experiments since the 1970s viii. Krovetz Stemmer - Hybrid algorithmic-dictionary 1. Word checked in dictionary 2. If present, either left alone or replaced with exception 3. If not present, word is checked for suffixes that could be removed 4. After removal, dictionary is checked again h. Character n-grams i. Building stemmers takes allot of time and expertise in the language - Character n-grams work well ii. Use a sliding window of length n over the string. 1. Get a lot of overlapping n-grams that capture a lot of linguistic information 2. Tolerant of garbles and mis-spelling in text 3. Use for any language/script (particularly European languages, but will also work for Chinese and others) iii. N-grams of size (per Dr. Cohen) 1. 3 are good for language identification (id) are good for topic id are good for speaker id are good for malware detection, plagiarism, 9

10 5. Week 5 Building and using Inverted Indexes a. Use hash table (dictionary) to store information by term (word, n-gram,...) i. Represent documents as vectors 1. One entry per term (word) 2. Sparse: ~ 10^6 entries, most = 0 3. Zipf s law says ½ of all terms will occur only once 4. Value = number of times the term occurs in the document 5. Document is a vector 6. Collection of documents becomes a sparse matrix -- don t want to store all of the zeros in a matrix so use inverted index which is a sparse representation ii. Inverted lists (also called postings lists) 1. attached to each bucket (term) in the hash table 2. Column Vector is an inverted list with one posting for each document sorted by document number order 3. Contains information about the value of the term in the document a. Counts, weights, etc. 4. Keep additional information with terms: a. number of documents in the list (number of documents that contain that term), b. number of times term is used in the collection 5. Compact and fast because sorted in document order iii. Linear Merge - 1. Merging inverted indexes of the terms that occur in a query and a document (or 2 documents)- works because the documents are in sorted order. a. Set pointer to the top of the first 2 inverted lists b. Compare documents in each posting list i. If Boolean scoring, then just looking for a match ii. If calculating a scoring function it must be decomposable into a linear function e.g. cosine similarity, c. If not same document, advance to the posting in the list with the lowest document number then go back to b. d. else you have match 2. If more than two terms, need multi-way merge iv. Proximity Indexes: Phrase search 1. Turn phrase into Boolean And if no positional information in indexes 2. Term pairs re-tokenize into 2 or 3 word sequences increases size of inverted index 3. Proximity index stores positions instead of counts a. Key to Rich indexing structure, fields, tags, etc. 10

11 b. DNA index store c. Could build a Near operator. d. Order may be important 4. Use Linear Merge comparison looks for positions within a given range v. Extent Indexes: Structure and Tags 1. Separate field or Index for each structural element now need to look at multiple indexes 2. Push structure into index values author = William 3. Extent index like positional index a. Special new term for each structural element named entity term with positional index span beginning and end, e.g. in doc3 spanning positions 1 3. b. Does term fall in the span for a special term. That is does William fall within the Author span. c. Over lays on other terms b. Index Compression Indexes are big. Reduce I/O time. Make big numbers small by converting to the delta between document numbers and pack deltas into smaller number of bits i. Delta encoding 1. Doc numbers are big. 10 Billion Docs 2. 4 bytes can only store 4 billion docs 3. often difference in document numbers is small because sorted by doc ID. Take differences (Deltas). 4. Works well for frequent words ii. V-byte encoding 1. Pack numbers into fewer bytes to represent small numbers 2. Variable number of bytes needs 1 byte to encode, etc bits with high bit to indicate this is the last byte to be encoded a. 6 = ; b. 127 = ; c. 128 = For sparse terms, the deltas between document numbers in your posting lists will be bigger and thus will need multiple bytes to represent the difference. iii. Skip pointers c. Query Execution i. Set-based a document matches or not ii. Ranked retrieval rank by relevance iii. Goal fast retrieval 1. Doc-at-a-time (DAAT) a. Assemble matching vector for each document i. List of matching words/counts/positions 11

12 ii. Compute score iii. Output score b. K-way linear merge i. Computing scoring function for each document as go through the merge ii. Optimal with 2 lists iii. Naïve variant with k lists is merge one by one, e.g. merge a,b with c then a,b,c with d, etc. 1. Complexity Worst case O(n^2) query with all rare words iv. Better way: Loop over all pointers at once 1. Complexity Loop over k pointers until all docs done. If no two lists have same docs (n), then O(kn) 2. If lists are dense and point to same docs then score operation is O(k^2) v. Even better: Use Min-heap (priority queue) 1. Keep pointers in a data structure to find the smallest in logarithmic time O(n Log k) vi. Only emits score when non-zero 2. Term-at-a-time (TaaT) Flip problem don t merge lists a. Incrementally compute scores for all documents updated one term at a time. b. Initialize array for term, iterate over inverted list and update with each doc information c. Compute score in no particular order d. Final step is to extract scores from the array (take top 5 from sorted array e. O(n + N) where n=length of all lists, N=number of docs 3. Trade offs a. TaaT -- is linear in n, and has sequential I/O (fetched by term), thus, order doesn t matter so can process in the order stored on disk in a single pass. But have to store the array of N accumulators for each document. This can be an issue if on disk. Also, can only use if scoring function is quasi linear, e.g. it is decomposable into parts that are: i. Dependent on query, g(q) ii. Dependent on document, h(d) iii. Uses a Dot Product summation b. DaaT i. Higher run time O(nk), O(k) memory, nonsequential I/O (take hit on performance) 12

13 ii. Any scoring function iii. Structured queries 4. Why complexity matters compute Pair-wise Similarity with 10^6 docs, 10^6 terms, ~10^3 terms/doc a. Brute force: 10^15 operations (10^6x10^6x10^3) i. Compute similarity of every doc to every other doc b. DaaT: c10^13 (10 to 100 times faster) i. For each query doc Q 10^6 ii. Merge 10^3 inverted lists (one per query term remember avg doc length is 10^3) 1. Avg size of list 10^3=10^9 (nonempty cells over all posts)/10^6 (columns) posts 2. Total number of posts across all lists is 10^6 = 10^3*10^3 iii. Heap Merge 10^3 lists with 10^6 elements is 10^7 = 10^6*log(10^3). C is a constant to maintain Heap. iv. Compute similarity for all matches above 10^6 (total number of inverted lists) which is less than 10^7 v. Complexity = C*(10^6*10^7) = C*10^13 c. TaaT: 10^12 (1000 times faster) i. For each query 10^6 ii. For each term in query 10^3 iii. For each doc D in inverted list 10^3 (=10^9/10^6) iv. Complexity = 10^6*10^3*10^3 = 10^12 and no C term d. Index Construction i. Add a document to the index by appending to the inverted lists of the terms in the document, create new entry if term does not exist ii. Why will not scale? 1. Everything in memory, then i above is ok. But need a tuple for each post so uses lots of memory 2. If on disk, then for each word, fetch inverted list, append an entry, then write back to disk. 3. Use Index Merging for documents in batches same as Linear merge for combining partial indexes on disk 13

Chapter 4. Processing Text

Chapter 4. Processing Text Chapter 4 Processing Text Processing Text Modifying/Converting documents to index terms Convert the many forms of words into more consistent index terms that represent the content of a document What are