C++lue; EECS 398 Search Engine Team Final Report Anvi Arora, Ben Bergkamp, Jake Close, Veronica Day, Zane Dunnings, Nicholas Yang April 16nd, 2018
|
|
- Jack Edwards
- 5 years ago
- Views:
Transcription
1 C++lue; EECS 398 Search Engine Team Final Report Anvi Arora, Ben Bergkamp, Jake Close, Veronica Day, Zane Dunnings, Nicholas Yang April 16nd, 2018
2 Overview We began the semester with multiple, 2-3 hour weekly team meetings where we began to design and scope the various components of the project. After creating a rough initial system diagram and discussing how the API s between the classes would look, we split the group of six into three groups of two. We made an effort to figure out which pieces of the projects could be parallelized and worked on separately. (Image of one of our first whiteboarding brainstorming sessions) Our major design decisions were typically done in iterative processes. We would define one way that we thought the interfaces should communicate, then once we researched and discovered better approaches, we would reevaluate and refactor the changes. Some of our major design choices included using the producer consumer relationship for the crawler-url frontier, parser-url frontier, parser-indexer, and constraint solver- ranker. Initially Jake and Ben worked on the Crawler, Anvi and Veronica on the Parser, and Nick and Zane on the Indexer. We would meet twice weekly for about an hour after class to discuss issues, problems and changes to the interface. Once those components were done, Jake and Ben worked on the Constraint Solver, Anvi, Zane and Veronica worked on the Ranker, Nick continued improving the indexer and worked on ISRWord, and Zane worked on the Query Language. Our development process and communication were key parts to our success. We each used Clion as a our debugger, linter, and build system which helped save valuable time. We also set up a vagrant machine to run all our code in a consistent environment, although we ended up leaning towards using Mac OSX for the majority of the project. We used the GitLab hosted on the UMich EECS servers for our version control. We made sure to follow best practices by testing new components thoroughly, branching for each new features, and merging only stable builds to
3 master. Outside of our twice-weekly meetings, we were constantly messaging in our GroupMe messaging app, figuring out times to meet up and solve issues. (Screenshot of our original Gantt chart of progress deadlines) Spreadsheet of major components and who did what in LOC We used the CLOC terminal script (Count Lines of Code) to computed lines of code count for each file, removing comments and whitespace.
4 Functionality Level: ~7
5 Basic statistics a. Total Docs = 50,000 docs b. Total Size = 157 MB c. 10 crawler worker threads d. Crawl Speed = ~100 docs every 10 seconds e. 6,632,329 total tokens indexed f. 228,465 unique tokens (may be inaccurate to our use of stemming) g. Graph below depicts query speed: We found the speed to be linear in time with the number of hits. We believe we could have reduced this time if we had ran one thread per chunk while running the constraint solver, optimized some in-memory seek data, and stored more cached data on disk.
6 2. Diagram of filesystem - Master.txt (disk hash table) - Contains word as key, and chunk lists, frequency, doc frequency, and last offset for the value. - Also stores statistics such as number of words, number of unique words, number of chunks, etc, decorated with =. - Chunks are comprised of three files, #.txt, #-seek.txt, and #-wordseek.txt - #.txt is the actual postings list. The first offset for each word is the absolute location. - #-seek.txt (disk hash table) is the seek list for every word in the chunk, also stored as a disk hash table. The seek location is simply stored as the value, and all the constraint solver needs to do to parse the file is search up the term as the key, and get the seek location on the actual postings list. - Also stores statistics such as the number of words, number of unique words, and number of documents, etc, decorated with =. - #-wordseek.txt (disk hash table) is the dictionary/ synchronization point for frequently used words. Each word is simply stored as the word appended with a number, with the number acting as a partition. The key value used is a listing of <absolute location, seek offset> where absolute location is the location in the entire corpus, and the seek offset is where to seek in the postings list to get to that location. - One thing that was experimented with was making just a single wordseek file, eliminating the seek list and perhaps lowering the amount of space required for a single chunk. However, this resulted in significantly larger space per chunk, because there were many terms with only one or two postings. We decided to continue with the current working file structure. - Chunk size: can be adjusted per build in the IndexerConstants.h file to any size desired.
7 3. Diagram of index build process (Detailed Diagram at Build Index Process) Our build process was able to run with a dynamic amount of crawling threads and a single indexing thread. The spiders and the indexer were each given a pointer to a shared memory producer consumer queue. On start up, the crawler would spawn the chosen amount of spiders threads. Each spider had its own stream reader and Parser class. The spider would retrieved a new url off of the Url Frontier, download the web page, pass it to the Parser whom would push the final inverted index to the Indexer Queue. The Indexer would pop an index off the queue and process it in memory. Once the in memory data structure reached a determined size, the chunk would be written to disk Strategy For modularity, we implemented a Url Frontier class which handled the web crawling strategy. This allowed us to give the spiders and parsers, push and pop API s and easily changing the strategy. Our goal for the crawler was to find a diverse set of the highest quality websites possible. Also, we wanted to obey some form of politeness and not spam a host with many requests at a time. When we original began, we utilized a single queue, and found that there were many weird/irrelevant sites getting indexed. Our attempt to fix this was but adding a blacklist for
8 domains we wanted to ignore. However, we quickly realized hand picking all of the bad sites was not scalable. Then, we tried a Whitelist and implemented a time buffer for each domain so the queue would prioritize the least recently seen domains. This was better but we were still getting stuck with a lot of the same sites and was still not diverse enough. Finally, we came upon the mercator round robin structure which provided us with the best diversity in crawl as well as a form of politeness. This strategy utilized a priority queue for each one of the whitelisted domains. (Diagram of Url Frontier and Multi Queue Structure)
9 Seeding We determined our seeds by running multiple Google searches and recording which sites were consistently in the top ten. We removed sites from the seed list that we found were not easily crawled, or prevented us from downloading their pages. An example list of the seeds we used:
10 Parser (Detailed Diagram for Parsing Process) Overview The parser, once it received a StreamReader object from the Crawler, would call doctostring() on the object in order to retrieve the entire html in one string. The parser used various custom string processing functions in order to find the text within the html that we deemed important. Tag Detection Once the parser found a < opening tag, it continued to index the string, looking for what type of tag it was encountering, calling various extract functions to extract full strings from within the tags. Tags the parser handled include body, paragraph, anchor, and title tags. Each extract function would pull out the full string and pass it to the Parser s Tokenizer instance. Any tags
11 that did not have detectable closing tags were ignored, such as meta tags. Any detected urls found in anchor tags were sent to the crawler after a checking for validity (i.e. no pdf files). Tokenization The tokenizer accepted a string, and a decorator char. If the string was a url (determined by the decorator), the Tokenizer split the text based on any symbols found. For all other strings the Tokenizer split the text on white spaces, removed symbols such as!, and case folded all chars so the string was entirely lower case. Stop words were removed and all other words were stemmed. The individual tokens were concatenated with their decorator at the beginning of the token, then stored in a document inverted index (an unordered map that lived on the heap), with the token as the key, and a list of offsets as the value, relative to the beginning of the document. Once all strings in the document had been parsed and tokenized, a pointer to the dictionary was sent to the Indexer, who was responsible for merging this dictionary with the master Index, normalizing offsets as relative to the beginning of the index, and deleting this dynamically declared document index. Stemming One important design decision was to incorporate a stemmer into our tokenization process. We knew we would have a limited about of memory, and stemming would reduce the length of our master Index, by storing similar words in the same location. Stemming should also increase recall for our search engine, since stemmed queries can be mapped to more terms. For example, the query why are cats so happy after stemming and stop word removal will yield to cat happi and search results such as Happy Cats will also be processed to happi cat. In this was more relevant document will be returned by the constraint solver. We used the Porter Stemmer algorithm to accomplish this. The algorithm removes the endings of words incrementally through a 5 step process. Word endings such as s, ing and ment are removed, and only if that word s measure (the count of Consonant-Vowel-Consonant patterns) were greater than 1. Query Language Overview The query language was designed to read in a query from the user, extract the important words, and generate a parse tree from that query. We chose to insert decorated nodes into our tree, rather than have the constraint solver generate the decorated token. We support queries that are nested with parentheses and support both ORs and AND. If not specified, we assume words are AND ed together. We support the following for AND: and,, AND, &, and && and the following for OR: or, OR,,.
12 Query Preprocessing We support the following methods of cleaning our queries: 1. Bracket cleanup: the query language detects either ( or ) symbols, and places a space before and after. Since we separate on whitespace, we avoid any errors in assuming (word is one word since it is transformed into ( word in the preprocessing stage. 2. Removing stop-words: our query language removes the same stops words that we remove in the text parser. Removing words that don t add any extra context ensure our system is searching for the most relevant keywords in a given query. Also, these words do not appear in the index, so we would return no search results if we ANDed a stop-word. The only modification is that we include or and and since they are needed for context in the query language. 3. Lowercase all words: this complies with the parser 4. Stem the query words to match what s stored in the index 5. Remove symbols: we remove all symbols. For the following :,, &, && we replace them with OR, OR, AND, and AND respectively. AND / OR Logic After the string has been preprocessed, the QueryParser object separates the string on whitespace, creating a vector of each individual word or symbol in the query. The parser then separates the words in the query based on any OR not within brackets, and then recursively calls itself on that subset of the original query. If no ORs are found, it splits on ANDs, and if none of those are found then it resolves the remaining query as an individual string. During the process, if any parentheses are found, the function searches for the matching parenthesis and calls itself on that subset of the query. Here is an example of a query tree generated from the Query What are some electric cars or gas trucks?
13 Ranker Overview The Ranker object consisted of a list of ISRWords. Each document that was returned by the constraint solver was read into a Site object. Each Site represented one potential document to be returned, and had a dictionary object that mapped each word in the document and query to a list of relevant information, such as frequency of word in document and offsets, that was obtained from the ISRWords for that document.
14 Each Site was sent to the Scorer, an object managed by the Ranker. Our Scorer objected ran through four different heuristics to rank a document: Static ranking, TF-IDF, Proximity Matching, and Location scoring. Each score was multiplied by its coefficient, determined from our linear regression model. After a final score was computed, it was normalized by dividing by the sum of coefficients for each heuristic. This ensured that all of our scores were relative and fell within the range of 0-1. Static Ranking Static ranking took into account the TLD of the url. We assumed that urls that contained TLDs such as.com and.edu would contain more relevant information that other TLDs. We used the Wikipedia article List of Internet top-level domains to support our scoring of each TLD. TF-IDF* TF-IDF, term frequency times inverse document frequency, is a bag of words model, which does not take into account the order or context of words in a document. We used this scoring method to balance the raw term frequency query terms that appeared in the document with its respective rarity in the corpus. We multiplied the term frequency (normalized by the number of words in the query) by the log of the total number of documents in the corpus divided by the document frequency of that particular word. We summed up the scores for each word, then divided it by the number of words in the query to get an average tf-idf score for each document. This score correlates to how often rare words that are in the query are also found in that document, and scores accordingly. *We were unable to fix a bug in the TF-IDF function that was causing the score to sometimes result in negative infinity. This resulted in use having to leave this scoring function commented out in production due to lack of time. Proximity Matching Spans For a proximity match score, used the heuristic that words that were near each other in the query would also appear in the same location in a relevant document. Documents that contained exact phrases that matched the query phrase should also be weighted higher than other documents. In order to find exact phrases, and also take into account near-exact phrases, we used the spanning algorithm taught in lecture. To do this we first found the rarest word in the query (the term with the least amount of occurences in the document that was being scored). Next we found all other terms in the query that were also in the document, and found the minimum offset
15 distance between each term and the rarest term. We then calculated the delta, or average distance, between each term in the span. A span consists of the query terms that are closets in order of appearance in the document. (Image representation of a span, span delta calculation) We then took the sum of each span delta and divided this by the number of total spans in the document. This value represented the average length, or window, of terms that would be necessary to include all the query terms in that window. The lower this number, the higher the score for that document should be, so we took the reciprocal and multiplied it with the number of unique terms in the query. We then multiplied this result by alpha, a weight determined by the linear regression modeling. This was then added to the number of exact phrases in the document, or span deltas of one less than the number of words in the query, divided by the number of spans. This number too was multiplied by alpha prime, which was determined in the linear regression modeling. Max Score We wanted to normalized this number by dividing the score by the highest score possible for a document. At first we assumed that the best document would only consist of query terms and would only consist of exact phrases.we realized this yield an unrealistic score and resulted in low scores for even high ranking documents. To counter this, we still assumed that the average span delta was one less than the number of unique terms in the query, multiplied by alpha. Instead of adding alpha prime multiplied by 100% exact phrases, we found beta, which represented a more realistic percentage of exact phrases that are found in a relevant document.
16 To find beta, we wrote a python script, goldstandarddoc.py, to run random phrase queries in Google. We then took the top five results returned from each query, parsed the document and found the number of exact query phrases in the document s text. We divided this number by the number of N-grams in the document, N equal to the number of terms in the query. This yield a percentage of exact phrases found in the document. We then took the average of each percentage of exact phrases found in each of the five documents returned for each query. Through this subset, we found that a top-ranked result on Google consists of about 0.06% exact phrase matches on average. This was number we used to multiply with alpha prime. If the score we calculated ended up being higher than our estimated exact score, we set it equal to our exact score and divided by itself, so our scoring was still normalized and fell within the bounds of 0-1. Word Locations The Scorer finds the number of instances of each of word of interest in the query in each respective location, and multiplies the frequency/totalsize with a corresponding weight. This weights documents with the query words in the URL and title very highly, but also gives some weight if the words of interest occur many times in a document. Machine Learning We found that our hand tuned coefficients were working generally well, and ranking was returning relevant and quality results. However, we decided to create a simple machine learning tool in order to improve the results. We used a linear regression model, trained on google results to produce the coefficients we wanted. Training Set Our training set consisted of the top result on each of the page of some queries google result. If a Google query returned 10 results per page, we extracted page 1, 11, 21, 31...etc. Since the top Google results are extremely relevant, and have little differentiation, we chose to choose results with more variations in features to help our program learn what the best pages look like. [4] Loss Function To calculate the error in each step of the gradient descent, we use a evaluation metric called the Kendall s Tau function. The metric evaluates how similar two ordered lists are. In our case, we
17 perform recursive descent where we increase the coefficient that increases the Kendall s Tau score the most. To evaluate the cost you create a nxn matrix where n is the number of documents to rank. The binary matrix has the value 1 at (i, j) if r(i) > r(j), and a zero otherwise, where r(i) is the i-th document in the set of documents r. You compare two matrices (ordered set of documents) by counting P (number of elements that agree), and Q (number of elements that disagree), and calculating tau. Potential Improvements There are some things we d love to improve about this machine learning approach that we think would greatly improve its accuracy: - Fix gradient jump with Kendall Tau: With each step in gradient descent the ranking of the documents may not change, so the loss function might not register a good increase in a docs score. Therefore, the gradient calculated will jump when there's a change in ranking. - More data: In our case, it was very difficult to extract features from all of the pages, so getting enough training data was tough, and I think increasing data would have made this really powerful. - Non-Linear regression: I think it would have been useful to experiment with logistic regression and other types of non-logistic regression. It s not likely to have a set of ranked documents be scored in a linear way. Results Our engine is able to return good relevant results for terms that are very common, however it is slightly slower because there are more pages to rank. Terms that less common return faster results but less relevant. AND and OR queries are relatively fast as well, although not as fast as simple word queries. We don t anticipate this is a huge issue given the fact that queries are on average 2.3 words. Server We built a web server in C++ using sockets to send the results via JSON format. The front end was implusing jquery and HTML to parse and display the top ten results including links, scores and url. Below is a screenshot.
18 Bugs a. One known bug is that sometimes the title of a page returned is: 301 Forbidden or 404 Not Found, even though the link below is valid. We believe this is due to parsing these returned values incorrectly in some rare cases. b. We weren t able to get Tf-Idf functionality working, we think this may be because method of retrieving document frequency is buggy c. Front End fails when JSON isn t returned properly d. Linear regression training data only includes a subset of the features, so some of the coefficients are hand tuned
19 Production There would be a fair amount of work required to get this product to production. While this product would never be able to compete with modern search engines like Google or Bing, it could fill the need for a local file system searcher. In order to adapt our engine to be a pure local file system searcher, it would require another few weeks or refactoring our code to be local file specific. These changes would include changing our parsing logic to get every word, instead of looking for specific html tags, and detecting which files are binary files by looking at how high the byte values are. Another option would be to focus our search engine to a certain world of data, such as a sports search engine, or a search engine solely for Wikipedia. 4. What would we do differently: a. Crawler: i. Add an ability to start/stop the crawler. ii. Obey robots.txt b. Constraint Solver: i. Have a thread per chunk of the index to query reduce speed ii. Not and Phrase ISR s c. Indexer: i. Introduce more compression to the file system (a lot of room on disk is dominated by our seek and wordseek disk hash tables). d. Query Language i. Phrase matching ii. NOT functionality iii. More error checking for mismatched brackets and missing words e. Ranker i. Fix TFID ii. Include a large set of heuristics f. Machine Learning i. More training data ii. Non linear regression iii. Better loss function
20 Appendix: [4] Hung-Mo Lin & Vernon M. Chinchilli (2011) A MEASURE OF ASSOCIATION BETWEEN TWO SETS OF MULTIVARIATE RESPONSES, Journal of Biopharmaceutical Statistics, 11:3, , DOI: /BIP
Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationWeb Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson
Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection
More informationHow Does a Search Engine Work? Part 1
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling
More informationToday s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationToday s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationCrawling CE-324: Modern Information Retrieval Sharif University of Technology
Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationFull-Text Indexing For Heritrix
Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationParametric Search using In-memory Auxiliary Index
Parametric Search using In-memory Auxiliary Index Nishant Verman and Jaideep Ravela Stanford University, Stanford, CA {nishant, ravela}@stanford.edu Abstract In this paper we analyze the performance of
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More information3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy
CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationQuery Answering Using Inverted Indexes
Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationBuilding an Inverted Index
Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationCollection Building on the Web. Basic Algorithm
Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationSemantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman
Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information
More informationCreating a Classifier for a Focused Web Crawler
Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationCS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:
Nathan Crandall, 3970001 Dane Pitkin, 4085726 CS140 Final Project Introduction: Our goal was to parallelize the Breadth-first search algorithm using Cilk++. This algorithm works by starting at an initial
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 6 Information Retrieval: Crawling & Indexing Aidan Hogan aidhog@gmail.com MANAGING TEXT DATA Information Overload If we didn t have search Contains
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationNews Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala
News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are
More informationNotes: Notes: Primo Ranking Customization
Primo Ranking Customization Hello, and welcome to today s lesson entitled Ranking Customization in Primo. Like most search engines, Primo aims to present results in descending order of relevance, with
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic
More informationSearch Engines. Charles Severance
Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity
More informationSearch Engines. Dr. Johan Hagelbäck.
Search Engines Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Search Engines This lecture is about full-text search engines, like Google and Microsoft Bing They allow people to search a large
More informationSolution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution
Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationJob Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.1
Job Reference Guide SLAMD Distributed Load Generation Engine Version 1.8.1 December 2004 Contents 1. Introduction...3 2. The Utility Jobs...4 3. The LDAP Search Jobs...11 4. The LDAP Authentication Jobs...22
More informationIndex construction CE-324: Modern Information Retrieval Sharif University of Technology
Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.
More informationA short introduction to the development and evaluation of Indexing systems
A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationCrawling. CS6200: Information Retrieval. Slides by: Jesse Anderton
Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationSE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?
PLAN SE Workshop Ellen Wilson Olena Zubaryeva Search Engines: How do they work? Search Engine Optimization (SEO) optimize your website How to search? Tricks Practice What is a Search Engine? A page on
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 6: Information Retrieval I Aidan Hogan aidhog@gmail.com Postponing MANAGING TEXT DATA Information Overload If we didn t have search Contains all
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationRELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION
RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION Srivatsan Sridharan 1, Kausal Malladi 1 and Yamini Muralitharan 2 1 Department of Computer Science, International Institute
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationIntroduc)on to. CS60092: Informa0on Retrieval
Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationJava Archives Search Engine Using Byte Code as Information Source
Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationResearch and implementation of search engine based on Lucene Wan Pu, Wang Lisha
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,
More informationCS/COE 1501
CS/COE 1501 www.cs.pitt.edu/~nlf4/cs1501/ Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge and may not be
More informationFACETs. Technical Report 05/19/2010
F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationWACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow
WACC Report Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow 1 The Product Our compiler passes all of the supplied test cases, and over 60 additional test cases we wrote to cover areas (mostly
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationseosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes
April 24-26, 2017 CLASSROOM EXERCISE #1 DEFINE YOUR SEO GOALS Template: SEO Goals.doc WHAT DOES SEARCH ENGINE OPTIMIZATION REALLY MEAN? Search engine optimization is often about making SMALL MODIFICATIONS
More information12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin. 1 Web Search Web Spider Document corpus Query String IR System 1. Page1 2. Page2
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationMG4J: Managing Gigabytes for Java. MG4J - intro 1
MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More information