C++lue; EECS 398 Search Engine Team Final Report Anvi Arora, Ben Bergkamp, Jake Close, Veronica Day, Zane Dunnings, Nicholas Yang April 16nd, 2018

Size: px
Start display at page:

Download "C++lue; EECS 398 Search Engine Team Final Report Anvi Arora, Ben Bergkamp, Jake Close, Veronica Day, Zane Dunnings, Nicholas Yang April 16nd, 2018"

Transcription

1 C++lue; EECS 398 Search Engine Team Final Report Anvi Arora, Ben Bergkamp, Jake Close, Veronica Day, Zane Dunnings, Nicholas Yang April 16nd, 2018

2 Overview We began the semester with multiple, 2-3 hour weekly team meetings where we began to design and scope the various components of the project. After creating a rough initial system diagram and discussing how the API s between the classes would look, we split the group of six into three groups of two. We made an effort to figure out which pieces of the projects could be parallelized and worked on separately. (Image of one of our first whiteboarding brainstorming sessions) Our major design decisions were typically done in iterative processes. We would define one way that we thought the interfaces should communicate, then once we researched and discovered better approaches, we would reevaluate and refactor the changes. Some of our major design choices included using the producer consumer relationship for the crawler-url frontier, parser-url frontier, parser-indexer, and constraint solver- ranker. Initially Jake and Ben worked on the Crawler, Anvi and Veronica on the Parser, and Nick and Zane on the Indexer. We would meet twice weekly for about an hour after class to discuss issues, problems and changes to the interface. Once those components were done, Jake and Ben worked on the Constraint Solver, Anvi, Zane and Veronica worked on the Ranker, Nick continued improving the indexer and worked on ISRWord, and Zane worked on the Query Language. Our development process and communication were key parts to our success. We each used Clion as a our debugger, linter, and build system which helped save valuable time. We also set up a vagrant machine to run all our code in a consistent environment, although we ended up leaning towards using Mac OSX for the majority of the project. We used the GitLab hosted on the UMich EECS servers for our version control. We made sure to follow best practices by testing new components thoroughly, branching for each new features, and merging only stable builds to

3 master. Outside of our twice-weekly meetings, we were constantly messaging in our GroupMe messaging app, figuring out times to meet up and solve issues. (Screenshot of our original Gantt chart of progress deadlines) Spreadsheet of major components and who did what in LOC We used the CLOC terminal script (Count Lines of Code) to computed lines of code count for each file, removing comments and whitespace.

4 Functionality Level: ~7

5 Basic statistics a. Total Docs = 50,000 docs b. Total Size = 157 MB c. 10 crawler worker threads d. Crawl Speed = ~100 docs every 10 seconds e. 6,632,329 total tokens indexed f. 228,465 unique tokens (may be inaccurate to our use of stemming) g. Graph below depicts query speed: We found the speed to be linear in time with the number of hits. We believe we could have reduced this time if we had ran one thread per chunk while running the constraint solver, optimized some in-memory seek data, and stored more cached data on disk.

6 2. Diagram of filesystem - Master.txt (disk hash table) - Contains word as key, and chunk lists, frequency, doc frequency, and last offset for the value. - Also stores statistics such as number of words, number of unique words, number of chunks, etc, decorated with =. - Chunks are comprised of three files, #.txt, #-seek.txt, and #-wordseek.txt - #.txt is the actual postings list. The first offset for each word is the absolute location. - #-seek.txt (disk hash table) is the seek list for every word in the chunk, also stored as a disk hash table. The seek location is simply stored as the value, and all the constraint solver needs to do to parse the file is search up the term as the key, and get the seek location on the actual postings list. - Also stores statistics such as the number of words, number of unique words, and number of documents, etc, decorated with =. - #-wordseek.txt (disk hash table) is the dictionary/ synchronization point for frequently used words. Each word is simply stored as the word appended with a number, with the number acting as a partition. The key value used is a listing of <absolute location, seek offset> where absolute location is the location in the entire corpus, and the seek offset is where to seek in the postings list to get to that location. - One thing that was experimented with was making just a single wordseek file, eliminating the seek list and perhaps lowering the amount of space required for a single chunk. However, this resulted in significantly larger space per chunk, because there were many terms with only one or two postings. We decided to continue with the current working file structure. - Chunk size: can be adjusted per build in the IndexerConstants.h file to any size desired.

7 3. Diagram of index build process (Detailed Diagram at Build Index Process) Our build process was able to run with a dynamic amount of crawling threads and a single indexing thread. The spiders and the indexer were each given a pointer to a shared memory producer consumer queue. On start up, the crawler would spawn the chosen amount of spiders threads. Each spider had its own stream reader and Parser class. The spider would retrieved a new url off of the Url Frontier, download the web page, pass it to the Parser whom would push the final inverted index to the Indexer Queue. The Indexer would pop an index off the queue and process it in memory. Once the in memory data structure reached a determined size, the chunk would be written to disk Strategy For modularity, we implemented a Url Frontier class which handled the web crawling strategy. This allowed us to give the spiders and parsers, push and pop API s and easily changing the strategy. Our goal for the crawler was to find a diverse set of the highest quality websites possible. Also, we wanted to obey some form of politeness and not spam a host with many requests at a time. When we original began, we utilized a single queue, and found that there were many weird/irrelevant sites getting indexed. Our attempt to fix this was but adding a blacklist for

8 domains we wanted to ignore. However, we quickly realized hand picking all of the bad sites was not scalable. Then, we tried a Whitelist and implemented a time buffer for each domain so the queue would prioritize the least recently seen domains. This was better but we were still getting stuck with a lot of the same sites and was still not diverse enough. Finally, we came upon the mercator round robin structure which provided us with the best diversity in crawl as well as a form of politeness. This strategy utilized a priority queue for each one of the whitelisted domains. (Diagram of Url Frontier and Multi Queue Structure)

9 Seeding We determined our seeds by running multiple Google searches and recording which sites were consistently in the top ten. We removed sites from the seed list that we found were not easily crawled, or prevented us from downloading their pages. An example list of the seeds we used:

10 Parser (Detailed Diagram for Parsing Process) Overview The parser, once it received a StreamReader object from the Crawler, would call doctostring() on the object in order to retrieve the entire html in one string. The parser used various custom string processing functions in order to find the text within the html that we deemed important. Tag Detection Once the parser found a < opening tag, it continued to index the string, looking for what type of tag it was encountering, calling various extract functions to extract full strings from within the tags. Tags the parser handled include body, paragraph, anchor, and title tags. Each extract function would pull out the full string and pass it to the Parser s Tokenizer instance. Any tags

11 that did not have detectable closing tags were ignored, such as meta tags. Any detected urls found in anchor tags were sent to the crawler after a checking for validity (i.e. no pdf files). Tokenization The tokenizer accepted a string, and a decorator char. If the string was a url (determined by the decorator), the Tokenizer split the text based on any symbols found. For all other strings the Tokenizer split the text on white spaces, removed symbols such as!, and case folded all chars so the string was entirely lower case. Stop words were removed and all other words were stemmed. The individual tokens were concatenated with their decorator at the beginning of the token, then stored in a document inverted index (an unordered map that lived on the heap), with the token as the key, and a list of offsets as the value, relative to the beginning of the document. Once all strings in the document had been parsed and tokenized, a pointer to the dictionary was sent to the Indexer, who was responsible for merging this dictionary with the master Index, normalizing offsets as relative to the beginning of the index, and deleting this dynamically declared document index. Stemming One important design decision was to incorporate a stemmer into our tokenization process. We knew we would have a limited about of memory, and stemming would reduce the length of our master Index, by storing similar words in the same location. Stemming should also increase recall for our search engine, since stemmed queries can be mapped to more terms. For example, the query why are cats so happy after stemming and stop word removal will yield to cat happi and search results such as Happy Cats will also be processed to happi cat. In this was more relevant document will be returned by the constraint solver. We used the Porter Stemmer algorithm to accomplish this. The algorithm removes the endings of words incrementally through a 5 step process. Word endings such as s, ing and ment are removed, and only if that word s measure (the count of Consonant-Vowel-Consonant patterns) were greater than 1. Query Language Overview The query language was designed to read in a query from the user, extract the important words, and generate a parse tree from that query. We chose to insert decorated nodes into our tree, rather than have the constraint solver generate the decorated token. We support queries that are nested with parentheses and support both ORs and AND. If not specified, we assume words are AND ed together. We support the following for AND: and,, AND, &, and && and the following for OR: or, OR,,.

12 Query Preprocessing We support the following methods of cleaning our queries: 1. Bracket cleanup: the query language detects either ( or ) symbols, and places a space before and after. Since we separate on whitespace, we avoid any errors in assuming (word is one word since it is transformed into ( word in the preprocessing stage. 2. Removing stop-words: our query language removes the same stops words that we remove in the text parser. Removing words that don t add any extra context ensure our system is searching for the most relevant keywords in a given query. Also, these words do not appear in the index, so we would return no search results if we ANDed a stop-word. The only modification is that we include or and and since they are needed for context in the query language. 3. Lowercase all words: this complies with the parser 4. Stem the query words to match what s stored in the index 5. Remove symbols: we remove all symbols. For the following :,, &, && we replace them with OR, OR, AND, and AND respectively. AND / OR Logic After the string has been preprocessed, the QueryParser object separates the string on whitespace, creating a vector of each individual word or symbol in the query. The parser then separates the words in the query based on any OR not within brackets, and then recursively calls itself on that subset of the original query. If no ORs are found, it splits on ANDs, and if none of those are found then it resolves the remaining query as an individual string. During the process, if any parentheses are found, the function searches for the matching parenthesis and calls itself on that subset of the query. Here is an example of a query tree generated from the Query What are some electric cars or gas trucks?

13 Ranker Overview The Ranker object consisted of a list of ISRWords. Each document that was returned by the constraint solver was read into a Site object. Each Site represented one potential document to be returned, and had a dictionary object that mapped each word in the document and query to a list of relevant information, such as frequency of word in document and offsets, that was obtained from the ISRWords for that document.

14 Each Site was sent to the Scorer, an object managed by the Ranker. Our Scorer objected ran through four different heuristics to rank a document: Static ranking, TF-IDF, Proximity Matching, and Location scoring. Each score was multiplied by its coefficient, determined from our linear regression model. After a final score was computed, it was normalized by dividing by the sum of coefficients for each heuristic. This ensured that all of our scores were relative and fell within the range of 0-1. Static Ranking Static ranking took into account the TLD of the url. We assumed that urls that contained TLDs such as.com and.edu would contain more relevant information that other TLDs. We used the Wikipedia article List of Internet top-level domains to support our scoring of each TLD. TF-IDF* TF-IDF, term frequency times inverse document frequency, is a bag of words model, which does not take into account the order or context of words in a document. We used this scoring method to balance the raw term frequency query terms that appeared in the document with its respective rarity in the corpus. We multiplied the term frequency (normalized by the number of words in the query) by the log of the total number of documents in the corpus divided by the document frequency of that particular word. We summed up the scores for each word, then divided it by the number of words in the query to get an average tf-idf score for each document. This score correlates to how often rare words that are in the query are also found in that document, and scores accordingly. *We were unable to fix a bug in the TF-IDF function that was causing the score to sometimes result in negative infinity. This resulted in use having to leave this scoring function commented out in production due to lack of time. Proximity Matching Spans For a proximity match score, used the heuristic that words that were near each other in the query would also appear in the same location in a relevant document. Documents that contained exact phrases that matched the query phrase should also be weighted higher than other documents. In order to find exact phrases, and also take into account near-exact phrases, we used the spanning algorithm taught in lecture. To do this we first found the rarest word in the query (the term with the least amount of occurences in the document that was being scored). Next we found all other terms in the query that were also in the document, and found the minimum offset

15 distance between each term and the rarest term. We then calculated the delta, or average distance, between each term in the span. A span consists of the query terms that are closets in order of appearance in the document. (Image representation of a span, span delta calculation) We then took the sum of each span delta and divided this by the number of total spans in the document. This value represented the average length, or window, of terms that would be necessary to include all the query terms in that window. The lower this number, the higher the score for that document should be, so we took the reciprocal and multiplied it with the number of unique terms in the query. We then multiplied this result by alpha, a weight determined by the linear regression modeling. This was then added to the number of exact phrases in the document, or span deltas of one less than the number of words in the query, divided by the number of spans. This number too was multiplied by alpha prime, which was determined in the linear regression modeling. Max Score We wanted to normalized this number by dividing the score by the highest score possible for a document. At first we assumed that the best document would only consist of query terms and would only consist of exact phrases.we realized this yield an unrealistic score and resulted in low scores for even high ranking documents. To counter this, we still assumed that the average span delta was one less than the number of unique terms in the query, multiplied by alpha. Instead of adding alpha prime multiplied by 100% exact phrases, we found beta, which represented a more realistic percentage of exact phrases that are found in a relevant document.

16 To find beta, we wrote a python script, goldstandarddoc.py, to run random phrase queries in Google. We then took the top five results returned from each query, parsed the document and found the number of exact query phrases in the document s text. We divided this number by the number of N-grams in the document, N equal to the number of terms in the query. This yield a percentage of exact phrases found in the document. We then took the average of each percentage of exact phrases found in each of the five documents returned for each query. Through this subset, we found that a top-ranked result on Google consists of about 0.06% exact phrase matches on average. This was number we used to multiply with alpha prime. If the score we calculated ended up being higher than our estimated exact score, we set it equal to our exact score and divided by itself, so our scoring was still normalized and fell within the bounds of 0-1. Word Locations The Scorer finds the number of instances of each of word of interest in the query in each respective location, and multiplies the frequency/totalsize with a corresponding weight. This weights documents with the query words in the URL and title very highly, but also gives some weight if the words of interest occur many times in a document. Machine Learning We found that our hand tuned coefficients were working generally well, and ranking was returning relevant and quality results. However, we decided to create a simple machine learning tool in order to improve the results. We used a linear regression model, trained on google results to produce the coefficients we wanted. Training Set Our training set consisted of the top result on each of the page of some queries google result. If a Google query returned 10 results per page, we extracted page 1, 11, 21, 31...etc. Since the top Google results are extremely relevant, and have little differentiation, we chose to choose results with more variations in features to help our program learn what the best pages look like. [4] Loss Function To calculate the error in each step of the gradient descent, we use a evaluation metric called the Kendall s Tau function. The metric evaluates how similar two ordered lists are. In our case, we

17 perform recursive descent where we increase the coefficient that increases the Kendall s Tau score the most. To evaluate the cost you create a nxn matrix where n is the number of documents to rank. The binary matrix has the value 1 at (i, j) if r(i) > r(j), and a zero otherwise, where r(i) is the i-th document in the set of documents r. You compare two matrices (ordered set of documents) by counting P (number of elements that agree), and Q (number of elements that disagree), and calculating tau. Potential Improvements There are some things we d love to improve about this machine learning approach that we think would greatly improve its accuracy: - Fix gradient jump with Kendall Tau: With each step in gradient descent the ranking of the documents may not change, so the loss function might not register a good increase in a docs score. Therefore, the gradient calculated will jump when there's a change in ranking. - More data: In our case, it was very difficult to extract features from all of the pages, so getting enough training data was tough, and I think increasing data would have made this really powerful. - Non-Linear regression: I think it would have been useful to experiment with logistic regression and other types of non-logistic regression. It s not likely to have a set of ranked documents be scored in a linear way. Results Our engine is able to return good relevant results for terms that are very common, however it is slightly slower because there are more pages to rank. Terms that less common return faster results but less relevant. AND and OR queries are relatively fast as well, although not as fast as simple word queries. We don t anticipate this is a huge issue given the fact that queries are on average 2.3 words. Server We built a web server in C++ using sockets to send the results via JSON format. The front end was implusing jquery and HTML to parse and display the top ten results including links, scores and url. Below is a screenshot.

18 Bugs a. One known bug is that sometimes the title of a page returned is: 301 Forbidden or 404 Not Found, even though the link below is valid. We believe this is due to parsing these returned values incorrectly in some rare cases. b. We weren t able to get Tf-Idf functionality working, we think this may be because method of retrieving document frequency is buggy c. Front End fails when JSON isn t returned properly d. Linear regression training data only includes a subset of the features, so some of the coefficients are hand tuned

19 Production There would be a fair amount of work required to get this product to production. While this product would never be able to compete with modern search engines like Google or Bing, it could fill the need for a local file system searcher. In order to adapt our engine to be a pure local file system searcher, it would require another few weeks or refactoring our code to be local file specific. These changes would include changing our parsing logic to get every word, instead of looking for specific html tags, and detecting which files are binary files by looking at how high the byte values are. Another option would be to focus our search engine to a certain world of data, such as a sports search engine, or a search engine solely for Wikipedia. 4. What would we do differently: a. Crawler: i. Add an ability to start/stop the crawler. ii. Obey robots.txt b. Constraint Solver: i. Have a thread per chunk of the index to query reduce speed ii. Not and Phrase ISR s c. Indexer: i. Introduce more compression to the file system (a lot of room on disk is dominated by our seek and wordseek disk hash tables). d. Query Language i. Phrase matching ii. NOT functionality iii. More error checking for mismatched brackets and missing words e. Ranker i. Fix TFID ii. Include a large set of heuristics f. Machine Learning i. More training data ii. Non linear regression iii. Better loss function

20 Appendix: [4] Hung-Mo Lin & Vernon M. Chinchilli (2011) A MEASURE OF ASSOCIATION BETWEEN TWO SETS OF MULTIVARIATE RESPONSES, Journal of Biopharmaceutical Statistics, 11:3, , DOI: /BIP

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Crawling CE-324: Modern Information Retrieval Sharif University of Technology Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Full-Text Indexing For Heritrix

Full-Text Indexing For Heritrix Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 04 Index Construction 1 04 Index Construction - Information Retrieval - 04 Index Construction 2 Plan Last lecture: Dictionary data structures Tolerant

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Parametric Search using In-memory Auxiliary Index

Parametric Search using In-memory Auxiliary Index Parametric Search using In-memory Auxiliary Index Nishant Verman and Jaideep Ravela Stanford University, Stanford, CA {nishant, ravela}@stanford.edu Abstract In this paper we analyze the performance of

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Ch. 4 Index construction How do we construct an index? What strategies can we use with

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Design of Alta Vista. Course Overview. Google System Anatomy CSE 454 - Case Studies Indexing & Retrieval in Google Slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Design of Alta Vista Based on a talk by Mike Burrows Group Meetings Starting Tomorrow

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Building an Inverted Index

Building an Inverted Index Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G ) Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:

CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction: Nathan Crandall, 3970001 Dane Pitkin, 4085726 CS140 Final Project Introduction: Our goal was to parallelize the Breadth-first search algorithm using Cilk++. This algorithm works by starting at an initial

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 6 Information Retrieval: Crawling & Indexing Aidan Hogan aidhog@gmail.com MANAGING TEXT DATA Information Overload If we didn t have search Contains

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala

News Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are

More information

Notes: Notes: Primo Ranking Customization

Notes: Notes: Primo Ranking Customization Primo Ranking Customization Hello, and welcome to today s lesson entitled Ranking Customization in Primo. Like most search engines, Primo aims to present results in descending order of relevance, with

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic

More information

Search Engines. Charles Severance

Search Engines. Charles Severance Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity

More information

Search Engines. Dr. Johan Hagelbäck.

Search Engines. Dr. Johan Hagelbäck. Search Engines Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Search Engines This lecture is about full-text search engines, like Google and Microsoft Bing They allow people to search a large

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.1

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.1 Job Reference Guide SLAMD Distributed Load Generation Engine Version 1.8.1 December 2004 Contents 1. Introduction...3 2. The Utility Jobs...4 3. The LDAP Search Jobs...11 4. The LDAP Authentication Jobs...22

More information

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology Index construction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch.

More information

A short introduction to the development and evaluation of Indexing systems

A short introduction to the development and evaluation of Indexing systems A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work? PLAN SE Workshop Ellen Wilson Olena Zubaryeva Search Engines: How do they work? Search Engine Optimization (SEO) optimize your website How to search? Tricks Practice What is a Search Engine? A page on

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 6: Information Retrieval I Aidan Hogan aidhog@gmail.com Postponing MANAGING TEXT DATA Information Overload If we didn t have search Contains all

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION

RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION Srivatsan Sridharan 1, Kausal Malladi 1 and Yamini Muralitharan 2 1 Department of Computer Science, International Institute

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

Introduc)on to. CS60092: Informa0on Retrieval

Introduc)on to. CS60092: Informa0on Retrieval Introduc)on to CS60092: Informa0on Retrieval Ch. 4 Index construc)on How do we construct an index? What strategies can we use with limited main memory? Sec. 4.1 Hardware basics Many design decisions in

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Java Archives Search Engine Using Byte Code as Information Source

Java Archives Search Engine Using Byte Code as Information Source Java Archives Search Engine Using Byte Code as Information Source Oscar Karnalim School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung, Indonesia 23512012@std.stei.itb.ac.id

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards This time: Spell correction Soundex Index construction Index

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

CS/COE 1501

CS/COE 1501 CS/COE 1501 www.cs.pitt.edu/~nlf4/cs1501/ Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge and may not be

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

WACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow

WACC Report. Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow WACC Report Zeshan Amjad, Rohan Padmanabhan, Rohan Pritchard, & Edward Stow 1 The Product Our compiler passes all of the supplied test cases, and over 60 additional test cases we wrote to cover areas (mostly

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes April 24-26, 2017 CLASSROOM EXERCISE #1 DEFINE YOUR SEO GOALS Template: SEO Goals.doc WHAT DOES SEARCH ENGINE OPTIMIZATION REALLY MEAN? Search engine optimization is often about making SMALL MODIFICATIONS

More information

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin. 12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin. 1 Web Search Web Spider Document corpus Query String IR System 1. Page1 2. Page2

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

MG4J: Managing Gigabytes for Java. MG4J - intro 1

MG4J: Managing Gigabytes for Java. MG4J - intro 1 MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information