Finding Similar Content
|
|
- Claud Cobb
- 5 years ago
- Views:
Transcription
1 Finding Similar Content Michael Welch Ethan Chan {mjwelch, Abstract In this paper, we present an algorithm for automatically discovering similar pages. Unlike common keyword based search engines, our similarity comparison is based on the entire textual content of the pages. We are unaware of the existence of such content based similarity detection mechanisms for the Web which are accessible to the common user. Our algorithm, FSC, exploits temporal locality of content without being dependant on absolute position within a page, making it difficult to bypass. 1. Introduction The World Wide Web contains millions of sources of information contained in billions of pages. One of the most difficult and frustrating tasks for a user of the Web is trying to find pages that contain the information they are interested in. Popular search engines such as Google [1] attempt to answer keyword queries from users and return relevant pages. Often a user may wish to find a mirrored site, where content is 100% similar, that is hosted more locally and is faster. Also, page similarity may pertain to legal issues. A website administrator may wish to search the Web for a site that is infringing upon his copyrighted page by measuring say 70% or higher similarity. Content theft is an ongoing problem [2] that is difficult if not impossible to find by ordinary Web search methods and content category listings such as those of Yahoo! [3]. We propose a method for finding similar pages based on the union and intersection of content units through an analysis of page content. Our two-phase algorithm, FSC, analyzes the contents of HTML pages and finds related pages that meet a user-defined threshold percentage of similarity. In the first phase of the algorithm, we extract and hash units from the pages. In the second phase, we calculate the similarity between pages and separate them into related groups. To avoid complex and time consuming web crawls in our experiments, all HTML pages are read from the UCLA Webcat Web Repository of 70 million pages through a C++ interface. In this paper, we will present our technique in three parts. We describe the first phase of FSC and its implementation using C++ and LEX in Sections 2 and 3. Section 2 describes our unit generation methods. Section 3 discusses the various hashing techniques we have used. In Section 4 we describe the second phase of FSC, discussing our analysis techniques using Perl and SQL. We present our initial results in Section 5. In Section 6 we will discuss some possible extensions and future work.
2 2. Unit Generation The first phase of the FSC algorithm requires processing of HTML pages. Because our goal is to find pages that match up for a given threshold of similarity, which is based on the union and intersection of corresponding page sections, we must decompose each page into units. We define each unit of a page as a specific sequence of characters within the HTML page that matches the designated unit function. The unit function is a pattern matching function that finds and returns the next occurrence of a unit string within the page. In our initial analysis, we have implemented and tested three unit functions. The first unit function we choose selects consecutive sentences until the total length of the string is at least a parameter gminwordsize, which is specified at runtime and set to 20 for our experiments. The end of a sentence is determined by common punctuation marks (!,?,. ). As noted in [4], this creates some potential problems when punctuation is not used at the end of a sentence, such as use in an abbreviation. We will show in our results that this does not substantially affect our similarity detection efforts. The second unit selection function we choose is by individual words, separated by space characters or end of sentence punctuation, and at least gminwordsize characters. In our analysis using this unit function, gminwordsize is set to 3. This provides the most exact measure of similarity detection comparable to a search engine s inverted indexing approach by taking word for word content from a page. It does not, however, preserve relative word location, and consequently some semantics of the page can be lost. For this reason, we believe individual word units are not appropriate for similarity detection. Our third unit function uses overlapping units similar to those described in [4]. This unit function returns gminwordsize (3 in our experiments) words as a unit. Two consecutive units overlap by one word. For example, consider a sequence of 5 words A B C D E. The two units returned by our third unit function would be A B C and C D E. All of our unit functions are generally free of phase dependence [4], in which inserting a single additional unit into the page can cause a false negative detection, because our unit functions do not consider absolute location within a page. At the same time, methods one and three preserve temporal location, which help protect against false positives. 3. Hashing Units Once we have divided the HTML page into units, we must somehow store them for comparison with other pages. To save storage space, and without loss of similarity information, we have chosen to hash each unit into a four-byte integer value. For each page, we assign a unique integer URL_ID tag. Then, for each unit in the page, we generate a hash value. We insert each <URL_ID, UNIT_HASH_VALUE> tuple into a relational table. For this storage, we have chosen to use IBM s DB2 Universal Database system. Using a relational database system allows us a simple mechanism for persistent storage that is easily accessible by both phases of the FSC algorithm.
3 In general, hashing functions aim to avoid excessive collisions. In the case of FSC, collision avoidance is desirable, but not essential. A collision of hash values from two units whose underlying data are different is in essence a false positive. Our similarity metric, however, is generally robust against the small percentage of such false positive collisions. Therefore, our main goals in hashing schemes are simplicity and speed. As a result, we choose two hashing functions based on character sums of units. The first hashing function adds the character sum of all alphanumeric characters within the unit. The second hashing function sums all characters in the unit. Both hashing functions have produced similar, high quality results in similarity detection. Figure 3.1 summarizes the first phase of the FSC algorithm. For each page in Web Repository Preprocess the raw HTML page, removing all HTML tags Assign the URL a unique ID number and insert the tuple into the LOOKUP table Do Get the next unit from the processed HTML Hash the unit into an integer value Insert the <URL_ID, HASH_VALUE> pair into the HASHES table While (More units in the page) Figure 3.1 Phase 1 of FSC 4. Finding Similarity The primary goal of an algorithm to determine similar pages should be reducing processing time, as any valid algorithm will return the correct results when done. Unfortunately, there is also an aspect we had to take into consideration: implementation costs. Given a short development time frame, we wanted a solution that could benefit from the database already used in phase 1 of our FSC algorithm. There are many disadvantages lingering onto this idea, processing time ironically being one of them. There are algorithms such as Min-Hashing and Locality-Sensitive Hashing [5] where processing time is O(n). However, given our overall priority was on having quality unit functions in phase 1, we decided to substitute implementing efficient and elegant algorithms in phase 2 with a single SQL query. There are many ways of querying and detecting similarity between two web pages. The initial naïve approach checked to see if each page had at least the predefined threshold t hashes in common. This would effectively remove pages that only had less than a few same units. The problem with this simplistic method becomes evident when the threshold is set too low, resulting in too many pages, mostly which are false positives. The idea is to scale the threshold with respect to the page sizes. By using page unions and intersections allow us to normalize their similarity with each other: Similarity ratio = A B A B Equation 4.1: Similarity
4 Given our SQL database implementation, the denominator ( A B ) can be rewritten to: Similarity ratio = A B A + B ( A B) Equation 4.2: Denominator now maps to SQL queries This equation form allows both numerator and denominator to be mapped to simple SQL queries. A and B can be easily calculated by SQL s COUNT aggregate function, and A B by SQL s GROUP BY clause. Also, notice that A B is being used (but not calculated) twice. The resultant SQL query is shown in Figure 4.1. SELECT J.url AS A_URL, K.url AS B_URL, DECIMAL(R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100) AS SIMILAR FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM HASH_TABLE AS A, HASH_TABLE AS B WHERE A.url_id < B.url_id AND A.hash = B.hash GROUP BY A.url_id, B.url_id HAVING count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM HASH_TABLE GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM HASH_TABLE GROUP BY url_id) AS T, LOOKUP_TABLE as J, LOOKUP_TABLE as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15) = substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=MIN_THRESH AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100< MAX_THRESH ORDER BY R.CNT / REAL(S.CNT + T.CNT - R.CNT) DESC; Figure 4.1 SQL Query Figure 4.2 shows the Perl script we have written that queries the database with the SQL query of Figure 4.1, and categorizes the results into 10% increments, starting with 20% to 100%. For our analysis, anything below 20% similarity is not worth considering and thus is discarded. It requires two parameters, the lookup and hash table names. We have made a distinction between comparing two web pages from the same and different domains. The SQL query WHERE substr(j.url,1,15) = substr(k.url,1,15) filters the first 3 characters of the domain url (i.e. which means that will unfortunately be considered similar to depending on whether they need to be the same or different. We have verified this simple method of character comparison is effective at separating same and different domains with very few false positives.
5 unless {die("no arguments - You must specify 2 parameters : <LOOKUP TABLE> <HASH for ($i=30; $i<=110; $i=$i+10) { system("echo 'connect to cs249;\n SELECT count(*) FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM ".@ARGV[1]." AS A, ".@ARGV[1]." AS B WHERE A.url_id < B.url_id AND A.hash = B.hash Group by A.url_id, B.url_id having count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM GROUP BY url_id) AS T, " as J, ".@ARGV[0]." as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15) = substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=".($i-10)." AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100<".$i.";\n' >process"); system("db2 -tvf process > samedomain_".@argv[0]."_".@argv[1]."_".($i- 10)."_".$i."res"); } for ($i=30; $i<=110; $i=$i+10) { system("echo 'connect to cs249;\n SELECT count(*) FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM AS A, ".@ARGV[1]." AS B WHERE A.url_id < B.url_id AND A.hash = B.hash Group by A.url_id, B.url_id having count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS T, ".@ARGV[0]." as J, ".@ARGV[0]." as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15)!= substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=".($i-10)." AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100<".$i.";\n' >process"); system("db2 -tvf process > diffdomain_".@argv[0]."_".@argv[1]."_".($i- 10)."_".$i."res"); } Figure 4.2 Perl Script The following simple example demonstrates the analysis steps taken. Figure 4.3 shows an example snippet view of an overall hash database from phase 1 of FSC. Figure 4.4 shows the calculated numerator of the similarity equation. Next, we find the sizes of pages A and B in Figure 4.5 to calculate the denominator of Equation 4.2, as shown in Figure 4.6. Lastly, we calculate the similarity of two pages by calculating Equation 4.2 from the values shown in Figure 4.4 and Figure 4.6. The final results are shown in Figure 4.7. url Hash Value Figure Hash database url A url B Same hash freq Figure Intersection: How many units are in both pages? ( A B )
6 url A url B Size A Size B Figure Intersection + Sizes: Determining A and B url A url B Size A Size B Union Figure Intersection + Union : Determining Union A B = A + B ( A B) url A url B Same hash Union Similarity Figure Calculate Similarity 5. Results Figure 5.1 shows our initial results, using the first unit function and first hash function on a sample of approximately pages from our UCLA Webcat Web Repository. Due to time constraints and issues with DB2 and other processes running on the Webcat server, we were unable to fully analyze our suite of unit and hash functions. However, as noted earlier, we believe the choice of hash function is fairly insignificant and the first unit function is likely to be the best. The first thing we notice is that pages within the same domain are very likely to be similar (about 25% of all pages are at least 50% similar). This can be caused by site templates, which give pages within a website a uniform style, such as a common navigation bar and site banners, which contributes to their similar content. Taking this thought further, we notice though, that most site templates probably do not contribute to more than half of page content, otherwise, there would be more 50%+ similar pages. Also notice that there are very few highly similar pages (in fact, similarity seems to be
7 inversely proportional to its frequency), which suggests that copied pages are rare, but do in fact occur. The second observation is that pages are less likely to be similar if they are from different domains. This is most likely caused by the lack of our previous observation: without site templates, pages no longer share as much content as before. This reduced set of pages, in some sense, weighs heavier than if they were from the same domain because of its rarity. Also notice that by eliminating page comparisons from the same domain, the result set is much smaller which yields quicker post process verification. Similarity Same Domain Different Domain 20% 3,869 1,023 30% 2, % 1, % 1, % % % % % 18 2 Figure 5.1 Frequency of pages Frequency Similarity (%) Same Domain Different Domain Figure 5.2
8 A closer inspection of the actual similar pages revealed more interesting results: Figure 5.3 Sample of 100% similar web pages from same domain Figure 5.3 reflects various characteristics with pages in the same domain. Firstly, pages from have only two units in each page, causing two near-empty pages to be matched as 100% similar. In our future work, we hope to weed out these false positive pages that have too few units in them. Another characteristic are the pages, which turn out to be all page not found pages. These pages are another type of false positives, where the server returns a custom error page. One simple way these pages can be eliminated from our result is to look for keywords indicating pages no longer exist, such as not found, removed etc. Lastly, we see dynamic URLs from where the page itself is no different with varying parameters Figure 5.4 Sample of 90% similar web pages from same domain Figure 5.4 shows a very interesting characteristic: clearly different categories of web pages from Yale University (i.e. medical, living, admissions, search) are considered very similar. The reason is these pages share a common site template, as earlier predicted in this section. With site templates, a lot of the content is duplicated. If a search for similar, non-templated pages is required, research in template detection such as [6] could aid in removing the templates as part of the preprocessing stage before similarity detection. As with different domain pages, the observations are different:
9 Figure 5.5 Sample of 100% similar web pages from different domain Figure 5.5 shows the two pairs of pages we detected as 100% similar from different domains. The unfortunate thing is that the size of both pairs is only 2 units long, meaning they are essentially empty-pages. If we implement our filters to remove small pages (such as 2 units big), then these false positives would not have shown Figure 5.6 Sample of 80% similar web pages from different domain Figure 5.6 shows the general results of different domain similarity: most contain robots.txt, which are web page files that tell the search engine robots which files they may download. Apparently, most of them disallow very similar directories, causing the high similarity. Since robots.txt is not technically a web page, it should be filtered out in our future algorithm revisions. 6. Conclusion and Future Work In this paper, we have described the two phases of our algorithm FSC in Sections 2, 3, and 4. We have implemented FSC using C++, LEX, Perl, and IBM DB2, and shown how it works well in finding similar content between pages using an arbitrary user defined threshold by our results in Section 5. Our brief study and analysis of similarity detection on a fairly small sample set has yielded promising results. Consider that the page sample is miniscule (less than %) compared to the billions of pages available on the Web 1. This paper has laid the groundwork for some potentially very interesting future study and development. We believe the ease with which similarity can be calculated combined with its effectiveness in finding copied pages provides an excellent starting point. Clearly the next step in adapting FSC to copy detection is to add facilities for specifying a seed page that we wish to find copies of. Another extension of this work could focus on including page layout, multimedia data, and the link structure between the two pages and their in and out links. 1 Google claims to index over 3.3 billion pages.
10 Our implementation of the FSC algorithm is non-optimal for both phases. The first phase could be parallelized into several separate processes, including preprocessing pages, generating units, and hashing units. The second phase currently uses an SQL query on data generated by phase 1, a clearly time-consuming process for large datasets. Phase 2 could be revamped to use Min-Hashing techniques [5] for efficiency with large datasets, while still using the same definition of similarity. By analyzing similar pages between same and different domain, we have observed many useful characteristics. Same domain pages tend to group site wide template pages together, whereas different domain pages will return actual copied content. Even though we did not actually find copied web pages over different domains, we are hopeful that with a larger set of pages, our algorithm will detect them. Also, we have observed a lot of false positives, and have proposed various easy implementations of removing them.
11 7. References: [1] Sergey Brin, Lawrence Page The Anatomy of a Large-Scale Hypertextual Web Search Engine [2] Pirated Sites!! Aaarrgghh... [3] Yahoo! [4] Sergey Brin, James David, Hector Garcia-Molina Copy Detection Mechanisms for Digital Documents [5] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey Ullman, Cheng Yang Finding Interesting Associations without Support Pruning [6] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Information Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationUS Patent 6,658,423. William Pugh
US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More informationConclusions. Chapter Summary of our contributions
Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web
More informationTowards a hybrid approach to Netflix Challenge
Towards a hybrid approach to Netflix Challenge Abhishek Gupta, Abhijeet Mohapatra, Tejaswi Tenneti March 12, 2009 1 Introduction Today Recommendation systems [3] have become indispensible because of the
More informationA Survey of Google's PageRank
http://pr.efactory.de/ A Survey of Google's PageRank Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationArchitecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine
Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Debajyoti Mukhopadhyay 1, 2 Sajal Mukherjee 1 Soumya Ghosh 1 Saheli Kar 1 Young-Chon
More informationWeb Search Engines: Solutions to Final Exam, Part I December 13, 2004
Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to
More informationEXTRACTION OF RELEVANT WEB PAGES USING DATA MINING
Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,
More informationParallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationTemplate Extraction from Heterogeneous Web Pages
Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many
More informationCSI5387: Data Mining Project
CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play
More informationCS 245: Database System Principles
CS 2: Database System Principles Notes 4: Indexing Chapter 4 Indexing & Hashing value record value Hector Garcia-Molina CS 2 Notes 4 1 CS 2 Notes 4 2 Topics Conventional indexes B-trees Hashing schemes
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationImproving Collection Selection with Overlap Awareness in P2P Search Engines
Improving Collection Selection with Overlap Awareness in P2P Search Engines Matthias Bender Peter Triantafillou Gerhard Weikum Christian Zimmer and Improving Collection Selection with Overlap Awareness
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationExtraction of Semantic Text Portion Related to Anchor Link
1834 IEICE TRANS. INF. & SYST., VOL.E89 D, NO.6 JUNE 2006 PAPER Special Section on Human Communication II Extraction of Semantic Text Portion Related to Anchor Link Bui Quang HUNG a), Masanori OTSUBO,
More informationIndexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel
Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationWeb Data Extraction. Craig Knoblock University of Southern California. This presentation is based on slides prepared by Ion Muslea and Kristina Lerman
Web Data Extraction Craig Knoblock University of Southern California This presentation is based on slides prepared by Ion Muslea and Kristina Lerman Extracting Data from Semistructured Sources NAME Casablanca
More informationDRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015
DRACULA CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 Acknowledgments Support for this work was provided by the National Science Foundation Award No. CMMI-1304383 and CMMI-1234859. Any opinions,
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationUNIVERSITY OF BOLTON WEB PUBLISHER GUIDE JUNE 2016 / VERSION 1.0
UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE WWW.BOLTON.AC.UK/DIA JUNE 2016 / VERSION 1.0 This guide is for staff who have responsibility for webpages on the university website. All Web Publishers must adhere
More informationNotes on Bloom filters
Computer Science B63 Winter 2017 Scarborough Campus University of Toronto Notes on Bloom filters Vassos Hadzilacos A Bloom filter is an approximate or probabilistic dictionary. Let S be a dynamic set of
More informationHow Does a Search Engine Work? Part 1
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling
More informationINSPIRE and SPIRES Log File Analysis
INSPIRE and SPIRES Log File Analysis Cole Adams Science Undergraduate Laboratory Internship Program Wheaton College SLAC National Accelerator Laboratory August 5, 2011 Prepared in partial fulfillment of
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationCodify: Code Search Engine
Codify: Code Search Engine Dimitriy Zavelevich (zavelev2) Kirill Varhavskiy (varshav2) Abstract: Codify is a vertical search engine focusing on searching code and coding problems due to it s ability to
More informationUsage Guide to Handling of Bayesian Class Data
CAMELOT Security 2005 Page: 1 Usage Guide to Handling of Bayesian Class Data 1. Basics Classification of textual data became much more importance in the actual time. Reason for that is the strong increase
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationCHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES
188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two
More informationThe PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,
More informationCSE 562 Database Systems
Goal of Indexing CSE 562 Database Systems Indexing Some slides are based or modified from originals by Database Systems: The Complete Book, Pearson Prentice Hall 2 nd Edition 08 Garcia-Molina, Ullman,
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationAI32 Guide to Weka. Andrew Roberts 1st March 2005
AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic
More informationPage Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.
SEO can split into two categories as On-page SEO and Off-page SEO. On-Page SEO refers to all the things that we can do ON our website to rank higher, such as page titles, meta description, keyword, content,
More informationWhat is database? Types and Examples
What is database? Types and Examples Visit our site for more information: www.examplanning.com Facebook Page: https://www.facebook.com/examplanning10/ Twitter: https://twitter.com/examplanning10 TABLE
More informationIntelligent Recipe Publisher - Delicious
Intelligent Recipe Publisher - Delicious Minor Project IBM Career Education Disclaimer This Software Requirements Specification document is a guideline. The document details all the high level requirements.
More informationBig Mathematical Ideas and Understandings
Big Mathematical Ideas and Understandings A Big Idea is a statement of an idea that is central to the learning of mathematics, one that links numerous mathematical understandings into a coherent whole.
More informationA Framework for adaptive focused web crawling and information retrieval using genetic algorithms
A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably
More informationImportance Sampling Spherical Harmonics
Importance Sampling Spherical Harmonics Wojciech Jarosz 1,2 Nathan A. Carr 2 Henrik Wann Jensen 1 1 University of California, San Diego 2 Adobe Systems Incorporated April 2, 2009 Spherical Harmonic Sampling
More informationShingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University
Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for
More informationSEO. Definitions/Acronyms. Definitions/Acronyms
Definitions/Acronyms SEO Search Engine Optimization ITS Web Services September 6, 2007 SEO: Search Engine Optimization SEF: Search Engine Friendly SERP: Search Engine Results Page PR (Page Rank): Google
More informationStanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.
Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term. Question 1: (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2
More informationDEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES
DEC. 1-5 Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES Monday Overview of Databases A web search engine is a large database containing information about Web pages that have been registered
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationSQL. Chapter 5 FROM WHERE
SQL Chapter 5 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh 1 Basic SQL Query SELECT FROM WHERE [DISTINCT] target-list
More informationSearch Engine Architecture. Hongning Wang
Search Engine Architecture Hongning Wang CS@UVa CS@UVa CS4501: Information Retrieval 2 Document Analyzer Classical search engine architecture The Anatomy of a Large-Scale Hypertextual Web Search Engine
More informationSelf Adjusting Refresh Time Based Architecture for Incremental Web Crawler
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh
More information1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1
A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC-03-08 College of Computing Atlanta,Georgia {ankit,abhi,lingliu}@cc.gatech.edu In this paper we present
More informationCS 525: Advanced Database Organization 04: Indexing
CS 5: Advanced Database Organization 04: Indexing Boris Glavic Part 04 Indexing & Hashing value record? value Slides: adapted from a course taught by Hector Garcia-Molina, Stanford InfoLab CS 5 Notes 4
More informationXQ: An XML Query Language Language Reference Manual
XQ: An XML Query Language Language Reference Manual Kin Ng kn2006@columbia.edu 1. Introduction XQ is a query language for XML documents. This language enables programmers to express queries in a few simple
More informationField Types and Import/Export Formats
Chapter 3 Field Types and Import/Export Formats Knowing Your Data Besides just knowing the raw statistics and capacities of your software tools ( speeds and feeds, as the machinists like to say), it s
More informationSpatial Index Keyword Search in Multi- Dimensional Database
Spatial Index Keyword Search in Multi- Dimensional Database Sushma Ahirrao M. E Student, Department of Computer Engineering, GHRIEM, Jalgaon, India ABSTRACT: Nearest neighbor search in multimedia databases
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationATYPICAL RELATIONAL QUERY OPTIMIZER
14 ATYPICAL RELATIONAL QUERY OPTIMIZER Life is what happens while you re busy making other plans. John Lennon In this chapter, we present a typical relational query optimizer in detail. We begin by discussing
More informationInformation Networks. Hacettepe University Department of Information Management DOK 422: Information Networks
Information Networks Hacettepe University Department of Information Management DOK 422: Information Networks Search engines Some Slides taken from: Ray Larson Search engines Web Crawling Web Search Engines
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationIntroduction hashing: a technique used for storing and retrieving information as quickly as possible.
Lecture IX: Hashing Introduction hashing: a technique used for storing and retrieving information as quickly as possible. used to perform optimal searches and is useful in implementing symbol tables. Why
More informationCMSC 341 Lecture 16/17 Hashing, Parts 1 & 2
CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash
More informationAdobe Marketing Cloud Data Workbench Controlled Experiments
Adobe Marketing Cloud Data Workbench Controlled Experiments Contents Data Workbench Controlled Experiments...3 How Does Site Identify Visitors?...3 How Do Controlled Experiments Work?...3 What Should I
More informationRelevance of a Document to a Query
Relevance of a Document to a Query Computing the relevance of a document to a query has four parts: 1. Computing the significance of a word within document D. 2. Computing the significance of word to document
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationSite Audit Virgin Galactic
Site Audit 27 Virgin Galactic Site Audit: Issues Total Score Crawled Pages 59 % 79 Healthy (34) Broken (3) Have issues (27) Redirected (3) Blocked (2) Errors Warnings Notices 25 236 5 3 25 2 Jan Jan Jan
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationDetection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences
Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,
More informationPASSWORD RBL API GUIDE API VERSION 3.10
PASSWORD RBL API GUIDE API VERSION 3.10 Table of Contents Summary... 4 What s New in this Version... 4 Recommendations... 4 API Endpoints... 5 Production Endpoints... 5 Development Endpoints... 5 Query
More informationDeep Web Content Mining
Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased
More informationmodern database systems lecture 5 : top-k retrieval
modern database systems lecture 5 : top-k retrieval Aristides Gionis Michael Mathioudakis spring 2016 announcements problem session on Monday, March 7, 2-4pm, at T2 solutions of the problems in homework
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationNetwork Flow. The network flow problem is as follows:
Network Flow The network flow problem is as follows: Given a connected directed graph G with non-negative integer weights, (where each edge stands for the capacity of that edge), and two distinguished
More information6.001 Notes: Section 6.1
6.001 Notes: Section 6.1 Slide 6.1.1 When we first starting talking about Scheme expressions, you may recall we said that (almost) every Scheme expression had three components, a syntax (legal ways of
More information