Finding Similar Content

Size: px

Start display at page:

Download "Finding Similar Content"

Claud Cobb
5 years ago
Views:

1 Finding Similar Content Michael Welch Ethan Chan {mjwelch, Abstract In this paper, we present an algorithm for automatically discovering similar pages. Unlike common keyword based search engines, our similarity comparison is based on the entire textual content of the pages. We are unaware of the existence of such content based similarity detection mechanisms for the Web which are accessible to the common user. Our algorithm, FSC, exploits temporal locality of content without being dependant on absolute position within a page, making it difficult to bypass. 1. Introduction The World Wide Web contains millions of sources of information contained in billions of pages. One of the most difficult and frustrating tasks for a user of the Web is trying to find pages that contain the information they are interested in. Popular search engines such as Google [1] attempt to answer keyword queries from users and return relevant pages. Often a user may wish to find a mirrored site, where content is 100% similar, that is hosted more locally and is faster. Also, page similarity may pertain to legal issues. A website administrator may wish to search the Web for a site that is infringing upon his copyrighted page by measuring say 70% or higher similarity. Content theft is an ongoing problem [2] that is difficult if not impossible to find by ordinary Web search methods and content category listings such as those of Yahoo! [3]. We propose a method for finding similar pages based on the union and intersection of content units through an analysis of page content. Our two-phase algorithm, FSC, analyzes the contents of HTML pages and finds related pages that meet a user-defined threshold percentage of similarity. In the first phase of the algorithm, we extract and hash units from the pages. In the second phase, we calculate the similarity between pages and separate them into related groups. To avoid complex and time consuming web crawls in our experiments, all HTML pages are read from the UCLA Webcat Web Repository of 70 million pages through a C++ interface. In this paper, we will present our technique in three parts. We describe the first phase of FSC and its implementation using C++ and LEX in Sections 2 and 3. Section 2 describes our unit generation methods. Section 3 discusses the various hashing techniques we have used. In Section 4 we describe the second phase of FSC, discussing our analysis techniques using Perl and SQL. We present our initial results in Section 5. In Section 6 we will discuss some possible extensions and future work.

2 2. Unit Generation The first phase of the FSC algorithm requires processing of HTML pages. Because our goal is to find pages that match up for a given threshold of similarity, which is based on the union and intersection of corresponding page sections, we must decompose each page into units. We define each unit of a page as a specific sequence of characters within the HTML page that matches the designated unit function. The unit function is a pattern matching function that finds and returns the next occurrence of a unit string within the page. In our initial analysis, we have implemented and tested three unit functions. The first unit function we choose selects consecutive sentences until the total length of the string is at least a parameter gminwordsize, which is specified at runtime and set to 20 for our experiments. The end of a sentence is determined by common punctuation marks (!,?,. ). As noted in [4], this creates some potential problems when punctuation is not used at the end of a sentence, such as use in an abbreviation. We will show in our results that this does not substantially affect our similarity detection efforts. The second unit selection function we choose is by individual words, separated by space characters or end of sentence punctuation, and at least gminwordsize characters. In our analysis using this unit function, gminwordsize is set to 3. This provides the most exact measure of similarity detection comparable to a search engine s inverted indexing approach by taking word for word content from a page. It does not, however, preserve relative word location, and consequently some semantics of the page can be lost. For this reason, we believe individual word units are not appropriate for similarity detection. Our third unit function uses overlapping units similar to those described in [4]. This unit function returns gminwordsize (3 in our experiments) words as a unit. Two consecutive units overlap by one word. For example, consider a sequence of 5 words A B C D E. The two units returned by our third unit function would be A B C and C D E. All of our unit functions are generally free of phase dependence [4], in which inserting a single additional unit into the page can cause a false negative detection, because our unit functions do not consider absolute location within a page. At the same time, methods one and three preserve temporal location, which help protect against false positives. 3. Hashing Units Once we have divided the HTML page into units, we must somehow store them for comparison with other pages. To save storage space, and without loss of similarity information, we have chosen to hash each unit into a four-byte integer value. For each page, we assign a unique integer URL_ID tag. Then, for each unit in the page, we generate a hash value. We insert each <URL_ID, UNIT_HASH_VALUE> tuple into a relational table. For this storage, we have chosen to use IBM s DB2 Universal Database system. Using a relational database system allows us a simple mechanism for persistent storage that is easily accessible by both phases of the FSC algorithm.

3 In general, hashing functions aim to avoid excessive collisions. In the case of FSC, collision avoidance is desirable, but not essential. A collision of hash values from two units whose underlying data are different is in essence a false positive. Our similarity metric, however, is generally robust against the small percentage of such false positive collisions. Therefore, our main goals in hashing schemes are simplicity and speed. As a result, we choose two hashing functions based on character sums of units. The first hashing function adds the character sum of all alphanumeric characters within the unit. The second hashing function sums all characters in the unit. Both hashing functions have produced similar, high quality results in similarity detection. Figure 3.1 summarizes the first phase of the FSC algorithm. For each page in Web Repository Preprocess the raw HTML page, removing all HTML tags Assign the URL a unique ID number and insert the tuple into the LOOKUP table Do Get the next unit from the processed HTML Hash the unit into an integer value Insert the <URL_ID, HASH_VALUE> pair into the HASHES table While (More units in the page) Figure 3.1 Phase 1 of FSC 4. Finding Similarity The primary goal of an algorithm to determine similar pages should be reducing processing time, as any valid algorithm will return the correct results when done. Unfortunately, there is also an aspect we had to take into consideration: implementation costs. Given a short development time frame, we wanted a solution that could benefit from the database already used in phase 1 of our FSC algorithm. There are many disadvantages lingering onto this idea, processing time ironically being one of them. There are algorithms such as Min-Hashing and Locality-Sensitive Hashing [5] where processing time is O(n). However, given our overall priority was on having quality unit functions in phase 1, we decided to substitute implementing efficient and elegant algorithms in phase 2 with a single SQL query. There are many ways of querying and detecting similarity between two web pages. The initial naïve approach checked to see if each page had at least the predefined threshold t hashes in common. This would effectively remove pages that only had less than a few same units. The problem with this simplistic method becomes evident when the threshold is set too low, resulting in too many pages, mostly which are false positives. The idea is to scale the threshold with respect to the page sizes. By using page unions and intersections allow us to normalize their similarity with each other: Similarity ratio = A B A B Equation 4.1: Similarity

4 Given our SQL database implementation, the denominator ( A B ) can be rewritten to: Similarity ratio = A B A + B ( A B) Equation 4.2: Denominator now maps to SQL queries This equation form allows both numerator and denominator to be mapped to simple SQL queries. A and B can be easily calculated by SQL s COUNT aggregate function, and A B by SQL s GROUP BY clause. Also, notice that A B is being used (but not calculated) twice. The resultant SQL query is shown in Figure 4.1. SELECT J.url AS A_URL, K.url AS B_URL, DECIMAL(R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100) AS SIMILAR FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM HASH_TABLE AS A, HASH_TABLE AS B WHERE A.url_id < B.url_id AND A.hash = B.hash GROUP BY A.url_id, B.url_id HAVING count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM HASH_TABLE GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM HASH_TABLE GROUP BY url_id) AS T, LOOKUP_TABLE as J, LOOKUP_TABLE as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15) = substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=MIN_THRESH AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100< MAX_THRESH ORDER BY R.CNT / REAL(S.CNT + T.CNT - R.CNT) DESC; Figure 4.1 SQL Query Figure 4.2 shows the Perl script we have written that queries the database with the SQL query of Figure 4.1, and categorizes the results into 10% increments, starting with 20% to 100%. For our analysis, anything below 20% similarity is not worth considering and thus is discarded. It requires two parameters, the lookup and hash table names. We have made a distinction between comparing two web pages from the same and different domains. The SQL query WHERE substr(j.url,1,15) = substr(k.url,1,15) filters the first 3 characters of the domain url (i.e. which means that will unfortunately be considered similar to depending on whether they need to be the same or different. We have verified this simple method of character comparison is effective at separating same and different domains with very few false positives.

5 unless {die("no arguments - You must specify 2 parameters : <LOOKUP TABLE> <HASH for ($i=30; $i<=110; $i=$i+10) { system("echo 'connect to cs249;\n SELECT count(*) FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM ".@ARGV[1]." AS A, ".@ARGV[1]." AS B WHERE A.url_id < B.url_id AND A.hash = B.hash Group by A.url_id, B.url_id having count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM GROUP BY url_id) AS T, " as J, ".@ARGV[0]." as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15) = substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=".($i-10)." AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100<".$i.";\n' >process"); system("db2 -tvf process > samedomain_".@argv[0]."_".@argv[1]."_".($i- 10)."_".$i."res"); } for ($i=30; $i<=110; $i=$i+10) { system("echo 'connect to cs249;\n SELECT count(*) FROM (SELECT A.url_id AS A_url, B.url_id AS B_url, count(a.hash) AS CNT FROM AS A, ".@ARGV[1]." AS B WHERE A.url_id < B.url_id AND A.hash = B.hash Group by A.url_id, B.url_id having count(a.hash)>1 ) AS R, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS S, (SELECT url_id, count(*) AS CNT FROM ".@ARGV[1]." GROUP BY url_id) AS T, ".@ARGV[0]." as J, ".@ARGV[0]." as K WHERE R.A_url = S.url_id AND R.B_url = T.url_id AND J.URL_id =R.A_url AND K.URL_id=R.B_url AND substr(j.url,1,15)!= substr(k.url,1,15) AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100>=".($i-10)." AND R.CNT / REAL(S.CNT + T.CNT - R.CNT)*100<".$i.";\n' >process"); system("db2 -tvf process > diffdomain_".@argv[0]."_".@argv[1]."_".($i- 10)."_".$i."res"); } Figure 4.2 Perl Script The following simple example demonstrates the analysis steps taken. Figure 4.3 shows an example snippet view of an overall hash database from phase 1 of FSC. Figure 4.4 shows the calculated numerator of the similarity equation. Next, we find the sizes of pages A and B in Figure 4.5 to calculate the denominator of Equation 4.2, as shown in Figure 4.6. Lastly, we calculate the similarity of two pages by calculating Equation 4.2 from the values shown in Figure 4.4 and Figure 4.6. The final results are shown in Figure 4.7. url Hash Value Figure Hash database url A url B Same hash freq Figure Intersection: How many units are in both pages? ( A B )

6 url A url B Size A Size B Figure Intersection + Sizes: Determining A and B url A url B Size A Size B Union Figure Intersection + Union : Determining Union A B = A + B ( A B) url A url B Same hash Union Similarity Figure Calculate Similarity 5. Results Figure 5.1 shows our initial results, using the first unit function and first hash function on a sample of approximately pages from our UCLA Webcat Web Repository. Due to time constraints and issues with DB2 and other processes running on the Webcat server, we were unable to fully analyze our suite of unit and hash functions. However, as noted earlier, we believe the choice of hash function is fairly insignificant and the first unit function is likely to be the best. The first thing we notice is that pages within the same domain are very likely to be similar (about 25% of all pages are at least 50% similar). This can be caused by site templates, which give pages within a website a uniform style, such as a common navigation bar and site banners, which contributes to their similar content. Taking this thought further, we notice though, that most site templates probably do not contribute to more than half of page content, otherwise, there would be more 50%+ similar pages. Also notice that there are very few highly similar pages (in fact, similarity seems to be

7 inversely proportional to its frequency), which suggests that copied pages are rare, but do in fact occur. The second observation is that pages are less likely to be similar if they are from different domains. This is most likely caused by the lack of our previous observation: without site templates, pages no longer share as much content as before. This reduced set of pages, in some sense, weighs heavier than if they were from the same domain because of its rarity. Also notice that by eliminating page comparisons from the same domain, the result set is much smaller which yields quicker post process verification. Similarity Same Domain Different Domain 20% 3,869 1,023 30% 2, % 1, % 1, % % % % % 18 2 Figure 5.1 Frequency of pages Frequency Similarity (%) Same Domain Different Domain Figure 5.2

8 A closer inspection of the actual similar pages revealed more interesting results: Figure 5.3 Sample of 100% similar web pages from same domain Figure 5.3 reflects various characteristics with pages in the same domain. Firstly, pages from have only two units in each page, causing two near-empty pages to be matched as 100% similar. In our future work, we hope to weed out these false positive pages that have too few units in them. Another characteristic are the pages, which turn out to be all page not found pages. These pages are another type of false positives, where the server returns a custom error page. One simple way these pages can be eliminated from our result is to look for keywords indicating pages no longer exist, such as not found, removed etc. Lastly, we see dynamic URLs from where the page itself is no different with varying parameters Figure 5.4 Sample of 90% similar web pages from same domain Figure 5.4 shows a very interesting characteristic: clearly different categories of web pages from Yale University (i.e. medical, living, admissions, search) are considered very similar. The reason is these pages share a common site template, as earlier predicted in this section. With site templates, a lot of the content is duplicated. If a search for similar, non-templated pages is required, research in template detection such as [6] could aid in removing the templates as part of the preprocessing stage before similarity detection. As with different domain pages, the observations are different:

9 Figure 5.5 Sample of 100% similar web pages from different domain Figure 5.5 shows the two pairs of pages we detected as 100% similar from different domains. The unfortunate thing is that the size of both pairs is only 2 units long, meaning they are essentially empty-pages. If we implement our filters to remove small pages (such as 2 units big), then these false positives would not have shown Figure 5.6 Sample of 80% similar web pages from different domain Figure 5.6 shows the general results of different domain similarity: most contain robots.txt, which are web page files that tell the search engine robots which files they may download. Apparently, most of them disallow very similar directories, causing the high similarity. Since robots.txt is not technically a web page, it should be filtered out in our future algorithm revisions. 6. Conclusion and Future Work In this paper, we have described the two phases of our algorithm FSC in Sections 2, 3, and 4. We have implemented FSC using C++, LEX, Perl, and IBM DB2, and shown how it works well in finding similar content between pages using an arbitrary user defined threshold by our results in Section 5. Our brief study and analysis of similarity detection on a fairly small sample set has yielded promising results. Consider that the page sample is miniscule (less than %) compared to the billions of pages available on the Web 1. This paper has laid the groundwork for some potentially very interesting future study and development. We believe the ease with which similarity can be calculated combined with its effectiveness in finding copied pages provides an excellent starting point. Clearly the next step in adapting FSC to copy detection is to add facilities for specifying a seed page that we wish to find copies of. Another extension of this work could focus on including page layout, multimedia data, and the link structure between the two pages and their in and out links. 1 Google claims to index over 3.3 billion pages.

10 Our implementation of the FSC algorithm is non-optimal for both phases. The first phase could be parallelized into several separate processes, including preprocessing pages, generating units, and hashing units. The second phase currently uses an SQL query on data generated by phase 1, a clearly time-consuming process for large datasets. Phase 2 could be revamped to use Min-Hashing techniques [5] for efficiency with large datasets, while still using the same definition of similarity. By analyzing similar pages between same and different domain, we have observed many useful characteristics. Same domain pages tend to group site wide template pages together, whereas different domain pages will return actual copied content. Even though we did not actually find copied web pages over different domains, we are hopeful that with a larger set of pages, our algorithm will detect them. Also, we have observed a lot of false positives, and have proposed various easy implementations of removing them.

11 7. References: [1] Sergey Brin, Lawrence Page The Anatomy of a Large-Scale Hypertextual Web Search Engine [2] Pirated Sites!! Aaarrgghh... [3] Yahoo! [4] Sergey Brin, James David, Hector Garcia-Molina Copy Detection Mechanisms for Digital Documents [5] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey Ullman, Cheng Yang Finding Interesting Associations without Support Pruning [6] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic