EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Size: px

Start display at page:

Download "EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING"

Kenneth Gibbs
5 years ago
Views:

1 Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query, the search engine consults its storage for the possible hits and based on the match between the query words and the index, a list of large links is displayed leading to the problem of information overkill. Hence, the need to design the search engines that are capable of discovering practical and valuable information lying in relevant web pages, comparatively less in number, spread across the WWW. In fact, this has been the focus of considerable past research. Prior studies have either mostly addressed issues pertaining to search engine effectiveness for general web searches [110] or have commented on trends in ecommerce-related web searching [111]. In this work, an algorithm called Relevance_Retrieval_Rank algorithm [112] has been designed that employs a data mining technique called a-priori [113, 114] in order to identify the relevance of web pages in proportion to the various keywords and subsequently the ranking of these pages is carried out. 3.2 A-PRIORI ALGORITHM A-priori is a breadth-first search algorithm based on association rule-based data mining technique that finds the association between the various items of the database. This algorithm also called generate and test type iterates over the database in multiple passes. The item sets having support equal to or greater than minimum support (minsup) are selected for the next pass and this process continues until an item set with maximum number of items is generated. Minsup is the primary tuning user-specified parameter.

2 Consider the following association rule in the transaction set D with support s : A=>B Where s is the percentage of the transactions in D that contain A B (i.e. both A and B). The nomenclature to be used by a-priori for web crawling is as: D: Database, collection of web pages each involving a set of keywords. I: Item set, set of web pages. T: Transaction, where each T is an asset of items such that T I. K-item set: Item set that contains k items. C k : set of candidate k-item sets. It has two fields: support count and item set. L k : set of candidate k-item sets which have passed the minsup threshold value Association Rule: An association rule is of the form X=> Y where X= {x 1, x 2, x 3 x n } and Y= {y 1, y 2, y 3 y n } are the set of keywords, with x i and y i being the distinct items for all i and j. Support: The support for X=> Y is the percentage of documents that hold all the keywords in the X Y. The pseudocode for the a-priori algorithm is given in Fig A-priori algorithm considers each item, checks its support and rejects the item with support less than the minimum support and adds thereafter one more item with previous item one by one followed by checks for the support and so on until the largest item set with support greater than the minimum support is found. At each iteration, the crawler can keep the copy of the items i.e. the web pages in this case in the table maintained by it for further use. With a view to improve the efficacy of the search engines, a novel approach called Relevance_Retrieval_Rank algorithm has been proposed in this work. It makes use of the A-priori algorithm to compute a Relevance_Retieval_Rank (RRR) as discussed in the next section. 55

3 Pseudo code: A-Priori Algorithm Frequent item set(s) { Step 1: For each item, Check if it is frequent item set. Place it in L k 2: Set k=1 3: Repeat // appears if support > minsup //iterative identification of frequent item sets. For each new frequent item set I k with k items from C k, Generate all the item sets I k +1 with k+1 items, formed by } joining item sets from L k. Scan all transactions once and check if the generated k+1 item sets are frequent. 4: Set k=k+1 Until no new frequent item sets are identified Fig. 3.1 A-Priori Algorithm 3.3 RELEVANCE_RETRIEVAL_RANK (RRR) THE PROPOSED WORK Generally, a crawler maintains database of web pages at search engine s side. On this database, various page rank functions are applied by the search engine and the user is presented with the pages according to their page rank values matched against their submitted query. This approach simply retrieves the information on the basis of the popularity of the web page without given importance to the relevance of the web page. In this work, A-priori algorithm coupled with page rank function is being employed that not only considers the frequency of the keywords within the documents but also takes account of the associations of various keywords within the documents so as to arrive at a 56

4 set of most relevant results. Thus, in order to retrieve the relevant information by the crawler, the proposed algorithm uses a data mining technique as well as the PageRank mechanism. The keywords and the page rank obtained thereafter are stored in a database called Search Engine Database, the format of which is given in Fig 3.2 URLs PageRank Fig 3.2 Search Engine Database The various steps followed by the crawler towards the computation of the page rank are given below: Step 1. Initialize Search Engine Database (SED) table 2. Download a web page and store its URL in this table 3. Analyze the content of the web page and determine the relevant keywords from the web page in question and make their entry into the table. 4. Compute the page rank of the web page and store it against its corresponding URL. It may be noted that the steps 1 to 4 are performed by the crawler with the objective to create a Search Engine Database, much required for computing the intermediate terms such as candidate sets (C k ) and candidate sets with item sets above minsup (L k ). 5. Apply A-Priori algorithm on SED for the purpose of computation of support for the individual keywords and store them in a table called Table of Support (Table 3.2). The data from Table of Support is filtered on the basis of user supplied threshold value and the resultant item sets are stored in a table called Table of filtered item sets (Table 3.3). 57

5 6. Calculate the support for the combination of keywords and repeat step 5 and store the intermediate results obtained thereof into the intermediate set of tables i.e. Table of support and Table of filtered item sets. The step 5 is repeated till all the combination of keywords where support is greater than minsup, is exhausted. It may be noted that in the end Table of filtered item sets will contain those combinations of keywords whose support is greater than the minsup. 7. Look for the corresponding URL from SED pertaining to these short-listed set of keywords present in the Table of filtered item sets. Store both the URL and the set of keywords in a new table named Table of Relevant pages. It may be further noted that the entry in the Table of Relevant Pages points to the most relevant document present in the SED corresponding to the given query. However there is every chance that other relevant pages of lesser magnitude would also be present which need to be mined. The next step has been proposed for the same. 8. Identify rest of the relevant pages by considering other keywords having minsup more than the threshold value in combination with the keywords identified for the most relevant document. 9. Append the individual keywords with the keywords identified for the most relevant document in the decreasing order of their support. Also identify the corresponding URL for them from SED and if found, mark its entry into the Table of Relevant Pages. 10. Repeat step 9 in reverse chronological order for populating the Table of Relevant Pages i.e. starting from (N-1) th to 1 st iteration till all the URLs from SED are identified. 58

6 11. Assign the new page rank termed as magnitude to each URL of the Table of Relevant Pages. 12. Compute the new RRR by attributing weightage in the ratio of 60:40 to the proposed mechanism and the existing Page Rank mechanism of google, respectively. Let steps 7 to 10; be known as the Table_fill process. The diagrammatic representation of the above steps is shown in Fig 3.3. WWW Crawler Downloads & Stores Web Page Repository User Queries SEI Search Engine Analyze Relevant & Store them in appropriate Data Structure Apply A-Priori Compute Page Rank Compute and fill the results using Table_fill algo Page Rank Table of Relevant Pages Compute Relevance_Retrieval_Rank RRR Fig 3.3 Steps followed in the computation of RRR 59

7 This work assumes that 60 percent of the weightage is given to the Relevance_retrieval_Rank (RRR) and 40 percent of the weightage is given to the Google s PageRank. This is because the process of RRR computation searches for those web pages whose relevance is much closer to the keywords given by the user as compared to the google s PageRank which gives importance to the popularity of the web pages only. Note: The A-priori brings together only those web pages whose relevance crosses the minimum support of relevance supplied by the user. The step by step execution of the RRR algorithm has been illustrated through an example given below: 3.4 EXAMPLE Consider Search Engine Database (SED) given in Table 3.1 formed according to the steps 1-4 of the RRR algorithm. Table 3.1 Search Engine Database URLs Page Rank (assumed) W1 Human, Computer, Intelligent machines 3 W2 Human, Intelligent machines 4 W3 Computer, Intelligent machines 5 W4 Computer, Robot, Human 1 W5 Human, Computer, Robot, Resource 7 W6 Human, Robot, Computer, Management 6 W7 Human, Computer, Resource 8 W8 Robot, Intelligent Machines, Human 9 W9 Computer, Robot, Intelligent Machines 10 W10 Human, Computer, Robot, Intelligent Machines 2 60

8 This SED is used for the computation of candidate sets (C k ) and candidate sets with support greater than minsup (L k ). Suppose the keywords entered by the user in the query are Human, Computer and Robot with threshold for support being 40% indicating that the item sets must possess minimum support of 40% to qualify for next iteration. 1 st iteration: Apply step 5 of the RRR algorithm on SED. Identify individual keywords from SED and calculate their support. Store the candidate item sets C k (for k=1) obtained thereafter into a table called Table of Support given in table 3.2. Table 3.2: Table of Support (C k where k=1) Support Human 8/10 = 80% Computer 8/10 = 80% Intelligent Machines 6/10 = 60% Robot 6/10 = 60% Resource 2/10 = 20% Management 1/10 = 10% Since the minimum qualifying threshold, i.e. minsup is 40%, it can be observed from table 3.2, that Resource and Management keywords have individual support less than 40% and thereby are removed from the table. The resultant frequent item sets L k (k=1) are thereafter stored in a table called Table of Filtered Item sets as shown in table 3.3. Table 3.3: Table of Filtered Item sets (L k where k=1) Support Human 80% Computer 80% Intelligent Machines 60% Robot 60% 61

9 2 nd Iteration: Repeat step 5 of the RRR algorithm on Table 3.3 to produce Table 3.4. This process repeats itself until all the combination of keywords for k=2 are exhausted. Table 3.4: Table of Support, Table of Filtered Item sets (C k, L k where k=2) Support Human, Computer 60% Human, Intelligent Machines 40% Human, Robot 60% Computer, Intelligent Machines 40% Computer, Robot 50% 3 rd Iteration: Increment k by 1 i.e. set k=3 and again repeat step 5 of the RRR to produce Tables 3.5 and 3.6 Table 3.5: Table of Support (C k where k=3) Support Human, Computer, Intelligent Machines 20% Human, Computer, Robot 40% Table 3.6: Table of Filtered Item sets (L k where k=3) Support Human, Computer, Robot 40% It can be observed from Table 3.6 that no more keywords are left to be combined. Hence the A-priori process i.e. steps 1-6 of the RRR algorithm stops here. By applying the process as outlined in step 7 onwards, Table 3.1 is appropriately modified into Table of Relevant Pages. The resultant tables generated are as below: 62

10 URLs W4 Table 3.7(a): Table of Relevant Pages (After 1 st iteration) Computer, Robot, Human Table 3.7(b): Table of Relevant Pages (After 2 nd iteration) URLs W4 Computer, Robot, Human W10 Human, Computer, Robot, Intelligent Machines W5 Human, Computer, Robot, Resource W6 Human, Robot, Computer, Management URLs W4 W10 W5 W6 W1 W7 W8 W9 W2 W3 Table 3.7(c): Table of Relevant Pages (After 3 rd iteration) Computer, Robot, Human Human, Computer, Robot, Intelligent Machines Human, Computer, Robot, Resource Human, Robot, Computer, Management Human, Computer, Intelligent machines Human, Computer, Resource Robot, Intelligent Machines, Human Computer, Robot, Intelligent Machines Human, Intelligent machines Computer, Intelligent machines Table 3.7(d): Table of Relevant Pages with their Magnitude (step 11) URLs Magnitude from proposed algorithm W4 Computer, Robot, Human 1 W10 Human, Computer, Robot, Intelligent Machines 2 63

11 W5 Human, Computer, Robot, 3 Resource W6 Human, Robot, Computer, 4 Management W1 Human, Computer, Intelligent 5 machines W7 Human, Computer, Resource 6 W8 Robot, Intelligent Machines, 7 Human W9 Computer, Robot, Intelligent 8 Machines W2 Human, Intelligent machines 9 W3 Computer, Intelligent machines 10 The magnitude merged in the above table is the new page rank obtained for the URLs of SED with first URL being most relevant having the highest magnitude of 1, 2 nd URL being the second most relevant having the magnitude of 2 and so forth. Merge Google s PageRank for each URL from SED with Table 3.7(d) to form Table 3.8 Table 3.8: Table of Relevant Pages URLs Magnitude from proposed algorithm Google Page Rank W4 Computer, Robot, Human 1 1 W10 W5 W6 Human, Computer, Robot, Intelligent Machines Human, Computer, Robot, Resource Human, Robot, Computer, Management W1 Human, Computer, Intelligent

12 machines W7 Human, Computer, Resource 6 8 W8 Robot, Intelligent Machines, 7 9 Human W9 Computer, Robot, Intelligent 8 10 Machines W2 Human, Intelligent machines 9 4 W3 Computer, Intelligent machines 10 5 According to Step 12 of the algorithm as discussed in section 3.3, final rank (RRR) is calculated on the basis of the following rule Magnitude from RRR_Algorithm: Google Page Rank = 60:40... (3.1) Hence the RRR for each page is computed in the following manner: RRR = (60*magnitude)/100 + (40* page rank)/100 (3.2) After applying the formula described in equation 3.2 on Google s PageRank and the magnitude of the URLs obtained from the RRR algorithm of Table 3.8, new rank of the URLs termed as Relevance_Retrieval_Rank (RRR) is computed and is listed in Table 3.9. Table 3.9: Relevance_Retrieval_Rank Table URLs RRR Calculation RRR W4 Computer, Robot, (60*1)/100 + (40*1)/100 1 Human W10 Human, Computer, (60*2)/100 + (40*2)/100 2 Robot, Intelligent Machines W1 Human, Computer, (60*3)/100 + (40*5)/

13 W6 W5 W2 W3 W7 W8 W9 Intelligent machines Human, Robot, Computer, Management Human, Computer, Robot, Resource Human, Intelligent machines Computer, Intelligent machines Human, Computer, Resource Robot, Intelligent Machines, Human Computer, Robot, Intelligent Machines (60*6)/100 + (40*4)/ (60*7)/100 + (40*3)/ (60*4)/ (40*9)/100 (60*5)/100 + (40*10)/100 7 (60*8)/100 + (40*6)/ (60*9)/100 + (40*7)/ (60*10)/100 + (40*8)/ The next section compares the proposed work with the Google s PageRank mechanism. 3.5 COMPARISON OF PROPOSED VS. EXISTING MECHANISM The Relevance_Retrieval_Rank listed in Table 3.9 was compared with Google s PageRank (see Table 3.10) and it was observed that the URLs W4, W10 and W1 have retained their positions. However, URLs W5 and W6 have moved from position 7 to position 5 and position 6 to position 4 respectively. W2 which was getting a higher position earlier but was not much relevant has moved down to position 6 in the RRR table. Same is true for W3, W7, W8, and W9 which have retained their positions in both the tables. Table 3.10 clearly indicates the list of URLs which are more close to the keywords in the given query and hence are more relevant. 66

14 Table 3.10: Comparison of Google s PageRank and RRR URLs RRR Google PageRank W4 Computer, Robot, Human 1 1 W10 W1 W6 W5 Human, Computer, Robot, Intelligent Machines Human, Computer, Intelligent machines Human, Robot, Computer, Management Human, Computer, Robot, Resource W2 Human, Intelligent machines 6 4 W3 Computer, Intelligent machines 7 5 W7 Human, Computer, Resource W8 W9 Robot, Intelligent Machines, Human Computer, Robot, Intelligent Machines As seen from the above results, the major benefit of proposed mechanism is that it has considered both the popularity and the relevancy of the web page according to the keywords supplied by the user. This has resulted in bringing up those web pages in the result list which were more relevant but were lying below their deserved positions in the result list displayed to the user due to the lack of popularity (i.e. having lesser number of backward and forward links) thereby illustrating the usage of Relevance_Retrieval_Rank algorithm to improve the relevancy of URLs against the keywords in question and offered improved search results for the user. Thus, the superiority of the mechanism has been established. Though RRR has improved the relevancy of the web pages but given the vast and increasing amount of information available on World Wide Web, relevancy of search 67

15 results alone is not sufficient. In fact, a short and quick response time is another important factor that governs the overall performance of any search mechanism. The response time is the time that a generic system or a functional unit takes to react to a given input. People use web to access information from the remote sites but do not like to wait long for their results. A famous report [109] indicated in its press release that if a web page does not load within 8-seconds, the user tends to go elsewhere for his information needs. Thus web latency is an issue that can impact a large number of users. The following related issues have been identified that affected the user perceived latency during the surfing of the web. The delay could be from the server side if web servers take longer to process a request especially if they are overloaded or have slow computational ability. Web clients can also add delay if they are not able to quickly parse the retrieved data and display it for the users. The retrieval time of the web documents also depend on network latency. Much of the network latency comes from the propagation delay. Propagation delay, which is basically determined by total distance traversed, is difficult to reduce beyond a particular point. The next chapter addresses the above issues of reducing the user perceived latency and also to provide the user with the relevant information according to the keywords supplied and comes up with a robust framework for the same. 68

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which