REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil Nadu, India 2 Research Scholar, PG & Research Department of computer Science, J.J College of Arts and Science, Pudukkottai, Tamil Nadu, India ABSTRACT The most common difficulty for users to locate resources that are both high in quality and relevant to their information needs without redundancy from the voluminous data available on World Wide Web is not addressed properly. Enormous challenges are being faced by the researchers to provide the relevant and related documents to the users according to the user query. Data mining nowadays plays a pivotal role in searching the information on the web that includes a high variety data types. The primary focus of this paper is to come up with a novel algorithm named Recursive Duplication Check Algorithm to make the search results redundant free and to improve the relevance of the retrieved documents. Keywords: Ranking Algorithm, Data mining, Web mining, Redundancy removal Introduction The volume of information on the internet upsurges every day with millions of new addition to the web and this presents a colossal challenge for site owner to provide proper and relevant information to the internet user popularly called netizens. In addition, more number of duplicate documents grows perpetually on the web which increases the retrieval time and reduces the precision of the retrieved documents. Generally redundancy is classified into two categories. The first is exact replication of web pages, when two web pages contain the same content. The second, web pages which are very similar called as near-duplicated web pages (ie) the pages which are similar and must be more than the threshold value. This paper focuses on detection on and removal of duplicate of web pages and nearly duplicated web pages from the dataset. Background Research Many researchers are working on the information retrieval, especially in the domain of web documents duplicates detection. In the modern era, duplicates and near duplicate documents detection is an interesting subject, which helps the user to get the related documents as per the given query. Min-yan Wang et al.[1] come up with an idea for the web page de-duplication method in which the information including original websites and original websites and web titles are extracted to eliminate duplicated web pages based on feature codes with the help of URL hashing. Charikar [2] suggested a method based on random projection of words in documents to detect near duplicate documents and improved the overall precision and recall. G Poonkuzhali et al. [3] proposed a mathematical approach based on signed and rectangular representation is developed to detect and remove the redundant web documents. Li Zhiyi et al.[4] summarize the situation of duplicated web pages detection technology in China. Gaudence U et al. [5] proposed an algorithm for near duplicate documents detection, which uses the method of Word Positional Based Approach for Document Selection. Junxiu An et al.[6] proposed an algorithm to detect the duplicate web pages based on edit distance. 17 admin@ijarcsa.org

Existence of correlation approach and signed approach In this section, we discuss correlation approach the Common words between the documents are extracted and the term frequency for all the common words is found out. Followed by that Correlation coefficient is computed between these two documents. If the Correlation value is 1, then the above documents are exactly redundant, therefore remove the second document from the original document set. This process is repeated for the remaining documents. In signed approach the set of documents with white-spaced separated words and it is indexed in two dimensional format (i,j), where i represent web pages and j represent words. Therefore, first word from first web page is indexed as (1,1), second word from the first page is indexed as (1,2) etc,. The domain dictionary is arranged in such a way that, all 1- letter word will be indexed first, followed by 2- letter words, then 3-letter words similarly up to 15-letters word which is a very reasonable upper bounds for number of characters in a word. Each page is mined individually to detect relevant and irrelevant documents using signed approach. In this process; the current dataset has been taken further to remove the nearly duplicate documents based on the weightage of the term and term frequency. Further, the documents are being ordered and listed according to the provided user query using Genetic Algorithm using Rank Based Objective Function. Algorithm Step 1: Load Wed Documents into memory Step 2: Remove the Exact duplicate documents from the dataset Step3: Remove the Nearly Duplicate documents from the dataset Step 4: Order the dataset according to the user query using Genetic Algorithm Step 5: Display the rank based documents list The main drawback in these approaches is the time consuption and execution speed, the proposed approach overcomes this problem and the execution time consumed is comparatively less. Proposed Approach Several approaches to detect duplicates and near duplicate documents have been proposed but they still need improvement. Lack of speed in searching for a query and relevance ranking is an obstacle in research especially in information retrieval which makes it necessary for a powerful algorithm to be used to make the searching process faster and provide better results. We propose a new approach which is based on WWT Weighted Web Tool Inverted Index is utilized, which contains unique terms with a list of html documents encompass frequency of the terms and weightage. Inverted Index is a common technique used in IR (Information Retrieval) because of its excellent efficiency in retrieving the relevant documents and also its ability to detect the duplicate documents. The term weight is calculated based on the significance and importance of the terms in identifying a document, that is, the location of the term within the html tags present in the document. The dataset considered here for experimental setup consists of a search result document comprising of top 800 results after providing the keyword Aptitude. The redundancy and the relevancy are found for this search result dataset obtained from google search engine. To find the relevancy of the result the weights are considered as the factors to decide the relevant pages or document. The weights of the searched terms are calculated according to the following table1. 18 admin@ijarcsa.org

HTML Tag Weight Title 6 Head,H1,H2 5 Anchor 4 Bold 3 Body 1 Table 1: Weightage calculation Flow diagram of the proposed work Search Engine result set - Dataset Extracted Document &Data Cleaning Check for redundancy using Recursive Duplication check algorithm proposed algorithm Display the rechecked resultset as output Figure1 Flowchart of the proposed work Experimental Study Most of the search engine results displayed is redundant and irrelevant to the query inquired by the users. One major questions of this paper is there a difference among ranking algorithms of search engines? And if so analyze & compare the results for relevant pages displayed with the popular search engines like Google design a novel algorithm which filters out the redundant pages and produces a user friendly result set. This proposed work is carried out with Google search engine and the results displayed with respect to the queries given by the user is taken as dataset to be mined to check duplication and eliminate the redundancy. For example a query word Aptitude is considered in this experiment and the result displayed in the Google is shown in the Figure2. 19 admin@ijarcsa.org

Figure 2 Search result of Google w.r.t keyword Aptitude This result displayed in the figure 2 clearly showcases there are redundancy in the search results as the first three result displayed belongs to the same website www.indiabix.com and this is the redundancy discussed in this work and to eliminate this a new algorithm is proposed. The dataset DS is first collected from the search result and converted it into text file for mining and eliminating redundancy. This file web mined using C#.Net code and the proposed algorithm reduces or completely eliminates the duplication in domain name present in the search result. Observation The dataset considered here for the experimental study comprises of various sizes ranging between 100 and 800 documents. The proposed algorithm is conducted on these datasets and the duplicate is found out to showcase the effectiveness of the algorithm. Efficiency = Actual data - Duplicate Found Actual data The redundancy is checked and if the document is found duplicate, then the document is removed from the web search result. Finally, a modified web document is obtained which contains required information catering to the user needs without redundant data. The query Aptitude fetched almost 22% redundant result page in Google search engine and the algorithm detected these pages and eliminated the duplicates to ensure quality results to the users. 20 admin@ijarcsa.org

DataSize Query Duplicate Found Execution Time (Sec) 50 Aptitude 8 0.221 84 200 Aptitude 14 0.249 0.93 400 Aptitude 23 0.277 0.94 800 Aptitude 41 0.284 0.94 Table 2: Experimental results Efficiency Four different datasets are considered for the experiment and the data are tested with the new algorithm and the results found are displayed in the table2. The top fifty results showed almost 16 % duplicates results and when the bigger dataset is considered the duplicate percentage was found to be lesser. The more relevant pages produced more duplication in the result. Proposed Algorithm - Recursive Duplication Check Algorithm Procedure Remove_Redundancy( dataset DS) Output : ResultSet DR 1. Load Dataset DS 2. Initialize resultset DR={} 3. Initialize counter value i : = 0 4. Loop until (Counter i less than DS count) a) Fetch DataDS i b) Initialize TitleCount=0 and AnchorCount=0 c) Initialize counter j= i + 1 Loop until ( j less than DS count) Fetch Data DS j ChecktitleText D i = titletextd j Increment TitleCount by 1 Check anchortext D i = anchortextd j Increment AnchorCount by 1 Increment loop j=j+1 End Loop j Check TitleCount=0 and AnchorCount=0 Add Document D i to Result DR Increment loop i=i+1 5. End main Loop i 6. Return resultset DR End Procedure Comparison with existing approaches The proposed algorithm was compared with two existing techniques named signed approach to check redundancy and correlation approach to check redundancy and the experimental results are here under displayed. ALGORITHM QUERY DATA SIZE RELEVANCY PRECISON Execution time(s) Signed approach Aptitude 50 30 87.5 0.344 100 62 87.1 0.367 200 131 88,9 0.379 Correlation approach Proposed algorithm Aptitude 50 31 88.3 0.319 100 63 89.2 0.328 200 136 90.1 0.344 Aptitude 50 33 89.8 0.221 100 70 91.4 0.234 200 142 92.6 0.249 Table 3: Comparison for relevancy for different data sizes 21 admin@ijarcsa.org

execution time(s) INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN DATAMINING AND CLOUD COMPUTING It is clear from the table3 that the proposed algorithm out scores the existing algorithms in terms of relevancy and precision. The proposed algorithm clearly out performs the signed and correlation approaches by a narrow margin. The execution time related to the proposed algorithm is comparatively low and the speed of the execution is relatively good when compared with the signed and correlation approaches. The proposed approach ensures that the quality and the efficiency of the search results are relevant and it consumes lesser search time and execution time as shown in figure3. 0.4 0.35 Comparision for Execution time(s) for different data sizes 0.367 0.379 0.344 0.3 0.319 0.328 0.34 0.25 0.221 0.234 0.249 0.2 0.15 0.1 0.05 0 50 100 200 data size Conclusion Although research on duplicate documents detection is going on until now, efficiency and effectiveness for document relevance is still an issue and needs improvement. Based on the approach used, the proposed algorithm provides a definite rank to resultant web pages. A typical search engine should use web page ranking techniques based on the specific needs of the users. The beauty of the new approach is simple, and the output list is free from redundant documents and the output list is having the more promising results. And the proposed algorithm completely reduces the redundant domain and site name displayed in the search results. signed approach correlation approach proposed approach International Workshop on Database Technology and Applications, pp 271-274.]. Broder A Z,Glassman S C,Manasse M S,Syntactic clustering of the Web. The Sixth International Conference On World Wide Web 1997. [2] Charika M S, Similarity estimation techniques from rounding algorithms, in proceedings of 34th Annual ACM symposium on Theory of Computing, (Montreal, Quebec, Canada, 2002) pp. 380-388. [3] G. Poonkuzhali et al./ International Journal of Engineering Science and Technology Vol. 2(9), 2010, 4026-4032. Reference [1 Min-yan Wang, Dong-Sheng Liu(2009): the Research of web page De-duplication based on web pages Re-shipment Statement, First [4] Li Zhiyi, Liyang Shijin, National Research on Deleting Duplicated Web Pages: Status and Summary, Library Information Service,2011, 55(7): pp.118-121. 22 admin@ijarcsa.org

[5] Gaudence Uwamahoro, Zhang Zuping, Efficent algorithm for Near Duplicate Documents Detection, International Journal Of Computer Science Issues, vol10, issue 2, March 2013.Suresh S, Sivaprakasam, Genetic Algorithm with a Ranking Based Objective function and inverse index representation for web data mining, International Journal of Computer Engineering & Technology(IJCET), Volume 4, Issue 5, September October(2013), pp. 84-90. [6 Ajik Kumar Mahapatra, Sitanath Biswas, Inverted Index Techniques, International Journal of Computer Science Issues, vol.8, Issue 4, No. 1,2011. Ammar A.Rasha S., Genetic Algorithm in Web search Using Inverted Index Representation, 5 th IEEE Conference,2009. 23 admin@ijarcsa.org