REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

Similar documents
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Efficient Algorithm for Near Duplicate Documents Detection

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

FSRM Feedback Algorithm based on Learning Theory

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

A New Technique to Optimize User s Browsing Session using Data Mining

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

CHAPTER 7 CONCLUSION AND FUTURE WORK

Relevancy Measurement of Retrieved Webpages Using Ruzicka Similarity Measure

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Dynamic Clustering of Data with Modified K-Means Algorithm

Inferring User Search for Feedback Sessions

Heterogeneous Density Based Spatial Clustering of Application with Noise

Crawler with Search Engine based Simple Web Application System for Forum Mining

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Domain Specific Search Engine for Students

Correlation Based Feature Selection with Irrelevant Feature Removal

Spatial Index Keyword Search in Multi- Dimensional Database

Improving Suffix Tree Clustering Algorithm for Web Documents

A Class of Submodular Functions for Document Summarization

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Distance-based Outlier Detection: Consolidation and Renewed Bearing

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

EFFICIENT ALGORITHM FOR MINING ON BIO MEDICAL DATA FOR RANKING THE WEB PAGES

A Methodology to Detect Most Effective Compression Technique Based on Time Complexity Cloud Migration for High Image Data Load

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

SCALING UP OF E-MSR CODES BASED DISTRIBUTED STORAGE SYSTEMS WITH FIXED NUMBER OF REDUNDANCY NODES

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

Proposed System. Start. Search parameter definition. User search criteria (input) usefulness score > 0.5. Retrieve results

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Smart Test Case Quantifier Using MC/DC Coverage Criterion

LITERATURE SURVEY ON SEARCH TERM EXTRACTION TECHNIQUE FOR FACET DATA MINING IN CUSTOMER FACING WEBSITE

International Journal of Advanced Research in Computer Science and Software Engineering

Web Information Retrieval using WordNet

Tag Based Image Search by Social Re-ranking

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

New Concept based Indexing Technique for Search Engine

Figure 1: Virtualization

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Model for Calculating the Rank of a Web Page

An efficient approach in Near Duplicate Document Detection using Shingling-MD5

Incremental K-means Clustering Algorithms: A Review

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

Scheme of Big-Data Supported Interactive Evolutionary Computation

Web Data mining-a Research area in Web usage mining

Annotating Multiple Web Databases Using Svm

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Efficient Algorithm for Frequent Itemset Generation in Big Data

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Mining Web Data. Lijun Zhang

Chapter 6: Information Retrieval and Web Search. An introduction

Searching SNT in XML Documents Using Reduction Factor

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Survey of String Similarity Join Algorithms on Large Scale Data

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION

A NEW ROBUST IMAGE WATERMARKING SCHEME BASED ON DWT WITH SVD

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

A NOVEL APPROACH ON SPATIAL OBJECTS FOR OPTIMAL ROUTE SEARCH USING BEST KEYWORD COVER QUERY

Iteration Reduction K Means Clustering Algorithm

Dr. Sushil Garg Professor, Dept. of Computer Science & Applications, College City, India

Web Usage Mining: A Research Area in Web Mining

Image Similarity Measurements Using Hmok- Simrank

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

User Intent Discovery using Analysis of Browsing History

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

An Efficient Approach for Color Pattern Matching Using Image Mining

Texture Image Segmentation using FCM

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Fault Identification from Web Log Files by Pattern Discovery

SHORTEST PATH ALGORITHM FOR QUERY PROCESSING IN PEER TO PEER NETWORKS

International Journal of Software and Web Sciences (IJSWS)

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

An Enhanced Binning Algorithm for Distributed Web Clusters

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Transcription:

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil Nadu, India 2 Research Scholar, PG & Research Department of computer Science, J.J College of Arts and Science, Pudukkottai, Tamil Nadu, India ABSTRACT The most common difficulty for users to locate resources that are both high in quality and relevant to their information needs without redundancy from the voluminous data available on World Wide Web is not addressed properly. Enormous challenges are being faced by the researchers to provide the relevant and related documents to the users according to the user query. Data mining nowadays plays a pivotal role in searching the information on the web that includes a high variety data types. The primary focus of this paper is to come up with a novel algorithm named Recursive Duplication Check Algorithm to make the search results redundant free and to improve the relevance of the retrieved documents. Keywords: Ranking Algorithm, Data mining, Web mining, Redundancy removal Introduction The volume of information on the internet upsurges every day with millions of new addition to the web and this presents a colossal challenge for site owner to provide proper and relevant information to the internet user popularly called netizens. In addition, more number of duplicate documents grows perpetually on the web which increases the retrieval time and reduces the precision of the retrieved documents. Generally redundancy is classified into two categories. The first is exact replication of web pages, when two web pages contain the same content. The second, web pages which are very similar called as near-duplicated web pages (ie) the pages which are similar and must be more than the threshold value. This paper focuses on detection on and removal of duplicate of web pages and nearly duplicated web pages from the dataset. Background Research Many researchers are working on the information retrieval, especially in the domain of web documents duplicates detection. In the modern era, duplicates and near duplicate documents detection is an interesting subject, which helps the user to get the related documents as per the given query. Min-yan Wang et al.[1] come up with an idea for the web page de-duplication method in which the information including original websites and original websites and web titles are extracted to eliminate duplicated web pages based on feature codes with the help of URL hashing. Charikar [2] suggested a method based on random projection of words in documents to detect near duplicate documents and improved the overall precision and recall. G Poonkuzhali et al. [3] proposed a mathematical approach based on signed and rectangular representation is developed to detect and remove the redundant web documents. Li Zhiyi et al.[4] summarize the situation of duplicated web pages detection technology in China. Gaudence U et al. [5] proposed an algorithm for near duplicate documents detection, which uses the method of Word Positional Based Approach for Document Selection. Junxiu An et al.[6] proposed an algorithm to detect the duplicate web pages based on edit distance. 17 admin@ijarcsa.org

Existence of correlation approach and signed approach In this section, we discuss correlation approach the Common words between the documents are extracted and the term frequency for all the common words is found out. Followed by that Correlation coefficient is computed between these two documents. If the Correlation value is 1, then the above documents are exactly redundant, therefore remove the second document from the original document set. This process is repeated for the remaining documents. In signed approach the set of documents with white-spaced separated words and it is indexed in two dimensional format (i,j), where i represent web pages and j represent words. Therefore, first word from first web page is indexed as (1,1), second word from the first page is indexed as (1,2) etc,. The domain dictionary is arranged in such a way that, all 1- letter word will be indexed first, followed by 2- letter words, then 3-letter words similarly up to 15-letters word which is a very reasonable upper bounds for number of characters in a word. Each page is mined individually to detect relevant and irrelevant documents using signed approach. In this process; the current dataset has been taken further to remove the nearly duplicate documents based on the weightage of the term and term frequency. Further, the documents are being ordered and listed according to the provided user query using Genetic Algorithm using Rank Based Objective Function. Algorithm Step 1: Load Wed Documents into memory Step 2: Remove the Exact duplicate documents from the dataset Step3: Remove the Nearly Duplicate documents from the dataset Step 4: Order the dataset according to the user query using Genetic Algorithm Step 5: Display the rank based documents list The main drawback in these approaches is the time consuption and execution speed, the proposed approach overcomes this problem and the execution time consumed is comparatively less. Proposed Approach Several approaches to detect duplicates and near duplicate documents have been proposed but they still need improvement. Lack of speed in searching for a query and relevance ranking is an obstacle in research especially in information retrieval which makes it necessary for a powerful algorithm to be used to make the searching process faster and provide better results. We propose a new approach which is based on WWT Weighted Web Tool Inverted Index is utilized, which contains unique terms with a list of html documents encompass frequency of the terms and weightage. Inverted Index is a common technique used in IR (Information Retrieval) because of its excellent efficiency in retrieving the relevant documents and also its ability to detect the duplicate documents. The term weight is calculated based on the significance and importance of the terms in identifying a document, that is, the location of the term within the html tags present in the document. The dataset considered here for experimental setup consists of a search result document comprising of top 800 results after providing the keyword Aptitude. The redundancy and the relevancy are found for this search result dataset obtained from google search engine. To find the relevancy of the result the weights are considered as the factors to decide the relevant pages or document. The weights of the searched terms are calculated according to the following table1. 18 admin@ijarcsa.org

HTML Tag Weight Title 6 Head,H1,H2 5 Anchor 4 Bold 3 Body 1 Table 1: Weightage calculation Flow diagram of the proposed work Search Engine result set - Dataset Extracted Document &Data Cleaning Check for redundancy using Recursive Duplication check algorithm proposed algorithm Display the rechecked resultset as output Figure1 Flowchart of the proposed work Experimental Study Most of the search engine results displayed is redundant and irrelevant to the query inquired by the users. One major questions of this paper is there a difference among ranking algorithms of search engines? And if so analyze & compare the results for relevant pages displayed with the popular search engines like Google design a novel algorithm which filters out the redundant pages and produces a user friendly result set. This proposed work is carried out with Google search engine and the results displayed with respect to the queries given by the user is taken as dataset to be mined to check duplication and eliminate the redundancy. For example a query word Aptitude is considered in this experiment and the result displayed in the Google is shown in the Figure2. 19 admin@ijarcsa.org

Figure 2 Search result of Google w.r.t keyword Aptitude This result displayed in the figure 2 clearly showcases there are redundancy in the search results as the first three result displayed belongs to the same website www.indiabix.com and this is the redundancy discussed in this work and to eliminate this a new algorithm is proposed. The dataset DS is first collected from the search result and converted it into text file for mining and eliminating redundancy. This file web mined using C#.Net code and the proposed algorithm reduces or completely eliminates the duplication in domain name present in the search result. Observation The dataset considered here for the experimental study comprises of various sizes ranging between 100 and 800 documents. The proposed algorithm is conducted on these datasets and the duplicate is found out to showcase the effectiveness of the algorithm. Efficiency = Actual data - Duplicate Found Actual data The redundancy is checked and if the document is found duplicate, then the document is removed from the web search result. Finally, a modified web document is obtained which contains required information catering to the user needs without redundant data. The query Aptitude fetched almost 22% redundant result page in Google search engine and the algorithm detected these pages and eliminated the duplicates to ensure quality results to the users. 20 admin@ijarcsa.org

DataSize Query Duplicate Found Execution Time (Sec) 50 Aptitude 8 0.221 84 200 Aptitude 14 0.249 0.93 400 Aptitude 23 0.277 0.94 800 Aptitude 41 0.284 0.94 Table 2: Experimental results Efficiency Four different datasets are considered for the experiment and the data are tested with the new algorithm and the results found are displayed in the table2. The top fifty results showed almost 16 % duplicates results and when the bigger dataset is considered the duplicate percentage was found to be lesser. The more relevant pages produced more duplication in the result. Proposed Algorithm - Recursive Duplication Check Algorithm Procedure Remove_Redundancy( dataset DS) Output : ResultSet DR 1. Load Dataset DS 2. Initialize resultset DR={} 3. Initialize counter value i : = 0 4. Loop until (Counter i less than DS count) a) Fetch DataDS i b) Initialize TitleCount=0 and AnchorCount=0 c) Initialize counter j= i + 1 Loop until ( j less than DS count) Fetch Data DS j ChecktitleText D i = titletextd j Increment TitleCount by 1 Check anchortext D i = anchortextd j Increment AnchorCount by 1 Increment loop j=j+1 End Loop j Check TitleCount=0 and AnchorCount=0 Add Document D i to Result DR Increment loop i=i+1 5. End main Loop i 6. Return resultset DR End Procedure Comparison with existing approaches The proposed algorithm was compared with two existing techniques named signed approach to check redundancy and correlation approach to check redundancy and the experimental results are here under displayed. ALGORITHM QUERY DATA SIZE RELEVANCY PRECISON Execution time(s) Signed approach Aptitude 50 30 87.5 0.344 100 62 87.1 0.367 200 131 88,9 0.379 Correlation approach Proposed algorithm Aptitude 50 31 88.3 0.319 100 63 89.2 0.328 200 136 90.1 0.344 Aptitude 50 33 89.8 0.221 100 70 91.4 0.234 200 142 92.6 0.249 Table 3: Comparison for relevancy for different data sizes 21 admin@ijarcsa.org

execution time(s) INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN DATAMINING AND CLOUD COMPUTING It is clear from the table3 that the proposed algorithm out scores the existing algorithms in terms of relevancy and precision. The proposed algorithm clearly out performs the signed and correlation approaches by a narrow margin. The execution time related to the proposed algorithm is comparatively low and the speed of the execution is relatively good when compared with the signed and correlation approaches. The proposed approach ensures that the quality and the efficiency of the search results are relevant and it consumes lesser search time and execution time as shown in figure3. 0.4 0.35 Comparision for Execution time(s) for different data sizes 0.367 0.379 0.344 0.3 0.319 0.328 0.34 0.25 0.221 0.234 0.249 0.2 0.15 0.1 0.05 0 50 100 200 data size Conclusion Although research on duplicate documents detection is going on until now, efficiency and effectiveness for document relevance is still an issue and needs improvement. Based on the approach used, the proposed algorithm provides a definite rank to resultant web pages. A typical search engine should use web page ranking techniques based on the specific needs of the users. The beauty of the new approach is simple, and the output list is free from redundant documents and the output list is having the more promising results. And the proposed algorithm completely reduces the redundant domain and site name displayed in the search results. signed approach correlation approach proposed approach International Workshop on Database Technology and Applications, pp 271-274.]. Broder A Z,Glassman S C,Manasse M S,Syntactic clustering of the Web. The Sixth International Conference On World Wide Web 1997. [2] Charika M S, Similarity estimation techniques from rounding algorithms, in proceedings of 34th Annual ACM symposium on Theory of Computing, (Montreal, Quebec, Canada, 2002) pp. 380-388. [3] G. Poonkuzhali et al./ International Journal of Engineering Science and Technology Vol. 2(9), 2010, 4026-4032. Reference [1 Min-yan Wang, Dong-Sheng Liu(2009): the Research of web page De-duplication based on web pages Re-shipment Statement, First [4] Li Zhiyi, Liyang Shijin, National Research on Deleting Duplicated Web Pages: Status and Summary, Library Information Service,2011, 55(7): pp.118-121. 22 admin@ijarcsa.org

[5] Gaudence Uwamahoro, Zhang Zuping, Efficent algorithm for Near Duplicate Documents Detection, International Journal Of Computer Science Issues, vol10, issue 2, March 2013.Suresh S, Sivaprakasam, Genetic Algorithm with a Ranking Based Objective function and inverse index representation for web data mining, International Journal of Computer Engineering & Technology(IJCET), Volume 4, Issue 5, September October(2013), pp. 84-90. [6 Ajik Kumar Mahapatra, Sitanath Biswas, Inverted Index Techniques, International Journal of Computer Science Issues, vol.8, Issue 4, No. 1,2011. Ammar A.Rasha S., Genetic Algorithm in Web search Using Inverted Index Representation, 5 th IEEE Conference,2009. 23 admin@ijarcsa.org