Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search

Size: px

Start display at page:

Download "Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search"

Kristopher Lee Houston
5 years ago
Views:

1 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search Y. Syed Mudhasir Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 sydmudhasirmr@gmail.com J. Deepika Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 deepi.realmail@gmail.com S. Sendhilkumar Department of Information Science & Technology College of Engineering Anna University, Chennai-25 ssk_pdy@yahoo.co.in G. S. Mahalakshmi Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 mahalakshmi@cs.annauniv.edu Abstract Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of nearduplicates. Such near-duplicates holdup the other promising results to the users. Many of these near-duplicates are from distrusted websites and/or authors who host information on web. Such nearduplicates may be eliminated by means of Provenance. Thus, this paper proposes a novel approach to identify such near-duplicates based on provenance. In this approach a provenance model has been built using web pages which are the search results returned by existing search engine. The proposed model combines both content based and trust based factors for classifying the results as original or near-duplicates. Keywords Web search, Near-duplicates, Provenance, Semantics, Trustworthiness I. INTRODUCTION Finding information on the internet has become a day-to-day task for billions of internet users. Hence it has become very important that the users get the best results for their queries. However, in any web search environment there exist challenges when it comes to providing the user with most relevant, useful and trustworthy results, as mentioned below: The lack of semantics in web The enormous amount of near-duplicate documents The lack of emphasis on the trustworthiness aspect of documents There are also many other factors that affect the performance of a web search. Several approaches have been made and still researches are going on to optimize the web search. This includes semantic analysis of web to provide relevant results as in [1], optimizing the indexing functions in order to improve the storage and retrieval process of web documents [2], optimizing the ranking function to provide the user with best documents at the top of the results as in [3]. However the efficiency of these approaches to optimize web search depend upon the amount of data that is available over the internet. Information in WWW is enormous and redundant. There comes the problem of nearduplicate documents. The following sub sections 1.1 and 1.2 briefly discuss the concepts of near-duplicates detection and Provenance. A. Near- Duplicates Detection The processes of identifying near duplicate documents can be done by scanning the document content for every document. That is, when two documents comprise identical document content, they are regarded as duplicates. And files that bear small dissimilarities and are not identified as being exact duplicates of each other but are identical to a remarkable extent are known as near-duplicates. Following are some of the examples of near duplicate documents in [4] Documents with a few different words - widespread form of near-duplicates Documents with the same content but different formatting for instance, the documents might contain the same text, but dissimilar fonts, bold type or italics Documents with the same content but with typographical errors Plagiarized documents and documents with different versions Documents with the same content but different file type for instance, Microsoft Word and PDF. Documents providing same information written by the same author being published in more than one domain. There are several existing approaches based on syntactical comparison, URL based and also semantic comparisons. However, this paper suggests an effective way of identifying and eliminating near-duplicates using provenance that will help in comparison of documents based on provenance factors. B. Provenance One of the causes of increasing near-duplicates in web is that the ease with which one can access the data in web and the lack of semantics in near-duplicates detection techniques. It has also become extremely difficult to decide on the trustworthiness of such web documents when different versions/formats of the same content exist. Hence, the needs to bring in semantics say

2 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, meaningful comparison in near-duplicates detection with the help of the 6W factors Who (has authored a document), What (is the content of the document), When (it has been made available), Where (it is been available), Why (the purpose of the document), How (In what format it has been published/how it has been maintained) [5]. This information can also be useful in calculating the trustworthiness of each document. A quantitative measure of how reliable that any arbitrary data is could be determined from the provenance information. This information can be useful in representative elimination during near-duplicate detection process and to calculate the trustworthiness of each document. The existing approaches of near-duplicates detection and elimination do not give much importance to the trustworthiness aspect of the content in documents retrieved through web search. Thus, Provenance based factors may be used for nearduplicates detection and elimination which provides the user with the most trustworthy results. II. RELATED WORK A. Detection and Elimination of Near-Duplicates Works on near-duplicates detection and elimination are many in the history. In general these works may be broadly classified as shown in Fig. 1 into Syntactic, URL based and Semantic based approaches. Shingling Signature Near-Duplicates Detection Techniques Syntactic URL Based Semantics sentence-wise Similarity Fuzziness Semantic based Graphs pair-wise similarity Fig.1 Near-duplicates detection techniques 1) Syntactical Approaches: One of the earliest was by Broder et al [6], proposed a technique for estimating the degree of similarity among pairs of documents, known as shingling, does not rely on any linguistic knowledge other than the ability to tokenize documents into a list of words, i.e., it is merely syntactic. In this, all word sequences (shingles) of adjacent words are extracted. If two documents contain the same set of shingles they are considered equivalent and can be termed as near-duplicates. The problem of finding text-based documents similarity was investigated and a new similarity measure was proposed to compute the pair-wise similarity of the documents using a given series of terms of the words in the documents. Also, a kappa measure was developed for computing the similarity of documents. Then ordered weighting averaging operator was used to aggregate the similarity measures between a set of documents [7]. Reference [8] shows another approach based on the similarity measure can be acquired by comparing the exterior tokens of inter-sentences, but relevance measure can be obtained only by comparing the interior meaning of the sentences. A method to explore the Quantified Conceptual Relations of wordpairs by using the definition of a lexical item was described, and a practical approach was proposed to measure the inter-sentence relevance. An approach based on the signature concept as in [9], suggested a method of descriptive words for definition of nearduplicates of documents which was on the basis of the choice of N words from the index to determine a signature of a document. Any search engine based on the inverted index can apply this method. Any two documents with similar signatures are termed as near-duplicates. The method based on shingles and the signature method when compared, the signature method in the presence of inverted index was more efficient. As a result, the above stated syntactic approaches carry out only a text based comparison. And these approaches did not involve the URLs or any link structure techniques in identification of near-duplicates. The following subsection discusses the impact of URL based approaches on nearduplicates detection. 2) URL Based Approaches: A novel algorithm, Dust Buster, for uncovering DUST (Different URLs with Similar Text) was intended to discover rules that transform a given URL to others that are likely to have similar content. Dust Buster employs previous crawl logs or web server logs instead of probing the page contents to mine the dust efficiently. Search engines can increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as Page Rank, which are the benefits provided by the information about the DUST [10]. Reference [11] shows another approach where detecting process was divided into three steps. 1) Removal according to URLs. First, remove pages with the same URL in the initial set of pages to avoid the same page been download repeated due to repeat links. 2) Remove miscellaneous information in the pages and extract the texts. Pretreatment the pages, remove the navigation information, advertising information, html tags, and other miscellaneous information on the pages, extract the text content and get a set of texts. 3) Detect with DDW algorithm. Use the DDW algorithm to detect similar pages. The combination of such URL based approaches along with syntactic approaches is still not sufficient as they do not have semantic in identifying near-duplicates. The following subsection discusses briefly a few semantic based approaches. 3) Semantic Approaches: A method on plagiarism detection using fuzzy semantic-based string similarity approach was proposed. The algorithm was developed through four main stages. First is pre-processing which includes tokenization, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate

3 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarized) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing hereby consecutive sentences are joined to form single paragraphs/sections [12]. Recognizing that two Semantic Web documents or graphs are similar, and characterizing their differences is useful in many tasks, including retrieval, updating, version control and knowledge base editing. A number of text based similarity metrics are discussed as in [13] that characterize the relation between Semantic Web graphs and evaluate metrics for three specific cases of similarity that have been identified: similarity in classes and properties used while differing only in literal content, difference only in base-uri, and versioning relationship. Such techniques are inadequate as the importance is on providing the relevant content and not much on the trust, originality or the authenticity of the documents. Thereby Provenance is likely to play an important role in near-duplicates detection. B. Provenance Researches on Provenance are being made wherein the amount of trustworthiness of the content is given importance. In general these works on provenance techniques may be classified as shown in Fig. 2 in to workflow oriented, network oriented, trustworthiness and collecting provenance. Workflow Oriented Network Oriented Provenance Techniques Trustworthiness Fig. 2 Provenance techniques Collecting Provenance Provenance Graphs An approach based on Collaborative Planning Application (CPA) was to help users organize information, potentially at a variety of security levels. Data provenance is a natural fit to the CPA. The goal of the CPA is to help users organize information, potentially at a variety of security levels, in the style of a blog/wiki. Labels consists of access control labels, which consist of lists of groups of users with read access and write access, and provenance labels which are comprised of a list of ProvAction labels that have affected the labeled data. As with most wiki software, a log of how, when, and by whom pages are edited is a matter of concern. To implement data provenance, every public function that modifies the state of the wiki is wrapped updating the appropriate label. Provenance policies consider creating, modifying, deleting, restoring, and relabeling the blocks [14]. 1) Trustworthiness: Another line of research on provenance was Knowledge provenance, to determine the validity and origin of web information by means of modeling and maintaining information sources and information dependencies. It constructs a trust judgment model for knowledge provenance. Trust judgment can be done with following factors: (1) the trustworthiness of information creator can be used to represent the trustworthiness of the information created (2) trust can be placed in what the trusted individual behaves like. This type of trust is intransitive (3) trust can be placed in what the trusted friend believes to be true in a field. This type of trust is transitive and can propagate in social networks (4) trust in an organization in a field can be transferred to a professional member of the organization. The importance is on trustworthiness of the content and the measures to find the amount of trustworthiness by means of social networking [15]. 2) Collecting Provenance Information: Recording provenance information is a fundamental topic of provenance research as discussed in [16]. While traditional provenance research usually addresses the creation of data, this provenance model also represented data access in the context of Web data. A system that applies this provenance model generates provenance graphs for data items. Some pieces of provenance information can be recorded by a system; for other pieces the system relies on metadata provided by third parties. Thus, recordable provenance information and metadata-reliant provenance information are properly distinguished. Provenance information that is common to all provenance elements of this type is the access time and the access method. The provenance element could describe the creation and the expiration date. Provenance-relevant metadata is either directly attached to a data item or its host document or it is available as additional data on the Web. Examples for attached metadata are RDF statements about an RDF graph that contains the statements, author and creation date of blog entries. Therefore, Provenance based near duplicates detection and elimination process will help in retaining the original i.e. most trustworthy documents and eliminating the other replicas to let alone the overhead involved in facing the near-duplicate documents in the search results which plays an important role in any web search environment. III. WEB PROVENANCE BASED DETECTION AND ELIMINATION OF NEAR-DUPLICATES The entire process of web provenance based near-duplicates detection and elimination is represented in the architecture as shown in Fig. 3. The architecture comprises of the following components: (i) Data collection (ii) Preprocessing (iii) Document Term Matrix(DTM) construction (iv) Provenance

4 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Matrix(PM) Construction (v) Database (vi) Singular Value Decomposition (vii) Document Clustering based on similarity scores (viii) Filtering (ix) Re-ranking based on trustworthiness values. DATA COLLECTION PREPROCESSING DOCUMENT TERM MATRIX CONSTRUCTION PROVENANCE MATRIX CONSTRUCTION SINGULAR VALUE DECOMPOSITION Provenance Records DATABASE DOCUMENT COMPARISON <img alt="ieee Xplore Digital Library" src="24_files/logo.xplore.gif" width=230 height=45>ieee We construct an automatic secure fingerprint verification system based on fuzzy vault scheme addressing a major security hole currently in most biometric systems. <Ahref=" >Acoustics, Speech, and Signal Processing, 2005</A> <img alt="ieee Xplore Digital Library" src="24_files/logo.xplore.gif" width=230 height=45>ieee Trustwort hiness value RE-RANKING TRUSTWORTHINESS CALCULATION FILTERING NEAR- DUPLICATES Filtered Documents Near- Duplicates Similarity Scores DOCUMENT CLUSTERING Fig. 4 Sample input to Parser When the sample input in Fig. 4 is given to the parser function, it removes all the tags and scripting elements and gives only the information that is needed i.e., the textual content of the documents as shown in Fig. 5. REFINED RESULTS Fig. 3 Web Search Optimization Based on Web Provenance A. Data Collection In this work, a web search application has been utilized. The aim of project is to identify and eliminate near-duplicates, which is done using the top 20 search results of a given query. Since data need to be collected automatically, a special web search browser as in [17] which collects and stores the web documents without any user intervention is made use of. Currently we have collected 20 search queries and 400 web documents have been downloaded for the purpose of near-duplicates detection. These documents need to be indexed using keywords which highlight the content feature of the documents but the collected web documents were of varied formats say with extensions.html,.pdf,.ppt,.doc,.ps, etc. But currently we are considering only the web pages i.e., html documents in this work. The documents have to be preprocessed as discussed in the following section. B. Preprocessing The preprocessing of a document involves: (i) removing html tags and scripting elements(parsing), (ii) tokenizing, (iii) removal of stop words, (iv) stemming and (v) feature extraction based on term frequency. C. Parsing The first step of preprocessing is parsing. A parser was specifically designed for this purpose of extracting the textual content from any html document. The sample part of a source file of any webpage resembles the one as in Fig. 4 with some <html> tags. IEEE We construct an automatic secure fingerprint verification system based on fuzzy vault scheme addressing a major security hole currently in most biometric systems. Acoustics, Speech, and Signal Processing, 2005 Fig. 5 Sample output from Parser The output of the parser as shown in Fig. 5 acts as an input to the stop words removal process. D. Stop Words Removal(SWR) A list of stop words such as of, the, is, are, were, etc available in [18] is used for this purpose. The existence of the stop words are checked in the given input text. If found, these stop words are removed. Thus the text now contains only the keywords that highlight the content of the document. The input is as shown in Fig. 5 and the output obtained after the process of stop words removal is represented in Fig. 6. IEEE Construct Automatic Secure Fingerprint Verification System Fuzzy Vault Scheme Addressing Major Security Hole Currently Biometric systems Acoustics Speech Signal Processing Fig. 6 Sample output from SWR The filtered tokens obtained as output from stop words removal process as shown in Fig. 6 goes as input to the stemming process.

5 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, E. Stemming Stemming is the process of obtaining the root words from the derived words that are present in the filtered tokens. Porter stemming algorithm in [19] is utilized for the purpose of identifying the root words from the derivatives. In Fig. 6, there are certain keywords like addressing, systems, verification, etc which does not highlight the exact keyword content then such keywords are given to the stemming module where they are stripped off their suffixes( ing, es, s, ity etc) and prefixes( ir, un, im etc) to obtain their root words as highlighted in the Fig. 7. IEEE Construct Automatic Secure Fingerprint Verify System Fuzzy Vault Scheme Address Major Security Hole Current Biometric system Acoustic Speech Signal Process Fig. 7 Sample output from Stemmer Hence finally each document will be represented in the form of keywords which helps in the formulation of Document Term Matrix for indexing and storage into the database. F. DTM Construction A document-term matrix or a term-document matrix describes the frequency of terms that occur in a collection of documents as in [20]. The input to this module is as shown in Fig. 7. The term frequency is computed for each term in the given input. A matrix of order is formed as represented in Fig. 8 where m corresponds to number of terms and n corresponds to the total no. of documents. The values of the matrix correspond to the term frequency of each term in documents. Terms doc 1 doc 2 doc3 Accept Access Accomplish Burger Burn Cambridge Camera Candid Fig. 8 Document Term Matrix(DTM) G. Provenance Matrix Construction In this work, we have proposed a matrix called Provenance Matrix (PM) which may be defined as in definition 3.1. Definition 3.1 Provenance Matrix: Given a set of web pages/documents and their information like Who( ), When( ), Where( ), What( ) for a document, A Provenance matrix of order may be defined where the rows represent the above mentioned factors and the columns represents the documents. factors/docs doc 1 doc 2 doc3 Who Andrew Kamal Lyle H. Ungar McCallum Kamal Nigam Nigam Jing Luo Where IEEE ACM IEEE When What Refer figure 3.6 Fig. 9 Provenance Matrix The quad-tuple representing the values of provenance factors for each document is obtained and a Provenance matrix is formed as shown in Fig. 9. The Provenance matrix gets further elaborated to individual matrices for each of the provenance factors. The Fig. 10, 11 and 12 represent the Who, Where and When matrices respectively. The two other factors Why and How remain to be explored. Author doc 1 doc 2 doc3 Andrew McCallum Lyle H. Ungar Kamal nigam Jing Luo Fig. 10 Sample Who Matrix Location doc 1 doc 2 doc3 IEEE ACM Fig. 11 Sample Where Matrix Year doc 1 doc 2 doc Fig. 12 Sample When Matrix The boolean values in individual matrices demonstrated in Fig. 10, 11 and 12 represent the author, location and year of creation for each document. 1 represents the existence and 0 the vice-versa. H. Database Component The Document Term Matrix, the Provenance Matrix and all three individual matrices are stored in the database for representing documents as vectors with respect to both the content and the provenance factors. These matrices are updated dynamically at the entry of each document vector. They are used for computing the similarity between documents by retrieving

6 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, any two document vectors for comparison based on both content and the other provenance factors. I. Singular Value Decomposition Due to the complex structure of DTM it becomes infeasible for further processing of the documents. Therefore a need for Singular Value Decomposition as detailed in [21] arises to reduce the dimensions of the matrix and reformulate the document vectors. The original DTM can be decomposed as mentioned in (1). The matrix will be the DTM The columns of and are, respectively, left- and right-singular vectors for the corresponding singular values. The diagonal entries of are equal to the singular values of the original matrix Then, the document vectors can be reformulated as The in (2) represents each document vector which has m rows is reduced to which will have only k rows where the value of k is very much less than m. The reformulations of document vectors help reduce the complexity in applying similarity measures on such vectors to easily compute their similarity with each other : : : : : : : : : : : : : : : : : : : : Fig. 13 Matrix before SVD The matrix in Fig. 13 are the values of original DTM whose dimensions are say 7355 rows and 10 columns i.e., 7355 terms and 10 documents. The time complexity for computing cosine similarity with dimensions is infeasible. Therefore a dimensionality reduction technique i.e., Singular Value Decomposition is utilized to decompose the matrix using equation given in (1) and the document vectors are reformulated using the equation given in (2). The matrix after SVD shown in Fig. 14 is reduced to dimensions. Where, 5 is the value of k as mentioned in (1) and (2). (1) (2) Fig.14 Matrix after SVD The exactness of the reduced matrix depends upon the value of k as in [22] such that a matrix A of order mxn is reduced to a matrix of order kxn where k<<m where it takes less space and also less time to compute the similarity of documents and the reduced matrix is also the best approximation to the original matrix A. J. Document Clustering Clustering is done based on the similarity scores between documents. Any two documents can be computed for their similarity by applying the cosine similarity measure as shown in (3) on the document vectors [23]. The measure given in (3) results in values ranging from [0-1]. represent the two document vectors that are compared. The values ranging from 0 to 1 symbolize dissimilar documents to exactly similar documents respectively. For every two document vectors in the temporary matrices and the reformulated vectors of Document Term Matrix, the cosine similarity measure is applied and the similarity scores between any two documents is computed. A new measure represented as in (4) was formed to find similarity score based on provenance. Where, F Count of factors having SIMSCORE>0 The Table I showcases the similarity between document 1 and few other documents based on all factors such as Who, Where, When and What. Most of the existing methodologies compare documents based only on their content (i.e., What factor). However when the additional provenance factors such as Who, Where and When are taken into account and computed for their over-all provenance based similarity as stated in (4), there seems to be an subsequent increase in similarity scores which helps in successful identification of near-duplicate documents. (3) (4)

7 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Document TABLE I EFFECTIVENESS OF PROVENANCE IN COMPUTING SIMILARITY BETWEEN DOCUMENTS Who sim score Document 1 Where When sim sim score score What sim score Provenance sim score Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Using the similarity scores of documents, the k-means clustering technique [24] is used to effectively cluster the group of documents that are found to be highly similar. The number of clusters is decided by the rule of thumb as mentioned in (5). k i.e., number of clusters is chosen as Where, N Total no of documents The documents are therefore classified into three types of clusters i.e., highly similar, marginally similar and lowly similar with respect to the similarity scores. The clusters of highly similar documents in the DTM and in the Provenance Matrix are characterized as near-duplicates. The work in this paper has not yet explored the below mentioned filtering process of near-duplicate documents, trustworthiness computation and re-ranking of results based on the trustworthiness values. K. Filtering In elimination process of near-duplicates, among the documents classified as near-duplicates, a decision is taken as to which one the original and most trustworthy document is (5) compared to the other documents that are to be eliminated. This can be done based on the representative elimination rules mentioned in Table II. These rules analyze and compare the information about the author of the documents, when these documents are created, where they have been published, the purpose and the format of the documents and enable to take a decision as to which document is to be retained. TABLE II REPRESENTATIVE ELIMINATION RULES 1. WHO Compare if equal return 1, else return 0 If rule1 returns 0, then 2. WHEN Document with Earliest(Date_of_Publish If rule1 returns 1, then 3. WHERE Compare returns with standardized publication 4. WHY Check returns 5. HOW Check returns with a better purpose with a better format This representative elimination based upon the Web Provenance will not only concentrate on eliminating nearduplicates but also put much emphasis on the chosen content being the most trustworthy one. L. Re-Ranking Based on Trustworthiness Values The filtered non-redundant results after the near-duplicates elimination process go through trustworthiness calculation. That is based on the provenance information, the author citations, etc., the trustworthiness value for each document can be calculated with the help of factors such as accountability, maintainability, coverage, and authority of the document. Accountability deals with the author information such as the standard, qualification and contactability Maintainability deals with the availability of up-to-date content Coverage deals with the number of working links with respect to the total number of links Authority deals with the place where the document has been published These factors have their own weights in order to calculate the trustworthiness value of a specific document. The trustworthiness value that will be associated with each document calculated based on these factors will help in re-ranking the results. The re-ranking of the filtered non-duplicate documents based on trustworthiness values will ensure that the users get the best results i.e., the most trustworthy in addition to the nonduplicate results at the top of their search results for their query.

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011 29 IV.

The top 20 search results are automatically downloaded by this browser. Only web pages i.e., html documents are taken into account for our proposed work.

8 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, IV. RESULTS AND DISCUSSIONS Using a specialized browser, the user gives the search query say for example Detecting near-duplicates for web crawling as demonstrated in Fig. 15. The top 20 search results are automatically downloaded by this browser. Only web pages i.e., html documents are taken into account for our proposed work. At first these 20 html documents are preprocessed. Fig. 15 Search results for a query Based on the feature extraction a Document Term Matrix is constructed which resembles the one in Fig. 16 where terms occupy the rows and columns represent the documents. The values in the matrix represent the frequency of a particular term in each document. This matrix corresponds to the What factor in provenance. Fig. 16 Document Term Matrix(DTM)

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011 30 Fig. 17 Provenance Matrix A provenance matrix is formed as represented in Fig.

That is, the Who tuple in the provenance matrix expands to an individual author matrix representing the author of each document.

And likewise, a year matrix is constructed from the When tuple of provenance matrix. This matrix associates documents with the year of publication of the documents.

9 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Fig. 17 Provenance Matrix A provenance matrix is formed as represented in Fig. 17 by obtaining the values of provenance factors for each document. This Provenance Matrix gets further split up as three individual matrices representing the other factors Who, Where and When. That is, the Who tuple in the provenance matrix expands to an individual author matrix representing the author of each document. Similarly, the Where tuple in the provenance matrix gets expanded to an individual location matrix representing whether the document is from that specific location. And likewise, a year matrix is constructed from the When tuple of provenance matrix. This matrix associates documents with the year of publication of the documents. Now, the cosine similarity measure is applied on document vectors in all four matrices i.e., reformulated DTM, Who matrix, Where matrix and the When matrix. The documents highly similar from DTM i.e., based on content are clustered and identified as plagiarized documents. And the highly similar documents from the other individual matrices form another cluster. Now, the documents from both clusters are combined and termed as near-duplicates. The Fig. 18 demonstrates the effective identification of nearduplicates in the search results. The showcased near-duplicates here discuss about the same document but are located in varied sites. It dissatisfies the users as they are provided with redundant content present in the first 20 results. The elimination of them will make place for the other promising results to the users those of which may lag behind in subsequent result pages. Fig. 18 Redefined Search Results Here, Fig. 19 shows the comparison of some 10 documents with respect to each other based upon their content i.e., the keywords whereas the Fig. 20 shows comparison of the same 10 documents of the same query with respect to each other based on the Provenance factors(who, Where, When, Why, How). These set of 10 documents represented in both Fig. 19 and 20 are of the same query. Fig. 19 Comparison based on DTM

10 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, OBSERVATION 1: The clusters of documents which are classified as highly similar from DTM are found to be very similar with respect to their content i.e., plagiarized. In Fig. 19 the cluster of document that are highly similar with each other are Doc 2, Doc 5, Doc 6, Doc 8 and Doc 9. Fig. 20 Comparison based on Provenance Matrix OBSERVATION 2: The clusters of documents which are classified as highly similar from Provenance Matrix are found be similar with respect to their Provenance factors (Who, When, Where). In Fig. 20 the cluster of document that are highly similar with each other are Doc 2, Doc 5, Doc 7 and Doc 10. OBSERVATION 3: The clusters of documents that are highly similar in both observations (observation 1 and observation 2) are classified as near-duplicates. From Fig. 19 and 20, the cluster of document which is highly similar in both observation 1 and 2 are Doc 2, Doc 5, Doc 6, Doc 7, Doc 8 and Doc 9 and Doc 10 since they are found to be highly similar on both the content and the Provenance factors. NOTE: The documents Doc 7 and Doc 10 remain undetected in observation 1 when compared based on their content. However they are detected as being similar through the observation 2 by comparison based on provenance factors. This approach of comparison of documents based on provenance factors(who, When, What, Where) in addition to their content based similarity to categorize the documents as near-duplicates is found to be very effective as on an average 3-4 near-duplicates were detected for each query. Therefore in the dataset of 400 documents around 80 documents are found to be near-duplicates. The documents that are classified as nearduplicates based on observation 3 have to go through the second phase i.e., near-duplicates elimination. The process of successfully eliminating the near-duplicates by applying representative elimination rules and then ranking the filtered documents based on their trustworthiness is under experimentation. V. CONCLUSION AND FUTURE WORK In this paper, we have proposed a method for detecting and eliminating the near duplicates and also re-ranking the documents by calculating their respective trustworthiness values. Our approach uses the Web Provenance concept to make sure that the near duplicates detection and elimination and trustworthiness calculation are done using semantics by means of the provenance factors(who, When, Where, What, Why and How). The future work involves considering the left-out factors Why and How in provenance matrix. Performing the filtering process of near-duplicates, computing the trustworthiness values of documents and then re-ranking the documents based on the trustworthiness values. In future, a further study will be made on the characteristics and properties of Web Provenance in near duplicates detection and elimination and also the calculation method of trustworthiness in varied web search environments and varied domains. As the future work, the architecture of a search engine can be designed or a web crawler, based on Web Provenance for the semantics based detection and elimination of near-duplicates. Also the ranking can be done based on trustworthiness values in addition to the present link structure techniques which are expected to be more effective in web search. REFERENCES [1] Cristina Scheau, Traian Rebedea, Costin Chiru and Stefan Trausan- Matu, Improving the Relevance of Search Engine Results by Using Semantic Information from Wikipedia, IEEE International Conference 2010, pp: , [2] Ourania I. Markaki and Dimitris E. Charilas, Personalization mechanisms for content indexing, search, retrieval and presentation in a multimedia search engine, IEEE, 2009 [3] Chunshui Zhao, Jun Yan and Ning Liu, ImproveWeb Search Ranking by Co-Ranking SVM, Fourth International Conference on Natural Computation, pp: 81-85, 2008 [4] Prasanna Kumar J and Govindarajulu P, Duplicate and Near-duplicate documents detection, European Journal of Scientific Research, Vol.32 No.4, pp: , [5] George Komatsoulis, Toward a Functional Model of Data Provenance, [6] Broder A, Glassman S, Manasse M, and Zweig G, Syntactic Clustering of the Web, In 6th International World Wide Web Conference, pp: , [7] Ali Emrouznejad and Gholam R. Amin, Document similarity: A new measure using OWA, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp: , [8] Maosheng Zhong, Yi Hu, Lei Liu and Ruzhan Lu, A Practical Approach for Relevance Measure of Inter-Sentence, Fifth International Conference on Fuzzy Systems and Knowledge Discovery, pp: , 2008 [9] Ilyinsky S, Kuzmin M, Melkov A, and Segalovich I, An efficient method to detect duplicates of Web documents with the use of inverted index, Proceedings of the Eleventh International World Wide Web Conference, [10] BarYossef, Z., Keidar, I., Schonfeld, U, Do Not Crawl in the DUST: Different URLs with Similar Text, 16th International world Wide Web conference, Alberta, Canada, Data Mining Track, pp: , [11] Junping Qiu and Qian Zeng, Detection and Optimized Disposal of Near- Duplicate Pages, 2nd International Conference on Future Computer and Communication, Vol.2, pp: , [12] Salha Alzahrani and Naomie Salim, Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection, [13] Krishnamurthy Koduvayur Viswanathan and Tim Finin, Text Based Similarity Metrics and Delta for Semantic Web Graphs, pp: 17-20, 2010.

(IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, 2011 32 [14] Brian J.

601 610, 2007. [15] Jingwei Huang and Mark S.

[16] Olaf Hartig, Provenance Information in the Web of Data, 2009. [17] Sendhilkumar S, Geetha T.

org/wiki/stop_words, 2010 [19] Willett P, The Porter stemming algorithm: then and now. Program: electronic library and information systems, 40 (3). pp. 219-223 [20] http://en.wikipedia.

11 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, [14] Brian J. Corcoran, Nikhil Swamy and Michael Hicks, Combining Provenance and Security Policies in a Web-based Document Management System, In On-line Proceedings of the Workshop on Principles of Provenance, pp: , [15] Jingwei Huang and Mark S. Fox, Trust Judgment in Knowledge Provenance, Proceedings of the 16th International Workshop on Database and Expert Systems Applications (DEXA 05), pp: , [16] Olaf Hartig, Provenance Information in the Web of Data, [17] Sendhilkumar S, Geetha T. V, Personalized Ontology for Web Search Personalization, Proceedings of the 1st Bangalore Annual Compute Conference, pp:1-7, [18] [19] Willett P, The Porter stemming algorithm: then and now. Program: electronic library and information systems, 40 (3). pp [20] [21] Emmett J. Ientilucci, Using the Singular Value Decomposition, pp: 1, 2003 [22] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, Finding the Optimal Rank for LSI Models, [23] E. Garcia, Cosine similarity and term weight tutorial, [24] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, An efficient k-means clustering algorithm: Analysis and implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp: , BIBLIOGRAPHY Y. Syed Mudhasir is born at Chennai, India on 16th June He is a student, currently pursuing his M.E degree in the College of Engineering Guindy, Anna University. He received his B. Tech degree from Anna University in His areas of interest include web search, object oriented programming concepts and Database management systems. S. Sendhilkumar born at Pondicherry, in India, completed his Bachelors of Technology (EEE) at Pondicherry Engineering College and his Masters OF Technology (IT) at the erstwhile Indian Institute of Information Technology Pondicherry. In 2009 he finished his Ph.D. at Anna University, Chennai, India. He is currently working as Asst. Professor (Sr. Grade) in the Dept. of Information Science & Technology at Anna University. He has more than 9 years of teaching and research experience. He has numerous international journal and conference publications to his credit. His research interests include Personalized Web Search, Web Mining, Data Mining, Social Network Analysis and Opinion Mining from Short Texts. He is currently the associate editor for Minmanjari, e-magazine from the house of INFITT. G. S. Mahalakshmi born at Coimbatore, Tamil Nadu, India, completed her B.E. (Computer Science and Engineering) from R.V.S. College of Engineering and Technology, Dindigul in 1998 and M.E. (Computer Science and Engineering) from College of Engineering, Anna University, Chennai in She also completed her Ph.D. in 2009 from the same university. She is a Assistant Professor (Senior Grade) in the Department of Computer Science and Engineering, College of Engineering, Anna University, Chennai. She has numerous international journal and conference publications to her credit. Her research interests include Reasoning, Knowledge Sharing and representation, Text Mining, Social Network Analysis, Robotics, and Natural Language Computing. She is a peer reviewer for reputed publications like ACM Transactions on Autonomous and Adaptive Systems. She is currently the associate editor for Minmanjari, e-magazine from the house of INFITT. J. Deepika is born at Namakkal, India on 11th May She is a student, currently pursuing her M.E degree in the College of Engineering Guindy, Anna University. She received her B. E degree from Anna University in Her areas of interest include text mining and software engineering.

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

IJRIM Volume 2, Issue 2 (February 2012) (ISSN ) AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique