Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search

Size: px
Start display at page:

Download "Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search"

Transcription

1 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Near-Duplicates Detection and Elimination Based on Web Provenance for Effective Web Search Y. Syed Mudhasir Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 sydmudhasirmr@gmail.com J. Deepika Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 deepi.realmail@gmail.com S. Sendhilkumar Department of Information Science & Technology College of Engineering Anna University, Chennai-25 ssk_pdy@yahoo.co.in G. S. Mahalakshmi Department of Computer Science & Engineering College of Engineering Anna University, Chennai-25 mahalakshmi@cs.annauniv.edu Abstract Users of World Wide Web utilize search engines for information retrieval in web as search engines play a vital role in finding information on the web. However, the performance of a web search is greatly affected by flooding of search results with information that is redundant in nature i.e., existence of nearduplicates. Such near-duplicates holdup the other promising results to the users. Many of these near-duplicates are from distrusted websites and/or authors who host information on web. Such nearduplicates may be eliminated by means of Provenance. Thus, this paper proposes a novel approach to identify such near-duplicates based on provenance. In this approach a provenance model has been built using web pages which are the search results returned by existing search engine. The proposed model combines both content based and trust based factors for classifying the results as original or near-duplicates. Keywords Web search, Near-duplicates, Provenance, Semantics, Trustworthiness I. INTRODUCTION Finding information on the internet has become a day-to-day task for billions of internet users. Hence it has become very important that the users get the best results for their queries. However, in any web search environment there exist challenges when it comes to providing the user with most relevant, useful and trustworthy results, as mentioned below: The lack of semantics in web The enormous amount of near-duplicate documents The lack of emphasis on the trustworthiness aspect of documents There are also many other factors that affect the performance of a web search. Several approaches have been made and still researches are going on to optimize the web search. This includes semantic analysis of web to provide relevant results as in [1], optimizing the indexing functions in order to improve the storage and retrieval process of web documents [2], optimizing the ranking function to provide the user with best documents at the top of the results as in [3]. However the efficiency of these approaches to optimize web search depend upon the amount of data that is available over the internet. Information in WWW is enormous and redundant. There comes the problem of nearduplicate documents. The following sub sections 1.1 and 1.2 briefly discuss the concepts of near-duplicates detection and Provenance. A. Near- Duplicates Detection The processes of identifying near duplicate documents can be done by scanning the document content for every document. That is, when two documents comprise identical document content, they are regarded as duplicates. And files that bear small dissimilarities and are not identified as being exact duplicates of each other but are identical to a remarkable extent are known as near-duplicates. Following are some of the examples of near duplicate documents in [4] Documents with a few different words - widespread form of near-duplicates Documents with the same content but different formatting for instance, the documents might contain the same text, but dissimilar fonts, bold type or italics Documents with the same content but with typographical errors Plagiarized documents and documents with different versions Documents with the same content but different file type for instance, Microsoft Word and PDF. Documents providing same information written by the same author being published in more than one domain. There are several existing approaches based on syntactical comparison, URL based and also semantic comparisons. However, this paper suggests an effective way of identifying and eliminating near-duplicates using provenance that will help in comparison of documents based on provenance factors. B. Provenance One of the causes of increasing near-duplicates in web is that the ease with which one can access the data in web and the lack of semantics in near-duplicates detection techniques. It has also become extremely difficult to decide on the trustworthiness of such web documents when different versions/formats of the same content exist. Hence, the needs to bring in semantics say

2 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, meaningful comparison in near-duplicates detection with the help of the 6W factors Who (has authored a document), What (is the content of the document), When (it has been made available), Where (it is been available), Why (the purpose of the document), How (In what format it has been published/how it has been maintained) [5]. This information can also be useful in calculating the trustworthiness of each document. A quantitative measure of how reliable that any arbitrary data is could be determined from the provenance information. This information can be useful in representative elimination during near-duplicate detection process and to calculate the trustworthiness of each document. The existing approaches of near-duplicates detection and elimination do not give much importance to the trustworthiness aspect of the content in documents retrieved through web search. Thus, Provenance based factors may be used for nearduplicates detection and elimination which provides the user with the most trustworthy results. II. RELATED WORK A. Detection and Elimination of Near-Duplicates Works on near-duplicates detection and elimination are many in the history. In general these works may be broadly classified as shown in Fig. 1 into Syntactic, URL based and Semantic based approaches. Shingling Signature Near-Duplicates Detection Techniques Syntactic URL Based Semantics sentence-wise Similarity Fuzziness Semantic based Graphs pair-wise similarity Fig.1 Near-duplicates detection techniques 1) Syntactical Approaches: One of the earliest was by Broder et al [6], proposed a technique for estimating the degree of similarity among pairs of documents, known as shingling, does not rely on any linguistic knowledge other than the ability to tokenize documents into a list of words, i.e., it is merely syntactic. In this, all word sequences (shingles) of adjacent words are extracted. If two documents contain the same set of shingles they are considered equivalent and can be termed as near-duplicates. The problem of finding text-based documents similarity was investigated and a new similarity measure was proposed to compute the pair-wise similarity of the documents using a given series of terms of the words in the documents. Also, a kappa measure was developed for computing the similarity of documents. Then ordered weighting averaging operator was used to aggregate the similarity measures between a set of documents [7]. Reference [8] shows another approach based on the similarity measure can be acquired by comparing the exterior tokens of inter-sentences, but relevance measure can be obtained only by comparing the interior meaning of the sentences. A method to explore the Quantified Conceptual Relations of wordpairs by using the definition of a lexical item was described, and a practical approach was proposed to measure the inter-sentence relevance. An approach based on the signature concept as in [9], suggested a method of descriptive words for definition of nearduplicates of documents which was on the basis of the choice of N words from the index to determine a signature of a document. Any search engine based on the inverted index can apply this method. Any two documents with similar signatures are termed as near-duplicates. The method based on shingles and the signature method when compared, the signature method in the presence of inverted index was more efficient. As a result, the above stated syntactic approaches carry out only a text based comparison. And these approaches did not involve the URLs or any link structure techniques in identification of near-duplicates. The following subsection discusses the impact of URL based approaches on nearduplicates detection. 2) URL Based Approaches: A novel algorithm, Dust Buster, for uncovering DUST (Different URLs with Similar Text) was intended to discover rules that transform a given URL to others that are likely to have similar content. Dust Buster employs previous crawl logs or web server logs instead of probing the page contents to mine the dust efficiently. Search engines can increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as Page Rank, which are the benefits provided by the information about the DUST [10]. Reference [11] shows another approach where detecting process was divided into three steps. 1) Removal according to URLs. First, remove pages with the same URL in the initial set of pages to avoid the same page been download repeated due to repeat links. 2) Remove miscellaneous information in the pages and extract the texts. Pretreatment the pages, remove the navigation information, advertising information, html tags, and other miscellaneous information on the pages, extract the text content and get a set of texts. 3) Detect with DDW algorithm. Use the DDW algorithm to detect similar pages. The combination of such URL based approaches along with syntactic approaches is still not sufficient as they do not have semantic in identifying near-duplicates. The following subsection discusses briefly a few semantic based approaches. 3) Semantic Approaches: A method on plagiarism detection using fuzzy semantic-based string similarity approach was proposed. The algorithm was developed through four main stages. First is pre-processing which includes tokenization, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate

3 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarized) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing hereby consecutive sentences are joined to form single paragraphs/sections [12]. Recognizing that two Semantic Web documents or graphs are similar, and characterizing their differences is useful in many tasks, including retrieval, updating, version control and knowledge base editing. A number of text based similarity metrics are discussed as in [13] that characterize the relation between Semantic Web graphs and evaluate metrics for three specific cases of similarity that have been identified: similarity in classes and properties used while differing only in literal content, difference only in base-uri, and versioning relationship. Such techniques are inadequate as the importance is on providing the relevant content and not much on the trust, originality or the authenticity of the documents. Thereby Provenance is likely to play an important role in near-duplicates detection. B. Provenance Researches on Provenance are being made wherein the amount of trustworthiness of the content is given importance. In general these works on provenance techniques may be classified as shown in Fig. 2 in to workflow oriented, network oriented, trustworthiness and collecting provenance. Workflow Oriented Network Oriented Provenance Techniques Trustworthiness Fig. 2 Provenance techniques Collecting Provenance Provenance Graphs An approach based on Collaborative Planning Application (CPA) was to help users organize information, potentially at a variety of security levels. Data provenance is a natural fit to the CPA. The goal of the CPA is to help users organize information, potentially at a variety of security levels, in the style of a blog/wiki. Labels consists of access control labels, which consist of lists of groups of users with read access and write access, and provenance labels which are comprised of a list of ProvAction labels that have affected the labeled data. As with most wiki software, a log of how, when, and by whom pages are edited is a matter of concern. To implement data provenance, every public function that modifies the state of the wiki is wrapped updating the appropriate label. Provenance policies consider creating, modifying, deleting, restoring, and relabeling the blocks [14]. 1) Trustworthiness: Another line of research on provenance was Knowledge provenance, to determine the validity and origin of web information by means of modeling and maintaining information sources and information dependencies. It constructs a trust judgment model for knowledge provenance. Trust judgment can be done with following factors: (1) the trustworthiness of information creator can be used to represent the trustworthiness of the information created (2) trust can be placed in what the trusted individual behaves like. This type of trust is intransitive (3) trust can be placed in what the trusted friend believes to be true in a field. This type of trust is transitive and can propagate in social networks (4) trust in an organization in a field can be transferred to a professional member of the organization. The importance is on trustworthiness of the content and the measures to find the amount of trustworthiness by means of social networking [15]. 2) Collecting Provenance Information: Recording provenance information is a fundamental topic of provenance research as discussed in [16]. While traditional provenance research usually addresses the creation of data, this provenance model also represented data access in the context of Web data. A system that applies this provenance model generates provenance graphs for data items. Some pieces of provenance information can be recorded by a system; for other pieces the system relies on metadata provided by third parties. Thus, recordable provenance information and metadata-reliant provenance information are properly distinguished. Provenance information that is common to all provenance elements of this type is the access time and the access method. The provenance element could describe the creation and the expiration date. Provenance-relevant metadata is either directly attached to a data item or its host document or it is available as additional data on the Web. Examples for attached metadata are RDF statements about an RDF graph that contains the statements, author and creation date of blog entries. Therefore, Provenance based near duplicates detection and elimination process will help in retaining the original i.e. most trustworthy documents and eliminating the other replicas to let alone the overhead involved in facing the near-duplicate documents in the search results which plays an important role in any web search environment. III. WEB PROVENANCE BASED DETECTION AND ELIMINATION OF NEAR-DUPLICATES The entire process of web provenance based near-duplicates detection and elimination is represented in the architecture as shown in Fig. 3. The architecture comprises of the following components: (i) Data collection (ii) Preprocessing (iii) Document Term Matrix(DTM) construction (iv) Provenance

4 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Matrix(PM) Construction (v) Database (vi) Singular Value Decomposition (vii) Document Clustering based on similarity scores (viii) Filtering (ix) Re-ranking based on trustworthiness values. DATA COLLECTION PREPROCESSING DOCUMENT TERM MATRIX CONSTRUCTION PROVENANCE MATRIX CONSTRUCTION SINGULAR VALUE DECOMPOSITION Provenance Records DATABASE DOCUMENT COMPARISON <br><img alt="ieee Xplore Digital Library" src="24_files/logo.xplore.gif" width=230 height=45>ieee <P id=pagetop class=jumplink> We construct an automatic secure fingerprint verification system based on fuzzy vault scheme addressing a major security hole currently in most biometric systems.</p> <Ahref=" >Acoustics, Speech, and Signal Processing, 2005</A> <br><img alt="ieee Xplore Digital Library" src="24_files/logo.xplore.gif" width=230 height=45>ieee <P id=pagetop class=jumplink> Trustwort hiness value RE-RANKING TRUSTWORTHINESS CALCULATION FILTERING NEAR- DUPLICATES Filtered Documents Near- Duplicates Similarity Scores DOCUMENT CLUSTERING Fig. 4 Sample input to Parser When the sample input in Fig. 4 is given to the parser function, it removes all the tags and scripting elements and gives only the information that is needed i.e., the textual content of the documents as shown in Fig. 5. REFINED RESULTS Fig. 3 Web Search Optimization Based on Web Provenance A. Data Collection In this work, a web search application has been utilized. The aim of project is to identify and eliminate near-duplicates, which is done using the top 20 search results of a given query. Since data need to be collected automatically, a special web search browser as in [17] which collects and stores the web documents without any user intervention is made use of. Currently we have collected 20 search queries and 400 web documents have been downloaded for the purpose of near-duplicates detection. These documents need to be indexed using keywords which highlight the content feature of the documents but the collected web documents were of varied formats say with extensions.html,.pdf,.ppt,.doc,.ps, etc. But currently we are considering only the web pages i.e., html documents in this work. The documents have to be preprocessed as discussed in the following section. B. Preprocessing The preprocessing of a document involves: (i) removing html tags and scripting elements(parsing), (ii) tokenizing, (iii) removal of stop words, (iv) stemming and (v) feature extraction based on term frequency. C. Parsing The first step of preprocessing is parsing. A parser was specifically designed for this purpose of extracting the textual content from any html document. The sample part of a source file of any webpage resembles the one as in Fig. 4 with some <html> tags. IEEE We construct an automatic secure fingerprint verification system based on fuzzy vault scheme addressing a major security hole currently in most biometric systems. Acoustics, Speech, and Signal Processing, 2005 Fig. 5 Sample output from Parser The output of the parser as shown in Fig. 5 acts as an input to the stop words removal process. D. Stop Words Removal(SWR) A list of stop words such as of, the, is, are, were, etc available in [18] is used for this purpose. The existence of the stop words are checked in the given input text. If found, these stop words are removed. Thus the text now contains only the keywords that highlight the content of the document. The input is as shown in Fig. 5 and the output obtained after the process of stop words removal is represented in Fig. 6. IEEE Construct Automatic Secure Fingerprint Verification System Fuzzy Vault Scheme Addressing Major Security Hole Currently Biometric systems Acoustics Speech Signal Processing Fig. 6 Sample output from SWR The filtered tokens obtained as output from stop words removal process as shown in Fig. 6 goes as input to the stemming process.

5 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, E. Stemming Stemming is the process of obtaining the root words from the derived words that are present in the filtered tokens. Porter stemming algorithm in [19] is utilized for the purpose of identifying the root words from the derivatives. In Fig. 6, there are certain keywords like addressing, systems, verification, etc which does not highlight the exact keyword content then such keywords are given to the stemming module where they are stripped off their suffixes( ing, es, s, ity etc) and prefixes( ir, un, im etc) to obtain their root words as highlighted in the Fig. 7. IEEE Construct Automatic Secure Fingerprint Verify System Fuzzy Vault Scheme Address Major Security Hole Current Biometric system Acoustic Speech Signal Process Fig. 7 Sample output from Stemmer Hence finally each document will be represented in the form of keywords which helps in the formulation of Document Term Matrix for indexing and storage into the database. F. DTM Construction A document-term matrix or a term-document matrix describes the frequency of terms that occur in a collection of documents as in [20]. The input to this module is as shown in Fig. 7. The term frequency is computed for each term in the given input. A matrix of order is formed as represented in Fig. 8 where m corresponds to number of terms and n corresponds to the total no. of documents. The values of the matrix correspond to the term frequency of each term in documents. Terms doc 1 doc 2 doc3 Accept Access Accomplish Burger Burn Cambridge Camera Candid Fig. 8 Document Term Matrix(DTM) G. Provenance Matrix Construction In this work, we have proposed a matrix called Provenance Matrix (PM) which may be defined as in definition 3.1. Definition 3.1 Provenance Matrix: Given a set of web pages/documents and their information like Who( ), When( ), Where( ), What( ) for a document, A Provenance matrix of order may be defined where the rows represent the above mentioned factors and the columns represents the documents. factors/docs doc 1 doc 2 doc3 Who Andrew Kamal Lyle H. Ungar McCallum Kamal Nigam Nigam Jing Luo Where IEEE ACM IEEE When What Refer figure 3.6 Fig. 9 Provenance Matrix The quad-tuple representing the values of provenance factors for each document is obtained and a Provenance matrix is formed as shown in Fig. 9. The Provenance matrix gets further elaborated to individual matrices for each of the provenance factors. The Fig. 10, 11 and 12 represent the Who, Where and When matrices respectively. The two other factors Why and How remain to be explored. Author doc 1 doc 2 doc3 Andrew McCallum Lyle H. Ungar Kamal nigam Jing Luo Fig. 10 Sample Who Matrix Location doc 1 doc 2 doc3 IEEE ACM Fig. 11 Sample Where Matrix Year doc 1 doc 2 doc Fig. 12 Sample When Matrix The boolean values in individual matrices demonstrated in Fig. 10, 11 and 12 represent the author, location and year of creation for each document. 1 represents the existence and 0 the vice-versa. H. Database Component The Document Term Matrix, the Provenance Matrix and all three individual matrices are stored in the database for representing documents as vectors with respect to both the content and the provenance factors. These matrices are updated dynamically at the entry of each document vector. They are used for computing the similarity between documents by retrieving

6 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, any two document vectors for comparison based on both content and the other provenance factors. I. Singular Value Decomposition Due to the complex structure of DTM it becomes infeasible for further processing of the documents. Therefore a need for Singular Value Decomposition as detailed in [21] arises to reduce the dimensions of the matrix and reformulate the document vectors. The original DTM can be decomposed as mentioned in (1). The matrix will be the DTM The columns of and are, respectively, left- and right-singular vectors for the corresponding singular values. The diagonal entries of are equal to the singular values of the original matrix Then, the document vectors can be reformulated as The in (2) represents each document vector which has m rows is reduced to which will have only k rows where the value of k is very much less than m. The reformulations of document vectors help reduce the complexity in applying similarity measures on such vectors to easily compute their similarity with each other : : : : : : : : : : : : : : : : : : : : Fig. 13 Matrix before SVD The matrix in Fig. 13 are the values of original DTM whose dimensions are say 7355 rows and 10 columns i.e., 7355 terms and 10 documents. The time complexity for computing cosine similarity with dimensions is infeasible. Therefore a dimensionality reduction technique i.e., Singular Value Decomposition is utilized to decompose the matrix using equation given in (1) and the document vectors are reformulated using the equation given in (2). The matrix after SVD shown in Fig. 14 is reduced to dimensions. Where, 5 is the value of k as mentioned in (1) and (2). (1) (2) Fig.14 Matrix after SVD The exactness of the reduced matrix depends upon the value of k as in [22] such that a matrix A of order mxn is reduced to a matrix of order kxn where k<<m where it takes less space and also less time to compute the similarity of documents and the reduced matrix is also the best approximation to the original matrix A. J. Document Clustering Clustering is done based on the similarity scores between documents. Any two documents can be computed for their similarity by applying the cosine similarity measure as shown in (3) on the document vectors [23]. The measure given in (3) results in values ranging from [0-1]. represent the two document vectors that are compared. The values ranging from 0 to 1 symbolize dissimilar documents to exactly similar documents respectively. For every two document vectors in the temporary matrices and the reformulated vectors of Document Term Matrix, the cosine similarity measure is applied and the similarity scores between any two documents is computed. A new measure represented as in (4) was formed to find similarity score based on provenance. Where, F Count of factors having SIMSCORE>0 The Table I showcases the similarity between document 1 and few other documents based on all factors such as Who, Where, When and What. Most of the existing methodologies compare documents based only on their content (i.e., What factor). However when the additional provenance factors such as Who, Where and When are taken into account and computed for their over-all provenance based similarity as stated in (4), there seems to be an subsequent increase in similarity scores which helps in successful identification of near-duplicate documents. (3) (4)

7 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Document TABLE I EFFECTIVENESS OF PROVENANCE IN COMPUTING SIMILARITY BETWEEN DOCUMENTS Who sim score Document 1 Where When sim sim score score What sim score Provenance sim score Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Doc Using the similarity scores of documents, the k-means clustering technique [24] is used to effectively cluster the group of documents that are found to be highly similar. The number of clusters is decided by the rule of thumb as mentioned in (5). k i.e., number of clusters is chosen as Where, N Total no of documents The documents are therefore classified into three types of clusters i.e., highly similar, marginally similar and lowly similar with respect to the similarity scores. The clusters of highly similar documents in the DTM and in the Provenance Matrix are characterized as near-duplicates. The work in this paper has not yet explored the below mentioned filtering process of near-duplicate documents, trustworthiness computation and re-ranking of results based on the trustworthiness values. K. Filtering In elimination process of near-duplicates, among the documents classified as near-duplicates, a decision is taken as to which one the original and most trustworthy document is (5) compared to the other documents that are to be eliminated. This can be done based on the representative elimination rules mentioned in Table II. These rules analyze and compare the information about the author of the documents, when these documents are created, where they have been published, the purpose and the format of the documents and enable to take a decision as to which document is to be retained. TABLE II REPRESENTATIVE ELIMINATION RULES 1. WHO Compare if equal return 1, else return 0 If rule1 returns 0, then 2. WHEN Document with Earliest(Date_of_Publish If rule1 returns 1, then 3. WHERE Compare returns with standardized publication 4. WHY Check returns 5. HOW Check returns with a better purpose with a better format This representative elimination based upon the Web Provenance will not only concentrate on eliminating nearduplicates but also put much emphasis on the chosen content being the most trustworthy one. L. Re-Ranking Based on Trustworthiness Values The filtered non-redundant results after the near-duplicates elimination process go through trustworthiness calculation. That is based on the provenance information, the author citations, etc., the trustworthiness value for each document can be calculated with the help of factors such as accountability, maintainability, coverage, and authority of the document. Accountability deals with the author information such as the standard, qualification and contactability Maintainability deals with the availability of up-to-date content Coverage deals with the number of working links with respect to the total number of links Authority deals with the place where the document has been published These factors have their own weights in order to calculate the trustworthiness value of a specific document. The trustworthiness value that will be associated with each document calculated based on these factors will help in re-ranking the results. The re-ranking of the filtered non-duplicate documents based on trustworthiness values will ensure that the users get the best results i.e., the most trustworthy in addition to the nonduplicate results at the top of their search results for their query.

8 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, IV. RESULTS AND DISCUSSIONS Using a specialized browser, the user gives the search query say for example Detecting near-duplicates for web crawling as demonstrated in Fig. 15. The top 20 search results are automatically downloaded by this browser. Only web pages i.e., html documents are taken into account for our proposed work. At first these 20 html documents are preprocessed. Fig. 15 Search results for a query Based on the feature extraction a Document Term Matrix is constructed which resembles the one in Fig. 16 where terms occupy the rows and columns represent the documents. The values in the matrix represent the frequency of a particular term in each document. This matrix corresponds to the What factor in provenance. Fig. 16 Document Term Matrix(DTM)

9 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, Fig. 17 Provenance Matrix A provenance matrix is formed as represented in Fig. 17 by obtaining the values of provenance factors for each document. This Provenance Matrix gets further split up as three individual matrices representing the other factors Who, Where and When. That is, the Who tuple in the provenance matrix expands to an individual author matrix representing the author of each document. Similarly, the Where tuple in the provenance matrix gets expanded to an individual location matrix representing whether the document is from that specific location. And likewise, a year matrix is constructed from the When tuple of provenance matrix. This matrix associates documents with the year of publication of the documents. Now, the cosine similarity measure is applied on document vectors in all four matrices i.e., reformulated DTM, Who matrix, Where matrix and the When matrix. The documents highly similar from DTM i.e., based on content are clustered and identified as plagiarized documents. And the highly similar documents from the other individual matrices form another cluster. Now, the documents from both clusters are combined and termed as near-duplicates. The Fig. 18 demonstrates the effective identification of nearduplicates in the search results. The showcased near-duplicates here discuss about the same document but are located in varied sites. It dissatisfies the users as they are provided with redundant content present in the first 20 results. The elimination of them will make place for the other promising results to the users those of which may lag behind in subsequent result pages. Fig. 18 Redefined Search Results Here, Fig. 19 shows the comparison of some 10 documents with respect to each other based upon their content i.e., the keywords whereas the Fig. 20 shows comparison of the same 10 documents of the same query with respect to each other based on the Provenance factors(who, Where, When, Why, How). These set of 10 documents represented in both Fig. 19 and 20 are of the same query. Fig. 19 Comparison based on DTM

10 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, OBSERVATION 1: The clusters of documents which are classified as highly similar from DTM are found to be very similar with respect to their content i.e., plagiarized. In Fig. 19 the cluster of document that are highly similar with each other are Doc 2, Doc 5, Doc 6, Doc 8 and Doc 9. Fig. 20 Comparison based on Provenance Matrix OBSERVATION 2: The clusters of documents which are classified as highly similar from Provenance Matrix are found be similar with respect to their Provenance factors (Who, When, Where). In Fig. 20 the cluster of document that are highly similar with each other are Doc 2, Doc 5, Doc 7 and Doc 10. OBSERVATION 3: The clusters of documents that are highly similar in both observations (observation 1 and observation 2) are classified as near-duplicates. From Fig. 19 and 20, the cluster of document which is highly similar in both observation 1 and 2 are Doc 2, Doc 5, Doc 6, Doc 7, Doc 8 and Doc 9 and Doc 10 since they are found to be highly similar on both the content and the Provenance factors. NOTE: The documents Doc 7 and Doc 10 remain undetected in observation 1 when compared based on their content. However they are detected as being similar through the observation 2 by comparison based on provenance factors. This approach of comparison of documents based on provenance factors(who, When, What, Where) in addition to their content based similarity to categorize the documents as near-duplicates is found to be very effective as on an average 3-4 near-duplicates were detected for each query. Therefore in the dataset of 400 documents around 80 documents are found to be near-duplicates. The documents that are classified as nearduplicates based on observation 3 have to go through the second phase i.e., near-duplicates elimination. The process of successfully eliminating the near-duplicates by applying representative elimination rules and then ranking the filtered documents based on their trustworthiness is under experimentation. V. CONCLUSION AND FUTURE WORK In this paper, we have proposed a method for detecting and eliminating the near duplicates and also re-ranking the documents by calculating their respective trustworthiness values. Our approach uses the Web Provenance concept to make sure that the near duplicates detection and elimination and trustworthiness calculation are done using semantics by means of the provenance factors(who, When, Where, What, Why and How). The future work involves considering the left-out factors Why and How in provenance matrix. Performing the filtering process of near-duplicates, computing the trustworthiness values of documents and then re-ranking the documents based on the trustworthiness values. In future, a further study will be made on the characteristics and properties of Web Provenance in near duplicates detection and elimination and also the calculation method of trustworthiness in varied web search environments and varied domains. As the future work, the architecture of a search engine can be designed or a web crawler, based on Web Provenance for the semantics based detection and elimination of near-duplicates. Also the ranking can be done based on trustworthiness values in addition to the present link structure techniques which are expected to be more effective in web search. REFERENCES [1] Cristina Scheau, Traian Rebedea, Costin Chiru and Stefan Trausan- Matu, Improving the Relevance of Search Engine Results by Using Semantic Information from Wikipedia, IEEE International Conference 2010, pp: , [2] Ourania I. Markaki and Dimitris E. Charilas, Personalization mechanisms for content indexing, search, retrieval and presentation in a multimedia search engine, IEEE, 2009 [3] Chunshui Zhao, Jun Yan and Ning Liu, ImproveWeb Search Ranking by Co-Ranking SVM, Fourth International Conference on Natural Computation, pp: 81-85, 2008 [4] Prasanna Kumar J and Govindarajulu P, Duplicate and Near-duplicate documents detection, European Journal of Scientific Research, Vol.32 No.4, pp: , [5] George Komatsoulis, Toward a Functional Model of Data Provenance, [6] Broder A, Glassman S, Manasse M, and Zweig G, Syntactic Clustering of the Web, In 6th International World Wide Web Conference, pp: , [7] Ali Emrouznejad and Gholam R. Amin, Document similarity: A new measure using OWA, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp: , [8] Maosheng Zhong, Yi Hu, Lei Liu and Ruzhan Lu, A Practical Approach for Relevance Measure of Inter-Sentence, Fifth International Conference on Fuzzy Systems and Knowledge Discovery, pp: , 2008 [9] Ilyinsky S, Kuzmin M, Melkov A, and Segalovich I, An efficient method to detect duplicates of Web documents with the use of inverted index, Proceedings of the Eleventh International World Wide Web Conference, [10] BarYossef, Z., Keidar, I., Schonfeld, U, Do Not Crawl in the DUST: Different URLs with Similar Text, 16th International world Wide Web conference, Alberta, Canada, Data Mining Track, pp: , [11] Junping Qiu and Qian Zeng, Detection and Optimized Disposal of Near- Duplicate Pages, 2nd International Conference on Future Computer and Communication, Vol.2, pp: , [12] Salha Alzahrani and Naomie Salim, Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection, [13] Krishnamurthy Koduvayur Viswanathan and Tim Finin, Text Based Similarity Metrics and Delta for Semantic Web Graphs, pp: 17-20, 2010.

11 (IJIDCS) International Journal on Internet and Distributed Computing Systems. Vol: 1 No: 1, [14] Brian J. Corcoran, Nikhil Swamy and Michael Hicks, Combining Provenance and Security Policies in a Web-based Document Management System, In On-line Proceedings of the Workshop on Principles of Provenance, pp: , [15] Jingwei Huang and Mark S. Fox, Trust Judgment in Knowledge Provenance, Proceedings of the 16th International Workshop on Database and Expert Systems Applications (DEXA 05), pp: , [16] Olaf Hartig, Provenance Information in the Web of Data, [17] Sendhilkumar S, Geetha T. V, Personalized Ontology for Web Search Personalization, Proceedings of the 1st Bangalore Annual Compute Conference, pp:1-7, [18] [19] Willett P, The Porter stemming algorithm: then and now. Program: electronic library and information systems, 40 (3). pp [20] [21] Emmett J. Ientilucci, Using the Singular Value Decomposition, pp: 1, 2003 [22] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, Finding the Optimal Rank for LSI Models, [23] E. Garcia, Cosine similarity and term weight tutorial, [24] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, An efficient k-means clustering algorithm: Analysis and implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp: , BIBLIOGRAPHY Y. Syed Mudhasir is born at Chennai, India on 16th June He is a student, currently pursuing his M.E degree in the College of Engineering Guindy, Anna University. He received his B. Tech degree from Anna University in His areas of interest include web search, object oriented programming concepts and Database management systems. S. Sendhilkumar born at Pondicherry, in India, completed his Bachelors of Technology (EEE) at Pondicherry Engineering College and his Masters OF Technology (IT) at the erstwhile Indian Institute of Information Technology Pondicherry. In 2009 he finished his Ph.D. at Anna University, Chennai, India. He is currently working as Asst. Professor (Sr. Grade) in the Dept. of Information Science & Technology at Anna University. He has more than 9 years of teaching and research experience. He has numerous international journal and conference publications to his credit. His research interests include Personalized Web Search, Web Mining, Data Mining, Social Network Analysis and Opinion Mining from Short Texts. He is currently the associate editor for Minmanjari, e-magazine from the house of INFITT. G. S. Mahalakshmi born at Coimbatore, Tamil Nadu, India, completed her B.E. (Computer Science and Engineering) from R.V.S. College of Engineering and Technology, Dindigul in 1998 and M.E. (Computer Science and Engineering) from College of Engineering, Anna University, Chennai in She also completed her Ph.D. in 2009 from the same university. She is a Assistant Professor (Senior Grade) in the Department of Computer Science and Engineering, College of Engineering, Anna University, Chennai. She has numerous international journal and conference publications to her credit. Her research interests include Reasoning, Knowledge Sharing and representation, Text Mining, Social Network Analysis, Robotics, and Natural Language Computing. She is a peer reviewer for reputed publications like ACM Transactions on Autonomous and Adaptive Systems. She is currently the associate editor for Minmanjari, e-magazine from the house of INFITT. J. Deepika is born at Namakkal, India on 11th May She is a student, currently pursuing her M.E degree in the College of Engineering Guindy, Anna University. She received her B. E degree from Anna University in Her areas of interest include text mining and software engineering.

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

IJRIM Volume 2, Issue 2 (February 2012) (ISSN ) AN ENHANCED APPROACH TO OPTIMIZE WEB SEARCH BASED ON PROVENANCE USING FUZZY EQUIVALENCE RELATION BY LEMMATIZATION Divya* Tanvi Gupta* ABSTRACT In this paper, the focus is on one of the pre-processing technique

More information

A HYBRID MODEL FOR DETECTION AND ELIMINATION OF NEAR- DUPLICATES BASED ON WEB PROVENANCE FOR EFFECTIVE WEB SEARCH

A HYBRID MODEL FOR DETECTION AND ELIMINATION OF NEAR- DUPLICATES BASED ON WEB PROVENANCE FOR EFFECTIVE WEB SEARCH A HYBRID MODEL FOR DETECTION AND ELIMINATION OF NEAR- DUPLICATES BASED ON WEB PROVENANCE FOR EFFECTIVE WEB SEARCH Tanvi Gupta 1 and Latha Banda 2 1 Department of Computer Science, Lingaya s University,

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data

Optimization of Search Results with Duplicate Page Elimination using Usage Data Optimization of Search Results with Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad, India 1

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Meta-Content framework for back index generation

Meta-Content framework for back index generation Meta-Content framework for back index generation Tripti Sharma, Assistant Professor Department of computer science Chhatrapati Shivaji Institute of Technology. Durg, India triptisharma@csitdurg.in Sarang

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Topic Diversity Method for Image Re-Ranking

Topic Diversity Method for Image Re-Ranking Topic Diversity Method for Image Re-Ranking D.Ashwini 1, P.Jerlin Jeba 2, D.Vanitha 3 M.E, P.Veeralakshmi M.E., Ph.D 4 1,2 Student, 3 Assistant Professor, 4 Associate Professor 1,2,3,4 Department of Information

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

DCT SVD Based Hybrid Transform Coding for Image Compression

DCT SVD Based Hybrid Transform Coding for Image Compression DCT SVD Based Hybrid Coding for Image Compression Raghavendra.M.J 1, 1 Assistant Professor, Department of Telecommunication P.E.S. Institute of Technology Bangalore, India mjraghavendra@gmail.com Dr.Prasantha.H.S

More information

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA Journal of Computer Science, 9 (5): 534-542, 2013 ISSN 1549-3636 2013 doi:10.3844/jcssp.2013.534.542 Published Online 9 (5) 2013 (http://www.thescipub.com/jcs.toc) MATRIX BASED INDEXING TECHNIQUE FOR VIDEO

More information

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

: Semantic Web (2013 Fall)

: Semantic Web (2013 Fall) 03-60-569: Web (2013 Fall) University of Windsor September 4, 2013 Table of contents 1 2 3 4 5 Definition of the Web The World Wide Web is a system of interlinked hypertext documents accessed via the Internet

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam, IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 8, Issue 5 (Jan. - Feb. 2013), PP 70-74 Performance Analysis Of Web Page Prediction With Markov Model, Association

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL

EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL J Naveen Kumar 1, Dr. M. Janga Reddy 2 1 jnaveenkumar6@gmail.com, 2 pricipalcmrit@gmail.com 1 M.Tech Student, Department of Computer Science, CMR Institute of Technology,

More information

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages S.Sathya M.Sc 1, Dr. B.Srinivasan M.C.A., M.Phil, M.B.A., Ph.D., 2 1 Mphil Scholar, Department of Computer Science, Gobi Arts

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Recommendation on the Web Search by Using Co-Occurrence

Recommendation on the Web Search by Using Co-Occurrence Recommendation on the Web Search by Using Co-Occurrence S.Jayabalaji 1, G.Thilagavathy 2, P.Kubendiran 3, V.D.Srihari 4. UG Scholar, Department of Computer science & Engineering, Sree Shakthi Engineering

More information

An Efficient Approach for Requirement Traceability Integrated With Software Repository

An Efficient Approach for Requirement Traceability Integrated With Software Repository An Efficient Approach for Requirement Traceability Integrated With Software Repository P.M.G.Jegathambal, N.Balaji P.G Student, Tagore Engineering College, Chennai, India 1 Asst. Professor, Tagore Engineering

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

AN ALGORITHM FOR TEST DATA SET REDUCTION FOR WEB APPLICATION TESTING

AN ALGORITHM FOR TEST DATA SET REDUCTION FOR WEB APPLICATION TESTING AN ALGORITHM FOR TEST DATA SET REDUCTION FOR WEB APPLICATION TESTING A. Askarunisa, N. Ramaraj Abstract: Web Applications have become a critical component of the global information infrastructure, and

More information

SHORTEST PATH ALGORITHM FOR QUERY PROCESSING IN PEER TO PEER NETWORKS

SHORTEST PATH ALGORITHM FOR QUERY PROCESSING IN PEER TO PEER NETWORKS SHORTEST PATH ALGORITHM FOR QUERY PROCESSING IN PEER TO PEER NETWORKS Abstract U.V.ARIVAZHAGU * Research Scholar, Sathyabama University, Chennai, Tamilnadu, India arivu12680@gmail.com Dr.S.SRINIVASAN Director

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH

EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH EFFICIENT INTEGRATION OF SEMANTIC TECHNOLOGIES FOR PROFESSIONAL IMAGE ANNOTATION AND SEARCH Andreas Walter FZI Forschungszentrum Informatik, Haid-und-Neu-Straße 10-14, 76131 Karlsruhe, Germany, awalter@fzi.de

More information

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com

More information

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IMPROVED PERFORMANCE OF STEMMING USING ENHANCED PORTER STEMMER ALGORITHM FOR INFORMATION RETRIEVAL Ramalingam Sugumar & 2 M.Rama

More information

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Rekha Jain 1, Sulochana Nathawat 2, Dr. G.N. Purohit 3 1 Department of Computer Science, Banasthali University, Jaipur, Rajasthan ABSTRACT

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Footprint Recognition using Modified Sequential Haar Energy Transform (MSHET)

Footprint Recognition using Modified Sequential Haar Energy Transform (MSHET) 47 Footprint Recognition using Modified Sequential Haar Energy Transform (MSHET) V. D. Ambeth Kumar 1 M. Ramakrishnan 2 1 Research scholar in sathyabamauniversity, Chennai, Tamil Nadu- 600 119, India.

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 562 567 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Image Recommendation

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Extracting Algorithms by Indexing and Mining Large Data Sets

Extracting Algorithms by Indexing and Mining Large Data Sets Extracting Algorithms by Indexing and Mining Large Data Sets Vinod Jadhav 1, Dr.Rekha Rathore 2 P.G. Student, Department of Computer Engineering, RKDF SOE Indore, University of RGPV, Bhopal, India Associate

More information

Lecture Video Indexing and Retrieval Using Topic Keywords

Lecture Video Indexing and Retrieval Using Topic Keywords Lecture Video Indexing and Retrieval Using Topic Keywords B. J. Sandesh, Saurabha Jirgi, S. Vidya, Prakash Eljer, Gowri Srinivasa International Science Index, Computer and Information Engineering waset.org/publication/10007915

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE

MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 MAINTAIN TOP-K RESULTS USING SIMILARITY CLUSTERING IN RELATIONAL DATABASE Syamily K.R 1, Belfin R.V 2 1 PG student,

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Adaptive and Personalized System for Semantic Web Mining

Adaptive and Personalized System for Semantic Web Mining Journal of Computational Intelligence in Bioinformatics ISSN 0973-385X Volume 10, Number 1 (2017) pp. 15-22 Research Foundation http://www.rfgindia.com Adaptive and Personalized System for Semantic Web

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Advanced Spam Detection Methodology by the Neural Network Classifier

Advanced  Spam Detection Methodology by the Neural Network Classifier Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Theme Identification in RDF Graphs

Theme Identification in RDF Graphs Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer

More information

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 02, February -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Survey

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics An Oracle White Paper October 2012 Oracle Social Cloud Platform Text Analytics Executive Overview Oracle s social cloud text analytics platform is able to process unstructured text-based conversations

More information

A Measurement of Similarity to Identify Identical Code Clones

A Measurement of Similarity to Identify Identical Code Clones The International Arab Journal of Information Technology, Vol. 12, No. 6A, 2015 735 A Measurement of Similarity to Identify Identical Code Clones Mythili ShanmughaSundaram and Sarala Subramani Department

More information

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web

Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Candidate Document Retrieval for Arabic-based Text Reuse Detection on the Web Leena Lulu, Boumediene Belkhouche, Saad Harous College of Information Technology United Arab Emirates University Al Ain, United

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Web Data Extraction and Generating Mashup

Web Data Extraction and Generating Mashup IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 9, Issue 6 (Mar. - Apr. 2013), PP 74-79 Web Data Extraction and Generating Mashup Achala Sharma 1, Aishwarya

More information

Data Extraction and Alignment in Web Databases

Data Extraction and Alignment in Web Databases Data Extraction and Alignment in Web Databases Mrs K.R.Karthika M.Phil Scholar Department of Computer Science Dr N.G.P arts and science college Coimbatore,India Mr K.Kumaravel Ph.D Scholar Department of

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION

TABLE OF CONTENTS CHAPTER NO. TITLE PAGENO. LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION vi TABLE OF CONTENTS ABSTRACT LIST OF TABLES LIST OF FIGURES LIST OF ABRIVATION iii xii xiii xiv 1 INTRODUCTION 1 1.1 WEB MINING 2 1.1.1 Association Rules 2 1.1.2 Association Rule Mining 3 1.1.3 Clustering

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters Research Journal of Applied Sciences, Engineering and Technology 10(9): 1045-1050, 2015 DOI: 10.19026/rjaset.10.1873 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

International Journal of Advanced Computer Technology (IJACT) ISSN: CLUSTERING OF WEB QUERY RESULTS USING ENHANCED K-MEANS ALGORITHM

International Journal of Advanced Computer Technology (IJACT) ISSN: CLUSTERING OF WEB QUERY RESULTS USING ENHANCED K-MEANS ALGORITHM CLUSTERING OF WEB QUERY RESULTS USING ENHANCED K-MEANS ALGORITHM M.Manikantan, Assistant Professor (Senior Grade), Department of MCA, Kumaraguru College of Technology, Coimbatore, Tamilnadu. Abstract :

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS A Semantic Link Network Based Search Engine For Multimedia Files Anuj Kumar 1, Ravi Kumar Singh 2, Vikas Kumar 3, Vivek Patel 4, Priyanka Paygude 5 Student B.Tech (I.T) [1].

More information

Implementation of Smart Question Answering System using IoT and Cognitive Computing

Implementation of Smart Question Answering System using IoT and Cognitive Computing Implementation of Smart Question Answering System using IoT and Cognitive Computing Omkar Anandrao Salgar, Sumedh Belsare, Sonali Hire, Mayuri Patil omkarsalgar@gmail.com, sumedhbelsare@gmail.com, hiresoni278@gmail.com,

More information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Implementation of Enhanced Web Crawler for Deep-Web Interfaces Implementation of Enhanced Web Crawler for Deep-Web Interfaces Yugandhara Patil 1, Sonal Patil 2 1Student, Department of Computer Science & Engineering, G.H.Raisoni Institute of Engineering & Management,

More information

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Anil K Goswami 1, Swati Sharma 2, Praveen Kumar 3 1 DRDO, New Delhi, India 2 PDM College of Engineering for

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH Sai Tejaswi Dasari #1 and G K Kishore Babu *2 # Student,Cse, CIET, Lam,Guntur, India * Assistant Professort,Cse, CIET, Lam,Guntur, India Abstract-

More information

Similarities in Source Codes

Similarities in Source Codes Similarities in Source Codes Marek ROŠTÁR* Slovak University of Technology in Bratislava Faculty of Informatics and Information Technologies Ilkovičova 2, 842 16 Bratislava, Slovakia rostarmarek@gmail.com

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

Minutiae Based Fingerprint Authentication System

Minutiae Based Fingerprint Authentication System Minutiae Based Fingerprint Authentication System Laya K Roy Student, Department of Computer Science and Engineering Jyothi Engineering College, Thrissur, India Abstract: Fingerprint is the most promising

More information