Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Similar documents
Improved Frequent Pattern Mining Algorithm with Indexing

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

String Vector based KNN for Text Categorization

Concept-Based Document Similarity Based on Suffix Tree Document

International Journal of Advanced Research in Computer Science and Software Engineering

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Encoding Words into String Vectors for Word Categorization

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Enhancing K-means Clustering Algorithm with Improved Initial Center

Mining of Web Server Logs using Extended Apriori Algorithm

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Keyword Extraction by KNN considering Similarity among Features

Keywords: clustering algorithms, unsupervised learning, cluster validity

Dynamic Clustering of Data with Modified K-Means Algorithm

Schema Matching with Inter-Attribute Dependencies Using VF2 Approach

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Chapter 1, Introduction

Text clustering based on a divide and merge strategy

Text Document Clustering Using DPM with Concept and Feature Analysis

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values

Performance Based Study of Association Rule Algorithms On Voter DB

Text Documents clustering using K Means Algorithm

Information-Theoretic Feature Selection Algorithms for Text Classification

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Gene Clustering & Classification

Auto-assemblage for Suffix Tree Clustering

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

An Efficient Clustering for Crime Analysis

TEXT CATEGORIZATION PROBLEM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Iteration Reduction K Means Clustering Algorithm

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

A hybrid method to categorize HTML documents

C-NBC: Neighborhood-Based Clustering with Constraints

Detection and Deletion of Outliers from Large Datasets

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

A User Preference Based Search Engine

Performance Analysis of Data Mining Classification Techniques

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Analyzing Outlier Detection Techniques with Hybrid Method

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

Cluster based boosting for high dimensional data

A Recommender System Based on Improvised K- Means Clustering Algorithm

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Machine Learning using MapReduce

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Inferring User Search for Feedback Sessions

Correlation Based Feature Selection with Irrelevant Feature Removal

Performance Analysis of Video Data Image using Clustering Technique

A Survey On Different Text Clustering Techniques For Patent Analysis

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

An Efficient Approach for Color Pattern Matching Using Image Mining

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Improving Suffix Tree Clustering Algorithm for Web Documents

A Novel method for Frequent Pattern Mining

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

An Algorithm for Mining Large Sequences in Databases

Improved Similarity Measure For Text Classification And Clustering

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Automated Online News Classification with Personalization

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Using Gini-index for Feature Weighting in Text Categorization

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

International ejournals

Impact of Term Weighting Schemes on Document Clustering A Review

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

Weighted Suffix Tree Document Model for Web Documents Clustering

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Probabilistic Graphical Models Part III: Example Applications

Unsupervised learning on Color Images

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

An Improved Apriori Algorithm for Association Rules

Transcription:

International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa Rao Prof. S. Ramakrishna Abstract: Present document clustering approaches cluster the documents by term frequency, which are not considering the concept and semantic relations during unsupervised or supervised learning. In this paper, a new unsupervised learning approach that estimates similarity between any two documents given by concept similarity measure is proposed. This novel method represents the concept as set of word sequences found in given documents. In regard to demonstrate the significance of the proposal we applied on set of benchmark document datasets. Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. 1. INTRODUCTION Clustering is unsupervised learning approach that partitions the data items into clusters. This unsupervised learning approach forms the clusters without using predefined class labels. Hence the number of clusters and documents in each cluster are dynamic. In contrast, the supervised learning approaches such as classification and prediction group the data by analyzing class-label data objects. The clustering analyzes given data and forms these labels dynamically. A similarity measure or similarity function is a real-valued function that enumerates the similarity between given any two elements. 2. RELATED WORK 3. DOCUMENT CLUSTERING BY CONCEPTUAL RELEVANCE In clustering algorithms and text classification the similarity measures are used to the maximum extent. For document Here in our proposed model, the given input documents will clustering cosine similarity measure was adopted by the be clustered into minimum k clusters. Since the proposed spherical k-means algorithm. [4] The label less documents model is an unsupervised learning approach, the class labels clusters became very regular and available in this particular should be generated dynamically by using features extracted model. The text documents are projected as elevated from the given input corpus. dimensions and spares vectors by using words as features. In 3.1 The algorithmic approach of the DCCR: order to get unit Euclidean norm the algorithm outputs k Let TDC be the text document corpus given as input for disjoint clusters each with a conceptual vector which is the DCCR. middle of the normalized cluster. For document clustering a Initially data preprocessing step will be applied that cosine-based pair wise adaptive similarity measure was used results processed documents word vector pdwv, which by the Derrick Higgins. [5] The quality and pace of the neglected clustering are enhanced in pair wise adaptive Extracts words from each document and forms a row in similarity measure for elevated dimensional documents when a word vector dwv compared with the original cosine similarity measure. By For each row in word vector dwv using the Kullback-leibler divergence a divisive data-theoretic o Remove stop words feature clustering algorithm was proposed by Daphe koller [7] o Remove noise (non English terms and special for text classification. symbols) Annaluri Sreenivasa Rao is currently working as Associate Professor at o Apply Stemming on words of the each row Department of Computer Science, MRCE, Hyderabad, Telangana State, India After the preprocessing the word vector dwv turned Email: annaluri.rao@gmail.com to be the resultant documents word vector pdwv Prof. S. Ramakrishna is currently working as Professor at Department of Next the word sequences of size wst will be considered as Computer Science, Sri Venkateswara University, Tirupathi, Andhra Pradesh, India feature attributes from each row of the pdwv and forms 2014 In executing difficult learners like Support Vector Machines to the object of text classification we can prevent elevated dimensionality of text. In k-means like clustering algorithm squared Euclidean distance and relative entropy were combined by Kullback. [8] Recently emerging k means algorithm is developed to manage unit length document vectors specifically. Craven [9] performed document clustering, which is based on the Suffix Tree Document (STD) model that computes the pair-wise similarities of documents.

International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 560 set of feature attributes fas with no duplicate elements. A Choose row wv d from the pdwv that represents word sequence is set of words with size wst appear in any document d row of pdwv in sequence. For each centroid { c c kcs} Then co-occurrence feature sets cofs such that each o Find similarity score ss( wvd, c) between wvd and feature of feature set { fs fs is belong to fas will each centroid c. be formed and size of each set { fs fs can vary. o Move the document d to the cluster that The co-occurrence feature sets will be formed as follows represented by a centroid c i, such that similarity o Initially one size feature sets will be formed and score between ss( wvd, ci) is maximal when moved to cofs compared to the similarity score of document d o And then two to max possible size co-occurrence with other centroids. feature sets will formed and moved to cofs Continue the above process for all o Then prunes co-occurrence feature sets as follows documents in ncg If { fsi fsi, fsi { fs j fs j cofs} and cooccurrence frequency of fsi is identically equals to Finally project documents of all clusters with their cluster id. co-occurrence frequency of fs j then fsi will be 3.2 Pseudo code for DCCR pruned from cofs i. Main Process Then the co-occurrence feature sets of cofs will be ordered in descending order of their length and orders similar Inputs: length co-occurrence feature sets in descending order of TDC (text document corpus to be clustered) their co-occurrence frequency k (minimal number of clusters to be formed) Then selects top k features sets as centroids. These set of top k centroids will be referred as kcs in further mst ( minimal similarity threshold) discussions. mclt (minimal centroid length threshold) Further it performs clustering of the documents as wst (word sequence length threshold) follows: 1. : o For each document 2. Form a word vector { d d TDC} dwv such that each row of the vector represents words in each document of the TDC Choose row wv d from the pdwv that represents 3. pdwv preprocess( dwv) document d 4. fas findfas( wst, pdwv) For each centroid { c c kcs} // fas represents feature attributes (set of words in Find similarity score ss( wvd, c) between wvd and sequence of size wst ) set each centroid c. 5. cofs findcofs( fas, pdws) Move the document d to the cluster that represented by a centroid c i, such that similarity // The method invoked in step 5 can use any frequent item set mining algorithm such as fpgrowth [14] or éclat score between ss( wvd, ci) is maximal when [15]. Due to space constraint that algorithm not explored compared to the similarity score of document d in this article with other centroids and ss( vwd, ci) is greater 6. Order cofs in descending order by size of each than the given minimal similarity threshold mst. { fs fs cofs} and order all feature sets with similar size If similarity score between document d and all in descending order of their co-occurrence frequency centroids are less than mst then move document 7. Choose top k feature sets from cofs such that each d to non cluster group ncg. feature set length must be greater than mclt as centroids Continue the above process for all documents in and refer them as kcs TDC 8. Set non cluster group ncg φ o If non cluster group is not empty then find new 9. clusters as follows kclusters findclusters( TDC, pdwv, kcs, mst, ncg) Remove all from kcs 10. if ( ncg φ) Consider left over feature sets with size greater 11. Remove all from kcs than minimum centroid length threshold mclt of 12. Add leftover feature sets from cofs to kcs cofs as centroids and move to kcs 13. k ' Clusters find _ k ' Clusters( ncg, pdwv, kcs) For each document { d d ncg} 14. End 2014

International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 561 15. if ( k ' Clusters φ) kclusters kclusters k ' Clusters iv. Finding min k Clusters 16. End find _ k ' Clusters( TDC, pdwv, kcs, mst, ncg ) 1. 2. For each document d in TDC 3. Select row dr from pdwv that represents d 4. Set s φ ii. Preprocessing 5. Set i φ 6. For each centroid c in kcs 1. preprocess( dwv) 7. dr c ss( dr, c) c 2. Set pdwv φ //finding similarity score of document d and 3. For each row dr of dwv centroid c 4. Set pdr φ 8. if ( ss( dr, c) > s) 5. Remove non English characters from dr 9. set s ss( dr, c) 6. Trim leading and trailing spaces of each word of 10. Set i index _ of () c dr 11. End 7. For each word w of dr 12. End 8. if ( w sws) then remove w from dr //here sws is 13. if ( s mst) stop words set. 14. Move document d into cluster kcs[] i 9. else 15. End 10. apply stemming process on w and add w to 16. Else Move document d to ncg pdr 17. End 11. Add pdr to pdwv 18. End 12. End 13. End 14. End 15. Return pdwv v. Finding clusters from left over documents 16. End iii. Finding feature attribute sets 1. findfas( wst, pdwv ) 2. Set fas φ 3. For each row dr of the pdwv 4. For each word w of dr 5. if (( index _ of ( w) + wst) < size _ of ( dr)) 6. fas word sequence of size wst begins from index _ of ( w ) 7. End 8. End 9. End 10. Return fas 11. End 2014 find _ k ' Clusters( ncg, pdwv, kcs ) 19. 20. For each document d in ncg 21. Select row dr from pdwv that represents d 22. Set s φ 23. Set i φ 24. For each centroid c in kcs dr c 25. ss( dr, c) c //finding similarity score of document d and centroid c 26. if ( ss( dr, c) > s) 27. set s ss( dr, c) 28. Set i index _ of () c 29. End 30. End 31. Move document d into cluster kcs[] i 32. End 33. End

International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 562 is beyond the actual number of categories the performance of DCCR significantly falling to low (see fig 2 with k=100). 4. EXPERIMENTAL RESULTS In this section, the effectiveness of the proposed Text Document Clustering By Conceptual Relevance (DCCR) as similarity factor of unsupervised learning is investigated. The investigation is done by applying the DCCR measure on Reuter s data. 4.1 Data set Exploration: Reuters-21578 is collection of thousands of documents belongs to Reuter s newswire. This dataset is rich in volume and categorization. 4.2 Performance Analysis Fig 2: Performance Analysis of DCCR under divergent min cluster count From the corpus of the input dataset, we removed their actual cluster identification, and then applied DCCR over those documents with divergent word sequence length threshold and minimum number of clusters count. The performance of the DCCR is explored by clustering the given documents under divergent word sequence length threshold and minimum number of clusters required. Fig 1 indicates the accuracy of clusters formation under different values for wst. Fig 1: DCCR performance analysis under divergent feature attributes set size 5. CONCLUSION AND FUTURE WORK Here in this article we devised an unsupervised learning strategy that clusters text documents under conceptual relevance as clustering factor, which we referred as Document Clustering by Concept Relevance (DCCR) as factor of unsupervised learning. The said model is clustering the documents, which is based on their concept relevance rather than by using term frequency that used by many of the current state of the models available in literature. The performance analysis of the said model is indicating that it performs extremely well under optimal length of the feature attribute set and optimal k (minimum cluster count) value. The current work demands the domain knowledge to fix the word sequence length and minimum clusters required, as these two are the prime factors influencing the performance of the DCCR. In future, our work will focus on forming the concepts by the correlation analysis of the word tokens instead of forming concepts by the word sequences as we done in this present solution. Another direction of extending this work would be including the compound words to estimate the concept relevance. The better conceptual relevance representation can help to further improve the performance of unsupervised learning strategies in text document mining. Fig 2 indicates the influence of k (minimum number of clusters) value in accuracy. It is clear evident that DCCR performed well under max conceptual relevance factor (see fig 1). We limited the word sequence (concept) length to 9 since the beyond that value leads to create clusters with number of documents potentially low in each cluster. The given dataset is benchmarked with 90 categories of documents. As DCCR generates minimum clusters of count of k value, when k value REFERENCES [1]. Similarity Measures For Text Document Clustering Anna Huang 2008. [2]. F. Sebastiani. Machine Learning In Automated Text Categorization. Acm Computing Surveys, 34(1):1-47, 2002. [3]. S. Clinchant And E. Gaussier. Information-Based Models For Ad Hoc Ir. Proceedings Of 33rd Annual International Acm Sigir Conference On Research And 2014

International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 563 Development In Information Retrieval, Pages 234-241, 2010. [4]. Sentence Similarity Measures For Essay Coherence Derrick Higgins Jill Burstein,2007 [5]. Similarity Measures for Short Segments of Text Donald Metzler1, Susan Dumais2, Christopher Meek,2007 [6]. Multi-Label Classification Algorithm Derived From K- Nearest Neighbor Rule With Label Dependencies Zoulficar Younes, Fahed Abdallah, And Thierry Denoeux,2003 [7]. Daphe Koller And Mehran Sahami, Hierarchically Classifying Documents Using Very Few Words, Proceedings Of The 14th International Conference On Machine Learning (Ml), Nashville, Tennessee, July 1997, Pages 170-178. [8]. S. Kullback And R. A. Leibler. On Information And Sufficiency. Annuals Of Mathematical Statistics, 22(1):79-86, 1951. [9]. H. Chim And X. Deng. Efficient Phrase-Based Document Similarity For Clustering. Ieee Transactions On Knowledge And Data Engineering, 20(9):1217-1229, 2008. [10]. http://web.ist.utl.pt/ acardoso/datasets/. [11]. http://www.cs.technion.ac.il/ ronb/thesis.html. [12]. http://www.daviddlewis.com/resources/testcollectio ns/reuters21578/ [13]. http://www.dmoz.org/ [14]. Yongmei Liu; Yong Guan, "FP-Growth Algorithm for Application in Research of Market Basket Analysis," Computational Cybernetics, 2008. ICCC 2008. IEEE International Conference on, vol., no., pp.269,272, 27-29 Nov. 2008 [15]. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li; New Algorithms for Fast Discovery of Association Rules; Proc. 3rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'97, Newport Beach, CA), 283-296; AAAI Press, Menlo Park, CA, USA 1997 2014