Text Mining Data Preparator with Multi-View Clustering

Size: px

Start display at page:

Download "Text Mining Data Preparator with Multi-View Clustering"

Allyson Rich
5 years ago
Views:

1 Text Mining Data Preparator with Multi-View Clustering J.B.Naga Venkata Lakshmi Sudhakar KN Jitendranath M Student: Dept of CSE Assoc Prof.: Dept. CSE Prof & Dean, Dept of CSE CMRIT college CMRIT college CMRIT, Bangalore, India Bangalore, India Bangalore, India jmungara@yahoo.com jagarlamudibala@gmail.com sudhukn@gmail.com Abstract The proposed system assumes some cluster relationship among the information objects that they're applied on. Similarity between a pair of objects is defined either explicitly or implicitly. During this paper, we tend to introduce a completely unique multiviewpoint based similarity measure and two connected clustering strategies. The most important distinction between a conventional dissimilarity/similarity measure and ours is that the previous uses solely one viewpoint that is the origin, whereas the latter utilizes many alternative viewpoints that are objects assumed to be not in the same cluster with the two objects being measured. Using multiple viewpoints, a lot of informative assessment of similarity may be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed primarily based on this new measure. Keywords-component Document clustering, text mining, similarity measure. (key words) I. INTRODUCTION Clustering is one in all the foremost fascinating and important topics in knowledge mining. The aim of clustering is to find intrinsic structures in knowledge, and organize them into meaningful subgroups for further study and analysis. There are several clustering algorithms publishing every year. They will be proposed for terribly distinct research fields, and developed using totally different techniques and approaches. Nevertheless, in line with a recent study [1], over a century when it had been introduced, the straightforward algorithm k-means still remains as one of the highest ten knowledge mining algorithms these days. It is the foremost frequently used partition clustering algorithm in apply. Another recent scientific discussion [2] states that k-means is still the favorite algorithm that practitioners within the connected fields like better to use. Need-less to say, k-means has over many basic drawbacks, like sensitiveness to initialization and to cluster size, and its performance is worse than different state-of-the-art algorithms in several domains. In spite of that, its simplicity, understandability and scalability are the reasons for its tremendous popularity. An algorithm with adequate performance and usefulness in most of application eventualities may well be preferable to one with better performance in some cases however restricted usage due to high complexity. Whereas giving affordable results, k-means is quick and simple to mix with different strategies in larger systems. A common approach to the clustering downside is to treat it as an optimization method. An optimal partition is found by optimizing a specific form of similarity (or distance) among information. Basically, there's an implicit assumption that the true intrinsic structure of information may be properly described by the similarity formula defined and embedded within the clustering criterion function. Hence, effectiveness of clustering algorithms underneath this approach depends on the appropriateness of the similarity measure to the information at hand. As an example, the original k-means has sum-of-squared-error objective function that uses Euclidean distance. In a very sparse and high dimensional domain like text documents, spherical k-means, that uses cosine similarity rather than Euclidean distance because the measure, is deemed to be additional appropriate [3], [4]. In [5], Banerjee et al. showed that Euclidean distance was indeed one explicit kind of a category of distance measures known as Bregman divergences. They proposed Bregman hard-clustering algorithm, within which any kind of the Bregman divergences may well be applied. Kullback-Leibler divergence was a special case of Bregman divergences that was said to grant smart clustering results on document datasets. Kullback-Leibler divergence may be a good example of non-symmetric measure. Conjointly on the topic of capturing dissimilarity in knowledge, Pakalska et al. [6] found that the discriminative power of some measures might increase when their non-euclidean and non-metric attributes were increased. They concluded that non- Euclidean and non-metric measures may well be informative for statistical learning of information. In [7], Pelillo even argued that the symmetry and non-negativity assumption of similarity measures was truly a limitation of current state-of-the-art clustering approaches. Simultaneously, clustering still needs a lot of strong dissimilarity similarity measures; recent works like [8] illustrate this want. 65

2 The work during this paper is motivated by investigations from the on top of and similar analysis findings. It appears to us that the character of similarity measure plays a really important role within the success or failure of a clustering method. Our first objective is to derive a completely unique method for measuring similarity between information objects in sparse and high-dimensional domain, significantly text documents. From the proposed similarity measure, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are quick and scalable like k- means, however also are capable of providing highquality and consistent performance. The remaining of this paper is organized as follows. In Section two, we have a tendency to review connected literature on similarity and clustering of documents. We have a tendency to then gift our proposal for document similarity measure in Section three. It is followed by two criterion functions for document clustering and their optimization algorithms in Section 4. Intensive experiments on real-world benchmark datasets are presented and mentioned in Sections five and six. Finally, conclusions and potential future work are given in Section seven. II. RELATED WORK TAE-WAN RYU AND CHRISTOPH F. EICK [9] introduces an approach to cope with the representational inappropriateness of traditional flat file format for data sets from databases, specifically in database clustering. Steffen Bickel and Tobias Schaeffer [10] consider clustering problems in which the available attributes can be split into two independent subsets, such that either subset suffices for learning. Here we study partitioning and agglomerative, hierarchical clustering algorithms for text data. Mala Mehrotra and Chris Wild [11] address the feasibility of partitioning rule-based systems into a number of meaningful units to enhance the comprehensibility, maintainability and reliability of expert systems software. They also present the results of using this approach to partition a deployed knowledge-based system that navigates the Space Shuttle's entry. N. Balayesu et.al [12] assumes some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Mala Mehrotra and Dmitri Bobrovnikoff [13] presents the MVP-CA tool clusters a knowledge base into related rule sets thus allowing the user to comprehend the knowledge base in terms of conceptually meaningful clusters of rules. The tool is eventually meant to aid knowledge engineers and subject matter experts to author, understand and manage the KB for its maximal utilization. Kamalika Chaudhuri et.al [14] considers constructing such projections using multiple views of the data, via Canonical Correlation Analysis (CCA). Mario Frank et.al [15] proposes a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. They also extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Bo Long et.al [16] we propose a general model, the collective factorization on related matrices, for multi-type relational data clustering. Second, under this model, we derive a novel algorithm, the spectral relational clustering, to cluster multi-type interrelated data objects simultaneously K.P.N.V.Satya sree and Dr.J V R Murthy [17] proposed a new way to compute the overlap rate in order to improve time efficiency and the veracity is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of Expectation-Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two sub-clusters combined when their overlap is the largest is narrated. Jean-Charles LAMIREL [18] proposed a new approach for knowledge extraction based on a Multi GAS model, which represents itself an extension of the Neural Gas model relying on the MDVA paradigm. Their approach makes use of original measures of unsupervised Recall and Precision for extracting rules from gases. Ran Song et.al [19] proposed a novel integration method cast in the framework of Markov random fields (MRF). We define a probabilistic description of a MRF model designed to represent not only the interposing Euclidean distances but also the surface topology and neighborhood consistency intrinsically embedded in a predefined neighborhood. Tilman Lange and Joachim M. Buhmann [20] presented an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. 66

3 Anna Huang [21] compares and analyzes the effectiveness of these measures in partitional clustering for text document datasets. Their experiments utilize the standard Kmeans algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering. III. PROBLEM DESCRIPTION The common approach to the clustering problem is to treat it as an optimization process. The problem formulation itself implies that some forms of measurement are needed to determine such similarity or dissimilarity. It is based on one principle: if similarity measure is appropriate for the clustering problem. Clustering is the process of partitioning or dividing a set of patterns (data) into groups. Each cluster is abstracted using one or more representatives. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Clustering is a type of classification imposed on finite set of objects. The relationship between objects is represented in a proximity matrix in which the rows represent n e- mails and columns correspond to the terms given as dimensions. If objects are categorized as patterns, or points in a d-dimensional metric space, the proximity measure can be Euclidean distance between a pair of points. Unless a meaningful measure of distance or proximity, between a pair of objects is established, no meaningful cluster analysis is possible. Clustering is useful in many applications like decision making, data mining, text mining, machine learning, grouping, and pattern classification and intrusion detection. Clustering has to be done as it helps in detecting outliners & to examine small size clusters IV. PROPOSED SYSTEM This project aims to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. There have been many clustering algorithms published every year. They can be proposed for very distinct research fields, and developed using totally different techniques and approaches. Nevertheless, according to a recent study [1], more than half a century after it was introduced; the simple algorithm k-means still remains as one of the top 10 data mining algorithms nowadays. It is the most frequently used partitioned clustering algorithm in practice. Another recent scientific discussion states that k-means is the favorite algorithm that practitioners in the related fields choose to use. A common approach to the clustering problem is to treat it as an optimization process. An optimal partition is found by optimizing a particular function of similarity (or distance) among data. They proposed Bregman hardclustering algorithm [5], in which any kind of the Bregman divergences could be applied. Kullback- Leibler divergence was a special case of Bregman divergences that was said to give good clustering results on document datasets. Kullback-Leibler divergence is a good example of non-symmetric measure. Also on the topic of capturing dissimilarity in data, Pakalska et al. found that the discriminative power of some distance measures could increase when their non-euclidean and non-metric attributes were increased. The main work is to develop a clustering algorithm for document clustering which provides maximum efficiency and performance. The proposed architecture is as shown in Figure 1. It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve time efficiency and the veracity is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of Expectation-Maximization algorithm in the Gaussian Mixture Model to count the parameters and make the two subclusters combined when their overlap is the largest is narrated. Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time. Given a data set satisfying the distribution of a mixture of Gaussians, the degree of overlap between components affects the number of clusters perceived by a human operator or detected by a clustering algorithm. In other words, there may be a significant difference between intuitively defined clusters and the true clusters corresponding to the components in the mixture. At establishing the fundamentals to implement in the future Ubiquitous Computing Architectures by developing an intelligent algorithm which can integrate, manage and connectively operate individual applications which are composed of diverse platforms and components, in accordance with clustering, using data objects and number of documents. I R : cluster size-weighted sum of average pair wise similarities of document in the same cluster. I V : weighted difference between two terms intra cluster similarity measure and inter cluster similarity measure. Number of terms Number of Documents Number of Classes Number of Clusters Multi-view Point Similarity Approach IR Data Objects Clustering Criterion IV Design incremental clustering Initiate Similarity measure Sparse Domain High-Dimensional domain Evaluate Accuracy Figure1: Proposed System Architecture Document vector 67

V. IMPLEMENTATION AND RESULTS The proposed system is experimented on standard 32 bit Windows OS on java platform.

Normally, the OS is Windows XP/7/Vista. The main theme of this project work is to introduce a clustering algorithm for document clustering which provides maximum efficiency and performance.

veracity. The two datasets (Fig 3) are preprocessed by stop-word removal and stemming.

Finally the documents are weighted by TF-IDF and normalized to unit vectors.

4 V. IMPLEMENTATION AND RESULTS The proposed system is experimented on standard 32 bit Windows OS on java platform. For the complete functionality of the project work, the project is run with the help of well equipped computer containing at least P4 processor, 20 GB HDD and 2 GB RAM. Normally, the OS is Windows XP/7/Vista. The main theme of this project work is to introduce a clustering algorithm for document clustering which provides maximum efficiency and performance. It also focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria and a new way to compute the overlap rate in order to improve time efficiency and the veracity. The two datasets (Fig 3) are preprocessed by stop-word removal and stemming. Moreover, we have to remove words that appear in less than two documents or more than maximum value of the total number of documents. Finally the documents are weighted by TF-IDF and normalized to unit vectors. In this project work, focus is given to derive a novel method for measuring similarity between data objects in sparse and high-dimensional domain, particularly text documents. From the proposed similarity measure, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are fast ad scalable like k-means, but are also capable of providing high-quality and consistent performance. Clustering has to be done as it helps in detecting outliers & to examine small size clusters Once we start this application Multi view based clustering details are obtained. Here we can enter the own project name and set the any type of the data sets. Figure 3: Two data sets This is often browsing the datasets we must select maximum number of data set content. After that data content must be transformation of the word Figure 2: GUI of the Multi view clustering Figure 4: Data Updating Configuration Once data sets has been updating this application to be count the data subsets after that data must be configured. 68

After obtaining the similarity values of the matrix the final clusters will be generated as per data content of data sets.

Figure 8: Term frequency Graph for Matrix After obtaining the clustering result for each data set will represent like term frequency graph as shown as above.

This vector matrix shows the result of multi view similarity.

CONCLUSION We tend to propose a Multi-Viewpoint based Similarity measuring technique, named MVS.

5 After obtaining the similarity values of the matrix the final clusters will be generated as per data content of data sets. Figure 5: Stop-Word After obtaining the data configured, transformation of the data content stop words must be removed. Figure 8: Term frequency Graph for Matrix After obtaining the clustering result for each data set will represent like term frequency graph as shown as above. Figure 6: Result of Multi view similarity Data sets of Data content must be transfer to the String tokens, counting this token and generated the vector matrix. This vector matrix shows the result of multi view similarity. Figure 9: Term frequency-idf Bar Graph for Matrix The above Bar graph as represented based on the contentent of the data sets clustering matrix. Figure7: Result of Clustering VI. CONCLUSION We tend to propose a Multi-Viewpoint based Similarity measuring technique, named MVS. Theoretical analysis and empirical examples show that MVS is potentially a lot of appropriate for text documents than the popular cosine similarity. Based on MVS, two criterion functions, IR and IV, and their respective clustering algorithms, MVSC-IR and MVSC-IV, are introduced. Compared with different state-of-the-art clustering methods that use differing kinds of similarity measures, on an oversized variety of document datasets and beneath different analysis metrics, the proposed algorithms shows that it offers significantly improved clustering performance. The 69

6 key contribution of this paper is that the basic concept of similarity measures from multiple viewpoints. Future strategies may build use of a similar principle, but define various forms for the relative similarity or doesn t use average however produce other strategies to combine the relative similarities in keeping with the different viewpoints. Besides, this paper focuses on partition clustering of documents. Within the future, it'd even be possible to use the proposed criterion functions for hierarchical clustering algorithms. Finally, we ve shown the application of MVS and its clustering algorithms for text data. It d be fascinating to explore how they work on different forms of sparse and high-dimensional data. REFERENCES [1] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst., vol. 14, no. 1, pp. 1 37, [2] I. Guyon, U. von Luxburg, and R. C. Williamson, Clustering: Science or Art? NIPS 09 Workshop on Clustering Theory, [3] I. Dhillon and D. Modha, Concept decompositions for large sparse text data using clustering, Mach. Learn., vol. 42, no. 1-2, pp , Jan [4] S. Zhong, Efficient online spherical K-means clustering in IEEE IJCNN, 2005, pp [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman divergences, J. Mach. Learn. Res., vol. 6, pp , Oct [14] Kamalika Chaudhuri, Sham M. Kakade, Karen Livescu, Karthik Sridharan Multi-View Clustering via Canonical Correlation Analysis 26th International Conference on Machine Learning, Montreal, Canada, [15] Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann Multi-Assignment Clustering for Boolean Data Journal of Machine Learning Research 13 (2012) , Submitted 9/10; Revised 6/11; Published 2/12 [16] Bo Long, Zhongfei (Mark) Zhang, Xiaoyun Wu, Philip S. Yu Spectral Clustering for Multi-type Relational Data 23 rd International Conference on Machine Learning, Pittsburgh, PA, [17] K.P.N.V.Satya sree, Dr.J V R Murthy CLUSTERING BASED ON COSINE SIMILARITY MEASURE [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY ISSN: Volume-2, Issue-3, ] Jean-Charles LAMIREL, A New Multi-Viewpoint and Multi-Level Clustering Paradigm for Efficient Data Mining Tasks New Fundamental Technologies in Data Mining, ISBN: , DOI: /13564 [19] Ran Song, Yonghuai Liu, Ralph R. Martin, and Paul L. Rosin Markov Random Field-Based Clustering for the Integration of Multiview Range Images ISVC 2010, Part I, LNCS 6453, pp , CSpringer-Verlag Berlin Heidelberg [20] Tilman Lange and Joachim M. Buhmann Fusion of Similarity Data in Clustering In Advances in Neural Information Processing Systems 18 (2006), pp Key: citeulike: [21] Anna Huang Similarity Measures for Text Document Clustering NZCSRSC 2008, April 2008, Christchurch, New Zealand Computer Science Research Student Conference [6] E. Pekalska, A. Harol, R. P. W. Duin, B. Spillmann, and H. Bunke, Non-Euclidean or non-metric measures can be informative, in Structural, Syntactic, and Statistical Pattern Recognition, ser. LNCS, vol. 4109, 2006, pp [7] M. Pelillo, What is a cluster? Perspectives from game theory, in Proc. of the NIPS Workshop on Clustering Theory, [8] D. Lee and J. Lee, Dynamic dissimilarity measure for support based clustering, IEEE Trans. on Knowl. And Data Eng., vol. 22, no. 6, pp , [9] TAE-WAN RYU AND CHRISTOPH F. EICK SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING CIT: Department of Computer Science, University of Houston, [10] Steffen Bickel and Tobias Schaeffer Multi-View Clustering IEEE international conference on data Mining, 2004, SCHE540/10-1. [11] Mala Mehrotra and Chris Wild Multi-Viewpoint Clustering Analysis ViGYAN, Inc. 30 Research Drive. Hampton, Va , CA 95014, From: AAAI Technical Report WS Compilation copyright 1993, AAAI ( All rights reserved. [12] N. Balayesu, M. Rambabu, D. Anusha Performance of Clustering with Multi-Viewpoint based Similarity Measure and Optimization Technique International Journal of Computer Science And Technology, ISSN : (Online) ISSN : (Print), IJCST Vol. 3, Issue 1, Spl. 5, Jan. - March [13] Mala Mehrotra and Dmitri Bobrovnikoff Multi-ViewPoint Clustering Analysis (MVP-CA) Tool From: AAAI-02 Proceedings. Copyright 2002, AAAI ( All rights reserved. American Association for Artificial Intelligence. 70

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar 3. P.Praveen Kumar ABSTRACT: Cluster is a term used regularly in our life is nothing but a group. In the view