Clustering Among Multi-Posture Based Correspondence Compute

Size: px

Start display at page:

Download "Clustering Among Multi-Posture Based Correspondence Compute"

Sara Maxwell
5 years ago
Views:

1 Clustering Among Multi-Posture Based Correspondence Compute M. SoumyaHarika 1, G. Prasadbabu 2, P. Nirupama 3, 1 M.Tech Student, 2 Assoc.Prof, 3 Prof, Head 1, 2, 3 Department of CSE, Siddharth Institute of Engineering & Technology, Puttur, Andhrapradesh, India, Abstract All clustering models have to presuppose several cluster liaisons among the information substance that they are practical on. Comparison connecting a brace of substance can be distinct moreover unequivocally or unreservedly. In this paper, we bring in a narrative multi-posture based correspondence compute and two allied clustering models. The foremost divergence among a customary divergence/correspondence gauge and ours is that the former uses only a single stance, which is the derivation, while the concluding utilizes many diverse postures, which are substance, implicit to not be in the same cluster with the two substances being deliberate. Using multiple postures, additional revealing judgment of comparison could be achieved. Theoretical investigation and experimental cram are conducted to sustain this claim. Two norm functions for manuscript clustering are projected based on this new gauge. We evaluate them with numerous familiar clustering algorithms that use other admired resemblance procedures on various manuscript collections to validate the compensation of our suggestion. Keywords -Document clustering, text mining, similarity measure. I. INTRODUCTION Clustering is one of the most motivating and imperative topics in data withdrawal. The aim of clustering is to find essential structures in information, and sort out them into consequential subgroups for supplementary study and analysis. There include been many clustering algorithms available every year. They can be planned for very different do research fields, and residential using absolutely different techniques and approaches. however, according to a modern study [1], more than partially a century after it was introduced, the uncomplicated algorithm k-means still remainder as one of the top 10 data withdrawal algorithms currently. It is the most recurrently used partition clustering algorithm in perform. A different current methodical argument [2] states that k-means is the most wanted algorithm that practitioners in the correlated fields choose to use. Unnecessary to point out, k- means has further than a few fundamental drawbacks, such as sensitiveness to initialization and to gather size, and its concert can be poorer than added state-of-the-art algorithms in many domains. In nastiness of that, its effortlessness, understandability and scalability are the reasons for its fantastic popularity. An algorithm with sufficient presentation and usability in most of submission scenarios could be preferable to one with enhanced presentation in some cases but imperfect usage due to high complication. While offering levelheaded results, k-means is fast and easy to coalesce with other methods in superior systems. A common move toward to the clustering trouble is to treat it as an optimization development. An most favorable partition is found by optimizing a meticulous function of comparison (or distance) among information. Essentially, there is assumption that the true fundamental construction of information could be appropriately described by the likeness formula definite and surrounded in the clustering measure function. Hence, usefulness of clustering algorithms beneath this come near depends on the correctness of the comparison measure to the data at hand. For occurrence, the inventive k- means has sum-of-squared-error intention occupation that uses Euclidean remoteness. In a very sparse and high dimensional sphere of influence like text papers, spherical k-means, distance as the quantify, is deemed to be further suitable [3], [4]. In [5], Banerjee et al. showed that Euclidean detachment was to be sure one scrupulous form of a class of reserve Measures called Bregman divergence. They proposed Bregman hard-clustering algorithm, in which several kind of the Bregman divergences could be functional. Kullback- Leibler discrepancy was a unusual case of Bregman divergences that was believed to give superior clustering results on document datasets. Kullback-Leibler divergence is good examples of non-symmetric determine Also on the topic of capturing variation in data, Pakalska et al.[6] found that the discriminative supremacy of some remoteness measures could amplify when their non-euclidean and non-metric attributes were enlarged. They completed that non-euclidean and nonmetric actions could be Informative for numerical education of data. In [7], Pelillo even argued that the equilibrium and nonnegativity postulation of comparison measures was in point of fact a limitation of in progress state-of-the-art clustering approaches. concurrently, clustering still requires more robust variation or similarity measures; recent machinery such as [8] illustrate this need. The work in this paper is provoked by investigation from the above and comparable investigate conclusion. It appears to us that the natural history of comparison calculate plays a very imperative role in the victory or malfunction of a clusters an unspoken methods. 166

2 Table 1 Notations It is used in the established k-means algorithm. The intention of k-means is to minimize the Euclidean detachment Between objects of a cluster and that cluster s centroid: (2) Our first purpose is to derive a novel process for measuring comparison between data objects in Sparse and highdimensional sphere, predominantly text documents. From the planned similarity calculate, we then put together new clustering criterion functions and commence their personal clustering algorithms, which are fast and scalable like k- means, but are also competent of provided that high-quality and unswerving performance. The outstanding of this paper is prepared as follows. In Section 2, we evaluation related journalism on correspondence and clustering of papers. We then in attendance our application for document correspondence measure in Section 3. It is followed by two standard functions for manuscript clustering and their optimization algorithms in Section 4. Extensive experiments on real-world standard datasets are accessible and discussed in Sections 5 and 6. Finally, conclusions and impending future work are given in Section 7. II. RELEATED WORK First of all, Table 1 summarize the basic notations that will be old extensively all the way through this paper to represent documents and related concept. Each manuscript in a corpus corresponds to an m dimensional vector d, where m is the total numeral of terms that the manuscript corpus has. Certificate vectors are often subjected to Some weighting schemes, such as the average Term Frequency-Inverse Document Frequency (TF-IDF), and Normalized to have unit measurement lengthwise. The principle description of clustering is to put together data objects into disconnect clusters such that the intra-cluster similarity as well as the inter-cluster distinction is maximized. The problem formulation itself implies that some forms of quantity are needed to establish such similarity or distinction. There are numerous state-of-theart clustering approaches that do not make use of any detailed form of quantity, for instance, probabilistic model based method [9], non-negative prevailing conditions factorization [10], in sequence theoretic coclustering [11] and so on. In this paper, though, we for the most part focus on methods that indeed do make the most of a specific measure. In the literature, Euclidean distance is one of the a large amount popular measures: Dist ( d i, d j ) = d i - d j However, for information in a meager and highdimensional breathing space, such as that in manuscript clustering, cosine comparison is more generally used. It is also a popular comparison score in text taking out and information retrieval [12]. Particularly, correspondence of two manuscript vectors d i and d j, Sim(d i, d j ), is defined as the cosine of the angle among them. For unit vectors, this age group to their inner product: Cosine compute is used in a variant of k-means called globular k-means [3]. While k-means aims to diminish Euclidean distance, spherical k-means intends to take full advantage of the cosine correspondence between credentials in a cluster and that cluster s centroid. The major distinction between Euclidean detachment and cosine comparison, and consequently between k-means and spherical k-means, is that the preceding focuses on vector magnitudes, while the concluding emphasizes on vector in sequence. Besides direct compliance in spherical k- means, cosine of document vectors is also normally used in many other document clustering methods as a core judgment dimension. The min-max cut graph-based spectral process is an example [13]. In graph partitioning approach, manuscript corpus is thinking about as a graph G = (V,E), where each article is a vertex in V and each perimeter in E has a weight equal to the comparison between a pair of vertices. Min-max cut algorithm tries to diminish the principle function: and when the cosine as in Eq. (3) is used, minimizing the criterion in Eq. (5) is equivalent to: There are many other graph partitioning methods with diverse unkind strategies and measure functions, suchas regular Weight [14] and Normalized Cut [15], all of which have been fruitfully applied for file Clustering using cosine as the pair wise correspondence score [16], [17]. In [18], an (4) (3) (6) (5) 167

3 experimental study was conducted to measure up to a variety of principle functions for manuscript clustering. Another well-liked graph-based clustering technique is implemented in a software tie together called CLUTO [19]. This technique first models the credentials with a nearest national graph, and then splits the diagram into clusters Using a min-cut algorithm. Besides cosine measure, the extended Jacquard coefficient can also be used in this method to represent resemblance between nearest credentials. Given nonunit document vectors u i, u j (di = ui/ ui, d j = uj/ uj ), their extended Jaccard coefficient is:. (7) from the document-content actions that are discussed in this paper. In common, cosine resemblance still remainder as the most popular measure for the reason that of its simple interpretation and easy totaling, though its usefulness is yet fairly inadequate. In the following sections, we suggest a novel way to appraise similarity between credentials, and as a result formulate new principle functions for manuscript clustering. III. MULTI-MODEL BASED COMPARISION A. Our novel comparison evaluate The cosine similarity in Eq. (3) can be articulated in the subsequent form without varying its meaning: Compared with Euclidean detachment and cosine correspondence, the complete Jaccard coefficient takes into description both the amount and the bearing of the manuscript vectors. If the credentials are as an alternative represented by their equivalent unit vectors, this calculate has the same effect as cosine comparison. In [20], Strehl et al. compared four measures: Euclidean, cosine, Pearson connection and comprehensive Jaccard, and concluded that cosine and extended Jaccard are the best ones on web credentials. In nearest-neighbor diagram clustering methods, such as the CLUTO s diagram method above, the thought of Similarity is to some extent poles apart from the beforehand discussed methods. Two credentials may have a certain value of cosine correspondence, but if neither of them is in the supplementary one s district, they have no connection sandwiched between them. In such a case, some context-based knowledge or relativeness possessions is previously taken into account when allowing for similarity. Recently, Ahmad and Dey [21] proposed a technique to compute detachment between two uncompromising values of an characteristic based on their relationship with all other attributes. Subsequently, Ienco et al. [22] introduced a comparable context-based detachment learning method for uncompromising data. However, for a given characteristic, they only preferred a relevant subset of attributes from the whole characteristic set to use as the context for manipulative distance between its two values. More associated to text information, there are phrase-based and concept-based manuscript similarities. Lakkaraju et al. [23] in employment a theoretical tree-similarity calculate to identify comparable documents. This process requires on behalf of documents as perception trees with the help of a classifier. For clustering, Chim and Deng [24] proposed a phrase-based manuscript correspondence by combining suffix tree model and vector space model. They then used Hierarchical Agglomerative Clustering algorithm to achieve the clustering commission. However, a downside of this come near is the high computational complication due to the requirements of construction the suffix tree and calculating pair wise similarities clearly before clustering. There are also actions designed particularly for capturing structural similarity among XML documents [25]. They are basically different (8) where 0 is vector 0 that represent the origin point. According to this formula, the compute takes 0 as one and only position point. The comparison between two Documents di and dj is strong-minded w.r.t. the point of view between the two points when looking from the origin. To construct a new perception of similarity, it is potential to use more than just one point of position. We may have a more accurate measurement of how close or distant a pair of points are, if we look at them from many diverse viewpoints. From a third point dh, the information and distances to d i and d j are indicated correspondingly by the difference vectors (d i d h ) and (d j d h ). By reputation at various situation points dh to view d i, d j and workingon their differentiation vectors, we define correspondence between the two documents as: As described by the above equation, similarity of two credentials d i and d j - given that they are in the identical cluster - is distinct as the average of similarity considered moderately from the views of all other credentials Outside that cluster. The two objects to be measured must be in the equivalent cluster, while the points from where to institution this length must be outer surface of the bunch. We call this proposal the Multi-Viewpoint based resemblance, or MVS. From this point onwards, we will denote the planned similarity calculate between two manuscript vectors d i and d j The final form of MVS in Eq. (9) depends on particular formulation of the human being similarity within the sum. If the relative comparison is defined by dot-product of the dissimilarity vectors, we have: (9) (10) The comparison between two points d i and d j inside bunch Sr, viewed from a point d h outside this bunch, is equal to the 168

4 creation of the cosine of the viewpoint between d i and d j looking from dh and the Euclidean distances from dh to these two points. This definition is based on the supposition that dh is not in the identical cluster with d i and d j. The less important the distances d i d h and d j d h are, the superior the probability that dh is in fact in the same bunch with di and dj, and the resemblance based on dh ought to also be small to reflect this probable. Therefore, from end to end these distances, Eq. (10) also provides a measure of put in the ground bunch difference, given that points d i and d j belong to come together Sr, whereas dh belongs to an additional cluster. The overall similarity between d i and d j is strong-minded by attractive average over all the viewpoints not belonging to cluster Sr. It is possible to quarrel that while nearly everyone of these viewpoints are practical, there may be a quantity of them giving misleading in sequence just like it may come about with the derivation point. However, given a large enough numeral of viewpoints and their assortment, it is rational to assume that the preponderance of them will be useful. Hence, the effect of ambiguous viewpoints is unnatural and condensed by the averaging step. It can be seen that this method offers more revealing evaluation of similarity than the on its own origin point based comparison measure. B. Analysis and practical examples of MPC In this section, we present investigative study to show that the planned MVS could be a very effective comparison Measure for data clustering. In order to make obvious its advantages, MVS is compared with cosine comparison (CS) on how well they reflect the true group structure in manuscript collections. Firstly, exploring Eq. (10), we have: (11) dh is the composite vector of all the credentials outside cluster r, called the outer Composite w.r.t. cluster r, and the outer centroid w.r.t. bunch r, r = 1,..., k. From Eq. (11), when comparing two pair wise similarities (12) To additional give reason for the above suggestion and psychiatry, we carried out a strength test for MVS and CS. The rationale of this experiment is to check how a large amount a comparison measure coincides with the true class labels. It is based on one principle: if a similarity calculate is suitable for the clustering problem, for any of a article in the corpus, the credentials that are contiguous to it based on this determine should be in the same bunch with it. For occurrence, the inventive k-means has sum-of-squared-error intention occupation that uses Euclidean remoteness. In a very sparse and high dimensional sphere of influence like text papers, spherical k-means, distance as the quantify, is deemed to be further suitable Fig. 1. Procedure: Get validity score. Over all the rows of A, as in line 14, Fig. 2. It is obvious that soundness score is bordered within 0 and 1. The higher validity score a comparison measure has, the more apposite it should be for the clustering task. Two real-world manuscript datasets are used as examples in this strength test. The first is reuters7, a subset of the wellknown collection, Reuters Allocation 1.0, of Reuter s newswire articles1. Reuters is one of the majority widely used test compilation for text classification. In our strength test, we selected 2,500 credentials from the largest 7 categories: acq, simple, attention, gross, money-fx, ship and trade to form reuters7. Some of the credentials may appear in more than one grouping. The second dataset is k1b, a collected works of 2,340 web pages from the Yahoo! subject ladder, including 6 topics: health, distraction, sport, political beliefs, tech and trade. It was created from a past revise in in sequence retrieval called WebAce [26], and is now accessible with the CLUTO toolkit [19]. The two data sets were preprocessed by stop-word elimination and stemming. Moreover, we detached words that come into sight in less than two credentials or more than 99.5% of the total numeral of credentials. Finally, the documents were weighted by TF-IDF and normalized to unit vectors. Clustering is one of the most motivating and imperative topics in data withdrawal. The aim of clustering is to find essential structures in information, and sort out them into consequential subgroups for supplementary study and analysis. 169

5 IV. MULTI-BASED POSTURE CLUSTERING A. Two clustering criterion functions IR and IV Having distinct our comparison measure, we now put together our clustering criterion functions. The first function,called I R, is the come together size-weighted sum of average pair wise similarities of credentials in the same bunch.firstly, let us articulate this sum in a general form by function F: (13) We would like to convert this objective occupation into some suitable form such that it could make possible the optimization modus operandi to be performed in a troublefree, fast and affective way, According equation (10), In common practice, {λr}k 1 are often taken to be simple functions of the respective cluster sizes {nr}k 1. (16) In the formulation of I R, a cluster superiority is calculated by the average pair wise comparison between documents within that cluster. However, such an draw near can lead to sensitiveness to the size and rigidity of the clusters. With CS, for example, pairwise likeness of documents in a sparse cluster is habitually slighter than those in a solid cluster. Though not as clear as with CS, it is still achievable that the same effect may hinder MPC-based clustering if using pairwise match. To prevent this, an substitute approach is to consider similarity between each document vector and its cluster s centroid instead. This is uttered in purpose function G: Substituting the above into Eq. (17) to have: (17) Since, Again, we could eliminate n because it is a constant. maximizing G is equivalent to maximizing IV below: (18) IV calculates the weighted distinction between the two terms: D r and D t r D/ D r, which again represent an Because n is constant, maximizing F is equivalent to maximizing F: (14) If comparing F with the min-max cut in Eq. (5), both functions surround the two conditions D r 2 (an intra-cluster r comparison determine) and D t D (an inter-cluster correspondence determine). Nonetheless, while the purpose of min-max cut is to curtail the opposite ratio between these two terms, our aim here is to maximize their weighted differentiation. In F, this difference term is gritty for each cluster. They are subjective by the inverse of the cluster s size, before summed up over all the clusters. One difficulty is that this formulation is anticipated to be quite perceptive to cluster size. From the formulation of COSA [27] - a widely known subspace clustering algorithm we have erudite that it is desirable to have a set of weight factors to regulate the allocation of these cluster sizes in clustering solutions. Hence, we incorporate into the appearance of F to have it become: Fig. 2. Algorithm: Incremental clustering. (15) 170

V. PERFORMANCE EVALUATION OF MPCC To confirm the reimbursement of our probable methods, we assess their presentation in experiments on manuscript information.

6 V. PERFORMANCE EVALUATION OF MPCC To confirm the reimbursement of our probable methods, we assess their presentation in experiments on manuscript information. The purpose of this section is to compare MPCC- I R and MPCC-I V with the accessible algorithms that also use precise comparison actions and standard functions for document clustering. The comparison procedures to be compared includes Euclidean detachment, cosine comparison and extended Jaccard coefficient. A. Document collections The information corpora that we worn for experiments consist of twenty benchmark essay datasets. Besides reuters7and k1b, which have been described in particulars prior, we incorporated another eighteen text collections so that the assessment of the clustering methods is more thorough and complete. Similar to k1b, these datasets are provided in concert with CLUTO by the toolkit s authors [19]. They had been used for untried testing in earlier papers, and their source and beginning had also been described in details. Table 2 summarizes their distinctiveness. The corpora in attendance a multiplicity of size, number of classes and class balance. They were all preprocessed by standard measures, counting stop word exclusion, stemming, and removal of too rare as well as Table 2 Document Database MPCC-I V. The justification for doing so is that if the ultimate solutions by MPCC-I R and MPCC-I V are better than the transitional ones obtained by Spkmeans, MPC is undeniably good for the clustering trouble. These experiments would divulge more evidently if MPC actually improves the clustering routine compared with CS. In the preceding section, MPCC algorithms have been compared against the existing algorithms that are closely related to them, i.e. ones that also employ similarity measures and criterion functions. In this section, we make use of the extended experiments to further compare the MPCC with a different type of clustering come near, the NMF methods [10], which do not use any form of plainly defined parallel determine for documents. A. TDT2 and Reuters collections For multiplicity and meticulousness, in this experimental study, we used two novel document copora described in Table 6: TDT2 and Reuters The innovative TDT2 corpus3, which consists of 11,201 documents in 96 topics (i.e. classes), has been one of the most average sets for document clustering principle. We used a sub-collection of this quantity which contains 10,021 documents in the principal 56 topics. The Reuters allocation 1.0 has been mentioned prior in this paper. The original quantity consists of 21,578 documents in 135 topics. We used a Table 3 TDT2 and Reuters document corpora sub- compilation having 8,213 documents from the largest 41 topics. The similar two manuscript collections had been used in the paper of the NMF methods [10]. Documents that emerge in two or more topics were uninvolved, and the remaining documents were preprocessed in the same way as in Section 5.1 VII. CONCLUSIONS AND FUTURE WORK VI. MPCC AS ELEGANCE FOR K-MEANS From the scrutiny of Eq. (12) in Section 3.2, MPC provides an added standard for measuring the comparison among documents compared with CS. instead, MPC can be measured as a enhancement for CS, and consequently MPCCC algorithms as refinements for spherical k-means, which uses CS. To auxiliary inspect the correctness and usefulness of MPC and its clustering algorithms, we approved out a different set of experiments in which solutions obtained by Spkmeans were supplementary optimized by MPCC-I R and In this paper, we propose a Multi-Posture based Correspondence Compute method, named MPC. Theoretical investigation and experimental examples show that MPC is potentially more appropriate for content documents than the admired cosine connection. Based on MPC, two standard functions, I R and I V, and their particular clustering algorithms, MPC-I R and MPC-I V, have been introduced. Compared with further state-of-the-art clustering models that use dissimilar types of resemblance gauge, on a large quantity of manuscript datasets and under dissimilar valuation metrics, the projected algorithms show that they could provide appreciably enhanced clustering presentation. The key donation of this paper is the elementary impression of comparison determine from multiple posture. Future models 171

7 could make use of the same standard, but identify substitute forms for the comparative comparison in Eq. (10), or do not use typical but have other models to mingle the comparative similarities according to the diverse posture. Besides, this paper focuses on partitional clustering of documents. In the prospect, it would also be potential to pertain the projected standard functions for hierarchical clustering algorithms. Finally, we have revealed the submission of MPC and its clustering algorithms for text data. It would be appealing to investigate how they occupation on other types of bare and high-dimensional information. REFERENCES [1] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach,D. J. Hand, and D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst., vol. 14, no. 1, pp. 1 37, [2] Guyon, U. von Luxburg, and R. C. Williamson, Clustering: Science or Art? NIPS 09 Workshop on Clustering Theory, [3] I. Dhillon and D. Modha, Concept decompositions for large sparse text data using clustering, Mach. Learn., vol. 42, no. 1-2, pp , Jan [4] S. Zhong, Efficient online spherical K-means clustering, in IEEE IJCNN, 2005, pp [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman divergences, J. Mach. Learn. Res., vol. 6, pp , Oct [6] E. Pekalska, A. Harol, R. P. W. Duin, B. Spillmann, and H. Bunke, Non-Euclidean or non-metric measures can be informative, in Structural, Syntactic, and Statistical Pattern Recognition, ser. LNCS, vol. 4109, 2006, pp [7] M. Pelillo, What is a cluster? Perspectives from game theory, in Proc. of the NIPS Workshop on Clustering Theory, [8] D. Lee and J. Lee, Dynamic dissimilarity measure for support based clustering, IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 6, pp , [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on the unit hypersphere using von Mises-Fisher distributions, J. Mach. Learn. Res., vol. 6, pp , Sep [10] W. Xu, X. Liu, and Y. Gong, Document clustering based o nonnegative matrix factorization, in SIGIR, 2003, pp [11] I. S. Dhillon, S. Mallela, and D. S. Modha, Information-theoretic coclustering, in KDD, 2003, pp [12] C. D. Manning, P. Raghavan, and H. Sch utze, An Introduction to Information Retrieval. Press, Cambridge U., [13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A min-max cut algorithm for graph partitioning and data clustering, in IEEE ICDM, 2001, pp [14] H. Zha, X. He, C. H. Q. Ding, M. Gu, and H. D. Simon, Spectral relaxation for k-means clustering, in NIPS, 2001, pp [15] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, pp , [16] I. S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in KDD, 2001, pp [17] Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis. Springer-Verlag New York, Inc., [18] Y. Zhao and G. Karypis, Empirical and theoretical comparisons of selected criterion functions for document clustering, Mach. Learn., vol. 55, no. 3, pp , Jun [19] G. Karypis, CLUTO a clustering toolkit, Dept. of Computer Science, Uni. of Minnesota, Tech. Rep., 2003, umn.edu/gkhome/views/cluto. [20] A. Strehl, J. Ghosh, and R. Mooney, Impact of similarity measures on web-page clustering, in Proc. of the 17th National Conf. on Artif. Intell.: Workshop of Artif. Intell. for Web Search. AAAI, Jul. 2000, pp [21] A. Ahmad and L. Dey, A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set, Pattern Recognit. Lett., vol. 28, no. 1, pp , [22] D. Ienco, R. G. Pensa, and R. Meo, Context-based distance learning for categorical data clustering, in Proc. of the 8th Int. Symp. IDA, 2009, pp [23] P. Lakkaraju, S. Gauch, and M. Speretta, Document similarity based on concept tree distance, in Proc. of the 19th ACM conf. on Hypertext and hypermedia, 2008, pp [24] H. Chim and X. Deng, Efficient phrase-based document similarity for clustering, IEEE Trans. on Knowl. and Data Eng., vol. 20, no. 9, pp , [25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, Fast detection of xml structural similarity, IEEE Trans. on Knowl. And Data Eng., vol. 17, no. 2, pp , [26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Webace: a web agent for document categorization and exploration, in AGENTS 98: Proc. of the 2nd ICAA, 1998, pp [27] J. Friedman and J. Meulman, Clustering objects on subsets of attributes, J. R. Stat. Soc. Series B Stat. Methodol., vol. 66, no. 4, pp ,

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar 3. P.Praveen Kumar ABSTRACT: Cluster is a term used regularly in our life is nothing but a group. In the view