Centroid Based Text Clustering

Size: px

Start display at page:

Download "Centroid Based Text Clustering"

Andrew Mills
6 years ago
Views:

1 Centroid Based Text Clustering Priti Maheshwari Jitendra Agrawal School of Information Technology Rajiv Gandhi Technical University BHOPAL [M.P] India Abstract--Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tackles problems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mysql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing.we apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering of documents. Keywords: Web mining, text clustering, Centroid selection, single link, complete link. I. Introduction Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its cluster. With the help of clustering the set of data are partition into groups based on data similarity[6].advantages of such clustering based process are that it is adaptable to change and helps single out useful features that distinguish different groups. We calculate the frequency of a term in a document that means how many numbers of a particular term occur in a particular document w ij = Freq ij (1.1) where w ij is the weight of jth term in ith document and Freq ij : = the number of times jth term occurs in a document D i.in this paper, we have designed a new a model for web pages clustering based on centroid selection techniques. The proposed algorithm approach is as follows:- Web Document Pre - Processing a) HTML tag remover b) STOP-word remover c) Stemming word remover Calculate Centroid term Document Vector Apply Agglomarative clustering approach For different clusters (a) Single Linkage (b) Average Linkage (c) Complete Linkage The framework of approach is starting with ISSN:

2 Pre-processing. Pre-processing task is accomplished in different steps which remove irrelevant or noisy information from data. The steps of this task involve following steps: 1] Removal of HTML tag, 2] Stop-word remover:-non informative word removed, 3] Stemming word remover:-remove suffix. After pre-processing we calculate term weight, which gives the frequency count of terms. After this we calculate centroid which gives the document vector. Clustering is the process in which data object that are similar to one other belong to one group and dissimilar objects in other clusters. It is one useful component in various application systems of natural language processing. Here we have used Agglomerative clustering approaches and find different clusters through Single Link (minimum distance), Average link, Complete Link (maximum distance). II. Related Approaches Rimma V. Nehme and Elke A. Rundensteiner et.al[1] propose, SCUBA, Algorithm for evaluating a large set of continuous queries over spatiotemporal data streams. The key idea of SCUBA is to group moving objects and queries based on common spatio-temporal properties at runtime into moving clusters to optimize query execution and thus facilitate scalability. SCUBA exploits shared cluster-based execution by abstracting the evaluation of a set of spatio-temporal queries as a spatial join first between moving clusters. This cluster-based filtering prunes true negatives. Then the execution proceeds with a fine-grained within moving- cluster join process for all pairs of moving clusters identified as potentially joinable by a positive cluster-join match. A moving cluster can serve as an approximation of the location of its members. They show how moving clusters can serve as means for intelligent load shedding of spatio-temporal data to avoid performance degradation with minimal harm to result quality.. Sudipto Guha* Rajeev Rastogi Kyuseok Shim et.al [2] propose a clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters.. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality. Mohamed E. El-Sharkawi Mohamed A. El-Zawawy et.al[3] propose clustering technique to solve the problem of clustering in the presence of obstacles.. In this paper, they propose CPO, an efficient clustering technique to solve the problem of clustering in the presence of obstacles. The proposed algorithm divides the spatial area into rectangular cells. Each cell is associated with statistical information used to label the cell as dense or non-dense. It also labels each cell as obstructed (i.e. intersects any obstacle) or nonobstructed. For each obstructed cell, the algorithm finds a number of non-obstructed sub-cells. Then it finds the dense regions of non-obstructed cells or subcells by a breadthfirst search as the required clusters with a center to each region. George Karypis Eui-Hong (Sam) Han Vipin Kumar et.al [4] called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are comparable to the internal inter-connectivity of the clusters and closeness of items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in CHAMELEON is applicable to all types of data as long as a similarity matrix can be constructed. Khaled M. Hammouda et.al[5] information theoretic-based similarity measure is derived based on shared phrases between documents, rather than individual words. The basic concept is finding a metric that makes use of phrases rather than individual words.twopairwise document similarity measures are proposed, one is corpus-dependent,and the other is corpus-independent.the corpus-independent measure allows for incremental processing of documents. Only the corpus-independent measure was eval-uated in this report. III. Proposed Approach In our proposed work, we have categorized our work in 5 steps. ISSN:

3 ALGORITHM OF APPROACH MODEL Step1:- First we load different web pages. Priti Maheshwari et. al. / International Journal of Engineering Science and Technology Step2:- We perform preprocessing which composed of following steps a) Apply HTML tag remover b) Apply STOP-word remover c) Apply stemming word remover d) Concatenate all words which remains after processing in all documents e) Then we apply frequency count. Step3:-We calculate centroid the algorthim for centroid Selection is as follow: Algorithm centroid_as_sum () Input List of keywords Document vectors of a particular class N is the total number of documents in a class Output A vector that is centroid as sum of a class Open a file that contains all keywords; i=0; While (! EOF) Sum=0; Read keyword [i]; For j=0 to N do Open doc [j]; If keyword [i] is in doc [j] then Sum = Sum + frequency or weight of keyword [i] in doc [j]; Write keyword [i]+ - +Sum into file; i=i+1; Algorithm: centroid_as_sum () ISSN:

In the above algorithm there are two inputs one is a list of dictionary (or set of features) of a particular class, which is the outcome of preprocessing, feature identification, feature weighting

4 In the above algorithm there are two inputs one is a list of dictionary (or set of features) of a particular class, which is the outcome of preprocessing, feature identification, feature weighting and feature reduction process and another input is a set of training documents those are already preprocessed. This algorithm takes one feature from dictionary at a time and sum up its weight from those documents in which this feature occur. Step4:-We then apply Agglomerative Clustering Approach. SIMPLEHAC(d 1,...,d N ) 1 for n 1 to N 2 do for i 1 to N 3 do C[n][i] SIM(d n,d i ) 4 I[n] 1 (keeps track of active clusters) 5 A [] (assembles clustering as a sequence of merges) 6 for k 1 to N 1 7 do (i,m) arg max [(i,m) : i m^i[i]=1^i [m]=1]c[i][m] 8 A.APPEND( (i,m)) (store merge) 9 for j 1 to N 10 do C[i][j] SIM (i,m,j) 11 C[i][j] SIM (i,m,j) 12 I[m] 0 (deactive cluster) 13 return A Step5:-Finally we find different cluster through Single Link, Complete Link, Average Link. IV. Experimental Results In this dissertation we used php-mysql based web application which is providing preprocessing and processing tasks. We have used Wamp-Server to utilize this application. We have proposed a clustering based approach that partitions the documents in each class in to k optimum clusters. Then centroid is computed for each of the cluster. We used Tool HCE (hierarchical clustering explorer)for forming clusters of classes in our experiments. For text clustering we downloaded text document from UCI KDD Archive dataset and using our preprocessor we vectorise these document.on the basis of our experiments following observation have been made. (1) Centroid based approaches produce very high clustering accuracy(95% )for document belonging to classes that are orthogonal. (2) Clustering result are more efficient. (3) When orthogonality between classes is less the clustering accuracy is not so high. ISSN:

5 V. Conclusion This paper addresses clustering problem. It first analyses the main framework of popular clustering algorthim i.e.agglomerative clustering apporach. It indicates that the centroid based text clustering is one of the most popular supervised approach to classify a text into a set of predefined classes with relatively low computation. Then it briefly discusses the work related to centroid computation for each of the cluster.finally it forms the clusters of classes with the help of HCE Tool 3.5 (hierarchical clustering explorer) This is fast and more robust web document clustering compared to others method. Empirical results are more accurate. We have developed a PHP-MySQL based approach for preprocessing, indexing and feature selection. These steps are mandatory for text categorization. As a future work we plan to experiment more on less orthogonal classes.we except to improve our proposed approach by clustering the documents in various arbitrary shapes.we also plan to improve clustering performance by using techniques that adjust the importance or weights of the different feature in supervised setting.we also plan to study how centroid based text clustering methods behave when new documents are incrementally added to the training set. Preprocessing algorithm can also we further changed in respect of hyperlinks for web data VI. References [1] Rimma V. Nehme, Elke A. Rundensteiner: SCUBA: Scalable Cluster-Based Algorithm for Evaluating Continuous Spatio-temporal Queries on Moving Objects. EDBT 2006: [2] Sudipto Guha* Rajeev Rastogi Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases,International conference on management of data proceedings of the 1998 ACM [3] Mohamed A. El-Zawawy, Mohamed E. El-Sharkawi Clustering with Obstacles in Spatial Database,2009. [4] George Karypis Eui-Hong (Sam) Han Vipin Kumar CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,1999 [5] Khaled M. Hammouda Web Document Clustering Using Phrase-based Document Similarity, proceedings of 2002 IEEE international conference on Data Mining. [6] Jiawei Han and Micheline Kamber Data Mining concepts and techniques, second edition ISSN:

Module: CLUTO Toolkit. Draft: 10/21/2010

Module: CLUTO Toolkit. Draft: 10/21/2010 Module: CLUTO Toolkit Draft: 10/21/2010 1) Module Name CLUTO Toolkit 2) Scope The module briefly introduces the basic concepts of Clustering. The primary focus of the module is to describe the usage of