Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue: Computing inologies and Research Development Conference Held at SCAD College of Engineering and Technology, India An Efficient Clustering Algorithm for Spam Mail Detection Sharmila.P #1, Shanthalakshmi Revathy.J *2 # Post Graduate Student, *Assistant Professor, Department of Computer Science and Engineering, Velammal College of Engineering and Technology, Madurai, Tamil Nadu, India Abstract clustering high dimensional data results in overlapping and loss of some data. This paper extends the k- means clustering using weight function for clustering high dimensional data. The weight function can be determined by vector space model that convert high dimensional data into vector matrix. Thus the proposed algorithm is for projective clustering which is used to find the overlapping boundaries in various subspaces. The objective function is to find the relevant dimensions by reducing the irrelevant dimensions for cluster formation. This can be explained in document clustering. Email documents are taken as sample datasets to explain projective clustering. Keywords Document clustering, Spam Filtering, Document Frequency, K-Means Clustering. I. INTRODUCTION Clustering is unsupervised learning process that is no predefined classes or class labels. A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity. Document clustering can be viewed as one that organizes a collection into groups such that the documents within each group are both similar to each other and dissimilar to those in other groups. Clustering can produce disjoint or overlapping partitions. In an overlapping partition, it is possible for a document to appear in multiple clusters. Most of clustering approaches choose vector to represent each document, therefore reducing a document dimension suitable for traditional data clustering approaches such as k-means, BIRCH, KNN algorithm. Hierarchical and partitional clustering algorithms are the dominant clustering methods. In hierarchical clustering, each document is initially its own cluster. Hierarchical algorithms work by successively merging or splitting the documents. An advantage of this method is that a number of clusters need not be supplied in advance. But this hierarchical algorithm is not appropriate for real-time applications or large corpora. So it is accepted that partitional algorithms perform better than hierarchical algorithms. Partitional methods, of which the classical example is k-means, start by choosing k initial documents as clusters, and iteratively assign documents to clusters while updating the centroids of these clusters. It is well known that text data is directional, and so it is typical to normalize document vectors, and to use a cosine similarity measure rather than Euclidian distance. The resulting algorithm is called spherical k-means. For each data set, first use the pre-processing method to compute the VSM model, and removed stop words using the common stop lists. A. Dimension Reduction : Major problem in clustering document data set are high dimensionality, two types of dimension reduction technique are feature transformation and feature selection. In feature transformation, the high dimensional space is transformed into lower dimensional space. In Feature selection extract only relevant dimension, for text data Document Frequency. B. Document Clustering :For large dataset k-means algorithm works well but the issues in this method are selections of initial centers, handling noisy data. To avoid the problem of noisy data, irrelevant data are reduced in pre-processing step and calculate the weight using vector model. The other issue, selecting initial centers is overcome by projective k-means clustering. In this clustering algorithm, select the first centroid randomly and select other document that is least similar to it as the second centroid, then the subsequent centroids are chosen such that they far away from those chosen centroids, and proceed with traditional k-means clustering. C. Spam Filtering :Many problems arise due to spam mails. There are number of techniques are used for identifying the spam mails; they are keyword identification, mail-header analysis, blacklist or whitelist, Bayesian analysis, and so on. The proposed clustering algorithm is based on keyword identification method. II. RELATED WORKS Numerous works related to document clustering using data mining techniques have motivated this study. Vector space model [1] is an algebraic model for representing text documents as vectors of identifiers. The tf-idf weighting scheme, is applied, where tf is the term frequency and idf is the inverse document frequency. The common terms are eliminated 2013, IJARCSSE All Rights Reserved Page 25
using this effect. [2][11] Compared Agglomerative and Partitional Document Clustering Algorithms and proved that partitional clustering is efficient than other in case of document clustering. [3] Proposed a robust partitional distancebased projected clustering algorithm for detecting projected clusters of low dimensionality embedded in a highdimensional space and also to avoid the computation of the distance in the full-dimensional space. [4] Proposed that Euclidean distance method used in k-means clustering algorithm was inefficient while clustering large number of documents, instead cosine similarity measure is used. [5] Proposed that the document frequency based technique is better for higher dimensions than for lower dimensions. [6] Extended the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. The main issue in traditional K-means algorithm is that the cluster result is sensitive to the initial cluster centroids and may converge to the local optima [8]. [10] Presented a text clustering system based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data, a new step is added in the k- means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. Feature selection is one of the important and frequently used techniques in data pre-processing for data mining [13], [14]. It reduces the number of features, removes irrelevant, redundant, or noisy data. III. SYSTEM ARCHITECTURE A. System Architecture : Document clustering is the process of clustering similar documents into single cluster. While clustering a document may belong to more than one cluster that is soft clustering. First collect the documents from corpus then extract the word for pre-processing such as removal of stop words and truncate to stem words. Then these words are represented in vector using their weights, the weights are calculated as the difference of term frequency, and inverse term frequency. Highest frequency and lowest frequency words are eliminated for efficient clustering. s in vector are the keywords for clustering. Sort the words weight in descending order, to minimize the keywords for clustering the documents. Cluster the documents using the similarity measure and k-means projective clustering. Fig.1 explains the system architecture. Fig.1 System Architecture B. Block Diagram : Fig.2 is the block diagram for document clustering. Extract the data source, then the initial step pre-processing removal of stop-list and truncate to stem words is done. Construct the vector model using document frequency, and then cluster the documents with cosine similarity. Fig 2 Block Diagram 2013, IJARCSSE All Rights Reserved Page 26
IV. MODULES The proposed clustering method consists of four main modules. First import the ling-spam data set, and then extract the words from each document. If the extracted word is a stop word then remove the stop word. And then perform stemming based on the rules of the stemming algorithm. Construct the vocabulary and calculate the tf-idf. Cluster the documents by K-means projective algorithm. Each module is explained in detail. A. Text Pre-processing Real world databases are highly susceptible to noisy, missing, and inconsistent data due to their huge size and are likely origin from multiple, heterogeneous sources. Low quality data will lead to low-quality mining results. To improve the quality, the data should be pre-processed. There are several data pre-processing techniques in data mining. They are data cleaning, data integration, data reduction, data transformation. Pre-processing the data set is the important step in the data mining process. In document clustering pre-processing is done by removal of stop words and truncate into root words called as stopping and stemming respectively. In document clustering, data reduction technique is applied in pre-processing step. The data reduction done in two main modules are Stopping and Stemming. B. Vector Representation : After pre-processing, the documents are represented using vector space model by term-document matrix as shown in Fig.3. Frequency of each term is its weight, which means terms appearing more frequently are more important for the document. Weight for each term is calculated using the following formula Wt t,d = tf t,d * idf t Frequency (tf) is number of term appears in the documents collection. Inverse Document Frequency (idf) is idf = log (N/ df t ), Where N is the total number of documents, df t is the number of documents that the specific term appears. Documents 1 2 Fig.3 -Document Matrix n Wt1 Wt2 Wtn C. Similarity Measure : A similarity measure is a function that computes the degree of similarity between two vectors. Cosine similarity is a common measure of similarity between two vectors which measures the cosine of the angle between them. In term document matrix A, the cosine between document vectors di and dj where di ={x1, x2, x2...xn} and dj = {y1, y2, y3...yn} can be computed. According to the cosine distance formula: cos i, j di. dj di dj Where di and dj are the ith and jth document vector, di and dj and denotes the Euclidean Length of vector di and dj respectively. The greater the value, the more similar they are said to be. Very often the document vectors are normalized to a unit length. In high dimensionality the cosine measure shows better performance than other measures of similarity. D. K-Means Projective Clustering : In traditional clustering, centroids are chosen randomly, it is possible to select the similar documents as different centroids, and this will increase the number of iterations. To avoid too many cluster centroids, First centroid is picked as a random document, and then picks the document that is least similar to it as the second centroid. Subsequent centroids are chosen such that they are farthest away from those centroids that are already picked. Major issues in k- means clustering are 1) selection of initial centroids; 2) handling outliers, 3) number of k clusters. The proposed algorithm overcomes these issues by selecting the initial centroid based on the similarity score and outliers are handled by eliminating the least weight term. The procedure for this proposed clustering algorithm is 1. Specifying the value of k (the number of clusters) 2. Randomly select k documents and place one of the k selected documents in each cluster. 3. Place the remaining document in the cluster based on the similarity between the documents and the document present in the clusters. 4. Compute centroid for each of k clusters. 5. Again by using similarity measures, find the similarity between the centroids and input documents, that is generate similarity vector. 6. Now place the documents in the cluster based on similarity between documents and the centroids of clusters. 7. After placing all the documents in the clusters compare the previous and current iterations then terminate the algorithm and obtain the final clusters. 8. Else repeat the step 5. 2013, IJARCSSE All Rights Reserved Page 27
V. IMPLEMENTATION The corpus used for training and testing is the ling-spam [12]. In ling-spam, there are four subdirectories, corresponding to 4 versions of the corpus, viz, Bare: Lemmatiser disabled, stop-list disabled, Lemm: Lemmatiser enabled, stop-list disabled, Lemm_stop: Lemmatiser enabled, stop-list enabled, Stop: Lemmatiser disabled, stop-list enabled, Where lemmatizing is similar to stemming and stop-list tells if stopping is done on the content of the parts or not. Attachments, HTML tags, and duplicate spam messages received on the same day are not included. A. Algorithm for text pre-processing : This algorithm is used to pre-process the text in email dataset, to reduce the irrelevant text by removing punctuation symbols, special characters, then compare with stop-list words, if present then remove it. Next truncate the remaining word into stem words or root words. : Training data set : Reduced documents For all documents Remove special characters and stop words For all remaining words in documents Truncate to root words (stemming) B. Algorithm for Vector Representation : Calculate document frequency, term frequency, and inverse document frequency to assign weight for each term in reduced data set. : Reduced document : Weight for each term For each word in each document Calculate idf log (n/df) Weight tf-idf C. Algorithm for K-means projective clustering : In traditional clustering, centroids are chosen randomly, it is possible to select the similar documents as different centroids, and this will increase the number of iterations. : k=2, dataset. : 2 distinct clusters K1 randomly chosen document K2 dissimilar with k1 document Repeat For each document Calculate similarity measure with k1 If similarity > 0.5 Spam cluster Else Non spam cluster Compute new centroids for each cluster Until no change 2013, IJARCSSE All Rights Reserved Page 28
To avoid too many cluster centroids, First centroid is picked as a random document, and then picks the document that is least similar to it as the second centroid. Subsequent centroids are chosen such that they are farthest away from those centroids that are already picked. VI. CONCLUSIONS In this project, a K-means projective clustering method is proposed for email clustering and implemented to detect the spam mails. Ling spam corpus dataset was selected for this experiment. The e-mail documents are pre-processed and represented in a Vector Space Model, then documents are clustered efficiently using k-means projective clustering algorithm using the calculated weight and similarity measure and list of keywords are identified for filtering the spam mails. The proposed method can be improved in efficiency by applying association rules. REFERENCES [1] Vector Space Model.[Mar.23,2012], Internet: http: // en.wikipedia.org /wiki/ Vector_space_model [2] Xu Rui.. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16(3):pp. 634-678, 2005. [3] M. Bouguessa and S. Wang, Mining Projected Clusters in High Dimensional Spaces, IEEE Trans. Knowledge and Data Eng., vol. 21, issue- 4, pp. 507-522, 2009. [4] A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI 2000 Workshop on AI for Web Search, pages 58.64, July 2000. [5] G. Sanguinetti, Dimensionality Reduction of Clustered Data Sets, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, issue-3, pp. 535-540, 2008. [6] L. Jing, M.K. Ng, and J.Z. Huang, An Entropy Weighting k-means Algorithm for Subspace Clustering of High- Dimensional Sparse Data, IEEE Trans. Knowledge and Data Eng., vol. 19, issue-8, pp. 1026-1041,2007. [7] L. Jing, M.K. Ng, J. Xu, and J.Z. Huang, A Text Clustering System Based on k-means Type Subspace Clustering, Int l J. Intelligent Technology, vol. 1, issue-2, pp. 91-103, 2006. [8] H. Liu and L. Yu, Toward Integrating Feature Selection Algorithms for Classification and Clustering, IEEE Trans. Knowledge and Data Eng., vol. 17, Issue-4, pp. 491-502, 2005. [9] C.C. Aggarwal and P.S. Yu, Redefining Clustering for High Dimensional Applications, IEEE Trans. Knowledge and Data Eng., vol. 14, issue-2, pp. 210-225, 2005. [10] L. Jing, M. Ng, J. Xu, and Z. Huang, Subspace clustering of text documents with feature weighting k-means algorithm, pp. 802 812, 2005. [11] Ying Zhao and George Karypis, Comparison of Agglomerative and Partitional Document Clustering Algorithms, 17 Apr. 2002. [12] Ling-Spam data set. Internet: http://csmining.org/index.php/ling-spam-datasets.html [Mar. 23, 2012]. [13] Feature Extraction, Construction and Selection: A Data Mining Perspective, H. Liu and H. Motoda, eds. Boston: Kluwer Academic, 1998, second printing, 2001. [14] A.L. Blum and P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, vol. 97,pp. 245-271,1997. 2013, IJARCSSE All Rights Reserved Page 29