International Journal of Advanced Research in Computer Science and Software Engineering

Similar documents
International Journal of Advanced Research in Computer Science and Software Engineering

Iteration Reduction K Means Clustering Algorithm

A Modified Hierarchical Clustering Algorithm for Document Clustering

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Concept-Based Document Similarity Based on Suffix Tree Document

String Vector based KNN for Text Categorization

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Encoding Words into String Vectors for Word Categorization

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Document Clustering: Comparison of Similarity Measures

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Analyzing Outlier Detection Techniques with Hybrid Method

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Keyword Extraction by KNN considering Similarity among Features

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

University of Florida CISE department Gator Engineering. Clustering Part 2

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques A Model

Text Documents clustering using K Means Algorithm

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Hierarchical Clustering 4/5/17

A Comparison of Document Clustering Techniques

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Based on Raymond J. Mooney s slides

Unsupervised Learning

Dynamic Clustering of Data with Modified K-Means Algorithm

Comparative Study of Subspace Clustering Algorithms

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Correlation Based Feature Selection with Irrelevant Feature Removal

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Informatics. Seon Ho Kim, Ph.D.

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Exploratory Analysis: Clustering

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Road map. Basic concepts

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Cluster Analysis. Ying Shen, SSE, Tongji University

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

New Approach for K-mean and K-medoids Algorithm

OUTLIER DETECTION FOR DYNAMIC DATA STREAMS USING WEIGHTED K-MEANS

Chapter 6: Information Retrieval and Web Search. An introduction

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CSE 5243 INTRO. TO DATA MINING

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

Tag-based Social Interest Discovery

A Review of K-mean Algorithm

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

K-Means Clustering With Initial Centroids Based On Difference Operator

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

Auto-assemblage for Suffix Tree Clustering

Keywords: clustering algorithms, unsupervised learning, cluster validity

Retrieval of Highly Related Documents Containing Gene-Disease Association

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Unsupervised Learning : Clustering

A NOVEL APPROACH FOR TEST SUITE PRIORITIZATION

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Performance Analysis of Video Data Image using Clustering Technique

Unsupervised Learning I: K-Means Clustering

An Approach to Improve Quality of Document Clustering by Word Set Based Documenting Clustering Algorithm

CS570: Introduction to Data Mining

CHAPTER 4: CLUSTER ANALYSIS

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Centroid Based Text Clustering

Clustering (COSC 416) Nazli Goharian. Document Clustering.

International Journal of Advance Research in Computer Science and Management Studies

Comparison of Online Record Linkage Techniques

Computer Technology Department, Sanjivani K. B. P. Polytechnic, Kopargaon

Impact of Term Weighting Schemes on Document Clustering A Review

Optimization of Query Processing in XML Document Using Association and Path Based Indexing

Text Document Clustering Using DPM with Concept and Feature Analysis

Unsupervised learning on Color Images

Web Document Clustering using Hybrid Approach in Data Mining

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

An Enhanced K-Medoid Clustering Algorithm

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CSE 5243 INTRO. TO DATA MINING

10701 Machine Learning. Clustering

Density Based Clustering using Modified PSO based Neighbor Selection

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

Transcription:

Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue: Computing inologies and Research Development Conference Held at SCAD College of Engineering and Technology, India An Efficient Clustering Algorithm for Spam Mail Detection Sharmila.P #1, Shanthalakshmi Revathy.J *2 # Post Graduate Student, *Assistant Professor, Department of Computer Science and Engineering, Velammal College of Engineering and Technology, Madurai, Tamil Nadu, India Abstract clustering high dimensional data results in overlapping and loss of some data. This paper extends the k- means clustering using weight function for clustering high dimensional data. The weight function can be determined by vector space model that convert high dimensional data into vector matrix. Thus the proposed algorithm is for projective clustering which is used to find the overlapping boundaries in various subspaces. The objective function is to find the relevant dimensions by reducing the irrelevant dimensions for cluster formation. This can be explained in document clustering. Email documents are taken as sample datasets to explain projective clustering. Keywords Document clustering, Spam Filtering, Document Frequency, K-Means Clustering. I. INTRODUCTION Clustering is unsupervised learning process that is no predefined classes or class labels. A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity. Document clustering can be viewed as one that organizes a collection into groups such that the documents within each group are both similar to each other and dissimilar to those in other groups. Clustering can produce disjoint or overlapping partitions. In an overlapping partition, it is possible for a document to appear in multiple clusters. Most of clustering approaches choose vector to represent each document, therefore reducing a document dimension suitable for traditional data clustering approaches such as k-means, BIRCH, KNN algorithm. Hierarchical and partitional clustering algorithms are the dominant clustering methods. In hierarchical clustering, each document is initially its own cluster. Hierarchical algorithms work by successively merging or splitting the documents. An advantage of this method is that a number of clusters need not be supplied in advance. But this hierarchical algorithm is not appropriate for real-time applications or large corpora. So it is accepted that partitional algorithms perform better than hierarchical algorithms. Partitional methods, of which the classical example is k-means, start by choosing k initial documents as clusters, and iteratively assign documents to clusters while updating the centroids of these clusters. It is well known that text data is directional, and so it is typical to normalize document vectors, and to use a cosine similarity measure rather than Euclidian distance. The resulting algorithm is called spherical k-means. For each data set, first use the pre-processing method to compute the VSM model, and removed stop words using the common stop lists. A. Dimension Reduction : Major problem in clustering document data set are high dimensionality, two types of dimension reduction technique are feature transformation and feature selection. In feature transformation, the high dimensional space is transformed into lower dimensional space. In Feature selection extract only relevant dimension, for text data Document Frequency. B. Document Clustering :For large dataset k-means algorithm works well but the issues in this method are selections of initial centers, handling noisy data. To avoid the problem of noisy data, irrelevant data are reduced in pre-processing step and calculate the weight using vector model. The other issue, selecting initial centers is overcome by projective k-means clustering. In this clustering algorithm, select the first centroid randomly and select other document that is least similar to it as the second centroid, then the subsequent centroids are chosen such that they far away from those chosen centroids, and proceed with traditional k-means clustering. C. Spam Filtering :Many problems arise due to spam mails. There are number of techniques are used for identifying the spam mails; they are keyword identification, mail-header analysis, blacklist or whitelist, Bayesian analysis, and so on. The proposed clustering algorithm is based on keyword identification method. II. RELATED WORKS Numerous works related to document clustering using data mining techniques have motivated this study. Vector space model [1] is an algebraic model for representing text documents as vectors of identifiers. The tf-idf weighting scheme, is applied, where tf is the term frequency and idf is the inverse document frequency. The common terms are eliminated 2013, IJARCSSE All Rights Reserved Page 25

using this effect. [2][11] Compared Agglomerative and Partitional Document Clustering Algorithms and proved that partitional clustering is efficient than other in case of document clustering. [3] Proposed a robust partitional distancebased projected clustering algorithm for detecting projected clusters of low dimensionality embedded in a highdimensional space and also to avoid the computation of the distance in the full-dimensional space. [4] Proposed that Euclidean distance method used in k-means clustering algorithm was inefficient while clustering large number of documents, instead cosine similarity measure is used. [5] Proposed that the document frequency based technique is better for higher dimensions than for lower dimensions. [6] Extended the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. The main issue in traditional K-means algorithm is that the cluster result is sensitive to the initial cluster centroids and may converge to the local optima [8]. [10] Presented a text clustering system based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data, a new step is added in the k- means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. Feature selection is one of the important and frequently used techniques in data pre-processing for data mining [13], [14]. It reduces the number of features, removes irrelevant, redundant, or noisy data. III. SYSTEM ARCHITECTURE A. System Architecture : Document clustering is the process of clustering similar documents into single cluster. While clustering a document may belong to more than one cluster that is soft clustering. First collect the documents from corpus then extract the word for pre-processing such as removal of stop words and truncate to stem words. Then these words are represented in vector using their weights, the weights are calculated as the difference of term frequency, and inverse term frequency. Highest frequency and lowest frequency words are eliminated for efficient clustering. s in vector are the keywords for clustering. Sort the words weight in descending order, to minimize the keywords for clustering the documents. Cluster the documents using the similarity measure and k-means projective clustering. Fig.1 explains the system architecture. Fig.1 System Architecture B. Block Diagram : Fig.2 is the block diagram for document clustering. Extract the data source, then the initial step pre-processing removal of stop-list and truncate to stem words is done. Construct the vector model using document frequency, and then cluster the documents with cosine similarity. Fig 2 Block Diagram 2013, IJARCSSE All Rights Reserved Page 26

IV. MODULES The proposed clustering method consists of four main modules. First import the ling-spam data set, and then extract the words from each document. If the extracted word is a stop word then remove the stop word. And then perform stemming based on the rules of the stemming algorithm. Construct the vocabulary and calculate the tf-idf. Cluster the documents by K-means projective algorithm. Each module is explained in detail. A. Text Pre-processing Real world databases are highly susceptible to noisy, missing, and inconsistent data due to their huge size and are likely origin from multiple, heterogeneous sources. Low quality data will lead to low-quality mining results. To improve the quality, the data should be pre-processed. There are several data pre-processing techniques in data mining. They are data cleaning, data integration, data reduction, data transformation. Pre-processing the data set is the important step in the data mining process. In document clustering pre-processing is done by removal of stop words and truncate into root words called as stopping and stemming respectively. In document clustering, data reduction technique is applied in pre-processing step. The data reduction done in two main modules are Stopping and Stemming. B. Vector Representation : After pre-processing, the documents are represented using vector space model by term-document matrix as shown in Fig.3. Frequency of each term is its weight, which means terms appearing more frequently are more important for the document. Weight for each term is calculated using the following formula Wt t,d = tf t,d * idf t Frequency (tf) is number of term appears in the documents collection. Inverse Document Frequency (idf) is idf = log (N/ df t ), Where N is the total number of documents, df t is the number of documents that the specific term appears. Documents 1 2 Fig.3 -Document Matrix n Wt1 Wt2 Wtn C. Similarity Measure : A similarity measure is a function that computes the degree of similarity between two vectors. Cosine similarity is a common measure of similarity between two vectors which measures the cosine of the angle between them. In term document matrix A, the cosine between document vectors di and dj where di ={x1, x2, x2...xn} and dj = {y1, y2, y3...yn} can be computed. According to the cosine distance formula: cos i, j di. dj di dj Where di and dj are the ith and jth document vector, di and dj and denotes the Euclidean Length of vector di and dj respectively. The greater the value, the more similar they are said to be. Very often the document vectors are normalized to a unit length. In high dimensionality the cosine measure shows better performance than other measures of similarity. D. K-Means Projective Clustering : In traditional clustering, centroids are chosen randomly, it is possible to select the similar documents as different centroids, and this will increase the number of iterations. To avoid too many cluster centroids, First centroid is picked as a random document, and then picks the document that is least similar to it as the second centroid. Subsequent centroids are chosen such that they are farthest away from those centroids that are already picked. Major issues in k- means clustering are 1) selection of initial centroids; 2) handling outliers, 3) number of k clusters. The proposed algorithm overcomes these issues by selecting the initial centroid based on the similarity score and outliers are handled by eliminating the least weight term. The procedure for this proposed clustering algorithm is 1. Specifying the value of k (the number of clusters) 2. Randomly select k documents and place one of the k selected documents in each cluster. 3. Place the remaining document in the cluster based on the similarity between the documents and the document present in the clusters. 4. Compute centroid for each of k clusters. 5. Again by using similarity measures, find the similarity between the centroids and input documents, that is generate similarity vector. 6. Now place the documents in the cluster based on similarity between documents and the centroids of clusters. 7. After placing all the documents in the clusters compare the previous and current iterations then terminate the algorithm and obtain the final clusters. 8. Else repeat the step 5. 2013, IJARCSSE All Rights Reserved Page 27

V. IMPLEMENTATION The corpus used for training and testing is the ling-spam [12]. In ling-spam, there are four subdirectories, corresponding to 4 versions of the corpus, viz, Bare: Lemmatiser disabled, stop-list disabled, Lemm: Lemmatiser enabled, stop-list disabled, Lemm_stop: Lemmatiser enabled, stop-list enabled, Stop: Lemmatiser disabled, stop-list enabled, Where lemmatizing is similar to stemming and stop-list tells if stopping is done on the content of the parts or not. Attachments, HTML tags, and duplicate spam messages received on the same day are not included. A. Algorithm for text pre-processing : This algorithm is used to pre-process the text in email dataset, to reduce the irrelevant text by removing punctuation symbols, special characters, then compare with stop-list words, if present then remove it. Next truncate the remaining word into stem words or root words. : Training data set : Reduced documents For all documents Remove special characters and stop words For all remaining words in documents Truncate to root words (stemming) B. Algorithm for Vector Representation : Calculate document frequency, term frequency, and inverse document frequency to assign weight for each term in reduced data set. : Reduced document : Weight for each term For each word in each document Calculate idf log (n/df) Weight tf-idf C. Algorithm for K-means projective clustering : In traditional clustering, centroids are chosen randomly, it is possible to select the similar documents as different centroids, and this will increase the number of iterations. : k=2, dataset. : 2 distinct clusters K1 randomly chosen document K2 dissimilar with k1 document Repeat For each document Calculate similarity measure with k1 If similarity > 0.5 Spam cluster Else Non spam cluster Compute new centroids for each cluster Until no change 2013, IJARCSSE All Rights Reserved Page 28

To avoid too many cluster centroids, First centroid is picked as a random document, and then picks the document that is least similar to it as the second centroid. Subsequent centroids are chosen such that they are farthest away from those centroids that are already picked. VI. CONCLUSIONS In this project, a K-means projective clustering method is proposed for email clustering and implemented to detect the spam mails. Ling spam corpus dataset was selected for this experiment. The e-mail documents are pre-processed and represented in a Vector Space Model, then documents are clustered efficiently using k-means projective clustering algorithm using the calculated weight and similarity measure and list of keywords are identified for filtering the spam mails. The proposed method can be improved in efficiency by applying association rules. REFERENCES [1] Vector Space Model.[Mar.23,2012], Internet: http: // en.wikipedia.org /wiki/ Vector_space_model [2] Xu Rui.. Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16(3):pp. 634-678, 2005. [3] M. Bouguessa and S. Wang, Mining Projected Clusters in High Dimensional Spaces, IEEE Trans. Knowledge and Data Eng., vol. 21, issue- 4, pp. 507-522, 2009. [4] A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI 2000 Workshop on AI for Web Search, pages 58.64, July 2000. [5] G. Sanguinetti, Dimensionality Reduction of Clustered Data Sets, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, issue-3, pp. 535-540, 2008. [6] L. Jing, M.K. Ng, and J.Z. Huang, An Entropy Weighting k-means Algorithm for Subspace Clustering of High- Dimensional Sparse Data, IEEE Trans. Knowledge and Data Eng., vol. 19, issue-8, pp. 1026-1041,2007. [7] L. Jing, M.K. Ng, J. Xu, and J.Z. Huang, A Text Clustering System Based on k-means Type Subspace Clustering, Int l J. Intelligent Technology, vol. 1, issue-2, pp. 91-103, 2006. [8] H. Liu and L. Yu, Toward Integrating Feature Selection Algorithms for Classification and Clustering, IEEE Trans. Knowledge and Data Eng., vol. 17, Issue-4, pp. 491-502, 2005. [9] C.C. Aggarwal and P.S. Yu, Redefining Clustering for High Dimensional Applications, IEEE Trans. Knowledge and Data Eng., vol. 14, issue-2, pp. 210-225, 2005. [10] L. Jing, M. Ng, J. Xu, and Z. Huang, Subspace clustering of text documents with feature weighting k-means algorithm, pp. 802 812, 2005. [11] Ying Zhao and George Karypis, Comparison of Agglomerative and Partitional Document Clustering Algorithms, 17 Apr. 2002. [12] Ling-Spam data set. Internet: http://csmining.org/index.php/ling-spam-datasets.html [Mar. 23, 2012]. [13] Feature Extraction, Construction and Selection: A Data Mining Perspective, H. Liu and H. Motoda, eds. Boston: Kluwer Academic, 1998, second printing, 2001. [14] A.L. Blum and P. Langley, Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, vol. 97,pp. 245-271,1997. 2013, IJARCSSE All Rights Reserved Page 29