CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Similar documents
Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Gene Clustering & Classification

Clustering CS 550: Machine Learning

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

Document Clustering: Comparison of Similarity Measures

Cluster Analysis. Angela Montanari and Laura Anderlucci

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning and Clustering

Unsupervised Learning

International Journal of Advanced Research in Computer Science and Software Engineering

Unsupervised Learning

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Based on Raymond J. Mooney s slides

Exploratory Analysis: Clustering

A Comparison of Document Clustering Techniques

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Data Mining Clustering

Enhancing K-means Clustering Algorithm with Improved Initial Center

Network Traffic Measurements and Analysis

Understanding Clustering Supervising the unsupervised

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Unsupervised Learning and Clustering

Tag-based Social Interest Discovery

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Dynamic Clustering of Data with Modified K-Means Algorithm

Cluster Analysis. Ying Shen, SSE, Tongji University

ECLT 5810 Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

ECLT 5810 Clustering

Information Retrieval and Web Search Engines

Lecture 8 May 7, Prabhakar Raghavan

Hierarchical Document Clustering

Clustering and Visualisation of Data

Machine Learning using MapReduce

CHAPTER 4: CLUSTER ANALYSIS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Iteration Reduction K Means Clustering Algorithm

Cluster Analysis: Agglomerate Hierarchical Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Chapter 6: Cluster Analysis

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Semi-Supervised Clustering with Partial Background Information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

CSE 5243 INTRO. TO DATA MINING

Analysis and Extensions of Popular Clustering Algorithms

Unsupervised Learning : Clustering

Clustering & Classification (chapter 15)

Clustering. Chapter 10 in Introduction to statistical learning

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering


2. Background. 2.1 Clustering

Information Retrieval and Web Search Engines

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Data Informatics. Seon Ho Kim, Ph.D.

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Chapter 9. Classification and Clustering

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Mining Web Data. Lijun Zhang

Clustering Algorithms for general similarity measures

Clustering in Ratemaking: Applications in Territories Clustering

Supervised and Unsupervised Learning (II)

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering: Overview and K-means algorithm

University of Florida CISE department Gator Engineering. Clustering Part 2

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Comparison of FP tree and Apriori Algorithm

INF 4300 Classification III Anne Solberg The agenda today:

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering in Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

1 More configuration model

Mining Web Data. Lijun Zhang

Clustering Part 4 DBSCAN

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Text Documents clustering using K Means Algorithm

String Vector based KNN for Text Categorization

A New Technique to Optimize User s Browsing Session using Data Mining

COMP 465: Data Mining Still More on Clustering

10701 Machine Learning. Clustering

Transcription:

41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have high efficiency in cases of low dimensionality. But, the amount of documents available is very large. So, when clustering is applied to these documents, it degrades the performance of the clustering algorithm. This motivated us to reduce the dimension of the documents, and then to cluster them, so that the performance of the clustering algorithm is improved. Clustering algorithms are divided into two classes: hierarchical and partitioning. The hierarchical method has the disadvantage that once the clustering (i.e. merge or split) is done, it cannot be undone. Therefore, the documents have to be analysed properly before they are clustered. This led us to retrieve the associated documents alone from the entire dataset before they are clustered. In order to retrieve the related documents, the association rule mining algorithm is being used. The Association rule mining algorithm is generally used in transactional databases to find the associated items. This has been modified to be used for text databases to retrieve the associated documents. Then the associated documents are clustered, using the hierarchical algorithm and the results have shown improved performance.

42 Next, in the partitioning algorithm, the disadvantage is the initial selection of k value. In the simple k-means algorithm the centroids are selected randomly by the user. This method of selecting the centroids randomly affects the performance of the algorithm. The centroids should be placed very correctly, because different locations will lead to different results. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a particular distance, we propose a method to locate the initial centroids. This method uses dissimilar documents as the initial centroids, and these disisimilar documents are identified, using the Apriori algorithm. The Apriori algorithm is used to find similar documents. The remaining documents are identified as dissimilar documents. These documents are selected as the initial centroids, and the K-means algorithm is applied on those documents. The experimental results have shown that on selecting the dissimilar documents as the initial centroids, the clustering performance has been improved. These two points motivated us to develop this work. This chapter is divided into two parts. In the first part, association rule mining integrated with the hierarchical algorithm is described. In the second part, the K-means algorithm integrated with association rule algorithm is explained. The rest of the chapter is organized as follows. In section 3.1, association rule based hierarchical clustering algorithm is described. In section 3.2, the K-means algorithm based on association rule is described. 3.2 ASSOCIATION RULE BASED HIERARCHICAL CLUSTERING ALGORITHM An Apriori algorithm is a popular association rule algorithm, that is used in data mining to find the frequent items. The same algorithm is used in

43 this work to find similar documents. The remaining dissimilar documents are removed, which reduces the size of the database. The associated documents alone are then clustered which improved the efficiency of the clustering algorithm. This algorithm consists of various phases like, pre-processing, feature selection, and finally, clustering. The block diagram of the proposed system is depicted in Figure 3.1. Input Text File Preprocessing Feature Selection Clustered Documents as Output Hierarchical Clustering Association Rule Miner Figure 3.1 Block Diagram of the Proposed Method 3.2.1 Pre-processing and Feature Selection Method Pre-processing involves the stop word removal and stemming process. Stop words are normal English words which have less meaning, such document. Therefore, these words are removed using a stop word list. The next process in pre-processing is stemming. Stemming is defined as the process of reducing the derived words to their base, stem or root form. The words are stemmed using the porter stemmer algorithm. After these pre-processing steps, the important features are selected using a feature selection method as proposed by (Xiao-Bing Xue & Zhi-Hua Zhou 2007). In this method, the importance of terms is calculated based on

44 the compactness of appearance of the word, and the position of the first appearance of the word. The distribution of the word is calculated in two steps. First the document is divided into several parts, and then it is entered in an array, where each entry corresponds to the count of appearance of that word in the particular part. In this paper, the number of parts corresponds to the number of sentences present in the document. Suppose a document d contains n sentences, the distributional array of the word t is represented as array (t, = [c 0, c 1 c n-1 ]. Compactness measures, whether the word is concentrated in a specific part of a document, or spread over the whole document.the compactness (ComPact) of the appearances of the word t and the position of the first appearance (FirstApp) of the word t are defined, respectively, in Equations (3.1) to (3.6) as follows: FirstApp ( t, min i {0.. n 1} c i 0? i : n (3.1) n 1 CompPact ( t, 0?1: 0 (3.2) PartNum c i i 0 LastApp ( t, max c i i {0.. n 1} 0? i : 1 (3.3) ComPact FLDist ( t, LastApp ( t, FirstApp ( t, (3.4) centroid ( t, n 1 c i n 1 i 0 i count ( t, c i (3.5) count ( t, i 0 ComPact ci i 0 ( t, i centroid ( t, PosVar (3.6) count ( t, The frequency of a word is calculated using the Term Frequency (TF) feature selection method. The compactness of the appearances of a word

45 is calculated, using the ComPactness (CP) method. The first appearance of a word is found, using the FirstAppearance (FA) method.tf, CP and FA are calculated using Equations (3.7) to (3.12) as follows: count( t, TF ( t, (3.7) size( ComPact ParNum( t, CPPN ( t, (3.8) len ( ComPact FLDist ( t, 1 CPFLD ( t, (3.9) len ( ComPact PosVar ( t, 1 CPPV ( t, (3.10) len ( FA ( t, f ( FirstApp ( t,, len ( ) (3.11) len ( p f ( p, len ( (3.12) len ( The importance of each word is calculated as a summation of TF, CP and FA. The words are then ranked according to their values, with the highest value on the top. From this the topmost four words are selected for each text file. All the text files are processed in a similar manner and a final output is obtained. 3.2.2 Association Rule Based Hierarchical Clustering Method The features that are selected are passed into the association rule miner. The association rule miner uses the Apriori algorithm to find the association rules between the text documents. The association algorithm is defined as follows: Let I = {i 1, i 2... i m } be a set of words. Let D be a set of

46 documents, where each document T contains a set of words. An association rule is an implication of the form X Y, where X I, Y I, and =. The association rule X Y holds in the set of documents D with confidence c if c% of transactions in D that contain X also contain Y. The association rule X Y has support s, if s% of transactions in D contain X Y. Mining the association rules is to find all the association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconf), respectively (Agarwal & Srikant 1994). The threshold vale for support and confidence lies between 0 and 1. In this work the threshold value is taken as 0.01 as constant for all the documents, so that the evaluation criteria remain the same throughout the process. The documents that are associated with one another are obtained from the association rule miner, and the remaining documents are discarded. As a result of this, the database size is reduced to a great extent. The documents that are associated with one another are selected and given as input to the hierarchical algorithm. This gives the advantage of improved cluster quality, and also reduces the time taken for clustering. The hierarchical algorithm performs agglomerative clustering and the final results are obtained. The clusters that are obtained as a result of ARBHC are found to have better quality, than the clusters produced using hierarchical clustering alone. 3.2.3 Experimental Results The experiments are tested on the Reuters-21578 dataset. Out of the entire document, four Reuters were taken for testing purposes. Each one had 180 documents. The topics include commodity, corporate, currency, energy, economic indicator and subject codes. The clusters are evaluated using the

47 measures F-measure and Recall. The cluster quality is evaluated using the cophenetic correlation coefficient. The time taken to cluster the documents using both the methods, is also calculated. 3.2.3.1 Cophenetic correlation Coefficient This coefficient is used to evaluate the quality of the cluster. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The clusters are evaluated and the results are given in Table 3.1. Table 3.1 Cluster Quality D0 D1 D2 D3 Hier 0.73-1.52-0.52-1.29 ARBHC 4.13 3.54 0.18 5.17 The tabular values show that the clusters formed using ARBHC have better quality than the clusters formed using the hierarchical algorithm. 3.2.3.2 F-measure The F-measure is the weighted harmonic mean of precision and recall. The formula of the F-measure in Equation (3.13) as follows: F 2. precision. recall ( precision recall ) Precision is the fraction of the documents retrieved, that are relevant to the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. The formula is given in Equation (3.14) as follows:

48 precision ( i, k) Recall ( i, k) Nik NC N N i ik k (3.14) where N is the total number of documents, i is the number of classes (predefine, K is the number of clusters in unsupervised classification, NCi is the number of documents of class i, NK is the number of documents of cluster CK, Nik is the number of documents of class i in the cluster CK. The values are given in Table 3.2. Table 3.2 F-Measure Values D0 D1 D2 D3 Hier 0.04 0.04 0.045 0.048 ARBHC 0.29 0.22 0.25 0.28 The tabulated values have shown that the ARBHC method has better F-Measure values than those of the hierarchical method. Figure 3.2 F-Measure Values

49 Figure 3.2 shows the diagrammatic representation of the tabular values and finally concludes that the ARBHC method outperforms the other method. 3.2.3.3 Entropy In the information theory, entropy is a measure of the uncertainty associated with a random variable. The values are given in Table 3.3. Table 3.3 Entropy Values D0 D1 D2 D3 Hier 5.66 5.8 6.1 5.9 ARBHC 3.4 3.5 3.66 3.57 The tabulated values show that the ARBHC method has lesser entropy values than those of the hierarchical method, which is a sign of better performance. Figure 3.3 Entropy Values

50 Figure 3.3 shows the diagrammatic representation of tabular values which colcudes that the ARBHC method outperforms the hierarchical method. 3.2.3.4 Time The time taken to cluster the documents using the normal hierarchical method and the ARBHC method is calculated, and the results are tabulated in Table 3.4. Table 3.4 Time Values (Seconds) D0 D1 D2 D3 Hier 178 178 178 178 ARBHC 54 46 50 38 Figure 3.4 Time Values The tabular values and Figure 3.4 have shown that the time taken to cluster the documents is less for the ARBHC method than for the hierarchical method.

51 3.3 MODIFIED K-MEANS ALGORITHM One of the other fundamental clustering algorithms is the K-means algorithm. K-means, a heuristic algorithm, is a method of cluster analysis, which partitions n observations into k clusters, such that each observation is grouped into a cluster with the nearest mean. The k-means algorithm is less accurate than the hierarchical algorithm, but provides greater efficiency. In the simple k-means algorithm the centroids are selected randomly by the user which affects the performance of the algorithm. 3.3.1 Introduction In this work, a method has been proposed to locate the initial centroids. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a distance, we propose a method to use dissimilar documents as the initial centroids. In the proposed method, the text data are first preprocessed. Then the features are selected using an appropriate method. Then the dissimilar documents are identified and selected as the initial centroids. With these initial centroids the k-means algorithm is run and the performance is evaluated. 3.3.2 Preprocessing The stop words in the input text files are first removed. The standard stop word list available is used for this purpose. Then the words are stemmed using the Porter Stemming algorithm, which reduces the word to its base word.

52 3.3.3 Feature Selection This method is used to pick out important words from the file which will eventually represent the entire document. This process is done using the term frequency method and title match method. Term frequency, is defined as a measure of how often a word appears in a document. It is defined as the number of times a word appears in the document divided by the total number of words. In the title match method, if the words in the file match with the title, then it is given a weight value. The term frequency value and the title match value are added to calculate the weight of each word. Then the words are ranked according to their values and the top four words are taken for each file. The process is continued for the entire text files and the words are selected. 3.3.4 Finding Dissimilar Documents The Apriori algorithm is used to find the similar documents and the remaining documents are considered as dissimilar documents.the number of documents thus found are given as the initial K-value. The dissimilar documents are selected as the initial centroids for clustering. These documents are given as input to the modified K-means algorithm. 3.3.5 Modified K-Means Algorithm The dissimilar documents that are found using the Apriori algorithm are selected as the initial centroids. This helps the algorithm to classify the documents as disjoint sets.

53 The procedure for finding the cluster is defined as The dissimilar documents are selected as the initial centroids; i.e m 1, m 2 k. Until there is no change in the mean The estimated means are used to classify the samples into clusters For i from 1 to k m i is replaced with the mean of all the samples for cluster i end_for end_until The modified K-means algorithm run by selecting the dissimilar documents as initial centroids, has proved that the clustering performance has been improved. It was also proved that the number of iterations taken to converge is less, when compared to the traditional K-means algorithm. 3. 3.6 Experimental Results All the experiments have been carried out on Reuters- 21578 dataset. Reut2-000, Reut2-001, Reut2-002, Reut2-001 have been taken for the analysis. These are named as D0, D1, D2 and D3 respectively. From each, 180 documents have been selected. Totally, 720 documents from various topics like commodity, corporate, currency, energy, economic indicator and subject codes have been taken. Evaluation metrics, namely,the F-measure and Entropy are used. The number of iterations taken to converge is also evaluated.

54 3.3.6.1 F-Measure The formula of F-measure is given in Equation (3.15) as follows: F 2.Precision. Recall (Precision Recall ) (3.15) Precision which is the measure of exactness is defined as the fraction of the documents retrieved that are relevant to the user's information need. Recall, which is the measure of completeness, is defined as the fraction of the documents that are relevant to the query that are successfully retrieved. Table 3.5 F-Measure Values D0 D1 D2 D3 K- means 0.18 0.16 0.18 0.22 Modified K-Means 0.17 0.154 0.14 0.2 The values tabulated in Table 3.5 show that the modified K-means algorithm has F-measure values somewhat similar to those of the K-means algorithm. 3.3.6.2 Entropy Entropy is a measure of the uncertainty associated with a random variable. It refers to the information contained in a message. The general form of entropy is given in Equation (3.16) as follows: E j i p ij log (p ij ) (3.16) p ij is the probability that a cluster j belongs to class i.

55 Table 3.6 Entropy Values D0 D1 D2 D3 K- means 9.5 10.11 8.8 4.21 Modified K-Means 8.3 8.9 8.4 5.7 The values tabulated in Table 3.6 have shown that the modified K-means algorithm has reduced values than the K-means algorithm, which shows its better performance. The lower the value the better the performance. 3.3.6.3 Number of Iterations The number of iterations taken by the algorithm to converge depends on the initial centroids. Therefore, the number of steps taken to converge is calculated. It is found that the modified K-means algorithm takes less number of iterations than the simple K-means algorithm. The values are tabulated in Table 3.7. Table 3.7 Number of iterations D0 D1 D2 D3 K- means 6 4 6 8 Modified K-Means 4 3 6 6 The tabular values show that the number of iterations taken by the modified K-means algorithm to converge is less than the time taken by the K- means algorithm.

56 3.4 SUMMARY In this work, we have proposed a new hierarchical clustering algorithm based on association rules. The results are evaluated using the Reuters-21578 dataset. Our algorithm is motivated by the fact that the increased database size will eventually reduce the clustering performance. In order to overcome this, we have integrated the association rule algorithm with the hierarchical algorithm. The associated documents identified using the association algorithm alone, are clustered using the hierarchical algorithm, which greatly improves the efficiency of the clustered output. The experimental results show that the association based hierarchical clustering outperforms the traditional hierarchical clustering algorithm. The cluster quality was also tested, using the cophenetic correlation coefficient, and the results showed improved performance. Also, the time taken to complete the clustering process was also tested and the ARBHC algorithm took less time when compared to the hierarchical clustering algorithm. Also, a solution to improve the performance of the K-means algorithm has been formulated. The clustering performance of the simple K- means algorithm depends on the initial centroids. These centroids are selected at random and thus the performance varies. To overcome this, dissimilar documents are identified using the Apriori algorithm, and these are selected as the initial centroids. This has yielded better results. The number of iterations taken to converge is also calculated, and the modified K-means algorithm has used lesser number of iterations than the simple K-means algorithm. The next chapter describes a new method, that has been used for keyword extraction from text documents.