CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Size: px
Start display at page:

Download "CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING"

Transcription

1 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have high efficiency in cases of low dimensionality. But, the amount of documents available is very large. So, when clustering is applied to these documents, it degrades the performance of the clustering algorithm. This motivated us to reduce the dimension of the documents, and then to cluster them, so that the performance of the clustering algorithm is improved. Clustering algorithms are divided into two classes: hierarchical and partitioning. The hierarchical method has the disadvantage that once the clustering (i.e. merge or split) is done, it cannot be undone. Therefore, the documents have to be analysed properly before they are clustered. This led us to retrieve the associated documents alone from the entire dataset before they are clustered. In order to retrieve the related documents, the association rule mining algorithm is being used. The Association rule mining algorithm is generally used in transactional databases to find the associated items. This has been modified to be used for text databases to retrieve the associated documents. Then the associated documents are clustered, using the hierarchical algorithm and the results have shown improved performance.

2 42 Next, in the partitioning algorithm, the disadvantage is the initial selection of k value. In the simple k-means algorithm the centroids are selected randomly by the user. This method of selecting the centroids randomly affects the performance of the algorithm. The centroids should be placed very correctly, because different locations will lead to different results. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a particular distance, we propose a method to locate the initial centroids. This method uses dissimilar documents as the initial centroids, and these disisimilar documents are identified, using the Apriori algorithm. The Apriori algorithm is used to find similar documents. The remaining documents are identified as dissimilar documents. These documents are selected as the initial centroids, and the K-means algorithm is applied on those documents. The experimental results have shown that on selecting the dissimilar documents as the initial centroids, the clustering performance has been improved. These two points motivated us to develop this work. This chapter is divided into two parts. In the first part, association rule mining integrated with the hierarchical algorithm is described. In the second part, the K-means algorithm integrated with association rule algorithm is explained. The rest of the chapter is organized as follows. In section 3.1, association rule based hierarchical clustering algorithm is described. In section 3.2, the K-means algorithm based on association rule is described. 3.2 ASSOCIATION RULE BASED HIERARCHICAL CLUSTERING ALGORITHM An Apriori algorithm is a popular association rule algorithm, that is used in data mining to find the frequent items. The same algorithm is used in

3 43 this work to find similar documents. The remaining dissimilar documents are removed, which reduces the size of the database. The associated documents alone are then clustered which improved the efficiency of the clustering algorithm. This algorithm consists of various phases like, pre-processing, feature selection, and finally, clustering. The block diagram of the proposed system is depicted in Figure 3.1. Input Text File Preprocessing Feature Selection Clustered Documents as Output Hierarchical Clustering Association Rule Miner Figure 3.1 Block Diagram of the Proposed Method Pre-processing and Feature Selection Method Pre-processing involves the stop word removal and stemming process. Stop words are normal English words which have less meaning, such document. Therefore, these words are removed using a stop word list. The next process in pre-processing is stemming. Stemming is defined as the process of reducing the derived words to their base, stem or root form. The words are stemmed using the porter stemmer algorithm. After these pre-processing steps, the important features are selected using a feature selection method as proposed by (Xiao-Bing Xue & Zhi-Hua Zhou 2007). In this method, the importance of terms is calculated based on

4 44 the compactness of appearance of the word, and the position of the first appearance of the word. The distribution of the word is calculated in two steps. First the document is divided into several parts, and then it is entered in an array, where each entry corresponds to the count of appearance of that word in the particular part. In this paper, the number of parts corresponds to the number of sentences present in the document. Suppose a document d contains n sentences, the distributional array of the word t is represented as array (t, = [c 0, c 1 c n-1 ]. Compactness measures, whether the word is concentrated in a specific part of a document, or spread over the whole document.the compactness (ComPact) of the appearances of the word t and the position of the first appearance (FirstApp) of the word t are defined, respectively, in Equations (3.1) to (3.6) as follows: FirstApp ( t, min i {0.. n 1} c i 0? i : n (3.1) n 1 CompPact ( t, 0?1: 0 (3.2) PartNum c i i 0 LastApp ( t, max c i i {0.. n 1} 0? i : 1 (3.3) ComPact FLDist ( t, LastApp ( t, FirstApp ( t, (3.4) centroid ( t, n 1 c i n 1 i 0 i count ( t, c i (3.5) count ( t, i 0 ComPact ci i 0 ( t, i centroid ( t, PosVar (3.6) count ( t, The frequency of a word is calculated using the Term Frequency (TF) feature selection method. The compactness of the appearances of a word

5 45 is calculated, using the ComPactness (CP) method. The first appearance of a word is found, using the FirstAppearance (FA) method.tf, CP and FA are calculated using Equations (3.7) to (3.12) as follows: count( t, TF ( t, (3.7) size( ComPact ParNum( t, CPPN ( t, (3.8) len ( ComPact FLDist ( t, 1 CPFLD ( t, (3.9) len ( ComPact PosVar ( t, 1 CPPV ( t, (3.10) len ( FA ( t, f ( FirstApp ( t,, len ( ) (3.11) len ( p f ( p, len ( (3.12) len ( The importance of each word is calculated as a summation of TF, CP and FA. The words are then ranked according to their values, with the highest value on the top. From this the topmost four words are selected for each text file. All the text files are processed in a similar manner and a final output is obtained Association Rule Based Hierarchical Clustering Method The features that are selected are passed into the association rule miner. The association rule miner uses the Apriori algorithm to find the association rules between the text documents. The association algorithm is defined as follows: Let I = {i 1, i 2... i m } be a set of words. Let D be a set of

6 46 documents, where each document T contains a set of words. An association rule is an implication of the form X Y, where X I, Y I, and =. The association rule X Y holds in the set of documents D with confidence c if c% of transactions in D that contain X also contain Y. The association rule X Y has support s, if s% of transactions in D contain X Y. Mining the association rules is to find all the association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconf), respectively (Agarwal & Srikant 1994). The threshold vale for support and confidence lies between 0 and 1. In this work the threshold value is taken as 0.01 as constant for all the documents, so that the evaluation criteria remain the same throughout the process. The documents that are associated with one another are obtained from the association rule miner, and the remaining documents are discarded. As a result of this, the database size is reduced to a great extent. The documents that are associated with one another are selected and given as input to the hierarchical algorithm. This gives the advantage of improved cluster quality, and also reduces the time taken for clustering. The hierarchical algorithm performs agglomerative clustering and the final results are obtained. The clusters that are obtained as a result of ARBHC are found to have better quality, than the clusters produced using hierarchical clustering alone Experimental Results The experiments are tested on the Reuters dataset. Out of the entire document, four Reuters were taken for testing purposes. Each one had 180 documents. The topics include commodity, corporate, currency, energy, economic indicator and subject codes. The clusters are evaluated using the

7 47 measures F-measure and Recall. The cluster quality is evaluated using the cophenetic correlation coefficient. The time taken to cluster the documents using both the methods, is also calculated Cophenetic correlation Coefficient This coefficient is used to evaluate the quality of the cluster. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The clusters are evaluated and the results are given in Table 3.1. Table 3.1 Cluster Quality D0 D1 D2 D3 Hier ARBHC The tabular values show that the clusters formed using ARBHC have better quality than the clusters formed using the hierarchical algorithm F-measure The F-measure is the weighted harmonic mean of precision and recall. The formula of the F-measure in Equation (3.13) as follows: F 2. precision. recall ( precision recall ) Precision is the fraction of the documents retrieved, that are relevant to the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. The formula is given in Equation (3.14) as follows:

8 48 precision ( i, k) Recall ( i, k) Nik NC N N i ik k (3.14) where N is the total number of documents, i is the number of classes (predefine, K is the number of clusters in unsupervised classification, NCi is the number of documents of class i, NK is the number of documents of cluster CK, Nik is the number of documents of class i in the cluster CK. The values are given in Table 3.2. Table 3.2 F-Measure Values D0 D1 D2 D3 Hier ARBHC The tabulated values have shown that the ARBHC method has better F-Measure values than those of the hierarchical method. Figure 3.2 F-Measure Values

9 49 Figure 3.2 shows the diagrammatic representation of the tabular values and finally concludes that the ARBHC method outperforms the other method Entropy In the information theory, entropy is a measure of the uncertainty associated with a random variable. The values are given in Table 3.3. Table 3.3 Entropy Values D0 D1 D2 D3 Hier ARBHC The tabulated values show that the ARBHC method has lesser entropy values than those of the hierarchical method, which is a sign of better performance. Figure 3.3 Entropy Values

10 50 Figure 3.3 shows the diagrammatic representation of tabular values which colcudes that the ARBHC method outperforms the hierarchical method Time The time taken to cluster the documents using the normal hierarchical method and the ARBHC method is calculated, and the results are tabulated in Table 3.4. Table 3.4 Time Values (Seconds) D0 D1 D2 D3 Hier ARBHC Figure 3.4 Time Values The tabular values and Figure 3.4 have shown that the time taken to cluster the documents is less for the ARBHC method than for the hierarchical method.

11 MODIFIED K-MEANS ALGORITHM One of the other fundamental clustering algorithms is the K-means algorithm. K-means, a heuristic algorithm, is a method of cluster analysis, which partitions n observations into k clusters, such that each observation is grouped into a cluster with the nearest mean. The k-means algorithm is less accurate than the hierarchical algorithm, but provides greater efficiency. In the simple k-means algorithm the centroids are selected randomly by the user which affects the performance of the algorithm Introduction In this work, a method has been proposed to locate the initial centroids. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a distance, we propose a method to use dissimilar documents as the initial centroids. In the proposed method, the text data are first preprocessed. Then the features are selected using an appropriate method. Then the dissimilar documents are identified and selected as the initial centroids. With these initial centroids the k-means algorithm is run and the performance is evaluated Preprocessing The stop words in the input text files are first removed. The standard stop word list available is used for this purpose. Then the words are stemmed using the Porter Stemming algorithm, which reduces the word to its base word.

12 Feature Selection This method is used to pick out important words from the file which will eventually represent the entire document. This process is done using the term frequency method and title match method. Term frequency, is defined as a measure of how often a word appears in a document. It is defined as the number of times a word appears in the document divided by the total number of words. In the title match method, if the words in the file match with the title, then it is given a weight value. The term frequency value and the title match value are added to calculate the weight of each word. Then the words are ranked according to their values and the top four words are taken for each file. The process is continued for the entire text files and the words are selected Finding Dissimilar Documents The Apriori algorithm is used to find the similar documents and the remaining documents are considered as dissimilar documents.the number of documents thus found are given as the initial K-value. The dissimilar documents are selected as the initial centroids for clustering. These documents are given as input to the modified K-means algorithm Modified K-Means Algorithm The dissimilar documents that are found using the Apriori algorithm are selected as the initial centroids. This helps the algorithm to classify the documents as disjoint sets.

13 53 The procedure for finding the cluster is defined as The dissimilar documents are selected as the initial centroids; i.e m 1, m 2 k. Until there is no change in the mean The estimated means are used to classify the samples into clusters For i from 1 to k m i is replaced with the mean of all the samples for cluster i end_for end_until The modified K-means algorithm run by selecting the dissimilar documents as initial centroids, has proved that the clustering performance has been improved. It was also proved that the number of iterations taken to converge is less, when compared to the traditional K-means algorithm Experimental Results All the experiments have been carried out on Reuters dataset. Reut2-000, Reut2-001, Reut2-002, Reut2-001 have been taken for the analysis. These are named as D0, D1, D2 and D3 respectively. From each, 180 documents have been selected. Totally, 720 documents from various topics like commodity, corporate, currency, energy, economic indicator and subject codes have been taken. Evaluation metrics, namely,the F-measure and Entropy are used. The number of iterations taken to converge is also evaluated.

14 F-Measure The formula of F-measure is given in Equation (3.15) as follows: F 2.Precision. Recall (Precision Recall ) (3.15) Precision which is the measure of exactness is defined as the fraction of the documents retrieved that are relevant to the user's information need. Recall, which is the measure of completeness, is defined as the fraction of the documents that are relevant to the query that are successfully retrieved. Table 3.5 F-Measure Values D0 D1 D2 D3 K- means Modified K-Means The values tabulated in Table 3.5 show that the modified K-means algorithm has F-measure values somewhat similar to those of the K-means algorithm Entropy Entropy is a measure of the uncertainty associated with a random variable. It refers to the information contained in a message. The general form of entropy is given in Equation (3.16) as follows: E j i p ij log (p ij ) (3.16) p ij is the probability that a cluster j belongs to class i.

15 55 Table 3.6 Entropy Values D0 D1 D2 D3 K- means Modified K-Means The values tabulated in Table 3.6 have shown that the modified K-means algorithm has reduced values than the K-means algorithm, which shows its better performance. The lower the value the better the performance Number of Iterations The number of iterations taken by the algorithm to converge depends on the initial centroids. Therefore, the number of steps taken to converge is calculated. It is found that the modified K-means algorithm takes less number of iterations than the simple K-means algorithm. The values are tabulated in Table 3.7. Table 3.7 Number of iterations D0 D1 D2 D3 K- means Modified K-Means The tabular values show that the number of iterations taken by the modified K-means algorithm to converge is less than the time taken by the K- means algorithm.

16 SUMMARY In this work, we have proposed a new hierarchical clustering algorithm based on association rules. The results are evaluated using the Reuters dataset. Our algorithm is motivated by the fact that the increased database size will eventually reduce the clustering performance. In order to overcome this, we have integrated the association rule algorithm with the hierarchical algorithm. The associated documents identified using the association algorithm alone, are clustered using the hierarchical algorithm, which greatly improves the efficiency of the clustered output. The experimental results show that the association based hierarchical clustering outperforms the traditional hierarchical clustering algorithm. The cluster quality was also tested, using the cophenetic correlation coefficient, and the results showed improved performance. Also, the time taken to complete the clustering process was also tested and the ARBHC algorithm took less time when compared to the hierarchical clustering algorithm. Also, a solution to improve the performance of the K-means algorithm has been formulated. The clustering performance of the simple K- means algorithm depends on the initial centroids. These centroids are selected at random and thus the performance varies. To overcome this, dissimilar documents are identified using the Apriori algorithm, and these are selected as the initial centroids. This has yielded better results. The number of iterations taken to converge is also calculated, and the modified K-means algorithm has used lesser number of iterations than the simple K-means algorithm. The next chapter describes a new method, that has been used for keyword extraction from text documents.

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Exploratory Analysis: Clustering

Exploratory Analysis: Clustering Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents

More information

A Comparison of Document Clustering Techniques

A Comparison of Document Clustering Techniques A Comparison of Document Clustering Techniques M. Steinbach, G. Karypis, V. Kumar Present by Leo Chen Feb-01 Leo Chen 1 Road Map Background & Motivation (2) Basic (6) Vector Space Model Cluster Quality

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Data Mining Clustering

Data Mining Clustering Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Tag-based Social Interest Discovery

Tag-based Social Interest Discovery Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture

More information

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search 1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history

More information

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Midterm Examination CS540-2: Introduction to Artificial Intelligence Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

Chapter 6: Cluster Analysis

Chapter 6: Cluster Analysis Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Analysis and Extensions of Popular Clustering Algorithms

Analysis and Extensions of Popular Clustering Algorithms Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric. CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Supervised and Unsupervised Learning (II)

Supervised and Unsupervised Learning (II) Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning BANANAS APPLES Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

1 More configuration model

1 More configuration model 1 More configuration model In the last lecture, we explored the definition of the configuration model, a simple method for drawing networks from the ensemble, and derived some of its mathematical properties.

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information