CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING
|
|
- Erin Waters
- 5 years ago
- Views:
Transcription
1 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have high efficiency in cases of low dimensionality. But, the amount of documents available is very large. So, when clustering is applied to these documents, it degrades the performance of the clustering algorithm. This motivated us to reduce the dimension of the documents, and then to cluster them, so that the performance of the clustering algorithm is improved. Clustering algorithms are divided into two classes: hierarchical and partitioning. The hierarchical method has the disadvantage that once the clustering (i.e. merge or split) is done, it cannot be undone. Therefore, the documents have to be analysed properly before they are clustered. This led us to retrieve the associated documents alone from the entire dataset before they are clustered. In order to retrieve the related documents, the association rule mining algorithm is being used. The Association rule mining algorithm is generally used in transactional databases to find the associated items. This has been modified to be used for text databases to retrieve the associated documents. Then the associated documents are clustered, using the hierarchical algorithm and the results have shown improved performance.
2 42 Next, in the partitioning algorithm, the disadvantage is the initial selection of k value. In the simple k-means algorithm the centroids are selected randomly by the user. This method of selecting the centroids randomly affects the performance of the algorithm. The centroids should be placed very correctly, because different locations will lead to different results. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a particular distance, we propose a method to locate the initial centroids. This method uses dissimilar documents as the initial centroids, and these disisimilar documents are identified, using the Apriori algorithm. The Apriori algorithm is used to find similar documents. The remaining documents are identified as dissimilar documents. These documents are selected as the initial centroids, and the K-means algorithm is applied on those documents. The experimental results have shown that on selecting the dissimilar documents as the initial centroids, the clustering performance has been improved. These two points motivated us to develop this work. This chapter is divided into two parts. In the first part, association rule mining integrated with the hierarchical algorithm is described. In the second part, the K-means algorithm integrated with association rule algorithm is explained. The rest of the chapter is organized as follows. In section 3.1, association rule based hierarchical clustering algorithm is described. In section 3.2, the K-means algorithm based on association rule is described. 3.2 ASSOCIATION RULE BASED HIERARCHICAL CLUSTERING ALGORITHM An Apriori algorithm is a popular association rule algorithm, that is used in data mining to find the frequent items. The same algorithm is used in
3 43 this work to find similar documents. The remaining dissimilar documents are removed, which reduces the size of the database. The associated documents alone are then clustered which improved the efficiency of the clustering algorithm. This algorithm consists of various phases like, pre-processing, feature selection, and finally, clustering. The block diagram of the proposed system is depicted in Figure 3.1. Input Text File Preprocessing Feature Selection Clustered Documents as Output Hierarchical Clustering Association Rule Miner Figure 3.1 Block Diagram of the Proposed Method Pre-processing and Feature Selection Method Pre-processing involves the stop word removal and stemming process. Stop words are normal English words which have less meaning, such document. Therefore, these words are removed using a stop word list. The next process in pre-processing is stemming. Stemming is defined as the process of reducing the derived words to their base, stem or root form. The words are stemmed using the porter stemmer algorithm. After these pre-processing steps, the important features are selected using a feature selection method as proposed by (Xiao-Bing Xue & Zhi-Hua Zhou 2007). In this method, the importance of terms is calculated based on
4 44 the compactness of appearance of the word, and the position of the first appearance of the word. The distribution of the word is calculated in two steps. First the document is divided into several parts, and then it is entered in an array, where each entry corresponds to the count of appearance of that word in the particular part. In this paper, the number of parts corresponds to the number of sentences present in the document. Suppose a document d contains n sentences, the distributional array of the word t is represented as array (t, = [c 0, c 1 c n-1 ]. Compactness measures, whether the word is concentrated in a specific part of a document, or spread over the whole document.the compactness (ComPact) of the appearances of the word t and the position of the first appearance (FirstApp) of the word t are defined, respectively, in Equations (3.1) to (3.6) as follows: FirstApp ( t, min i {0.. n 1} c i 0? i : n (3.1) n 1 CompPact ( t, 0?1: 0 (3.2) PartNum c i i 0 LastApp ( t, max c i i {0.. n 1} 0? i : 1 (3.3) ComPact FLDist ( t, LastApp ( t, FirstApp ( t, (3.4) centroid ( t, n 1 c i n 1 i 0 i count ( t, c i (3.5) count ( t, i 0 ComPact ci i 0 ( t, i centroid ( t, PosVar (3.6) count ( t, The frequency of a word is calculated using the Term Frequency (TF) feature selection method. The compactness of the appearances of a word
5 45 is calculated, using the ComPactness (CP) method. The first appearance of a word is found, using the FirstAppearance (FA) method.tf, CP and FA are calculated using Equations (3.7) to (3.12) as follows: count( t, TF ( t, (3.7) size( ComPact ParNum( t, CPPN ( t, (3.8) len ( ComPact FLDist ( t, 1 CPFLD ( t, (3.9) len ( ComPact PosVar ( t, 1 CPPV ( t, (3.10) len ( FA ( t, f ( FirstApp ( t,, len ( ) (3.11) len ( p f ( p, len ( (3.12) len ( The importance of each word is calculated as a summation of TF, CP and FA. The words are then ranked according to their values, with the highest value on the top. From this the topmost four words are selected for each text file. All the text files are processed in a similar manner and a final output is obtained Association Rule Based Hierarchical Clustering Method The features that are selected are passed into the association rule miner. The association rule miner uses the Apriori algorithm to find the association rules between the text documents. The association algorithm is defined as follows: Let I = {i 1, i 2... i m } be a set of words. Let D be a set of
6 46 documents, where each document T contains a set of words. An association rule is an implication of the form X Y, where X I, Y I, and =. The association rule X Y holds in the set of documents D with confidence c if c% of transactions in D that contain X also contain Y. The association rule X Y has support s, if s% of transactions in D contain X Y. Mining the association rules is to find all the association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconf), respectively (Agarwal & Srikant 1994). The threshold vale for support and confidence lies between 0 and 1. In this work the threshold value is taken as 0.01 as constant for all the documents, so that the evaluation criteria remain the same throughout the process. The documents that are associated with one another are obtained from the association rule miner, and the remaining documents are discarded. As a result of this, the database size is reduced to a great extent. The documents that are associated with one another are selected and given as input to the hierarchical algorithm. This gives the advantage of improved cluster quality, and also reduces the time taken for clustering. The hierarchical algorithm performs agglomerative clustering and the final results are obtained. The clusters that are obtained as a result of ARBHC are found to have better quality, than the clusters produced using hierarchical clustering alone Experimental Results The experiments are tested on the Reuters dataset. Out of the entire document, four Reuters were taken for testing purposes. Each one had 180 documents. The topics include commodity, corporate, currency, energy, economic indicator and subject codes. The clusters are evaluated using the
7 47 measures F-measure and Recall. The cluster quality is evaluated using the cophenetic correlation coefficient. The time taken to cluster the documents using both the methods, is also calculated Cophenetic correlation Coefficient This coefficient is used to evaluate the quality of the cluster. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. The clusters are evaluated and the results are given in Table 3.1. Table 3.1 Cluster Quality D0 D1 D2 D3 Hier ARBHC The tabular values show that the clusters formed using ARBHC have better quality than the clusters formed using the hierarchical algorithm F-measure The F-measure is the weighted harmonic mean of precision and recall. The formula of the F-measure in Equation (3.13) as follows: F 2. precision. recall ( precision recall ) Precision is the fraction of the documents retrieved, that are relevant to the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. The formula is given in Equation (3.14) as follows:
8 48 precision ( i, k) Recall ( i, k) Nik NC N N i ik k (3.14) where N is the total number of documents, i is the number of classes (predefine, K is the number of clusters in unsupervised classification, NCi is the number of documents of class i, NK is the number of documents of cluster CK, Nik is the number of documents of class i in the cluster CK. The values are given in Table 3.2. Table 3.2 F-Measure Values D0 D1 D2 D3 Hier ARBHC The tabulated values have shown that the ARBHC method has better F-Measure values than those of the hierarchical method. Figure 3.2 F-Measure Values
9 49 Figure 3.2 shows the diagrammatic representation of the tabular values and finally concludes that the ARBHC method outperforms the other method Entropy In the information theory, entropy is a measure of the uncertainty associated with a random variable. The values are given in Table 3.3. Table 3.3 Entropy Values D0 D1 D2 D3 Hier ARBHC The tabulated values show that the ARBHC method has lesser entropy values than those of the hierarchical method, which is a sign of better performance. Figure 3.3 Entropy Values
10 50 Figure 3.3 shows the diagrammatic representation of tabular values which colcudes that the ARBHC method outperforms the hierarchical method Time The time taken to cluster the documents using the normal hierarchical method and the ARBHC method is calculated, and the results are tabulated in Table 3.4. Table 3.4 Time Values (Seconds) D0 D1 D2 D3 Hier ARBHC Figure 3.4 Time Values The tabular values and Figure 3.4 have shown that the time taken to cluster the documents is less for the ARBHC method than for the hierarchical method.
11 MODIFIED K-MEANS ALGORITHM One of the other fundamental clustering algorithms is the K-means algorithm. K-means, a heuristic algorithm, is a method of cluster analysis, which partitions n observations into k clusters, such that each observation is grouped into a cluster with the nearest mean. The k-means algorithm is less accurate than the hierarchical algorithm, but provides greater efficiency. In the simple k-means algorithm the centroids are selected randomly by the user which affects the performance of the algorithm Introduction In this work, a method has been proposed to locate the initial centroids. It has been observed that locating the centroids at a larger distance causes better results (Mushfeq-Us-Saleheen Shameem & Raihana Ferdous 2009). To locate the centroids at a distance, we propose a method to use dissimilar documents as the initial centroids. In the proposed method, the text data are first preprocessed. Then the features are selected using an appropriate method. Then the dissimilar documents are identified and selected as the initial centroids. With these initial centroids the k-means algorithm is run and the performance is evaluated Preprocessing The stop words in the input text files are first removed. The standard stop word list available is used for this purpose. Then the words are stemmed using the Porter Stemming algorithm, which reduces the word to its base word.
12 Feature Selection This method is used to pick out important words from the file which will eventually represent the entire document. This process is done using the term frequency method and title match method. Term frequency, is defined as a measure of how often a word appears in a document. It is defined as the number of times a word appears in the document divided by the total number of words. In the title match method, if the words in the file match with the title, then it is given a weight value. The term frequency value and the title match value are added to calculate the weight of each word. Then the words are ranked according to their values and the top four words are taken for each file. The process is continued for the entire text files and the words are selected Finding Dissimilar Documents The Apriori algorithm is used to find the similar documents and the remaining documents are considered as dissimilar documents.the number of documents thus found are given as the initial K-value. The dissimilar documents are selected as the initial centroids for clustering. These documents are given as input to the modified K-means algorithm Modified K-Means Algorithm The dissimilar documents that are found using the Apriori algorithm are selected as the initial centroids. This helps the algorithm to classify the documents as disjoint sets.
13 53 The procedure for finding the cluster is defined as The dissimilar documents are selected as the initial centroids; i.e m 1, m 2 k. Until there is no change in the mean The estimated means are used to classify the samples into clusters For i from 1 to k m i is replaced with the mean of all the samples for cluster i end_for end_until The modified K-means algorithm run by selecting the dissimilar documents as initial centroids, has proved that the clustering performance has been improved. It was also proved that the number of iterations taken to converge is less, when compared to the traditional K-means algorithm Experimental Results All the experiments have been carried out on Reuters dataset. Reut2-000, Reut2-001, Reut2-002, Reut2-001 have been taken for the analysis. These are named as D0, D1, D2 and D3 respectively. From each, 180 documents have been selected. Totally, 720 documents from various topics like commodity, corporate, currency, energy, economic indicator and subject codes have been taken. Evaluation metrics, namely,the F-measure and Entropy are used. The number of iterations taken to converge is also evaluated.
14 F-Measure The formula of F-measure is given in Equation (3.15) as follows: F 2.Precision. Recall (Precision Recall ) (3.15) Precision which is the measure of exactness is defined as the fraction of the documents retrieved that are relevant to the user's information need. Recall, which is the measure of completeness, is defined as the fraction of the documents that are relevant to the query that are successfully retrieved. Table 3.5 F-Measure Values D0 D1 D2 D3 K- means Modified K-Means The values tabulated in Table 3.5 show that the modified K-means algorithm has F-measure values somewhat similar to those of the K-means algorithm Entropy Entropy is a measure of the uncertainty associated with a random variable. It refers to the information contained in a message. The general form of entropy is given in Equation (3.16) as follows: E j i p ij log (p ij ) (3.16) p ij is the probability that a cluster j belongs to class i.
15 55 Table 3.6 Entropy Values D0 D1 D2 D3 K- means Modified K-Means The values tabulated in Table 3.6 have shown that the modified K-means algorithm has reduced values than the K-means algorithm, which shows its better performance. The lower the value the better the performance Number of Iterations The number of iterations taken by the algorithm to converge depends on the initial centroids. Therefore, the number of steps taken to converge is calculated. It is found that the modified K-means algorithm takes less number of iterations than the simple K-means algorithm. The values are tabulated in Table 3.7. Table 3.7 Number of iterations D0 D1 D2 D3 K- means Modified K-Means The tabular values show that the number of iterations taken by the modified K-means algorithm to converge is less than the time taken by the K- means algorithm.
16 SUMMARY In this work, we have proposed a new hierarchical clustering algorithm based on association rules. The results are evaluated using the Reuters dataset. Our algorithm is motivated by the fact that the increased database size will eventually reduce the clustering performance. In order to overcome this, we have integrated the association rule algorithm with the hierarchical algorithm. The associated documents identified using the association algorithm alone, are clustered using the hierarchical algorithm, which greatly improves the efficiency of the clustered output. The experimental results show that the association based hierarchical clustering outperforms the traditional hierarchical clustering algorithm. The cluster quality was also tested, using the cophenetic correlation coefficient, and the results showed improved performance. Also, the time taken to complete the clustering process was also tested and the ARBHC algorithm took less time when compared to the hierarchical clustering algorithm. Also, a solution to improve the performance of the K-means algorithm has been formulated. The clustering performance of the simple K- means algorithm depends on the initial centroids. These centroids are selected at random and thus the performance varies. To overcome this, dissimilar documents are identified using the Apriori algorithm, and these are selected as the initial centroids. This has yielded better results. The number of iterations taken to converge is also calculated, and the modified K-means algorithm has used lesser number of iterations than the simple K-means algorithm. The next chapter describes a new method, that has been used for keyword extraction from text documents.
Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationCHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL
85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationCHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM
CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationExploratory Analysis: Clustering
Exploratory Analysis: Clustering (some material taken or adapted from slides by Hinrich Schutze) Heejun Kim June 26, 2018 Clustering objective Grouping documents or instances into subsets or clusters Documents
More informationA Comparison of Document Clustering Techniques
A Comparison of Document Clustering Techniques M. Steinbach, G. Karypis, V. Kumar Present by Leo Chen Feb-01 Leo Chen 1 Road Map Background & Motivation (2) Basic (6) Vector Space Model Cluster Quality
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationData Mining Clustering
Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0
More informationEnhancing K-means Clustering Algorithm with Improved Initial Center
Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationINF4820, Algorithms for AI and NLP: Hierarchical Clustering
INF4820, Algorithms for AI and NLP: Hierarchical Clustering Erik Velldal University of Oslo Sept. 25, 2012 Agenda Topics we covered last week Evaluating classifiers Accuracy, precision, recall and F-score
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationIteration Reduction K Means Clustering Algorithm
Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationData Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM Clustering Formulation
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationLearning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search
1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationAccelerating Unique Strategy for Centroid Priming in K-Means Clustering
IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationCLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16
CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationAnalysis and Extensions of Popular Clustering Algorithms
Analysis and Extensions of Popular Clustering Algorithms Renáta Iváncsy, Attila Babos, Csaba Legány Department of Automation and Applied Informatics and HAS-BUTE Control Research Group Budapest University
More informationUnsupervised Learning : Clustering
Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex
More informationClustering & Classification (chapter 15)
Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More information2. Background. 2.1 Clustering
2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationCase-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 25: Kernels and Clustering 12/2/2008 Dan Klein UC Berkeley 1 1 Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,
More informationDocument Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure
Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com
More informationChapter 9. Classification and Clustering
Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More informationClustering in Ratemaking: Applications in Territories Clustering
Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking
More informationSupervised and Unsupervised Learning (II)
Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346 - Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised
More informationExploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray
Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data
More informationAdministrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES
Administrative Machine learning: Unsupervised learning" Assignment 5 out soon David Kauchak cs311 Spring 2013 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationUNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania
UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing
More informationComparison of FP tree and Apriori Algorithm
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationCS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University
CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document
More informationClustering in Data Mining
Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationClustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York
Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity
More information1 More configuration model
1 More configuration model In the last lecture, we explored the definition of the configuration model, a simple method for drawing networks from the ensemble, and derived some of its mathematical properties.
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationText Documents clustering using K Means Algorithm
Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More information