Clustering Documents in Large Text Corpora

Size: px
Start display at page:

Download "Clustering Documents in Large Text Corpora"

Transcription

1 Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe Yongzheng Zhang Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 yongzhen Abstract In this report, two clustering algorithms PDDP and EM are tested using neural network corpus. Terms extracted by C-value / NC-value method are used to construct the space vector model. The resulting clusters are evaluated with the measure of mean scatter value for PDDP algorithm and with the measure of log likelihood for EM algorithm. 1 Introduction As the World Wide Web and online system continue to grow at a tremendous rate, text clustering is becoming increasingly widespread [10]. The topic of clustering has been extensively studied in many scientific disciplines and over the years a variety of different approaches have been developed [4, 6, 10]. Fast document clustering algorithms with high quality play an important role in organizing large amounts of data into a small number of meaningful clusters [12]. Typically clustering approaches can 1

2 be categorized as agglomerative and partitional based on the underlying methodology of the algorithm, or as hierarchical and non-hierarchical based on the structure of the final solution [13]. In general, text clustering involves constructing vector space model and representing documents by feature vectors. First, a set of keywords (or significant terms) are extracted from the document corpus to form the feature vector. Second, each document is represented by the feature vector, which consists of frequency and weight statistics of all significant terms. Finally, clustering proceeds by measuring the similarity (usually a function of Euclidean distance) between documents and assigning documents to appropriate clusters. In this project, we used significant terms as feature vectors, which differentiates from the work in [3], where keywords are used. Traditionally frequency of occurrence method is used to extract useful terms. Research in [5] shows C-value / NC-value method, which combines linguistic and statistic information, is convinced to have better performance over conventional frequency method [5]. In this project, we applied both frequency method and C-value / NC-value method to extract terms from the corpus. In our work, we took advantage of two software packages: PDDP [2] and weka [7] to cluster documents in a large text corpora of neural network articles. These two tools implement PDDP algorithm [3] and EM algorithm [9], respectively. We aim to evaluate the quality of the clustering results based on different feature vectors, which are generated using two different methods, i.e., frequency of occurrence and C-value / NC-value. 2 Text Clustering Algorithms In this section we briefly describe how PDDP and EM algorithms work on a collection of documents in the text clustering literature. 2

3 2.1 PDDP algorithm The method of Principal Direction Divisive Partitioning (PDDP) was developed by Boley [4]. It falls in the partitional or hierarchical categories. It first computes a root hyperplane, and then a child hyperplane for each cluster formed from the root hyperplane, and so on. The algorithm proceeds by splitting a leaf node into two children nodes using the leaf s associated hyperplane. The final result is a binary tree of clusters defined by associated principal directions and hyperplanes [3]. As indicated in [4], each document is represented by a column of term frequencies and all the columns form a term frequency matrix, say M. Specifically, the i, j-th entry, M ij, is the number of occurrences of term t i in document d j. In order to make the results independent of document length, each column is scaled to have unit length in the usual Euclidean norm: ˆMij = M ij / i Mij, 2 so that i ˆM ij 2 = 1. At each stage of the algorithm a cluster is split as follows. The centroid vector for each cluster is the vector c whose i-th component is c i = j ˆM ij /k, where the sum is taken over all documents in the cluster and k is the number of documents in the cluster. The principal direction for each individual cluster is the direction of maximum variance, defined to be the eigenvector corresponding to the largest eigenvalue of the unscaled sample covariance matrix ( ˆM ce)( ˆM ce), where e is a row vector of all ones and denotes the matrix transpose. The resulting vector is the principal direction, and all the documents are projected onto this vector. Those documents with positive projections are allocated to the right child cluster, the remaining documents are allocated to the left child cluster. 2.2 EM algorithm The expectation-maximization (EM) algorithm was first introduced by Dempster et al. [9] as an approach to estimating the missing model parameters. It can be seen as an iterative approach to optimization. It first finds the expected value of the log likelihood with respect to the current parameter estimates. The second step is to maximize the expectation computed in the first step. These two steps are 3

4 iterated if necessary. Each iteration is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function [1]. EM has many applications such text classification and text clustering. It is one of the most used statistical unsupervised learning algorithms. 3 Evaluation of Clustering Results In this section we discuss how to conduct various clustering tasks using PDDP and weka EM clustering packages and evaluate the quality of the clustering results. 3.1 Experiment setup PDDP for Matlab [2] and EM for Java [7] are already available online. What we have to do is creating separate term frequency matrices for clustering using above two packages. Precisely, we applied two Automatic Term Extraction methods (frequency of occurrence and C-value / NC-value) [5] to generate a set of candidate terms, and then manually extract the real terms. Next each document is scanned and frequency of term occurrences is recorded to construct the term frequency matrix. In terms of term selection, we have to select real terms with high frequencies of occurrences to construct the feature vectors, because intuitively similar documents must share a high rate of common terms. However, too frequently appeared terms, such as neural network, might not be good features because most articles in this corpus are talking about neural network, and using these features might result in most papers clustered into a single cluster. So we eliminated the top 10 terms when constructing the term list. Consequently, we constructed eight different term frequency matrices, as described below, for two software packages corresponding to four different term lists. Then we ran the clustering software and compared the quality of both methods against different feature vectors. 4

5 3.2 Evaluation Schemes Usually for clustering, there are two kinds of measures of cluster goodness or quality [11]. One type of measure allows us to compare different sets of clusters without reference to external knowledge and is called an internal quality measure. This type of measure uses the overall similarity which is based on the pairwise similarity of documents in a cluster. The second type allows us to evaluate how well the clustering is working by comparing the groups produced by the clustering techniques to known classes. This type of measure is called an external quality measure. One external measure is entropy [11], which provides a measure of goodness for un-nested clusters or for the clusters at one level of a hierarchical clustering. Another external measure is the F-measure [11], which is more oriented toward measuring the effectiveness of a hierarchical clustering. For PDDP algorithm, we could use scatter to measure the overall similarity, since the scatter value in PDDP is the distance measure as in other clustering algorithm [4], where the group at the biggest scatter point is split at each iteration level. For EM algorithm, we may use log likelihood to measure the overall similarity, since in EM, at each iteration, we optimize the log likelihood of expected parameters. 3.3 Clustering results using PDDP Four experiments were carried out for testing the performance of PDDP algorithm. We constructed the term lists by both frequency of occurrence and C-value / NC-value methods [5]. The data sets for the experiments are summarized in Table 1. As stated in previous subsection, the scatter value can be used as an indicator of overall similarity measure. The bigger the scatter, the more distinct the two clusters. As shown in Table 2, for the four experiments, the mean scatters have little difference. The largest mean scatters are from experiment Frq 400 and Cv400, whereas the smallest is from Cv200. This means that for PDDP algorithm, the more terms used in space vector, the better the resulting clusters. 5

6 Experiment Frq 200 Frq 400 Cv 200 Cv400 Data extract 200 terms by frequency from 1000 articles extract 400 terms by frequency from 1000 articles extract 200 terms by Cvalue from 1000 articles extract 400 terms by Cvalue from 1000 articles Table 1: Experiment data set summary Experiment mean (scatter) standard derivation (scatter) Frq Frq Cv Cv Table 2: Scatter values of four experiments Figure 1 illustrates the distributions of the scatter values for the four experiments against the cluster node number. The overall distributions for the four experiments have little difference. 3.4 Clustering results using EM In our experiments for EM algorithm, four term lists, namely T L 1, T L 2, T L 3, and T L 4, were constructed from 200 neural network articles. T L 1 was created using the C-value / NC-value method, which consists of 100 terms ranking from 11 to 110 in the final term list. T L 2 was created using the same method, but with only 50 terms ranking from 11 to 60. T L 3 and T L 4 were generated using the frequency of occurrence method, with 100 and 50 terms, respectively. In this set of experiments, four term frequency matrices (T F M 1, T F M 2, T F M 3, and T F M 4 ) with Attribute-Relation File Format (ARFF) [8], were constructed based on four term lists T L 1, T L 2, T L 3, and T L 4, respectively. Next we ran EM algorithm with four term frequency matrices one by one. First we examine the quality of EM clustering on term frequency matrix T F M 1, 6

7 x. Cv400 o Cv200.*. Fre400 [] Fre Scatter Number of Nodes Figure 1: Scatter distribution along the cluster node for the four experiments which was generated using 100 terms extracted by C-value / NC-value method. EM algorithm uses log likelihood to measure how likely a particular clustering is. The greater the log likelihood is, the better the clustering result is. We noticed that a minimum allowable standard deviation, σ, for normal density calculation must be set, because text clustering in high dimensional space often has large sparse data and consequently joint density overflows. Table 3 shows various values of σ and corresponding clustering results, where N is the number of resulting clusters and l is the log likelihood. σ N l Table 3: Various σ values and clustering results As we can see in Table 3, different σ results in different number of clusters and log likelihood. More importantly, for a particular term frequency matrix, there exists a threshold t, which allows meaningful clustering, such as t = 0.04 in this case. With 7

8 σ < t, EM achieves positive log likelihood, which indicates a wrong clustering. We also observe that there is an optimal σ, which achieves greatest log likelihood (i.e. best clustering performance), such as σ = 0.05 in this case. As shown in Table 4, seven clusters were built, where i is the cluster number, C i is the number of documents in cluster C i, and p is the percentage of all documents in a single cluster. i C i p 7.5% 14.0% 6.5% 8.5% 11.5% 24.5% 27.5% Table 4: Best clustering results achieved with σ = 0.05 We are also interested in the EM clustering quality with number of clusters as an additional option besides σ = Table 5 shows the clustering results with the number of clusters, N, varying from 2 to 10. N l Table 5: Different number of clusters As we can see in Table 5, these 200 documents are most probably clustered into 7 clusters with greatest log likelihood -13.4, and 6 or 8 clusters are also acceptable. Next we did similar tests on T F M 2, T F M 3, and T F M 4, respectively. The best clustering results achieved in each test are shown in Table 6, where N is the number of resulting clusters, and l is the greatest log likelihood achieved in each test. We observe that different term frequency matrices of the same document set lead to different clustering quality. Precisely, term frequency matrices (T F M 1 and T F M 2 ) built on terms generated by C-value / NC-value achieve better results than those (T F M 3 and T F M 4 ) with frequency method. And higher dimensional space produces better quality (100 vs. 50). 8

9 4 Conclusion T F M N l Table 6: Best clustering results for all T F Ms We have done many tests aiming to show the difference of clustering performance between different feature vectors and different dimensions. Experiments show that the quality of clustering results depends on the feature vectors. Clustering based on terms which are more representative of the document corpus achieves better quality. Moreover, text clustering usually suffers from high dimensionality, where the distance between documents seems to be the same, thus documents seem to be similar. However for PDDP algorithm, this is not the case, since PDDP does not use the distance as the similarity measure. Due to the limited time, we have only tried how to take advantage of mature software to conduct clustering tasks and have got a sense of how clustering proceeds on a small document set. In terms of future work, we are interested in digging deeper the following topics: We are interested in using more formal evaluation criteria such as entropy and F-measure as indicated in [11]. As we have seen, the EM clustering algorithm falls into the non-hierarchical category. However, it can be extended to hierarchical by applying it to further decompose the clusters obtained in the first iteration. We force EM algorithm to produce two clusters (bisecting), and iterate on this to produce a hierarchical decomposition of the same type as PDDP. Thus we can compare the computational efficiency and clustering quality of PDDP and EM algorithms in a high dimensional space. The text corpus we use is a set of computer science articles in neural network. 9

10 Currently we are staying with a small collection. It will be interesting to experiment different sized subsets of the whole collection, and to estimate the time required to cluster the full papers in the whole Neural Network corpus. References [1] J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report, University of Berkeley, ICSI-TR , [2] Daniel Boley. Retrieve Experimental Software for Principal Direction Divisive Partitioning. boley/distribution/pddp.html, last accessed on Dec. 11, [3] Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4): , [4] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. Document categorization and query generation on the world wide web using webace. AI Review, 13(5). [5] K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multiword terms. International Journal of Digital Libraries, 3(2): , [6] A. Hotho, A. Maedche, and S. Staab. Ontology-based text clustering. In Proceedings of the IJCAI-2001 Workshop Text Learning: Beyond Supervision, August, Seattle, USA., [7] The University of Waikato. Weka 3: Machine Learning Software in Java. ml/weka/index.html, last accessed on Dec. 11,

11 [8] The University of Waikato. Attribute-Relation File Format (ARFF). ml/weka/arff.html, last accessed on Dec. 11, 2002, [9] Dempster A. P., Laird N. M., and Rubin D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, 39:1 38, [10] H. Schutze and H. Silverstein. Projections for efficient document clustering. In Proceedings of SIGIR 97, Philadelphia, pp , [11] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, [12] Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01 40, Department of Computer Science, University of Minnesota, Minneapolis, MN, Available on the WWW at 15, [13] Ying Zhao and George Karypis. Evaluation of hierarchical clustering algorithms for document datasets. 11

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Term-Based Clustering and Summarization of Web Page Collections

Term-Based Clustering and Summarization of Web Page Collections Term-Based Clustering and Summarization of Web Page Collections Yongzheng Zhang, Nur Zincir-Heywood, and Evangelos Milios Faculty of Computer Science, Dalhousie University 6050 University Ave., Halifax,

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Parameter Selection for EM Clustering Using Information Criterion and PDDP

Parameter Selection for EM Clustering Using Information Criterion and PDDP Parameter Selection for EM Clustering Using Information Criterion and PDDP Ujjwal Das Gupta,Vinay Menon and Uday Babbar Abstract This paper presents an algorithm to automatically determine the number of

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms Biotechnology & Biotechnological Equipment, 2014 Vol. 28, No. S1, S44 S48, http://dx.doi.org/10.1080/13102818.2014.949045 ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

The comparative study of text documents clustering algorithms

The comparative study of text documents clustering algorithms 16 (SE) 133-138, 2015 ISSN 0972-3099 (Print) 2278-5124 (Online) Abstracted and Indexed The comparative study of text documents clustering algorithms Mohammad Eiman Jamnezhad 1 and Reza Fattahi 2 Received:30.06.2015

More information

Document Categorization and Query Generation on the World Wide Web Using WebACE

Document Categorization and Query Generation on the World Wide Web Using WebACE Artificial Intelligence Review 13: 365 391, 1999. 2000 Kluwer Academic Publishers. Printed in the Netherlands. 365 Document Categorization and Query Generation on the World Wide Web Using WebACE DANIEL

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Criterion Functions for Document Clustering Experiments and Analysis

Criterion Functions for Document Clustering Experiments and Analysis Criterion Functions for Document Clustering Experiments and Analysis Ying Zhao and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

A Framework for Summarization of Multi-topic Web Sites

A Framework for Summarization of Multi-topic Web Sites A Framework for Summarization of Multi-topic Web Sites Yongzheng Zhang Nur Zincir-Heywood Evangelos Milios Technical Report CS-2008-02 March 19, 2008 Faculty of Computer Science 6050 University Ave., Halifax,

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

In Search of Deterministic Methods for. Initializing K-Means and Gaussian Mixture

In Search of Deterministic Methods for. Initializing K-Means and Gaussian Mixture In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering Ting Su Jennifer G. Dy Department of Electrical and Computer Engineering Northeastern University, Boston, MA

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.) 10. MLSP intro. (Clustering: K-means, EM, GMM, etc.) Rahil Mahdian 01.04.2016 LSV Lab, Saarland University, Germany What is clustering? Clustering is the classification of objects into different groups,

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Unsupervised Learning I: K-Means Clustering

Unsupervised Learning I: K-Means Clustering Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Clustering and Dimensionality Reduction

Clustering and Dimensionality Reduction Clustering and Dimensionality Reduction Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: Data Mining Automatically extracting meaning from

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

A Co-Clustering approach for Sum-Product Network Structure Learning

A Co-Clustering approach for Sum-Product Network Structure Learning Università degli Studi di Bari Dipartimento di Informatica LACAM Machine Learning Group A Co-Clustering approach for Sum-Product Network Antonio Vergari Nicola Di Mauro Floriana Esposito December 8, 2014

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Module: CLUTO Toolkit. Draft: 10/21/2010

Module: CLUTO Toolkit. Draft: 10/21/2010 Module: CLUTO Toolkit Draft: 10/21/2010 1) Module Name CLUTO Toolkit 2) Scope The module briefly introduces the basic concepts of Clustering. The primary focus of the module is to describe the usage of

More information

Expectation-Maximization Algorithm and Image Segmentation

Expectation-Maximization Algorithm and Image Segmentation Expectation-Maximization Algorithm and Image Segmentation Daozheng Chen 1 In computer vision, image segmentation problem is to partition a digital image into multiple parts. The goal is to change the representation

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering

Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering Yingbo Miao Vlado Keselj Evangelos Milios Technical Report CS-2005-23 September 18, 2005

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 1 P.G. Scholar, Department of Computer Engineering, ARMIET, Mumbai University, India 2 Principal of, S.S.J.C.O.E, Mumbai University, India ABSTRACT Now a

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No.

International Journal Of Engineering And Computer Science ISSN: Volume 5 Issue 11 Nov. 2016, Page No. www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issue 11 Nov. 2016, Page No. 19054-19062 Review on K-Mode Clustering Antara Prakash, Simran Kalera, Archisha

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Centroid-Based Document Classification: Analysis & Experimental Results?

Centroid-Based Document Classification: Analysis & Experimental Results? Centroid-Based Document Classification: Analysis & Experimental Results? Eui-Hong (Sam) Han and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis,

More information

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification Gazi. Ali, Pei-Ju Chiang Aravind K. Mikkilineni, George T. Chiu Edward J. Delp, and Jan P. Allebach School

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Normalized Texture Motifs and Their Application to Statistical Object Modeling

Normalized Texture Motifs and Their Application to Statistical Object Modeling Normalized Texture Motifs and Their Application to Statistical Obect Modeling S. D. Newsam B. S. Manunath Center for Applied Scientific Computing Electrical and Computer Engineering Lawrence Livermore

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Robust Kernel Methods in Clustering and Dimensionality Reduction Problems Jian Guo, Debadyuti Roy, Jing Wang University of Michigan, Department of Statistics Introduction In this report we propose robust

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences UNIVERSITY OF OSLO Faculty of Mathematics and Natural Sciences Exam: INF 4300 / INF 9305 Digital image analysis Date: Thursday December 21, 2017 Exam hours: 09.00-13.00 (4 hours) Number of pages: 8 pages

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

A Comparative Analysis between K Means & EM Clustering Algorithms

A Comparative Analysis between K Means & EM Clustering Algorithms A Comparative Analysis between K Means & EM Clustering Algorithms Y.Naser Eldin 1, Hythem Hashim 2, Ali Satty 3, Samani A. Talab 4 P.G. Student, Department of Computer, Faculty of Sciences and Arts, Ranyah,

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

COMP5318 Knowledge Management & Data Mining Assignment 1

COMP5318 Knowledge Management & Data Mining Assignment 1 COMP538 Knowledge Management & Data Mining Assignment Enoch Lau SID 20045765 7 May 2007 Abstract 5.5 Scalability............... 5 Clustering is a fundamental task in data mining that aims to place similar

More information