One-mode Additive Clustering of Multiway Data

Similar documents
Note Set 4: Finite Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm

Framework for Design of Dynamic Programming Algorithms

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Relational Database: The Relational Data Model; Operations on Database Relations

CLUSTERING. JELENA JOVANOVIĆ Web:

Expectation Maximization (EM) and Gaussian Mixture Models

Analyzing the Peeling Decoder

Clustering Lecture 5: Mixture Model

Two-mode clustering methods: astructuredoverview

Let denote the number of partitions of with at most parts each less than or equal to. By comparing the definitions of and it is clear that ( ) ( )

Clustering: Classic Methods and Modern Views

Chapter 6 Continued: Partitioning Methods

Linear Methods for Regression and Shrinkage Methods

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

REGULAR GRAPHS OF GIVEN GIRTH. Contents

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Statistical Analysis of List Experiments

10-701/15-781, Fall 2006, Final

Approximate Linear Programming for Average-Cost Dynamic Programming

Vertical decomposition of a lattice using clique separators

Grundlagen der Künstlichen Intelligenz

Unsupervised Learning: Clustering

6.867 Machine Learning

Hierarchical Mixture Models for Nested Data Structures

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Unsupervised Learning : Clustering

CHAPTER 2 TEXTURE CLASSIFICATION METHODS GRAY LEVEL CO-OCCURRENCE MATRIX AND TEXTURE UNIT

A method to speedily pairwise compare in AHP and ANP

CHAPTER 2. Graphs. 1. Introduction to Graphs and Graph Isomorphism

The Cross-Entropy Method

VARIANTS OF THE SIMPLEX METHOD

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Machine Learning for OR & FE

Clustering. Supervised vs. Unsupervised Learning

Evaluating Classifiers

CS Introduction to Data Mining Instructor: Abdullah Mueen

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

7. Decision or classification trees

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS

Unsupervised: no target value to predict

Supervised vs. Unsupervised Learning

Modeling Proportional Membership in Fuzzy Clustering

Data clustering & the k-means algorithm

Unsupervised Learning and Clustering

Job-shop scheduling with limited capacity buffers

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

3 Nonlinear Regression

Research Interests Optimization:

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

A Genetic k-modes Algorithm for Clustering Categorical Data

Introduction to Machine Learning CMU-10701

THE LOCAL MINIMA PROBLEM IN HIERARCHICAL CLASSES ANALYSIS: AN EVALUATION OF A SIMULATED ANNEALING ALGORITHM AND VARIOUS MULTISTART PROCEDURES

Face Recognition Using Long Haar-like Filters

Spectral Clustering and Community Detection in Labeled Graphs

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

University of Florida CISE department Gator Engineering. Clustering Part 2

Greedy Algorithms and Matroids. Andreas Klappenecker

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

6. Dicretization methods 6.1 The purpose of discretization

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

1 Non greedy algorithms (which we should have covered

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

10701 Machine Learning. Clustering

Clustering and The Expectation-Maximization Algorithm

Cluster quality assessment by the modified Renyi-ClipX algorithm

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

EE 589 INTRODUCTION TO ARTIFICIAL NETWORK REPORT OF THE TERM PROJECT REAL TIME ODOR RECOGNATION SYSTEM FATMA ÖZYURT SANCAR

I How does the formulation (5) serve the purpose of the composite parameterization

Some questions of consensus building using co-association

Text Modeling with the Trace Norm

Clustering CS 550: Machine Learning

Greedy Algorithms and Matroids. Andreas Klappenecker

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Fuzzy C-MeansC. By Balaji K Juby N Zacharias

CHAPTER 4: CLUSTER ANALYSIS

Structured System Theory

Unsupervised Learning and Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

DM and Cluster Identification Algorithm

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Math 414 Lecture 30. The greedy algorithm provides the initial transportation matrix.

Variational Methods for Graphical Models

A Late Acceptance Hill-Climbing algorithm the winner of the International Optimisation Competition

Statistical Methods and Optimization in Data Mining

Foundations of Computing

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem

Disjunctive and Conjunctive Normal Forms in Fuzzy Logic

Clustering. Shishir K. Shah

Transcription:

One-mode Additive Clustering of Multiway Data Dirk Depril and Iven Van Mechelen KULeuven Tiensestraat 103 3000 Leuven, Belgium (e-mail: dirk.depril@psy.kuleuven.ac.be iven.vanmechelen@psy.kuleuven.ac.be) Abstract. Given a multiway data set, in several contexts it may be desirable to obtain an overlapping clustering of one of the modes implied by the data. For this purpose a one-mode additive clustering model has been proposed, which implies a decomposition of the data into a binary membership matrix and a real-valued centroid matrix. To fit this model to a data set, a least squares loss function can be minimized. This can be achieved by means of a sequential fitting algorithm (SEFIT) as developed by Mirkin. In this presentation we will propose a new algorithm for the same model, based on an alternating least squares approach. Keywords: Additive Clustering, Approximation Clustering, ALS algorithm. 1 Introduction N-way N-mode data often show up in statistical practice. The simplest instance is a two-way two-mode data set. In this paper we will focus on the latter type of data, but everything can easily be extended to the N-way case. For two-mode two-way data sets a one-mode additive clustering model has been described by several authors, including [Mirkin, 1996]. The aim of the associated data analysis is to fit this model to a data set under study (either in a least squares or in a maximum likelihood classification sense). For this purpose [Mirkin, 1990] proposed a sequential fitting (SEFIT) algorithm. However, at this moment not much information is available about its performance; moreover, as will be discussed below, this algorithm implies some conceptual problems. As a possible way out, in this paper we will present a new algorithm to estimate the same model. The remainder of this paper is then structured as follows. In Section 2 we will describe the one-mode additive clustering model and in Section 3 we will explain the aim of the associated data analysis. In Section 4 the SEFIT algorithm will be explained and in Section 5 we will present our new algorithm. In Section 6 we present a few concluding remarks.

One-mode Additive Clustering 725 2 Model In one-mode additive clustering the data matrix X is approximated by a model matrix. This model matrix M with entries m ij (i = 1,...,I, j = 1,...,J) can be decomposed as m ij = R a ir g rj, (1) r=1 with R denoting total number of clusters, with a ir taking values 1 or 0 and with g rj real-valued. A is called the cluster membership matrix with entries a ir indicating whether entity i belongs to cluster r (a ir = 1) or not (a ir = 0). One may note that apart from the binary nature, no further restrictions are imposed on the values a ir, implying that the resulting clustering may be an overlapping one. The vector g r = (g rj ) J j=1 is called the centroid of cluster r and the entire matrix G with entries g rj is called the centroid matrix. Equation (1) then means that the ith row of M is obtained by summing up the centroids of the clusters to which row i belongs. Note that (1) can also be written in matrix form as M = AG. (2) In the past, this model has been described in [Mirkin, 1990] and [Mirkin, 1996]. To illustrate the conceptual meaningfulness of the one-mode additive clustering model we may refer to the following hypothetical medical example. Consider a patients by symptom data matrix, the entries of which indicate the extent to which each of a set of patients suffers from each of a set of symptoms. In such a context, symptom strength may be attributed to underlying diseases or syndromes, that correspond to clusters of patients. Given that patients might suffer from more than one syndrome (a phenomenon called syndrome co-morbidity), in such a case an overlapping patient clustering is justified. The measured values of symptom strength can be considered additive combinations of the underlying syndrome profiles formalized by the rows of the centroid matrix G of the additive clustering model (1). 3 Aim of the data analysis A two-way two-mode data matrix X resulting from a real experiment can always be represented by the model in (1). However, in most cases, a large number of clusters R will be needed for this. Therefore one usually looks for a model with a small value for R that fits the data well in some way. A first way to do this is a deterministic one. In that case one assumes that X M and the goal of the data analysis is then to find the model M with R clusters that optimally approximates the data X according to some loss

726 Depril and van Mechelen function. In this paper, the quality of the approximation will be expressed in terms of a least squares loss function: L 2 = ij R (x ij a ir g rj ) 2, (3) r=1 which needs to be minimized with respect to the unknown a ir and g rj (i = 1,...,I, r = 1,...,R, j = 1,...,J). Note that, if the matrix A is given, then the optimal G according to (3) is the least squares multiple regression estimator (A A) 1 A X. Note that this implies, since we have only 2 IR possible binary matrices A, that the solution space of (3) is finite and that therefore in principle it is possible to find the global minimum enumeratively. However, as computation time is an exponential function of the size of the data matrix, an enumerative search will quickly become infeasible. Therefore, in practice suitable algorithms or heuristics need to be developed to find the global optimum of (3). A second approach to the data analysis is of a stochastic nature. We now assume that: R x ij = a ir g rj + e ij, (4) r=1 where e ij is an error term with e ij iid sim N(0, σ 2 ). The goal of the data analysis then is to estimate the a ir, g rj and σ that maximize the log-likelihood: log l = ij log f(x ij A, G, σ) = IJ log 2π IJ log σ ij (x ij R r=1 a irg rj ) 2 2σ 2 (5) This can be characterized as a classification likelihood problem. In the latter type of problem, the binary entries of A are considered fixed parameters that need to be estimated, rather than realisations of latent random variables as in mixture-like models. For the estimation of a ir and g rj we need to minimize the sum in the numerator of the right most term in (5). This means that the stochastic approach for estimating the memberships and centroids is fully equivalent to the deterministic approach as explained above. For the estimation of σ 2 we have ij ˆσ 2 = (x ij R r=1 âirĝ rj ) 2, (6) IJ where â ir and ĝ rj are the maximum likelihood estimators of a ir and g rj respectively.

One-mode Additive Clustering 727 4 SEFIT As explained in the previous section, the minimization of the loss function (3) requires suitable algorithms. In this section we will explain a first such algorithm that has been developed by [Mirkin, 1990] and that is a sequentially fitting (SEFIT) algorithm. In this algorithm the membership matrix A is estimated column-by-column meaning that one sequentially looks for new clusters. Suppose m 1 clusters have already been found, the mth cluster is then estimated by making use of the residual data x m ij = x ij r<ma ir g rj (7) and by minimizing the function (x m ij a img mj ) 2. (8) ij Given the memberships a im (i = 1,...,I) the least squares estimates for the centroid values g mj (j = 1,...,R) are given by g mj = I i=1 a imx ij, (9) I i=1 a2 im which is the simple mean of the elements in the cluster. The estimation of the memberships a im (i = 1,...,I) proceeds as follows. We start with a zero memberships column (i.e., an empty cluster) and sequentially add elements of the first mode to the cluster in a greedy way, that is, add that row that yields the biggest decrease in the loss function (8), and continue until no further decrease is obtained. A full loop of the algorithm then goes as follows. First estimate the memberships a im (i = 1,...,I) using the residuals x m ij by means of the greedy procedure explained above and next estimate the centroid values g jm (j = 1,...,R) by means of equation (9). Denote a m = (a im ) and g m = (g mj ), calculate the outer product a m g m and subtract it from x m ij yielding the residuals x m+1 ij = x m ij a mg m. This loop is repeated on x m+1 ij and the algorithm terminates if the prespecified number of clusters R is reached. One may note that this algorithm may only yield a local minimum rather than the global optimum of the loss function. Moreover, SEFIT might also have problems in recovering a true underlying model. We now will illustrate this latter problem with a hypothetical example. Suppose the following true

728 Depril and van Mechelen structure underlies the data X: 1 0 g 11 g 12 g 13 1 0 ( ) g 11 g 12 g 13 M = AG = 1 1 g11 g 12 g 13 1 1 = g 11 + g 21 g 12 + g 22 g 13 + g 23 g 21 g 22 g 23 g 11 + g 21 g 12 + g 22 g 13 + g 23. 0 1 g 21 g 22 g 23 0 1 g 21 g 22 g 23 (10) Suppose now that we estimate the first cluster and that the correct membership vector a 1 = (1, 1, 1, 1, 0, 0) has been recovered. Then according to (9) the estimate of the corresponding centroid g 1 reads g 1 = (g 11 + g 21 /2, g 12 + g 22 /2, g 13 + g 23 /2), (11) which is not equal to the true (g 11, g 12, g 13 ). Clearly a bias has been introduced due to the overlapping nature of the clusters and all further estimates may be influenced by this wrong estimate since in the next step of the algorithm the centroid will be subtracted from the data. 5 ALS algorithm Our new approach to find the optimum of the loss function (3) is of the alternating least squares (ALS) type: given a membership matrix A we will look for an optimal G conditional upon the given A; given this G we will subsequently look for a new and conditionally optimal A, and so on. The easiest part is the search for G given the memberships A since this comes down to an ordinary multivariate least squares regression problem, with a closed form solution: G = (A A) 1 A X. (12) For the estimation of A we can use a separability property of the loss function (3), see also [Chaturvedi and Carroll, 1994]. This loss function indeed can be rewritten as follows: L 2 = R (x 1j a 1r g rj ) 2 j r=1 +... + R (x Ij a Ir g rj ) 2. (13) j r=1 The latter means that the contribution of the ith row of A has no influence on the contributions of the other rows. As a consequence, A can be estimated

One-mode Additive Clustering 729 row-by-row, which reduces the work to evaluating I 2 R possible memberships (instead of the full 2 IR ). The alternating process is repeated until there is no more decrease in the loss function. Since in each step the optimal conditional solution is found, we create a decreasing row of positive loss function values. As a consequence, this row has a limit which moreover will be reached after a finite number of iterations since there are only a finite number of possible membership matrices A. The iterative process is to be started with some initial membership matrix A, which can for instance be user specified or randomly drawn. Since in the ALS approach entire matrices are estimated rather than single columns, a bias as implied by the SEFIT strategy is avoided. Nevertheless, the ALS algorithm could yield a local optimum of the loss function (3). The only solution for this inconvenience is to use a large number of starts and retain the best solution. 6 Concluding remark In this paper we proposed a new algorithm for finding overlapping clusters of one mode of a multiway data set. It involves an alternating least squares approach and might overcome some limitations of Mirkin s original SEFIT algorithm. To justify the latter claim, however, an extensive simulation study in which the performance of both algorithms would be compared, would be needed. References [Chaturvedi and Carroll, 1994]A. Chaturvedi and J.D. Carroll. An alternating combinatorial optimization approach to fitting the indclus and generalized indclus models. Journal of Classification, pages 155 170, 1994. [Mirkin, 1990]B. G. Mirkin. A sequential fitting procedure for linear data analysis models. Journal of Classification, pages 167 195, 1990. [Mirkin, 1996]B. G. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers, Dordrecht The Netherlands, 1996.