Information-based Biclustering for the Analysis of Multivariate Time Series Data

Size: px
Start display at page:

Download "Information-based Biclustering for the Analysis of Multivariate Time Series Data"

Transcription

1 Information-based Biclustering for the Analysis of Multivariate Time Series Data Kevin Casey Courant Institute of Mathematical Sciences, New York University, NY August 6,

2 Abstract A wide variety of strategies have been proposed for the analysis of gene expression data, many of these approaches, especially those that focus on identifying sets of coexpressed (or potentially coregulated) genes, have been based on some variety of data clustering. While useful, many of these techniques have limitations that are related to the fact that gene expression data sets generally show correlations in both genes and conditions. Specifically, with respect to time series microarray data, often there are intervals of time in which gene expression is more more or less correlated for various sets of genes followed by intervals for which the correlation changes or disappears. This structure suggests that a clustering approach that partitions both genes and time points simultaneously might be profitable. Using techniques from information theory, one may characterize such a biclustering of the data in terms of how much various regions of the data set can be compressed. Motivated by these considerations, we have developed a biclustering algorithm, based on techniques from information theory and graph algorithms, that finds a partition of time course data such that sets of clustered genes are found for an optimally disjoint windowing of the dataset in time. This windowing places the boundaries between temporal regions at points in time for which gene expression activity undergoes substantial reorganization and thus sheds light on the process level dynamics in the biological data. We validate the method against more traditional techniques on both simple synthetic data as well as actual time course microarray data from the literature and show that while computationally expensive, our method outperforms others in terms of accuracy. Further, the method outlined here serves as a first step in the construction of automata based models of gene interaction as discussed in the paper [8]. 2

3 Background Clustering Clustering data [16] is a major topic of research within the disciplines of statistics, computer science and machine learning. It is also an important practical technique used for data analysis and finds application in fields as diverse as biology, data mining, document understanding, object recognition and image analysis. Currently, many approaches under research are a response to biological questions related to microarray analysis of gene expression and genetic regulatory network reconstruction [25]. Clustering is characterized as a form of unsupervised learning wherein the elements of a dataset are organized into some number of groups. The number of groups might be known apriori or might be discovered during the clustering process. Clustering is generally regarded as a form of statistical learning executed without the benefit of known class labels (i.e. unsupervised), that finds a natural partition of a dataset that is optimal in some respect (where optimality makes sense in the context of the data under consideration). Such techniques contrast with classification achieved by supervised methods, in which a set of labeled data vectors are used to train a learning machine that is then used to classify novel unlabeled patterns. Clustering methods discover groups in the data without any advanced knowledge of the nature of the classes (except perhaps their number) and are thus entirely data driven. Less abstractly, clustering is the classification of feature vectors or patterns into groups when we do not have the benefit of patterns that are marked up with class information. It is generally exploratory in nature and often makes few assumptions about the data. When clustering, one is interested in finding a grouping (clustering) of the data for which elements within the same group are similar and elements associated with different groups are different (by some measure). There are a wide variety of methods and similarity measures that are used to this end, some of which are discussed below. Algorithmically, clustering methods are divided into a number of classes (e.g. hierarchical, partitional, model-based, density-based, grid based, etc. [16]). More recently, a number of techniques based on spectral methods and graphs [27] have also been developed, as well as methods relying on information theoretic principles. Additionally, clustering may be hard (each pattern belongs to a single group) or soft (each data vector belongs to each cluster with some probability). Hierarchical clustering (common in biological applications) produces a nested family of more and more finely grained groups, while partitional clustering finds a grouping that optimizes (often locally) some objective function. As mentioned above there are a wide variety of clustering techniques that have been developed, our technique is an iterative partitional method that use techniques from information theory to formulate the optimization problem and that expects data in the form of a multivariate time series. Additionally, there has been much recent interest in so called biclustering techniques (see below), especially in the context of biological research. This report focuses on methods related to partitional algorithms, and the current work can be seen as a relative of biclustering algorithms that partition data by optimizing some measure of energy or fitness. What is unique to this work is the emphasis on creating a partition specifically for ordered temporal data (i.e. time series), and characterizing the partition using the language of lossy data compression. In our specific case, we are looking for a windowing or segmentation of a time series dataset into intervals, within each of which we perform a clustering. In this way we achieve a biclustering of our data for which each window of the data is optimally clustered. Partitioning algorithms divide (i.e. partition) the data into some number of clusters such that some measure of the distances between the items in the clusters is minimal while the dissimilarity between the clusters is maximal. The number of clusters is usually specified by the user, but there are techniques for automatically discovering model size (see below). Examples of partitioning algorithms include the popular 3

4 k-means and k-medians algorithms. The present work develops an iterative biclustering method that builds on previous partitioning algorithms, optimization techniques, and traditional graph search in order to find a set of partitions in both genes and time. It minimizes an energy term developed using the tools of information theory and results in a set of clusterings for a corresponding set of disjoint temporal windows that cover the dataset and share only their endpoints. Within the computational biology community, clustering has found popularity as a means to explore microarray gene expression data, aiding the researcher in the attempt to locate groups of coexpressed genes (with the hope that coexpression might imply - at least in some circumstances - coregulation). However, this objective is difficult to achieve as genes often show varying amounts of correlation with different sets of genes as regulatory programs execute in time. It is this shortcoming that motivated the development of biclustering within the context of computational biology, and again there has been much work in this area. Part of the problem with many biclustering techniques however, is that they are computationally complex and they do not take special characteristics of the data into account. Our algorithm is specifically designed to work with time series data and to locate points in the data at which significant process level reorganization occurs. Furthermore, our technique differentiates small tight clusters, from large loose clusters of less related data elements, an important quality when dealing with biological data. Historically, two important steps in any clustering task are pattern representation (possibly including feature extraction) and the definition of a similarity measure. Indeed these two tasks are related since the definition of distance between data elements may be seen as implicit feature selection (e.g. Euclidean distance treats distance in any component as equally important). We try to make few assumptions here other than that the data is temporal in nature (i.e. a time series) and that correlation captures proximity between vectors in a manner that we are satisfied with. Specifically, in what follows we present a modelfree procedure for time series segmentation that makes no assumptions about the underlying distributions that generate our data. While we do rely on correlation as a measure of similarity, it should be pointed out in advance that we are not wedded to it and that one could choose to use another basis for distortion calculations if one preferred. Finally, it is often the case that clustering procedures require a number of necessary parameters that must be supplied by the user. We have based our times series segmentation procedure on a clustering subprocedure that does not need any such additional inputs. As we will see, our algorithm attempts to search for the best values of such tuning parameters in a natural way. Precision vs. Complexity Discussions related to learning from data often begin with descriptions of curve fitting as an example that illustrates the trade-off one must make between precision and the complexity of data representation. If one fits a curve of too high a degree (high complexity representation), one risks over-fitting and an inability to generalize, whereas if one uses too low a degree (low complexity representation), one risks a poor description of the data. In unsupervised learning a similar problem is often met. Determining the appropriate model size and type are difficult enough when the data is labeled and such considerations become only more significant when one is dealing without the benefit of such information. The method below sidesteps the issue of model type by making few suppositions about the data, but the question of model complexity remains. How many clusters should one use to describe the data? We explore our specific solution to these problems in the discussion related to model size below, but make a few remarks now to clarify the issues at hand. For our purposes, model complexity generally corresponds to the cardinality of out clustering variable T. The more we compress the data (i.e. the smaller T or, as we will see, the lower I(T;X)), the less 4

5 Figure 1: An example of the precision complexity trade-off. Which clustering better captures the basic structure of the data? The red clustering achieves more compression (i.e. has lower complexity) but losses its ability to discriminate the subclusters (i.e. has less precision). Conversely, the green clustering has greater precision at the cost of increased complexity. The decision about model size, here one between 2 or 4 clusters, will be discussed below. precision and more expected distortion will obtain. Conversely, if T is too large we might be modeling noise and overfitting which leads to a lack of ability to generalize to new examples. The data analysis solution that we explore in this report allows for a consistent comparison of various descriptions of the data with respect to precision and complexity. In fact, the clustering subprocedure that we use, which is based on various recent extensions to rate distortion theory [10, 29], can be understood as an algorithm that captures just this trade-off, characterizing the problem in terms of similarity of clusters and compactness of representation (model size and amount of compression). In essence, one agrees to this trade-off and then attempts to do the best one can by formulating an objective function whose optimization finds the most compressed representation while striving for precision. This formalism offers a convenient way to define significant transition points in a large set of time series data as locations at which the amount of data compression one can do fluctuates. If one is motivated by a desire to capture biologically significant changes in such sets of data, this formulation is quite beneficial for it can be used to capture times for which biological processes undergo significant reorganization. Microarray Data and Critical Time Points Over the last decade, gene expression studies have emerged as a powerful tool for studying biological systems. With the emergence of whole genome sequences, microarrays (i.e. DNA chips) allow for the simultaneous measurement of a large number of genes, often the whole genome of an organism. When used repeatedly, microarray studies allow one to build up a matrix of expression measurements for which rows correspond to gene s expression levels and columns correspond to experimental conditions or time points for which a sample is taken. Thus, one may make comparisons between rows (vectors of expression 5

6 values for various genes), and columns (vectors of different gene s responses at a specific time or under a specific condition). In our study, we are interested in microarray data for which the columns are ordered in time, possibly (but not necessarily) at regular intervals. In this case, the rows are (possibly nonuniform) time series, that is, they are the expression profiles for the individual genes under study and they capture the history of the gene s dynamic behavior across time. A current area of research activity in computational biology is the effort to use such time course microarray data to elucidate networks of gene interaction, that is, to pull from the data a coherent picture of how groups of genes execute in a coordinated fashion across time and how the behavior of one group of genes influences and regulates the the behavior of other groups of genes. In our context we are interested in contributing to this general area of research by considering the points in time at which significant reorganization of gene expression activity takes place, for if we can locate these crucial points in time, we can aid biologists in focusing their analysis on the portions of the data that might be the most informative. As indicated above, clustering has proved to be a powerful tool for data analysis and continues to be an active area of research. However when applied to microarray data, conventional clustering is somewhat limited. The problem derives from the fact that when analyzing a microarray data matrix, conventional clustering techniques allow one to cluster genes (rows) and thus compare expression profiles, or to cluster conditions (columns) and thus compare experimental samples but are not intended to allow one to accomplish both simultaneously. Often this becomes a problem, especially when one is attempting to track the development of groups of genes over time, that is, when the rows of the data matrix may be viewed as multivariate time series. In this case, biological intuition would suggest that as biochemical programs execute, various groups of genes would flow in and out of correlation with each other. That is, one would expect genes to show correlation with certain genes during some periods of time, and other genes during other periods. Additionally, there might be times when a gene s expression might not show a high degree of similarity with any other identifiable group of genes. For this reason, simply clustering genes across conditions (time points) does not make sense, as one would like to capture this dynamic aspect of the data. Moreover, one might even be explicitly interested in identifying times for which these critical points of gene expression reorganization take place. Locating such critical time points and understanding the gene activity related to them might shed light on network level arrangements and processes that are too difficult to discern when looking at all of the time points simultaneously. Biclustering and Biological Data Recently progress has been made on some of the limitations of applying clustering to microarray data analysis. Specifically, so called biclustering algorithms have been introduced that aim to find a clustering simultaneously in both the genes and columns of a data matrix. These techniques locate submatrices in the data for which subsets of genes exhibit correlated activity across subsets of conditions. There has been much research in this area in the recent past and several excellent reviews compile surveys of this work [19]. There are a substantial number of approaches to biclustering that result in various types of clustered data. Much of the work has centered on finding biclusters in microarray data for which the conditions are not necessarily ordered. We are interested in a specific type of clustering of our temporally ordered data, one that respects the ordering of the columns and that searches for blocks of time for which coordinated gene activity takes place. One assumption that we are working under is that the signals in our biological data show varying compression across points of critical reorganization. Here we are using compression in the technical sense found in the communication theory literature (as discussed below). While there has been some work concerned with finding biclusters in time series data (e.g. [34]), a biclustering algorithm that finds clusters of concerted gene activity within temporal windows that are optimal in the objective just men- 6

7 tioned (i.e. data compression) has not, to our knowledge, been investigated. Our interest is in a specific constrained biclustering problem for which the order of adjacent time points is respected. We have the twin objectives of clustering the data within temporal windows and deducing the correct window endpoints (i.e. critical time points ). Thus, we offer a biclustering algorithm for time series microarray data that locates clustered gene activity in regions of time (i.e. windows) that are optimal in terms of the total amount of data compression that may be responsibly done on the data. Our biclustering produces clusters of genes in each of a number of disjoint temporal windows that partition the data in time. The windows are optimal in the amount of compression one can do on the underlying biological signals (i.e. expression profiles). We will clarify and make explicit these concepts in the discussion below. 7

8 Figure 2: A simple example of entropy for the binary case. It is a plot of the binary entropy H 2 (x) = xlog 1 x (1 x)log 1 (1 x) as a function of x. Theory Information theory [10] is the standard tool for discussing data compression in a quantitative manner. Recently [32, 29], a novel approach to clustering that relies on information theory to characterize the precision vs. complexity trade-off has been introduced and applied to various data types (e.g. neural spike trains, microarray expression data, text). We will see that a method similar to these is suitable for use as a subprocedure for our time series segmentation algorithm as it offers one a measure of goodness that includes notions of both data compression and clustering fitness. First however, we must introduce some basic facts, definitions and concepts from the field of information theory. Entropy Information in the abstract is a difficult concept to define precisely. In the seminal 1948 paper [26] that both defined and solved many of the basic problems related to information theory, Claude Shannon defined the notion of entropy, which captures much of what is usually meant by information : H(X) x p(x)log p(x) = x p(x)log 1 p(x) (1) Entropy may be regarded as a measure of uncertainty (i.e. information) related to an event (or signal). It measures the expected number of bits (binary digits) required on average to describe the signal. In the context of data compression, the entropy is known to be the expected limit of lossless compression, that is, given a random variable X, one cannot generally compress X past H(X) without loss. The definition above makes accessible a clear development of the concept of entropy based on the notion of surprise. On an intuitive level, one would want a definition of information to capture what we experience subjectively as 8

9 surprise. That is, suppose one receives a message but one already knows the message s text. In this case there is no information transmitted via the message, as the receiver already knows what is being sent. Similarly, imagine observing a coin flip. If the coin is unbiased, with a 50 percent chance of coming up heads, then the coin flip is generally more surprising than a flip of a biased coin that has a 99 percent chance of coming up heads. There is more information conveyed by a flip of an unbiased coin than there is by a flip of the biased coin. (In fact, it can be shown that an unbiased coin maximizes the entropy (or information) related to the coin flip, and that the uniform distribution generally maximizes the entropy related to an event.) We can see then that we would like the information content of the coin toss to correspond to the amount of surprise one should expect upon the event taking place. But how does one characterize a surprising event? Generally, all other things being equal, one is more surprised when an event with a small probability takes place than when an event that is very likely happens. Thus, we want the information content of an event to be proportional, that way the less likely the event the more information is conveyed. Taking the expectation of the log of this quantity yields Shannon s concept of entropy H(X), which is intended to measure the information content of a random variable X. Note that entropy as defined here is a functional, that is, a function of the distribution over X, rather than a function of a variable. One could also write H[p(x)] to emphasize this point. Additionally, one may define the entropy of two (or more) random variables, also known as the joint entropy : to the inverse of the likelihood of the event 1 p(x) H(X,Y ) x,y p(x,y)log p(x,y) = x,y p(x,y)log 1 p(x, y) The joint entropy is the uncertainty (or information) associated with a set of random variables (in the above case two). Finally, the conditional entropy is the expected uncertainty (or information) associated with one random variable Y, given that we know the value of another random variable X. That is: (2) H(T X) x p(x)h(t X = x) = x p(x) t p(t x) log p(t x) (3) These definitions are natural and follow the chain rule, that is, the entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other: H(X, T) = H(X) + H(T X). KL Divergence and Mutual Information The relative entropy or Kullback Leibler divergence (KL divergence) is a measure of the distance between two probability distributions, it is the expected logarithm of the likelihood ratio and is defined: D[p q] x p(x)log p(x) q(x) (4) One can immediately see that if the distribution p = q for all x, then D[p q] = 0. One can use the relative entropy to define yet another information measure called the mutual information. The mutual information is a measure of the amount of information that one random variable contains about another, and is defined: I(X;Y ) p(x,y)log p(x,y) (5) p(x)p(y) x,y 9

10 Figure 3: A pictorial representation of the various information quantities defined so far. Adapted from [10]. With the help of a little algebra it is not hard to show [10] that the following identities hold as well: I(X;Y ) = H(X) H(X Y ) = H(Y ) H(Y X) = H(X) + H(Y ) H(X,Y ) (6) Thus, mutual information is the relative entropy between the joint distribution and the product of the marginal distributions. It is symmetric and always greater than or equal to zero [10]. It is equal to the uncertainty of one random variable left over after subtracting the conditional entropy with respect to another random variable. In some sense, mutual information seems even closer than entropy to our colloquial notion of information, since in most cases we speak of one thing containing information about another rather than just information in the abstract. This idea of shared information is exactly what mutual information formalizes and its role in what follows is crucial. The following diagram, adapted from [10], is a useful pictorial representation of the the information measures that we have defined so far. Rate Distortion Theory In looking for the appropriate formalism to characterize our time series segmentation problem, it is useful to review rate distortion theory (RDT). Traditionally, RDT has been the main tool that information theorists use to address lossy compression in a rigorous manner. Given that clustering can be viewed as a form of lossy compression, and since the main component of our method is an information based clustering algorithm, it makes sense to review RDT and build on it as necessary. We will see that various recent extensions to RDT form the heart of our method and provide a powerful framework that we may use to attack our specific biclustering problem. 10

11 In rate distortion theory [10], one desires a compressed representation T of a random variable X that minimizes some measure of distortion between the elements x X and their prototypes t T. Taking I(T;X), the mutual information between T and X, to be a measure of the compactness or degree of compression of the new representation, and defining a distortion measure d(x, t) that measures distance between cluster prototypes and data elements, traditionally in terms of Euclidean distance, one can frame this problem as a trade-off between compression and average distortion. The main idea is that one balances the desire to achieve a compressed description of the data with the precision of the clustering, as measured by the average distortion, and strikes the appropriate balance that maintains enough information while eliminating noise and inessential details. In rate distortion theory, this trade-off is characterized mathematically with the rate distortion function R(D), which is the minimal achievable rate under a given constraint on the expected distortion: R(D) Where average distortion is defined to be: min I(T;X) (7) {p(t x):<d(x,t)> D} <d(x, t)> = x,t p(x)p(t x)d(x, t) (8) and is simply the weighted sum of the distortions between the data elements and their prototypes. To find R(D), we introduce a Lagrange parameter β, for the constraint on the distortion, and solve the variational problem: F min [p(t x)] = I(T; X) + β<d(x, t)> p(x)p(t x) (9) This functional captures the compression-precision trade-off and allows one to use an iterative method, based on Blahut-Arimoto [11, 4, 6] to calculate points on R(D). The solution to this problem [10]: under the constraints x p(t x) = 1, x X has the form: p(t x) = F p(t x) = 0 (10) p(t) Z(x,β) exp βd(x,t) (11) where Z(x, β) is a partition function, and the Lagrange multiplier β, is positive and determined by the upper bound on the distortion D: R = β (12) D That is, the slope of the rate-distortion curve is β. This is an implicit solution (p(t) depends on p(t x)) and is defined for a fixed set of prototypes. Different prototypes will change the solution obtained and for this reason selecting the correct prototypes is an important question. The joint optimization over cluster assignments p(t x) and prototypes is in general more difficult to solve and does not have a unique solution. One can see from (11) that if the expected distance between a data element x X and a prototype t T is small, then the cluster assignment p(t x) will be large for that pair and x will be assigned to the cluster with centroid t. However, choosing these centroids so that one achieves optimal compression is a more complicated task and rate distortion theory unfortunately does not provide the solution. 11

12 Set A Set B a 1 a 2 b 1 a 3 b 2 a 4 d(a,b) Euclidean b 3 Figure 4: An example of the Blahut-Arimoto alternating minimization, in this case between two convex sets in R 2 and the Euclidean distance. Since the minimization is of a convex function over convex sets, the algorithm is guaranteed to converge to the global minimum regardless of starting conditions. We can calculate R(D) using an iterative technique based on Blahut-Arimoto [11, 4, 6], in which we consider two convex sets and a convex distance function over them that is simultaneously convex in both of its arguments. We alternately minimize the distance between points chosen from the two sets, which has been shown to converge to the global minimum. An illustration of the procedure is provided in Fig. 4. In the specific case of calculating the rate distortion functional, we define two sets: 1. A = the set of all joint distributions p(t, x) with marginal p(x) such that <d(x, t)> D 2. B = the set of product distributions p(t)p(x) with normalized p(t) We can then reformulate the rate distortion functional R(D) as the double minimization of the KL divergence between elements chosen from these sets: R(D) = min a A min b B D KL[a b] (13) We can rewrite R(D) in this way because it can be shown that at the minimum, this KL divergence D KL [p(x)p(t x) p(x)p(t)] equals I(T;X), thus the D KL bounds the information, with equality when 12

13 β Achievable region I(T;X) R(D) <d(x,t)> β 0 Figure 5: A typical rate-distortion curve, illustrating the trade-off between compression and average distortion. One can see that in order to achieve high compression (small I(T;X)) a larger upper bound on the expected distortion must be used. p(t) equals the marginal x p(x)p(t x). One can apply Blahut-Arimoto to sets A and B and R(D) of (13). This allows one to fix β which, in turn, fixes the upper bound on the distortion D. We then pick a point in B and minimize the KL divergence R(D), thus determining a point in a A. We subsequently minimize the KL divergence again, this time holding our point a fixed and generating a new point b B. We iterate this until the algorithm converges to a limit, which is guaranteed by [11]. Doing this procedure for various β values allows one to trace out an approximation to the rate distortion curve R(D). An example of such a curve can be seen in Fig. 5. The points above the curve are the possible rate-distortion pairs, that is, they correspond to achievable amounts of compression for various upper bounds on the average distortion. We call this the achievable region. The parameter β is related to the derivative of the rate-distortion function, and as one changes the value of β, one traces out the entire curve R(D). Information Based Clustering From the discussion of rate distortion theory above, it is clear that one would like to have a formulation of the clustering problem that involves only relations between data elements, rather than prototypes. This would allow one to sidestep the thorny issue of how to correctly choose the cluster centers, which is one of the major drawbacks of conventional RDT. The information based clustering of [29] is just such a clustering scheme. Information based clustering is a method that is similar in many respects to RDT but that makes modifications to the distortion term that result in a number of important gains. The functional that characterizes information based clustering looks deceptively similar to RDT, but the distortion term masks an important difference. To perform information based clustering one minimizes the functional: F min = I(T; X) + β<d info > (14) 13

14 This method replaces the <d> term in the RDT functional with an overall measure of distortion <d info > that is defined only in terms of pairwise relations between data elements (rather than relations between data elements and prototypes). Here again β serves as a parameter controlling the trade-off between compression and precision, and sets the balance between the number of bits required to describe the data and the average distortion between the elements within the data partitions. In information based clustering, <d info > is defined as the average distortion taken over all of the clusters: N c <d info > = p(t i )d(t i ) (15) i=1 Where N c is the number of clusters (i.e. T ) and d(t) is the average (pairwise) distortion between elements chosen out of cluster t: N N d(t) = p(x i t)p(x j t)d(x i,x j ) (16) i=1 j=1 In the above, d(x 1,x 2 ) is a measure of distortion between 2 elements in a cluster (this could instead be a measure for m 2 elements, or some more complicated measure of multi-way distortion between elements). In our present case, we use a pairwise distortion measure (defined below) based on correlation. The central idea is that one wants to choose the probabilistic cluster assignments p(t x) such that the average distortion <d info > is minimized, while simultaneously performing compression. This is accomplished by constraining the average distortion term <d info > and minimizing the mutual information between the clusters and the data I(X;T) over all probability distributions p(t x) that satisfy the constraint on the compression level. The crucial difference between this method and RDT is located in the average distortion terms. For the example of pairwise clustering, we can easily see the difference. In RDT the average pairwise distortion is defined as: N c N <d RDT pair > = p(t i ) p(x j t i )d(x j, t i ) (17) i=1 Where the prototype t i (the cluster centroid) is calculated by averaging over the elements in a single cluster: t i = Whereas in information based clustering the average distortion is defined as: i=1 j=1 N p(x k t i )x k (18) k=1 N c N N <d Info pair > = p(t i ) p(x j t i )p(x k t i )d(x j, x k ) (19) j=1 k=1 The important thing to recognize is that in <d RDT pair > the sum over k takes place before the call to d(x j,t i ) in the sense that the prototypes are calculated by averaging over members in the cluster as in equation (18). However, in <d Info pair > the sum over k is outside of the call to d(x j,x k ). Thus, in RDT the distortion is pairwise between data elements and prototypes, whereas in information based clustering we have eliminated any reference to prototypes and only consider pairwise distortions between data elements. For our purposes, the most important aspects of characterizing clustering in the above way are that there are explicit numerical measures of the goodness of the clustering (i.e. the average distortion <d>) as well as of the trade-off captured in the functional value. We can make use of these values to perform a 14

15 segmentation of our time series data such that we produce a series of time windows that capture transitions between major stages in the data or interesting events. As in traditional rate distortion theory, in information based clustering one computes updates to the matrix of conditional probabilities p(t x) (i.e. the cluster assignments) by using an iterative procedure that calculates a Boltzmann distribution. Again, this method is based on Blahut-Arimoto and the form of the distribution is found by differentiating the clustering functional and setting it equal to zero. A proof for our case is provided below, however one should note that the form of the distribution contains an important difference that distinguishes it from traditional rate distortion theory. The form of the distribution is: p(t x) = p(t) Z(x,β) exp β(d(t)+2d(x,t)) (20) This form is for a pairwise distortion measure and differs from (11) above in that it contains an additional term d(t) (the average distortion for a cluster t), as well as a factor of 2 in the exponent. This form is a result of the differences in the original clustering functional and it adds an important notion of cluster tightness to the cluster assignment updating function. That is, tighter clusters (with low average distortion) are more desirable than diffuse clusters (high average distortion) and the clustering should try and produce clusters with low average pairwise distortion. Time Series Segmentation Given a set of time series gene expression data, we want to determine a sequence of windows in the dataset that capture important aspects of the temporal regulation of the sampled genes. We define a window, Wt te s as a set of consecutive time points beginning at time point t s and ending at time point t e. Given a time series dataset with time points T = {t 1,t 2,...,t n }, the task is to segment the time series into a sequence of windows {Wt ta 1,W t b t a,...,wt tn k } such that each window represents some unique temporal aspect of the data. Note that adjacent windows meet at their boundary points but do not overlap. This problem is basically a special case of the biclustering problem discussed above, that is, we desire a biclustering that maintains the correct ordering of the elements in time but that finds clusters of data elements that are similar in informative temporal intervals. In the end, we have a number of windows, each with its own set of clusters. The clusters in each window are composed from the data subvectors that correspond to each window. The goal is to find the optimal such windowing that results in the maximal amount of data compression while preserving the important features in the data. The start and end points of such windows (i.e. t s and t e ) correspond to points in the time series dataset where significant reorganization among genes has occurred. We would like to highlight such points in time, where the amount of compression changes significantly, for further investigation into the underlying biology. We have attempted to create a method that relies on as few external parameters as possible while retaining flexibility. Thus, if one happens to have a good guess for the model size or temperature (i.e. β) parameters, then such values can be supplied. If no reasonable guess exists, we attempt to locate good values automatically (at additional cost in the running time). The one input that must be given, of course, is the distortion matrix that describes how similar various pairs of data elements are. We discuss the construction of this input below. 15

16 Distortion Measure Details To create our distortion matrix, we used a pairwise similarity measure that is common in gene expression studies, the Pearson correlation coefficient [12]. While it has well known deficiencies (e.g. a lack of ability to capture nonlinear relations between the profiles that it is comparing), it also has various strengths, including its ability to function well as a measure of similarity between profiles that have a small number of points. Our approach is to form the distortion matrix directly from the values of the correlation coefficient: d(i,j) = 1 Np n=1 (X i n X i )(X jn X j ) S Xi S Xj (21) 1 N p Np Where N p = X i = X j and S X = n=1 (X n X) 2 is the standard deviation of X. We can calculate the (pairwise) distortion matrix based on the the correlation coefficients and feed this input into the clustering subprocedure of our time series segmentation algorithm. Here the values in the distortion matrix take 0 if the vectors are perfectly correlated and 2 if the vectors are perfectly negatively correlated. If an objective measure of clustering goodness is required within the windows, one may measure the coherence [25] with respect to Gene Ontology terms, this gives us a good idea of how well the basic algorithm partitions the data with respect to an external qualitative grouping (we discuss this in further detail below). Our notion of biological similarity is derived from the labels given in the Gene Ontology [5], and can be added to the distortion measure to augment distortion based purely on correlation of time series profiles. This allows us to validate our method in the manner common to gene expression clustering (i.e. by measuring coherence) and then to use these same ontology annotations to allow our algorithm to cluster based on both correlation of time series profiles as well as prior knowledge about the functional characteristics of the gene products themselves. Thus, we capitalize on the annotations provided by biological specialists as well as on the underlying characteristics of the data, and work toward automating a process that is ordinarily accomplished by hand (namely, choosing genes with specific known function and clustering around them to find potential functional partners). Based on these concepts, we have begun to experiment with another related idea: namely, using an additional labeling (e.g. GO terms) in the clustering algorithm itself. Future work will include the construction of a similarity matrix that takes both correlation as well as proximity on the graph of GO terms into account. Initial experiments have included taking a weighted sum of distortion matrices, where one is a pairwise correlation matrix M p with entries defined as in (21) above, and the other is a matrix M g, with entries that correspond to how similar two genes ontology labels are. Here both M p and M g have N rows (where N is the number of genes under consideration), and N columns. An entry e ij in matrix M g is in the interval [0,1] and takes on values closer to one the more the corresponding GO terms are shared between g i and g j in the ontology. The entry is zero if no terms are shared. When using this strategy, we create a distortion matrix by using a weighted combination of the above matrices: M s = am p +(1 a)m g where a [0,1], and use M s as the input to our clustering method. In fact, this method is quite general and can be used to add any type of prior similarity information we like to the algorithm. The difficulty here, of course, relates to choosing the relative weights on the various matrices appropriately and deciding how to weigh the contributions of the various ontology terms in M g (i.e. more specific labels should count more than extremely general ones). The flexibility of the distortion term also allows for prototypes to be selected by the user, thus forcing clusters to consolidate around specific profiles, this is useful if the researcher is interested in a single wellunderstood gene and wishes to find out what other genes might be related to it. In such a case, one would simply define a measure of pairwise distance that relied on how far apart genes profiles were from some third target profile. 16

17 Previous work [29] has shown that using information measures (e.g. mutual information) to characterize distortion works well in practice. In our case we have steered away from this approach due to the short lengths of the time windows we would like to consider. With windows as short as four or five time points, estimating probability distributions well enough to calculate mutual information becomes too error prone. Model Selection and Model Fitting In addition to the question of how to generate the initial distortion matrix, there are also choices to be made about both the trade-off parameter beta and the underlying model complexity (i.e. the number of clusters N c ). Although these questions have been explored to some degree [23], including in the context of information bottleneck [30], we use a straight forward approach that favors simplicity and ease of interpretation in terms of rate-distortion curves. That is, we perform rudimentary model selection by iterating over the number of clusters while optimizing (line search) over beta. This procedure, while somewhat expensive, results in a fairly complete sampling of the rate-distortion curves (i.e. the plots of I(X; T) vs. <d>) at various resolutions. Essentially, we trace the phase transitions (corresponding to different numbers of clusters) while tuning β and choose the simplest model that achieves minimal cost (and maximal compression) as measured by the the target functional. In this way, by optimizing the target functional over beta and the number of clusters, we obtain for each window a score that is the minimum cost in terms of model size and model fit, based on the trade-off between compression and precision. Obviously, for this method, run times can be substantial and for this reason we have developed an implementation that can take advantage of parallel hardware if it is available. We have used the Message Passing Interface [13], to provide parallel implementation on a cluster of machines. This offers the opportunity to decompose the larger problem into a set of clustering tasks to be performed on multiple machines and consolidated during the final stages of execution. One aspect of the problem as we have formulated it above is worth mentioning here, that is, the relationship between β and the clustering solution produced by the clustering algorithm. We have stated that β parameterizes R(D) and controls the trade-off between information preservation and compression. As β goes to 0, we focus on compression (in the limit we find just a single cluster with high distortion). Alternatively, as β goes to infinity, we focus on eliminating expected distortion (at the cost of increased mutual information). Thus, if we know before we run the algorithm, that we would prefer a very compressed representation of the data, we can set β accordingly. Similarly, if we know that we want to concentrate on minimizing distortion we can do that as well. We do not have to exhaustively search across β if we know what kind of solution we are looking for in advance, but if we want to try and determine the best possible minima, optimizing over this parameter is a reasonable task. Graph Search for Optimal Windowing Let T = {t 1,t 2,...,t n } be the time points at which a given time series dataset is sampled, and l min and l max be the minimum and maximum window lengths respectively. For each time point t a T, we define a candidate set of windows starting from t a as S ta = {W t b t a l min < t b t a < l max }. Each of these windows may then be clustered and labeled with a score based on its length and the cost associated with the value of the clustering functional. Following scoring, we formulate the problem of finding the lowest cost windowing of our time series in terms of a graph search problem and use a shortest path algorithm to generate the final set of (non-overlapping) time windows that fully cover the original series. To score the windows, we use a variant of the information based clustering procedure described above. We want to maximize compression (by minimizing the mutual information between the clusters and data 17

18 E 5,17 = (17 5) * min F 5,17 t 5 t 1 t 17 (...) t 6 t 2 t 10 E 6,10 = (10 6) * min F 6,10 Figure 6: A portion of an example of the graph of weighted free energies, the output of the segmentation procedure. Edges are labeled with the clustering functional values weighted by window lengths. We use Dijkstra s algorithm to search for the minimum cost path through the graph (in the terms of the weighted free energy). In this way we find the lowest cost windowing of our data from the first time point to the last. 18

19 elements), while at the same time forcing our clusters to have minimal distortion. In such a framework, the measure of distortion is left up to the user, and while in the past the performance has been studied using a distortion term based on information estimation [28], we chose (due to the small lengths of our windows and the difficulty of accurately estimating mutual information between short sections of time series) to use a measure of similarity based on the Pearson correlation coefficient that is common in gene expression studies [12], and augment it with optional terms that measure similarity based on biological characteristics (as described above). Once these scores are generated, we pose the problem of finding the lowest cost tiling of the time series by viewing it as a graph search problem. We consider a graph G = (V,E) for which the vertices represent time points V = {t 1,t 2,...,t n } and the edges represent windows with associated scores E = {W t b t a } (see Fig. 6). The fully connected graph has N vertices (t1,...,tn) and n 2 edges, one between each pair of vertices. Each edge e ab E represents the corresponding window W t b t a from time point t a to time point t b, and has an initially infinite (positive) cost. The edges are then labeled with the costs for the windows they represent, taken from the scores for the {S ti } computed earlier, each edge cost gets (F ab length) where F ab is the minimum cost found by the information based clustering procedure and length is the length of the window (a b). The edge weights are computed using a function that iterates over the number of clusters and optimizes over β and computes a numerical solution to equation (14) in an inner loop that tries multiple initializations and chooses the one that converges to the best cost. This algorithm is depicted in Fig. 7. In this way, we simultaneously label and prune the graph because edges that correspond to windows that have illegal length are left unlabeled and their costs remain infinite, while edges with finite cost are labeled appropriately. Our original problem of segmenting the time series into an optimal sequence of windows can now be formulated as finding the minimal cost path from the vertex t 1 to the vertex t n. The vertices on the path with minimal cost represent the points at which our optimal windows begin and end. We may apply Dijkstra s shortest path algorithm to generate our set of optimal windows. We use the shortest path algorithm and generate a windowing that covers all of our original time points in a disjoint fashion and as such, segments our original time series data into a sequence of optimally selected windows which perform maximal compression in terms of the information based clustering cost functional. One thing to note is that if one desired to provide a set number of clusters or a specific beta based on some prior knowledge, one may easily do so. See Fig. 7 for a complete description of the segmentation algorithm in psuedocode. Algorithmic Complexity Dijkstra s algorithm is a graph search method with a worst case running time of O(n 2 ) for a graph with n vertices. The clustering procedure used to score the windows is O(N 3 N c ), where N is the number of time points in the window and N c is the number of clusters. One can see this by noting that an outer loop of size N iterated over the rows of the conditional probability matrix and updates each entry (one for each of the N c columns. Each update is of order N 2 since a call to the Boltzmann procedure, which generates the entries p(t x) in the matrix, must compute d(t) the average distortion of cluster t, which contains two summations over N elements. This clustering procedure is nested in a loop that iterates over a small number of model sizes O(1)[= constant N] and a line search over potential values for β a O(1) operation. This clustering procedure is run for each window of legal length, there are n2 2 of these in the case of no restrictions on length. Creating the distortion matrix requires that we calculate the correlation coefficient for each of N 2 entries of the matrix where N is the number of genes in the dataset (larger than pairwise distortion measures would require many more computations). The graph search and distortion matrix creation complexity are dominated by the iterated clustering with a O(N 5 N c ) cost. 19

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Biclustering for Microarray Data: A Short and Comprehensive Tutorial Biclustering for Microarray Data: A Short and Comprehensive Tutorial 1 Arabinda Panda, 2 Satchidananda Dehuri 1 Department of Computer Science, Modern Engineering & Management Studies, Balasore 2 Department

More information

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering. Chapter 4 Fuzzy Segmentation 4. Introduction. The segmentation of objects whose color-composition is not common represents a difficult task, due to the illumination and the appropriate threshold selection

More information

Product Clustering for Online Crawlers

Product Clustering for Online Crawlers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Triclustering in Gene Expression Data Analysis: A Selected Survey

Triclustering in Gene Expression Data Analysis: A Selected Survey Triclustering in Gene Expression Data Analysis: A Selected Survey P. Mahanta, H. A. Ahmed Dept of Comp Sc and Engg Tezpur University Napaam -784028, India Email: priyakshi@tezu.ernet.in, hasin@tezu.ernet.in

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data Ryan Atallah, John Ryan, David Aeschlimann December 14, 2013 Abstract In this project, we study the problem of classifying

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

1. Lecture notes on bipartite matching

1. Lecture notes on bipartite matching Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans February 5, 2017 1. Lecture notes on bipartite matching Matching problems are among the fundamental problems in

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

A Generalized Method to Solve Text-Based CAPTCHAs

A Generalized Method to Solve Text-Based CAPTCHAs A Generalized Method to Solve Text-Based CAPTCHAs Jason Ma, Bilal Badaoui, Emile Chamoun December 11, 2009 1 Abstract We present work in progress on the automated solving of text-based CAPTCHAs. Our method

More information

A Topography-Preserving Latent Variable Model with Learning Metrics

A Topography-Preserving Latent Variable Model with Learning Metrics A Topography-Preserving Latent Variable Model with Learning Metrics Samuel Kaski and Janne Sinkkonen Helsinki University of Technology Neural Networks Research Centre P.O. Box 5400, FIN-02015 HUT, Finland

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

A Fast Learning Algorithm for Deep Belief Nets

A Fast Learning Algorithm for Deep Belief Nets A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

1 Homophily and assortative mixing

1 Homophily and assortative mixing 1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Chapter II. Linear Programming

Chapter II. Linear Programming 1 Chapter II Linear Programming 1. Introduction 2. Simplex Method 3. Duality Theory 4. Optimality Conditions 5. Applications (QP & SLP) 6. Sensitivity Analysis 7. Interior Point Methods 1 INTRODUCTION

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Information-Theoretic Co-clustering

Information-Theoretic Co-clustering Information-Theoretic Co-clustering Authors: I. S. Dhillon, S. Mallela, and D. S. Modha. MALNIS Presentation Qiufen Qi, Zheyuan Yu 20 May 2004 Outline 1. Introduction 2. Information Theory Concepts 3.

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

B553 Lecture 12: Global Optimization

B553 Lecture 12: Global Optimization B553 Lecture 12: Global Optimization Kris Hauser February 20, 2012 Most of the techniques we have examined in prior lectures only deal with local optimization, so that we can only guarantee convergence

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Measures of Clustering Quality: A Working Set of Axioms for Clustering

Measures of Clustering Quality: A Working Set of Axioms for Clustering Measures of Clustering Quality: A Working Set of Axioms for Clustering Margareta Ackerman and Shai Ben-David School of Computer Science University of Waterloo, Canada Abstract Aiming towards the development

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Matching Theory. Figure 1: Is this graph bipartite?

Matching Theory. Figure 1: Is this graph bipartite? Matching Theory 1 Introduction A matching M of a graph is a subset of E such that no two edges in M share a vertex; edges which have this property are called independent edges. A matching M is said to

More information

Feature Selection for Image Retrieval and Object Recognition

Feature Selection for Image Retrieval and Object Recognition Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao Scalable Discriminant Feature Selection for Image

More information

Mixture models and clustering

Mixture models and clustering 1 Lecture topics: Miture models and clustering, k-means Distance and clustering Miture models and clustering We have so far used miture models as fleible ays of constructing probability models for prediction

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Novel Lossy Compression Algorithms with Stacked Autoencoders

Novel Lossy Compression Algorithms with Stacked Autoencoders Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is

More information

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific

More information

Convex Clustering with Exemplar-Based Models

Convex Clustering with Exemplar-Based Models Convex Clustering with Exemplar-Based Models Danial Lashkari Polina Golland Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 {danial, polina}@csail.mit.edu

More information

Data Mining: Models and Methods

Data Mining: Models and Methods Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data

More information

Inital Starting Point Analysis for K-Means Clustering: A Case Study

Inital Starting Point Analysis for K-Means Clustering: A Case Study lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt

More information

Convex Clustering with Exemplar-Based Models

Convex Clustering with Exemplar-Based Models Convex Clustering with Exemplar-Based Models Danial Lashkari Polina Golland Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 {danial, polina}@csail.mit.edu

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Edge-exchangeable graphs and sparsity

Edge-exchangeable graphs and sparsity Edge-exchangeable graphs and sparsity Tamara Broderick Department of EECS Massachusetts Institute of Technology tbroderick@csail.mit.edu Diana Cai Department of Statistics University of Chicago dcai@uchicago.edu

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

How Learning Differs from Optimization. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari How Learning Differs from Optimization Sargur N. srihari@cedar.buffalo.edu 1 Topics in Optimization Optimization for Training Deep Models: Overview How learning differs from optimization Risk, empirical

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Auxiliary Variational Information Maximization for Dimensionality Reduction

Auxiliary Variational Information Maximization for Dimensionality Reduction Auxiliary Variational Information Maximization for Dimensionality Reduction Felix Agakov 1 and David Barber 2 1 University of Edinburgh, 5 Forrest Hill, EH1 2QL Edinburgh, UK felixa@inf.ed.ac.uk, www.anc.ed.ac.uk

More information

AM 221: Advanced Optimization Spring 2016

AM 221: Advanced Optimization Spring 2016 AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 2 Wednesday, January 27th 1 Overview In our previous lecture we discussed several applications of optimization, introduced basic terminology,

More information

Medical Image Segmentation Based on Mutual Information Maximization

Medical Image Segmentation Based on Mutual Information Maximization Medical Image Segmentation Based on Mutual Information Maximization J.Rigau, M.Feixas, M.Sbert, A.Bardera, and I.Boada Institut d Informatica i Aplicacions, Universitat de Girona, Spain {jaume.rigau,miquel.feixas,mateu.sbert,anton.bardera,imma.boada}@udg.es

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Markov Networks in Computer Vision

Markov Networks in Computer Vision Markov Networks in Computer Vision Sargur Srihari srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Some applications: 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information