IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST

Size: px
Start display at page:

Download "IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST"

Transcription

1 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST Distributed Clustering Using Wireless Sensor Networks Pedro A. Forero, Student Member, IEEE, Alfonso Cano, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE Abstract Clustering spatially distributed data is well motivated and especially challenging when communication to a central processing unit is discouraged, e.g., due to power constraints. Distributed clustering schemes are developed in this paper for both deterministic and probabilistic approaches to unsupervised learning. The centralized problem is solved in a distributed fashion by recasting it to a set of smaller local clustering problems with consensus constraints on the cluster parameters. The resulting iterative schemes do not exchange local data among nodes, and rely only on single-hop communications. Performance of the novel algorithms is illustrated with simulated tests on synthetic and real sensor data. Surprisingly, these tests reveal that the distributed algorithms can exhibit improved robustness to initialization than their centralized counterparts. Index Terms Clustering methods, distributed algorithms, expectation maximization (EM) algorithms, iterative methods, wireless sensor networks. I. INTRODUCTION T HE development of small, low-cost, intelligent sensors with communication capabilities has prompted the emergence of wireless sensor networks (WSNs) in applications including environmental monitoring, surveillance, tracking, and inference tasks in bio-informatics [20], [25], [26]. When using a WSN as an exploratory infrastructure, it is often desired to infer hidden structures in distributed data collected by the sensors. With each sensor having available a set of unlabeled observations drawn from a known number of classes, the goal of the present paper is to design local clustering rules that perform at least as well as global ones, which rely on all observations being centrally available. Because low-cost sensors must operate under stringent power constraints, transmitting all observations to a central location may be infeasible. This motivates Manuscript received June 01, 2010; revised November 15, 2010; accepted January 25, Date of publication February 14, 2011; date of current version July 20, This work was supported in part by the National Science Foundation (NSF) under Grants CCF and CON and also in part through collaborative participation in the Communications and Networks Consortium sponsored by the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the Army Research Laboratory or the U.S. Government. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Anna Scaglione. The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN USA ( forer002@umn.edu; alfonso@umn.edu; georgios@umn.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /JSTSP looking for in-network clustering algorithms requiring information exchanges among single-hop neighbors only. Focus is placed on partitional (as opposed to hierarchical) clustering algorithms, which yield a single partitioning of the data described by a fixed number of parameters [30]. With these parameters being less than the available data, partitional clustering can afford parsimonious distributed implementations of deterministic and probabilistic approaches. A popular centralized deterministic partitional clustering approach is offered by the K-means algorithm, which features simple, and fast-convergent iterations [19]. Alternatively, clustering can be viewed as the byproduct of a density estimation problem by introducing a parametric probabilistic model governing the data generation; e.g., a Gaussian mixture model (GMM) [9, Ch. 10]. Density estimation problems are of further interest in the clustering context, because they provide extra information in the form of confidence on the data-to-cluster assignment. When the sought density is described by a finite number of parameters, a popular scheme for estimating them using the maximum-likelihood (ML) approach is the centralized expectation maximization (EM) algorithm. The EM algorithm has well-documented merits because it is computationally affordable, and offers convergence guarantees [7]. Parallel and distributed implementations of the K-means (DKM) and EM (DEM) algorithms have risen most often because of the need to deal with large data sets. However, most existing schemes are agnostic to the network communication constraints [8], [22], [31]. In the WSN context, various probabilistic approaches have been reported leading to: an incremental (I-) DEM scheme [23]; a gossip-based scheme [18]; a scheme based on consensus averaging [14]; a scheme based on junction trees and related topologies [29]; and a scheme based on the alternating direction method of multipliers [12]. Except for [12] and [29], all these works are confined to parameter estimation when the data probability density function (pdf) is modeled as a finite mixture of Gaussian density functions, a case local estimators are available in closed form. In addition, [23] and [29] are confined to specific communication network topologies (loops or trees). This paper presents and analyzes novel distributed algorithms for clustering observations collected by spatially distributed resource-aware sensors, which exchange only sufficient information with their one-hop neighbors. Viewing first the data as deterministic, a distributed version of the centralized K-means algorithm is developed. In par with the centralized K-means algorithm, the novel DKM scheme iterates over the variables of a consensus-based decentralized version of the global classification cost. Subsequently, viewing the data as random draws from a probabilistic model, the underlying data pdf is modeled /$ IEEE

2 708 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 as a finite mixture of (not necessarily Gaussian) pdfs. Then, the clustering problem reduces to a distributed parameter estimation problem, for which a distributed version of the EM algorithm is introduced. Using the variational formulation of [21], the resulting DEM algorithm relies on alternating iterations over the mixture parameters, and the unknown cluster labels. Compared to [18], [23], and [14], the novel algorithm does not require closed-form expressions for local (i.e., per sensor) estimators in terms of sufficient statistics. Compared to [11] and [12], the novel DKM and DEM algorithms can afford reduced communication overhead and are provably convergent. Interestingly, numerical tests reveal that these algorithms can be less sensitive to initialization, while achieving improved clustering results relative to their centralized counterparts with random initializations. The remainder of this paper is organized as follows. Section II states the problem, and reviews briefly the centralized K-means and EM algorithms to establish context and notation. The corresponding distributed versions are developed in Sections III and IV. Section V presents numerical tests. The paper concludes with final remarks in Section VI. Notational conventions are as follows: upper (lower) bold face letters are used for matrices (column vectors); denotes matrix and vector transposition; ( ) the -entry of a matrix ( -entry of a vector); set cardinality; a set of variables with appropriate indexing; stands for a diagonal matrix with on its diagonal; for a block-diagonal matrix with on its diagonal; ( ) for a 1 vector of ones (zeros); for a identity matrix; for the Euclidean norm; for the trace operation; and for the multivariate Gaussian density function of vector with mean, and covariance matrix. II. PRELIMINARIES AND PROBLEM STATEMENT Consider a WSN with nodes, node is allowed to communicate only with its one-hop neighbors. Communication links between neighboring nodes are assumed symmetric; thus, if, then. The WSN is modeled via an undirected graph with vertex set, and the edge set comprising the connections between pairs of nodes. Graph is assumed connected, meaning that data at any node can become available to any other node generally through a multi-hop path of. The WSN is deployed to gather data, and perform an unsupervised classification task. Every node collects a set of observations, denotes the th observation at sensor. Each observation is assumed belonging to a class-set with, contains the indices of all possible classes, and denotes the total number of classes present. Although is assumed known a priori, the methodology presented in this paper can be complemented readily with model order selection criteria to estimate along the lines of, e.g., [17] or [10]. The goal is to assign each to a cluster based on a proper criterion chosen to quantify similarity among observations. Hard clustering assigns to a unique class, while soft clustering yields a probability-like score that belongs to class. Soft clustering is preferred whenever the boundaries between classes are not well defined; e.g., when neighboring classes overlap considerably. In a centralized setting, all observations are available at the same location for clustering. Centralized clustering schemes, outlined next, will be used to benchmark performance of the novel distributed algorithms presented in the ensuing section. A. Deterministic Partitional Clustering In addition to the similarity assessing criterion, deterministic partitional clustering (DPC) entails prototypical elements (a.k.a. cluster centroids) per class in order to avoid comparisons between every pair of observations. Let denote the prototype element for class, and the membership coefficient of to class. The clustering problem amounts to specifying the family of clusters with centroids such that the sum of squared-errors is minimized; that is denotes the convex set of constraints on all membership coefficients, and is a tuning parameter. If, the cost in (1) becomes linear with respect to (w.r.t.), and with fixed, (1) then is a linear program for which the optimal coefficients result in hard assignments, and yield clusters. On the other hand, if, then the optimal coefficients generally result in soft membership assignments, which turn out to be expressible in closed form. In this case, the optimal clusters are obtained as. Problem (1) is a well-known combinatorial problem with NP-hard complexity in the number of observations [6]. The K-means algorithm offers a low-complexity, iterative, suboptimum solver that minimizes in two steps per iteration the cost in (1) w.r.t. (S1) with fixed, and (S2) with fixed [19]. Convergence is guaranteed at least to a local minimum. Both steps require availability of global information centrally: (S1) requires knowledge of the global cluster centroids after each iteration; and, (S2) requires knowledge of which data belong to each cluster. Hence, the centralized K-means algorithm is not amenable to distributed implementation in its current form. B. Probabilistic Partitional Clustering Probabilistic partitional clustering (PPC) views clustering as a follow-up of estimating a mixture density, formed by class-conditional pdfs. Once the class-conditional pdfs are estimated, each node can construct a probabilistic rule for clustering data, e.g., using the maximum a posteriori (MAP), or, the Neyman Pearson criterion [9]. In PPC, the data are viewed as realizations of independent, identically distributed (i.i.d.) random vectors, and likewise are their labels. Consider the 1 random, binary-valued, hard-label vector indicating the class the random vector is drawn from; (1)

3 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 709 and and their corresponding realizations. Label vectors take values from the set, denotes the canonical vector with one in its th entry and zeros else. If belongs to class, then and,. Data are drawn by first selecting class at random with probability, and then drawing a vector according to the class-conditional density, collects unknown deterministic parameters of the pdf. A datum and its class label, i.e., the complete data, are jointly described by the pdf only one factor in (2) has exponent equal to 1 while all other factors have exponents equal to 0, and thus do not affect the product. Nodes do not know the class each belongs to; hence, the only observable datum is the realization. Let collect parameters and. The pdf of is a mixture of pdfs obtained by marginalizing (2) with respect to as (2) With available and denoting expectation w.r.t., the M-step updates as which is used in the E-step of the ensuing iteration. Iterative passes of the E- and M-steps proceed until the condition is satisfied for a prescribed tolerance. Notice that at iteration, the E- and M-steps require knowledge of the global parameter estimate ; thus, the EM algorithm is not amenable to a distributed implementation in its current form. III. DISTRIBUTED ALGORITHMS FOR DPC In this section, distributed algorithms are developed for DPC, and their convergence is analyzed. With the goal of minimizing (1) in a distributed fashion, per sensor define local prototype vectors, and formulate the distributed clustering problem as (7) (8) s.t. The ML estimate of function (3) is found by maximizing the likelihood. The maximizer of cannot be generally found in closed form since the (log-) likelihood in (4) is a nonlinear function of. To overcome this hurdle, the EM approach to ML estimation hinges on the idea that if the class labels for each were known, it would be easier to maximize the likelihood in (4). Specifically, the EM approach relies on the pdf of the complete data, whose loglikelihood is given by the set contains all observations with their corresponding labels. The EM algorithm starts with an initial guess to obtain an estimate per class. Given at iteration index, the E-step estimates class labels as, the last equality holds because is a Bernoulli random variable. By using Bayes rule on, and noticing that is conditionally independent of all other, with and, given, the E-step yields (4) (5) (6) the auxiliary variables will allow (8) to be solved in a distributed fashion, and the consensus constraints and guarantee that problems (8) and (1) are equivalent as formalized by the following lemma proved in Appendix A. Lemma 1: (Equivalence of (8) with (1)) For every, let denote a feasible solution of (8). If is connected, then problems (8) and (1) are equivalent, i.e.,, is a feasible solution of (1). Lemma 1 ensures agreement on the cluster centroids with common index across all nodes. The union of local clusters,or when performing soft clustering, forms, which constitutes a cluster equivalent to one that would be found if all observations were centrally available. Unfortunately, the computational complexity involved in solving (8) via exhaustive search remains exponential, which motivates pursuing low-complexity distributed solvers. A. Distributed K-Means Algorithm Consider what can be termed surrogate augmented Lagrangian of (8), given by (9)

4 710 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 and denote Lagrange multipliers corresponding to the constraints and, respectively, and is a positive scalar. The constraints in have not been included in (9), hence the name surrogate. Further, the quadratic terms and augment the ordinary definition of the Lagrangian. They render it strictly convex w.r.t. each variable with all other variables fixed, and, through the scalar, allow one to control the convergence properties of the resulting iterative algorithm (see Section V). In the same spirit of centralized clustering algorithms, consider minimizing cyclically one set of variables at a time w.r.t., and with all other variables fixed, followed by a gradient ascent step over the multipliers. With denoting iteration index, the resulting iterates are given by (10a) (10b) (10c) (10d) (10e) Due to the augmented terms, update (10c) entails minimizing which is linear-quadratic w.r.t., and thus can be obtained in closed form. With and initialized to zero, the five iterates in (10a) (10e) can be reduced to the following three (see Appendix B for the proof) (11a) (11b) (11c) denotes a local (per-node) aggregate Lagrange multiplier and. Solving (11a) with amounts to assigning observation to cluster if with.for, solving the constrained convex optimization problem in (11a), yields the membership coefficient updates in closed form as (12) The novel DKM clustering scheme is summarized as Algorithm 1. At iteration, each node randomly initializes its local cluster centroids, while Lagrange multipliers are initialized to. Per iteration, each node locally assigns membership coefficients to its observations via (11a). Next, each node updates its local cluster centroids via (11b), and broadcasts to its neighboring nodes. Subsequently, each node updates its local aggregate Lagrange multipliers via (11c), thus completing iteration. Algorithm 1 requires each node to wait for the information from all its neighbors, before updating in (11c); thus, it is a synchronous algorithm. At the startup phase, DKM requires no central information made available to the nodes other than agreeing a priori on and, which incurs minimal communication overhead. During the operational phase, DKM broadcasts vectors of length 1 per iteration per node. Every node stores the vectors representing the local cluster centroids, and the local Lagrange multipliers ; along with an matrix for the membership variables. Algorithm 1 Distributed K-means (DKM) Require: Node initializes and. 1: for do 2: for all do 3: Compute via (11a). 4: Compute via (11b). 5: end for 6: for all do 7: Broadcast to all neighbors. 8: end for 9: for all do 10: Compute via (11c). 11: end for 12: end for Remark 1: (Distributed vs. centralized -means) Per iteration, consider updating (11b) and (11c) times before updating again in (11a). For fixed, (11b) and (11c) solve a distributed least-squares problem using the alternating direction method of multipliers as in [32] and [11]. As, the vector converges to its centralized weighted average solution for the cluster centroids (13)

5 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 711 which is in turn the closed-form expression for (S2) in the equivalent centralized K-means algorithm. Different from DKM, the centralized solution reached in (13) satisfies,. In this case, the DKM inherits the convergence properties of the centralized K-means algorithm. However, (13) does not need to hold for any finite, and a different approach is necessary to prove convergence of the DKM Algorithm 1 as is shown in the ensuing section. B. Convergence Analysis Standard approaches to analyze convergence of centralized clustering algorithms capitalize on the fact that the cost is nonincreasing after each iteration; see, e.g., [3], [27]. For the iterations in (11), this holds true for updates (11a) and (11b), but not for (11c). For this reason, a different approach is pursued in this section to establish stability of (11). In particular, it is shown that (11b) (11c) are bounded-input bounded-output (BIBO) stable iterations regardless of the cluster assignment update in (11a), with the output iterates arbitrarily close to their fixed point. For the special case of the hard K-means cost, this suffices to guarantee convergence at least to a local minimum of (1). Iterations (11b) and (11c) are first-order linear difference equations with time-varying coefficients that depend on. To analyze their convergence, consider the weighted average, and the super-vectors (14) (15) (16) Let denote the graph Laplacian matrix, with denoting the degree matrix, and the adjacency matrix [13, Ch. 8]. Matrix describes the connectivity pattern of the WSN. If, then, which implies by the link symmetry. Using these definitions, it turns out that (11b) and (11c) can be compactly re-written as Consider further the diagonal matrix, (17) (18) (19) collects all per-node cluster cardinalities for class. Next, let, and denote the iterates at a fixed point of the iterations (17), (18), and (11a). At this fixed point (18) yields. Using this fact in (17) evaluated at the fixed point, yields (20) and are defined as in (16) and (19), respectively, with substituted by. Based on (14) (20), the following lemma is proved in Appendix C. Lemma 2: (Characterization of fixed points) The fixed points of iterations (11a), (17) and (18), namely,, and, are the Karush Kuhn Tucker (KKT) points of (1). Since (1) has multiple KKT points,,, and are not unique (although they always exist). Because iterations are deterministic, each fixed point of (17), (18), and (11a) is uniquely specified by the initialization point; that is, different initialization points generally lead to different fixed points but the same initialization point will always converge to the same fixed point. To proceed, define the error sequences,, and the corresponding supervectors and. From (17) (20), it follows that (21) (22). Since, the definition of implies that can be bounded as, for some finite constant. From (21) (22), Appendix D establishes the following result. Proposition 1: (Stability of the -means iterations) For sufficiently large, there exists a such that for any the distance between iterate in (17) and its fixed point is bounded; i.e., for some finite constant, it holds that (23) Proposition 1 asserts that remains bounded regardless of. For a sufficiently large, the bound on becomes arbitrarily small. Furthermore, for the special case of the K-means algorithm, the following stronger result holds (see Appendix E for the proof). Corollary 1: (Convergence of hard -means) For sufficiently large, the distributed clustering algorithm (11) with membership exponent converges to a local minimum of problem (1). Intuitively, for a sufficiently large, prototypes are so close to that remains fixed. This along with the fact that every fixed point of the centralized K-means algorithm is a KKT point of (1), suffices to prove Corollary 1. Remark 2: (On the selection of ) A sufficiently large value of is needed for Proposition 1 and Corollary 1 to hold. However, numerical tests in Section V will illustrate that large values of can affect the convergence speed of DKM and may lead iterations to KKT points of higher cost. The proof of Proposition 1 is based on the stability analysis in [28, Sec. C6], which is an existence result. Finding constructively the minimum that guarantees convergence depends on the WSN topology, and the set of observations, but goes beyond the scope of this paper. Nonetheless, simulated tests in Section V suggest that in practice relatively small values of suffice for the hard-dkm to converge, and for the soft-dkm to refrain from noticeable hovering-type behavior.

6 712 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Remark 3: (Node failures) The convergence claims in Proposition 1 and Corollary 1 still hold when a node fails, provided that the resultant network remains connected. In this case, the fixed points are those of a centralized cost function entailing all but the observations of the failed node. If node failure(s) lead to a disconnected network (it is a graph cut-vertex), then the algorithm would remain operational in each subnetwork. The fixed points of each subnetwork would be those of a centralized problem using the observations from all nodes in that subnetwork. IV. DISTRIBUTED ALGORITHMS FOR PPC In this section, the centralized PPC formulation of Section II-B is first re-cast in a variational form, and subsequently solved in a distributed fashion. Let denote the pdf of the 0 1 valued random label vector, denote expectation w.r.t., and the entropy of. Using this notation, the EM algorithm can be viewed as an alternating functional-parameter optimization solver that cyclically maximizes [21] (26) is a positive tunable scalar, and ( ) denotes the inner product (induced norm) over the space. As in the previous section, (26) will be minimized cyclically w.r.t.,,,, and. Then, gradient ascent updates on and will be performed. Consider the quantities,,,, and that are given by (cf. Appendix B) (27a) (24) first w.r.t. over the space of pdfs factorized as, with the parameters fixed; and next w.r.t. for the fixed pdf. At each iteration, the steps in the EM algorithm can be viewed as (see [21] for details): (E-step) find ; (M-step) find. To perform the (E-step) local information is needed, and all sensors must know from the previous (M-step). Similar to Section III, local parameter estimates comprising local and are introduced to allow distributing the objective function in (24). In this framework, the distributed PPC solves the following optimization problem: (25) the auxiliary variables and will play a role similar to that of in (8). Let Lagrange multipliers and, with 1, 2, correspond to the constraints,,, and, respectively. Also, let denote an inner product space such that,. The augmented Lagrangian associated with (25) is (27b) (27c) (27d) (27e) denotes the set of constraints on all parameters. Finding in (27b) requires solving a convex optimization problem which can be efficiently accomplished via, e.g., interior point methods. If the pdfs are log-concave w.r.t., then update (27c) is also a convex optimization problem. The DEM scheme described by iterations (27a) (27e) is summarized as Algorithm 2. DEM begins at iteration when each node randomly initializes its local parameter estimates, ; and its local aggregate Lagrange multipliers as, and. At iteration, every node updates its class label estimates via (27a). Then, every node updates its local estimates and via (27b) and (27c), respectively. Per node, the updated estimates and are broadcast to all nodes. Finally, every node updates its local Lagrange multipliers and via (27d) and (27e). Similar to Algorithm 1, Algorithm 2 requires each node to wait for

7 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 713 information all its neighbors before updating (11c); thus, it is also a synchronous algorithm. Algorithm 2 Distributed EM (DEM) from in Require: Node initializes,,, and. 1: for do 2: for all do 3: Compute via (27a). 4: Compute via (27b). 5: Compute via (27c). 6: end for 7: for all do 8: Broadcast and to all neighbors. 9: end for 10: for all do 11: Compute via (27d). 12: Compute via (27e). 13: end for 14: end for Remark 4: (Distributed vs. centralized EM) As in Remark 1, consider fixing, and cyclically computing (27b)-(27e) times before updating again. Because the alternating direction method of multipliers is provably convergent [2], iterates converge as to the centralized M-step solution in (7). In this case, DEM also inherits convergence guarantees from its centralized counterpart. Notice that when dealing with Gaussian distributions, DEM reduces for to the algorithm in [14]. The merits of DEM relative to [23] and [18] can be summarized as follows: DEM does not require finding a path traversing across all sensors. Incremental alternatives tacitly assume such a so-termed Hamiltonian cycle to ensure passing information once per node, and thus minimize communication overhead. However, Hamiltonian cycles do not always exist, and when they do, finding them is an NP hard task [24]. Because the alternating direction method of multipliers that DEM relies on is known to be resilient to quantization errors and additive noise affecting inter-sensor communications [32], unlike [23] and [18], the DEM is robust to imperfect links. DEM only requires one-hop connectivity among sensors, and thus keeps communication overhead per sensor at an affordable level even when the number of nodes grows. Without requiring estimators to be available in closed form, DEM applies to general mixture models provided that their class-conditional pdfs are differentiable and (log-) concave. A. Gaussian PDFs Without Consensus on The mixture parameters characterize the percentage of observations coming from a specific class at node.ina distributed setting, it is possible that the proportion in which data are mixed varies across nodes. This motivates enforcing consensus only on the parameters across the network, while allowing the to vary across nodes, thereby enabling each node to construct local clustering rules that best reflect the per-node observations available from different classes. Let, and are unknown. Each is fully characterized by, denotes the interior of the cone of positive semi-definite matrices. The space is endowed with the inner product, for any, and. The inner product in is given by, and the corresponding induced norm is,, denotes the Frobenious norm. If the coefficients are allowed to vary across nodes, then (27a) (27e) can be re-written as (28a) (28b) (28c) (28d) (28e) (28f) ; is a positive tuning scalar, and, are aggregate Lagrange multipliers pertaining to the consensus constraints and, respectively. Since no consensus constraints on the are present, the Lagrange multipliers are not included in the augmented Lagrangian (26), and (27b) can be solved in closed form yielding (28b). Note also that (28d) entails

8 714 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 solving a convex optimization problem per node. The first-order optimality condition for (28d) is (29), is a matrix of all zeros,, and Lemma 3: (Characterization of fixed points) The fixed points of iterations (27a) and (27c), namely and, are convergence points of the centralized EM iterations (6) and (7). As in Section III-C, consider the error vectors, and, which satisfy (31) For, (28d) can be solved in closed form. For, interior point methods or direct solution of the matrix (29) can be used to carry out the minimization in (28d)[4], [15]. B. Convergence Analysis Analyzing the convergence of centralized EM typically relies on the objective function in (25) being non-increasing as iterations proceed; see, e.g., [7]. This holds true for updates (27a), (27b), and (27c) too, but not for updates (27d) and (27e). Similar to Section III-C, it is proved here that iterations (27a) (27e) come arbitrarily close to a fixed point of the DEM algorithm, which in turn is a fixed point of the centralized EM. To this end, the following assumptions are necessary. AS1: Class-conditional pdf is log-concave and twice-differentiable w.r.t.,. Notice that AS1 holds true for various members of the widely used family of exponential distributions. In the ensuing derivations only convergence of in (27c) will be established. Convergence of from iteration (27b) follows similarly after combining (27b) with (27c), and considering as the new optimization variable (details are omitted). Under AS1, the first-order optimality conditions for (27c) yields (30) denotes the gradient of w.r.t. evaluated at ; and in deriving (30), was assumed per iteration to fall in the interior of the subspace, which holds, e.g., when in (28d) is a strictly positive-definite matrix. At a fixed point, (27e) implies, which upon substituting into (30) yields,. These points are indeed convergence points of the centralized EM as shown by the following lemma proved in Appendix F. Vectorizing (or equivalently assuming is a vector) and applying the mean-value theorem on each row of, it is shown in Appendix G that (32) matrix depends on,, and. Substituting (32) into (31), letting, and following steps similar to those used in Section III-C, the iterates can now be written as and (33) (34),, and is a new (bounded) error vector that depends on and. Matrix is clearly nonsingular for all. Using steps similar to those in the previous section, Appendix G establishes the following result. Proposition 2: (Stability of the DEM iterations) For sufficiently large, there exists a such that for any, the distance between iterates in (33) and its fixed point is bounded; i.e., (35) for some finite constant. For a sufficiently large, the iterates in (33) come arbitrarily close to a fixed point of the iterations in (27a) (27e), which by Lemma 3 are fixed points of the centralized EM. Similar to the DKM algorithm (cf. Remark 2), finding the minimum that guarantees stability is a challenging task, aggravated by the presence of different scalar, vector and matrix updates. However, numerical tests in Section V indicate that when choosing different values of for each set of optimization parameters, convergence can be guaranteed with relatively small step size values.

9 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 715 V. NUMERICAL RESULTS In this section, the performance of the proposed distributed clustering algorithms is evaluated via numerical simulations on both synthetic and real data sets. A. Distributed K-Means Consider a randomly generated WSN with nodes. The network is connected with algebraic connectivity , and average degree per sensor. Data are generated at random from classes, with vectors from each class generated from a two-dimensional Gaussian distribution with common covariance, and corresponding means:,,,,,,,,,,,,,,,,, and. Every node has available a set of observations with 30 observations per class. Each observation is drawn at random from class according to the density,. Thus, each sensor has observations available. The centralized K-means algorithm is simulated to benchmark performance. Letting, the figure of merit for both schemes is [cf. (8)] (36) Fig. 1. DKM for various values of and K = 18(top). DKM for various values of K and =10(bottom). with for hard clustering, and for soft clustering. In all tests, the cluster centroids in (11b) are randomly initialized per sensor, while the Lagrange multipliers in (11c) are initialized to. The centralized K-means algorithm is initialized using the initialization of a sensor in the WSN chosen at random. 1) Distributed Hard K-Means: Fig. 1 (top) depicts the evolution of the hard K-means cost with for different values of and the same initialization. Fig. 2 shows the minimum, mean, and variance of the cost in (36) obtained after iterations (the algorithm converges to the final solution after iterations in most cases) for 100 Monte Carlo runs. This test reveals that most often small values of lead to a lower value of the cost than large values of. Also larger values of lead to increased sensitivity to initialization. For the centralized K-means algorithm with the results obtained were: (min.) ; (mean) ; and (std. dev.) 5,904. Interestingly, the DKM algorithm outperforms the centralized K-means algorithm reaching both a lower minimum and a lower average cost. This could be intuitively explained by: 1) the weighted augmentation in (9), which allows DKM to more easily avoid local minima; and 2) the dual update in (11c), which offers an extra degree of freedom, compared to the centralized K-means, by tuning. The behavior of the DKM algorithm w.r.t. the initial guess for was tested for fixed. Fig. 1 (right) and Table I summarize the results obtained after iterations. Compared to the centralized K-means, DKM consistently finds im- Fig. 2. Error bars comparison between centralized K-means and DKM for K =18. TABLE I PERFORMANCE COMPARISON BETWEEN CENTRALIZED K-MEANS AND DKM FOR =10 proved clustering results regardless of the initialization. This fact is even more evident when. As mentioned in Section III, DKM can mimic the iterations of the centralized K-means if iterations (11b) and (11c) are performed cyclically a number of times for fixed memberships in (11a). Fig. 3 shows the performance of this ap-

10 716 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Fig. 3. DKM for different values of. Fig. 5. Error bars comparison between centralized soft K-means and soft-dkm for K =18. TABLE II PERFORMANCE METRICS FOR SOFT-DKM, =4AND t =800 Fig. 4. Soft-DKM for various values of and K =18(top). Soft-DKM for various values of K and =4(bottom). proach for various values of [note that corresponds to the DKM algorithm in (11a) (11c)]. As expected, the average performance of DKM degrades for larger values of approaching that of the centralized K-means. 2) Distributed Soft K-Means: Fig. 4 (top) and Fig. 5 show the performance of the distributed soft K-means (soft-dkm) for, after iterations, various values of and the same initialization. For the centralized soft K-means the performance metrics obtained were: (min.) ; (mean) ; and (std. dev.) 537. As in DKM, the choice of affects the convergence of the algorithm. For small values of, soft-dkm achieves lower costs than its centralized counterpart. The performance of the soft-dkm algorithm for various values of with fixed is depicted in Fig. 4 (bottom), and summarized in Table II. As seen, the distributed algorithm outperforms its centralized counterpart for small values of. B. DEM Algorithm In this section, the performance of DEM is tested and compared with the one of I-DEM in [23]. Consider a randomly generated WSN with nodes. The network is connected with algebraic connectivity , and average degree 3.80 per sensor. Observations are generated at random from classes, each class is modeled as a two-dimensional Gaussian distribution with covariance matrices given by,,,, and per class; and corresponding mean vectors given by,,,,, and. Each node has available a total of observations. The proportion of observations per class per node is determined by the local mixture parameters. For nodes 1, 2, 3, 5, 6, 7, 9, 10, the mixture parameters are. For nodes 4, 8, the mixture parameters are,, and ; and,,,, and. The figure of merit for the DEM algorithm is the negative log-likelihood of the complete data given by [cf. (5)] (37). As in the previous tests, iterates are randomly initialized and Lagrange multipliers are initialized to zero. It was found empirically that improved convergence speed could be obtained by using different values of for each of the parameters sought. Consequently, the individual penalty parameters are set to,, and,,, and are the penalty parameters associated with,, and, respectively. These values are chosen to balance out consensus optimization variables taking values

11 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 717 Fig. 8. Error bars comparison between centralized EM and DEM with consensus on for K =6. TABLE III PERFORMANCE METRICS FOR DEM WITH CONSENSUS ON AND =5 Fig. 6. DEM with consensus on for various values of and K =6(top). DEM with consensus on for various values of K and fixed =5(bottom). Fig. 7. Level set curves of the GMM fitted by DEM, with =4and K =6, at different nodes after 400 iterations with consensus on. that greatly differ in magnitude. Otherwise, the iterates with the largest absolute magnitudes will be impacted more by the consensus constraints hindering the convergence rate of the DEM algorithm. 1) DEM With Consensus on : Fig. 6 (top) and Fig. 8 show the performance of the DEM with for different values of. Update (27b) is found numerically via interior point methods. Since the pdfs are Gaussian, update (27c) is carried out by solving directly the quadratic matrix equation (29). The centralized EM algorithm is used as a benchmark. The performance metrics obtained for the centralized EM algorithm are: (min.) ; (mean) ; and (std. dev.) 452. Fig. 7 depicts the level curves for the GMM, and the local data point for nodes 1, 4, 8 with and after 500 iterations. Note that DEM correctly identifies the presence of the Gaussian components not present in the local observations at nodes and. The performance of DEM for various values of with fixed is shown in Fig. 6 (bottom) and Table III. As in the K-means iterations, smaller values of lead to faster convergence to a local minimum with lower cost, comparable with the centralized solution. Larger values of ensure that nodes achieve similar values per node faster, but also cause iterations to be trapped at less favorable local minima. 2) DEM Without Consensus on : In this test, the case of free mixture parameters per node is explored. As benchmarks, the centralized EM algorithm and the I-DEM in [23] are also implemented. For the centralized EM algorithm with the performance metrics obtained are those of the previous subsection. For the incremental EM algorithm with and after 80 cycles through the network, the performance metrics, averaged over 100 Monte Carlo runs, obtained are: (min.) ; (mean) ; and (std. dev.) Fig. 9 (top) depicts the average evolution of the DEM algorithm after 100 Monte Carlo runs with for different values of, and. Clearly, not having to consent on allows nodes to adapt their parameter estimates to the locally available observations. As in the previous cases, the choice of considerably impacts the algorithm s performance. In particular, smaller values of reach consensus in less iterations, and achieve results comparable to the centralized and incremental ones. Fig. 11 shows the performance metrics for DEM after iterations for various values of. The freedom of the mixture coefficients per node allows nodes to adapt their parameter estimates to the local behavior of the observations. This translates to larger values for the global

12 718 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 TABLE IV PERFORMANCE METRICS FOR DEM WITHOUT CONSENSUS ON AND =5 Fig. 10. Level set curves of the GMM fitted by DEM, with =2and K = 6, at different nodes for after 400 iterations and no consensus on the mixture coefficients. Fig. 9. DEM for various values of and K =6(top). DEM for various values of K and fixed =5(bottom). log-likelihood of the data. Fig. 9 (bottom) shows the average evolution of DEM for various values of. The performance metrics for the centralized EM, the DEM, and the I-DEM algorithms are shown in Table IV for various values of at. Note that the centralized EM algorithm automatically forces all with common to coincide across nodes. Fig. 10 depicts the level curves for the GMM, and the local data point for nodes 1, 4, 8 with 2 and 6, after 400 iterations. C. Clustering of Oceanographic Data Environmental monitoring is a potentially important application of WSNs. Such an example is one involving WSNs for oceanographic monitoring, in which the cost of computation per node is lower than the cost of accessing each nodes observations [1]. This makes the option of centralized processing less attractive, thus motivating decentralized processing. This section tests the proposed decentralized schemes on real data collected by multiple underwater sensors in the Mediterranean coast of Spain, available at the World Oceanic Database (WOD) [5], with the goal of identifying regions sharing common physical characteristics. A total of 5720 feature vectors were selected, Fig. 11. Error bars for DEM without consensus on for K =6. each with entries the temperature ( ) and salinity (psu) levels ( ). The measurements were normalized to have zero mean and unit variance. The data were grouped into blocks of measurements each. The algebraic connectivity of the WSN is and the average degree per node is 4.9. Fig. 12 (left) shows the performance of 25 Monte Carlo runs for the hard-dkm algorithm with different values. The best average convergence rate was obtained for yielding a performance attaining the average centralized performance after 300 iterations. Tests with different values of and are also included in Fig. 12 (left) for comparison. Notice that for and hard-dkm hovers around a point without converging.

13 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 719 Fig. 13. Performance of DEM on the WOD data set using a WSN with J = 20 nodes for various values of K (top). Level sets and clustering results of the GMM fitted by DEM for the WOD data set with K = 3 (bottom) at t = 4000 iterations. Fig. 12. Average performance of hard-dkm on the WOD data set using a WSN with J =20nodes for various values of and K (top). Clustering results with K =3and =5(bottom) at t = 400 iterations. Choosing a larger value of guarantees convergence of the algorithm to a unique solution. The clustering results of hard-dkm at iterations for and are depicted in Fig. 12 (bottom). Similarly, the DEM algorithm with and parameters and, was used to cluster the WOD data set. The DEM performance using different values of, shown in Fig. 13 (top), approaches the centralized one after a few thousand iterations. Fig. 13 (bottom) shows the GMM level sets obtained by the DEM fit, and the clustering results using the MAP rule for and iterations. VI. CONCLUSION This paper developed distributed algorithms for partitional clustering by capitalizing on a consensus-based formulation and parallel optimization tools. The novel algorithms are well suited for applications network-wide observations cannot become available to individual nodes due e.g., to stringent power constraints. Also, the algorithms guarantee homogeneous usage of power across nodes, thus increasing their battery lifetime. Both deterministic and probabilistic partitional clustering approaches were addressed. The DKM algorithm uses spatially distributed sets of observations to obtain a unified clustering rule across nodes. Both hard and soft versions were explored. The DEM uses local sets of observations to estimate the parameters defining a mixture of pdfs from which observations were drawn. Parameter estimates can be found for any family of log-concave, twice-differentiable, parametric pdfs even if their estimators cannot be written in closed form. Scenarios consensus on the mixture parameters across nodes is not enforced were also investigated. In this case, the global behavior of the observations determines the values of class parameters, while the local proportion in which observations appear per node determines the local mixture parameters. Convergence was analyzed in all cases, establishing that the iterates come arbitrarily close to a fixed point of the algorithm, this fixed point being a KKT point of the centralized cost function. For the special case of the hard K-means algorithm, this guarantees convergence at least to a local minimum. The proofs relied on stability analysis of linear and non-linear time varying systems. Numerical tests show that the novel algorithms can exhibit resilience to initialization since on average they find better local minima more often than their centralized counterparts. APPENDIX A PROOF OF LEMMA 1 It is shown that for every, the equality constraints are equivalent to for any feasible solution of (8). It follows readily that the constraints in (8) imply

14 720 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST Consider two arbitrary nodes. Since the network is connected, there exists a path of length at least one, connecting nodes and. Because for, it follows that. Since, are arbitrary, the latter implies. Since is also arbitrary, the result of the lemma follows. APPENDIX B DERIVATION OF (11a) (11c) The goal is to show that iterations (10a) (10e) reduce to (11a) (11c). First, note that (10a) yields (11a) since the first term of the Lagrangian (9) is the only one that depends on. Iteration (10c) requires solving an unconstrained minimization problem w.r.t. over a linear-quadratic cost function; thus, it admits a closed-form solution, given by APPENDIX C PROOF OF LEMMA 2 At a fixed point of (11a), (17) and (18), it holds that, is a consensus solution reached by the DKM algorithm, and [cf. (20)]. Left-multiplying the latter by and, it follows that, yields, but since (C.41) Since the are not affected by the consensus constraints in (8), in (C.41) coincides with a KKT optimal point of (1). Substituting (B.38) into (10d) and (10e) yields (B.38) (B.39a) APPENDIX D PROOF OF PROPOSITION 1 With denoting matrix pseudo-inverse, rewrite the iterations (21) and (22), as in the following lemma. Lemma 4: If and, then can be written in terms of an auxiliary variable as (D.42) (D.43) (B.39b) Let and be initialized to zero at every node; i.e.,. From (B.39a) (B.39b), it then holds that. Likewise, if, then by induction. Upon defining, iteration (11c) follows directly from (B.39a). Finally, notice that iteration (10b) solves an unconstrained quadratic optimization problem. The first-order optimality conditions for (10b) yields, with Proof: Use (21) and (22) to recognize that first-order difference equation (D.44) (D.45) (D.46) obeys the (D.47) (B.40) is defined in (D.46), and (D.48). Since and, then again by induction. Using the definition of and the fact that, iteration (11b) follows from (B.40). Consider now an auxiliary variable, and the system of equations in (D.42) and (D.43). From the definition of in (D.45), it follows that (D.49)

15 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 721 Also, from the definition of, and by exploiting the fact that is a symmetric matrix to write, it follows that (D.50) Next, it is shown by induction that if, then iterations (D.42) and (D.43) are equivalent to (D.47). For iteration, it follows readily from (D.42) that. Substituting back into (D.43) and using (D.49) and (D.50) yields, as desired. Next, suppose that at iteration, (D.42) and (D.43) are equivalent to (D.47). Consider iteration and substitute into, to obtain (D.51) the second equality follows from (D.49) and (D.50), and the third equality is due to (D.43). Note that (D.51) corresponds to (D.47) evaluated at. Hence, the system of equations in (D.42) and (D.43) is equivalent to (D.47),. Since is a bounded matrix, stability analysis of iterations (21) and (22) can be deduced from that of (D.42). To this end, consider the homogeneous linear time-varying system. If the latter is exponentially stable, then (D.42) is BIBO stable. Exponential stability of is ensured if [28, Sec. C6]: 1) is slowly time-varying; i.e.,, for small enough; 2) all eigenvalues are inside the unit circle, i.e., ; and 3). Condition 3) is readily satisfied since the entries of remain finite for all. Conditions 1) and 2) are satisfied by the updates, as shown in the following two lemmas. Lemma 5: For any fixed, there exists a sufficiently large such that (D.52) Using the Frobenius matrix norm and properties of Kronecker products, it follows that (D.55) is the norm of the matrix in (D.54) that does not depend on. In addition, recalling that,, yields (D.56) Combining (D.55) and (D.56) in (D.53) yields. Note that decreases with, and so approaches zero as. In other words, for any fixed one can always find a sufficiently large so that (D.52) holds. Lemma 6: The spectral radius of, denoted,is strictly smaller than unity. Proof: By the properties of Kronecker products, the eigenvalues of correspond to the eigenvalues of, each with algebraic multiplicity ; hence,. Matrix can be written as (D.57). For notational simplicity consider dropping the iteration index.if denotes an eigenvalue for with corresponding eigenvector, it clearly holds that. Splitting into and with appropriate dimensions, the latter gives rise to the following system of equations [cf. (D.57)]: (D.58) (D.59) Invoking properties of the matrix pseudo-inverse to manipulate (D.58), yields, which implies that for some. Solving for, and substituting the result back into (D.59) leads to Proof: Since, it clearly holds that (D.60) Matrix can be decomposed as [cf. (D.44)] (D.53) which is valid for all. Re-organizing terms in (D.60) and left-multiplying by, denotes Hermitian transposition, yields the quadratic equation (D.54) (D.61)

16 722 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 Conditions on the values of solution of (D.61), given by can be inferred through the, (D.62) First, consider the case, which yields complex conjugate values for. Next, we prove that for this case. With, it follows that ; hence, it suffices to check whether. This follows readily by noticing that for every, since, which is a positive definite matrix. Next, consider, which yields two distinct real values for. Define the auxiliary variable, which by the definition of equals (D.63) Since both and are positive definite, it follows that. Moreover, consider matrix, and employ Gershgorin s circle theorem [16, Ch. 6] to deduce that it is positive definite for all. Hence, it follows that. The solution of (D.61) in terms of is. Consider the term and note that (D.64) Recalling the fact that (D.61) is valid for all values of, it follows from (D.64) that. Likewise, it follows readily that the smallest possible value for is. Next, consider the case and. From (D.58), it follows that. Upon substituting this value into (D.58) with, one arrives at (D.65) We claim that there is no satisfying (D.65). Arguing by contradiction, suppose that there is such a. Let,, and recall that has rank. Then using the eigen-decomposition of, (D.65) can be written as (D.66) is a diagonal matrix with the eigenvalues of on its diagonal, and is an orthogonal matrix whose th column is the eigenvector of corresponding to. Assume without loss of generality that, and. It then follows from (D.66) that implying that, which is a contradiction since. Hence, (D.65) does not have a solution, and for this case matrix cannot have as an eigenvalue. Finally, consider the case, and. Then, and must satisfy ; thus, with, implying that is an eigen-vector of.we claim that the only possible value for is. Indeed, if, then it must be true that which implies that (D.67) (D.68) Taking the Hermitian transpose on both sides of (D.68) yields, which contradicts the assumption. Hence, this case leads to an eigenvector of zeros, which is not allowed by the definition of eigenvectors [16, Ch. 1]. Therefore, matrix cannot have as an eigenvalue either; which completes the proof of the lemma. After proving in Lemma 5 that, for arbitrarily small, and in Lemma 6 that has all its eigenvalues inside the unit circle, the stability result for slowly time varying linear systems applies readily [28, Sec. C6] and establishes the validity of (23). APPENDIX E PROOF OF COROLLARY 1 When, the membership assignment in (11a) is such that if, with, and otherwise. At a fixed point of Algorithm 1, it thus holds that (E.69) Using the triangle inequality, the first term of (E.69) can be bounded as (E.70) the last inequality follows from Proposition 1. Using (E.70) in (E.69), one arrives at Adding now again the triangle inequality yields (E.71) to (E.71), and using (E.72) (E.73) (E.74)

17 FORERO et al.: DISTRIBUTED CLUSTERING USING WIRELESS SENSOR NETWORKS 723 For sufficiently large, we thus have, keeping membership assignments unchanged; i.e., for all. As a consequence, iterations (D.43) (D.42) become time-invariant ( ), and error-free ( ). Invoking again Lemma 6, we find,or equivalently (E.75) Lemma 2 established that any fixed point of (11a) (11c) is a KKT point of the centralized K-means in (1). Since for the hard K-means algorithm, any KKT point is a local minimum [27], it follows that the convergence point of the distributed hard K-means algorithm is a local minimum as well. APPENDIX F PROOF OF LEMMA 3 At any fixed point of the centralized EM algorithm, denoted by, and, it holds that [cf. (5) and (7)] (F.76a) (F.76b) Clearly, any fixed point of (27a) satisfies (F.76a) as well. Next, we prove that any fixed point in (27c) satisfies (F.76b). At a fixed point and for every, iteration (27c) yields APPENDIX G PROOF OF PROPOSITION 2 Letting, iterations (33) (34) can be written as (G.79) is defined as in (D.44), and. Next, we will show that iteration (G.79) is BIBO stable. Similar to Appendix D, the stability claim holds because satisfies the following. 1), for small enough. 2) All eigenvalues lie inside the unit circle. Arguing as in Lemma 5, 1) follows readily. To establish 2), apply the mean-value theorem to the rows of to arrive at (32), is a diagonal matrix with diagonal entries evaluated at. Under AS1, is a positive definite matrix implying that is positive definite as well. In order to show that the spectral radius, note that there exists a such that is positive definite. Using in lieu of, and mimicking the steps followed in the proof of Lemma 6, yields the desired result. Since, for arbitrarily small, and has all its eigenvalues inside the unit circle, the stability result in [28, Sec. C6] readily applies, and (35) holds. (F.77) Also, at a fixed point of (27e), we have, implying. Using this fact the set of problems in (F.77) can equivalently be solved as (F.78) the equality constraints are enforced at the consensus solution reached by DEM. Note that the quadratic term in the cost of (F.77) vanishes at the minimum. Introducing the equality constraints into the objective function by defining and ; and noting that by construction, it follows that (F.78) is equivalent to (F.76b). A similar reasoning applies to the iterates in (27b), and completes the proof of the lemma. REFERENCES [1] C. Albaladejo, P. Sánchez, A. Iborra, F. Soto, J. A. López, and R. Torres, Wireless sensor networks for oceanographic monitoring: a systematic review, Sensors, vol. 10, no. 7, pp , Jul [2] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Nashua, NH: Athena Scientific, [3] J. C. Bezdek, R. J. Hathaway, M. J. Sabin, and W. T. Tucker, Convergence theory for fuzzy c-means: Counterexamples and repairs, IEEE Trans. Syst., Man, Cybern., vol. SMC-17, no. 5, pp , Sep./Oct [4] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, MA: Cambridge Univ. Press, [5] T. P. Boyer, J. I. Antonov, H. E. Garcia, D. R. Johnson, R. A. Locarnini, A. V. Mishonov, M. T. Pitcher, O. K. Baranova, and I. V. Smolyar, World Ocean Database 2005, S. Levitus, Ed. Washington, D.C.: U.S. Government Printing Office, 2006, vol. 60, NOAA Atlas NESDIS, p [6] S. Dasgupta and Y. Freund, Random projection trees for vector quantization, IEEE Trans. Inf. Theory, vol. 55, no. 7, pp , Jul [7] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Statist. Soc. Ser. B (Methodological), vol. 39, pp. 1 38, Aug [8] I. Dhillon and D. Modha, A data-clustering algorithm on distributed memory multiprocessors, Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pp , [9] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, [10] P. A. Forero, A. Cano, and G. B. Giannakis, Distributed feature-based modulation classification using wireless sensor networks, in Proc. IEEE Military Commun. Conf., San Diego, CA, Nov , 2008, pp. 1 7.

18 724 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 [11] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based k-means algorithm for distributed learning using wireless sensor networks, in Proc. Workshop Sens., Signal, Inf. Process., Sedona, AZ, May 11-14, [12] P. A. Forero, A. Cano, and G. B. Giannakis, Consensus-based distributed expectation-maximization algorithm for density estimation and classification using wireless sensor networks, in Proc. Int. Conf. Acoust., Speech, Signal Process., Las Vegas, NV, Mar.-Apr. 30-4, 2008, pp [13] C. Godsil and G. Royle, Algebraic Graph Theory. New York: Springer, [14] D. Gu, Distributed EM algorithm for Gaussian mixtures in sensor networks, IEEE Trans. Neural Netw., vol. 19, no. 7, pp , Jul [15] N. J. Higham and H. Kim, Numerical analysis of a quadratic matrix equation, IMA J. Numer. Anal., vol. 20, pp , [16] R. A. Horm and C. R. Johnson, Matrix Analysis. Cambridge, MA: Cambridge Univ. Press, [17] R. E. Kass and L. Wasserman, J. Amer. Stat. Assoc., vol. 90, no. 431, pp , Sep [18] W. Kowalczyk and N. Vlassis, Newscast EM, in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec , 2005, pp [19] S. P. Lloyd, Least-squares quantization in PCM, IEEE Trans. Inf. Theory, vol. IT-28, no. 2, pp , Mar [20] C. E. Lopes, F. D. Linhares, M. M. Santos, and L. B. Ruiz, A multi-tier, multimodal wireless sensor network for environmental monitoring, in Proc. 4th Int. Conf. Ubiquitous Intell. Comput., Hong Kong, China, Jul , 2007, pp [21] R. Neal and G. E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models, M. I. Jordan, Ed. Norwell, MA: Kluwer, 1998, pp [22] M. K. Ng, K-means-type algorithms on distributed memory computer, Int. J. High Speed Comput., pp , Jun [23] R. D. Nowak, Distributed EM algorithms for density estimation and clustering in sensor networks, IEEE Trans. Signal Process., vol. 51, no. 8, pp , Aug [24] C. H. Papadimitriou, Computational Complexity. Reading, MA: Addison-Wesley, [25] M. Piórkowski and M. Grossglauser, Constrained tracking on a road network, in Proc. 3rd European Workshop on Wireless Sensor Networks, Berlin/Heidelberg, Germany, Feb , 2006, pp [26] L. Schwiebert, K. S. Gupta, and J. Weinmann, Research challenges in wireless networks of biomedical sensors, in Proc. 7th Annu. Int. Conf. Mobile Comput. Netw., Rome, Italy, Jul , 2001, pp [27] S. Z. Selim and M. A. Ismail, K-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6, no. 1, pp , Jan [28] V. Solo and X. Kong, Adaptive Signal Processing Algorithms: Stability and Performance, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, [29] J. Wolfe, A. Haghighi, and D. Klein, Fully distributed EM for very large datasets, in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 6 8, 2008, pp [30] R. Xu and D. Wunsch-II, Survey of clustering algorithms, IEEE Trans. Neural Netw., vol. 16, no. 3, pp , May [31] Y. Zhang, Z. Xiong, J. Mao, and L. Ou, The study of parallel K-means algorithm, in Proc. 6th World Congr. Intell. Control Autom., Dalian, China, Jun , 2006, vol. 2, pp [32] H. Zhu, G. B. Giannakis, and A. Cano, Distributed in-network channel decoding, IEEE Trans. Signal Process., vol. 57, pp , Oct Pedro A. Forero (S 07) received the Diploma in electronics engineering from Pontificia Universidad Javeriana, Bogota, Colombia, in 2003 and the M.Sc. degree in electrical engineering from Loyola Marymount University, Los Angeles, CA, in He is currently working towards the Ph.D. degree with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis. His research interests include statistical signal processing, machine learning, and sensor networks. His current research focuses on distributed and robust algorithms for unsupervised learning. Mr. Forero was a recipient of the Science, Mathematics, and Research for Transformation (SMART) fellowship in Alfonso Cano (M 07) received the electrical engineering degree and Ph.D degree with honors in telecommunications engineering from the Universidad Carlos III de Madrid, Madrid, Spain, in 2002 and 2006, respectively. During , he was with the Dept. of Signal Theory and Communications, Universidad Rey Juan Carlos, Madrid, Spain. Since 2007 he has been with the ECE Department at the Univ. of Minnesota, MN, USA, he is a post-doctoral researcher and lecturer. His general research interests lie in the areas of signal processing, communications and networking. Georgios B. Giannakis (F 97) received the Diploma in electrical engineering from the National Technical University of Athens, Athens, Greece, in 1981 and the M.Sc. degree in electrical engineering in 1983, the M.Sc. degree in mathematics in 1986, and the Ph.D. degree in electrical engineering in 1986, all from the University of Southern California (USC), Los Angeles. Since 1999, he has been a Professor with the University of Minnesota, Minneapolis, he now holds an ADC Chair in Wireless Telecommunications in the Electrical and Computer Engineering Department, and serves as Director of the Digital Technology Center. His general interests span the areas of communications, networking, and statistical signal processing subjects on which he has published more than 300 journal papers, 500 conference papers, two edited books, and two research monographs. Current research focuses on compressive sensing, cognitive radios, network coding, cross-layer designs, mobile ad hoc networks, wireless sensor, and social networks. He is the (co-)inventor of 20 patents issued. Prof. Giannakis is the (co-)recipient of seven paper awards from the IEEE Signal Processing (SP) and Communications Societies, including the G. Marconi Prize Paper Award in Wireless Communications. He also received Technical Achievement Awards from the SP Society (2000), from EURASIP (2005), a Young Faculty Teaching Award, and the G. W. Taylor Award for Distinguished Research from the University of Minnesota. He is a Fellow of EURASIP, and has served the IEEE in a number of posts, including that of a Distinguished Lecturer for the IEEE-SP Society.

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Eric Xing Lecture 14, February 29, 2016 Reading: W & J Book Chapters Eric Xing @

More information

MOST attention in the literature of network codes has

MOST attention in the literature of network codes has 3862 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 Efficient Network Code Design for Cyclic Networks Elona Erez, Member, IEEE, and Meir Feder, Fellow, IEEE Abstract This paper introduces

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

4122 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 9, SEPTEMBER 2011

4122 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 9, SEPTEMBER 2011 4122 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 9, SEPTEMBER 2011 Set-Membership Constrained Particle Filter: Distributed Adaptation for Sensor Networks Shahrokh Farahmand, Stergios I. Roumeliotis,

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-4: Constrained optimization Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Principles of Wireless Sensor Networks. Fast-Lipschitz Optimization

Principles of Wireless Sensor Networks. Fast-Lipschitz Optimization http://www.ee.kth.se/~carlofi/teaching/pwsn-2011/wsn_course.shtml Lecture 5 Stockholm, October 14, 2011 Fast-Lipschitz Optimization Royal Institute of Technology - KTH Stockholm, Sweden e-mail: carlofi@kth.se

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize. Cornell University, Fall 2017 CS 6820: Algorithms Lecture notes on the simplex method September 2017 1 The Simplex Method We will present an algorithm to solve linear programs of the form maximize subject

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

ARELAY network consists of a pair of source and destination

ARELAY network consists of a pair of source and destination 158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 55, NO 1, JANUARY 2009 Parity Forwarding for Multiple-Relay Networks Peyman Razaghi, Student Member, IEEE, Wei Yu, Senior Member, IEEE Abstract This paper

More information

A Two-phase Distributed Training Algorithm for Linear SVM in WSN

A Two-phase Distributed Training Algorithm for Linear SVM in WSN Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 015) Barcelona, Spain July 13-14, 015 Paper o. 30 A wo-phase Distributed raining Algorithm for Linear

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

The Cross-Entropy Method

The Cross-Entropy Method The Cross-Entropy Method Guy Weichenberg 7 September 2003 Introduction This report is a summary of the theory underlying the Cross-Entropy (CE) method, as discussed in the tutorial by de Boer, Kroese,

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Bryan Poling University of Minnesota Joint work with Gilad Lerman University of Minnesota The Problem of Subspace

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007 Lecture 11: E-M and MeanShift CAP 5415 Fall 2007 Review on Segmentation by Clustering Each Pixel Data Vector Example (From Comanciu and Meer) Review of k-means Let's find three clusters in this data These

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

STOCHASTIC control theory provides analytic and computational

STOCHASTIC control theory provides analytic and computational 1644 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 58, NO 7, JULY 2013 Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach Ashutosh Nayyar, Member, IEEE, Aditya Mahajan,

More information

5 The Theory of the Simplex Method

5 The Theory of the Simplex Method 5 The Theory of the Simplex Method Chapter 4 introduced the basic mechanics of the simplex method. Now we shall delve a little more deeply into this algorithm by examining some of its underlying theory.

More information

Distributed Alternating Direction Method of Multipliers

Distributed Alternating Direction Method of Multipliers Distributed Alternating Direction Method of Multipliers Ermin Wei and Asuman Ozdaglar Abstract We consider a network of agents that are cooperatively solving a global unconstrained optimization problem,

More information

Delay-minimal Transmission for Energy Constrained Wireless Communications

Delay-minimal Transmission for Energy Constrained Wireless Communications Delay-minimal Transmission for Energy Constrained Wireless Communications Jing Yang Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College Park, M0742 yangjing@umd.edu

More information

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

The Simplex Algorithm

The Simplex Algorithm The Simplex Algorithm Uri Feige November 2011 1 The simplex algorithm The simplex algorithm was designed by Danzig in 1947. This write-up presents the main ideas involved. It is a slight update (mostly

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

An Improved Measurement Placement Algorithm for Network Observability

An Improved Measurement Placement Algorithm for Network Observability IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 16, NO. 4, NOVEMBER 2001 819 An Improved Measurement Placement Algorithm for Network Observability Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 8, AUGUST

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 8, AUGUST IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 8, AUGUST 2009 1807 Data Transmission Over Networks for Estimation and Control Vijay Gupta, Member, IEEE, Amir F. Dana, Member, IEEE, Joao P. Hespanha,

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

Precomputation Schemes for QoS Routing

Precomputation Schemes for QoS Routing 578 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 11, NO. 4, AUGUST 2003 Precomputation Schemes for QoS Routing Ariel Orda, Senior Member, IEEE, and Alexander Sprintson, Student Member, IEEE Abstract Precomputation-based

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology WWW site for book Information and Orders http://world.std.com/~athenasc/index.html Athena Scientific, Belmont,

More information

Loopy Belief Propagation

Loopy Belief Propagation Loopy Belief Propagation Research Exam Kristin Branson September 29, 2003 Loopy Belief Propagation p.1/73 Problem Formalization Reasoning about any real-world problem requires assumptions about the structure

More information

Multi-Cluster Interleaving on Paths and Cycles

Multi-Cluster Interleaving on Paths and Cycles Multi-Cluster Interleaving on Paths and Cycles Anxiao (Andrew) Jiang, Member, IEEE, Jehoshua Bruck, Fellow, IEEE Abstract Interleaving codewords is an important method not only for combatting burst-errors,

More information

A Short SVM (Support Vector Machine) Tutorial

A Short SVM (Support Vector Machine) Tutorial A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange

More information

One-mode Additive Clustering of Multiway Data

One-mode Additive Clustering of Multiway Data One-mode Additive Clustering of Multiway Data Dirk Depril and Iven Van Mechelen KULeuven Tiensestraat 103 3000 Leuven, Belgium (e-mail: dirk.depril@psy.kuleuven.ac.be iven.vanmechelen@psy.kuleuven.ac.be)

More information

Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction

Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 8, AUGUST 2011 5227 Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction K V Rashmi,

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Robust Signal-Structure Reconstruction

Robust Signal-Structure Reconstruction Robust Signal-Structure Reconstruction V. Chetty 1, D. Hayden 2, J. Gonçalves 2, and S. Warnick 1 1 Information and Decision Algorithms Laboratories, Brigham Young University 2 Control Group, Department

More information

Approximate Linear Programming for Average-Cost Dynamic Programming

Approximate Linear Programming for Average-Cost Dynamic Programming Approximate Linear Programming for Average-Cost Dynamic Programming Daniela Pucci de Farias IBM Almaden Research Center 65 Harry Road, San Jose, CA 51 pucci@mitedu Benjamin Van Roy Department of Management

More information

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can

LOW-DENSITY PARITY-CHECK (LDPC) codes [1] can 208 IEEE TRANSACTIONS ON MAGNETICS, VOL 42, NO 2, FEBRUARY 2006 Structured LDPC Codes for High-Density Recording: Large Girth and Low Error Floor J Lu and J M F Moura Department of Electrical and Computer

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

WE consider the gate-sizing problem, that is, the problem

WE consider the gate-sizing problem, that is, the problem 2760 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL 55, NO 9, OCTOBER 2008 An Efficient Method for Large-Scale Gate Sizing Siddharth Joshi and Stephen Boyd, Fellow, IEEE Abstract We consider

More information

Fast-Lipschitz Optimization

Fast-Lipschitz Optimization Fast-Lipschitz Optimization DREAM Seminar Series University of California at Berkeley September 11, 2012 Carlo Fischione ACCESS Linnaeus Center, Electrical Engineering KTH Royal Institute of Technology

More information

Clustering web search results

Clustering web search results Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means

More information

THE preceding chapters were all devoted to the analysis of images and signals which

THE preceding chapters were all devoted to the analysis of images and signals which Chapter 5 Segmentation of Color, Texture, and Orientation Images THE preceding chapters were all devoted to the analysis of images and signals which take values in IR. It is often necessary, however, to

More information

Advanced phase retrieval: maximum likelihood technique with sparse regularization of phase and amplitude

Advanced phase retrieval: maximum likelihood technique with sparse regularization of phase and amplitude Advanced phase retrieval: maximum likelihood technique with sparse regularization of phase and amplitude A. Migukin *, V. atkovnik and J. Astola Department of Signal Processing, Tampere University of Technology,

More information

2000 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 5, MAY 2009

2000 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 5, MAY 2009 2000 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 5, MAY 2009 Distributed Sensor Localization in Random Environments Using Minimal Number of Anchor Nodes Usman A. Khan, Student Member, IEEE, Soummya

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

912 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 42, NO. 7, JULY 1997

912 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 42, NO. 7, JULY 1997 912 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 42, NO. 7, JULY 1997 An Approach to Parametric Nonlinear Least Square Optimization and Application to Task-Level Learning Control D. Gorinevsky, Member,

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Monotone Paths in Geometric Triangulations

Monotone Paths in Geometric Triangulations Monotone Paths in Geometric Triangulations Adrian Dumitrescu Ritankar Mandal Csaba D. Tóth November 19, 2017 Abstract (I) We prove that the (maximum) number of monotone paths in a geometric triangulation

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Exploiting a database to predict the in-flight stability of the F-16

Exploiting a database to predict the in-flight stability of the F-16 Exploiting a database to predict the in-flight stability of the F-16 David Amsallem and Julien Cortial December 12, 2008 1 Introduction Among the critical phenomena that have to be taken into account when

More information

Linear methods for supervised learning

Linear methods for supervised learning Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes

More information

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach Basic approaches I. Primal Approach - Feasible Direction

More information

Math 5593 Linear Programming Lecture Notes

Math 5593 Linear Programming Lecture Notes Math 5593 Linear Programming Lecture Notes Unit II: Theory & Foundations (Convex Analysis) University of Colorado Denver, Fall 2013 Topics 1 Convex Sets 1 1.1 Basic Properties (Luenberger-Ye Appendix B.1).........................

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

Calculation of extended gcd by normalization

Calculation of extended gcd by normalization SCIREA Journal of Mathematics http://www.scirea.org/journal/mathematics August 2, 2018 Volume 3, Issue 3, June 2018 Calculation of extended gcd by normalization WOLF Marc, WOLF François, LE COZ Corentin

More information

Applied Lagrange Duality for Constrained Optimization

Applied Lagrange Duality for Constrained Optimization Applied Lagrange Duality for Constrained Optimization Robert M. Freund February 10, 2004 c 2004 Massachusetts Institute of Technology. 1 1 Overview The Practical Importance of Duality Review of Convexity

More information

However, this is not always true! For example, this fails if both A and B are closed and unbounded (find an example).

However, this is not always true! For example, this fails if both A and B are closed and unbounded (find an example). 98 CHAPTER 3. PROPERTIES OF CONVEX SETS: A GLIMPSE 3.2 Separation Theorems It seems intuitively rather obvious that if A and B are two nonempty disjoint convex sets in A 2, then there is a line, H, separating

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY A New Class of Upper Bounds on the Log Partition Function

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY A New Class of Upper Bounds on the Log Partition Function IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY 2005 2313 A New Class of Upper Bounds on the Log Partition Function Martin J. Wainwright, Member, IEEE, Tommi S. Jaakkola, and Alan S. Willsky,

More information

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis Application of fuzzy set theory in image analysis Nataša Sladoje Centre for Image Analysis Our topics for today Crisp vs fuzzy Fuzzy sets and fuzzy membership functions Fuzzy set operators Approximate

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos Machine Learning for Computer Vision 1 22 October, 2012 MVA ENS Cachan Lecture 5: Introduction to generative models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris

More information

In other words, we want to find the domain points that yield the maximum or minimum values (extrema) of the function.

In other words, we want to find the domain points that yield the maximum or minimum values (extrema) of the function. 1 The Lagrange multipliers is a mathematical method for performing constrained optimization of differentiable functions. Recall unconstrained optimization of differentiable functions, in which we want

More information

Generalized Network Flow Programming

Generalized Network Flow Programming Appendix C Page Generalized Network Flow Programming This chapter adapts the bounded variable primal simplex method to the generalized minimum cost flow problem. Generalized networks are far more useful

More information