Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering

Size: px

Start display at page:

Download "Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering"

Steven Sherman
5 years ago
Views:

Portland State University PDXScholar Mathematics and Statistics Faculty Publications and Presentations Fariborz Maseeh Department of Mathematics and Statistics 3-017 Nesterov's Smoothing Technique

1 Portland State University PDXScholar Mathematics and Statistics Faculty Publications and Presentations Fariborz Maseeh Department of Mathematics and Statistics Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering Mau Nam Nguyen Portland State University, Wondi Geremew Stockton University Sam Raynolds Portland State University Tuyen Tran Portland State University Let us know how access to this document benefits you. Follow this and additional works at: Part of the Mathematics Commons Citation Details Nam, N. M., Geremew, W., Raynolds, S., & Tran, T. (017). The Nesterov Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering. arxiv preprint arxiv: This Pre-Print is brought to you for free and open access. It has been accepted for inclusion in Mathematics and Statistics Faculty Publications and Presentations by an authorized administrator of PDXScholar. For more information, please contact

2 NESTEROV S SMOOTHING TECHNIQUE AND MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS FOR HIERARCHICAL CLUSTERING N. M. NAM 1, W. GEREMEW, S. REYNOLDS 3 and T. TRAN 4. arxiv: v [math.oc] 6 Mar 017 Abstract. A bilevel hierarchical clustering model is commonly used in designing optimal multicast networks. In this paper, we consider two different formulations of the bilevel hierarchical clustering problem, a discrete optimization problem which can be shown to be NP-hard. Our approach is to reformulate the problem as a continuous optimization problem by making some relaxations on the discreteness conditions. Then Nesterov s smoothing technique and a numerical algorithm for minimizing differences of convex functions called the DCA are applied to cope with the nonsmoothness and nonconvexity of the problem. Numerical examples are provided to illustrate our method. Key words. DC programming, Nesterov s smoothing technique, hierarchical clustering, subgradient, Fenchel conjugate. AMS subject classifications. 49J5, 49J53, 90C31 1 Introduction Although convex optimization techniques and numerical algorithms have been the topics of extensive research for more than 50 years, solving large-scale optimization problems without the presence of convexity remains a challenge. This is a motivation to search for new optimization methods that are capable of handling broader classes of functions and sets where convexity is not assumed. One of the most successful approaches to go beyond convexity is to consider the class of functions representable as differences of two convex functions. Functions of this type are called DC functions, where DC stands for difference of convex. It was recognized early by P. Hartman [10] that the class of DC functions has many nice algebraic properties. For instance, this class of functions is closed under many operations usually considered in optimization such as taking linear combination, imum, or product of a finite number of DC functions. Given a linear space X, a DC program is an optimization problem in which the objective function f : X R can be represented as f = g h, where g, h: X R are convex functions. This extension of convex programming enables us to take advantage of the available tools from convex analysis and optimization. At the same time, DC programming is sufficiently broad to use in solving many nonconvex optimization problems faced in recent applications. Another feature of DC programming is that it possesses a very nice duality theory; see [3] and the references therein. Although DC programming had been known to be important for many applications much earlier, the first algorithm for minimizing differences of convex 1 Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 9707, USA (mau.nam.nguyen@pdx.edu). Research of this author was partly supported by the National Science Foundation under grant # School of General Studies, Stockton University, Galloway, NJ 0805, USA (wondi.geremew@stockton.edu) 3 Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 9707, USA (ser6@pdx.edu ). 4 Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 9707, USA (tuyen@pdx.edu). 1

3 functions called the DCA was introduced by Tao and An in [3, 9]. The DCA is a simple but effective optimization scheme used extensively in DC programming and its applications. Cluster analysis or clustering is one of the most important problems in many fields such as machine learning, pattern recognition, image analysis, data compression, and computer graphics. Given a finite number of data points in a metric space, a centroid-based clustering problem seeks a finite number of cluster centers with each data point assigned to the nearest cluster center in a way that a certain distance, a measure of dissimilarity among data points, is minimized. Since many kinds of data encountered in practical applications have nested structures, they are required to use multilevel hierarchical clustering which involves grouping a data set into a hierarchy of clusters. In this paper, we apply the mathematical optimization approach to the bilevel hierarchical clustering problem. In fact, using mathematical optimization in clustering is a very promising approach to overcome many disadvantages of the k mean algorithm commonly used in clustering; see [1,, 5, 15] and the references therein. In particular, the DCA was successfully applied in [] to a bilevel hierarchical clustering problem in which the distance measurement is defined by the squared Euclidean distance. Although the DCA in [] provides an effective way to solve the bilevel hierarchical clustering in high dimensions, it has not been used to solve the original model defined by the Euclidean distance measurement proposed in [5] which is not suitable for the resulting DCA according to the authors of []. By applying Nesterov s smoothing technique and the DCA, we are able to solve the original model proposed in [5] in high dimensions. The paper is organized as follows. In Section, we present basic definitions and tools of optimization that are used throughout the paper. In section 3, we study two models of bilevel hierarchical clustering problems, along with two new algorithms based on Nesterov s smoothing technique and the DCA. Numerical examples and conclusions are presented in Section 4 and Section 5, respectively. Basic Definitions and Tools of Optimization In this section, we present two main tools of optimization used to solve the bilevel hierarchical crusting problem: the DCA introduced by Pham Dinh Tao and Nesterov s smoothing technique. We consider throughout the paper DC programming: minimize f(x) := g(x) h(x), x R n, (.1) where g : R n R and h: R n R are convex functions. The function f in (.1) is called a DC function and g h is called a DC decomposition of f. Given a convex function g : R n R, the Fenchel conjugate of g is defined by g (y) := sup{ y, x g(x) x R n }, y R n. Note that g : R n (, ] is also a convex function. In addition, x g (y) if and only if y g(x), where denotes the subdifferential operator in the sense of convex analysis; see, e.g., [19-1]

4 Let us present below the DCA introduced by Tao and An [3, 9] as applied to (.1). Although the algorithm is used for nonconvex optimization problems, the convexity of the functions involved still plays a crucial role. The DCA INPUT: x 0 R n, N N. for k = 1,..., N do Find y k h(x k 1 ). Find x k g (y k ). end for OUTPUT: x N. Let us discuss below a convergence result of DC programming. A function h: R n R is called γ-convex (γ 0) if the function defined by k(x) := h(x) γ x, x R n, is convex. If there exists γ > 0 such that h is γ convex, then h is called strongly convex. We say that an element x R n is a critical point of the function f defined by (.1) if g( x) h( x). Obviously, in the case where both g and h are differentiable, x is a critical point of f if and only if x satisfies the Fermat rule f( x) = 0. The theorem below provides a convergence result for the DCA. It can be derived directly from [9, Theorem 3.7]. Theorem.1 Consider the function f defined by (.1) and the sequence {x k } generated by the DCA. The following properties are valid: (i) If g is γ 1 -convex and h is γ -convex, then f(x k ) f(x k+1 ) γ 1 + γ x k+1 x k for all k N. (ii) The sequence {f(x k )} is monotone decreasing. (iii) If f is bounded from below, g is γ 1 -convex and h is γ -convex with γ 1 + γ > 0, and {x k } is bounded, then every subsequential limit of the sequence {x k } is a critical point of f. Now we present a direct consequence of Nesterov s smoothing technique given in [16]. In the proposition below, d(x; Ω) denotes the Euclidean distance and P (x; Ω) denotes the Euclidean projection from a point x to a nonempty closed convex set Ω in R n. Proposition. Given any a R n and > 0, a Nesterov smoothing approximation of ϕ(x) := x a defined in R n has the representation Moreover, ϕ (x) = P ( x a ; B) and where B is the closed unit ball of R n. ϕ (x) := 1 x a [ x a d( ; B)]. (.) ϕ (x) ϕ(x) ϕ (x) +, 3

5 3 The Bilevel Hierarchical Clustering Problem Given a set of m points (nodes) a 1, a,..., a m in R n, our goal is to decompose this set into k clusters. In each cluster, we would like to find a point x i among the nodes and assign it as the center for this cluster with all points in the cluster connected to this center. Then we will find a total center x among the given points a 1, a,..., a m, and all centers are connected to this total center. The goal is to minimize the total transportation cost in this tree computed by the sum of the distances from the total center to each center and from each center to the nodes in each cluster. This is a discrete optimization problem which can be shown to be NP-hard. We will solve this problem based on continuous optimization techniques. 3.1 The Bilevel Hierarchical Clustering: Model I The difficulty in solving this hierarchical clustering problem lies in the fact that the centers and total center have to be among the nodes. We first relax this condition with the use of artificial centers x 1, x,..., x k that could be anywhere in R n. Then we define the total center as the centroid of x 1, x,..., x k given by The total cost of the tree is given by x := 1 k (x1 + x + + x k ). ϕ(x 1,..., x k ) := min,...,k xl a i + x l x. Note that here each a i is assigned to its closest center. However, what we expect are the real centers, which can be approximated by trying to minimize the difference between the artificial centers and the real centers. To achieve this goal, define the function φ(x 1,..., x k ) := min,...,m xl a i, x 1,..., x k R n. Observe that φ(x 1,..., x k ) = 0 if and only if for every l = 1,..., k, there exists i {1,..., m} such that x l = a i, which means that x l is a real node. Therefore, we consider the constrained minimization problem: minimize ϕ(x 1,..., x k ) subject to φ(x 1,..., x k ) = 0. This problem can be converted to an unconstrained minimization problem: minimize f λ (x 1,..., x k ) := ϕ(x 1,..., x k ) + λφ(x 1,..., x k ), x 1,..., x k R n, (3.1) where λ > 0 is a penalty parameter. Similar to the situation with the clustering problem, this new problem is nonsmooth and nonconvex, which can be solved by smoothing techniques and the DCA. Note that a particular case of this model in two dimensions was 4

6 considered in [5] where the problem was solved using the derivative-free discrete gradient method established in [4], but this method is not suitable for large-scale settings in high dimensions. The DCA was used in [] to solve a similar model with the squared Euclidean distance function used as the distance measurement. In this paper, the authors also addressed the difficulty of dealing with (3.1) using the DCA as not suitable for the resulting DCA. Nevertheless, we will show in what follows that the DCA is applicable to this model when combined with Nesterov s smoothing technique. Note that the functions ϕ and φ in (3.1) belong to the class of DC functions with the following DC decompositions: ϕ(x 1,..., x k ) = φ(x 1,..., x k ) = [ x l a i [ m = x l a i + x l a i r=1,...,k,l r ] x l x s=1,...,m,i s ] x l a i + x l x r=1,...,k,l r x l a i. It follows that the objective function f λ in (3.1) has the DC decomposition: f λ (x 1,..., x k ) = [ (1 + λ) [ m r=1,...,k,l r x l a i + x l a i + λ ] x l x s=1,...,m,i s x l a i, ] x l a i. This DC decomposition is not suitable for applying the DCA because there is no closed form for a subgradient of the function g involved. In the next step, we apply Nesterov s smoothing technique from Proposition. to approximate the objective function f λ by a new DC function favorable for applying the DCA. To accomplish this goal, we simply replace each term of the form x a from the first part of f λ (x 1,..., x k ) (the positive part) by the smooth approximation (.), while keeping the second part (the negative part) the same. As a result, we obtain f λ (x 1,..., x k ) := (1 + λ) (1 + λ) r=1,...,k,l r x l a i [ d + x l x ( x l a i )] ; B x l a i λ s=1,...,m,i s [ ( x l x )] d ; B x l a i. The original bilevel hierarchical clustering problem now can be solved using a DC program: minimize f λ (x 1,..., x k ) = g λ (x 1,..., x k ) h λ (x 1,..., x k ), x 1,..., x k R n. 5

7 In this formulation, g λ and h λ are convex functions on (R n ) k defined by g λ (x 1,..., x k ) := g 1 λ (x1,..., x k ) + g λ (x1,..., x k ), h λ (x 1,..., x k ) := h 1 λ (x1,..., x k ) + h λ (x1,..., x k ) + h 3 λ (x1,..., x k ) + h 4 λ (x1,..., x k ), with their respective components defined as g 1 λ (x1,..., x k ) := 1 + λ h 1 λ (x1,..., x k ) := h 3 λ (x1,..., x k ) := (1 + λ) x l a i, g λ (x1,..., x k ) := 1 r=1,...,k,l r x l x, [ ( x l a i )] d ; B, h λ (x1,..., x k ) := x l a i, h 4 λ (x1,..., x k ) := λ s=1,...,m,i s [ ( x l x d ; B)], x l a i. To facilitate the gradient and subgradient calculations for the DCA, we introduce a data matrix A and a variable matrix X. The data A is formed by putting each a i, i = 1,..., m, in the i th row, i.e., a 11 a 1 a a 1n a A = 1 a a 3... a n..... a m1 a m a m3... a mn Similarly, if x 1,..., x k are the k cluster centers, then the variable X is formed by putting each x l, l = 1,..., k, in the l th row, i.e., x 11 x 1 x x 1n x X = 1 x x 3... x n..... x k1 x k x k3... x kn Then the variable matrix X of the optimization problem belongs to R k n, the linear space of k by n real matrices equipped with the inner product X, Y := trace(x T Y). The Frobenius norm on R k n is defined by X F := X, X = k x l, x l = k x l. Finally, we represent the average of the k cluster centers by x, i.e., x := 1 k k j=1 xj. Gradient and Subgradient Calculations for the DCA Let us start by computing the gradient of g λ (X) = g 1 λ (X) + g λ (X). 6

8 Using the Frobenius norm, the function gλ 1 can equivalently be written as g 1 λ (X) = 1 + λ = 1 + λ = 1 + λ x l a i [ x l x l, a i + a i ] [m X F X, E km A + k A F where E km is a k m matrix whose entries are all ones. Hence, one can see that g 1 λ is differentiable and its gradient is given by g 1 λ (X) = 1 + λ [mx E kma]. ], Similarly, g λ can equivalently be written as g λ (X) = 1 = 1 = 1 = 1 x l x [ x l x l, x + x ] [ X F k [ X F 1 k X, E kk X + 1 ] X, E kk X k X, E kk X ], where E kk is a k k matrix whose entries are all ones. Hence, gλ gradient is given by gλ (X) = 1 [ X 1 ] k E kkx. is differentiable and its Since g λ (X) = gλ 1 (X) + g λ (X), its gradient can be computed by Therefore, g λ (X) = g 1 λ (X) + g λ (X) = 1 + λ [mx E kma] + 1 [ X 1 ] k E kkx = 1 [ (1 + λ)mx (1 + λ)e km A + X 1 ] k E kkx = 1 [[ [(1 + λ)m + 1] I kk 1 k E kk ] X (1 + λ)e km A g λ (X) = [ 1 (((1 + λ)m + 1)Ikk J ) ] X (1 + λ)s, where J = 1 k E kk, and S = E km A. ]. 7

9 Our goal now is to compute g (Y ), which can be accomplished by the relation The latter can equivalently be written as X = g (Y) if and only if Y = g(x). [[1 + (1 + λ)m] I kk J] X = [(1 + λ)s + Y]. Then with some algebraic manipulation we can show that [ ] g 1 (Y) = X = 1 + (1 + λ)m I 1 kk + [1 + (1 + λ)m](1 + λ)m J [(1 + λ)s + Y]. Next, we will demonstrate in more detail the techniques we used in finding a subgradient for the convex function h λ. Recall that h λ is defined by h λ (X) = We will start with the function h 1 λ given by h 1 (1 + λ) λ (X) = 4 h i λ (X). [ ( x l a i d ; B)]. From its representation, one can see that h 1 λ is differentiable, and hence its subgradient coincides with its gradient, that can be computed by the partial derivatives with respect to x 1,, x k, i.e., h 1 λ m x l (X) = (1 + λ) [ x l a i ( x l a i )] P ; B. Thus, for l = 1,,..., k, h 1 λ (X) is the k n matrix U whose lth row is h1 λ (X). x l Similarly, one can see that the function h λ given by h λ (X) = [ ( x l x )] d ; B is differentiable with its partial derivatives computed by h [ x l x l (X) = x ( x l x )] P ; B 1 [ x j x ( x j x )] P ; B. k Hence, for l = 1,,..., k, h (X) is the k n matrix V whose l th row is h x l (X). Unlike h 1 λ and h λ, the convex functions h3 λ and h4 λ are not differentiable, but both can be written as a finite sum of the imum of a finite number of convex functions. Let us compute a subgradient of h 3 λ as an example. We have j=1 h 3 λ (X) = m r=1,...,k,l r 8 x l a i = γ i (X),

10 where, for i = 1,..., m, γ i (X) := γ ir(x) = x l a i, r = 1,..., k.,l r Then, for each i = 1,..., m, we find W i γ i (X) according to the subdifferential rule for the imum of convex functions. Then define W := m W i to get a subgradient of the function h 3 λ at X by the subdifferential sum rule. To accomplish this goal, we first choose an index r from the index set {1,..., k} such that γ i (X) = γ ir (X) = x l a i.,l r Using the familiar subdifferential formula of the Euclidean norm function, the l th row wi l for l r of the matrix W i is determined as follows x l a i wi l if x l a i, x := l a i u B if x l = a i. The r th row of the matrix W i is w r i := 0. The procedure for calculating a subgradient of the function h 4 λ given by h 4 λ (x1,..., x k ) = λ s=1,...,m,i s is very similar to what we just demonstrated for h 3 λ. x l a i, At this point, we are ready to give a new DCA based algorithm for Model I. Algorithm 1. INPUT: X 0, λ 0, 0, N N. while stopping criteria (λ, ) = false do for k = 1,, 3,, N do Find Y k h 1λ (X k 1 ). Find X k g1λ (Y k). end for update λ and. end while OUTPUT: X N. 3. The Bilevel Hierarchical Clustering: Model II In this section, we introduce the second model to solve the bilevel hierarchical clustering problem. In this model, we use an additional variable x k+1 to denote the total center. At 9

11 first we allow the total center x k+1 to be a free point in R n, the same as the k cluster centers. Then the total cost of the tree is given by ϕ(x 1,..., x k+1 ) := min,...,k xl a i + x l x k+1, x 1,..., x k+1 R n. To force the k + 1 centers to be chosen from the given nodes (or to make them as close to the nodes as possible), we set the constraint φ(x 1,..., x k+1 ) := k+1 Our goal is to solve the optimization problem minimize ϕ(x 1,..., x k+1 ) min,...,m xl a i = 0. subject to φ(x 1,..., x k+1 ), x 1,..., x k+1 R n. Similar to the first model, this problem formulation can be converted to an unconstrained minimization problem involving a penalty parameter λ > 0: minimize f λ (x 1,..., x k+1 ) := ϕ(x 1,..., x k+1 ) + λφ(x 1,..., x k+1 ), x 1,..., x k+1 R n. (3.) Next, we apply Nesterov s smoothing technique to get an approximation of the objective function f given in (3.) which involves two parameter λ > 0 and > 0: f λ (X) := 1 + λ λ k+1 [ d x l a i + 1 ( x k+1 a i )] ; B [ ( x l x k+1 d ; B)] x l x k+1 1 (1 + λ) r=1,...,k,l r [ d x k+1 a i ( x l a i )] ; B k+1 x l a i λ s=1,...,m,i s As we will show in what follows, it is convenient to apply the DCA to minimize the function f λ. This function can be represented as the differences of two convex functions defined on R (k+1) n using a variable X whose i th row is x i for i = 1,..., k + 1: f λ (X) = g λ (X) h λ (X), X R (k+1) n. In this formulation, g λ and h λ are convex functions defined on R (k+1) n by g λ (X) := g 1 λ (X) + g λ (X) x l a i. and h λ (X) := h 1 λ (X) + h λ (X) + h3 λ (X) + h4 λ (X) + h5 λ (X) + h6 λ (X), 10

12 with their respective components given by gλ λ (X) := h 1 1 λ (X) := h 3 (1 + λ) λ (X) := h 5 λ (X) := m k+1 x l a i, gλ 1 (X) := x l x k+1, x k+1 a i, h λ (X) := λ [ ( x k+1 a i d ; B)], [ ( x l a i )] d ; B, h 4 λ (X) := [ ( x l x k+1 d ; B)], r=1,...,k,l r k+1 x l a i, h 6 λ (X) := λ s=1,...,m,i s x l a i. Gradient and Subgradient Calculations for the DCA Let us start by computing the gradient of the first part of the DC decomposition, i.e., g λ (X) = g 1 λ (X) + g λ (X). By applying similar techniques used in computing gradients/subgradients for Model I, gλ 1 (X) can be written as g 1 λ (X) = 1 + λ [mx EA], where E is the (k + 1) m matrix whose entries are all ones. Similarly, gλ (X) can also be written as gλ (X) = 1 [I + T] X, where I is the (k + 1) (k + 1) identity matrix, and T is the (k + 1) (k + 1) matrix whose entries are all zeros except its (k + 1) th row and (k + 1) th column, which both are filled by the vector ( 1, 1, 1,..., 1, k 1). It follows that g λ (X) = g 1 λ (X) + g λ (X) = 1 + λ [mx EA] + 1 [I + T] X = 1 [c 1I + T] X 1 + λ EA, where c 1 = 1 + (1 + λ)m. Our goal now is to compute gλ (Y), which can be accomplished by the relation X = gλ (Y) if and only if Y = g λ(x). The latter can be equivalently written as [c 1 I + T] X = (1 + λ)ea + Y, 11

13 whose solutions can be explicitly computed by its l th row for l = 1,..., k + 1: In the representation x l = [(1 + λ)ea + Y] l + xk+1 c 1 for l = 1,..., k, x k+1 = c 1 [(1 + λ)ea + Y] k+1 + k [(1 + λ)ea + Y] l. (c 1 + k)(c 1 1) h λ (X) = h 1 λ (X) + h λ (X) + h3 λ (X) + h4 λ (X) + h5 λ (X) + h6 λ (X), the convex functions h 1 λ, h λ, h3 λ, and h4 λ are given by h 1 λ are differentiable. The partial derivatives of h 1 λ (X) = 0 for l = 1,..., k, xl [ h 1 λ x k+1 (X) = 1 (x k+1 a i ) = 1 mx k+1 ] A i. Then h 1 λ (X) is the (k + 1) n matrix L whose lth row is h1 λ (X) for l = 1,..., k + 1. x l Similarly, the partial derivatives of h λ are given by h λ (X) = 0 for l = 1,..., k, xl h λ m x k+1 (X) = λ [ x k+1 a i ( x k+1 a i )] P ; B. Then h λ (X) is the (k + 1) n matrix M whose lth row is h λ (X), for l = 1,..., k + 1. x l Now we compute the gradient of h 3 λ. We have h 3 λ m x l (X) = (1 + λ) [ x l a i ( x l a i P )] ; B for l = 1,..., k, h 3 λ (X) = 0. xk+1 Thus, h 3 λ (X) is the (k + 1) n matrix U whose lth row is h3 λ x l (X) for l = 1,...,, k + 1. Let us compute the gradient of h 4 λ. We have h 4 λ x l (X) = h 4 λ x k+1 (X) = [ x l x k+1 ( x l x k+1 )] P ; B [ x l x k+1 ( x l x k+1 P for l = 1,..., k, )] ; B. 1

14 Similarly, h 4 λ (X) is the (k + 1) n matrix V whose lth row is filled with h4 λ (X), for x l l = 1,..., k + 1. The procedure for computing subgradients of the last two nondifferentiable components, h 5 λ and h6 λ, is similar to the procedure we described for computing subgradients of h3 λ in Model I. The following is a DCA based algorithm for Model II. Algorithm. INPUT: X 0, λ 0, 0, N N. while stopping criteria (λ, ) = false do for k = 1,, 3,, N do Find Y k h λ (X k 1 ). Find X k gλ (Y k). end for update λ and. end while OUTPUT: X N. 4 Numerical Experiments We use MATLAB to code our algorithms and perform numerical experiments on a MacBook Pro with. GHz Intel Core i7 Processor, and 16 GB 1600 MHz DDR3 Memory. For our numerical experiments, we use three data sets: one artificial data set with 18 data points in R (see Figure 1a), the EIL76 and the PR100 from [17] (see Figure 1b and Figure 1c), respectively. (a) 18 Data Points, Centers (b) 76 Data Points, 3 Centers (c) 100 Data Points, 6 Centers Figure 1: Plots of the three test data sets The two MATLAB codes used to implement our two algorithms have two major parts: an outer loop for updating the penalty and the smoothing parameters and an inner loop for updating the cluster centers. The penalty parameter λ and the smoothing parameter are updated as follows. Choosing 0 > 0, λ 0 > 0 and σ 1 > 1, σ (0, 1), we update λ i+1 = σ 1 λ i and i+1 = σ i for i 0 after each outer loop. To choose σ 1 and σ, we first 13

15 let N be the number of outer iterations and choose λ 0, λ as the initial and final values of λ, respectively, and similarly for 0, min. Then choose growth/decay parameters according to σ 1 = (λ /λ 0 ) 1/N and σ = ( min / 0 ) 1/N. By trial and error we find that the values chosen for λ 0, λ, 0, and min in large part determine the performance of the two algorithms for each data set. Intuitively, we see that very large values of λ will over-penalize the distance between an artificial center and its nearest data node and may prevent the algorithm from clustering properly. We therefore use λ 0 1 λ so that the algorithm has a chance to cluster the data before the penalty parameter takes effect. Similarly, we choose min 1 0. We select the starting center X 0 at a certain radius, γ rad(a), from the median point, median(a), of the entire data set, i.e., X 0 = median(a) + γ rad(a) U, where rad(a) := { a i median(a) a i A}, γ is a randomly chosen real number from the standard uniform distribution on the open interval (0,1), U is a k n matrix whose k rows are randomly generated unit vectors in R n, and the sum is in the sense of adding a vector to each row of a matrix. As showed in Tables 1,, and 3, both algorithms identify the optimal solutions with reasonable amount of time for both DS18 and EIL76 with two and three cluster centers, respectively. To get a good starting point which yields a better estimate of the optimal value for bigger data sets such as PR100, we use a method called radial search described as follows. Given initial radius r 0 > 0 and m N, set γ = ir 0 for i = 1,..., m. Then we test the algorithm with different starting points given by X 0 (i) = median(a) + ir 0 (rad(a)u) for i = 1,..., m. Figure shows the result of the method applied to PR100 with six cluster centers, where the y-axis represents the optimal value returned by ALG1 with different starting points X 0 (i), as represented on the x-axis. 0 = 5.70, λ 0 = 0.001, σ 1 = 7500, σ = 0.5 Cost1 Cost Time1 Time Iter1 Iter k m n ADS ADS ADS Table 1: Results for the 18 points artificial data set. 0 = 100, λ 0 = 10 6, σ 1 = 1, σ = 0.5 Cost1 Cost Time1 Time Iter1 Iter k m n EIL EIL EIL Table : Results for EIL76 data set. 14

0 = 1950, λ 0 = 10 6, σ 1 = 7500, σ = 0.5 Cost1 Cost Time1 Time Iter1 Iter k m n PR100 1.63399e+06 1.63399e+06.1578 5.3651 330 330 6 100 PR100 1.63399e+06 1.63399e+06.978 5.5373 330 330 6 100 PR100 1.

16 0 = 1950, λ 0 = 10 6, σ 1 = 7500, σ = 0.5 Cost1 Cost Time1 Time Iter1 Iter k m n PR e e PR e e PR e e Table 3: Results for PR100 data set. For comparison purposes, both Cost1 and Cost are computed by the same way. First, we systematically reassign the k cluster centers returned by the respective algorithms by k real nodes that are close to them, i.e., for l = 1,..., k x l = argmin{ x l a i a i A}. Then the total center x will be a real node, from the remaining nodes, whose sum of distances from the k reassigned centers is the minimal, i.e., { } x := argmin x l a i a i A. The total cost is computed by adding the distance of each real node to its closest center (including the total center), and the distances of the total center from the k cluster centers, i.e., Cost := min,...,k+1 xl a i + x l x k+1, where x k+1 = x. Figure : PR100, The 100 City Problem, with 6 Cluster Centers 15

17 5 Conclusions In our numerical experiments, one of the major challenge we face is how to choose optimal parameters for our algorithms. We observe that parameter selection is the decisive factor in terms of accuracy and speed of convergence of our proposed algorithms. The performance of the proposed algorithms highly depends on the initial values set to the penalty and smoothing parameters, λ 0 and 0 ; and their respective growth/decay factors σ 1 and σ. A better guidance of selecting the optimal parameter than the methods we suggest in this paper will be addressed in or future research. References [1] L. T. H. An, M. T. Belghiti, P. D. Tao: A new efficient algorithm based on DC programming and DCA for clustering. J. Glob. Optim., 7, (007) [] L. T. H. An, and L. H. Minh, Optimization based DC programming and DCA for hierarchical clustering. European J. Oper. Res. 183 (007), [3] L. T. H. An, and P. D. Tao, Convex analysis approach to D.C. programming: Theory, algorithms and applications. Acta Math. Vietnam. (1997), [4] A. M. Bagirov, Derivative-free methods for unconstrained nonsmooth optimization and its numerical analysis. Investigacao Operacional. 19 (1999), [5] A. Bagirov, Long Jia, I. Ouveysi, and A.M. Rubinov, Optimization based clustering algorithms in Multicast group hierarchies, in: Proceedings of the Australian Telecommunications, Networks and Applications Conference (ATNAC), 003, Melbourne Australia (published on CD, ISNB ). [6] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces Springer, New York, 011. [7] J. M. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization, nd edition, Springer, New York, 006. [8] R. I. Boţ, Conjugate Duality in Convex Optimization, Springer, Berlin, 010. [9] T. Pham Dinh and H. A. Le Thi, A d.c. optimization algorithm for solving the trust-region subproblem, SIAM J. Optim. 8 (1998), [10] P. Hartman, On functions representable as a difference of convex functions. Pacific J. Math. 9, (1959), [11] J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization Algorithms I, II, Springer, Berlin, [1] J. B. Hiriart-Urruty, Generalized differentiability, duality and optimization for problems dealing with differences of convex functions. Lecture Note in Economics and Math. Systems. 56 (1985), [13] B. S. Mordukhovich, Variational Analysis and Generalized Differentiation, I: Basic Theory, II: Applications, Springer, Berlin, 006. [14] B. S. Mordukhovich and N. M. Nam, An Easy Path to Convex Analysis and Applications, Morgan & Claypool Publishers, San Rafael, CA,

18 [15] N. M. Nam, R.B. Rector, D. Giles: Minimizing Differences of Convex Functions with Applications to Facility Location and Clustering, submitted. [16] Y. Nesterov: Smooth minimization of non-smooth functions. Math. Program. 103, (005) [17] G. Reinelt, TSPLIB: A Traveling Salesman Problem Library. ORSA Journal of Computing. 3 (1991), [18] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, [19] R. T. Rockafellar, Conjugate Duality and Optimization, SIAM, Philadelphia, PA,

DC Programming: A brief tutorial. Andres Munoz Medina Courant Institute of Mathematical Sciences

DC Programming: A brief tutorial. Andres Munoz Medina Courant Institute of Mathematical Sciences munoz@cims.nyu.edu Difference of Convex Functions Definition: A function exists f is said to be DC if there