Stability yields a PTAS for k-median and k-means Clustering

Size: px

Start display at page:

Download "Stability yields a PTAS for k-median and k-means Clustering"

Elijah Booth
5 years ago
Views:

1 Stability yields a PTAS for -Media ad -Meas Clusterig Prajal Awasthi Caregie Mello Uiversity pawasthi@cs.cmu.edu Avrim Blum Caregie Mello Uiversity avrim@cs.cmu.edu Or Sheffet Caregie Mello Uiversity osheffet@cs.cmu.edu Abstract We cosider -media clusterig i fiite metric spaces ad -meas clusterig i Euclidea spaces, i the settig where is part of the iput (ot a costat). For the -meas problem, Ostrovsy et al. [18] show that if the optimal ( 1)-meas clusterig of the iput is more expesive tha the optimal -meas clusterig by a factor of 1/ǫ 2, the oe ca achieve a (1 + f(ǫ))-approximatio to the -meas optimal i time polyomial i ad by usig a variat of Lloyd s algorithm. I this wor we substatially improve this approximatio guaratee. We show that give oly the coditio that the ( 1)-meas optimal is more expesive tha the -meas optimal by a factor 1+α for some costat α > 0, we ca obtai a PTAS. I particular, uder this assumptio, for ay ǫ > 0 we achieve a (1 + ǫ)-approximatio to the -meas optimal i time polyomial i ad, ad expoetial i 1/ǫ ad 1/α. We thus decouple the stregth of the assumptio from the quality of the approximatio ratio. We also give a PTAS for the -media problem i fiite metrics uder the aalogous assumptio as well. For -meas, we i additio give a radomized algorithm with improved ruig time of O(1) ( log ) poly(1/ǫ,1/α). Our techique also obtais a PTAS uder the assumptio of Balca et al. [4] that all (1 + α) approximatios are δ-close to a desired target clusterig, i the case that all target clusters have size greater tha δ ad α > 0 is costat. Note that the motivatio of Balca et al. [4] is that for may clusterig problems, the objective fuctio is oly a proxy for the true goal of gettig close to the target. From this perspective, our improvemet is that for -meas i Euclidea spaces we reduce the distace of the clusterig foud to the target from O(δ) to δ whe all target clusters are large, ad for -media we improve the largeess coditio eeded i [4] to get exactly δ-close from O(δ) to δ. Our results are based o a ew otio of clusterig stability. 1. INTRODUCTION Clusterig is a well-studied tas, arisig i umerous areas from computer visio to computatioal biology to distributed computig. Geerally speaig, the goal of clusterig is to partitio give data objects ito groups that share some commoality. Operatioally, clusterig is ofte performed by viewig the data as poits i a metric space ad the optimizig some atural objective over them. I this paper, we cosider two popular such objectives, -media ad -meas. Both measure a -partitio by choosig a special poit for each cluster, called the ceter, ad defie the cost of a clusterig as a fuctio of the distaces betwee the data poits ad their respective ceters. I the -media case, the cost is the sum of the distaces of the poits to their ceters, ad i the -meas case, the cost is the sum of these distaces squared. The -media objective is typically studied for data i a fiite metric (complete weighted graph satisfyig triagle iequality) over the data poits; -meas clusterig is typically studied for poits i a (fiite dimesioal) Euclidea space. Both objectives are ow to be NP-hard (we view as part of the iput ad ot a costat, though eve the 2-meas problem i Euclidea space was recetly show to be NP-hard [8]). For -media i a fiite metric, there is a ow (1+1/e)- hardess of approximatio result [14] ad substatial wor o approximatio algorithms [11], [7], [2], [14], [9], with the best guaratee a 3 + ǫ approximatio. For -meas i a Euclidea space, there is also a vast literature of approximatio algorithms [17], [3], [9], [10], [12], [15] with the best guaratee a costat-factor approximatio if polyomial depedece o ad the dimesio d is desired. 1 Ostrovsy et al. [18] proposed a iterestig coditio uder which oe ca achieve better -meas approximatios i time polyomial i ad. They cosider -meas istaces where the optimal -clusterig has cost oticeably smaller tha the cost of ay ( 1)-clusterig, motivated by the idea that if a ear-optimal -clusterig ca be achieved by a partitio ito fewer tha clusters, the that smaller value of should be used to cluster the data [18]. Uder the assumptio that the ratio of the cost of the optimal ( 1)-meas clusterig to the cost of the optimal -meas clusterig is at least max{100, 1/ǫ 2 }, Ostrovsy et al. show that oe ca obtai a (1+f(ǫ))-approximatio for -meas i time polyomial i ad, by usig a variat o Lloyd s algorithm. I this paper, we substatially improve o this approximatio guaratee. We show that uder the much weaer assumptio that the ratio of these costs is just at least (1 + α) for some costat α > 0, we ca achieve a PTAS: amely, (1 + ǫ)-approximate the -meas optimum, for ay costat ǫ > 0. Our approximatio scheme rus i time which is poly(, ) ad expoetial oly i 1/ǫ ad 1/α. Thus, we decouple the stregth of the assumptio from the quality of the coclusio, ad i the process allow the assumptio to be substatially weaer. For -meas clusterig we i additio give a radomized algorithm with improved ruig time O(1) ( log ) poly(1/ǫ,1/α). Balca et al. [4], motivated by the fact that objective fuctios are ofte just a proxy for the uderlyig goal 1 If is costat, the -media i fiite metrics ca be trivially solved i polyomial time ad there is a PTAS ow for -meas (ad -media) i Euclidea space [16]. There is also a PTAS ow for low-dimesioal Euclidea spaces (dimesio at most log log ) [1], [12].

2 of gettig the data clustered correctly, propose clusterig istaces that satisfy the coditio that all (1 + α) approximatios to the give objective (e.g., -media or -meas) are δ-close, i terms of how poits are partitioed, to a target clusterig (such as a correct clusterig of proteis by fuctio or a correct clusterig of images by who is i them). This ca be viewed as a assumptio made implicitly whe cosiderig approximatio algorithms for problems of this ature where the true goal is to get close to the target. Balca et al. show that for ay α ad δ, give a istace satisfyig this property for -media or -meas objectives, oe ca i fact efficietly produce a clusterig that is O(δ/α)-close to the target clusterig (so, O(δ)-close for ay costat α > 0), eve though obtaiig a 1 + α approximatio to the objective is NP-hard for α < 1 e, ad remais hard eve uder this assumptio. Thus they show that oe ca approximate the target eve though it is hard to approximate the objective. Oe iterestig questio that has remaied is the approximability of the objectives whe all target clusters are large compared to δ, sice the hardess of approximatio requires allowig small clusters. 2 Here, we show that for both -media ad -meas objectives, if all clusters cotai more tha δ poits, the for ay costat α > 0 we ca i fact get a PTAS. Thus, we (early) resolve the approximability of these objectives uder this coditio. Note that uder this coditio, this further implies fidig a δ-close clusterig (settig ǫ = α). Thus, we also exted the results of Balca et al. [4] i the case of large clusters ad costat α by gettig exactly δ-close for both -media ad -meas objectives. (I [4] this exact closeess was achieved for the -media objective but eeded a somewhat larger O(δ(1 + 1/α)) miimum cluster size requiremet). Our algorithmic results are achieved by examiig implicatios of a property we call wea deletio-stability that is implied by both the separatio coditio of Ostrovsy et al. [18] as well as (whe target clusters are large) the stability coditio of Balca et al. [4]. I particular, a istace of - media/-meas clusterig satisfies wea deletio-stability if i the optimal solutio, deletig ay of the ceters c i ad assigig all poits i cluster i istead to oe of the remaiig 1 ceters c j, results i a icrease i the -media/-meas cost by a (arbitrarily small) costat factor. We also show that wea deletio-stability still allows for NP-hard istaces ad that o FPTAS is possible as well (uless P = NP). Thus, our algorithm, whose ruig time is () poly(1/ǫ,1/β), is optimal i the sese that the superpolyomial depedece o 1/ǫ ad 1/β is uavoidable. After presetig otatio ad prelimiaries i Sectio 2, i Sectio 3 we itroduce wea deletio-stability ad relate it to the stability otios of [18] ad [4]. We the defie aother property of a clusterig beig β-distributed which, 2 I fact, as show i [19], the -media algorithm i [4] for the case that clusters are sufficietly large compared to δ(1 + 1/α) achieves a better costat-factor approximatio. Note that δ eed ot be a costat. while ot so ituitive, we show is implied by wea deletiostability ad will be the actual coditio that our algorithms will use. We the go o to prove that beig β-distributed suffices to give a PTAS for -media i Sectio 4. We exted the algorithm to -meas clusterig i Sectio 5, where we also itroduce a radomized versio whose ru-time is bouded by 3 ((log() )) poly(1/ǫ,1/β). We coclude with discussio ad ope problems i Sectio NOTATION AND PRELIMINARIES We are give a set S of poits. Whe discussig - media, we assume the poits reside i a fiite metric space, ad whe discussig -meas, we assume they all reside i a fiite dimesioal Euclidea space. We deote d : S S R 0 as the distace fuctio. A solutio to the -media objective partitios the poits ito disjoit subsets, C 1, C 2,...,C ad assigs a ceter c i for each subset. The -media cost of this partitio is the measured by i=1 x C i d(x, c i ). A solutio to the -meas objective agai gives a -partitio of the data poits, but ow we may assume uses the ceter of mass, µ Ci = 1 x C i x, as the ceter of the C i. C i We the measure the -meas cost of this clusterig by i=1 x C i d 2 (x, µ Ci ) = i=1 x C i x µ Ci 2. The optimal clusterig (w.r.t. to either the -media or the -meas objective) is deoted as C = {C1, C 2,...,C }, ad its cost is deoted as. The ceters used i the optimal clusterig are deoted as {c 1, c 2,..., c }. Clearly, give the optimal clusterig, we ca fid the optimal ceters (either by brute-force checig all possible poits for - media, or by c i = µ Ci for -meas). Alteratively, give the optimal ceters, we ca assig each x to its earest ceter, thus obtaiig the optimal clusterig. Thus, we use C to deote both the optimal -partitio, ad the optimal list of ceters. We use i to deote the cotributio of the cluster i to, that is i = x C d(x, c i i ) i the -media case, or i = x C d 2 (x, c i i ) i the -meas case. 3. STABILITY PROPERTIES As metioed above, our results are achieved by exploitig implicatios of a stability coditio we call wea deletio-stability, ad i particular a implicatio we call beig β-distributed. I this sectio we defie wea deletiostability ad of beig β-distributed, relate wea deletiostability to coditios of Ostrovsy et al. [18] ad Balca et al. [4], ad show that wea deletio-stability implies the clusterig is β-distributed. I Sectios 4 ad 5 we use the property of beig β-distributed to obtai a PTAS. 3 Defiitio 3.1. For α > 0, a -media/-meas istace satisfies (1+α) wea deletio-stability, if it has the followig property. Let {c 1, c 2,...,c } deote the ceters i the 3 Techically, we could sip the middlema of wea deletio-stability ad just defie the property of beig β-distributed as our mai stability otio, but wea deletio-stability is a more ituitive coditio.

3 optimal -media/-meas solutio. Let deote the optimal -media/-meas cost ad let (i j) deote the cost of the clusterig obtaied by removig c i as a ceter ad assigig all its poits istead to c j. The for ay i j, it holds that (i j) > (1 + α) We use wea deletio-stability via the followig implicatio we call beig β-distributed. Defiitio 3.2. For β > 0, a -media istace is β- distributed if for ay ceter c i of the optimal clusterig ad ay data poit x / Ci, it holds that d(x, c i ) β C i. A -meas istace is β-distributed if for ay such c i ad x / Ci, it holds that d 2 (x, c i ) β C i We prove that (1 + α) wea deletio-stability implies the clusterig is α/2-distributed for -media (α/4-distributed for -meas) i Theorem 3.5 below. First, however, we relate wea deletio-stability to the coditios cosidered i [18] ad [4]. A. ORSS-Separability Ostrovsy, Rabai, Schulma ad Swamy [18] defie a clusterig istace to be ǫ-separated if the optimal -meas solutio is cheaper tha the optimal ( 1)-meas solutio by at least a factor ǫ 2. For a give objective (-meas or -media) let us use ( 1) to deote the cost of the optimal ( 1)-clusterig. Itroducig a parameter α > 0, say a clusterig istace is (1 + α)-orss separable if ( 1) > 1 + α If a istace satisfies (1 + α)-orss separability the all ( 1) clusterigs must have cost more tha (1 + α) ad hece it is immediately evidet that the istace will also satisfy (1 + α)-wea deletio-stability. Hece we have the followig claim: Claim 3.3. Ay (1 + α)-orss separable -media/-meas istace is also (1 + α)-wealy deletio stable. B. BBG-Stability Balca, Blum, ad Gupta [4] (see also Balca ad Braverma [5] ad Balca, Rögli, ad Teg [6]) cosider a otio of stability to approximatios motivated by settigs i which there exists some (uow) target clusterig C target we would lie to produce. Balca et al. [4] defie a clusterig istace to be (1 + α, δ) approximatio-stable with respect to some objective Φ (such as -media or -meas), if ay -partitio whose cost uder Φ is at most (1+α) agrees with the target clusterig o all but at most δ data poits. That is, for ay (1 + α) approximatio C to objective Φ, we have mi σ S i Ctarget i C σ(i) δ (here, σ is simply a matchig of the idices i the target clusterig to those i C). I geeral, δ may be larger tha the smallest target cluster size, ad i that case approximatio-stability eed ot imply wea deletio-stability (ot surprisigly sice [4] show that -media ad -meas remai hard to approximate). However, whe all target clusters have size greater tha δ (ote that δ eed ot be a costat) the approximatio-stability ideed also implies wea deletiostability, allowig us to get a PTAS (ad thereby δ-close to the target) whe α > 0 is a costat. Claim 3.4. A -media/-meas clusterig istace that satisfies (1 + α, δ) approximatio-stability, ad i which all clusters i the target clusterig have size greater tha δ, also satisfies (1 + α) wea deletio-stability. Proof: Cosider a istace of -media/-meas clusterig which satisfies (1 + α, δ) approximatio-stability. As before, let {c 1, c 2,...,c } be the ceters i the optimal solutio ad cosider the clusterig C (i j) obtaied by o loger usig c i as a ceter ad istead assigig each poit from cluster i to c j, maig the ith cluster empty. The distace of this clusterig from the target is defied as 1 mi σ S i Ctarget i C(i j) σ(i ). Sice C(i j) has oly ( 1) oempty clusters, oe of the target clusters must map to a empty cluster uder ay permutatio σ. Sice by assumptio, this target cluster has more tha δ poits, the distace betwee C target ad C (i j) will be greater tha δ ad hece by the BBG stability coditio, the -media/meas cost of C (i j) must be greater tha (1 + α). C. Wea Deletio-Stability implies β-distributed We show ow that wea deletio-stability implies the istace is β-distributed. Theorem 3.5. Ay (1 + α)-wealy deletio-stable -media istace is α 2 -distributed. Ay (1+α)-wealy deletio-stable -meas istace is α 4 -distributed. Proof: Fix ay ceter i the optimal -clusterig, c i, ad fix ay poit p that does ot belog to the Ci cluster. Deote by Cj the cluster that p is assiged to i the optimal -clusterig. Therefore it must hold that d(p, c j ) d(p, c i ). Cosider the clusterig obtaied by deletig c i from the list of ceters, ad assigig each poit i Ci to Cj. Sice the istace is (1 + α)-wealy deletio-stable, this should icrease the cost by at least α. Suppose we are dealig with a -media istace. Each poit x Ci origially pays d(x, c i ), ad ow, assiged to c j, it pays d(x, c j ) d(x, c i )+d(c i, c j ). Thus, the ew cost of the poits i Ci is upper bouded by x C d(x, c i j ) i + Ci d(c i, c j ). As the icrease i cost is lower bouded by α ad upper bouded by Ci d(c i, c j ), we deduce that d(c i, c j ) > α Ci. Observe that triagle iequal-

4 ity gives that d(c i, c j ) d(c i, p) + d(p, c j ) 2d(c i, p), so we have that d(c i, p) > (α/2) Ci. Suppose we are dealig with a Euclidea -meas istace. Agai, we have created a ew clusterig by assigig all poits i Ci to the ceter c j. Thus, the cost of trasitioig from the optimal -clusterig to this ew ( 1)- clusterig, which is at least α, is upper bouded by x C x c i j 2 x c i 2. As c i = µ Ci, it follows that this boud is exactly x C c i j c i 2 = Ci d2 (c i, c j ), see [13] ( 2, Theorem 2). It follows that d 2 (c i, c j ) > α Ci. As before, d 2 (c i, c j ) ( d(c i, p) + d(p, c j )) 2 4d 2 (c i, p), so d 2 (c i, p) > α 4 Ci. D. NP-hardess uder wea deletio-stability Fially, we would lie to poit out that NP-hardess of the -media problem i maitaied eve if we restrict ourselves oly to wealy deletio-stable istaces. Also the reductio setched below uses oly iteger poly-size distaces, ad hece rules out the existece of a FPTAS for the problem, uless P = NP. I additio, the reductio ca be modified to show that NP-hardess is maitaied uder the coditios studied i [18] ad [4]. Theorem 3.6. For ay costat α > 0, fidig the optimal -media clusterig of (1 + α)-wealy deletio-stable istaces is NP-hard. Proof: Fix ay costat α > 0. We give a poly-time reductio from Set-Cover to (1 + α)-wealy deletio-stable -media istaces. Uder stadard otatio, we assume our iput cosists of subsets of a give uiverse of size m, for which we see a -cover. We reduce such a istace to a -media istace over m + ( + 4αm) poits. We start with the usual reductio of Set-Cover to a istace with m poits represetig the items of the uiverse ad poits represetig all possible sets. Fix iteger D 1 to be chose later. If j belogs to the ith set, fix the distace d(i, j) = D, otherwise we fix the distace d(i, j) = D + 1, ad betwee ay two set-poits we fix the distace to be 1. (The distace betwee ay two item poits is shortestpath distace.) However, we augmet the set-poits with additioal 2mD poits, settig the distace betwee all of the ( + 2mD) poits as 1. Furthermore, we replicate copies of these ( + 2mD) augmeted set-poits, all coected oly via the m-item poits. Observe that each of the copies of our augmeted set-poits compoets cotais may poits, ad all poits outside this copy are of distace D from it. Therefore, i the optimal -media solutio, each ceter resides i oe uique copy of the augmeted set-poits. Now, if our Set-Cover istace has a -cover, the we ca pic the respective ceters ad have a optimal solutio with cost exactly ( + 2mD 1)+ md. Otherwise, o sets cover all m items, so for ay ceters, some item-poit must have distace D + 1 from its ceter, ad so the cost of ay - partitio is (+2mD 1)+mD +1. Furthermore, the resultig istace is (1 + α) wealy deletio-stable, i fact, eve (1+α) ORSS-separable. I particular, usig oe ceter from each augmeted set-poit results i a -media solutio of cost m(d+1)+(+2md 1) < (+1)(+2mD); hece, is at most this quatity. However, i ay 1 clusterig, oe of the copies of the augmeted setpoits must ot cotai a ceter ad therefore ( 1) + ( + 2mD)(D 1). Choosig D = α( + 1) + 1 esures that this cost is at least (1 + α). 4. A PTAS FOR ANY β-distributed -MEDIAN INSTANCE We ow preset the algorithm for fidig a (1 + ǫ)- approximatio of the -media optimum for β-distributed istaces. First, we commet that usig a stadard doublig techique, we ca assume we approximately ow the value of. 4 Our algorithm wors if istead of we use a value v s.t. v (1 + ǫ/2), but for ease of expositio, we assume that the exact value of is ow. Below, we iformally describe the algorithm for a special case of β-distributed istaces i which o cluster domiates the overall cost of the optimal clusterig. Specifically, we say a cluster Ci i the optimal -media clusterig C (hereafter also referred to as the target clusterig) is cheap if i βǫ 32, otherwise, we say Ci is expesive. Note that i ay evet, there ca be at most a costat ( 32 βǫ ) umber of expesive clusters. Algorithm Ituitio: The ituitio for our algorithm ad for itroducig the otio of cheap clusters is the followig. Pic some cluster Ci i the optimal -media clusterig. Sice the istace is β-distributed, ay x / Ci is far from c i, amely, d(x, c i ) > β Ci. I cotrast, the average distace of x Ci from c i i is Ci. Thus, if we focus o a cluster whose cotributio, i, is o more tha, say, β 100, we have that c i is 100 times closer, o average, to the poits of Ci tha to the poits outside C i. Furthermore, usig the triagle iequality we have that ay two average poits of Ci 2β are of distace at most 100 Ci, while the distace betwee ay such average poit ad ay poit outside of Ci 99β is at least 100 Ci. So, if we maage to correctly guess the size s of a cheap cluster, we ca set a radius r = Θ ( ) β s ad collect data-poits accordig to the size ad itersectio of the r-balls aroud them. We ote that this use of balls with a iverse relatio betwee size ad radius is similar to that i the mi-sum clusterig algorithm of [5]. Note that i the geeral case we might have up to 32 βǫ expesive clusters. We hadle them by brute force guessig their ceters. I Subsectio 4-A, we preset the algorithm for clusterig β-distributed istaces of -media uder the assumptio that for all the expesive clusters we have made the correct guess for their cluster ceters. The algorithm 4 Istead of doublig from 1, we ca alteratively ru a off-the-shelf 5-approximatio of, which will retur a value v 5.

5 populates a list Q, where each elemet i this list is a subset of poits. Ideally, each subset is cotaied i some target cluster, yet we might have a few subsets with poits from two or more target clusters. The first stage of the algorithm is to add compoets ito Q, ad the secod stage is to fid good compoets i Q, ad use these compoets to retrieve a clusterig with low cost. Sice we do ot have may expesive clusters, we ca ru the algorithm for all possible guesses for the ceters of the expesive clusters ad choose the solutio which has the miimum cost. The aalysis below shows that oe such guess will lead to a solutio of cost at most (1+ǫ). Later, i Sectio 5, whe we deal with -meas i Euclidea space, we use samplig techiques, similar to those of Kumar et al. [16] ad Ostrovsy et al. [18], to get good substitutes for the ceters of the expesive clusters. Note however a importat differece betwee the approach of [16], [18] ad ours. While they sample poits from all clusters, we sample poits oly for the O(1) expesive clusters. As a result, the rutime of the PTAS of [16], [18] has expoetial depedece i, while ours has oly a polyomial depedece i. A. Clusterig β-distributed Istaces The algorithm is preseted i Figure 1. I this sectio we assume that at the begiig, the list Q is iitialized with Q iit which cotais the ceters of all the expesive clusters. I geeral, the algorithm will be ru several times with Q iit cotaiig differet guesses for the ceters of the expesive clusters. Before goig ito the proof of correctess of the algorithm, we itroduce{ aother defiitio. We defie the ier rig of Ci as the set x; d(x, c i ) β 8 C }. i Note the followig fact: Fact 4.1. If C i is a cheap cluster, the o more tha a ǫ/4 fractio of its poits reside outside the ier rig. I particular, at least half of a cheap cluster is cotaied withi the ier rig. Proof: This follows from Marov s iequality. If more tha (ǫ/4) Ci poits are outside of the ier rig, the i > ǫ C i 4 β 8 Ci = βǫ/32. This cotradicts the fact that Ci is cheap. Our high level goal is to show that for ay cheap cluster Ci i the target clusterig, we isert a compoet T i that is cotaied withi Ci, ad furthermore, cotais oly poits that are close to c i. It will follow from the ext claims that the compoet T i is the oe that cotais poits from the ier rig of Ci. We start with the followig Lemma which we will utilize a few times. Lemma 4.2. Let T be ay compoet added to Q. Let s be the stage i which we add T to Q. Let Ci be ay cheap cluster s.t. s Ci. The (a) T does ot cotai ay [ poit z s.t. the distace d(c i, z) lies withi the rage β 2 Ci, 3β 4 C ], i ad (b) T caot cotai both 1) Iitializatio Stage: Set Q Q iit. 2) Populatio Stage: For s =, 1, 2,...,1 do: a) Set r = β 4s. b) Remove ay poit x such that d(x, Q) < 2r. (Here, d(x, Q) = mi T Q;y T d(x, y).) c) For ay remaiig data poit x, deote the set of data poits whose distace from x is at most r, by B(x, r). Coect ay two remaiig poits a ad b if: (i) d(a, b) r, (ii) B(a, r) > s 2 ad (iii) B(b, r) > s 2. d) Let T be a coected compoet of size > s 2. The: i) Add T to Q. (That is, Q Q {T }.) ii) Defie the set B(T) = {x : d(x, y) 2r for some y T }. Remove the poits of B(T) from the istace. 3) Ceters-Retrievig Stage: For ay choice of compoets T 1, T 2,..., T out of Q (we later show that Q < + O(1/β)) a) Fid the best ceter c i for T i B(T i ). That is c i = argmi p Ti B(T i) x T i B(T i) d(x, p).a b) Partitio all poits accordig to the earest poit amog the ceters of the curret compoets. c) If a clusterig of cost at most (1+ǫ) is foud output these ceters ad halt. a This ca be doe before fixig the choice of compoets out of Q. Figure 1. The algorithm to obtai a PTAS for β-distributed istaces of -media. a poit p 1 s.t. d(c i, p 1) < β 2 d(c i, p 2) > 3β 4 C i. C i ad a poit p 2 s.t. Proof: We prove (a) by cotradictio. Assume T cotais a poit z s.t. β 2 Ci d(c i, z) 3β 4 Ci. Set r = β 4s β 4 Ci, just as i the stage whe T was added to Q, ad let p be ay poit i the ball B(z, r). The by the triagle iequality we have that d(c i, p) d(c i, z) d(z, p) β 4 Ci, ad similarly d(c i, p) d(c β i, z) + d(z, p) Ci. Sice our istace is β-distributed it holds that p belogs to Ci, ad from the defiitio of the ier rig of C i, it holds that p falls outside the ier rig. However, z is added to T because the ball B(z, r) cotais more tha s/2 Ci /2 may poits. So more tha half of the poits i Ci fall outside the ier rig of Ci, which cotradicts Fact 4.1. Assume ow (b) does ot hold. Recall that T is a coected compoet, so exists some path p 1 p 2. Each two cosecutive poits alog this path were coected because

6 their distace is at most β 2 [ β 2 β 4s Ci β 4 C i. As d(c i, p 1) < Ci ad d(c i, p 2) > 3β 4, there must exist a poit z alog the path whose distace from c i falls i the rage Ci, 3β 4 C ], i cotradictig (a). Claim 4.3. Let Ci be ay cheap cluster i the target clusterig. By stage s = Ci, the algorithm adds to Q a compoet T that cotais a poit from the ier rig of Ci. Proof: Suppose that up to the stage s = Ci the algorithm has ot iserted such a compoet ito Q. Now, it is possible that by stage s, the algorithm has iserted some compoet T to Q, s.t. some x i the ier rig of Ci is too close to some y T (amely, d(x, y) 2r), thus causig x to be removed from the istace. Assume for ow this is ot the case. This meas that the ier rig of cluster Ci still cotais more tha C i /2 poits. Also observe that all ier rig poits are of distace at most β 8 Ci from the ceter, so every pair of ier rig poits has a distace of at most β 4 Ci. Hece, whe we reach stage s = C i, ay ball of radius r = β 4s = β 4 Ci cetered at ay ier-rig poit, must cotai all other ier-rig poits. This meas that at stage s = Ci all ier rig poits are coected amog themselves, so they form a compoet (i fact, a clique) of size > s/2. Therefore, the algorithm iserts a ew compoet, cotaiig all ier rig poits. So, by stage s = Ci, oe of two thigs ca happe. Either the algorithm iserts a compoet that cotais some ier rig poit to Q, or the algorithm removes a ier rig poit due to some compoet T Q. If the former happes, we are doe. So let us prove by cotradictio that we caot have oly the latter. Let s Ci be the stage i which we throw away the first ier rig poit of the cluster Ci. At stage s the algorithm removes this ier rig poit x because there exists a poit y i some compoet T Q, s.t. d(x, y) 2r = β 2s, ad so d(c i, y) d(c β i, x) + d(x, y) 8 Ci + β 2s 5 β 8 Ci. This immediately implies that T caot be the ceter of a expesive cluster sice ay such poit will be at a distace at least β C from c i. Let s s Ci be the previous stage i which we added the compoet T to Q. As Lemma 4.2 applies to T, we deduce that d(c i, y) < β 2 Ci. Recall that T cotais > s /2 Ci /2 may poits, yet, by assumptio, cotais oe of the Ci /2 poits that reside i the ier rig of Ci. It follows from Fact 4.1 that some poit w T must belog to a differet cluster Cj. Sice the istace is β-distributed, we have that d(c β i, w) > Ci. The existece of both y ad w i T cotradicts part (b) of Lemma 4.2. We call a compoet T Q good if it cotais a ier rig poit of some cheap cluster Ci. A compoet is called bad if it is ot good ad is ot oe of the iitial ceters preset i Q iit. We ow discuss the properties of good compoets. Claim 4.4. Let T be a good compoet added to Q, cotaiig a ier rig poit from a cheap cluster Ci. (By Claim 4.3 we ow at least oe such T exists.) The: (a) all poits i T are of distace at most β 2 Ci from c i, (b) T B(T) is fully cotaied i Ci, ad (c) the etire ier rig of Ci is cotaied i T B(T), ad (d) o other compoet T Q, T T, cotais a ier rig poit from Ci. Proof: As we do ot ow (d) i advace, it might be the case that Q cotais may good compoets, all cotaiig a ier-rig poit from the same cluster, Ci. Out of these (potetially may) compoets, let T deote the first oe iserted to Q. Deote the stage i which T was iserted to Q as s. Due to the previous claim, we ow s Ci, ad so Lemma 4.2 applies to T. We show (a), (b), (c) ad (d) hold for T, ad deduce that T is the oly good compoet to cotai a ier rig poit from Ci. Part (a) follows immediately from Lemma 4.2. We ow T cotais some ier rig poit x from Ci, so d(c i, x) β 8 C i < β 2 C i, so we ow that ay y T must satisfy that d(c i, y) < β 2 Ci. Sice we ow ow (a) holds ad the istace is β-distributed, we have that T Ci, so we oly eed to show B(T) Ci. Fix ay y B(T). The poit y is assiged to B(T) (thus removed from the istace) because there exists some poit x T s.t. d(x, y) 2r. So agai, we have that d(c i, y) d(c β i, x) + d(x, y) Ci, which gives us that y Ci (sice the istace is β-distributed). We ow prove (c). Because of (b), we deduce that the umber of poits i T is at most Ci. However, i order for T to be added to Q, it must also hold that T > s/2. It follows that s < 2 Ci. Let x be a ier rig poit of C i that belogs to T. The the distace of ay other ier rig poit of Ci β ad x is at most = 2r. It follows 4 Ci < β 2s that ay ier rig poit of C i which is t added to T is assiged to B(T). Thus T B(T) cotais all ier-rig poits. Fially, observe that (d) follows immediately from the defiitio of a good compoet ad from (c). We ow show that i additio to havig all good compoets, we caot have too may bad compoets. Claim 4.5. We have less tha 16/(3β) bad compoets. Proof: Let T be a bad compoet, ad let s be the stage i which T was iserted to Q. Let y be ay poit i T, ad let C be the cluster to which y belogs i the optimal clusterig with ceter c. We show d(c, y) > 3β 8 s. We divide ito cases. Case 1: C is a expesive cluster. Note that we are worig uder the assumptio that Q iit cotais the correct ceters of the expesive clusters. I particular, Q iit cotais c. Also, the fact that poit y was ot throw out i stage s implies that d(c, y) > 2r = β 2s > 3β 8s. Case 2: C is a cheap cluster ad s C. We apply Lemma 4.2, ad deduce that either d(c, y) < β 2 C or

7 C 3β 4 s that d(c, y) > 3β 4. As the ier rig of C cotais > C /2 ad T cotais > s/2 C /2 may poits, oe of which is a ier rig poit, some poit w T does ot belog to C ad hece d(c, w) > β 3β 4 C C >. Part (b) of Lemma 4.2 assures us that all poits i T are also far from c. Case 3: C is a cheap cluster ad s < C. Usig Claim 4.3 we have that some good compoet cotaiig a poit x from the ier rig of C was already added to Q. So it must hold that d(x, y) > 2r, for otherwise we removed y from the istace ad it caot be added to ay T. We deduce that d(c, y) d(x, y) d(c, x) β 3β 8 s. 2s β 8 C > All poits i T have distace > 3β 8s from their respective ceters i the optimal clusterig, ad recall that T is added to Q because T cotais at least s/2 may poits. Therefore, the cotributio of all elemets i T to is at least 3β 16. It follows that we ca have o more tha 16/3β such bad compoets. We ca ow prove the correctess of our algorithm. Theorem 4.6. The algorithm outputs a -clusterig whose cost is o more tha (1 + ǫ). Proof: Usig Claim 4.4, it follows that there exists some choice of compoets, T 1,..., T, such that we have the ceter of every expesive cluster ad the good compoet correspodig to every cheap cluster C. Fix that choice. We show that for the optimal clusterig, replacig the true ceters {c 1, c 2,..., c } with the ceters {c 1, c 2,..., c } that the algorithm outputs, icreases the cost by at most a (1+ǫ) factor. This implies that usig the {c 1, c 2,..., c } as ceters must result i a clusterig with cost at most (1 + ǫ). Fix ay Ci i the optimal clusterig. Let i be the cost of this cluster. If Ci is a expesive cluster the we ow that its ceter c i is preset i the list of ceters chose. Hece, the cost paid by poits i Ci will be at most i. If Ci is a cheap cluster the deote by T the good compoet correspodig to it. We brea the cost of Ci ito two parts: i = x C d(x, c i i ) = x T B(T) d(x, c i ) + x Ci, yet x/ T B(T) d(x, c i ) ad compare it to the cost Ci usig c i, the poit piced by the algorithm to serve as ceter: x C d(x, c i i ) = x T B(T) d(x, c i) + x Ci, yet x/ T B(T) d(x, c i). Now, the first term is exactly the fuctio that is miimized by c i, as c i = arg mi p x T B(T) d(x, p). We also ow c i, the actual ceter of C i, resides i the ier rig, ad therefore, by Claim 4.4 must belog to T B(T). It follows that x T B(T) d(x, c i) x T B(T) d(x, c i ). We ow upper boud the 2d term, ad show that x Ci, yet x/ T B(T) d(x, c i) (1 + ǫ) x Ci, yet x/ T B(T) d(x, c i ) Ay poit x Ci, s.t. x / T B(T), must reside outside the ier rig of Ci. Therefore, d(x, c i ) > β 8 Ci. We show that d(c i, c i ) ǫ β 8 Ci, ad thus we have that d(x, c i ) d(x, c i ) + d(c i, c i) (1 + ǫ)d(x, c i ), which gives the required result. Note that thus far, we have oly used the fact that the cost of ay cheap cluster is proportioal to β/ Ci. Here is the first (ad the oly) time we use the fact that the cost is actually at most (ǫ/32) β/ Ci. Usig the Marov iequality, we have that the set of poits satisfyig {x; d(x, c i ) ǫ β/(16 C i )} cotais at least half of the poits i Ci, ad they all reside i the ier rig, thus belog to T B(T). Assume for the sae of cotradictio that d(c i, c i ) ǫ β 8 Ci. The at least half of the poits i Ci cotribute more tha ǫ β 16 Ci to the sum x T B(T) d(x, c i). It follows that this sum is more tha ǫ β 32 Ci i. However, c i is the poit that miimizes the sum x T B(T) d(x, p), ad by usig p = c i we have x T B(T) d(x, p) i. Cotradictio. B. Rutime aalysis A aive implemetatio of the 2d step of algorithm i Sectio 4-A taes O( 3 ) time (for every s ad every poit x, fid how may of the remaiig poits fall withi the ball of radius r aroud it). Fidig c i for all compoets taes O( 2 ) time, ad measurig the cost of the solutio usig a particular set of data poits as ceters taes O() time. Guessig the right compoets taes O(1/β) time. Overall, the ruig time of the algorithm i Figure 1 is O( 3 O(1/β) ). The geeral algorithm that brute-force guesses the ceters of all expesive clusters, maes O(1/βǫ) iteratios of the give algorithm, so its overall ruig time is O(1/βǫ) O(1/β). 5. A PTAS FOR ANY β-distributed EUCLIDEAN -MEANS INSTANCE Aalogous to the -media algorithm, we preset a essetially idetical algorithm for -meas i Euclidea space. Ideed, the fact that -meas cosiders distaces squared, maes upper (or lower) boudig distaces a bit more complicated, ad requires that we fiddle with the parameters of the algorithm. I additio, the ceters c i may ot be data poits. However, the overall approach remais the same. Roughly speaig, covertig the -media algorithm to the -meas case, we use the same costats, oly squared. 5 As before we hadle expesive clusters by guessig good substitutes for their ceters ad obtai good compoets for cheap clusters. Ofte, whe cosiderig the Euclidea space -meas problem, the dimesio of the space plays a importat factor. I cotrast, here we mae o assumptios about the dimesio, ad our results hold for ay poly() dimesio. I fact, for ease of expositio, we assume all distaces betwee ay two poits were computed i advace ad are give to our algorithm. Clearly, this oly adds O( 2 dim) 5 We stress that we made o attempt to optimize the costats.

8 to our rutime. I additio to the chage i parameters, we utilize the followig facts that hold for the ceter of mass i Euclidea space. Fact 5.1. Let U be a (fiite) set of poits i a Euclidea space, ad let µ U deote their ceter of mass (µ = 1 U x U x). Let A be a radom subset of U, ad deote by µ A the ceter of mass of A. The for ay δ < 1/2, we have both [ ] Pr µ U µ A 2 > 1 δ A 1 x µ U 2 < δ (1) U x U Pr [ x U x µ A 2 > (1 + 1 δ A ) x U x µ U 2 ] < δ Fact 5.2. Let U be a (fiite) set of poits i a Euclidea space, ad let A ad B be a partitio of U. Deote by µ U ad µ A the ceter of mass of U ad A resp. The µ U µ A 2 1 U x U x µ U 2 B A. Fact 5.2, prove i [18] (Lemma 2.2), allows us to upper boud the distace betwee the real ceter of a cluster ad the empirical ceter we get by averagig all poits i T B(T) for a good compoet T. Fact 5.1 allows us to hadle expesive clusters. Sice we caot brute force guess a ceter (as the ceter of the clusters are t ecessarily data poits), we guess a sample of O(β 1 + ǫ 1 ) poits from every expesive cluster, ad use their average as a ceter. Both properties of Fact 5.1, prove i [13] ( 3, Lemma 1 ad 2), assure us that the ceter is a adequate substitute for the real ceter ad is also close to it. This motivates the approach behid our first algorithm, i which we brute-force traverse all choices of O(ǫ 1 + β 1 ) poits for ay of the expesive clusters. The secod algorithm, whose rutime is ( log ) poly(1/ǫ,1/β) O( 3 ), replaces brute-force guessig with radom samplig. Ideed, if a cluster cotais poly(1/) fractio of the poits, the by radomly samplig O(ǫ 1 + β 1 ) poits, the probability that all poits belog to the same expesive cluster, ad furthermore, their average ca serve as a good empirical ceter, is at least 1/ poly(1/ǫ,1/β). I cotrast, if we have expesive clusters that cotai few poits (e.g. a expesive cluster of size, while = poly(log())), the radom samplig is uliely to fid good empirical ceters for them. However, recall that our algorithm collects poits ad deletes them from our istace. So, it is possible that i the middle of the ru, we are left with so few poits, so that expesive clusters whose size is small i compariso to the origial umber of poits, cotai a poly(1/) fractio of the remaiig poits. Ideed, this is the motivatio behid our secod algorithm. We ru the algorithm while iterleavig the Populatio Stage of the algorithm with radom samplig. Istead of ruig s from to 1, we use {, 2, 4, 6,...,1 } (2) as brea poits. Correspodigly, we defie l i to be the umber of expesive clusters whose size is i the rage [ 2i 2, 2i). Wheever s reaches such a 2i brea poit, we radomly sample poits i order to guess the l i+3 ceters of the clusters that lie 3 itervals ahead (ad so, iitially, we guess all ceters i the first 3 itervals). We prove that i every iterval we are liely to sample good empirical ceters. This is a simple corollary of Fact 5.2 alog with the followig two claims. First, we claim that at the ed of each iterval, the umber of poits remaiig is at most 2i+1. Secodly, we also claim that i each iterval we do ot remove eve a sigle poit from a cluster whose size is smaller tha 2i 6. We refer the reader to Appedix A for the algorithms ad their aalysis. 6. DISCUSSION AND OPEN PROBLEMS The algorithm we preset here for -media has rutime of poly( 1/β, 1/ǫ, ), ad the algorithm for -meas has rutime poly(, ( log ) 1/ǫ, ( log ) 1/β ). 6 We commet that it is uliely that we ca obtai a algorithm of rutime poly( 1/ǫ, 1/β, ). Observe that for ay clusterig istace ad ay > 1 we have that ( 1) > 1 + 1, simply by cosiderig the -clusterig that results from taig the optimal ( 1)-clusterig, ad settig the poit which is the furthest from its ceter i a cluster of its ow (as a ew ceter). Hece, ay -media/-meas istace is β- distributed for β = Ω( 1 ). Recall from Sectio 3-D the - media problem restricted oly to wealy-stable istaces has o FPTAS. So the fact that our algorithm s rutime has super-polyomial depedece i both 1/β ad 1/ǫ is uavoidable. Noetheless, oe might still hope to do better. I particular, oe major rutime expese of our algorithm comes from hadlig expesive clusters by brute-force guessig or samplig. Ca oe improve the rutime by doig somethig more clever for expesive clusters? It is worth otig that for the stability coditios of [4], Voevodsi et al. [20] develop a especially efficiet implemetatio with good performace (i terms of both accuracy ad speed) o real-world protei sequece datasets. A differet ope problem lies i the relatio to results of Ostrovsy et al. [18]. Their motivatig questio was to aalyze the performace of Lloyd-type methods over stable istaces. Is it possible that wea deletio-stability is sufficiet for some versio of the -meas heuristic to coverge to the optimal clusterig? Acowledgemets: This wor was supported i part by the Natioal Sciece Foudatio uder grat CCF REFERENCES [1] Sajeev Arora, Prabhaar Raghava, ad Satish Rao. Approximatio schemes for Euclidea -medias ad related problems. I STOC, Whe dealig with -meas i a Euclidea space of dimesio dim, we eed to explicitly compute the distaces, so we add 2 dim to the rutime.

9 [2] Vijay Arya, Navee Garg, Rohit Khadear, Adam Meyerso, Kamesh Muagala, ad Viayaa Padit. Local search heuristic for -media ad facility locatio problems. I STOC, [3] Mihai B ādoiu, Sariel Har-Peled, ad Piotr Idy. Approximate clusterig via core-sets. I STOC, pages , [4] Maria-Floria Balca, Avrim Blum, ad Aupam Gupta. Approximate clusterig without the approximatio. I SODA, [5] Maria-Floria Balca ad Mar Braverma. Fidig low error clusterigs. I COLT, [6] Maria-Floria Balca, Heio Rögli, ad Shag-Hua Teg. Agostic clusterig. I ALT, pages , [7] Moses Chariar, Sudipto Guha, Éva Tardos, ad David B. Shmoys. A costat-factor approximatio algorithm for the -media problem. I STOC, [8] Sajoy Dasgupta. The hardess of -meas clusterig. Techical report, Uiversity of Califoria at Sa Diego, [9] W. Feradez de la Vega, Mare Karpisi, Claire Keyo, ad Yuval Rabai. Approximatio schemes for clusterig problems. I STOC, [10] Michelle Effros ad Leoard J. Schulma. Determiistic clusterig with data ets. ECCC, (050), [11] Sudipto Guha ad Samir Khuller. Greedy stries bac: Improved facility locatio algorithms. I Joural of Algorithms, pages , [12] Sariel Har-Peled ad Soham Mazumdar. O coresets for - meas ad -media clusterig. I STOC, pages , [13] Mary Iaba, Naoi Katoh, ad Hiroshi Imai. Applicatios of weighted vorooi diagrams ad radomizatio to variacebased -clusterig: (exteded abstract). I Proc. 10th Symp. Comp. Geom., pages , [14] Kamal Jai, Mohammad Mahdia, ad Ami Saberi. A ew greedy approach for facility locatio problems (exteded abstract). I STOC, pages , [15] Tapas Kaugo, David M. Mout, Natha S. Netayahu, Christie D. Piato, Ruth Silverma, ad Agela Y. Wu. A local search approximatio algorithm for -meas clusterig. I Proc. 18th Symp. Comp. Geom., [16] Amit Kumar, Yogish Sabharwal, ad Sadeep Se. A simple liear time (1+ ǫ)-approximatio algorithm for -meas clusterig i ay dimesios. I FOCS, [17] R. Ostrovsy ad Y. Rabai. Polyomial time approximatio schemes for geometric -clusterig. I FOCS, [18] Rafail Ostrovsy, Yuval Rabai, Leoard J. Schulma, ad Chaitaya Swamy. The effectiveess of Lloyd-type methods for the -meas problem. I FOCS, pages , [19] F. Schaleamp, M. Yu, ad A. va Zuyle. Clusterig with or without the Approximatio. I COCOON, [20] Kostati Voevodsi, Maria Floria Balca, Heio Rogli, ShagHua Teg, ad Yu Xia. Efficiet clusterig with limited distace iformatio. I Proc. 26th UAI, APPENDIX We preset the algorithm for (1 + ǫ)-approximatio to the -meas optimum of a β-distributed istace. Much lie i Sectio 4, we call a cluster i the optimal -meas solutio cheap if i = x C d 2 (x, c i i ) βǫ 4. 6 A. Clusterig β-distributed Istaces of Euclidea -meas The algorithm is preseted i Figure 2. The correctess is proved i a similar fashio to the proof of correctess preseted i Sectio 4. First, observe that by the Marov { iequality, for ay cheap } cluster Ci, we have that the set x; d 2 (x, c i ) > t β Ci caot cotai more tha ǫ/(4 6 t) fractio of the { poits i Ci. It follows that the ier rig of Ci, the set x; d 2 (x, c i ) β 256 C }, i cotais at least half of the poits of Ci. As metioed Sectio 5 the algorithm populates the list Q with good compoets correspodig to cheap clusters. Also from Sectio 5, we ow that for every expesive cluster, there exists a sample of O( 1 β + 1 ǫ ) data poits whose ceter is a good substitute for the ceter of the cluster. I the aalysis below, we assume that Q has bee iitialized correctly with Q iit cotaiig these good substitutes. I geeral, the algorithm will be ru multiple times for all possible guesses of samples from expesive clusters. We start with the followig lemma which is similar to Lemma 4.2. Lemma A.1. Let T Q be ay compoet ad let s be the stage i which we isert T to Q. Let Ci be ay cheap cluster s.t. s Ci. The (a) T does ot cotai[ ay poit z s.t. the distace d 2 (c i, z) lies withi the rage β 16 Ci, β 4 C ], i ad (b) T caot cotai both a poit p 1 s.t. d 2 (c i, p 1) β 16 Ci ad a poit p 2 s.t. d 2 (c i, p 2) > β 4 Ci. Proof: Assume (a) does ot hold. Let z be such poit, ad let B(z, r) be the set of all poits p s.t. d 2 (z, p) r = β 64s β 64 Ci. As d2 (z, c i ) β 16 Ci, we have that d(z, p) 1 2 d(z, c i ). It follows that d2 (c i, p) (d(c i, z) d(z, p))2 (d(c i, z)/2)2 = β 64 Ci. Similarly, d 2 (c i, p) (d(c i, z) + d(z, p))2 (3d(c i, z)/2)2 9β 16 Ci. Thus B(z, r) is cotaied i C i ier-rig of Ci, yet cotais s/2 C i, but falls outside the /2 may poits. Cotradictio. Assume (b) does ot hold. Let p 1 ad p 2 the above metioed poits. As T is a coected compoets, it follows that alog the path p 1 p 2, exists a pairs of eighborig odes, x, y, s.t. d 2 (x, y) r β 64 C i yet d 2 (c i, x) β 16 Ci while d 2 (c i, y) β 4 Ci. However, a simple computatio gives that d 2 (c i, y) (3d(c i, x)/2)2 9β 64 C i. Cotradictio.

10 1) Iitializatio Stage: Set Q Q iit. 2) Populatio Stage: For s =, 1, 2,...,1 do: a) Set r = β 64s. b) Remove ay poit x such that d 2 (x, Q) < 4r. (Here, d(x, Q) = mi T Q;y T d(x, y).) c) For ay remaiig data poit x, deote the set of data poits whose distace squared from x is at most r, by B(x, r). Coect ay two remaiig poits a ad b if: (i) d 2 (a, b) r, (ii) B(a, r) > s 2 ad (iii) B(b, r) > s 2. d) Let T be a coected compoet of size > s 2. The: i) Add T to Q. (That is, Q Q {T }.) ii) Defie the set B(T) = {x : d 2 (x, y) 4r for some y T }. Remove the poits of B(T) from the istace. 3) Ceters-Retrievig Stage: For ay choice of compoets T 1, T 2,..., T out of Q a) Fid the best ceter c i for T i B(T i ). That is c i = µ(t i B(T i )) = Figure 2. 1 T i B(T i) x T i B(T i) x. b) Partitio all poits accordig to the earest poit amog the ceters of the curret compoets. c) If a clusterig of cost at most (1+ǫ) is foud output these ceters ad halt. A PTAS for β-distributed istaces of Euclidea -meas. Lemma A.1 allows us to give the aalogous claims to Claims 4.3 ad 4.4. As before, call a compoet T good if it is cotaied withi some target cluster Ci ad T B(T) cotais all of the ier rig poits of Ci. Otherwise, the compoet is called bad provided it is ot oe of the iitial ceters preset i Q iit. We ow show that each cheap target cluster will have a sigle, uique, good compoet. Claim A.2. Let Ci be ay cheap cluster i the target clusterig. By stage s = Ci, the algorithm adds to Q a compoet T that cotais a poit from the ier rig of Ci. Claim A.3. Let T be a good coected compoet added to Q, cotaiig a ier rig poit from cluster Ci. The: β (a) all poits i T are of distace squared at most 16 Ci from c i, (b) T B(T) is fully cotaied i C i, ad (c) the etire ier rig of Ci is cotaied i T B(T), ad (d) o other compoet T T i Q cotais a ier rig poit from Ci. As the proofs of Claims A.2 ad A.3 are idetical to the Claims 4.3 ad 4.4, we omit them. Lemma A.4. We do ot add to Q more tha 1000/β bad compoets. Proof: Cosider ay bad compoet T that we add to Q ad deote that stage i which we isert T to Q as s. So the size of this compoet is > s 2. Let y be a arbitrary poit from T which belogs to cluster C i the optimal clusterig. Let c be the ceter of C. We show that d 2 (c, y) > β 500s. We divide ito cases. Case 1: C is a cheap cluster ad s C. Recall that T must cotai s/2 C /2 poits, so it follows that T cotais some poit x that does ot belog to C. β-stability gives that this poit has distace d 2 (c, x) > β C, ad we apply Lemma A.1 to deduce that all poits i T are of C. distace squared of at least β 4 Case 2: C is a cheap cluster ad s < C. I this case we have that the etire ier rig of C already belogs to some T Q. Let x T be ay ier rig poit from C, ad we have that d(c, x) 2 β 256 C β 256s, while d2 (x, y) > β 16s. It follows that d2 (c, y) (3d(x, y)/4) 2 > β 500s. Case 3: C is a expesive cluster ad s > 2 C. We claim that d 2 (c, y) > β 32 C. If, by cotradictio, we have that d 2 (c, y) β 32 C, the we show that the ball B(y, r) cotais oly poits from Ci, yet it must cotais s/2 > Ci poits. This is because each p B(y, ( r) satisfies that d 2 (c, p) (d(c, y) + d(y, p)) 2 ) 2 β 32 C + β 16s < β C. Case 4: C is a expesive cluster ad s 2 C. I this case, from Fact 5.1 we ow that Q iit cotais a a good empirical ceter c for the expesive cluster C, i the sese that c c 2 β 512 C β 256s. The, similarly case 2 above we have d 2 (y, c ) (d(y, c) d(c, c )) 2 > β 500s. It follows that every poit i T has a large distace from its ceter. Therefore, the s/2 poits i this compoet cotribute at least β/1000 to the -meas cost. Hece, we ca have o more tha 1000/β such bad compoets. We ow prove the mai theorem. Theorem A.5. The algorithm outputs a -clusterig whose cost is at most (1 + ǫ). Proof: Usig Claim A.3, it follows that there exists some choice of compoets which has good compoets for all the cheap clusters ad good substitutes for the ceters of the expesive clusters. Fix that choice ad cosider a cluster Ci with ceter c i. If C i is a expesive cluster the from Sectio 5 we ow that Q iit cotais a poit c i i Ci. Hece, the cost paid by the such that d 2 (c i, c i ) βǫ β+ǫ poits i Ci will be atmost (1 + ǫ) i. If Ci is a cheap cluster the deote by T the good compoet that resides withi Ci. Deote T B(T) by A, ad C i \ A by B. Let

Stability yields a PTAS for k-median and k-means Clustering

Stability yields a PTAS for k-median and k-means Clustering Stability yields a PTAS for -Median and -Means Clustering Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cmu.edu Avrim Blum Carnegie Mellon University avrim@cs.cmu.edu Or Sheffet Carnegie Mellon