arxiv: v1 [cs.lg] 5 Oct 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 5 Oct 2018"

Marsha Hines
5 years ago
Views:

1 CDF TRANSFORM-SHIFT: AN EFFECTIVE WAY TO DEAL WITH INHOMOGENEOUS DENSITY DATASETS A PREPRINT arxiv: v1 [cs.lg] 5 Oct 2018 Ye Zhu ( ) School of Iformatio Techology Deaki Uiversit Victoria, Australia 3125 ye.zhu@ieee.org Mark J. Carma Faculty of Iformatio Techology Moash Uiversit Victoria, Australia 3800 mark.carma@moash.edu October 9, 2018 ABSTRACT Kai Mig Tig School of Iformatio Techology Federatio Uiversit Victoria, Australia 3842 kaimig.tig@federatio.edu.au Maia Agelova School of Iformatio Techology Deaki Uiversit Victoria, Australia 3125 maia.a@deaki.edu.au May distace-based algorithms exhibit bias towards dese clusters i ihomogeeous datasets (i.e., those which cotai clusters i both dese ad sparse regios of the space). For example, desity-based clusterig algorithms ted to joi eighbourig dese clusters together ito a sigle group i the presece of a sparse cluster; while distace-based aomaly detectors exhibit difficulty i detectig local aomalies which are close to a dese cluster i datasets also cotaiig sparse clusters. I this paper, we propose the CDF Trasform-Shift (CDF-TS) algorithm which is based o a multi-dimesioal Cumulative Distributio Fuctio (CDF) trasformatio. It effectively coverts a dataset with clusters of ihomogeeous desity to oe with clusters of homogeeous desity, i.e., the data distributio is coverted to oe i which all locally low/high-desity locatios become globally low/high-desity locatios. Thus, after performig the proposed Trasform-Shift, a sigle global desity threshold ca be used to separate the data ito clusters ad their surroudig oise poits. Our empirical evaluatios show that CDF-TS overcomes the shortcomigs of existig desitybased clusterig ad distace-based aomaly detectio algorithms ad sigificatly improves their performace. Keywords Desity-ratio Desity-based Clusterig knn aomaly detectio ihomogeeous desities Scalig Shift 1 Itroductio May distace-based algorithms have a bias towards dese clusters 2,5,13,36,26. For example, (i) DBSCAN 12 is biased towards groupig eighbourig dese clusters ito a sigle cluster, i the presece of a sparse cluster 36. (ii) The knn aomaly detector 4,27 employs the knn desity estimator 16,23 to estimate the desity of every poit i a dataset, ad the sorts all the poits based o their estimated desity i descedig order. Low-desity poits are regarded as more likely to be aomalous tha high-desity poits. Sice the rakig is based o global desities, this bias results i the misclassificatio of so-called local aomalies (poits cosidered aomalous with respect to their eighborhood) i high desity regios as ormal poits 7,8,9.

2 A umber of techiques have bee proposed to correct this bias, otably, the desity-ratio approach of ReCo 36, ReScale 36 ad DScale 37 (for desity-based clusterig such as DBSCAN 14 ad DP 29 ), ad the reachability-distace approach of LOF 7 (for the knn aomaly detector 27 ). The desity-ratio approach 36,37 aims to trasform the data i such a way that the estimated desity of each trasformed poit approximates the desity-ratio of that poit i the origial space, ad as a result all locally low-desity poits are easily separated from all locally high-desity poits usig a sigle global threshold. While the curret desity-ratio methods have bee show to improve the clusterig performace of desity-based clusterig, we idetify their shortcomigs ad propose a ew algorithm to better achievig the aim of the desity-ratio approach. Furthermore, we exted its applicatio to distace-based aomaly detectio. We propose to perform a multi-dimesioal Cumulative Distributio Fuctio (CDF) based trasformatio o the give dataset as a preprocessig step, which lears a mappig f 1 such that the resultig Euclidea distace betwee pairs of poits d(x, y) = f(x) f(y) 2 is locally cosistet with the Euclidea distace betwee poits i the origial space (i.e., all local iter-poit distaces are proportioal to their origial values), while the fial trasformed dataset has a uiform distributio, as estimated by desity estimator pdf usig a certai badwidth. This effectively achieves the desired aim of CDF trasform, i.e., desity equalisatio w.r.t. a desity estimator with a certai badwidth, without impairig the cluster structure i the origial dataset. This paper makes the followig cotributios: (1) It geeralises a curret CDF scalig method DScale 37 so that it ca be applied to desity estimators usig uiform kerels with either fixed or variable badwidths. (2) It proposes a ew Trasform-Shift method called CDF-TS (CDF Trasform-Shift) that chages the volumes of each poit s eighbourhood simultaeously i the full multi-dimesioal space. Existig CDF trasform methods are either idividual attribute based (e.g., ReScale 36 ) or idividual poit based (e.g., DScale 37 ). As far as we kow, this is the first attempt to perform a multi-dimesioal CDF trasform to achieve the desired effect of homogeisig (makig uiform) the distributio of a etire dataset w.r.t a desity estimator. (3) It applies the ew Trasform-Shift method to desity-based clusterig ad distace-based aomaly detectio to demostrate its impact i overcomig the weakesses of three existig algorithms. The proposed approach CDF-TS differs from existig approaches i terms of: (i) Methodology: While ReScale 36, DScale 37 ad the proposed CDF-TS follow the same pricipled approach ad have the same aim, the proposed CDF Trasform-Shift method is a multi-dimesioal techique which icorporates both trasformatio ad poit-shift, which greatly expad the approach s applicability. I cotrast, ReScale is a oe-dimesioal trasformatio ad poit-shift techique; ad DScale icorporates trasformatio without poit-shift. The complete CDF-TS method leads to a better result i both clusterig ad aomaly detectio, which we will show i Sectio 6. (ii) Metric compliace: CDF-TS ad ReScale create a similarity matrix which is a metric; whereas the oe produced by DScale 37 is ot a metric, i.e., it does ot satisfy the symmetry or triagle iequalities. (iii) Ease of use: Like Rescale, CDF-TS trasforms the data as a preprocessig step. The targeted (clusterig or aomaly detectio) algorithm ca the applied ualtered to the trasformed data. I cotrast, DScale requires a algorithmic modificatio i order to use it. To demostrate the effects of differet dissimilarity measures o image segmetatio, we took the image show i Figure 1a, ad applied differet rescalig methods (o rescalig, ReScale, DScale ad CDF-TS) ad clusterig the 3-dimesioal pixel values of the image i LAB space 32. Tables 1 show the results of applyig the clusterig algorithm DP 29 to produce three clusters. The scatter plots i LAB space i Figure 1 show that the three clusters become more uiform distributed with similar desity ad the gaps betwee their boudaries are much larger for CDF-TS tha the other three scalig methods. As a result, DP with CDF-TS yields the best clusterig result, show i Table 1. The rest of the paper is orgaised as follows. We describe issues of ihomogeeous desity i desity-based clusterig ad distace-based aomaly detectio i Sectio 2. Sectio 3 provides the desity-ratio estimatio as a priciple to address the issues of ihomogeous desity. Sectio 4 presets the two existig CDF scalig approaches based o desity-ratio. Sectio 5 proposes the local-desity trasform-shift as a multi-dimesioal CDF scalig method. Sectio 6 empirically evaluates the performace of existig desity-based clusterig ad distace-based aomaly detectio algorithms o the trasformed-ad-shifted datasets. Discussio ad the coclusios are provided i the last two sectios. 1 The mappig f(x; D, pdf) depeds o all poits i the give dataset D, where each poit y D asserts a (local ad poit-based) scalig factor r(y; pdf) to x. 2

The scatter plots of DScale ad No Scalig are based o the origial LAB attributes; ad the other two are based o the

3 (a) Image (b) No Scalig (c) ReScale (d) DScale (e) CDF-TS Figure 1: Clusterig results by DP i LAB space of the image show i (a). The colours i the scatter plots idicate the three clusters idetified by DP usig each of the four methods. The scatter plots of DScale ad No Scalig are based o the origial LAB attributes; ad the other two are based o the trasformed attributes. Table 1: DP s image segmetatio o the image show i Figure 1a. Cluster 1 Cluster 2 Cluster 3 CDF-TS DScale ReScale No Scalig 3

4 2 Issues of ihomogeeous desity 2.1 Desity-based clusterig Let D = {x 1, x 2,..., x }, x i R d, x i F deote a dataset of poits, each sampled idepedetly from a distributio F. Let pdf(x) deote the desity estimate of poit x which approximates the true desity pdf(x). I additio, let N (x; ɛ) be the ɛ-eighbourhood of x, N (x; ɛ) = {y D s(x, y) ɛ}, where s(, ) is the distace fuctio (s : R d R d R). I geeral, the desity of a poit pdf(x) ca be estimated via a small ɛ-eighbourhood (as used by the classic desity-based clusterig algorithm DBSCAN 14 ) defied as follows: where V ɛ ɛ d is the volume of a d-dimesioal ball of radius ɛ. pdf ɛ (x) = 1 {y D s(x, y) ɛ} N (x; ɛ) = (1) V ɛ V ɛ A set of clusters {C 1,..., C ς } is defied as o-empty ad o-itersectig subsets: C i D, C i, i j C i C j =. Let c i = arg max x Ci pdf(x) deote the mode (poit of the highest estimated desity) for cluster Ci ; ad p i = pdf(c i ) deote the correspodig peak desity value. DBSCAN uses a global desity threshold to idetify core poits (which have desities higher tha the threshold); ad the it liks eighbourig core poits together to form clusters 14. It is defied as follows. Defiitio 1. A core poit is a poit with a estimated desity above or equal to a user-specified threshold τ, i.e., ( pdf ɛ (x) τ) Core(x) = 1, where Core deotes a set idicator fuctio. Defiitio 2. Usig a desity estimator with desity threshold τ, a poit x 1 is desity coected with aother poit x p i a sequece of p uique poits from D, i.e., {x 1, x 2, x 3,..., x p }: CON τ ɛ (x 1, x p ) is defied as: (i) if p = 2 : CONɛ τ (x (x 1, x p ) 1 N ɛ (x p )) (Core(x 1 ) Core(x p )); (ii) if p > 2 : (x1,x 2,...x p)(( i {2,...,p} x i 1 N ɛ (x i )) ( i {2,...,p 1} Core(x i ))). Defiitio 3. A cluster detected by the desity-based algorithm DBSCAN is a maximal set of desity coected istaces, i.e., C i = {x D CON τ ɛ (x, c i )}, where c i = arg max x Ci pdf ɛ (x) is the cluster mode. For this kid of algorithm to fid all clusters i a dataset, the data distributio must have the followig ecessary coditio: the peak desity of each cluster must be greater tha the maximum over all possible paths of the miimum desity alog ay path likig ay two modes. 2 This coditio is formally described by Zhu et al 36 as follows: mi c k > max g ij (2) k {1,...,ς} i j {1,...,ς} where g ij is the largest of the miimum desity alog ay path likig the mode of for clusters C i ad C j. This coditio implies that there must exist a threshold τ that ca be used to break all paths betwee the modes by assigig regios with desity less tha τ to oise. Otherwise, if the mode of some cluster has a desity lower tha that of a low-desity regio betwee other clusters, the this kid of desity-based clusterig algorithm will fail to fid all clusters. Either some high-desity clusters will be merged together (whe a lower desity threshold is used), or low-desity clusters will be desigated as oise (whe a higher desity threshold is used). To illustrate, Figure 2a shows that usig a high threshold τ 1 will cause all poits i Cluster C 3 to be assiged to oise but usig a low threshold τ 2 will cause poits i C 1 ad C 2 to be assiged to the same cluster. 2.2 knn aomaly detectio A classic earest-eighbour based aomaly detectio algorithm assigs a aomaly score to a istace based o its distace to the k-th earest eighbour 27. The istaces with the largest aomaly scores are idetified as aomalies. 2 A path likig two modes c i ad c j is defied as a sequece of uique poits startig with c i ad edig with c j where adjacet poits lie i each other s ɛ-eighbourhood. 4

5 (a) Origial data (b) ReScaled data Figure 2: (a) A mixture of three Gaussia distributios that caot be separated usig a sigle desity threshold; (b) Desity distributio o ReScaled data of (a), where a sigle desity threshold ca be foud to separated all three clusters. Note that poit x 1, x 2 ad x 3 are shifted to y 1, y 2 ad y 3, respectively. Here η is a larger badwidth tha ɛ. Give a dataset ad the parameter k, the desity of x ca be estimated usig a k-th earest eighbour desity estimator (as used by the classic knn aomaly detector 27 ): pdf knn (x; k) = pdf(x; k ɛ k (x)) = V ɛk (x) 1 ( ɛ k (x) )d (3) where ɛ k (x) is the distace betwee x R d ad its k-th earest eighbour i a dataset D. Note that the k-th earest eighbour distace ɛ k (x) is a proxy to the desity of x, i.e., high ɛ k (x) idicates low desity, ad vice versa. Let C be the set of all ormal poits i a dataset D. The coditio uder which the classic knn aomaly detector could, with a appropriate settig of a desity/distace threshold, idetify every aomaly y i A = D \ C is give as follows: mi y A ɛ k (y) > max x C ɛ k (x) (4) Equatio 4 states that all aomalies must have the highest knn distaces (or lowest desities) i order to detect them. I other words, knn aomaly detectors ca detect both global aomalies ad scattered aomalies which have lower desities tha that of all ormal poits 1,20. However, based o this characteristic, knn aomaly detectors are uable to detect: (a) Local aomalies. This is because a local aomaly y with low desity relative to earby ormal (o-aomalous) istaces i a regio of high average desity may still have higher desity tha that of ormal (o-aomalous) istaces i regios of lower average desity 7. Traslatig this i terms of k-th NN distace, we have: x,z C,y A ɛ k (x) < ɛ k (y) < ɛ k (z). Here, some local aomalies (for example, poits located aroud the boudaries of C 1 ad C 2 ) are raked lower tha the ormal poits located aroud the cere sparse cluster (C 3 ), as show i Figure 2a. (b) Aomalous clusters (sometimes referred to as clustered aomalies 21 ) are groups of poits that are too small to be cosidered true clusters ad are foud i low desity parts of the space (i.e. are separate from the other clusters). A purported beefit of the knn aomaly detector is that it is able to remove such aomalous clusters providig k is sufficietly large (larger tha the size of the aomalous group) 1. By rescalig the data we are able to exted this property to locally aomalous clusters, i.e. to be tolerat to variatios i the local (backgroud or average) desity over the space. A example is show i Figure Summary Both desity-based clusterig ad distace-based aomaly detector have weakesses whe it comes to hadlig datasets with ihomogeeous desity clusters. Rather tha creatig a ew desity estimator or modifyig the existig clusterig ad aomaly detectio algorithm procedures, we advocate trasformig data to be more uiformly distributed tha it is i the origial space such that the separatio betwee clusters ca be idetified easily. 5

6 (a) Origial data (b) ReScaled data Figure 3: (a) A mixture of two Gaussia distributios that C is a ormal cluster ad A is a aomalous cluster; (b) Desity distributio o the ReScaled data of (a), where the aomalous cluster are farther to the ormal cluster cetre. Note that Poit x 1, x 2 ad x 3 are shifted to y 1, y 2 ad y 3, respectively. 3 Desity-ratio estimatio Desity-ratio estimatio is a pricipled approach to overcome the weakess of desity-based clusterig for detectig clusters with ihomogeeous desities 36. The desity-ratio of a poit is the ratio of two desity estimates calculated usig the same desity estimator, but with two differet badwidth settigs. Let pdf(, γ) ad pdf(, λ) be desity estimators usig kerels of badwidth γ ad λ, respectively. Give the costrait that the deomiator has larger badwidth tha the umerator γ < λ, the desity ratio of x is estimated as rpdf(x; γ, λ) = pdf(x; γ) pdf(x; λ) (5) We correct the lemma from 36 regardig the desity ratio value : Lemma 1. For ay data distributio ad sufficietly small values of γ ad λ s.t. γ < λ, if x is at a local maximum desity of N (x; λ), the rpdf(x; γ, λ) 1; ad if x is at a local miimum desity of N (x; λ), the rpdf(x; γ, λ) 1. Sice poits located at local high-desity areas (almost ivariably) have desity-ratio higher tha poits located at local low-desity areas, a global desity-ratio threshold aroud uity ca be used to idetify all cluster peaks ad break all paths betwee differet clusters. Thus, based o desity-ratio estimatio, existig desity-based clusterig algorithms such as DBSCAN ca idetify clusters as regios of locally high desity, separated by regios of locally low desity. Similarly, a desity-based aomaly detector is able to detect local aomalies sice their desity-ratio values are lower ad raked higher tha ormal poits with locally high desities. 4 CDF scalig to overcome the issue of ihomogeeous desities 4.1 Oe-dimesioal CDF scalig Let pdf(, λ) ad cdf(, λ) be a desity estimator ad a cumulative desity estimator, respectively, usig a kerel of badwidth λ. Let x be a trasformed poit of x usig cdf as follows: The, we have the property as follows 36 : ad pdf(x ; γ) x = cdf(x; λ) (6) pdf(x ; λ) = 1/ (7) pdf(x; γ) pdf(x; λ) 6 for λ > γ (8)

7 ReScale 36 is a represetative implemetatio algorithm based o this cdf trasform. Figure 2b ad Figure 3b show ReScale rescales the data distributio o two 1-dimesioal datasets, respectively. They show that clusters ad aomalies are easier to idetify after the applicatio of ReScale. Sice this cdf trasform ca be performed o a oe-dimesioal dataset oly, ReScale must apply the trasformatio to each attribute idepedetly for a multi-dimesioal dataset. 4.2 Multi-dimesioal CDF scalig Usig a distace scalig method, a multi-dimesioal cdf trasform ca be achieved by simply rescalig the distaces betwee each poit ad all the other istaces i its local eighbourhood, (thereby cosiderig all of the dimesios of the multidimesioal dataset at oce). Give a poit x D, the distace betwee x ad all poits i its λ-eighbourhood y N (x; s, λ) 3 ca be rescaled usig a scalig fuctio r( ): where s (, ) is the scaled distace of s(, ). x,y D,y N (x;s,λ) s (x, y) = s(x, y) r(x; λ) (9) Here, the scalig fuctio r(x; λ) depeds o both the positio x ad size of the eighbourhood λ. It is defied as follows usig the estimated desity pdf(x; s, λ) with the aim of makig the desity distributio withi the λ-eighbourhood uiform: r(x; λ) = m λ (pdf(x; s, λ) V λ ) 1 1 d pdf(x; s, λ) d (10) where m = max x,y D s(x, y) is the maximum pairwise distace i D. Note that we have ow reparameterised the desity estimator pdf(x; s, λ) to iclude the distace fuctio s, i order to facilitate calculatios below. With referece to the uiform distributio which has desity V m, where V m is the volume of the ball havig radius m: r(x; λ) < 1 if pdf(x; s, λ) < V m (11) r(x; λ) > 1 if pdf(x; s, λ) > V m (12) That is, the process rescales sparse regios to be more dese by usig r(x; λ) < 1, ad dese regios to be more sparse by usig r(x; λ) > 1; such that the etire dataset is approximately uiformly distributed i the scaled λ-eighbourhood. More spefically, after rescalig distaces with s, the desity of poits i the eighbourhood of size λ x = λ r(x; λ) aroud x is the same as the desity of poits across the whole dataset: pdf(x; s, λ x) = pdf(x; s, λ) V λ V λ x = pdf(x; s, λ) V λ V λ r(x; λ) d pdf(x; s, λ) V λ = V λ ( m λ ( pdf(x;s,λ) V λ ) 1 d ) = λd d V λ m d = (13) V m Note that the above derivatio is possible because (i) the scalig is isotropic about x (hece the shape of the uit ball does t chage oly its size) ad (ii) the uiform-kerel desity estimator is local (i.e., its value depeds oly o poits withi the λ x eighbourhood). I order to maitai the same relative orderig of distaces betwee x ad all other poits i the dataset, the distace betwee x ad ay poit ot i the λ-eighbourhood y D \ N (x; s, λ) ca be ormalised by a simple mi-max ormalisatio: y D\N (x;s,λ) s (x, y) = (s(x, y) λ) m λ x m λ + λ x (14) 3 For reasos that will become obvious shortly, we ow reparameterise the eighbourhood fuctio N (x; s, λ) to deped o the distace measure s. 7

8 It is iterestig to ote that providig λ is sufficietly large that the average of the desity estimates across the data poits approximates the average desity over the space E x D [pdf(x; s, λ)] V m, the we have that the average rescalig factor is approximately 1: E x D [r(x; λ)] 1, E x D [λ x] λ, ad thus the average distace betwee poits is uchaged after rescalig: 4 y D E x D [s (x, y)] E x D [s(x, y)] (15) The implemetatio of DScale 37, which is a represetative algorithm based o this multi-dimesioal CDF scalig, is provided i Algorithm 2 i Appedix 8. A lemma about the desity o the rescaled distace is give as follows: Lemma 2. The desity pdf(x; s, γ x) with the rescaled distace s is approximately proportioal to the desity-ratio i terms of the origial distace s withi the λ-eighbourhood of x. pdf(x;s,γ) pdf(x;s,λ) Proof. where γ x = γ r(x; λ) ad γ < λ. pdf(x; s, γ x) pdf(x; s, γ) V γ V γ x pdf(x; s, γ) = m d λ d pdf(x;s,λ) V λ = pdf(x; s, γ) V γ V γ r(x; λ) d = pdf(x; s, γ) V m pdf(x; s, λ) pdf(x; s, γ) pdf(x; s, λ) Note that the above lemma is oly valid withi the λ-eighbourhood of x, ad whe each x D is treated idepedetly. I other words, the cdf trasform is oly valid locally. I additio, the rescaled distace is asymmetric, i.e., s (x, y) s (y, x) whe r(x; λ) r(y; λ). To be a valid cdf trasform globally for the etire dataset, we propose i this paper to perform a iterative process that ivolves two steps: distace rescalig ad poit shiftig. 5 CDF Trasform-Shift (CDF-TS) Istead of just rescalig distaces betwee each data istace x ad all other poits i the dataset, we ca apply the rescalig directly to the dataset by traslatig poits i approximate accordace with the rescaled distaces. Two advatages of the ew approach are: (i) the shifted dataset becomes approximately uiformly distributed as a whole; ad (ii) the stadard Euclidea distace ca the be used to measure distaces betwee the trasformed poits, rather tha the o-symmetric ad as a result o-metric rescaled distace. Cosider two poits x, y D. I order to make the distributio aroud poit x more uiform, we wish to rescale (expad or cotract) the distace betwee x ad y to be s (x, y) as defied by Equatios 9 ad 14. We ca do this by traslatig the poit y to a ew poit deoted y x which lies alog the directio (y x) as follows: y x = x + s (x, y) (y x) (16) s(x, y) Note that the distace betwee x ad the ewly trasformed poit y x satisfies the rescaled distace requiremet: s(x, y x) = s (x, y) (17) Above we cosidered the effect of traslatig poit y to y x to scale the space aroud poit x. I order to approximately scale the space aroud all poits i the dataset, poit y eeds to be traslated by the average of the above traslatios: ȳ = 1 y x = 1 x + s (x, y) (y x) (18) s(x, y) x D x D A theorem regardig the fial shifted data D is provided as follows: 4 Proof: For y N (x, s, λ) we have that E x D[s (x, y)] E x D[r(x; λ)]e x D[s(x, y)] E x D[s(x, y)] sice r(x; λ) ad s(x, y) are idepedet. The same ca be show for poits y / N (x, s, λ). 8

9 Theorem 1. Give a sufficietly large badwidth λ < m, the λ-eighbourhood desity of D is expected to be more uiform tha that of D: x D pdf( x ; s, λ) /V m < pdf(x; s, λ) /V m, Proof. Let us defie the badwidth λ x for the trasformed eighbourhood aroud x as λ x = λ s( x,ȳ ) s(x,y) x, y D ad y N (x, s, λ). We ca ow estimate λ x as follows. We have s( x, ȳ ) = x ȳ = 1 ( s (z, x) s(z, x) (x z) s (z, y) (y z))) s(z, y) z D where Sice y is i the λ-eighbourhood of x we have z D s (z,x) s(z,x) s( x, ȳ ) 1 s( x, ȳ ) s(x, y) z D = s(x, y) 1 = 1 ( s (z,y) s(z,y), ad s (z, x) (x y) s(z, x) z D z N (x;s,λ) s (z, x) s(z, x) s (z, x) s(z, x) + z / N (x;s,λ) s (z, x) s(z, x) ) Supposig that pdf(z; s, λ) varies slowly withi the eighbourhood N (x; s, λ), we have: z N (x;s,λ) s (z, x) s(z, x) z N (x;s,λ) s (x, z) s(x, z) = z N (x;s,λ) r(x; λ) The based o Equatio 15, give a sufficietly large λ such that E z D [pdf(z; s, λ)] V m ad providig the cout N (x; s, λ) is sufficietly small, we have: E z,x D,z / N (x;s,λ) [s (z, x)] E z,x D [s (z, x)] E z,x D [s(z, x)] This meas that s( x, ȳ ) is maily affected by the poit z located i the λ-eighbourhood of x, i.e., E z,x D,z / N (x;s,λ) [ s (z,x) s(z,x) ] 1. The we have s( x, ȳ ) s(x, y) The above relatio ca be rewritte as: 1 ( z N (x;s,λ) r(x; λ) + z / N (x;s,λ) N (x; s, λ) = ( r(x; λ) + s( x,ȳ ) s(x,y) 1 N (x; s, λ) r(x; λ) 1 1) N (x; s, λ) ) As N (x;s,λ) (0, 1), the relatio betwee s( x,ȳ ) s(x,y) ad r(x; λ) depeds o whether x is i the sparse or dese regio: r(x; λ) < s( x, ȳ ) s(x, y) r(x; λ) > s( x, ȳ ) s(x, y) < 1 if pdf(x; s, λ) < > 1 if pdf(x; s, λ) > V m (19) V m (20) Whe the desity varies slowly i the N (x; s, λ), we ca safely assume that each poit s earest eighbours keep similar after trasform ad the desity still varied slowly i λ -eighbourhood. The we have E x D N (x ; s, λ ) N (x; s, λ) ad pdf( x ; s, λ) ca estimate as 9

I dese regio pdf(x; s, λ) > /V m, we have r(x; λ) > s( x,ȳ ) s(x,y) iequality i Equatios 13 ad 21, we have > 1, as show i Equatio 20.

10 pdf( x ; s, λ) pdf( x ; s, λ ) N (x; s, λ) V λ Here we have three scearios, depedig o the desity of x: = pdf(x; s, λ) V λ V λ ( s( x,ȳ ) s(x,y) )d (21) 1. I sparse regio pdf(x; s, λ) < /V m, we have r(x; λ) < s( x,ȳ ) s(x,y) iequality i Equatios 13 ad 21, we have /V m > pdf( x ; s, λ) > pdf(x; s, λ) < 1, as show i Equatio 19. Usig this 2. I dese regio pdf(x; s, λ) > /V m, we have r(x; λ) > s( x,ȳ ) s(x,y) iequality i Equatios 13 ad 21, we have > 1, as show i Equatio 20. Usig this /V m < pdf( x ; s, λ) < pdf(x; s, λ) Therefore, x D, pdf( x ; s, λ) /V m < pdf(x; s, λ) /V m. Sice our target is to get a uiformly distributed data w.r.t. pdf( x, s, λ), this trasform-shift process ca be repeated multiple times usig a iterative method to reduce the λ-eighbourhood desity variatio o the shifted dataset i the previous iteratio. This is accomplished by usig a expectatio-maximisatio-like algorithm. The objective is to miimise the desity variace i D. A coditio for termiatig the iteratio process is whe the total distace of poit-shifts from D to D i the latest iteratio is less tha a threshold δ, i.e., x D, x D x x δ (22) Note that, due to the shifts i each iteratio, the value rages of attributes of the shifted dataset D are likely to be differet from those of the origial dataset D. We ca use a mi-max ormalisatio 3 o each attribute to keep the values i the same fixed rage at the ed of each iteratio. Figure 4 illustrates the effects of CDF-TS o a two-dimesioal data with differet iteratios. It shows that the origial clusters with differet desity become icreasig uiform as the umber of iteratios icreases. (a) A two-dimesioal data (b) After 1 iteratio (c) After 3 iteratios (d) After 5 iteratios Figure 4: Illustratig the effects of CDF-TS o a two-dimesioal data with λ = 0.1. The implemetatio of CDF-TS is show i Algorithm 1. The DScale algorithm used i step 6 is provided i Algorithm 2 i Appedix 8. Algorithm 1 requires two parameters λ ad δ. 10

11 Algorithm 1 CDF-TS(D, λ, δ) Iput: D - iput data matrix ( d matrix); λ - badwidth parameter; δ - threshold for the total shifted distace. Output: D - data matrix after rescalig ad poit shiftig. 1: Normalisig D usig mi-max ormalisatio 2: = 3: t = 1 4: while > δ do 5: S Calculatig the distace matrix for D 6: S DScale(S, λ, d) 7: for each z D (where z is a referece poit) do 8: D z Shift every poit x D to x z i the directio of z to x with magitude S [z, x] 9: ed for 10: D 1 z D D z 11: Normalisig D usig mi-max ormalisatio 12: = 1 d i,j D[i, j] D [i, j] 13: t = t : D = D 15: ed while 16: retur D 5.1 Desity estimators applicable to CDF-TS The followig two types of desity estimators, with fixed-size ad variable-size badwidths of the uiform kerel, are applicable to CDF-TS to do the CDF scalig: (a) ɛ-eighbourhood desity estimator: pdf ɛ (x) = where V ɛ is the volume of the ball with radius ɛ. {y D s(x, y) ɛ} V ɛ Whe this desity estimator is used i CDF-TS, the λ-eighbourhood described i Sectios 4.2 ad 5, i.e., N (x; s, λ) = {y D s(x, y) ɛ}, where λ = ɛ, deotig the fixed-size badwidth uiform kerel. (b) k-th earest eighbour desity estimator: pdf(x; ɛ k (x)) = k V s(x,xk ) 1 ( s(x, x k ) )d where x k is the k-th earest eighbour of x; ɛ k (x) is k-th earest eighbour distace of x. Whe this desity estimator is used i CDF-TS, the eighbourhood size λ becomes depedet o the locatio x ad the distributio of surroudig poits. We deote λ(x) k = s(x, x k ) as the distace to the k th earest eighbour of x. I this case the desity is simply calculated over the eighbourhood: N (x; s, λ(x) k ) = {y D s(x, y) s(x, x k )}, deotig the variable-size badwidth uiform kerel, i.e., small i dese regio ad large i sparse regio. Note that, i this circumstace, the desity of the shifted poits still become more uiformly w.r.t their surroudig. I the followig experimets, we employ the same desity estimators as used i DBSCAB ad DP (i clusterig) ad knn aomaly detector i CDF-TS to trasform D to D, i.e., ɛ-eighbourhood desity estimator is used i CDF-TS for clusterig; ad k-th earest eighbour desity estimator is used i CDF-TS for aomaly detectio. 6 Empirical Evaluatio This sectio presets experimets desiged to evaluate the effectiveess of CDF-TS. All algorithms used i our experimets were implemeted i Matlab R2017a (the source code ca be obtaied at The experimets are ru o a machie with eight cores CPU (Itel Core 3.60GHz), 32GB memory ad a 2560 CUDA cores GPU (GeForce GTX 1080). All datasets were ormalised usig the mi-max ormalisatio to yield each attribute to be i [0,1] before the experimets bega. 11

Table 2: Properties of clusterig datasets Dataset Data Size #Dimesios #Clusters Segmet 2310 19 7 Mice 1080 83 8 Biodeg 1055 41 2 ILPD 579 9 2 ForestType 523 27 4 Wilt 500 5 2 Musk 476 166 2 Libras

with differet data sizes ad dimesios from UCI Machie Learig Repository 11. Table 2 presets the data properties of the datasets.

12 Table 2: Properties of clusterig datasets Dataset Data Size #Dimesios #Clusters Segmet Mice Biodeg ILPD ForestType Wilt Musk Libras Dermatology Ecoli Haberma Seeds Lug Wie L C For clusterig, we used 2 artificial datasets ad 14 real-world datasets with differet data sizes ad dimesios from UCI Machie Learig Repository 11. Table 2 presets the data properties of the datasets. 3L is a 2-dimesioal data cotaiig three elogated clusters with differet desities, as show i Figure 5a. 4C is a 2-dimesioal dataset cotaiig four clusters with differet desities (three Gaussia clusters ad oe elogated cluster), as show i Figure 5b. Note that DBSCAN is uable to correctly idetify all clusters i both of these datasets because they do ot satisfy the coditio specified i Equatio 2. Furthermore, clusters i 3L sigificatly overlap o idividual attribute projectios, which violates the requiremet of ReScale that the oe-dimesioal projectios allow for the idetificatio of the desity peaks of each cluster. (a) 3L data distributio (b) 4C data distributio Figure 5: (a) A two-dimesioal data cotaiig three elogated clusters. (b) A two-dimesioal data cotaiig four clusters. For aomaly detectio, we compared the aomaly detectio performace o 2 sythetic datasets with aomalous clusters ad 10 real-world bechmark datasets 5. The data size, dimesios ad percetage of aomalies are show i Table 3. Both the Sy 1 ad Sy 2 datasets cotai clusters of aomalies. Their data distributios are show i Figure Clusterig I this sectio, we compare CDF-TS with ReScale ad DScale usig two existig desity-based clusterig algorithms, i.e., DBSCAN 14 ad DP 29, i terms of F-measure 28 : give a clusterig result, we calculate the precisio score P i ad the recall score R i for each cluster C i based o the cofusio matrix, ad the the F-measure score of C i is the harmoic mea of P i ad R i. The overall F-measure score is the uweighted (macro) average over all clusters: F-measure= 1 ς ς i=1 2P ir i P i+r i. 6 5 Velocity, At ad Tomcat are from ad others are from UCI Machie Learig Repository It is worth otig that other evaluatio measures such as purity ad Normalised Mutual Iformatio (NMI) 31 oly take ito accout the poits assiged to clusters ad do ot accout for oise. A clusterig algorithm which assigs the majority of the poits 12

13 Table 3: Properties of aomaly detectio datasets Dataset Data Size #Dimesios % Aomaly AThyroid % Pageblocks % Tomcat % At % BloodDoatio % Vowel % Mfeat % Dermatology % Balace % Velocity % Sy % Sy % Desity (a) Data distributio o Sy 1 dataset x (b) Stacked histogram o Sy 2 dataset Figure 6: Data distributios of the Sy 1 ad Sy 2 datasets, where the red colour idicate the aomaly. We report the best clusterig performace after performig a parameter search for each algorithm. Table 4 lists the parameters ad their search rages for each algorithm. ψ i ReScale cotrols the precisio of ĉdf λ(x), i.e., the umber of itervals used for estimatig ĉdf λ(x). We set δ = as the default value for CDF-TS. Table 4: Parameters ad their search rage. The search rages of ψ ad λ are as used by Zhu et. al. 36. Algorithm Parameter with search rage DBSCAN Mipts {2, 3,..., 10}; ɛ [0, 1] DP k {2, 3,..., 10}; ɛ [0, 1] ReScale ψ = 100; λ {0.1, 0.2,..., 0.5} DScale λ {0.1, 0.2,..., 0.5} CDF-TS λ {0.1, 0.2,..., 0.5}; δ = Table 5 shows the best F-measures for DBSCAN, DP, their ReScale, DScale ad CDF-TS versios. The average F-measures, showed i the secod-to-last row, reveal that CDF-TS improves the clusterig performace of either DBSCAN or DP with a larger performace gap tha both ReScale ad DScale. I additio, CDF-TS is the best performer o may more datasets tha other coteders (show i the last row of Table 5.) It is iterestig to see that CDF-TS exhibits a large performace improvemet over both DBSCAN ad DP o may datasets, such as ForestType, Wilt, Dermatology, Ecoli, Lug ad Wie. O may of these datasets, CDF-TS also shows a large performace gap i compariso with ReScale ad/or DScale. Note that the performace gap is smaller for DP tha it is for DBSCAN because DP is a more powerful algorithm which does ot rely o a sigle desity threshold to idetify clusters. To evaluate whether the performace differece amog the three scalig algorithms is sigificat, we coduct the Friedma test with the post-hoc Nemeyi test 10. to oise may result i a high clusterig performace. Thus the F-measure is more suitable tha purity or NMI i assessig the clusterig performace of desity-based clusterig. 13

14 Table 5: Best F-measure of DBSCAN, DP, ad their ReScale, DScale ad CDF-TS versios. For each clusterig algorithm, the best performer i each dataset is boldfaced. Ori, ReS, ad DS represet the Origial algorithm, ReScale, ad DScale respectively. Data DBSCAN DP Orig ReS DS CDF-TS Orig ReS DS CDF-TS Segmet Mice Biodeg ILPD ForestType Wilt Musk Libras Dermatology Ecoli Haberma Seeds Lug Wie L C Average #Top Figures 7a ad 7b show the results of the sigificace test for DBSCAN ad DP, respectively. The results show that the CDF-TS versios are sigificatly better tha the DScale ad ReScale versios for both DBSCAN ad DP. (a) Sigificace test for DBSCAN (b) Sigificace test for DP Figure 7: Critical differece (CD) diagram of the post-hoc Nemeyi test (α = 0.10). Two algorithms are sigificat differet if the gap betwee their raks is larger tha CD. Otherwise, there is a lie likig them. Here we compare the differet effects o the trasformed datasets due to ReScale ad CDF-TS usig the MDS visualisatio 7. The effects o the two sythetic datasets are show i Figure 8 ad Figure 9. They show that both the rescaled datasets are more axis-parallel distributed usig ReScale tha those usig CDF-TS. For the 3L dataset, the blue cluster is still very sparse after ruig ReScale, as show i Figure 8a. I cotrast, it becomes deser usig CDF-TS, as show i Figure 9a. As a result, DBSCAN has much better performace with CDF-TS o the 3L dataset. 7 Multidimesioal scalig (MDS) 6 is used for visualisig a high-dimesioal dataset i a 2-dimesioal space through a projectio method which preserves as well as possible the origial pairwise dissimilarities betwee istaces. 14

(a) 3L data distributio (b) 4C data distributio Figure 8: Data distributio after ruig ReScale whe achievig the best DBSCAN clusterig performace (a) 3L data distributio (b) 4C data distributio Figure

the Wilt dataset which CDF-TS has the largest performace improvemet over both DBSCAN ad DP.

15 (a) 3L data distributio (b) 4C data distributio Figure 8: Data distributio after ruig ReScale whe achievig the best DBSCAN clusterig performace (a) 3L data distributio (b) 4C data distributio Figure 9: Data distributio after ruig CDF-TS whe achievig the best DBSCAN clusterig performace (a) Based o distace Figure 10: MDS plots o the Wilt dataset (b) Based o CDF-TS Figure 10 shows the MDS plots o the Wilt dataset which CDF-TS has the largest performace improvemet over both DBSCAN ad DP. The figure shows that the two clusters are more uiformly distributed i the ew space which makes their boudary easier to be idetified tha that i the origial space. I cotrast, based o distace, most poits of the red cluster are surrouded by poits of the blue cluster, as show i Figure 10a. 6.2 Aomaly detectio I this sectio, we evaluate the ability of CDF-TS to detect local aomalies based o knn aomaly detectio. Two state-of-the-art local aomaly detector, Local Outlier Factor (LOF) 7 ad iforest 19,22, are also used i the compariso. Table 6 lists the parameters ad their search rages for each algorithm. Parameters ψ ad t i iforest cotrol the sub-sample size ad umber of itrees, respectively. We report the best performace of each algorithm o each dataset i terms of best AUC (Area uder the Curve of ROC) 15. Table 7 compares the best AUC score of each algorithm. CDF-TS-kNN achieves the highest average AUC of 0.90 ad performs the best o 6 out of 12 datasets. For the Sy 1 dataset which has overlappig o the idividual attribute projectio, DScale ad CDF-TS have much better results tha ReScale. Figure 11 shows the sigificace test o the three algorithms applied to knn aomaly detector. It ca be see from the results that CDF-TS-kNN is sigificatly better tha DScale-kNN ad ReScale-kNN. 15

16 Table 6: Parameters ad their search rages. Algorithm Parameters ad their search rage knn/lof k {5%, 10%,..., 50%} iforest t = 100; ψ {2 1, 2 2,..., 2 10 } ReScale ψ = 100; λ {0.1, 0.2,..., 0.5} DScale λ {0.1, 0.2,..., 0.5} CDF-TS λ {0.1, 0.2,..., 0.5}; δ = Table 7: Best AUC o 12 datasets. The best performer o each dataset is boldfaced. ReS, ad DS represet ReScale-kNN, ad DScale-kNN respectively. Data knn LOF iforest Res DS CDF-TS AThyroid Pageblocks Tomcat At BloodDoatio Vowel Mfeat Dermatology Balace Velocity Sy Sy Average #Top Figure 11: Critical differece (CD) diagram of the post-hoc Nemeyi test (α = 0.10) for aomaly detectio algorithms. It is iterestig to metio that CDF-TS-kNN has the largest AUC improvemet over knn o Dermatology ad AThyroid, show i Table 7. Table 8 compares their MDS plots betwee the origial datasets ad the shifted (desity equalised) datasets. It shows that most aomalies become aomalous clusters ad farther away from ormal poits o the shifted datasets, makig them easier to detect by knn aomaly detector. Note that some aomalies are closed to ormal clusters ad have higher desities tha may ormal istaces i the origial dataset. 6.3 Ru-time ReScale, DScale ad CDF-TS are all pre-processig methods, their computatioal complexities are show i Table 9. Because may existig desity-based clusterig algorithms have time ad space complexities of O( 2 ), all these methods does ot icrease their overall complexities. Table 10 shows the rutime of the dissimilarity matrix calculatio for each of the three methods o the Mfeat, Segmet, Pageblocks ad AThyroid datasets. Their parameters are set to be the same for all datasets, i.e., λ = 0.2, ψ = 100 ad δ = The result shows that CDF-TS took the logest rutime because it requires multiple DScale calculatios. 16

Table 8: MDS plots o two datasets, where red poits idicate aomalies. Distace CDF-TS AThyroid Dermatology Table 9: Computatioal complexity of ReScale, DScale ad CDF-TS.

17 Table 8: MDS plots o two datasets, where red poits idicate aomalies. Distace CDF-TS AThyroid Dermatology Table 9: Computatioal complexity of ReScale, DScale ad CDF-TS. Algorithm Time complexity Space complexity ReScale O(dψ) O(d + dψ) DScale ad CDF-TS O(d 2 ) O(d + 2 ) Table 10: Rutime compariso of dissimilarity matrix calculatio (i secods). Dataset CPU GPU ReScale DScale CDF-TS ReScale DScale CDF-TS Mfeat Segmet Pageblocks AThyroid Discussio 7.1 Parameter sesitivity Both DScale ad CDF-TS have oe critical parameter λ to defie the λ-eighbourhood. The desity-ratio based o a small λ will approximate 1 ad provides o iformatio; while o a large λ will approximate the estimated desity ad have o advatage. Geerally, λ [0.1, 0.2] ad ɛ i DBSCAN ad DP shall be set slightly smaller tha λ. The parameter δ i CDF-TS cotrols the umber of iteratios, i.e., a smaller δ usually results i a higher umber of iteratios i the CDF-TS process, ad makes the shifted dataset more uiform. I our experimets, we set δ = as default value ad got sigificatly better results tha existig algorithms. Whe tuig the best δ for differet datasets, CDF-TS would get better results o most datasets. 7.2 Relatio to LOF LOF 7 is a local desity-based approach for aomaly detectio. The LOF score of x is the ratio of the desity of x to the average desity of x s k-earest eighbours, where the desity of x is iversely proportioal to the average distace to its k-earest-eighbours 7. 17

18 Though LOF is a desity-ratio approach, it does ot employ CDF trasform, which is the core techique i CDF-TS. While LOF has the ability to detect local aomalies, it is weaker i idetifyig aomalous clusters tha knn which employs CDF-TS. 7.3 Relatio to metric learig There are may usupervised distace metric learig algorithms which trasform data e.g., global methods such as PCA 18, KPCA 30 ad KUMMP 35 ); ad local methods such as Isomap 33, t-sne 24 ad LPP 17. These methods are usually used as dimesio reductio methods such that data clusters i the leared low-dimesioal space are adjusted based o some objective fuctio 34. CDF-TS is a usupervised algorithm based o the CDF trasform such that the data are more uiform i the scaled space. Though it is a liear ad local method, CDF-TS is ot a metric learig method for dimesio reductio. We have examied the performace of t-sne i desity-based clusterig ad knn aomaly detectio. We set the dimesioality of its output space to be the same as the origial data. We foud that t-sne has icosistet clusterig performace. For example, while it produced the best clusterig results o some high-dimesioal datasets such as the ForestType ad Lug datasets; but it yielded the worst results o the Breast ad 3L datasets. I additio, t-sne could ot improve the performace o knn aomaly detector. This is because aomalies are still dese or close to some dese clusters i the t-sne trasformed space. 7.4 Compariso with a desity equalisatio method I the cotext of kerel k-meas 25, knn kerel 26 has bee suggested to be a way to equalise the desity of a give dataset to reduce the bias of kerel k-meas algorithm ad to improve the clusterig performace o datasets with ihomogeeous desities. The dissimilarity matrix geerated by knn kerel is a biary matrix ca be see as the adjacet matrix of knn graph such that 1 meas a poit is i the set of k earest eighbours of aother poit; ad 0 otherwise. Thus, a k-meas clusterig algorithm ca be used to group poits based o their knn graph. However, knn kerel caot be applied to both desity-based clusterig ad knn aomaly detectio. This is because it coverts all poits to have the same desity i the ew space regardless of the badwidth used by a desity estimator used by the iteded algorithm. Thus, DBSCAN will either group eighbourig clusters ito a sigle cluster or assig all clusters to oise. This outcome is ot coducive i producig a good clusterig result. DP also caot work with knn kerel for the same reaso. DP liks a poit to aother poit with a higher desity to form clusters i the last step of the clusterig process 29. DP would ot be able to lik poits whe all poits have the same desity. For knn aomaly detectio, replacig the distace measure with the knn kerel produces either 0 or 1 similarity betwee ay two poits. This does ot work for aomaly detectio. 8 Coclusios We itroduce a CDF Trasform-Shift (CDF-TS) algorithm based o a multi-dimesioal CDF trasform to effectively deal with datasets with ihomogeeous desities. It is the first attempt to chage the volumes of every poit s eighbourhood i the etire multi-dimesioal space i order to get the desired effect of uiform distributio for the whole dataset w.r.t a desity estimator. Existig CDF trasform methods such as ReScale 36 ad DScale 37 achieved a much reduced effect o the recaled dataset. We show that this more sophisticated CDF trasform brigs about a much better outcome. I additio, CDF-TS is more geeric tha existig CDF trasform methods with the ability to use differet desity estimators, i.e, uiform kerels with either fixed or variable badwidths. As a result, CDF-TS ca be applied to more algorithms/tasks with a better outcome tha existig methods such as knn kerel 26. Through itesive evaluatios, we show that CDF-TS sigificatly improves the performace of two existig desitybased clusterig algorithms ad oe existig distace-based aomaly detector. Appedix: DScale algorithm Algorithm 2 is a geeralisatio of the implemetatio of DScale 37 which ca employ a desity estimator with a badwidth parameter λ. It requires oe parameter λ oly. 18

19 Algorithm 2 DScale(S, λ, d) Iput: S - iput distace matrix ( matrix); λ - badwidth parameter; d - dimesioality of the dataset. Output: S - distace matrix after scalig. 1: m the maximum distace i S 2: Iitialisig matrix S 3: for i = 1 to do 4: r(x i ) = m N (xi,s,λ) λ ( ) 1 d 5: xj N (x i,s,λ) S [x i, x j ] = S[x i, x j ] r(x i ) 6: xj D\N (x i,s,λ) S [x i, x j ] = (S[x i, x j ] λ) m λ r(xi) m λ + λ r(x i ) 7: ed for 8: retur S Refereces 1. Charu C Aggarwal. Outlier Aalysis. Spriger Iteratioal Publishig, Switzerlad, 2d editio, ISBN Charu C Aggarwal ad Chada K Reddy. Data Clusterig: Algorithms ad Applicatios. Chapma ad Hall/CRC Press, Selim Aksoy ad Robert M Haralick. Feature ormalizatio ad likelihood-based similarity measures for image retrieval. Patter recogitio letters, 22(5): , Fabrizio Agiulli ad Clara Pizzuti. Fast outlier detectio i high dimesioal spaces. I Europea Coferece o Priciples of Data Miig ad Kowledge Discovery, pages Spriger, Mihael Akerst, Markus M. Breuig, Has-Peter Kriegel, ad Jörg Sader. OPTICS: Orderig poits to idetify the clusterig structure. I Proceedigs of the 1999 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, SIGMOD 99, pages 49 60, New York, NY, USA, ACM. ISBN Igwer Borg, Patrick JF Groee, ad Patrick Mair. Applied Multidimesioal Scalig. Spriger Sciece & Busiess Media, Markus M. Breuig, Has-Peter Kriegel, Raymod T. Ng, ad Jörg Sader. LOF: Idetifyig desity-based local outliers. SIGMOD Rec., 29(2):93 104, May Varu Chadola, Aridam Baerjee, ad Vipi Kumar. Aomaly detectio: A survey. ACM computig surveys (CSUR), 41(3):15, Timothy De Vries, Sajay Chawla, ad Michael E Houle. Fidig local aomalies i very high dimesioal space. I Data Miig (ICDM), 2010 IEEE 10th Iteratioal Coferece o, pages IEEE, Jaez Demšar. Statistical comparisos of classifiers over multiple data sets. Joural of Machie Learig Research, 7:1 30, Dua Dheeru ad Efi Karra Taiskidou. UCI machie learig repository, URL Levet Ertöz, Michael Steibach, ad Vipi Kumar. Fidig topics i collectios of documets: A shared earest eighbor approach. Clusterig ad Iformatio Retrieval, 11:83 103, Levet Ertöz, Michael Steibach, ad Vipi Kumar. Fidig clusters of differet sizes, shapes, ad desities i oisy, high dimesioal data. I Proceedigs of the 2003 SIAM Iteratioal Coferece o Data Miig, pages SIAM, Marti Ester, Has-Peter Kriegel, Jörg S, ad Xiaowei Xu. A desity-based algorithm for discoverig clusters i large spatial databases with oise. I Proceedigs of the Secod Iteratioal Coferece o Kowledge Discovery ad Data Miig (KDD-96), pages AAAI Press, Tom Fawcett. A itroductio to roc aalysis. Patter recogitio letters, 27(8): , Evely Fix ad Joseph L Hodges Jr. Discrimiatory aalysis-oparametric discrimiatio: cosistecy properties. Techical report, Califoria Uiv Berkeley,

20 17. Xiaofei He ad Partha Niyogi. Locality preservig projectios. I S. Thru, L. K. Saul, ad B. Schölkopf, editors, Advaces i Neural Iformatio Processig Systems 16, pages MIT Press, Ia Jolliffe. Pricipal compoet aalysis. I Iteratioal ecyclopedia of statistical sciece, pages Spriger, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. Isolatio forest. I 2008 Eighth IEEE Iteratioal Coferece o Data Miig, pages IEEE, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. O detectig clustered aomalies usig SCiForest. I Proceedigs of the 2010 Europea Coferece o Machie Learig ad Kowledge Discovery i Databases: Part II, ECML PKDD 10, pages , Berli, Heidelberg, Spriger-Verlag. ISBN X, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. O detectig clustered aomalies usig sciforest. I José Luis Balcázar, Fracesco Bochi, Aristides Giois, ad Michèle Sebag, editors, Machie Learig ad Kowledge Discovery i Databases, pages , Berli, Heidelberg, Spriger Berli Heidelberg. ISBN Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. Isolatio-based aomaly detectio. ACM Trasactios o Kowledge Discovery from Data (TKDD), 6(1):3:1 3:39, March ISSN Do O Loftsgaarde, Charles P Queseberry, et al. A oparametric estimate of a multivariate desity fuctio. The Aals of Mathematical Statistics, 36(3): , Laures va der Maate ad Geoffrey Hito. Visualizig data usig t-se. Joural of machie learig research, 9(Nov): , James MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of the fifth Berkeley symposium o mathematical statistics ad probability, volume 1, pages Oaklad, CA, USA, D. Mari, M. Tag, I. Be Ayed, ad Y. Y. Boykov. Kerel clusterig: desity biases ad solutios. IEEE Trasactios o Patter Aalysis ad Machie Itelligece, pages 1 1, ISSN doi: / TPAMI Sridhar Ramaswamy, Rajeev Rastogi, ad Kyuseok Shim. Efficiet algorithms for miig outliers from large data sets. I Proceedigs of the 2000 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, SIGMOD 00, pages , New York, NY, USA, ACM. ISBN C. J. Va Rijsberge. Iformatio Retrieval. Butterworth-Heiema, Newto, MA, USA, 2d editio, ISBN Alex Rodriguez ad Alessadro Laio. Clusterig by fast search ad fid of desity peaks. Sciece, 344(6191): , Berhard Schölkopf, Alexader J Smola, et al. Learig with kerels: support vector machies, regularizatio, optimizatio, ad beyod. MIT press, Cambridge, Alexader Strehl ad Joydeep Ghosh. Cluster esembles a kowledge reuse framework for combiig multiple partitios. Joural of Machie Learig Research, 3(Dec): , Richard Szeliski. Computer visio: algorithms ad applicatios. Spriger Sciece & Busiess Media, Joshua B Teebaum, Vi De Silva, ad Joh C Lagford. A global geometric framework for oliear dimesioality reductio. sciece, 290(5500): , Fei Wag ad Jimeg Su. Survey o distace metric learig ad dimesioality reductio i data miig. Data Miig ad Kowledge Discovery, 29(2), Fei Wag, Bi Zhao, ad Chagshui Zhag. Usupervised large margi discrimiative projectio. IEEE trasactios o eural etworks, 22(9): , Ye Zhu, Kai Mig Tig, ad Mark J. Carma. Desity-ratio based clusterig for discoverig clusters with varyig desities. Patter Recogitio, 60: , ISSN

Image Segmentation EEE 508

Image Segmentation EEE 508 Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.