arxiv: v1 [cs.lg] 5 Oct 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 5 Oct 2018"

Transcription

1 CDF TRANSFORM-SHIFT: AN EFFECTIVE WAY TO DEAL WITH INHOMOGENEOUS DENSITY DATASETS A PREPRINT arxiv: v1 [cs.lg] 5 Oct 2018 Ye Zhu ( ) School of Iformatio Techology Deaki Uiversit Victoria, Australia 3125 ye.zhu@ieee.org Mark J. Carma Faculty of Iformatio Techology Moash Uiversit Victoria, Australia 3800 mark.carma@moash.edu October 9, 2018 ABSTRACT Kai Mig Tig School of Iformatio Techology Federatio Uiversit Victoria, Australia 3842 kaimig.tig@federatio.edu.au Maia Agelova School of Iformatio Techology Deaki Uiversit Victoria, Australia 3125 maia.a@deaki.edu.au May distace-based algorithms exhibit bias towards dese clusters i ihomogeeous datasets (i.e., those which cotai clusters i both dese ad sparse regios of the space). For example, desity-based clusterig algorithms ted to joi eighbourig dese clusters together ito a sigle group i the presece of a sparse cluster; while distace-based aomaly detectors exhibit difficulty i detectig local aomalies which are close to a dese cluster i datasets also cotaiig sparse clusters. I this paper, we propose the CDF Trasform-Shift (CDF-TS) algorithm which is based o a multi-dimesioal Cumulative Distributio Fuctio (CDF) trasformatio. It effectively coverts a dataset with clusters of ihomogeeous desity to oe with clusters of homogeeous desity, i.e., the data distributio is coverted to oe i which all locally low/high-desity locatios become globally low/high-desity locatios. Thus, after performig the proposed Trasform-Shift, a sigle global desity threshold ca be used to separate the data ito clusters ad their surroudig oise poits. Our empirical evaluatios show that CDF-TS overcomes the shortcomigs of existig desitybased clusterig ad distace-based aomaly detectio algorithms ad sigificatly improves their performace. Keywords Desity-ratio Desity-based Clusterig knn aomaly detectio ihomogeeous desities Scalig Shift 1 Itroductio May distace-based algorithms have a bias towards dese clusters 2,5,13,36,26. For example, (i) DBSCAN 12 is biased towards groupig eighbourig dese clusters ito a sigle cluster, i the presece of a sparse cluster 36. (ii) The knn aomaly detector 4,27 employs the knn desity estimator 16,23 to estimate the desity of every poit i a dataset, ad the sorts all the poits based o their estimated desity i descedig order. Low-desity poits are regarded as more likely to be aomalous tha high-desity poits. Sice the rakig is based o global desities, this bias results i the misclassificatio of so-called local aomalies (poits cosidered aomalous with respect to their eighborhood) i high desity regios as ormal poits 7,8,9.

2 A umber of techiques have bee proposed to correct this bias, otably, the desity-ratio approach of ReCo 36, ReScale 36 ad DScale 37 (for desity-based clusterig such as DBSCAN 14 ad DP 29 ), ad the reachability-distace approach of LOF 7 (for the knn aomaly detector 27 ). The desity-ratio approach 36,37 aims to trasform the data i such a way that the estimated desity of each trasformed poit approximates the desity-ratio of that poit i the origial space, ad as a result all locally low-desity poits are easily separated from all locally high-desity poits usig a sigle global threshold. While the curret desity-ratio methods have bee show to improve the clusterig performace of desity-based clusterig, we idetify their shortcomigs ad propose a ew algorithm to better achievig the aim of the desity-ratio approach. Furthermore, we exted its applicatio to distace-based aomaly detectio. We propose to perform a multi-dimesioal Cumulative Distributio Fuctio (CDF) based trasformatio o the give dataset as a preprocessig step, which lears a mappig f 1 such that the resultig Euclidea distace betwee pairs of poits d(x, y) = f(x) f(y) 2 is locally cosistet with the Euclidea distace betwee poits i the origial space (i.e., all local iter-poit distaces are proportioal to their origial values), while the fial trasformed dataset has a uiform distributio, as estimated by desity estimator pdf usig a certai badwidth. This effectively achieves the desired aim of CDF trasform, i.e., desity equalisatio w.r.t. a desity estimator with a certai badwidth, without impairig the cluster structure i the origial dataset. This paper makes the followig cotributios: (1) It geeralises a curret CDF scalig method DScale 37 so that it ca be applied to desity estimators usig uiform kerels with either fixed or variable badwidths. (2) It proposes a ew Trasform-Shift method called CDF-TS (CDF Trasform-Shift) that chages the volumes of each poit s eighbourhood simultaeously i the full multi-dimesioal space. Existig CDF trasform methods are either idividual attribute based (e.g., ReScale 36 ) or idividual poit based (e.g., DScale 37 ). As far as we kow, this is the first attempt to perform a multi-dimesioal CDF trasform to achieve the desired effect of homogeisig (makig uiform) the distributio of a etire dataset w.r.t a desity estimator. (3) It applies the ew Trasform-Shift method to desity-based clusterig ad distace-based aomaly detectio to demostrate its impact i overcomig the weakesses of three existig algorithms. The proposed approach CDF-TS differs from existig approaches i terms of: (i) Methodology: While ReScale 36, DScale 37 ad the proposed CDF-TS follow the same pricipled approach ad have the same aim, the proposed CDF Trasform-Shift method is a multi-dimesioal techique which icorporates both trasformatio ad poit-shift, which greatly expad the approach s applicability. I cotrast, ReScale is a oe-dimesioal trasformatio ad poit-shift techique; ad DScale icorporates trasformatio without poit-shift. The complete CDF-TS method leads to a better result i both clusterig ad aomaly detectio, which we will show i Sectio 6. (ii) Metric compliace: CDF-TS ad ReScale create a similarity matrix which is a metric; whereas the oe produced by DScale 37 is ot a metric, i.e., it does ot satisfy the symmetry or triagle iequalities. (iii) Ease of use: Like Rescale, CDF-TS trasforms the data as a preprocessig step. The targeted (clusterig or aomaly detectio) algorithm ca the applied ualtered to the trasformed data. I cotrast, DScale requires a algorithmic modificatio i order to use it. To demostrate the effects of differet dissimilarity measures o image segmetatio, we took the image show i Figure 1a, ad applied differet rescalig methods (o rescalig, ReScale, DScale ad CDF-TS) ad clusterig the 3-dimesioal pixel values of the image i LAB space 32. Tables 1 show the results of applyig the clusterig algorithm DP 29 to produce three clusters. The scatter plots i LAB space i Figure 1 show that the three clusters become more uiform distributed with similar desity ad the gaps betwee their boudaries are much larger for CDF-TS tha the other three scalig methods. As a result, DP with CDF-TS yields the best clusterig result, show i Table 1. The rest of the paper is orgaised as follows. We describe issues of ihomogeeous desity i desity-based clusterig ad distace-based aomaly detectio i Sectio 2. Sectio 3 provides the desity-ratio estimatio as a priciple to address the issues of ihomogeous desity. Sectio 4 presets the two existig CDF scalig approaches based o desity-ratio. Sectio 5 proposes the local-desity trasform-shift as a multi-dimesioal CDF scalig method. Sectio 6 empirically evaluates the performace of existig desity-based clusterig ad distace-based aomaly detectio algorithms o the trasformed-ad-shifted datasets. Discussio ad the coclusios are provided i the last two sectios. 1 The mappig f(x; D, pdf) depeds o all poits i the give dataset D, where each poit y D asserts a (local ad poit-based) scalig factor r(y; pdf) to x. 2

3 (a) Image (b) No Scalig (c) ReScale (d) DScale (e) CDF-TS Figure 1: Clusterig results by DP i LAB space of the image show i (a). The colours i the scatter plots idicate the three clusters idetified by DP usig each of the four methods. The scatter plots of DScale ad No Scalig are based o the origial LAB attributes; ad the other two are based o the trasformed attributes. Table 1: DP s image segmetatio o the image show i Figure 1a. Cluster 1 Cluster 2 Cluster 3 CDF-TS DScale ReScale No Scalig 3

4 2 Issues of ihomogeeous desity 2.1 Desity-based clusterig Let D = {x 1, x 2,..., x }, x i R d, x i F deote a dataset of poits, each sampled idepedetly from a distributio F. Let pdf(x) deote the desity estimate of poit x which approximates the true desity pdf(x). I additio, let N (x; ɛ) be the ɛ-eighbourhood of x, N (x; ɛ) = {y D s(x, y) ɛ}, where s(, ) is the distace fuctio (s : R d R d R). I geeral, the desity of a poit pdf(x) ca be estimated via a small ɛ-eighbourhood (as used by the classic desity-based clusterig algorithm DBSCAN 14 ) defied as follows: where V ɛ ɛ d is the volume of a d-dimesioal ball of radius ɛ. pdf ɛ (x) = 1 {y D s(x, y) ɛ} N (x; ɛ) = (1) V ɛ V ɛ A set of clusters {C 1,..., C ς } is defied as o-empty ad o-itersectig subsets: C i D, C i, i j C i C j =. Let c i = arg max x Ci pdf(x) deote the mode (poit of the highest estimated desity) for cluster Ci ; ad p i = pdf(c i ) deote the correspodig peak desity value. DBSCAN uses a global desity threshold to idetify core poits (which have desities higher tha the threshold); ad the it liks eighbourig core poits together to form clusters 14. It is defied as follows. Defiitio 1. A core poit is a poit with a estimated desity above or equal to a user-specified threshold τ, i.e., ( pdf ɛ (x) τ) Core(x) = 1, where Core deotes a set idicator fuctio. Defiitio 2. Usig a desity estimator with desity threshold τ, a poit x 1 is desity coected with aother poit x p i a sequece of p uique poits from D, i.e., {x 1, x 2, x 3,..., x p }: CON τ ɛ (x 1, x p ) is defied as: (i) if p = 2 : CONɛ τ (x (x 1, x p ) 1 N ɛ (x p )) (Core(x 1 ) Core(x p )); (ii) if p > 2 : (x1,x 2,...x p)(( i {2,...,p} x i 1 N ɛ (x i )) ( i {2,...,p 1} Core(x i ))). Defiitio 3. A cluster detected by the desity-based algorithm DBSCAN is a maximal set of desity coected istaces, i.e., C i = {x D CON τ ɛ (x, c i )}, where c i = arg max x Ci pdf ɛ (x) is the cluster mode. For this kid of algorithm to fid all clusters i a dataset, the data distributio must have the followig ecessary coditio: the peak desity of each cluster must be greater tha the maximum over all possible paths of the miimum desity alog ay path likig ay two modes. 2 This coditio is formally described by Zhu et al 36 as follows: mi c k > max g ij (2) k {1,...,ς} i j {1,...,ς} where g ij is the largest of the miimum desity alog ay path likig the mode of for clusters C i ad C j. This coditio implies that there must exist a threshold τ that ca be used to break all paths betwee the modes by assigig regios with desity less tha τ to oise. Otherwise, if the mode of some cluster has a desity lower tha that of a low-desity regio betwee other clusters, the this kid of desity-based clusterig algorithm will fail to fid all clusters. Either some high-desity clusters will be merged together (whe a lower desity threshold is used), or low-desity clusters will be desigated as oise (whe a higher desity threshold is used). To illustrate, Figure 2a shows that usig a high threshold τ 1 will cause all poits i Cluster C 3 to be assiged to oise but usig a low threshold τ 2 will cause poits i C 1 ad C 2 to be assiged to the same cluster. 2.2 knn aomaly detectio A classic earest-eighbour based aomaly detectio algorithm assigs a aomaly score to a istace based o its distace to the k-th earest eighbour 27. The istaces with the largest aomaly scores are idetified as aomalies. 2 A path likig two modes c i ad c j is defied as a sequece of uique poits startig with c i ad edig with c j where adjacet poits lie i each other s ɛ-eighbourhood. 4

5 (a) Origial data (b) ReScaled data Figure 2: (a) A mixture of three Gaussia distributios that caot be separated usig a sigle desity threshold; (b) Desity distributio o ReScaled data of (a), where a sigle desity threshold ca be foud to separated all three clusters. Note that poit x 1, x 2 ad x 3 are shifted to y 1, y 2 ad y 3, respectively. Here η is a larger badwidth tha ɛ. Give a dataset ad the parameter k, the desity of x ca be estimated usig a k-th earest eighbour desity estimator (as used by the classic knn aomaly detector 27 ): pdf knn (x; k) = pdf(x; k ɛ k (x)) = V ɛk (x) 1 ( ɛ k (x) )d (3) where ɛ k (x) is the distace betwee x R d ad its k-th earest eighbour i a dataset D. Note that the k-th earest eighbour distace ɛ k (x) is a proxy to the desity of x, i.e., high ɛ k (x) idicates low desity, ad vice versa. Let C be the set of all ormal poits i a dataset D. The coditio uder which the classic knn aomaly detector could, with a appropriate settig of a desity/distace threshold, idetify every aomaly y i A = D \ C is give as follows: mi y A ɛ k (y) > max x C ɛ k (x) (4) Equatio 4 states that all aomalies must have the highest knn distaces (or lowest desities) i order to detect them. I other words, knn aomaly detectors ca detect both global aomalies ad scattered aomalies which have lower desities tha that of all ormal poits 1,20. However, based o this characteristic, knn aomaly detectors are uable to detect: (a) Local aomalies. This is because a local aomaly y with low desity relative to earby ormal (o-aomalous) istaces i a regio of high average desity may still have higher desity tha that of ormal (o-aomalous) istaces i regios of lower average desity 7. Traslatig this i terms of k-th NN distace, we have: x,z C,y A ɛ k (x) < ɛ k (y) < ɛ k (z). Here, some local aomalies (for example, poits located aroud the boudaries of C 1 ad C 2 ) are raked lower tha the ormal poits located aroud the cere sparse cluster (C 3 ), as show i Figure 2a. (b) Aomalous clusters (sometimes referred to as clustered aomalies 21 ) are groups of poits that are too small to be cosidered true clusters ad are foud i low desity parts of the space (i.e. are separate from the other clusters). A purported beefit of the knn aomaly detector is that it is able to remove such aomalous clusters providig k is sufficietly large (larger tha the size of the aomalous group) 1. By rescalig the data we are able to exted this property to locally aomalous clusters, i.e. to be tolerat to variatios i the local (backgroud or average) desity over the space. A example is show i Figure Summary Both desity-based clusterig ad distace-based aomaly detector have weakesses whe it comes to hadlig datasets with ihomogeeous desity clusters. Rather tha creatig a ew desity estimator or modifyig the existig clusterig ad aomaly detectio algorithm procedures, we advocate trasformig data to be more uiformly distributed tha it is i the origial space such that the separatio betwee clusters ca be idetified easily. 5

6 (a) Origial data (b) ReScaled data Figure 3: (a) A mixture of two Gaussia distributios that C is a ormal cluster ad A is a aomalous cluster; (b) Desity distributio o the ReScaled data of (a), where the aomalous cluster are farther to the ormal cluster cetre. Note that Poit x 1, x 2 ad x 3 are shifted to y 1, y 2 ad y 3, respectively. 3 Desity-ratio estimatio Desity-ratio estimatio is a pricipled approach to overcome the weakess of desity-based clusterig for detectig clusters with ihomogeeous desities 36. The desity-ratio of a poit is the ratio of two desity estimates calculated usig the same desity estimator, but with two differet badwidth settigs. Let pdf(, γ) ad pdf(, λ) be desity estimators usig kerels of badwidth γ ad λ, respectively. Give the costrait that the deomiator has larger badwidth tha the umerator γ < λ, the desity ratio of x is estimated as rpdf(x; γ, λ) = pdf(x; γ) pdf(x; λ) (5) We correct the lemma from 36 regardig the desity ratio value : Lemma 1. For ay data distributio ad sufficietly small values of γ ad λ s.t. γ < λ, if x is at a local maximum desity of N (x; λ), the rpdf(x; γ, λ) 1; ad if x is at a local miimum desity of N (x; λ), the rpdf(x; γ, λ) 1. Sice poits located at local high-desity areas (almost ivariably) have desity-ratio higher tha poits located at local low-desity areas, a global desity-ratio threshold aroud uity ca be used to idetify all cluster peaks ad break all paths betwee differet clusters. Thus, based o desity-ratio estimatio, existig desity-based clusterig algorithms such as DBSCAN ca idetify clusters as regios of locally high desity, separated by regios of locally low desity. Similarly, a desity-based aomaly detector is able to detect local aomalies sice their desity-ratio values are lower ad raked higher tha ormal poits with locally high desities. 4 CDF scalig to overcome the issue of ihomogeeous desities 4.1 Oe-dimesioal CDF scalig Let pdf(, λ) ad cdf(, λ) be a desity estimator ad a cumulative desity estimator, respectively, usig a kerel of badwidth λ. Let x be a trasformed poit of x usig cdf as follows: The, we have the property as follows 36 : ad pdf(x ; γ) x = cdf(x; λ) (6) pdf(x ; λ) = 1/ (7) pdf(x; γ) pdf(x; λ) 6 for λ > γ (8)

7 ReScale 36 is a represetative implemetatio algorithm based o this cdf trasform. Figure 2b ad Figure 3b show ReScale rescales the data distributio o two 1-dimesioal datasets, respectively. They show that clusters ad aomalies are easier to idetify after the applicatio of ReScale. Sice this cdf trasform ca be performed o a oe-dimesioal dataset oly, ReScale must apply the trasformatio to each attribute idepedetly for a multi-dimesioal dataset. 4.2 Multi-dimesioal CDF scalig Usig a distace scalig method, a multi-dimesioal cdf trasform ca be achieved by simply rescalig the distaces betwee each poit ad all the other istaces i its local eighbourhood, (thereby cosiderig all of the dimesios of the multidimesioal dataset at oce). Give a poit x D, the distace betwee x ad all poits i its λ-eighbourhood y N (x; s, λ) 3 ca be rescaled usig a scalig fuctio r( ): where s (, ) is the scaled distace of s(, ). x,y D,y N (x;s,λ) s (x, y) = s(x, y) r(x; λ) (9) Here, the scalig fuctio r(x; λ) depeds o both the positio x ad size of the eighbourhood λ. It is defied as follows usig the estimated desity pdf(x; s, λ) with the aim of makig the desity distributio withi the λ-eighbourhood uiform: r(x; λ) = m λ (pdf(x; s, λ) V λ ) 1 1 d pdf(x; s, λ) d (10) where m = max x,y D s(x, y) is the maximum pairwise distace i D. Note that we have ow reparameterised the desity estimator pdf(x; s, λ) to iclude the distace fuctio s, i order to facilitate calculatios below. With referece to the uiform distributio which has desity V m, where V m is the volume of the ball havig radius m: r(x; λ) < 1 if pdf(x; s, λ) < V m (11) r(x; λ) > 1 if pdf(x; s, λ) > V m (12) That is, the process rescales sparse regios to be more dese by usig r(x; λ) < 1, ad dese regios to be more sparse by usig r(x; λ) > 1; such that the etire dataset is approximately uiformly distributed i the scaled λ-eighbourhood. More spefically, after rescalig distaces with s, the desity of poits i the eighbourhood of size λ x = λ r(x; λ) aroud x is the same as the desity of poits across the whole dataset: pdf(x; s, λ x) = pdf(x; s, λ) V λ V λ x = pdf(x; s, λ) V λ V λ r(x; λ) d pdf(x; s, λ) V λ = V λ ( m λ ( pdf(x;s,λ) V λ ) 1 d ) = λd d V λ m d = (13) V m Note that the above derivatio is possible because (i) the scalig is isotropic about x (hece the shape of the uit ball does t chage oly its size) ad (ii) the uiform-kerel desity estimator is local (i.e., its value depeds oly o poits withi the λ x eighbourhood). I order to maitai the same relative orderig of distaces betwee x ad all other poits i the dataset, the distace betwee x ad ay poit ot i the λ-eighbourhood y D \ N (x; s, λ) ca be ormalised by a simple mi-max ormalisatio: y D\N (x;s,λ) s (x, y) = (s(x, y) λ) m λ x m λ + λ x (14) 3 For reasos that will become obvious shortly, we ow reparameterise the eighbourhood fuctio N (x; s, λ) to deped o the distace measure s. 7

8 It is iterestig to ote that providig λ is sufficietly large that the average of the desity estimates across the data poits approximates the average desity over the space E x D [pdf(x; s, λ)] V m, the we have that the average rescalig factor is approximately 1: E x D [r(x; λ)] 1, E x D [λ x] λ, ad thus the average distace betwee poits is uchaged after rescalig: 4 y D E x D [s (x, y)] E x D [s(x, y)] (15) The implemetatio of DScale 37, which is a represetative algorithm based o this multi-dimesioal CDF scalig, is provided i Algorithm 2 i Appedix 8. A lemma about the desity o the rescaled distace is give as follows: Lemma 2. The desity pdf(x; s, γ x) with the rescaled distace s is approximately proportioal to the desity-ratio i terms of the origial distace s withi the λ-eighbourhood of x. pdf(x;s,γ) pdf(x;s,λ) Proof. where γ x = γ r(x; λ) ad γ < λ. pdf(x; s, γ x) pdf(x; s, γ) V γ V γ x pdf(x; s, γ) = m d λ d pdf(x;s,λ) V λ = pdf(x; s, γ) V γ V γ r(x; λ) d = pdf(x; s, γ) V m pdf(x; s, λ) pdf(x; s, γ) pdf(x; s, λ) Note that the above lemma is oly valid withi the λ-eighbourhood of x, ad whe each x D is treated idepedetly. I other words, the cdf trasform is oly valid locally. I additio, the rescaled distace is asymmetric, i.e., s (x, y) s (y, x) whe r(x; λ) r(y; λ). To be a valid cdf trasform globally for the etire dataset, we propose i this paper to perform a iterative process that ivolves two steps: distace rescalig ad poit shiftig. 5 CDF Trasform-Shift (CDF-TS) Istead of just rescalig distaces betwee each data istace x ad all other poits i the dataset, we ca apply the rescalig directly to the dataset by traslatig poits i approximate accordace with the rescaled distaces. Two advatages of the ew approach are: (i) the shifted dataset becomes approximately uiformly distributed as a whole; ad (ii) the stadard Euclidea distace ca the be used to measure distaces betwee the trasformed poits, rather tha the o-symmetric ad as a result o-metric rescaled distace. Cosider two poits x, y D. I order to make the distributio aroud poit x more uiform, we wish to rescale (expad or cotract) the distace betwee x ad y to be s (x, y) as defied by Equatios 9 ad 14. We ca do this by traslatig the poit y to a ew poit deoted y x which lies alog the directio (y x) as follows: y x = x + s (x, y) (y x) (16) s(x, y) Note that the distace betwee x ad the ewly trasformed poit y x satisfies the rescaled distace requiremet: s(x, y x) = s (x, y) (17) Above we cosidered the effect of traslatig poit y to y x to scale the space aroud poit x. I order to approximately scale the space aroud all poits i the dataset, poit y eeds to be traslated by the average of the above traslatios: ȳ = 1 y x = 1 x + s (x, y) (y x) (18) s(x, y) x D x D A theorem regardig the fial shifted data D is provided as follows: 4 Proof: For y N (x, s, λ) we have that E x D[s (x, y)] E x D[r(x; λ)]e x D[s(x, y)] E x D[s(x, y)] sice r(x; λ) ad s(x, y) are idepedet. The same ca be show for poits y / N (x, s, λ). 8

9 Theorem 1. Give a sufficietly large badwidth λ < m, the λ-eighbourhood desity of D is expected to be more uiform tha that of D: x D pdf( x ; s, λ) /V m < pdf(x; s, λ) /V m, Proof. Let us defie the badwidth λ x for the trasformed eighbourhood aroud x as λ x = λ s( x,ȳ ) s(x,y) x, y D ad y N (x, s, λ). We ca ow estimate λ x as follows. We have s( x, ȳ ) = x ȳ = 1 ( s (z, x) s(z, x) (x z) s (z, y) (y z))) s(z, y) z D where Sice y is i the λ-eighbourhood of x we have z D s (z,x) s(z,x) s( x, ȳ ) 1 s( x, ȳ ) s(x, y) z D = s(x, y) 1 = 1 ( s (z,y) s(z,y), ad s (z, x) (x y) s(z, x) z D z N (x;s,λ) s (z, x) s(z, x) s (z, x) s(z, x) + z / N (x;s,λ) s (z, x) s(z, x) ) Supposig that pdf(z; s, λ) varies slowly withi the eighbourhood N (x; s, λ), we have: z N (x;s,λ) s (z, x) s(z, x) z N (x;s,λ) s (x, z) s(x, z) = z N (x;s,λ) r(x; λ) The based o Equatio 15, give a sufficietly large λ such that E z D [pdf(z; s, λ)] V m ad providig the cout N (x; s, λ) is sufficietly small, we have: E z,x D,z / N (x;s,λ) [s (z, x)] E z,x D [s (z, x)] E z,x D [s(z, x)] This meas that s( x, ȳ ) is maily affected by the poit z located i the λ-eighbourhood of x, i.e., E z,x D,z / N (x;s,λ) [ s (z,x) s(z,x) ] 1. The we have s( x, ȳ ) s(x, y) The above relatio ca be rewritte as: 1 ( z N (x;s,λ) r(x; λ) + z / N (x;s,λ) N (x; s, λ) = ( r(x; λ) + s( x,ȳ ) s(x,y) 1 N (x; s, λ) r(x; λ) 1 1) N (x; s, λ) ) As N (x;s,λ) (0, 1), the relatio betwee s( x,ȳ ) s(x,y) ad r(x; λ) depeds o whether x is i the sparse or dese regio: r(x; λ) < s( x, ȳ ) s(x, y) r(x; λ) > s( x, ȳ ) s(x, y) < 1 if pdf(x; s, λ) < > 1 if pdf(x; s, λ) > V m (19) V m (20) Whe the desity varies slowly i the N (x; s, λ), we ca safely assume that each poit s earest eighbours keep similar after trasform ad the desity still varied slowly i λ -eighbourhood. The we have E x D N (x ; s, λ ) N (x; s, λ) ad pdf( x ; s, λ) ca estimate as 9

10 pdf( x ; s, λ) pdf( x ; s, λ ) N (x; s, λ) V λ Here we have three scearios, depedig o the desity of x: = pdf(x; s, λ) V λ V λ ( s( x,ȳ ) s(x,y) )d (21) 1. I sparse regio pdf(x; s, λ) < /V m, we have r(x; λ) < s( x,ȳ ) s(x,y) iequality i Equatios 13 ad 21, we have /V m > pdf( x ; s, λ) > pdf(x; s, λ) < 1, as show i Equatio 19. Usig this 2. I dese regio pdf(x; s, λ) > /V m, we have r(x; λ) > s( x,ȳ ) s(x,y) iequality i Equatios 13 ad 21, we have > 1, as show i Equatio 20. Usig this /V m < pdf( x ; s, λ) < pdf(x; s, λ) Therefore, x D, pdf( x ; s, λ) /V m < pdf(x; s, λ) /V m. Sice our target is to get a uiformly distributed data w.r.t. pdf( x, s, λ), this trasform-shift process ca be repeated multiple times usig a iterative method to reduce the λ-eighbourhood desity variatio o the shifted dataset i the previous iteratio. This is accomplished by usig a expectatio-maximisatio-like algorithm. The objective is to miimise the desity variace i D. A coditio for termiatig the iteratio process is whe the total distace of poit-shifts from D to D i the latest iteratio is less tha a threshold δ, i.e., x D, x D x x δ (22) Note that, due to the shifts i each iteratio, the value rages of attributes of the shifted dataset D are likely to be differet from those of the origial dataset D. We ca use a mi-max ormalisatio 3 o each attribute to keep the values i the same fixed rage at the ed of each iteratio. Figure 4 illustrates the effects of CDF-TS o a two-dimesioal data with differet iteratios. It shows that the origial clusters with differet desity become icreasig uiform as the umber of iteratios icreases. (a) A two-dimesioal data (b) After 1 iteratio (c) After 3 iteratios (d) After 5 iteratios Figure 4: Illustratig the effects of CDF-TS o a two-dimesioal data with λ = 0.1. The implemetatio of CDF-TS is show i Algorithm 1. The DScale algorithm used i step 6 is provided i Algorithm 2 i Appedix 8. Algorithm 1 requires two parameters λ ad δ. 10

11 Algorithm 1 CDF-TS(D, λ, δ) Iput: D - iput data matrix ( d matrix); λ - badwidth parameter; δ - threshold for the total shifted distace. Output: D - data matrix after rescalig ad poit shiftig. 1: Normalisig D usig mi-max ormalisatio 2: = 3: t = 1 4: while > δ do 5: S Calculatig the distace matrix for D 6: S DScale(S, λ, d) 7: for each z D (where z is a referece poit) do 8: D z Shift every poit x D to x z i the directio of z to x with magitude S [z, x] 9: ed for 10: D 1 z D D z 11: Normalisig D usig mi-max ormalisatio 12: = 1 d i,j D[i, j] D [i, j] 13: t = t : D = D 15: ed while 16: retur D 5.1 Desity estimators applicable to CDF-TS The followig two types of desity estimators, with fixed-size ad variable-size badwidths of the uiform kerel, are applicable to CDF-TS to do the CDF scalig: (a) ɛ-eighbourhood desity estimator: pdf ɛ (x) = where V ɛ is the volume of the ball with radius ɛ. {y D s(x, y) ɛ} V ɛ Whe this desity estimator is used i CDF-TS, the λ-eighbourhood described i Sectios 4.2 ad 5, i.e., N (x; s, λ) = {y D s(x, y) ɛ}, where λ = ɛ, deotig the fixed-size badwidth uiform kerel. (b) k-th earest eighbour desity estimator: pdf(x; ɛ k (x)) = k V s(x,xk ) 1 ( s(x, x k ) )d where x k is the k-th earest eighbour of x; ɛ k (x) is k-th earest eighbour distace of x. Whe this desity estimator is used i CDF-TS, the eighbourhood size λ becomes depedet o the locatio x ad the distributio of surroudig poits. We deote λ(x) k = s(x, x k ) as the distace to the k th earest eighbour of x. I this case the desity is simply calculated over the eighbourhood: N (x; s, λ(x) k ) = {y D s(x, y) s(x, x k )}, deotig the variable-size badwidth uiform kerel, i.e., small i dese regio ad large i sparse regio. Note that, i this circumstace, the desity of the shifted poits still become more uiformly w.r.t their surroudig. I the followig experimets, we employ the same desity estimators as used i DBSCAB ad DP (i clusterig) ad knn aomaly detector i CDF-TS to trasform D to D, i.e., ɛ-eighbourhood desity estimator is used i CDF-TS for clusterig; ad k-th earest eighbour desity estimator is used i CDF-TS for aomaly detectio. 6 Empirical Evaluatio This sectio presets experimets desiged to evaluate the effectiveess of CDF-TS. All algorithms used i our experimets were implemeted i Matlab R2017a (the source code ca be obtaied at The experimets are ru o a machie with eight cores CPU (Itel Core 3.60GHz), 32GB memory ad a 2560 CUDA cores GPU (GeForce GTX 1080). All datasets were ormalised usig the mi-max ormalisatio to yield each attribute to be i [0,1] before the experimets bega. 11

12 Table 2: Properties of clusterig datasets Dataset Data Size #Dimesios #Clusters Segmet Mice Biodeg ILPD ForestType Wilt Musk Libras Dermatology Ecoli Haberma Seeds Lug Wie L C For clusterig, we used 2 artificial datasets ad 14 real-world datasets with differet data sizes ad dimesios from UCI Machie Learig Repository 11. Table 2 presets the data properties of the datasets. 3L is a 2-dimesioal data cotaiig three elogated clusters with differet desities, as show i Figure 5a. 4C is a 2-dimesioal dataset cotaiig four clusters with differet desities (three Gaussia clusters ad oe elogated cluster), as show i Figure 5b. Note that DBSCAN is uable to correctly idetify all clusters i both of these datasets because they do ot satisfy the coditio specified i Equatio 2. Furthermore, clusters i 3L sigificatly overlap o idividual attribute projectios, which violates the requiremet of ReScale that the oe-dimesioal projectios allow for the idetificatio of the desity peaks of each cluster. (a) 3L data distributio (b) 4C data distributio Figure 5: (a) A two-dimesioal data cotaiig three elogated clusters. (b) A two-dimesioal data cotaiig four clusters. For aomaly detectio, we compared the aomaly detectio performace o 2 sythetic datasets with aomalous clusters ad 10 real-world bechmark datasets 5. The data size, dimesios ad percetage of aomalies are show i Table 3. Both the Sy 1 ad Sy 2 datasets cotai clusters of aomalies. Their data distributios are show i Figure Clusterig I this sectio, we compare CDF-TS with ReScale ad DScale usig two existig desity-based clusterig algorithms, i.e., DBSCAN 14 ad DP 29, i terms of F-measure 28 : give a clusterig result, we calculate the precisio score P i ad the recall score R i for each cluster C i based o the cofusio matrix, ad the the F-measure score of C i is the harmoic mea of P i ad R i. The overall F-measure score is the uweighted (macro) average over all clusters: F-measure= 1 ς ς i=1 2P ir i P i+r i. 6 5 Velocity, At ad Tomcat are from ad others are from UCI Machie Learig Repository It is worth otig that other evaluatio measures such as purity ad Normalised Mutual Iformatio (NMI) 31 oly take ito accout the poits assiged to clusters ad do ot accout for oise. A clusterig algorithm which assigs the majority of the poits 12

13 Table 3: Properties of aomaly detectio datasets Dataset Data Size #Dimesios % Aomaly AThyroid % Pageblocks % Tomcat % At % BloodDoatio % Vowel % Mfeat % Dermatology % Balace % Velocity % Sy % Sy % Desity (a) Data distributio o Sy 1 dataset x (b) Stacked histogram o Sy 2 dataset Figure 6: Data distributios of the Sy 1 ad Sy 2 datasets, where the red colour idicate the aomaly. We report the best clusterig performace after performig a parameter search for each algorithm. Table 4 lists the parameters ad their search rages for each algorithm. ψ i ReScale cotrols the precisio of ĉdf λ(x), i.e., the umber of itervals used for estimatig ĉdf λ(x). We set δ = as the default value for CDF-TS. Table 4: Parameters ad their search rage. The search rages of ψ ad λ are as used by Zhu et. al. 36. Algorithm Parameter with search rage DBSCAN Mipts {2, 3,..., 10}; ɛ [0, 1] DP k {2, 3,..., 10}; ɛ [0, 1] ReScale ψ = 100; λ {0.1, 0.2,..., 0.5} DScale λ {0.1, 0.2,..., 0.5} CDF-TS λ {0.1, 0.2,..., 0.5}; δ = Table 5 shows the best F-measures for DBSCAN, DP, their ReScale, DScale ad CDF-TS versios. The average F-measures, showed i the secod-to-last row, reveal that CDF-TS improves the clusterig performace of either DBSCAN or DP with a larger performace gap tha both ReScale ad DScale. I additio, CDF-TS is the best performer o may more datasets tha other coteders (show i the last row of Table 5.) It is iterestig to see that CDF-TS exhibits a large performace improvemet over both DBSCAN ad DP o may datasets, such as ForestType, Wilt, Dermatology, Ecoli, Lug ad Wie. O may of these datasets, CDF-TS also shows a large performace gap i compariso with ReScale ad/or DScale. Note that the performace gap is smaller for DP tha it is for DBSCAN because DP is a more powerful algorithm which does ot rely o a sigle desity threshold to idetify clusters. To evaluate whether the performace differece amog the three scalig algorithms is sigificat, we coduct the Friedma test with the post-hoc Nemeyi test 10. to oise may result i a high clusterig performace. Thus the F-measure is more suitable tha purity or NMI i assessig the clusterig performace of desity-based clusterig. 13

14 Table 5: Best F-measure of DBSCAN, DP, ad their ReScale, DScale ad CDF-TS versios. For each clusterig algorithm, the best performer i each dataset is boldfaced. Ori, ReS, ad DS represet the Origial algorithm, ReScale, ad DScale respectively. Data DBSCAN DP Orig ReS DS CDF-TS Orig ReS DS CDF-TS Segmet Mice Biodeg ILPD ForestType Wilt Musk Libras Dermatology Ecoli Haberma Seeds Lug Wie L C Average #Top Figures 7a ad 7b show the results of the sigificace test for DBSCAN ad DP, respectively. The results show that the CDF-TS versios are sigificatly better tha the DScale ad ReScale versios for both DBSCAN ad DP. (a) Sigificace test for DBSCAN (b) Sigificace test for DP Figure 7: Critical differece (CD) diagram of the post-hoc Nemeyi test (α = 0.10). Two algorithms are sigificat differet if the gap betwee their raks is larger tha CD. Otherwise, there is a lie likig them. Here we compare the differet effects o the trasformed datasets due to ReScale ad CDF-TS usig the MDS visualisatio 7. The effects o the two sythetic datasets are show i Figure 8 ad Figure 9. They show that both the rescaled datasets are more axis-parallel distributed usig ReScale tha those usig CDF-TS. For the 3L dataset, the blue cluster is still very sparse after ruig ReScale, as show i Figure 8a. I cotrast, it becomes deser usig CDF-TS, as show i Figure 9a. As a result, DBSCAN has much better performace with CDF-TS o the 3L dataset. 7 Multidimesioal scalig (MDS) 6 is used for visualisig a high-dimesioal dataset i a 2-dimesioal space through a projectio method which preserves as well as possible the origial pairwise dissimilarities betwee istaces. 14

15 (a) 3L data distributio (b) 4C data distributio Figure 8: Data distributio after ruig ReScale whe achievig the best DBSCAN clusterig performace (a) 3L data distributio (b) 4C data distributio Figure 9: Data distributio after ruig CDF-TS whe achievig the best DBSCAN clusterig performace (a) Based o distace Figure 10: MDS plots o the Wilt dataset (b) Based o CDF-TS Figure 10 shows the MDS plots o the Wilt dataset which CDF-TS has the largest performace improvemet over both DBSCAN ad DP. The figure shows that the two clusters are more uiformly distributed i the ew space which makes their boudary easier to be idetified tha that i the origial space. I cotrast, based o distace, most poits of the red cluster are surrouded by poits of the blue cluster, as show i Figure 10a. 6.2 Aomaly detectio I this sectio, we evaluate the ability of CDF-TS to detect local aomalies based o knn aomaly detectio. Two state-of-the-art local aomaly detector, Local Outlier Factor (LOF) 7 ad iforest 19,22, are also used i the compariso. Table 6 lists the parameters ad their search rages for each algorithm. Parameters ψ ad t i iforest cotrol the sub-sample size ad umber of itrees, respectively. We report the best performace of each algorithm o each dataset i terms of best AUC (Area uder the Curve of ROC) 15. Table 7 compares the best AUC score of each algorithm. CDF-TS-kNN achieves the highest average AUC of 0.90 ad performs the best o 6 out of 12 datasets. For the Sy 1 dataset which has overlappig o the idividual attribute projectio, DScale ad CDF-TS have much better results tha ReScale. Figure 11 shows the sigificace test o the three algorithms applied to knn aomaly detector. It ca be see from the results that CDF-TS-kNN is sigificatly better tha DScale-kNN ad ReScale-kNN. 15

16 Table 6: Parameters ad their search rages. Algorithm Parameters ad their search rage knn/lof k {5%, 10%,..., 50%} iforest t = 100; ψ {2 1, 2 2,..., 2 10 } ReScale ψ = 100; λ {0.1, 0.2,..., 0.5} DScale λ {0.1, 0.2,..., 0.5} CDF-TS λ {0.1, 0.2,..., 0.5}; δ = Table 7: Best AUC o 12 datasets. The best performer o each dataset is boldfaced. ReS, ad DS represet ReScale-kNN, ad DScale-kNN respectively. Data knn LOF iforest Res DS CDF-TS AThyroid Pageblocks Tomcat At BloodDoatio Vowel Mfeat Dermatology Balace Velocity Sy Sy Average #Top Figure 11: Critical differece (CD) diagram of the post-hoc Nemeyi test (α = 0.10) for aomaly detectio algorithms. It is iterestig to metio that CDF-TS-kNN has the largest AUC improvemet over knn o Dermatology ad AThyroid, show i Table 7. Table 8 compares their MDS plots betwee the origial datasets ad the shifted (desity equalised) datasets. It shows that most aomalies become aomalous clusters ad farther away from ormal poits o the shifted datasets, makig them easier to detect by knn aomaly detector. Note that some aomalies are closed to ormal clusters ad have higher desities tha may ormal istaces i the origial dataset. 6.3 Ru-time ReScale, DScale ad CDF-TS are all pre-processig methods, their computatioal complexities are show i Table 9. Because may existig desity-based clusterig algorithms have time ad space complexities of O( 2 ), all these methods does ot icrease their overall complexities. Table 10 shows the rutime of the dissimilarity matrix calculatio for each of the three methods o the Mfeat, Segmet, Pageblocks ad AThyroid datasets. Their parameters are set to be the same for all datasets, i.e., λ = 0.2, ψ = 100 ad δ = The result shows that CDF-TS took the logest rutime because it requires multiple DScale calculatios. 16

17 Table 8: MDS plots o two datasets, where red poits idicate aomalies. Distace CDF-TS AThyroid Dermatology Table 9: Computatioal complexity of ReScale, DScale ad CDF-TS. Algorithm Time complexity Space complexity ReScale O(dψ) O(d + dψ) DScale ad CDF-TS O(d 2 ) O(d + 2 ) Table 10: Rutime compariso of dissimilarity matrix calculatio (i secods). Dataset CPU GPU ReScale DScale CDF-TS ReScale DScale CDF-TS Mfeat Segmet Pageblocks AThyroid Discussio 7.1 Parameter sesitivity Both DScale ad CDF-TS have oe critical parameter λ to defie the λ-eighbourhood. The desity-ratio based o a small λ will approximate 1 ad provides o iformatio; while o a large λ will approximate the estimated desity ad have o advatage. Geerally, λ [0.1, 0.2] ad ɛ i DBSCAN ad DP shall be set slightly smaller tha λ. The parameter δ i CDF-TS cotrols the umber of iteratios, i.e., a smaller δ usually results i a higher umber of iteratios i the CDF-TS process, ad makes the shifted dataset more uiform. I our experimets, we set δ = as default value ad got sigificatly better results tha existig algorithms. Whe tuig the best δ for differet datasets, CDF-TS would get better results o most datasets. 7.2 Relatio to LOF LOF 7 is a local desity-based approach for aomaly detectio. The LOF score of x is the ratio of the desity of x to the average desity of x s k-earest eighbours, where the desity of x is iversely proportioal to the average distace to its k-earest-eighbours 7. 17

18 Though LOF is a desity-ratio approach, it does ot employ CDF trasform, which is the core techique i CDF-TS. While LOF has the ability to detect local aomalies, it is weaker i idetifyig aomalous clusters tha knn which employs CDF-TS. 7.3 Relatio to metric learig There are may usupervised distace metric learig algorithms which trasform data e.g., global methods such as PCA 18, KPCA 30 ad KUMMP 35 ); ad local methods such as Isomap 33, t-sne 24 ad LPP 17. These methods are usually used as dimesio reductio methods such that data clusters i the leared low-dimesioal space are adjusted based o some objective fuctio 34. CDF-TS is a usupervised algorithm based o the CDF trasform such that the data are more uiform i the scaled space. Though it is a liear ad local method, CDF-TS is ot a metric learig method for dimesio reductio. We have examied the performace of t-sne i desity-based clusterig ad knn aomaly detectio. We set the dimesioality of its output space to be the same as the origial data. We foud that t-sne has icosistet clusterig performace. For example, while it produced the best clusterig results o some high-dimesioal datasets such as the ForestType ad Lug datasets; but it yielded the worst results o the Breast ad 3L datasets. I additio, t-sne could ot improve the performace o knn aomaly detector. This is because aomalies are still dese or close to some dese clusters i the t-sne trasformed space. 7.4 Compariso with a desity equalisatio method I the cotext of kerel k-meas 25, knn kerel 26 has bee suggested to be a way to equalise the desity of a give dataset to reduce the bias of kerel k-meas algorithm ad to improve the clusterig performace o datasets with ihomogeeous desities. The dissimilarity matrix geerated by knn kerel is a biary matrix ca be see as the adjacet matrix of knn graph such that 1 meas a poit is i the set of k earest eighbours of aother poit; ad 0 otherwise. Thus, a k-meas clusterig algorithm ca be used to group poits based o their knn graph. However, knn kerel caot be applied to both desity-based clusterig ad knn aomaly detectio. This is because it coverts all poits to have the same desity i the ew space regardless of the badwidth used by a desity estimator used by the iteded algorithm. Thus, DBSCAN will either group eighbourig clusters ito a sigle cluster or assig all clusters to oise. This outcome is ot coducive i producig a good clusterig result. DP also caot work with knn kerel for the same reaso. DP liks a poit to aother poit with a higher desity to form clusters i the last step of the clusterig process 29. DP would ot be able to lik poits whe all poits have the same desity. For knn aomaly detectio, replacig the distace measure with the knn kerel produces either 0 or 1 similarity betwee ay two poits. This does ot work for aomaly detectio. 8 Coclusios We itroduce a CDF Trasform-Shift (CDF-TS) algorithm based o a multi-dimesioal CDF trasform to effectively deal with datasets with ihomogeeous desities. It is the first attempt to chage the volumes of every poit s eighbourhood i the etire multi-dimesioal space i order to get the desired effect of uiform distributio for the whole dataset w.r.t a desity estimator. Existig CDF trasform methods such as ReScale 36 ad DScale 37 achieved a much reduced effect o the recaled dataset. We show that this more sophisticated CDF trasform brigs about a much better outcome. I additio, CDF-TS is more geeric tha existig CDF trasform methods with the ability to use differet desity estimators, i.e, uiform kerels with either fixed or variable badwidths. As a result, CDF-TS ca be applied to more algorithms/tasks with a better outcome tha existig methods such as knn kerel 26. Through itesive evaluatios, we show that CDF-TS sigificatly improves the performace of two existig desitybased clusterig algorithms ad oe existig distace-based aomaly detector. Appedix: DScale algorithm Algorithm 2 is a geeralisatio of the implemetatio of DScale 37 which ca employ a desity estimator with a badwidth parameter λ. It requires oe parameter λ oly. 18

19 Algorithm 2 DScale(S, λ, d) Iput: S - iput distace matrix ( matrix); λ - badwidth parameter; d - dimesioality of the dataset. Output: S - distace matrix after scalig. 1: m the maximum distace i S 2: Iitialisig matrix S 3: for i = 1 to do 4: r(x i ) = m N (xi,s,λ) λ ( ) 1 d 5: xj N (x i,s,λ) S [x i, x j ] = S[x i, x j ] r(x i ) 6: xj D\N (x i,s,λ) S [x i, x j ] = (S[x i, x j ] λ) m λ r(xi) m λ + λ r(x i ) 7: ed for 8: retur S Refereces 1. Charu C Aggarwal. Outlier Aalysis. Spriger Iteratioal Publishig, Switzerlad, 2d editio, ISBN Charu C Aggarwal ad Chada K Reddy. Data Clusterig: Algorithms ad Applicatios. Chapma ad Hall/CRC Press, Selim Aksoy ad Robert M Haralick. Feature ormalizatio ad likelihood-based similarity measures for image retrieval. Patter recogitio letters, 22(5): , Fabrizio Agiulli ad Clara Pizzuti. Fast outlier detectio i high dimesioal spaces. I Europea Coferece o Priciples of Data Miig ad Kowledge Discovery, pages Spriger, Mihael Akerst, Markus M. Breuig, Has-Peter Kriegel, ad Jörg Sader. OPTICS: Orderig poits to idetify the clusterig structure. I Proceedigs of the 1999 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, SIGMOD 99, pages 49 60, New York, NY, USA, ACM. ISBN Igwer Borg, Patrick JF Groee, ad Patrick Mair. Applied Multidimesioal Scalig. Spriger Sciece & Busiess Media, Markus M. Breuig, Has-Peter Kriegel, Raymod T. Ng, ad Jörg Sader. LOF: Idetifyig desity-based local outliers. SIGMOD Rec., 29(2):93 104, May Varu Chadola, Aridam Baerjee, ad Vipi Kumar. Aomaly detectio: A survey. ACM computig surveys (CSUR), 41(3):15, Timothy De Vries, Sajay Chawla, ad Michael E Houle. Fidig local aomalies i very high dimesioal space. I Data Miig (ICDM), 2010 IEEE 10th Iteratioal Coferece o, pages IEEE, Jaez Demšar. Statistical comparisos of classifiers over multiple data sets. Joural of Machie Learig Research, 7:1 30, Dua Dheeru ad Efi Karra Taiskidou. UCI machie learig repository, URL Levet Ertöz, Michael Steibach, ad Vipi Kumar. Fidig topics i collectios of documets: A shared earest eighbor approach. Clusterig ad Iformatio Retrieval, 11:83 103, Levet Ertöz, Michael Steibach, ad Vipi Kumar. Fidig clusters of differet sizes, shapes, ad desities i oisy, high dimesioal data. I Proceedigs of the 2003 SIAM Iteratioal Coferece o Data Miig, pages SIAM, Marti Ester, Has-Peter Kriegel, Jörg S, ad Xiaowei Xu. A desity-based algorithm for discoverig clusters i large spatial databases with oise. I Proceedigs of the Secod Iteratioal Coferece o Kowledge Discovery ad Data Miig (KDD-96), pages AAAI Press, Tom Fawcett. A itroductio to roc aalysis. Patter recogitio letters, 27(8): , Evely Fix ad Joseph L Hodges Jr. Discrimiatory aalysis-oparametric discrimiatio: cosistecy properties. Techical report, Califoria Uiv Berkeley,

20 17. Xiaofei He ad Partha Niyogi. Locality preservig projectios. I S. Thru, L. K. Saul, ad B. Schölkopf, editors, Advaces i Neural Iformatio Processig Systems 16, pages MIT Press, Ia Jolliffe. Pricipal compoet aalysis. I Iteratioal ecyclopedia of statistical sciece, pages Spriger, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. Isolatio forest. I 2008 Eighth IEEE Iteratioal Coferece o Data Miig, pages IEEE, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. O detectig clustered aomalies usig SCiForest. I Proceedigs of the 2010 Europea Coferece o Machie Learig ad Kowledge Discovery i Databases: Part II, ECML PKDD 10, pages , Berli, Heidelberg, Spriger-Verlag. ISBN X, Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. O detectig clustered aomalies usig sciforest. I José Luis Balcázar, Fracesco Bochi, Aristides Giois, ad Michèle Sebag, editors, Machie Learig ad Kowledge Discovery i Databases, pages , Berli, Heidelberg, Spriger Berli Heidelberg. ISBN Fei Toy Liu, Kai Mig Tig, ad Zhi-Hua Zhou. Isolatio-based aomaly detectio. ACM Trasactios o Kowledge Discovery from Data (TKDD), 6(1):3:1 3:39, March ISSN Do O Loftsgaarde, Charles P Queseberry, et al. A oparametric estimate of a multivariate desity fuctio. The Aals of Mathematical Statistics, 36(3): , Laures va der Maate ad Geoffrey Hito. Visualizig data usig t-se. Joural of machie learig research, 9(Nov): , James MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of the fifth Berkeley symposium o mathematical statistics ad probability, volume 1, pages Oaklad, CA, USA, D. Mari, M. Tag, I. Be Ayed, ad Y. Y. Boykov. Kerel clusterig: desity biases ad solutios. IEEE Trasactios o Patter Aalysis ad Machie Itelligece, pages 1 1, ISSN doi: / TPAMI Sridhar Ramaswamy, Rajeev Rastogi, ad Kyuseok Shim. Efficiet algorithms for miig outliers from large data sets. I Proceedigs of the 2000 ACM SIGMOD Iteratioal Coferece o Maagemet of Data, SIGMOD 00, pages , New York, NY, USA, ACM. ISBN C. J. Va Rijsberge. Iformatio Retrieval. Butterworth-Heiema, Newto, MA, USA, 2d editio, ISBN Alex Rodriguez ad Alessadro Laio. Clusterig by fast search ad fid of desity peaks. Sciece, 344(6191): , Berhard Schölkopf, Alexader J Smola, et al. Learig with kerels: support vector machies, regularizatio, optimizatio, ad beyod. MIT press, Cambridge, Alexader Strehl ad Joydeep Ghosh. Cluster esembles a kowledge reuse framework for combiig multiple partitios. Joural of Machie Learig Research, 3(Dec): , Richard Szeliski. Computer visio: algorithms ad applicatios. Spriger Sciece & Busiess Media, Joshua B Teebaum, Vi De Silva, ad Joh C Lagford. A global geometric framework for oliear dimesioality reductio. sciece, 290(5500): , Fei Wag ad Jimeg Su. Survey o distace metric learig ad dimesioality reductio i data miig. Data Miig ad Kowledge Discovery, 29(2), Fei Wag, Bi Zhao, ad Chagshui Zhag. Usupervised large margi discrimiative projectio. IEEE trasactios o eural etworks, 22(9): , Ye Zhu, Kai Mig Tig, ad Mark J. Carma. Desity-ratio based clusterig for discoverig clusters with varyig desities. Patter Recogitio, 60: , ISSN

Image Segmentation EEE 508

Image Segmentation EEE 508 Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Sprig 2017 A secod course i data miig http://www.it.uu.se/edu/course/homepage/ifoutv2/vt17/ Kjell Orsbor Uppsala Database Laboratory Departmet of Iformatio Techology, Uppsala Uiversity,

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le Fudametals of Media Processig Shi'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dih Le Today's topics Noparametric Methods Parze Widow k-nearest Neighbor Estimatio Clusterig Techiques k-meas Agglomerative Hierarchical

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Dynamic Programming and Curve Fitting Based Road Boundary Detection Dyamic Programmig ad Curve Fittig Based Road Boudary Detectio SHYAM PRASAD ADHIKARI, HYONGSUK KIM, Divisio of Electroics ad Iformatio Egieerig Chobuk Natioal Uiversity 664-4 Ga Deokji-Dog Jeoju-City Jeobuk

More information

arxiv: v2 [cs.ds] 24 Mar 2018

arxiv: v2 [cs.ds] 24 Mar 2018 Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

Stone Images Retrieval Based on Color Histogram

Stone Images Retrieval Based on Color Histogram Stoe Images Retrieval Based o Color Histogram Qiag Zhao, Jie Yag, Jigyi Yag, Hogxig Liu School of Iformatio Egieerig, Wuha Uiversity of Techology Wuha, Chia Abstract Stoe images color features are chose

More information

Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach

Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach Aalysis of Differet Similarity Measure Fuctios ad their Impacts o Shared Nearest Neighbor Clusterig Approach Ail Kumar Patidar School of IT, Rajiv Gadhi Techical Uiversity, Bhopal (M.P.), Idia Jitedra

More information

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee

More information

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem A Improved Shuffled Frog-Leapig Algorithm for Kapsack Problem Zhoufag Li, Ya Zhou, ad Peg Cheg School of Iformatio Sciece ad Egieerig Hea Uiversity of Techology ZhegZhou, Chia lzhf1978@126.com Abstract.

More information

Evaluation of Support Vector Machine Kernels for Detecting Network Anomalies

Evaluation of Support Vector Machine Kernels for Detecting Network Anomalies Evaluatio of Support Vector Machie Kerels for Detectig Network Aomalies Prera Batta, Maider Sigh, Zhida Li, Qigye Dig, ad Ljiljaa Trajković Commuicatio Networks Laboratory http://www.esc.sfu.ca/~ljilja/cl/

More information

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters.

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters. SD vs. SD + Oe of the most importat uses of sample statistics is to estimate the correspodig populatio parameters. The mea of a represetative sample is a good estimate of the mea of the populatio that

More information

1 Graph Sparsfication

1 Graph Sparsfication CME 305: Discrete Mathematics ad Algorithms 1 Graph Sparsficatio I this sectio we discuss the approximatio of a graph G(V, E) by a sparse graph H(V, F ) o the same vertex set. I particular, we cosider

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

Dimension Reduction and Manifold Learning. Xin Zhang

Dimension Reduction and Manifold Learning. Xin Zhang Dimesio Reductio ad Maifold Learig Xi Zhag eeizhag@scut.edu.c Cotet Motivatio of maifold learig Pricipal compoet aalysis ad its etesio Maifold learig Global oliear maifold learig (IsoMap) Local oliear

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

are two specific neighboring points, F( x, y)

are two specific neighboring points, F( x, y) $33/,&$7,212)7+(6(/)$92,',1* 5$1'20:$/.12,6(5('8&7,21$/*25,7+0,17+(&2/285,0$*(6(*0(17$7,21 %RJGDQ602/.$+HQU\N3$/86'DPLDQ%(5(6.$ 6LOHVLDQ7HFKQLFDO8QLYHUVLW\'HSDUWPHQWRI&RPSXWHU6FLHQFH $NDGHPLFND*OLZLFH32/$1'

More information

New Fuzzy Color Clustering Algorithm Based on hsl Similarity

New Fuzzy Color Clustering Algorithm Based on hsl Similarity IFSA-EUSFLAT 009 New Fuzzy Color Clusterig Algorithm Based o hsl Similarity Vasile Ptracu Departmet of Iformatics Techology Tarom Compay Bucharest Romaia Email: patrascu.v@gmail.com Abstract I this paper

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process Vol.133 (Iformatio Techology ad Computer Sciece 016), pp.85-89 http://dx.doi.org/10.1457/astl.016. Euclidea Distace Based Feature Selectio for Fault Detectio Predictio Model i Semicoductor Maufacturig

More information

BASED ON ITERATIVE ERROR-CORRECTION

BASED ON ITERATIVE ERROR-CORRECTION A COHPARISO OF CRYPTAALYTIC PRICIPLES BASED O ITERATIVE ERROR-CORRECTIO Miodrag J. MihaljeviC ad Jova Dj. GoliC Istitute of Applied Mathematics ad Electroics. Belgrade School of Electrical Egieerig. Uiversity

More information

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only Edited: Yeh-Liag Hsu (998--; recommeded: Yeh-Liag Hsu (--9; last updated: Yeh-Liag Hsu (9--7. Note: This is the course material for ME55 Geometric modelig ad computer graphics, Yua Ze Uiversity. art of

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

New HSL Distance Based Colour Clustering Algorithm

New HSL Distance Based Colour Clustering Algorithm The 4th Midwest Artificial Itelligece ad Cogitive Scieces Coferece (MAICS 03 pp 85-9 New Albay Idiaa USA April 3-4 03 New HSL Distace Based Colour Clusterig Algorithm Vasile Patrascu Departemet of Iformatics

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today Admiistrative Fial project No office hours today UNSUPERVISED LEARNING David Kauchak CS 451 Fall 2013 Supervised learig Usupervised learig label label 1 label 3 model/ predictor label 4 label 5 Supervised

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

The isoperimetric problem on the hypercube

The isoperimetric problem on the hypercube The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose

More information

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve Advaces i Computer, Sigals ad Systems (2018) 2: 19-25 Clausius Scietific Press, Caada Aalysis of Server Resource Cosumptio of Meteorological Satellite Applicatio System Based o Cotour Curve Xiagag Zhao

More information

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System

A Novel Feature Extraction Algorithm for Haar Local Binary Pattern Texture Based on Human Vision System A Novel Feature Extractio Algorithm for Haar Local Biary Patter Texture Based o Huma Visio System Liu Tao 1,* 1 Departmet of Electroic Egieerig Shaaxi Eergy Istitute Xiayag, Shaaxi, Chia Abstract The locality

More information

Investigating methods for improving Bagged k-nn classifiers

Investigating methods for improving Bagged k-nn classifiers Ivestigatig methods for improvig Bagged k-nn classifiers Fuad M. Alkoot Telecommuicatio & Navigatio Istitute, P.A.A.E.T. P.O.Box 4575, Alsalmia, 22046 Kuwait Abstract- We experimet with baggig knn classifiers

More information

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb Chapter 3 Descriptive Measures Measures of Ceter (Cetral Tedecy) These measures will tell us where is the ceter of our data or where most typical value of a data set lies Mode the value that occurs most

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Fast algorithm for skew detection. Adnan Amin, Stephen Fischer, Tony Parkinson, and Ricky Shiu

Fast algorithm for skew detection. Adnan Amin, Stephen Fischer, Tony Parkinson, and Ricky Shiu Fast algorithm for skew detectio Ada Ami, Stephe Fischer, Toy Parkiso, ad Ricky Shiu School of Computer Sciece ad Egieerig Uiversity of New South Wales, Sydey NSW, 2052 Australia ABSTRACT Documet image

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware

Parallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware Parallel Polygo Approximatio Algorithm Targeted at Recofigurable Multi-Rig Hardware M. Arif Wai* ad Hamid R. Arabia** *Califoria State Uiversity Bakersfield, Califoria, USA **Uiversity of Georgia, Georgia,

More information

On (K t e)-saturated Graphs

On (K t e)-saturated Graphs Noame mauscript No. (will be iserted by the editor O (K t e-saturated Graphs Jessica Fuller Roald J. Gould the date of receipt ad acceptace should be iserted later Abstract Give a graph H, we say a graph

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua Iteratioal Coferece o Automatio, Mechaical Cotrol ad Computatioal Egieerig (AMCCE 05) Mobile termial 3D image recostructio program developmet based o Adroid Li Qihua Sichua Iformatio Techology College

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Counting the Number of Minimum Roman Dominating Functions of a Graph

Counting the Number of Minimum Roman Dominating Functions of a Graph Coutig the Number of Miimum Roma Domiatig Fuctios of a Graph SHI ZHENG ad KOH KHEE MENG, Natioal Uiversity of Sigapore We provide two algorithms coutig the umber of miimum Roma domiatig fuctios of a graph

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

OCR Statistics 1. Working with data. Section 3: Measures of spread

OCR Statistics 1. Working with data. Section 3: Measures of spread Notes ad Eamples OCR Statistics 1 Workig with data Sectio 3: Measures of spread Just as there are several differet measures of cetral tedec (averages), there are a variet of statistical measures of spread.

More information

Algorithms for Disk Covering Problems with the Most Points

Algorithms for Disk Covering Problems with the Most Points Algorithms for Disk Coverig Problems with the Most Poits Bi Xiao Departmet of Computig Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog csbxiao@comp.polyu.edu.hk Qigfeg Zhuge, Yi He, Zili Shao, Edwi

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

Lecture 2: Spectra of Graphs

Lecture 2: Spectra of Graphs Spectral Graph Theory ad Applicatios WS 20/202 Lecture 2: Spectra of Graphs Lecturer: Thomas Sauerwald & He Su Our goal is to use the properties of the adjacecy/laplacia matrix of graphs to first uderstad

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Image based Cats and Possums Identification for Intelligent Trapping Systems

Image based Cats and Possums Identification for Intelligent Trapping Systems Volume 159 No, February 017 Image based Cats ad Possums Idetificatio for Itelliget Trappig Systems T. A. S. Achala Perera School of Egieerig Aucklad Uiversity of Techology New Zealad Joh Collis School

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University CSCI 5090/7090- Machie Learig Sprig 018 Mehdi Allahyari Georgia Souther Uiversity Clusterig (slides borrowed from Tom Mitchell, Maria Floria Balca, Ali Borji, Ke Che) 1 Clusterig, Iformal Goals Goal: Automatically

More information

Our Learning Problem, Again

Our Learning Problem, Again Noparametric Desity Estimatio Matthew Stoe CS 520, Sprig 2000 Lecture 6 Our Learig Problem, Agai Use traiig data to estimate ukow probabilities ad probability desity fuctios So far, we have depeded o describig

More information

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c Iteratioal Coferece o Computatioal Sciece ad Egieerig (ICCSE 015) Harris Corer Detectio Algorithm at Sub-pixel Level ad Its Applicatio Yuafeg Ha a, Peijiag Che b * ad Tia Meg c School of Automobile, Liyi

More information

Numerical Methods Lecture 6 - Curve Fitting Techniques

Numerical Methods Lecture 6 - Curve Fitting Techniques Numerical Methods Lecture 6 - Curve Fittig Techiques Topics motivatio iterpolatio liear regressio higher order polyomial form expoetial form Curve fittig - motivatio For root fidig, we used a give fuctio

More information

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1 Eigeimages Uitary trasforms Karhue-Loève trasform ad eigeimages Sirovich ad Kirby method Eigefaces for geder recogitio Fisher liear discrimat aalysis Fisherimages ad varyig illumiatio Fisherfaces vs. eigefaces

More information

FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING

FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING FEATURE BASED RECOGNITION OF TRAFFIC VIDEO STREAMS FOR ONLINE ROUTE TRACING Christoph Busch, Ralf Dörer, Christia Freytag, Heike Ziegler Frauhofer Istitute for Computer Graphics, Computer Graphics Ceter

More information

Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article

Journal of Chemical and Pharmaceutical Research, 2013, 5(12): Research Article Available olie www.jocpr.com Joural of Chemical ad Pharmaceutical Research, 2013, 5(12):745-749 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 K-meas algorithm i the optimal iitial cetroids based

More information

Neuro Fuzzy Model for Human Face Expression Recognition

Neuro Fuzzy Model for Human Face Expression Recognition IOSR Joural of Computer Egieerig (IOSRJCE) ISSN : 2278-0661 Volume 1, Issue 2 (May-Jue 2012), PP 01-06 Neuro Fuzzy Model for Huma Face Expressio Recogitio Mr. Mayur S. Burage 1, Prof. S. V. Dhopte 2 1

More information

A Note on Least-norm Solution of Global WireWarping

A Note on Least-norm Solution of Global WireWarping A Note o Least-orm Solutio of Global WireWarpig Charlie C. L. Wag Departmet of Mechaical ad Automatio Egieerig The Chiese Uiversity of Hog Kog Shati, N.T., Hog Kog E-mail: cwag@mae.cuhk.edu.hk Abstract

More information

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute

More information

Handwriting Stroke Extraction Using a New XYTC Transform

Handwriting Stroke Extraction Using a New XYTC Transform Hadwritig Stroke Etractio Usig a New XYTC Trasform Gilles F. Houle 1, Kateria Bliova 1 ad M. Shridhar 1 Computer Scieces Corporatio Uiversity Michiga-Dearbor Abstract: The fudametal represetatio of hadwritig

More information

CS 683: Advanced Design and Analysis of Algorithms

CS 683: Advanced Design and Analysis of Algorithms CS 683: Advaced Desig ad Aalysis of Algorithms Lecture 6, February 1, 2008 Lecturer: Joh Hopcroft Scribes: Shaomei Wu, Etha Feldma February 7, 2008 1 Threshold for k CNF Satisfiability I the previous lecture,

More information

Cubic Polynomial Curves with a Shape Parameter

Cubic Polynomial Curves with a Shape Parameter roceedigs of the th WSEAS Iteratioal Coferece o Robotics Cotrol ad Maufacturig Techology Hagzhou Chia April -8 00 (pp5-70) Cubic olyomial Curves with a Shape arameter MO GUOLIANG ZHAO YANAN Iformatio ad

More information

Data Analysis. Concepts and Techniques. Chapter 2. Chapter 2: Getting to Know Your Data. Data Objects and Attribute Types

Data Analysis. Concepts and Techniques. Chapter 2. Chapter 2: Getting to Know Your Data. Data Objects and Attribute Types Data Aalysis Cocepts ad Techiques Chapter 2 1 Chapter 2: Gettig to Kow Your Data Data Objects ad Attribute Types Basic Statistical Descriptios of Data Data Visualizatio Measurig Data Similarity ad Dissimilarity

More information

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Cluster Analysis. Andrew Kusiak Intelligent Systems Laboratory

Cluster Analysis. Andrew Kusiak Intelligent Systems Laboratory Cluster Aalysis Adrew Kusiak Itelliget Systems Laboratory 2139 Seamas Ceter The Uiversity of Iowa Iowa City, Iowa 52242-1527 adrew-kusiak@uiowa.edu http://www.icae.uiowa.edu/~akusiak Two geeric modes of

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015 15-859E: Advaced Algorithms CMU, Sprig 2015 Lecture #2: Radomized MST ad MST Verificatio Jauary 14, 2015 Lecturer: Aupam Gupta Scribe: Yu Zhao 1 Prelimiaries I this lecture we are talkig about two cotets:

More information

Diego Nehab. n A Transformation For Extracting New Descriptors of Shape. n Locus of points equidistant from contour

Diego Nehab. n A Transformation For Extracting New Descriptors of Shape. n Locus of points equidistant from contour Diego Nehab A Trasformatio For Extractig New Descriptors of Shape Locus of poits equidistat from cotour Medial Axis Symmetric Axis Skeleto Shock Graph Shaked 96 1 Shape matchig Aimatio Dimesio reductio

More information

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c

Pruning and Summarizing the Discovered Time Series Association Rules from Mechanical Sensor Data Qing YANG1,a,*, Shao-Yu WANG1,b, Ting-Ting ZHANG2,c Advaces i Egieerig Research (AER), volume 131 3rd Aual Iteratioal Coferece o Electroics, Electrical Egieerig ad Iformatio Sciece (EEEIS 2017) Pruig ad Summarizig the Discovered Time Series Associatio Rules

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

New Results on Energy of Graphs of Small Order

New Results on Energy of Graphs of Small Order Global Joural of Pure ad Applied Mathematics. ISSN 0973-1768 Volume 13, Number 7 (2017), pp. 2837-2848 Research Idia Publicatios http://www.ripublicatio.com New Results o Eergy of Graphs of Small Order

More information

Which movie we can suggest to Anne?

Which movie we can suggest to Anne? ECOLE CENTRALE SUPELEC MASTER DSBI DECISION MODELING TUTORIAL COLLABORATIVE FILTERING AS A MODEL OF GROUP DECISION-MAKING You kow that the low-tech way to get recommedatios for products, movies, or etertaiig

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by

More information

Computational Geometry

Computational Geometry Computatioal Geometry Chapter 4 Liear programmig Duality Smallest eclosig disk O the Ageda Liear Programmig Slides courtesy of Craig Gotsma 4. 4. Liear Programmig - Example Defie: (amout amout cosumed

More information

Criterion in selecting the clustering algorithm in Radial Basis Functional Link Nets

Criterion in selecting the clustering algorithm in Radial Basis Functional Link Nets WSEAS TRANSACTIONS o SYSTEMS Ag Sau Loog, Og Hog Choo, Low Heg Chi Criterio i selectig the clusterig algorithm i Radial Basis Fuctioal Lik Nets ANG SAU LOONG 1, ONG HONG CHOON 2 & LOW HENG CHIN 3 Departmet

More information

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING Y.K. Patil* Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING Prof. V.S. Nadedkar** Abstract: Documet clusterig is

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

Learning to Shoot a Goal Lecture 8: Learning Models and Skills

Learning to Shoot a Goal Lecture 8: Learning Models and Skills Learig to Shoot a Goal Lecture 8: Learig Models ad Skills How do we acquire skill at shootig goals? CS 344R/393R: Robotics Bejami Kuipers Learig to Shoot a Goal The robot eeds to shoot the ball i the goal.

More information

On-line cursive letter recognition using sequences of local minima/maxima. Robert Powalka

On-line cursive letter recognition using sequences of local minima/maxima. Robert Powalka O-lie cursive letter recogitio usig sequeces of local miima/maxima Summary Robert Powalka 19 th August 1993 This report presets the desig ad implemetatio of a o-lie cursive letter recogizer usig sequeces

More information

Octahedral Graph Scaling

Octahedral Graph Scaling Octahedral Graph Scalig Peter Russell Jauary 1, 2015 Abstract There is presetly o strog iterpretatio for the otio of -vertex graph scalig. This paper presets a ew defiitio for the term i the cotext of

More information

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Descriptive Statistics

ENGI 4421 Probability and Statistics Faculty of Engineering and Applied Science Problem Set 1 Descriptive Statistics ENGI 44 Probability ad Statistics Faculty of Egieerig ad Applied Sciece Problem Set Descriptive Statistics. If, i the set of values {,, 3, 4, 5, 6, 7 } a error causes the value 5 to be replaced by 50,

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

IMAGE-BASED MODELING AND RENDERING 1. HISTOGRAM AND GMM. I-Chen Lin, Dept. of CS, National Chiao Tung University

IMAGE-BASED MODELING AND RENDERING 1. HISTOGRAM AND GMM. I-Chen Lin, Dept. of CS, National Chiao Tung University IMAGE-BASED MODELING AND RENDERING. HISTOGRAM AND GMM I-Che Li, Dept. of CS, Natioal Chiao Tug Uiversity Outlie What s the itesity/color histogram? What s the Gaussia Mixture Model (GMM? Their applicatios

More information

4.2.1 Bayesian Principal Component Analysis Weighted K Nearest Neighbor Regularized Expectation Maximization

4.2.1 Bayesian Principal Component Analysis Weighted K Nearest Neighbor Regularized Expectation Maximization 4 DATA PREPROCESSING 4.1 Data Normalizatio 4.1.1 Mi-Max 4.1.2 Z-Score 4.1.3 Decimal Scalig 4.2 Data Imputatio 4.2.1 Bayesia Pricipal Compoet Aalysis 4.2.2 K Nearest Neighbor 4.2.3 Weighted K Nearest Neighbor

More information

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System Evaluatio of Differet Fitess Fuctios for the Evolutioary Testig of a Autoomous Parkig System Joachim Wegeer 1 ad Oliver Bühler 2 1 DaimlerChrysler AG, Research ad Techology, Alt-Moabit 96 a, D-10559 Berli,

More information