Unsupervised Discretization Using Kernel Density Estimation

Size: px

Start display at page:

Download "Unsupervised Discretization Using Kernel Density Estimation"

Russell Park
5 years ago
Views:

1 Usupervised Discretizatio Usig Kerel Desity Estimatio Maregle Biba, Floriaa Esposito, Stefao Ferilli, Nicola Di Mauro, Teresa M.A Basile Departmet of Computer Sciece, Uiversity of Bari Via Oraboa 4, 7025 Bari, Italy Abstract Discretizatio, defied as a set of cuts over domais of attributes, represets a importat preprocessig task for umeric data aalysis. Some Machie Learig algorithms require a discrete feature space but i real-world applicatios cotiuous attributes must be hadled. To deal with this problem may supervised discretizatio methods have bee proposed but little has bee doe to sythesize usupervised discretizatio methods to be used i domais where o class iformatio is available. Furthermore, existig methods such as (equal-width or equal-frequecy) biig, are ot well-pricipled, raisig therefore the eed for more sophisticated methods for the usupervised discretizatio of cotiuous features. This paper presets a ovel usupervised discretizatio method that uses o-parametric desity estimators to automatically adapt sub-iterval dimesios to the data. The proposed algorithm searches for the ext two sub-itervals to produce, evaluatig the best cut-poit o the basis of the desity iduced i the sub-itervals by the curret cut ad the desity give by a kerel desity estimator for each sub-iterval. It uses cross-validated log-likelihood to select the maximal umber of itervals. The ew proposed method is compared to equal-width ad equal-frequecy discretizatio methods through experimets o well kow bechmarkig data. Itroductio Data format is a importat issue i Machie Learig (ML) because differet types of data make relevat differece i learig tasks. While there ca be ifiitely may values for a cotiuous attribute, the umber of discrete values is ofte small or fiite. Whe learig, e.g., classificatio trees/rules, the data type has a importat impact o the decisio tree iductio. As reported i [Dougherty et al.,995], discretizatio makes learig more accurate ad faster. I geeral, the decisio trees ad rules leared usig discrete features are more compact ad more accurate tha those iduced usig cotiuous oes. I additio to the advatages of discrete values over cotiuous oes, the poit is that may learig algorithms ca oly hadle discrete attributes, thus good discretizatio methods are a key issue for them sice they ca sigificatly affect the learig outcome. There are differet axes by which discretizatio methods ca be classified, accordig to the differet directios followed by the implemetatio of discretizatio techiques due to differet eeds: global vs. local, splittig (topdow) vs. mergig (bottom-up), direct vs. icremetal ad supervised vs. usupervised. Local methods, as exemplified by C4.5, discretize i a localized regio of the istace space. (i.e. a subset of istaces). O the other side, global methods use the etire istace space [Chmielevski ad Grzymala-Busse, 994]. Splittig methods start with a empty list of cutpoits ad, while splittig the itervals i a top-dow fashio, produce progressively the cut-poits that make up the discretizatio. O the cotrary, mergig methods start with all the possible cutpoits ad, at each step of the discretizatio refiemet, elimiate cut-poits by mergig itervals. Direct methods divide the iitial iterval i subitervals simultaeously (i.e., equal-width ad equalfrequecy), thus they eed as a further iput from the user the umber of itervals to produce. Icremetal methods [Cerquides ad Mataras, 997] start with a simple discretizatio step ad progressively improve the discretizatio, hece eedig a additioal criterio to stop the process. Supervised discretizatio cosiders class iformatio while usupervised discretizatio does ot. Equal-width ad equal-frequecy biig are simple techiques that perform usupervised discretizatio without exploitig ay class iformatio. I these methods, cotiuous itervals are split ito sub-itervals ad it is up to the user specifyig the width (rage of values to iclude i a sub-iterval) or frequecy (umber of istaces i each sub-iterval). These simple methods may ot lead to good results whe the cotiuous values are ot compliat with the uiform distributio. Additioally, sice outliers are ot hadled, they ca produce results with low accuracy i the presece of skew data. Usually, to deal with these problems, class iformatio has bee used i supervised methods, but whe o such iformatio is available the oly optio is exploitig usupervised methods. While there exist may supervised methods i literature, ot much work has bee doe for sythe-

2 sizig usupervised methods. This could be due to the fact that discretizatio has bee commoly associated with the classificatio task. Therefore, work o supervised methods is strogly motivated i those learig tasks where o class iformatio is available. I particular, i may domais, learig algorithms deal oly with discrete values. Amog these learig settigs, i may cases o class iformatio ca be exploited ad usupervised discretizatio methods such as simple biig are used. The work preseted i this paper proposes a top-dow, global, direct ad usupervised method for discretizatio. It exploits desity estimatio methods to select the cut-poits durig the discretizatio process. The umber of cutpoits is computed by cross-validatig the log-likelihood. We cosider as cadidate cutpoits those that fall betwee two istaces of the attribute to be discretized. The space of all the possible cut-poits to evaluate could grow for large datasets that have cotiuous attributes with may istaces with differet values amog them. For this reaso we developed ad implemeted a efficiet algorithm of complexity Nlog(N) where N is umber of istaces. The paper is orgaized as follows. I Sectio 2 we describe o-parametric desity estimators, a special case of which is the kerel desity estimator. I Sectio 3 we preset the discretizatio algorithm, while i Sectio 4 we report experimets carried out o classical datasets of the UCI repository. Sectio 5 cocludes the paper ad outlies future work. 2 No-parametric desity estimatio Sice data may be available uder various distributios, it is ot always straightforward to costruct desity fuctios from some give data. I parametric desity estimatio, a importat assumptio is made: available data has a desity fuctio that belogs to a kow family of distributios, such as the ormal distributio or the Gaussia oe, havig their ow parameters for mea ad variace. What a parametric method does is fidig the values of these parameters that best fit the data. However, data may be complex ad assumptios about the distributios that are forced upo the data may lead to models that do ot fit well the data. I these cases, where makig assumptios is difficult, oparametric desity fuctios are preferred. Simple biig (histograms) is oe of the most wellkow o-parametric desity methods. It cosists i assigig the same value of the desity fuctio f to every istace that falls i the iterval [C h/2, C + h/2), where C is the origi of the bi ad h is the biwidth. The value of such a fuctio is defied as follows (symbol # stads for umber of ): f = h h #{istaces that fall i C, C + } h 2 2 Oce fixed the origi C of a bi, for every istace that falls i the iterval cetered i C ad of width h, a block of size by the bi width is placed over the iterval (Figure ). Here, it is importat to ote that, if oe wats to get the desity value i x, every other poit i the same bi, cotributes equally to the desity i x, o matter how close or far away from x these poits are. Figure. Simple biig places a block i every sub-iterval for every istace x that falls i it This is rather restrictig because it does ot give a real mirror of the data. I priciple, poits closer to x should be weighted more tha other poits that are far from it. The first step i doig this is elimiatig the depedece o bi origis fixed a-priori ad place the bi origis cetered at every poit x. Thus the followig pseudo-formula: biwidth # {istaces that fall i a bi cotaiig should be trasformed i the followig oe: biwidth # {istaces that fall i a bi aroud The subtle but importat differece i costructig biig desity with the secod formula, permits to place the bi aroud x ad the calculatio of the desity is performed ot i a bi cotaiig x ad depedig from the origi C, but i a bi whose ceter is upo x. The bi ceter o x, allows successively to assig differet weights to the other poits i the same bi i terms of impact upo the desity i x depedig o the distace from x. If we cosider itervals of width h cetered o x, the the desity fuctio i x is give by the formula: f = #{istaces that fall i [ x h, x + h] } 2h I this case, whe costructig the desity fuctio, a box of width h is placed for every poit that falls i the iterval cetered i x. These boxes (the dashed oes i Figure 2) are the added up, yieldig the desity fuctio of Figure 2. This provides a way for givig a more accurate view of what the desity of the data is, called box kerel desity x} x}

3 estimate. However, the weights of the poits that fall i the same bi as x have ot bee chaged yet. Figure 2. Placig a box for every istace i the iterval aroud x ad addig them up. I order to do this, the kerel desity fuctio is itroduced: p = x X K i h i = h where K is a weightig fuctio. What this fuctio does is providig a smart way of estimatig the desity i x, by coutig the frequecy of other poits X i i the same bi as x ad weightig them differetly depedig o their distace from x. Cotributios to the desity value of f i x from poits X i vary, sice those that are closer to x are weighted more tha poits that are further away. This property is fulfilled by may fuctios, that are called kerel fuctios. A kerel fuctio K is usually a probability desity fuctios that itegrates to ad takes positive values i its domai. What is importat for the desity estimatio does ot reside i the kerel fuctio itself (Gaussia, Epaechikov or quadratic could be used) but i the badwidth selectio [Silverma 986]. We will motivate our choice for the badwidth (the value h i the case of kerel fuctios) selectio problem i the ext sectio where we itroduce the problem of cuttig itervals based o the desity iduced by the cut ad the desity give by the above kerel desity estimatio. 3 Where ad what to cut The aim of discretizatio is always to produce sub-itervals whose iduced desity over the istaces best fits the available data. The first problem to be solved is where to cut. While most supervised top-dow discretizatio method cut exactly at the poits i the mai iterval to discretize that represet istaces of the data, we decided to cut i the middle poits betwee istace values. The advatage is that this cuttig strategy avoids the eed of decidig whether the poit at which the cut is performed is to be icluded i the left or i the right sub-iterval. The secod questio is which (sub-)iterval should be cut/split ext amog those produced at a give step of the discretizatio process. Such a choice must be drive by the objective of capturig the sigificat chages of desity i differet separated bis. Our proposal is to evaluate all the possible cut-poits i all the sub-itervals, by assigig to each of them a score accordig to a method whose meaig is as follows. Give a sigle iterval to split, ay of its cutpoits produces two bis ad thus iduces upo the iitial iterval two desities, computed usig the simple biig desity estimatio formula. Such a formula, as show i the previous sectio, assigs the same desity value of the fuctio f to every istace i the bi ad igores the distace from x of the other istaces of the bi whe computig the desity i x. Every sub-iterval produced has a averaged bied desity (the bied desity i each poit) that is differet from the desity estimated with the kerel fuctio. The less this differece is, the more the sub-iterval fits the data well, i.e. the better this biig is, ad hece there is o reaso to split it. O the cotrary, the idea uderlyig our discretizatio algorithm is that, whe splittig, oe must search for the ext two worst sub-itervals to produce, where worst meas that the desity show by each of the sub-itervals is much differet tha it would be if the distaces amog poits i the itervals ad a weightig fuctio were cosidered. The idetified worst sub-itervals are just those to be split to produce other itervals, because they do ot fit the data well. I this way itervals whose desity differs much from the real data situatio are elimiated, ad replaced by other sub-itervals. I order to achieve the desity computed by the kerel desity fuctio we should reproduce a splittig of the mai iterval such as that i Figure 2. A obvious questio that arises is: whe a give subiterval is ot to be cut aymore? Ideed, searchig for the worst sub-itervals, there are always good cadidates to be split. This is true, but o the other had at each step of the algorithms we ca split oly oe sub-itervals i other two. Thus if there are more tha oe sub-iterval (this is the case after the first split) to be split, the scorig fuctio of the cut-poits allows to choose the sub-iterval to split. 3. The scorig fuctio for the cutpoits At each step of the discretizatio process, we must choose from differet sub-itervals to split. I every sub-iterval we idetify as cadidate cut-poits all the middle poits betwee the istaces. For each of the cadidate cut-poits T we compute a score as follows: k Score(T) = ( p( x ) f ( x ) ) + i i i= i= k+ ( p( x ) f ( x )) i i

4 where i=,..,k refers to the istaces that fall ito the left sub-iterval ad i= k +,.., to the istaces that fall ito the right bi. The desity fuctios p ad f are respectively the kerel desity fuctio ad the simple biig desity fuctio. These fuctios are computed as follows: f(x i ) = where m is the umber of istaces that fall i the (left or right) bi, w is the biwidth ad N is the umber of istaces i the iterval that is beig split. The kerel desity estimator is give by the formula: p(x i ) = hn j= w N where h is the badwidth ad K is a kerel fuctio. I this framework for discretizatio, it still remais to be clarified how the badwidth of the kerel desity estimator is chose. Although there are several ways to do it, as reported i [Silverma 986], i fact i this cotext we are ot iterested i the desity computed by a classic kerel desity estimator that cosiders globally the etire set of available istaces. The classic way a kerel desity estimatio works cosiders N as the total umber of istaces i the iitial iterval ad chooses h as the smoothig parameter. The choice of h is ot easy ad various techiques have bee ivestigated to fid a optimal h. Our proposal, i this cotext, is to adapt the classic kerel desity estimator by takig h equal to the biwidth w, specified as follows. Ideed, as ca be see from the formula of p(x i ), istaces that are more distat tha h from x i, cotribute with weight equal to zero to the desity of x i. Hece, if a sub-iterval (bi) uder cosideratio has biwidth h, oly the istaces that fall i it will cotribute, depedig o their distace from x i, to the desity i x i. As we are iterested i kowig how the curret bied desity (iduced by the cadidate cut-poit ad computed by f with biwidth w) differs from the desity i the same bi but computed weightig the cotributios of X j to the desity i x i o the basis of the distace x i X j, it is useless to cosider, for the fuctio p, a badwidth greater tha w. 3.2 The discretizatio algorithm Oce a scorig fuctio has bee sythesized, we explai how the discretizatio algorithm works. Figure 3 shows the algorithm i pseudo laguage. It starts with a empty list of cut-poits (that ca be implemeted as a priority queue i order to maitai, at each step, the cut-poits ordered after their value accordig to the scorig fuctio) ad aother priority queue that cotais the sub-itervals geerated thus far. Let us see it through a example. Suppose the iitial iterval to be discretized is the oe i Figure 4 (frequecies of the istaces are ot show). N m xi X K h j Discretize(Iterval) Begi PotetialCutpoits = ComputeCutPoits(Iterval); PriorityQueueItervals.Add(Iterval); While stoppig criteria is ot met do If PriorityQueueCPs is empty Foreach cutpoit CP i PotetialCutpoits do scorecp = ComputeScorigFuctio(CP,Iterval); PriorityQueueCPs.Add(CP,scoreCP); Ed for Else BestCP = PriorityQueue.GetBest(); CurretIterval = PriorityQueueItervals.GetBest(); NewItervals = Split(CurretIterval,BestCP); LeftIterval = NewItervals.GetLeftIterval(); RightIterval = NewItervals.GetRightIterval(); PotetialLeftCPs = ComputeCutPoits(LeftIterval); PotetialRightCPs =ComputeCutPoits(RightIterval); Foreach cutpoit CP i PotetialLeftCPs scorecp = ComputeScorigFuctio(CP,LeftIterval); PriorityQueueCPs.Add(CP,scoreCP); PriorityQueueItervals.Add(LeftIterval,scoreCP); Ed For // the same foreach cycle for PotetialRightCPs Ed while Ed Figure 3. The discretizatio algorithm i pseudo laguage ,5 7, Figure 4. The first cut The cadidate cut-poits are placed i the middle of adjacet istaces: 2.5, 7.5, 22.5, 27.5; the sub-itervals produced by cut-poit 2.5 are [0, 2.5] ad [2.5, 30], ad similarly for all the other cut-poits. Now, suppose that, computig the scorig fuctio for each cut-poit, the greatest value (idicatig the cut-poit that produces the ext two worst sub-itervals) is reached by the cut-poit 7.5. The the sub-itervals are: [0, 7.5] ad [7.5, 30] ad the list of cadidate cut-poits becomes <2.5, 6.25, 8.75, 22.5, 27.5>. Suppose the scorig fuctio evaluates as follows: Score(2.5) = 40, Score(6.25) = 22, Score(8.75) =, Score(22.5) = 5, Score(27.5) = 28. The algorithm selects 22.5 as the best cut-poit ad splits the correspodig iterval as show i Figure ,5 7, , ,5 Figure 5. The secod cut 22,

5 This secod cut produces two ew sub-itervals ad hece the curret discretizatio is made up of three sub-itervals: [0, 7.5], [7.5, 22.5], [22.5, 30], with cadidate cutpoits <2.5, 8.75, 23.75, 27,75>. Suppose values of the scorig fuctio are as follows: Score(2.5) = 40, Score(8.75) = 20, Score(23.75) = 35, Score(27,5) = 48. The best cut-poit 27.5 suggests the third cut ad the discretizatio becomes [0, 7.5], [7.5, 22.5], [22.5, 27.5], [27.5, 30]. Thus, the algorithm refies those sub-itervals that show worst fit to the data. A ote is worth: i some cases it might happe that a split is performed eve if oe of the two sub-itervals (which could be the left or the right oe) it produces shows such a good fit, compared to the other sub-itervals, that it is ot split i the future. This is ot strage, sice the scorig fuctio evaluates the overall fit of the two sub-itervals. This is the case of the first cut i the preset example: the cut-poit 7.5 has bee chose, where the left sub-iterval [0, 7.5] shows good fit to the data i terms of desity while the right oe [7.5, 30] shows bad fit. I this case the iterval [0, 7.5] will ot be cut before the iterval [7.5, 30] ad perhaps will remai utouched till the ed of the discretizatio algorithm. The algorithm will stop cuttig whe the stoppig criterio (the maximal umber of cut-poits, computed by a procedure explaied i the ext paragraph) is met. 3.3 Stoppig criteria ad complexity The defiitio of a stoppig criterio is fudametal, to prevet the algorithm from cotiuig to cut util each bi cotais a sigle istace. Eve without reachig such a extreme situatio, the risk of ruig ito overfittig the model is real, because, as usual i the literature, we use loglikelihood to evaluate the desity estimators, the simple biig ad the kerel desity estimate. As a solutio, istead of requirig a specific umber of itervals (that could be too rigid ad ot based o valid assumptios), we propose the use of cross-validatio to provide a ubiased estimatio of how the model fits the real distributio. For the experimets performed the 0-fold cross-validatio was used. For each fold the algorithm computes the stoppig criterio as follows: Supposig there are N cadidate cut-poits, for each of them the cross-validated loglikelihood is computed. I order to optimize performace, at each step a structure maitais the sub-itervals i the curret discretizatio ad the correspodig splittig values, so that oly the ew values for the iterval to be split have to be computed at each step. Thus the algorithm that computes the log-likelihood for the N cut-poits is performed 0 times overall. The umber of cut-poits that shows the maximum value of the averaged log-likelihood o the test folds is chose as the best. The log-likelihood o the test data is give by the followig formula: Log-likelihood = j= j test log w N j trai where j-trai is the umber of traiig istaces i bi j, j-test is the umber of test istaces that fall i bi j, N is the total umber of istaces ad w is the width of bi j. As regards the kerel desity estimator complexity, from the formula of p, it ca be deduced that the complexity for evaluatig the kerel desity i N poits is N 2. For uivariate data, the complexity problem has bee solved by the algorithms proposed i [GreeGard ad Strai, 99] ad [Yag et al 2003] which compute the kerel desity estimate i O(N+N) istead of O(N 2 ). I our cotext we deal oly with uivariate data because oly sigle cotiuous attributes have to processed, ad thus for N istaces, the theoretical complexity of the algorithm is O(NlogN). 4 Experimets I order to assess the validity ad performace of the proposed discretizatio algorithm, we have performed experimets o several datasets take from the UCI repository ad classically used i the literature to evaluate discretizatio algorithms i the past. Specifically, the dataset used are: autos, bupa, wie, ioosphere, ecoli, soar, glass, heart, hepatitis, arrhythmia, aeal, cylider, ad auto-mpg. These datasets cotai a large set of umeric attributes of various types, from which 200 cotiuous attributes were extracted at radom ad used to test the discretizatio algorithm. I order to evaluate the discretizatio carried out by the proposed algorithm with respect to other algorithms i the literature, we compared it to three other methods: equalwidth with fixed umber of bis (we use 0 for the experimets), equal-frequecy with fixed umber of bis (we use 0 for the experimets), equal-width cross-validated for the umber of bis. The compariso was made alog the loglikelihood o the test data usig a 0-fold cross-validatio methodology. The results o the test folds were compared through a paired t-test as regards cross-validated loglikelihood. Table presets the results of the t-test based o cross-validated log-likelihood with a risk level α = It shows the umber of cotiuous attributes whose discretizatio through our method was sigificatly better, equal or sigificatly worst compared to the other methods. Our method sigificalty more accurate Equal Our method sigificalty less accurate EqualWidth 0 bis 7 (35,5%) 26 3 (,5%) EqualFreq 0 bis 79 (39,5%) 9 2 (,0%) EqualWidth Cross- Validated 54 (27,0%) 36 0 (5,0%) Table. Results of paired t-test based o cross-validated log-likelihood o 0 folds. It is clear that, eve if i the majority of cases the ew algorithm shows o differece i performace with respect to

6 the others, there is a outstadig percetage of cases (at least 27%) i which it behaves better, while the opposite holds oly i very rare cases. Amog the datasets there ca be foud may cases of cotiuous attributes whose iterval of values cotai may occurreces of the same value. This characteristic had a impact o the results of the equal frequecy method that ofte, i such cases, was ot able to produce a valid model that could fit the data. This is atural, sice this method creates the bis based o the umber of istaces that fall i it. For example if the total umber of istaces is 200 ad the bis to geerate are 0, the the umber of istaces that must fall i a bi is 20. Thus, if amog the istaces there is oe that has 30 occurreces, the the equal frequecy method is ot able to build a good model because it caot compute the desity of the bi that cotais oly the occurreces of the sigle istace. This would be eve more problematic i case of cross-validatio, which is the reaso why o compariso with the Equal Frequecy Cross-Validatio method was carried out. A importat ote ca be made cocerig (very) discotiuous data, o which our method performs better tha the others. This is due to the ability of the proposed algorithm to catch the chages i desity i separated bis. Thus very high desities i the itervals (for example large umber of istaces i a small regio) are isolated i bis differet from those which host low desities. Although it is ot straightforward to hadle very discotiuous distributios, the method we have proposed achieves good results whe tryig to produce bis that ca fit these kid of distributios. 5 Coclusios ad future work Discretizatio represets a importat preprocessig task for umeric data aalysis. So far may supervised discretizatio methods have bee proposed but little has bee doe to sythesize usupervised methods. This paper presets a ovel usupervised discretizatio method that exploits a kerel desity estimator for choosig the itervals to be split ad cross-validated log-likelihood to select the maximal umber of itervals. The ew proposed method is compared to equal-width ad equal-frequecy discretizatio methods through experimets o well kow bechmarkig data. Prelimiary results are promisig ad show that kerel desity estimatio methods are good for developig sophisticated discretizatio methods. Further work ad experimets are eeded to fie-tue the discretizatio method to deal with those cases where the other methods show better accuracy. As future applicatio we pla to use the proposed discretizatio algorithm i a learig task that requires discretizatio ad where o class iformatio is always available. Oe such cotext could be Iductive Logic Programmig, where objects whose class is ot kow, are ofte described by cotiuous attributes. This ivestigatio will aim at assessig the quality of the learig task ad how this is affected by the discretizato of the cotiuous attributes. Refereces [Cerquides ad Mataras, 997] Cerquides. J ad Mataras R.L. Proposal ad empirical compariso of a parallelizable distace-based discretizatio method. I KDD97. Third Iteratioal Coferece o Kowledge Discovery ad Data Miig, pp [Chmielevski ad Grzymala-Busse, 994] Chmielevski, M.R ad Grzymala-Busse,J.W. Global discretizatio of cotiuous attributes o preprocessig for machie learig. I Third Iteratioal Workshop o Rough Sets ad Soft Computig, pp , 994. [Dougherty et al..,995] Dougherty.J.,Kohavi,R., ad Sahami,M. Supervised ad usupervised discretizatio discretizatio of cotiuous features. I Proc. Twelfth Iteratioal Coferece o Machie Learig, Los Altos, CA:Morga Kaufma,pp , 995. [Gregard ad Strai 99] Greegard, L. ad Strai, J.The fast Gauss Trasform. SIAM Joural of Scietific ad statistical computig. 2,, [Silverma 986] Silverma, B.W. Desity estimatio for statistics ad data aalysis. Chapma ad Hall, Lodo, 986. [Yag et al 2003] Yag, C., Duraiswami, R., ad Gumerov, N Improved fast Gauss trasform. Tech. Rep.CS-TR- 4495, Dept. of Computer Sciece, Uiversity of Marylad, College Park.

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le Fudametals of Media Processig Shi'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dih Le Today's topics Noparametric Methods Parze Widow k-nearest Neighbor Estimatio Clusterig Techiques k-meas Agglomerative Hierarchical