UNIVERSITY OF JOENSUU COMPUTER SCIENCE DISSERTATIONS 11. Mantao Xu. K-means Based Clustering and Context Quantization ACADEMIC DISSERTATION

Size: px

Start display at page:

Download "UNIVERSITY OF JOENSUU COMPUTER SCIENCE DISSERTATIONS 11. Mantao Xu. K-means Based Clustering and Context Quantization ACADEMIC DISSERTATION"

Garry Gregory
6 years ago
Views:

1 UNIVERSITY OF JOENSUU COMPUTER SCIENCE DISSERTATIONS Mantao Xu K-means Based Clusterng and Context Quantzaton ACADEMIC DISSERTATION To be presented, wth the permsson of the Faculty of Scence of the Unversty of Joensuu, for publc crtcsm n the Louhela Audtorum of the Scence Park, Länskatu 5, Joensuu, on the June 8 th, 2005, at 2 noon. UNIVERSITY OF JOENSUU 2005

2 Supervsor Professor Pas Fränt Department of Computer Scence Unversty of Joensuu Joensuu, Fnland Revewers Professor Laurence S. Dooley Faculty of Informaton Technology Monash Unversty Vctora, Australa Professor Ales Leonards Faculty of Computer and Informaton Scence Unversty of Lublana Lublana, Slovena Opponent Professor Samuel Kask Department of Computer Scence Unversty of Helsnk Helsnk, Fnland ISBN ISSN Computng Revews 998 Classfcaton: E.4, H.3.3, I.2.8, I.4.2, I.4.6, I.5., I.5.2 General Terms: Algorthm Ylopstopano Joensuu 2005

3 K-mean Based Clusterng and Context Quantzaton Mantao Xu Department of Computer Scence Unversty of Joensuu P. O. Box, FIN-800 Joensuu FINLAND Unversty of Joensuu, Computer Scence, Dssertaton Joensuu, 2005, 62 pages ISBN , ISSN Abstract In ths thess, we study the problems of K-means clusterng and context quantzaton. The man task of K-means clusterng s to partton the tranng patterns nto k dstnct groups or clusters that mnmze the mean-square-error MSE obectve functon. But the man dffculty of conventonal K-means clusterng s that ts classfcaton performance s hghly susceptble to the ntalzed soluton or codebook. Hence the man goal of ths research work s to nvestgate the effectve K-means clusterng algorthms to overcome ths dffculty. An extensve task addressed by ths thess s to desgn a feasble context quantzer n crcumventng the so-called context dluton problem, whch s a specfc form of K-means clusterng problem. Publcaton P presents a genetc algorthm to tackle the k-center clusterng problem by usng randomzed swappng to change one reference vector between the parent solutons n the crossover and then usng a local repartton clusterng procedure. The algorthm estmates the number of clusters automatcally whle optmzng the locaton of the clusters. It has been shown that the algorthm outperforms the other K- means algorthms revewed n the publcaton.

4 locaton of the clusters. It has been shown that the algorthm outperforms the other K- means algorthms revewed n the publcaton. In publcaton P2, a heurstc dssmlarty functon, SC-dstance, s proposed for K-means clusterng n mnmzng the stochastc complexty for taxonomy problems. When ncorporated nto K-means clusterng, the underlyng dssmlarty yelds better classfcaton performance than the tradtonal L 2 dstance. The desgn scheme of the SC-dstance dssmlarty s then extended to the mnmzaton of MSE dstorton n publcaton P3, whch comes up wth another dssmlarty functon, the Delta-MSE dssmlarty. The Delta-MSE dssmlarty s defned as the change of wthn-class varance after movng gven data sample to one cluster to another. In publcatons P4 and P5, we focus on a suboptmal scheme for estmatng the ntal soluton for K-means clusterng. The ntal solutons are selected by performng dynamc programmng n a one-dmensonal feature subspace, whch can be constructed ether by kernel prncpal component analyss or Fsher dscrmnant analyss. Instead of usng the tradtonal L 2 dstance, the suboptmal K-means algorthms ncorporate the prevous Delta-MSE dssmlarty nto the clusterng procedure. Expermental results show that the proposed algorthms outperform the other K-means varants consdered n the publcatons. The use of hgh-order fxed context templates n lossless mage compresson often leads to the severe context dluton problem. A common technque to tackle ths problem s the so-called context quantzaton technque, whch s a specal form of vector quantzaton n context space. Context quantzaton ams at seekng the groupng of context vectors such that the condtonal entropy or the Kullback-Lebler dstance KL s mnmzed. Publcaton P6 examnes and revews the applcaton and performance of context quantzaton n lossless mage compresson. In ths publcaton, context quantzaton s mplemented manly by usng the generalzed Lloyd algorthm accordng to the Kullback-Lebler dstance. Even f context quantzaton s an effectve alternatve n v

5 approachng desrable condtonal entropy code length, ts practcal applcaton poses a great dffculty that the resultng quantzer mappng functon n the context space s very complex. Ths lmtaton has motvated us to nvestgate a feasble desgn scheme of smplfyng context quantzers by usng a state-of-art nonlnear classfer, kernel Fsher dscrmnant KFD, n publcaton P7. The kernel Fsher dscrmnant makes the context quantzer cells more separable when they are proected onto the dscrmnant curve. The desgn scheme succeeds n approachng the optmal mnmum condtonal entropy context quantzer desgned n the probablty smplex space, but wth a practcal smplfcaton of the quantzer mappng functon. Keywords: K-means clusterng, vector quantzaton, dynamc programmng, data reducton, statstcal pattern recognton, kernel-based methods, context quantzaton, mnmum condtonal entropy v

6 Acknowledgement The work presented n ths dssertaton was conducted at the Department of Computer Scence, Unversty of Joensuu, Fnland, durng years I would lke to express my utmost grattude to my thess supervsor, Professor Pas Fränt for hs endless support, gudance and encouragement throughout the research process. I also apprecate the cooperatve work of Professor Xaoln Wu, who leveraged an nnovatve dea and gave me constructve suggestons about ths research work. I also owe my thanks to the co-author of some parts of ths research work, Mr. Ismo Kärkkänen. In partcular, I would lke to express my thanks to the East Fnland Graduate School n Computer Scence and Engneerng for the fnancal support durng the wnter semester of Fnally, I would lke to express my grattude to my wfe Lnglng, who has gven me consstent support throughout my Ph.D studes. Joensuu, June Mantao Xu v

7 Abbrevatons and symbols Abbrevatons BIC Bayesan nformaton crteron BIRCH balanced teratve reducng and clusterng usng herarches algorthm CALIC context-based, adaptve, lossless mage codec CL Shannon code-length dstance DBI Davs-Bouldn ndex DPCM dfferental pulse code modulaton EM expectaton-maxmzaton algorthm FDA Fsher dscrmnant analyss GLA generalzed Lloyd algorthm GMM Gaussan mxture model GRASP growng, reorderng and selecton by prunng algorthm HMSE heurstc mean square error JBIG ont b-level mage experts group KFD kernel Fsher dscrmnant KL Kullback-Lebler dstance LFD lnear Fsher dscrmnant MBAS modfed basc sequental clusterng algorthmc scheme MCECQ mnmal condton entropy context quantzer MDL mnmum descrpton length MSE mean square error MST mnmum spannng tree PCA prncpal component analyss pdfs probablty densty functons PNN parwse-nearest-neghborhood algorthm v

8 SC SOM VQ stochastc complexty self-organzng map vector quantzaton Symbols A m B m A B c c context cells or context clusters n context space context cells or clusters n probablty smplex space Φ N N matrx derved n kernel space F correspondng to S B Φ N N matrx derved n kernel space F correspondng to S W th cluster centrod n Chapter 5, t represents a raw context vector representaton of context C C k cluster centrods or the codebook C = {c =,, k } d F G k K n Chapter 5, t s a general term representng context dmenson of data vectors; hgh-dmensonal kernel space a set of clusters or clusterng the number of clusters; kernel matrx n the sample sze of cluster M the number of classes n most cases, equvalent to k N the number of data vectors; q a scalar or vector quantzer Q context quantzer k Q N the set of k-level quantzers gven N number of data samples u x x fuzzy cluster membershp functon for data vector x data vector mean vector of a set of data vectors X v

9 S B S W S X Φ S B between-class covarance matrx wthn-class covarance matrx total covarance matrx of X between-class covarance matrx n kernel space F Φ S wthn-class covarance matrx n kernel space F W v Fsher dscrmnant or prncpal component X a set of N data vectors X = { x, x 2, x N } X t X t- α λ π σ codng sample at current tme t prefx or context of current codng sample X t coeffcents of kernel expanson for any vector y n kernel space F egenvalue of covarance matrx Σ cluster ndcator for data vector x standard devaton of cluster Π partton: the cluster label of X where Π = {π =,, N } Σ covarance matrx of nput data X x

10 Lst of orgnal publcatons P. P. Fränt and M. Xu, Genetc local repartton for solvng dynamc clusterng problems, Int. Conf. on Sgnal Processng ICSP'02, Beng, Chna, vol. 2, 28-33, August P2. P. Fränt, M. Xu and I. Kärkkänen, Classfcaton of bnary vectors by usng DeltaSC-dstance to mnmze stochastc complexty, Pattern Recognton Letters, 24-3, 65-73, January P3. M. Xu, Delta-MSE dssmlarty n GLA based vector quantzaton, IEEE Int. Conf. on Acoustcs, Speech, and Sgnal Processng ICASSP'04, Montreal, Canada, vol. 5, 83-86, May P4. M. Xu and P. Fränt, Iteratve K-Means algorthm based on Fsher dscrmnant, Int. Conf. on Informaton Fuson Fuson'04, Stockholm, Sweden, vol., 70-73, June P5. M. Xu and P. Fränt, A heurstc K-Means clusterng algorthm by kernel PCA, IEEE Int. Conf. on Image Processng ICIP'04, Sngapore, vol. 3, , October P6. M. Xu and P. Fränt, Context clusterng n lossless compresson of gray-scale mage, Scandnavan Conf. on Image Analyss SCIA'03, Göteborg, Sweden, Lecture Notes on Computer Scence, vol. 2749, , June P7. M. Xu, X. Wu and P. Fränt, Context quantzaton by kernel Fsher dscrmnant, IEEE Trans. on Image Processng, accepted for publcaton. x

11 Table of Contents. Introducton.. Task of clusterng....2 Applcaton of cluster analyss Motvaton Context quantzaton Structure of the thess Clusterng 7 2. Problem formulaton Revew of clusterng algorthms Dssmlarty functon Obectve functon The number of clusters Data Reducton Prncpal component analyss Kernel PCA Fsher dscrmnant analyss Kernel Fsher dscrmnant analyss Implementaton of K-means clusterng Related work and background Selecton of ntal soluton Dynamc programmng Dssmlarty revsted 44 x

12 5. Context quantzaton Motvaton Problem formulaton Predctve codng Context modelng Context clusterng Implementaton of quantzer mappng functon Summary of the publcatons Conclusons References x

13 . Introducton Clusterng technques have been appled to a wde varety of research problems. In the feld of medcne, categorzng cures for dseases, or symptoms of dseases, can lead to varous types of taxonomcal problems, whch can be solved by clusterng algorthms. For example, a healthcare center database usually records thousands of patents basc health nformaton. Ths huge amount of nformaton can be categorzed or retreved through clusterng algorthms. For ths purpose, a clusterng procedure mght be appled to group the patents n such a way that patents wth smlar dseases are assgned to the same cluster. In archeology, researchers have attempted to establsh taxonomes of stone tools, funeral obects, etc. by applyng cluster analyss technques. In marketng research, cluster analyss s commonly used n market segmentaton and determnng target markets. In computer scence, many practcal problems that arse n pattern recognton and mage analyss can be formulated as a clusterng task. The man task of clusterng s to classfy a set of data patterns nto dstnct groups or clusters such that data patterns n each cluster are smlar. The process of cluster analyss embraces a number of dfferent algorthms and methods for groupng unlabelled obects of smlar knd nto ther respectve categores. Another functon of clusterng s to reveal the organzaton of patterns nto sensble groups. In a broader sense, cluster analyss allows for the dscovery of smlartes or dssmlartes hdden wthn data patterns as well as for drawng conclusons from those patterns. Thus, cluster analyss has attracted consderable nterest n a varety of contexts owng to ts extensve applcatons and smplcty for mplementaton.. Task of clusterng Clusterng s commonly used as an exploratory tool n data analyss. For nstance, t can be appled as a data reducton technque or as the preprocessng stage of many

14 other machne learnng areas. In order to fulfll a practcal clusterng task, one must take nto account the followng essental procedures: Feature selecton: The selected features should mnmze the nformaton redundancy n the representatve raw patterns. The man goal of feature selecton s to encode the nformaton that resdes n the patterns as much as possble n order to smplfy clusterng tasks. Dssmlarty functon: The dssmlarty functon s an mportant ssue n cluster analyss, whch quantfes how smlar or dssmlar a gven feature vector s to another feature vector or to ts correspondng cluster representatve. In cluster analyss, dfferent dssmlarty functons can be used to measure the dstncton between feature vectors and cluster representatves. Obectve functon: The clusterng crteron can be expressed n the form of a cost functon by summng over all the dssmlartes between feature vectors and ther representatve clusters, or the dssmlartes amongst all possble clusters. Clusterng procedures: The most crucal part of cluster analyss s to desgn an algorthm to seek the cluster representatves from tranng data and to group tranng data n terms of the underlyng dssmlarty functon. Valdaton of the result: The correctness of the clusters produced by clusterng procedures has to be valdated by a seres of approprate tests on the models obtaned. Ths, n general, can be mplemented by conductng the valdaton procedure on a test dataset. Interpretaton of output: In practce, the expertse doman knowledge and other emprcal evdence can be ncorporated nto valdaton of the clusterng results n order to nterpret the ultmate output of cluster analyss. 2

15 .2 Applcatons of cluster analyss Clusters are pervasve n the natural and socal envronments of daly lfe. For example, a group of dners sharng the same table n a restaurant may be regarded as a cluster of people. In supermarket, tems of smlar nature, such as dfferent types of meat or vegetables are dsplayed n the same or nearby locatons. Hence t can be formulated as a common problem, whch exsts n many applcaton felds of those natural and socal envronments. For nstance, bologsts must categorze dfferent speces of anmals before a meanngful descrpton of the dfferences between anmals s possble. In bonformatcs, cluster analyss has played a crucal part n analyzng tme-course gene expresson data [WZK04] and gene expresson profles [YK03]. In computer scence, cluster analyss usually serves as a powerful tool for data reducton. For example, once the clusterng algorthms classfy data patterns nto sensble groups, one can post each cluster as a sngle entty n the remanng data processng stages. Another example s that cluster analyss plays an ncreasngly mportant role n speech recognton snce nvestgatng every pece of speech nformaton or collecton of a large scale labeled tranng set s ntractably expensve. The departure to overcome ths obstacle s to apply clusterng technques such as vector quantzaton [HC95] or the Gaussan mxture model GMM [RB93, Jel99], to estmates the lkelhood of emttng each observaton, gven the speech state n a hdden Markov model. In the context of statstcs, cluster analyss has been appled to the statstcal hypothess testng and generaton n revealng the nature of data patterns. In ths sense, cluster analyss seeks to convey the more nformatve dstncton amongst varous groups rather than make a statstcal null hypothess test of dfferences. For example, the results of uncertanty analyss n the rsk assessment of food can be summarzed by a cluster analyss of food categores [FDAR03]. One mportant applcaton of cluster analyss s to provde clusterng algorthms n data mnng. For example, clusterng algorthms have been ntensvely used to 3

16 nvestgate the sgnfcant behavor of web users by analyzng the dfferent webstes vsted. The classfcaton of web-user behavors has gven rse to an mprovement of the customersaton for commercal companes. It should be ponted out that, unlke many other statstcal procedures, cluster analyss methods are wdely utlzed when a pror hypothess s not applcable. In prncple, cluster analyss seeks the most sgnfcant possble soluton..3 Motvaton A popular category of clusterng algorthms dealng wth k-center clusterng problems s the so-called K-means clusterng [For65, Mac67, BH67, DH73]. The conventonal K-means algorthm classfes data by mnmzng the MSE obectve functon. However, the clusterng procedure assumes that data must be presented wth a known number of clusters, whch n turn allows for a smple mplementaton of K-means clusterng. Snce the K-means algorthm s a knd of gradent descent method, ts clusterng performance s greatly susceptble to the ntalzed soluton. In other words, the clusterng algorthm merely promses the local optmalty or the convergence to a locally optmal soluton. Even though such an operatonal dffculty has been settled by many varants of K-means clusterng, most of them complcate the man optmzaton procedure of K-means clusterng. For nstance, the K-medan algorthm [KR87, BMS97] searches for each cluster centrod from all data patterns such that the summaton of the dstances from all data patterns n the cluster to the cluster centrod s mnmzed. Ths complete search for cluster centrods over the entre dataset obvously features a robustness on the selecton of ntalzed soluton but mposes ON 2 tme complexty on computaton of any cluster centrod. Hence one motvaton for ths research s to develop a partton-based clusterng scheme wthout changng the man optmzaton procedure of K-means clusterng for the sake of avodng addtonal tme complexty. To fulfll ths strategy, the correspondng part of ths work 4

17 wll focus on estmatng a suboptmally ntalzed partton and ncorporatng a heurstc type of dssmlarty functon nto K-means clusterng..4 Context quantzaton K-means clusterng can be mplemented also as the vector quantzaton technque VQ [LBG80] n the feld of communcaton. The obectve of vector quantzaton s to desgn a set of code vectors or a codebook for the gven vector source wth a known statstcal property such that the average dstorton measure s mnmzed. Although vector quantzaton s essentally ntended for data compresson, ts man dea can be extended to quantzng the condtonng contexts n compressng a dscrete sequence of fnte source [Che04, FWA04, Wu99]. When codng a fnte source, a hgh-order condtonng event or context template mples a large number of parameters n the statstcal model, and thereby s assocated wth a huge model cost that could offset the entropy codng savngs. In the sequental codng scheme, the ncreased model cost can be nterpreted as a result of the context dluton phenomenon occurrng when count statstcs spread over too many contexts, eventually affectng the accuracy of statstcal estmates. A natural soluton s to reduce the resoluton of the raw contexts or quantze the contexts accordng to some crteron. Thus, context quantzaton becomes one technque to tackle the context dluton problem. Expressvely, context quantzaton s posted as a form of the k-center clusterng problem n the prncple of mnmzng the Kullback-Lebler dstance, or the code length for a class of fnte memory sources. Thus, ths problem can be of course solved n the framework of K-means clusterng. However, the shape of clusters formed by context quantzaton s extremely complex and rregular n context space, whch poses another challenge on how to mplement the rregular quantzer mappng functon. Although ths operatonal problem can be treated by utlzaton of a huge lookup-table, our motvaton s to desgn a natural quantzer mappng functon to classfy each raw 5

18 context to ts codng state. Ths leads us to conduct a successve research on context quantzaton..5 Structure of the thess The structure of the thess s organzed as follows: The clusterng problem s presented n Chapter 2. Ths chapter also revews and dscusses dfferent clusterng algorthms. In Chapter 3, we ntroduce the two most common technques n feature extracton: prncpal component analyss and Fsher dscrmnant analyss. Ths chapter also studes the popular kernel extensons of the two technques: the kernel prncpal component analyss and the kernel Fsher dscrmnant analyss. In Chapter 4, we manly study the problem of K-Means clusterng from two aspects: the dssmlarty functon and the selecton of ntal solutons. The dynamc programmng technque s ntroduced to estmate the ntal soluton for K-means clusterng. In Chapter 5, we consder a specfc k-center clusterng problem n lossless mage compresson - context quantzaton. We nvestgate the problem from two aspects: context clusterng n terms of the Kullback-Lebler dstance and the practcal mplementaton of quantzer mappng functon. In Chapter 6, we summarze the man contrbutons made by the orgnal publcatons n ths thess work. In Chapter 7, we outlne the man conclusons of ths thess work. 6

19 2. Clusterng Ths chapter begns wth the problem formulaton of cluster analyss. Next, we brefly revew a varety of clusterng algorthms and the related work. Secton 2.3 dscusses the dssmlarty functon used n cluster analyss. In secton 2.4, several common obectve functons are ntroduced for clusterng problems. In partcular, for secton 2.5, we end up wth a revew of a varety of clusterng valdty ndces for estmatng the number of clusters. 2. Problem formulaton A key task of clusterng s, gven a set of unlabelled data patterns X, to dscover the nherent partton, = of X, such that,, k k G { G } = k = G G where N s the number of data vectors; k s the number of clusters; G = X = φ, d s the dmenson of data vectors X = { x, x 2, x N } s a set of N data vectors; Π = {π =,, N } s class label or membershp of X; C = {c =,, k } are k cluster centrods or the codebook of k clusters. Clearly, the groupng of clusters G can be unquely determned by the class label functon Π {π =,, N },.e., G = { x π =, x X } Formally, the qualty of a resultng partton of X can be measured by a clusterng obectve functon f 7

20 f N X, G, k = D x, Gπ N = where D s some dssmlarty functon or quantty to measure the dstncton between a gven pattern x and ts assgned cluster G. Thus, n prncple, the task of clusterng s to seek a partton such that the obectve functon s mnmzed. In most cases, the partton G s equvalent to or can be derved from the set of cluster centrods.e., representatve vectors C, f the dssmlarty D s farly defned. An example of clusterng comprsed of three compact clusters s vsualzed n Fgure 2., of whch the tranng patterns are generated n terms of a Gaussan mxture model wth three components. Fgure 2.. An example dataset wth three compact clusters of patterns produced by three GMM components It s worth notng that the set of tranng patterns X are also termed data vectors, data patterns, data obects or data samples, whch s also nterpreted as a set of d- dmensonal feature vectors n Eucldean space R d. A practcal term to denote cluster 8

21 representatve s cluster centrod or reference vector. In general, the number of possble parttons or possble solutons grows exponentally [Spa85] wth the number of clusters k as N = k k! k = k k Thus, the combnatoral optmzaton of k-center clusterng problem n d-dmensonal feature space s NP-complete n k.e., the tme complexty s k N. Even f a branchand-bound technque [KNF75] s able to fnd the global optmum wthout the need for exhaustve search, the excessve tme complexty s stll exponental wth N. Only under some rgorous restrctons can some of the clusterng problems be resolved n polynomal tme. For nstance, scalar quantzaton problem can be solved n OkN 2 tme. Ths s why most clusterng algorthms guarantee merely a heurstc soluton or a local optmum. The defnton of dstnct clusters G s sometmes termed as a hard clusterng. In a broader sense, the defnton above can be extended to the so-called fuzzy clusterng [Bez80], whch s to classfy X nto k clusters only accordng to k number of specfc membershp functons u x u : X [0,], =, l, k such that for any arbtrary pattern x and k N and k = u x =, =, m, N N 0 < u x < N, =, m, k = In a contrast to dstnct clusters, the partton of data patterns X s not unquely determned by the membershp functons u. However, n practce, once the clusters are formed, one can derve the class label functon π by 9

π x = max u k x Intutvely, dstnct clusters can be nterpreted as a specal case of fuzzy clusterng. One advantage of fuzzy clusterng s that t does not force every data pattern nto a specfc cluster. 2.

22 π x = max u k x Intutvely, dstnct clusters can be nterpreted as a specal case of fuzzy clusterng. One advantage of fuzzy clusterng s that t does not force every data pattern nto a specfc cluster. 2.2 Revew of clusterng algorthms Clusterng technques can be classfed nto two maor types: partton-based algorthms and herarchcal algorthms. Partton-based clusterng uses an teratve optmzaton procedure that ams at mnmzng an obectve functon f, whch measures the goodness of clusterng. Fgure 2.2 shows smply an example of the partton-based clusterng, of whch the MSE dstorton functon s mnmzed. Typcal partton-based clusterngs are composed of two learnng steps: the parttonng of each pattern to ts closest cluster and the computaton of cluster centrods. A common feature of partton-based clusterngs s that the clusterng procedure starts from an ntal soluton wth a known number of clusters. The cluster centrods are usually computed based on the optmalty crteron such that the obectve functon s mnmzed. Fgure 2.2. An example of partton-based clusterng vsualzed by Vorono dagram 0

23 Among partton-based clusterngs, K-means clusterng [BH67, DH73, LBG80] s the most popular one. There are a number of varants for K-means clusterng, of whch the generalzed Lloyd algorthm GLA s the most useful one, and thus, s usually referred to as the conventonal K-means algorthm [KMN02]. The pseudocode of the GLA algorthm s presented n Fgure 2.3. K-means clusterng can be posted as a specal form of fuzzy K-means [Dunn74, CDB86] clusterng and as the statstcal Gaussan mxture model [Bs95]. The Gaussan mxture model often serves as a probablstc decomposton n the classfcaton of the ncomplete data mssng class labels. Thus, t follows the task of partton-based clusterng by assumng each component model to characterze cluster shape and locaton. A common approach for estmatng the model parameters of GMM s to use the expectaton-maxmzaton algorthm EM algorthm [DLR77] n the sprt of mnmzng the log-lkelhood estmate. Analogous to K-means clusterng, the EM algorthm offers a locally optmal estmate of cluster densty albet exhbtng a nce form of teratve gradent descent approach. But ths does not nhbt t from provdng a non-vector form of dataset wth the practcal clusterng soluton wthn a unfyng framework [CGS00]. Fuzzy clusterng seeks to mnmze a heurstc global cost functon by explotng the fact that each pattern has some graded membershp n each cluster. The clusterng crteron allows each pattern for multple assgnments of clusters. Fuzzy K-means clusterng can be cast to a form of convex optmzaton but wthout a closed form of expresson. In an ntutve way, t teratvely updates the cluster centrod and estmates the class membershp functon u by usng the gradent descent approach. An alternatve clusterng approach n a smlar manner to fuzzy clusterng s, the K- harmonc clusterng [Zhan0], whch replaces the MSE cost functon wth a harmonc functon. Ths harmonc obectve functon s nterpreted as the harmonc summaton of the dstances between all possble data patterns and clusters:

24 f HM X, G, k = N k = = k / x c 2 Input: X: a set of N data vectors C I : ntalzed k cluster centrods Output: C: the cluster centrods of k-clusterng Π = {π =,, N } s the cluster label of X Functon GLAX, C I REPEAT C prevous C I ; FOR all [, k] DO c Average of x, whose π=; FOR all [, N] DO π arg mn d x, c ; UNTIL C=C prevous ; Return C, Π; k Fgure 2.3. Psuedocode of the generalzed Lloyd algorthm However partton-based clusterngs often face a common problem, namely, the convergence to local optmum of poor qualty [BMS97, FR98], whch s due n part to the senstvty to the ntal random clusterng and to the exstence of outlers or nose [DK97, ECY04]. For ths purpose, the local optmalty has been remeded by convertng the k-center clusterng problem to a blnear program, whch can be solved by the K-medan algorthm [BMS97]. The use of K-medan clusterng leads to a robustness n the statstcal sense [RL87] snce outlers and ntal solutons have less mpact on a full search for cluster centrods than the optmzaton scheme conducted by K-means clusterng. Thus, K-medan clusterng s robust wth respect to random ntalzaton, addtve nose and multplcatve nose [ECH0]. The K-medan algorthm uses medan vectors as the cluster centrods n terms of L norm, whch can produce clusters wth hgher qualty, but ts mplementaton takes Okn 2 tme. A 2

25 treatment of acceleratng the costly approach nto a subquadratc tme has been made by conductng a partal search for cluster centrods through a k-nearest-neghborhood graph, whch s constructed by the Delaunay trangulaton technque [ECY04, ECH0]. Herarchcal clusterng recursvely groups data patterns nto a tree.e., dendrogram; see Fgure 2.4. The herarchcal clusterng scheme s a popular choce when dfferent level heterogenetes of cluster structure are desred. The resultng clusters are always produced as the nternal nodes of the tree, whle the root node s reserved for the entre dataset and leaf nodes are for ndvdual data samples. In partcular, these algorthms nvolve as many steps as the number of data samples. The two man categores of algorthm used n the herarchcal clusterng framework are agglomeratve and dvsve. Fgure 2.4. Herarchcal clusterng by usng the average lnkage tree over the IRIS dataset Agglomeratve algorthms seek to merge clusters to be larger and larger by startng wth N sngle pont clusters. The algorthms can be dvded nto three classes: snglelnk algorthm [JD88], complete-lnk algorthm [Kng67] and mnmum-varance 3

26 algorthms [Ward63, Murt83]. The sngle lnk algorthm merges two clusters accordng to the mnmum dstance between the data samples from two clusters. Accordngly, the algorthm allows for a tendency to produce the clusters wth elongated shapes. In a contrast, the complete-lnk algorthm ncorporates the maxmum dstance between data samples n clusters, but ts applcaton always results n compact clusters. Thus, the qualty of herarchcal clusterng consderably depends on how the dssmlarty measurement between two clusters s defned. The mnmum varance algorthm combnes two clusters n the sense of mnmzng the cost functon, namely, to form a new cluster wth the mnmum ncrease of the cost functon. Ths algorthm has also attracted consderable nterest n vector quantzaton, where t s also termed the parwse-nearest-neghborhood PNN algorthm [Equ89, Bot92, KS98]. More treatments of acceleratng the PNN algorthms can be found n [FKSC00, VF03, VFT0]. Dvsve clusterng begns wth the entre dataset n the same cluster, followed by teratve splttng of the dataset untl the sngle-pont clusters are attaned on leaf nodes. It clearly follows a reverse clusterng strategy aganst agglomeratve clusterng, whch s demonstrated n Fgure 2.5. On each node, the dvsve algorthm conducts a full search for all possble pars of clusters for data samples on the node. In practce, the method s seldom used [JMF99] n clusterng numercal datasets due to an exponental tme complexty. The tractable way to mplement ths clusterng s to make a compromse between the number of searches and the qualty of resultng clusters,.e. to use the partal search nstead. Ths can be realzed by mposng some addtonal clusterng crteron on each dvson. Related work can be found n [DH02]. Despte ths reservaton, dvsve clusterng has provded an ntutve approach for groupng the bnary data samples [Chav98]. A reportable class of herarchcal clusterng algorthms s the sequental clusterng algorthms n manpulatng large-scale database. For example, the modfed basc sequental clusterng algorthmc scheme MBAS algorthm [TK99, KS03] apples a 4

27 tree structure to splttng the entre dataset wth a sngle scan. The number of clusters s determned dynamcally by usng a presumed threshold. Later, ths algorthm was mproved as the so-called balanced teratve reducng and clusterng usng herarches algorthm BIRCH algorthm [ZRL96]. The BIRCH algorthm constructs a heght balanced CF-tree to mantan the clusterng features for sub-cluster entres. The clusterng algorthm s capable of producng the robust clusters wth good qualty wthn a few addtonal scans. Fgure 2.6 [BIR] brefly outlnes the mechansm of desgnng the BIRCH algorthm. However, as a sequental clusterng algorthm, t s nevtably senstve to the orderng of data sequences Fgure 2.5. A demonstraton of a dvsve clusterng procedure Alternatve clusterng approaches nclude graph-based clusterng algorthms [Jarv78, Zahn7], artfcal neural network clusterng [CTC94, WC03] and evolutonary clusterng algorthms [FV03, KFN03, P]. A useful graph-based clusterng technque s the mnmum spannng tree MST algorthm that converts a multdmensonal clusterng problem to a tree-parttonng problem. The algorthm assumes that each cluster corresponds to one subtree of the MST, accordngly, at each tme, the most nconsstent or large edge s cut or removed from the graph. A dstngushed advantage s that the smple structure of tree facltates an effcent mplementaton that does not depend on any practcal geometrc shape of clusters. Thus, t can restrct many of the problems faced by other partton-based clusterng algorthms. 5

28 Fgure 2.6. An llustratve dagram for mplementaton of the BIRCH clusterng algorthm Self-organzng map SOM [CTC94, Koh95, Fran99] s a well-known extenson of artfcal neural network to unsupervsed learnng, whch s usually appled to reveal the complex structure of data clusterng. The SOM s usually composed of a twodmensonal regular grd of nodes, where each node s assocated wth a model of data patterns. The SOM algorthm computes models so that smlar models are closer to each other n the grd than the less smlar ones. In ths sense, the SOM provdes both a smlarty graph and a clusterng dagram at the same tme. Among evolutonary approaches, genetc algorthms are most commonly used n cluster analyss. There are a varety of reproducton schemes elaborated n those genetc clusterng methods, of whch the crossover technque s most popular. The essental dea of genetc clusterngs was frst reported n Raghavan and Brchand s work [RB78] on mnmzaton of the MSE cost functon of clusterng. The form of genetc crossovers manly depends on the representaton of clusterng soluton [Fran00], and thus they can be classfed nto two subcategores: partton-label based methods and centrod-based methods. In partcular, the genetc algorthm can be 6

29 appled to estmatng the number of clusters by usng dynamc crossovers [P]. An example of dynamc crossover s llustrated n Fgure 2.7. Father Cf Cf2 Cf3 Cf4 Cf5 Cf6 Cf7 Father Cf Cf2 Cf3 Cf4 Cf5 Cf6 Cf7 Cf8 Cf9 Chld Cf Cf2 Cf3 Cf4 Cm5 Cm6 Cm7 Chld Cm Cm2 Cm3 Cf4 Cf5 Cf6 Mother Cm Cm2 Cm3 Cm4 Cf5 Cm6 Cm7 Mother Cm Cm2 Cm3 Cm4 Cm5 Cm6 Cm7 a b a Statc crossover by swappng cluster centrods b Dynamc crossover by swappng cluster centrods Fgure 2.7. Example of swappng cluster centrods or reference vectors n genetc crossover 2.3 Dssmlarty functon An essental quantty to measure the dstncton between data patterns s the dssmlarty functon. It can be defned ether on the contnuous valued vectors or the dscrete valued vectors. In the sense of data clusterng, the dssmlarty functons must be carefully chosen due to the dversty and scale of features nhabted n patterns. More mportantly, dfferent choces of clusterng dssmlartes lead to dstnct clusterng results. The mpact of dssmlarty measures on herarchcal clusterng s presented n Fgure 2.8. A common way to formulate the dssmlarty of two data vectors s Mnkowsk metrc d dm p x, x = s= x p / p, s x, s In partcular, wth p =, t mples the L dstance or Manhattan dstance, and wth p = 2 t mples the Eucldean dstance or L 2 dstance. Conceptually, a metrc must satsfy the followng requrements of a dstance functon 7

30 8,,,,, 0 0,, x y y x y x y x x x x d d X d X d = = The defnton of metrc s also restrcted by a more rgorous condton, the trangular nequalty,,,, z y y x z x d d d + and y x y x = = 0, d The case wth l = n Mnkowsk metrc comes to the maxmum metrc max,,, dm s s s x x d = x x These measures above are translaton nvarant but not nvarant to scalng. A common practce to acheve nvarance s to normalze the entre dataset n the preprocessng stage. However, the specfc scales lodged n features are seen as the factors that nfluence clusterng results even f they are seemngly rrelevant. The scale nvarant dssmlartes nclude the Canberra metrc = + = dm,,,,, s s s s s x x x x d x x and the cosne smlarty, d x x x x x x = where represents the nner product of two vectors and s the L 2 norm n Eucldean space. Another way to generalze the Eucldean dstance to the scale nvarant case s the Mahalanobs dstance:, T d x x x x x x Σ = where Σ s the covarance matrx wth respect to the entre dataset

31 or wth respect to some cluster G N Σ X = x x x x N = N Σ G = x x x x G = where x s the correspondng mean vector. The Mahalanobs dstance serves as a dssmlarty measurement n the model-based clusterng such as the EM clusterng algorthm. Maxmzaton of the log lkelhood of the Gaussan mxture models leads to approxmaton of the poster probablty PG x wth the Mahalanobs dstance. But the Mahalanobs dstance applcaton always suffers from the sngularty problem and a hgh computatonal cost when appled n the framework of K-means clusterng. T T a The dstance between closest par of ponts b The dstance between farthest par of ponts Fgure 2.8. Dfferent dssmlarty measures lead to the dstnct herarchcal clusterng results: sngle lnkage and complete lnkage The measurements above are unlkely to help us descrbe bnary-valued feature vectors. The comparson of two such feature vectors can be realzed by countng the occurrence of dfferent bts wth respect to two vectors, whch s the Hammng dstance d x, x = dm dm s= x, x, s s 9

32 An equvalent form of the Hammng dstance s the M-coeffcent n, 00 + n d x x = dm where n 00 and n are the number of 0 bts and the number of bts presented smultaneously n the common features of two vectors. A further generalzaton d x, x n = dm n s ntended for the datasets where the bts of 0 appear more frequently than the bts of. Most of the dssmlarty functons above can be utlzed n the partton-based clusterng algorthms excludng the Hammng dstance. One form of bnary-valued data clusterng has been conducted n the framework of K-means clusterng [FGG00, GKV94] accordng to Rssanen s stochastc complexty [Rs89] SC = M = 00 M n nm T d n H d c + N H M,, + logmax, n 2. N N 2 where n = G s the sample sze of cluster G.e., the number of data vectors n the cluster, M s the number of classes.e. equvalent to k and H c = H c, c n H b 2 = x, c = x log2 c xlog c H x, c = H d M z = d s= M = H x, c b s z log s 2 z where d s the dmenson of bnary vectors. An ntrnsc dssmlarty can mnmze the stochastc complexty wth the help of Shannon code length dstance CL d CL n x, c = H d x, c log 2.2 N where n s the number of bnary vectors n the th cluster and c s the centrod of cluster G. The cluster centrod c, s equvalent to the cluster densty functon Px G only for the bnomally dstrbuted data see publcaton P2. Ths dssmlarty can be d d 20

33 nterpreted as the logarthm of the posteror probablty Pc x, whch can be equally derved by the lkelhood log 2 Px c Pc. Planly, the dssmlarty s not a dstance due to ts asymmetrc property. For smplcty, we term ths dssmlarty as the CL dstance n the sequel. In general, the dssmlarty functon should be chosen carefully before any clusterng procedure s conducted. The applcaton of dfferent dssmlarty clearly leads to dstnct shapes of clusters, whch, however, depends on the clusterng model that s chosen. 2.4 Obectve functon One of the maor ssues n clusterng s concerned wth the crteron functon to be optmzed. Snce the obectve of clusterng s to classfy the data patterns nto sensbly smlar groups, one way to formulate the clusterng problems s to defne a crteron functon or an obectve functon to measure the overall qualty of a clusterng partton. Once the obectve functon s well defned, the clusterng task s to seek a partton such that the obectve functon s mnmzed. In the followng, we wll study several basc obectve functons n common use. The smplest and most popular obectve functon s the mean-square-error functon MSE, whch s wrtten as f n 2 MSE = u, c π n = where c s the mean vector of th cluster x 2.3 c = x G x G and u, s the weght of dstance between x and c. Ths obectve functon s nterpreted as the overall wthn-cluster varance ncurred by the resultng partton G. The mean vectors are best cluster centrods n the sense that the MSE functon s mnmzed. Ths mples that the obectve functon unquely determnes the defnton 2

34 of cluster centrods. It s mportant to pont out that the MSE functon s an approprate clusterng crteron only when the clusters wth compact shapes are well separated. A more strct assumpton s that each cluster has the same scatterng varance. Thus, the obectve functon s, to some extent, senstve to the presence of outlers. The computaton of cluster centrods s ntractable n some specfc cases, however, n ths case, a more general form of f MSE can be used nstead f k 2 MSE = d x, y 2N = G x, y G where dx,y s some dssmlarty functon chosen for clusterng. The generalzed form can be vewed as the summaton of the average square dstances between vectors n each cluster. The generalzaton clearly lends us an ntutve way on how to derve an obectve functon from the dssmlarty functon. A reportable class of crteron functons can be derved from the analyss of varance. Namely, the total covarance matrx of the entre dataset X, S X, can be decomposed nto the sum of the between-group covarance matrx and the wthngroup covarance matrx where n = G and S S W = = k S = S + S 2.4 X = x G k B n = W x c B x c c x c x x = N N = x Clearly, the total covarance matrx S X merely depends on the data samples X. The summaton n 2.4 mples that the between-group covarance matrx S W ncely goes n the reverse drecton of the wthn-group covarance matrx S B. Ths s why T T 22

35 maxmzaton of the between-group covarance s equvalent to the mnmzaton of the wthn-group covarance. One clusterng crteron, the F-rato valdty ndex, can be defned as a rato between the traces of the two covarance matrces S W and S B where tr S W tr S k tr SW f F rato = 2.5 N tr S = = k = x G k B n = x c c B T x c T x c x The form 2.5 n fact s equvalent to the F-statstcs whereas t s qute often used as a cluster valdty ndex n estmaton of the number of clusters. Obvously for a fxed k, mnmzaton of the F-rato ndex n 2.5 equals to mnmzaton of f MSE n 2.3. A more general form of the F-rato ndex [NMC97] wth the translaton nvarant property s defned accordng to ether the trace or the determnant of matrx S - W S B, whch s the so-called J-measures f J = k N W tr S S or f = B J k N det S log det S A dstnct advantage behnd the crteron above s that t does not restrct a rgorous assumpton that each cluster has a same varance. The two functons above are qute useful n valdatng the number of clusters. Even f many obectve functons can be chosen for solvng clusterng problems, the obectve functons must be defned accordng to the specfc applcaton of clusterng. It also should be n agreement wth the defnton of ts clusterng dssmlarty functon. The teratve optmzaton based clusterng often prefers an obectve functon wth a closed form n the sense that cluster centrods can be computed tractably. In a broader sense, the obectve functon plays a crtcal role n many heurstc clusterng algorthms, where the qualty of clusterng solutons needs to be W B 23

36 frequently evaluated. However, n practce, a specfc class of obectve functons s more frequently utlzed n estmatng the number of clusters rather than optmzng the locaton of clusters. For ths purpose, we wll study some of the cluster valdty ndces that are used to estmate the number of clusters n the next secton. 2.5 The number of clusters Cluster valdaton s a procedure to evaluate the qualty of resultng clusters n an obectve functon. However, the number of clusters s seldom known n advance. Thus, a maor challenge arsng n cluster analyss s the determnaton of the number of clusters, whch has been one of the most common applcatons of cluster valdaton. Regardless of whchever clusterng methods are chosen, the estmaton of an approprate number of clusters remans of crtcal mportance. One drect way to estmate the number of clusters s through the optmzaton of an obectve functon, whch s termed as the cluster valdty ndex functon. A wder class of cluster valdty ndces was studed n [BP98] n terms of three maor measurement functons. Modfed Hubert statstcs s one measurement for evaluatng the goodness of ft between the data patterns and the resultng partton. Ths statstc s computed as the normalzed correlaton between the proxmty matrx and the representatve matrx of the resultng partton. Another valdty ndex s the functon of rato of the sum of wthn-cluster varance and the between-cluster varance, referred as Davs-Bouldn ndex DBI [DB79], where f DBI r = k k = maxr σ + σ = d c, c where σ s the standard devaton of all samples n cluster G. Ths valdty ndex s geometrcally plausble to seek a partton such that the wthn-cluster varance s 24

37 mnmzed and the between-cluster varance s maxmzed. In partcular, the valdty ndex s useful n estmatng the number of clusters for a dataset wth compact clusters. For example, the number of clusters can be determned so that the valdty ndex has the greatest decreased value when ncrementng k by. In a smlar manner, the F- rato ndex s also a popular choce n the determnaton of the number of clusters. A ratonal valdty ndex analogous to the DBI ndex s the Dunn ndex, whch s also desgned to measure the more separable compact clusters. However, the weakness of the three valdty ndces above s that they can be greatly nfluenced by the presence of outlers or overlapped clusters. For ths purpose, the Dunn ndex was generalzed n [BP98] to amelorate the senstvty to outlers, d G s, Gt = mn mn s k t k max d G r r k δ f Dunn where the defnton of the dameter of one cluster G can be rewrtten as d G = n n x, y G d x, y and the dstance between two clusters d δ s defned by Hausdorff metrc. The dx,y between two data samples s often a Mnkowsk metrc n Eucldean space. Notably, ths generalzed form does not mpose a restrcton that the resultng clusters must be compact and well separated. A popular soluton ndependent of clusterng algorthms s the so-called GAP statstc [TWH0], whch s defned as where Gap k = E * n log W k log W k W k = k 2n = x Gy G x y 2 * and E s the expectaton wth respect to some reference dstrbuton. The optmal n number of clusters s such k that maxmzes the statstc Gapk. The reference 25

38 dstrbuton s often selected by a unform dstrbuton over a box algned wth prncpal components of data samples. Smlarly, the statstcal approach s restrcted by an assumpton that the clusters are well separated. Another way to analyze the best number of clusters s the maxmzaton of lkelhood densty [RH94, Smy00] by a cross-valdaton test over the dataset. The cross-valdated lkelhood serves as a means of automatcally determnng the approprate number clusters n fnte mxture models, partcularly n the context of model-based probablstc clusterng. In other words, the cross valdaton test of log lkelhood s maxmzed over the optmal number of clusters, whch s an unbased estmator of the Kullback-Lebler dstance between the true model and the model under consderaton. To estmate the number of clusters, we argue for a more general framework by maxmzng the penalzed lkelhood functon. The penalzed lkelhood methods are typcally deduced from approxmaton n terms of asymptotc arguments when the number of samples approaches nfnty. The advantage behnd those methods s that they allow for smple mplementaton by addng the penalzed term. For example, the mnmum descrpton length prncple MDL [Rs89] provdes a strategy to encode data wth any parametrc model that s defned by the maxmum lkelhood estmate of the resultng clusterng. The prncple s seemngly capable of beng ncorporated nto any clusterng model or densty dstrbuton n accordance wth the crteron that data samples should not be overftted by too complex models [BRY98]. A practcal applcaton of usng the MDL prncple n seekng the optmal number of code vectors for vector quantzaton can be found n [BLS99]. Furthermore, a more general clusterng framework based on the MDL prncple was presented n [KMW04] to overcome the dffcultes addressed by havng no pror nformaton on the model parameters.e. the model parameters pror n K-means clusterng refers to the probablty of cluster centrods, whch s seldom known n a typcal clusterng framework. The practcal mplementaton of the MDL prncple s the stochastc 26

39 complexty defned as the shortest descrpton length of a gven data relatve to a model class M. Namely, lettng θ ˆ X be the maxmum lkelhood estmate of X, the stochastc complexty s SC ˆ X M = log P θ X N, M log P X θ X, θ, M θ = ˆ θ X where N s sample sze. Clearly, ths represents the two parts of the code, where the frst term represents the pror of parameters P θˆ X N, M max P θˆ X θ, M θ The formulaton above gves an ntutve way to estmate clusterng model parameters n terms of model class M. In practce, M can be a class of densty dstrbutons, for example, the multnomal dstrbuton. The cluster valdty ndex based on stochastc complexty relatve to the multnomal dstrbuton was nvestgated n more detal n [KMW04]. Here we skp the formulaton of ths cluster valdty ndex, whch s very complex. However, for bnary data samples, the stochastc complexty based on bnomal dstrbuton can be approxmated [GKV94] n a smpler form, as n 2.. Another cluster valdty ndex smlar to the MDL prncple s the Bayesan nformaton crteron BIC [Sch78]. To estmate the number of clusters, a common approach s to maxmze the BIC valdty ndex n terms of the Gaussan mxture model.e., the model class M uses the Gaussan dstrbuton [FR98]. Fortunately, ths valdty ndex has an elegant approxmaton ˆ f BIC = log P X θ ˆ x, θ, M =θ m 2 θ X M log N where m M s the number of ndependent parameters to be estmated n model class M. However, ths approach mplctly restrcts tself wth a pror assumpton on the shape of clusters. To overcome ths reservaton, one recent soluton [SJ03] focuses on the analyss of a measurement f RD k n terms of the rate dstorton theory f N T RD k = x cπ Σ x cπ dm = 27

40 by averagng the dstortons between data samples and ther assgned cluster centrods per dmenson. Then the desrable number of clusters s chosen by k that produces the greatest ump f RD k -y - f RD k- -y, where an approprate y = dm/2. But ths approach s based on an assumpton that the underlyng data samples obey a mxture of some dstrbuton wth equal pror parameters. We may conclude from the arguments above that the plausble answer to choosng the number of clusters s applcaton dependent. 28

41 3. Data reducton Feature extracton s an essental procedure of clusterng ncludng feature constructon, space dmensonalty reducton, sparse representatons, and feature selecton. These technques are commonly used as the preprocessng stages of pattern recognton tasks. Although those problems have been nvestgated for many years, feature extracton has attracted consderable nterest recently. Ths s because a number of new applcatons wth hgh-dmensonal data crtcally need space dmensonalty reducton for effcency. These demandng applcatons have hghly motvated the use of data reducton partcularly n the context of clusterng and classfcaton. In ths chapter, we manly revew two useful dmensonalty reducton technques: prncpal component analyss PCA and Fsher dscrmnant analyss FDA. The two data reducton technques are of the most common use n pattern recognton but wth a smple mplementaton. Thus, we attempt to seek a suboptmal ntal soluton of k-center clusterng based on them. Fortunately, they have both been successfully generalzed as kernel-based learnng machnes, whch are accredted to a class of state-of-art classfcaton approaches. 3. Prncpal component analyss The purpose of clusterng s to represent data samples nto a more manageable form by dstngushng them nto dfferent classes or groups. However, the clusterng of hghdmensonal data has been proven to be very dffcult. A common approach s to reduce the data dmensonalty by explotng the rrelevant redundancy behnd the nput data. Prncpal component analyss s one of the methods to fulfll ths demand. PCA s often appled as a decorrelaton technque n statstcs to extract the man correlatons n hgh-dmensonal data. The goal s to fnd a set of m orthogonal vectors to account for as much as possble of the data's varance, whch are called prncpal 29

42 components. Proectng the nput data onto the m-dmensonal subspace spanned by these vectors mplctly conducts a dmensonalty reducton, but at the same tme, t preserves most of the ntrnsc nformaton n the data. There are several algorthms for calculatng the prncpal components. A standard way s to solve the egenvalue problem of the covarance matrx for the nput data, where Σ s the covarance matrx of X Σ = N N = λ v = Σv x x x x Then the resultng egenvectors can be sorted n a descendng order accordng to ther correspondng egenvalues. The egenvalues gve an ndcaton of the amount of nformaton, whch the respectve prncpal components represent. The frst prncpal component s often called the leadng prncpal component, and thus s the most nformatve. Hence the mplementaton of PCA s cast to an egenvalue problem of covarance matrx Σ, whch can be solved n Od 3 tme [PZ99]. From the statstcal pont of vew, PCA analyzes the total varance by mplctly assumng the ntal communtes as s [FWM99]. However, the components n PCA are sometmes not necessarly to be nterpretable. Ths results n a rotaton of prncpal components n purpose of achevng a smple structure for nterpretaton. The varmax rotaton s one of rotaton based methods to enhance the nterpretablty of the prncpal components or factors, whereby each component correlates hgh loadngs wth a smaller number of varables and lower loadngs on the other varables. Ths rotaton technque s most lkely appled n the context of psychology [Ka58]. Despte a useful applcaton n exploratory data analyss, PCA mples a potental oversmplfcaton of data samples beng analyzed n the framework of lnear modelng. However, the advent of neural network models rases hope that even f nonlnearty s hghly accommodated n data samples, data reducton s resolvable n the mplementaton of PCA by neural network approaches. A varety of nonlnear PCAs T 30

43 have been nvented n the framework of neural networks [Oa82, DK96]. These nonlnear PCAs have been ntensvely studed n many applcatons; however, the learnng procedure of most neural networks obvously converges to the local mnma. Thus, they are less stable than the lnear ones. A drect way to follow the non-lnearty n data reducton s to apply prncpal component analyss n the framework of K-means clusterng [KL97]. The clusterngbased PCA s mplemented by frst clusterng data samples and then performng PCA proecton n each cluster.e., a PCA subspace formed by data n each cluster. In contrast to other K-means algorthms, the clusterng procedure s conducted accordng to the dstance from a gven data sample to the PCA subspace formed by a cluster. Ths mples a nonlnear approach to dmenson reducton, whch provdes a more accurate representaton and s fast to compute. From the vew pont of K-means clusterng, the local PCA s merely an mplementaton of PCA by the generalzed Lloyd algorthm. The pseudocode of local PCA s presented n Fgure 3.. But we can vew t as a specfc form of the so-called clustered component analyss [CBL04], whch s derved from the MDL prncple. A more general extenson of local PCA s the so-called multple egenspaces [LBM02], whch allows for an automatc selecton of the number of local PCAs and the dmensonalty of each local PCA egenspace accordng to the MDL prncple. In partcular, the clustered component analyss deals wth prncpal component analyss wthn a mxture model and thereby t formulates the data reducton task nto a mxture of lnear regressons. From a geometrcal pont of vew, the prncpal components can be desgned as a class of nonlnear curves or manfolds [HS89].e., prncpal curves. These prncpal curves have been defned as "self-consstent" curves passng through the mddle of random samples from a gven dstrbuton. A prncpal curve n data reducton s to proect data samples nto a nonlnear manfold nstead of the lnear one. Fortunately, these prncpal curves can be desgned and constructed from varous theoretcal vew ponts [Webb99, Del0, KK02]. However, n practce, learnng and desgn of the 3

44 prncpal curves [KKL00] fulflls an expectaton and maxmzaton procedure [Tb92] n fttng a hyper curve from gven tranng patterns. They can nevtably be nterpreted as a parametrc formulaton of self-organzed mappng [CG0]. Remarkably, the prncpal curves have successfully led to a popular research trend n pattern classfcaton and feature generaton [CG98]. However, n a more general way, the nonlnearty of data can be captured by abstractng a lnear prncpal component nto an nfnte functonal space, whch comes to the so-called kernel prncpal component analyss. The kernel PCA s an dealzed technque n data reducton for the hgh-dmensonal complex data source [Mog02, CF00]. We wll brefly revew the kernel PCA technque n next secton. Local PCA Local PCA s an mplementaton of the generalzed Lloyd algorthm that mnmzes the total reconstructon dstance ~ 2 2 E D x x = d x, cπ x dp x by two teratve steps: Assgn data sample to cluster wth least reconstructon dstance d x, c = x c < x c, e > e Re-compute cluster centrods c and estmate the correspondng prncpal component e.e., a unt vector over the data samples of cluster Fgure 3.. Pseudocode of the local PCA algorthm 3.2 Kernel PCA Conceptually, the kernel PCA s a nonlnear mplementaton of prncpal component analyss n a hgh-dmensonal feature space F, to whch the nput data s mapped by Φ: R d F. Although the nput data s more separable n F, the space F and therewth the mappng Φ are also very complcated. For ths purpose, the kernel PCA performs a kernel trck n F nstead of nvestgatng the mappng Φ drectly 32

45 Φ x, Φ x K x, x Thus, the alternatve prncpal components of X can be estmated n the new feature space F. In prncple, the kernel PCA provdes the same number of prncpal components as the nput data samples. Fortunately, ths mples that a larger number of features are permssble n estmaton of the ntal solutons for clusterng problems. Accordngly, the covarance matrx of X n F can be wrtten as: W Φ = N N = Φ x Φ x For any egenvalue of W Φ, λ 0, and ts correspondng egenvector v F\{0}, the equvalent formulaton of the egenvalue problem [SMS98] n F s defned as: N λα = Kα where the egenvector v s spanned n space F as: N v = α Φx = and where K = kx, x, α = α,α 2 α N T and kx,y s a kernel functon e.g., kx,y=exp x-y 2 /2σ 2. To extract a kernel component feature, we can proect each data vector x onto egenvector v. N Φ x, v = α x, x = K The kernel PCA allows us to obtan the features wth hgh-order correlatons between the nput data samples. Naturally, extracton of a kernel feature mplctly undermnes the nonlnear spatal structure of nput data wth most mert n the prncpal component subspace. Ths can be observed ntutvely n the example shown n Fgure 3.2. However, the kernel features are useful n estmatng an ntal soluton for unsupervsed learnng, whch has been reported n publcaton P5. T 33

46 Fgure 3.2. A example llustraton of kernel prncpal component curve 3.3 Fsher dscrmnant analyss Fsher's lnear dscrmnant [FS75] s a lnear mappng that proects hgh-dmensonal data onto one-dmensonal subspace accordng to the class labels of data. The proecton maxmzes the rato of the between-class varance aganst the wthn-class varance, whch s drectly derved from the defnton of the F-rato valdty ndex, M n = = N T J v 3. 2 v x = T v c x c π where π s the class label of each sample x and x s the mean of all data samples. Note that M represents the number of classes here and n the sequel. Thus, a multclass lnear Fsher dscrmnant v s the optmal dscrmnant drecton that maxmzes the F-rato valdty ndex n 3.,.e., 2 T y S By v = arg max T 3.2 y y S y W 34

47 where S B and S W are the between-class covarance matrx and the wthn-class covarance matrx respectvely: S M N T B = n c x c x, SW = x cπ x cπ = = where c and n are the mean vector and sample sze of class respectvely. Maxmzng the F-rato ndex yelds a closed form of soluton but nvolves the egenvalue problem of S - W S B. Ths approach can be vewed as a specfc form of lnear perceptrons. However, the Fsher s lnear dscrmnant s restrcted by an assumpton that the data samples obey the Gaussan dstrbuton wth equal group covarance. A more practcal reservaton s that t assumes that all dscrmnatng nformaton s contaned n lnear dscrmnant subspace. On the contrary, most of nput data may not be perfectly correlated as has been shown n the rght part of Fgure 3.3. Thus, n practce, the data reducton equals to seek a nonlnear manfold to dscrmnate the nput classes. Nevertheless, mathematcally, the nonlnearty of the labeled data can be captured by most of reproducng kernel functons. For ths purpose, we wll examne a kernel-based learnng machne, the kernel Fsher dscrmnant analyss n next secton. T 3.4 Kernel Fsher dscrmnant analyss The recent success n dscrmnant analyss has been rased by a generalzaton by usng the reproducng kernels [BA00, NS02]. Ths extenson can be vewed as a mult-class kernel Fsher dscrmnant, whch enables a hgh dscrmnatng power over the nput classes wth complex structure. The kernel dscrmnant frst maps the nput feature vectors nto some new feature space F n whch dfferent classes are better separable. Then a lnear dscrmnant s computed subsequently to separate nput classes n F. Ths process mplctly constructs a non-lnear dscrmnant hypercurve n the orgnal feature space. 35

Fgure 3.3. Two cases of separablty: lnearly separable and lnearly nseparable Let Φx be the nonlnear mappng from nput feature space to some hghdmensonal Hlbert space F.

48 Fgure 3.3. Two cases of separablty: lnearly separable and lnearly nseparable Let Φx be the nonlnear mappng from nput feature space to some hghdmensonal Hlbert space F. The goal s to fnd the proecton lne v n F such that the F-rato valdty ndex Jv T Φ v S B v J v = T Φ 3.3 v S v W s maxmzed, where Φ Φ S and B S are the between-class and wthn-class covarance W matrces n F. Snce the space F s of very hgh or even nfnte dmensons, the functon Φx s nfeasble. A technque to overcome ths dffculty s the Mercer kernel functon kx, y = Φx Φy, whch s the dot product n Hlbert feature space F. It s known that under some mld condtons on S Φ and Φ B S, any soluton v F W maxmzng 3.3 can be wrtten as the lnear span of all mapped data samples [MRW03]: N v = α Φ x 3.4 = Fortunately, the Mercer kernel kx,y s able to reformulate the F-rato valdty ndex 36

49 T α Aα J v = 3.5 T α Bα where A and B are N N matrces: A = M = T T T M µ µ µ µ, B = KK n µ = n µ where K s the kernel matrx such that K =Φx Φx, and, µ = K / N µ = K / n 0, N are membershp vectors correspondng to class labels, and s the vector of all ones. The proecton of any data sample x onto the dscrmnant subspace s gven by the nner product N = v, Φ x = α k x, x 3.6 Formally, the KFD problem s to fnd the leadng egenvector of B - A. As the dmenson of F s hgher than the number of data samples N, and B s a hghly sngular N N matrx obtaned from only N samples, some form of regularzaton s necessary. The smplest way s to add an dentty matrx to B; namely matrx B s replaced by B β =B + βi. Ths makes the problem smoothly stable snce B becomes more postve defnte for large β. It s also roughly equvalent to addng ndependent nose to each of the kernel bases. However, n practce, matrces B and A are too large n sze. Snce an N N matrx egenvalue problem has ON 3 tme complexty, estmatng a kernel dscrmnant n terms of 3.4 s ntractable for a large N. A common way s to restrct t n a subspace of F, namely, to use a partal expanson nstead: l v = α x, l << N 3.7 = A practcal mplementaton of KFD wll be dscussed n Chapter 5. 37

50 4. Implementaton of K-means clusterng K-means clusterng s one of the most popular technques n unsupervsed learnng [BH67, DH73], of whch the man obectve s to classfy the underlyng data samples nto dsont clusters such that mnmze the MSE obectve functon. However, t often suffers from a common problem that the clusterng procedure converges to a local mnmum. To crcumvent ths lmtaton, we wll study K-means clusterng manly from two mplementaton aspects: the selecton of an ntal soluton and a dssmlarty functon. Our ntenton s to nvestgate a clusterng scheme wthout changng the man optmzaton procedure of K-means clusterng. Thus, n the sequel, we would lke to concentrate on an mplementaton of K-means clusterng wthout any addtonal tme complexty. 4. Related work and background The local optmalty of K-means clusterng occurs not only to datasets wth a hgh number of clusters but also to nosy datasets. In general, K-means clusterng tself s one type of gradent descent method, and thus nevtably, the qualty of resultng clusters s senstve to ts ntal soluton. Ths weakness can be clearly seen from the Vorono dagram of K-means clusterngs startng wth three dfferent ntal guesses n Fgure 4.. Many k-center clusterng algorthms have succeeded n seekng an alternatve optmzaton scheme n crcumventng ths local optmalty. Remarkable examples nclude the K-medan clusterng algorthm [BMS97], the K-harmonc clusterng algorthms [Zhan0] and the adaptve K-means algorthm [CS95]. The K-Medan algorthm searches each cluster centrod from data samples such that the centrod mnmzes the overall dstances from all data ponts n the cluster to the cluster centrod. Moreover, the medan dea can be generalzed by a practcal mplementaton of fuzzy clusterng [Ker97], the well-known fuzzy C-medan algorthm. The exact evaluaton of the fuzzy medan vectors always nvolves an 38

51 expensve orderng of the data samples but an approxmate soluton can be resolvable n practce. Despte that the local optmalty has been remeded n many cases, an mproper choce of ntal soluton s stll able to nfluence the qualty of clusters sgnfcantly. In addton, any partal soluton by usng approxmaton s not suffcent n crcumventng the local optmalty completely. Analogous to K-means clusterng, the EM clusterng algorthm s often crtczed [BFR99, RG99] for ts local optmalty due to an mproper choce of ntal partton. Moreover, the medan approaches are obvously not applcable to the framework of the EM clusterng. The smplest departure from ths lmtaton [OO02] s to pck the ntal soluton by usng other clusterng technques. Snce K-means clusterng ams at mnmzaton of the MSE dstorton, t can be settled by any useful gradent descent method. One fathful approach s to apply a hard compettve learnng [CS95, FK99] wth an adaptve learnng rate, whch clams a faster convergence to a global mnmum. But that enhancement s restrcted by an assumpton that the underlyng clusters must have the same varance when the number of clusters s suffcently large. For ths purpose, the kernel K-means can be a gentle choce [Gro02] to obtan the global mnmum. The kernel K-means expresses ts dstance functon n a form of kernel product of two samples n some hghdmensonal space, where data samples are more separable. Namely, t solves the k- center clusterng problem n a reproducng kernel space nstead of n the orgnal feature space. However, the mplementaton of the kernel approach needs to mantan N N sze matrx. Another group of K-means clusterng algorthms [PM99, LVV03, P3] choose the ntal solutons by usng a k-d tree structure. In effect, ths s equvalent to usng a dvsve clusterng scheme to ntalze ther cluster centrods. Even f n some cases one of these clusterng algorthms [KMN02] can be accelerated n the sense of enhancng the convergence to global optmum, we argue that the optmalty of ntalzed solutons s qute senstve to the orderng of nodes n the tree structure. 39

However, n onedmensonal scalar space, the clusterng problem s equvalent to a scalar quantzaton problem, of whch the global optmum quantzer k q Q N can be computed by usng dynamc programmng.

52 a b c Fgure 4.. Three K-means clusterngs obtaned from dfferent ntal guesses or solutons 4.2 Selecton of ntal soluton We have noted that the optmzaton of k-center clusterng problems n a d- dmensonal vector space has been proved to be NP-hard n k. However, n onedmensonal scalar space, the clusterng problem s equvalent to a scalar quantzaton problem, of whch the global optmum quantzer k q Q N can be computed by usng dynamc programmng. Ths result was frst reported by Bruce [Bruc64], and later mproved for convex cost functons by Sharma [Shar78]. Furthermore, the optmal scalar quantzer can be constructed n ON 2 tme for a wde class of cost functons wth the so-called Monge property [AKM87, AST94], whch can be mplemented by a matrx searchng algorthm [WZ93]. Snce the dynamc programmng technque approvngly yelds a global mnmum to the scalar quantzaton problem, our ntenton s to reduce the ntalzaton of clusterng nto some one-dmensonal feature space. In other words, the ntal partton can be obtaned by usng dynamc programmng over one-dmensonal subspace nstead. For ths purpose, most of feature extracton technques can be utlzed n practce. A qute obvous way s to apply PCA analyss to obtan the prncpal drecton of the nput data, mplctly formng a scalar feature space. In a 40

53 smlar manner to the work of color quantzaton n [Wu92], such an ntalzaton of K-means clusterng s navely straghtforward. Namely, the ntal soluton can be estmated by usng dynamc programmng over the prncpal drecton of nput data, whch leads to a slght gap between the ntalzed partton and the global optmum. As prncpal component analyss provdes at most d scalar features, we can obtan at most d ntal solutons. However, for a large-scale data set, the resultng d number of ntalzed solutons mght be stll dstant from the global optmum. In ths sense, the scalar features can be extracted by alternatve approaches as n [P4, P5]. For nstance, one may teratvely use the mult-class lnear Fsher dscrmnant analyss LFD [FS75] to obtan the one-dmensonal subspace n K-means clusterng. Snce LFD s a classfcaton problem, an nput class must be needed. But one could select the output partton of K-means clusterng as the nput class of LFD, from whch the Fsher dscrmnant can be constructed. Once the Fsher dscrmnant s constructed, an optmal partton of the nput data n the dscrmnant subspace can be obtaned by usng dynamc programmng. Ths classfcaton procedure can be conducted teratvely n the framework of K-means clusterng. Namely, n each teraton, LFD s frst conducted and then a suboptmal ntal partton of K-Means clusterng can be estmated by usng dynamc programmng n the dscrmnant scalar space. Once after the suboptmal ntal soluton s chosen, the convectonal K-mean algorthm can be performed. Note that the nput class of LFD n each teraton s the output partton of the K-Means clusterng n the former teraton. It can be observed from the experments n [P4] that the teratve K-Means clusterng by Fsher dscrmnant analyss s superor to the PCA based K-means algorthm. In addton to dscrmnant analyss, a more general approach s to extract a nonlnear feature from the nput data. For nstance, most of the kernel machnes are able to extract an deally descrptve feature for lnearly nseparable data. The kernel PCA s one of such technques to obtan a hghly correlated feature for the nput data. Naturally, the extracton of a kernel feature mplctly undermnes the nonlnear spatal 4

54 structure of nput data wth most mert n the prncpal drecton. Lkewse, once f the kernel feature s extracted, one can apply dynamc programmng to attan a suboptmal ntal soluton. It has been demonstrated expermentally n [P5] that ths kernel PCA based K-means clusterng also outperforms the lnear PCA based clusterng algorthm. We remark that the tme complexty ncurred by the prncpal component analyss and the lnear Fsher dscrmnant analyss s at most Od 2 N. Thus, performng the two lnear extractons does not mpose an addtonal tme complexty to K-means clusterng. However, computaton of one kernel prncpal component poses great dffculty due to a ON 3 computatonal complexty. Even f the kernel prncpal component can be constructed by the expectaton-maxmum algorthm [RG0] n a OpN 2 tme, where p s the number of features extracted, t s obvously computatonally more expensve than the K-means clusterng tself. A feasble approach s to use a subset of the entre dataset S X nstead as n [SS00] to construct the leadng kernel prncpal component. The subset S can be ncrementally sought by addng a data vector y to S once at a tme. The selecton of a new vector y can be done n a greedy manner, namely, s the tranng vector n a randomzed subset U X\S such that mnmzes the MSE dstorton of K-means clusterng along the newly formed prncpal component after y beng added to S. The proper sze of U was shown to be 59 [SS00] n order to obtan nearly as good performance as f the search was through X\S. We should note that once the leadng kernel component has been constructed over S at each ncremental stage, dynamc programmng can be utlzed to obtan an ntal clusterng soluton, followed by performng a K-means procedure. Ths partal estmaton of the kernel prncpal component nvolves a slender tme complexty as ON l 4 where l s the sze of S. 4.3 Dynamc programmng Dynamc Programmng s a technque of computng recurrence relatons effcently by storng partal results, whch dramatcally reduces some of the optmzaton problems 42

55 to polynomal tme. An optmal scalar quantzer can be constructed by dynamc programmng for a wde class of cost functons [WZ93] wth the monotoncty property. Fortunately, many convex cost functons obey such an elegant property that guarantees the unqueness of optmal quantzer e.g., the MSE cost functon. Furthermore, the monotoncty of cost functons forms the optmal quantzer as a convex partton of scalar space, whch also promses that the recursve procedure of dynamc programmng globally converges to the optmal quantzer. We wll brefly revew the dynamc programmng scheme n the followng. Let Y ={y,,y n } be a sequence of scalar data samples. We can mmedately buld an optmal scalar quantzer over Y by dynamc programmng: k k E Q[, n] = mn E Q[, t] + Err[ t+, n] 4. k t n where k- t n s a cuttng pont and Err [t+,n] = Erry t, y n s the error dstorton over y t, y n ]. Fortunately, the MSE dstorton over a sngle nterval a,b] s capable of beng computed recursvely n lnear tme [WZ93]. Formally, ths can be seen by rewrtng the MSE dstorton Erra,b as Err a, b = b a p y y E = M b M 2 2 a, b 2 y dy M b M a a + M b M a where M m s the accumulatve mth moment of the data samples Y wth a probablty densty py and E a, b M y = M 0 b M b M The dynamc programmng n 4. leads to a OkN 2 tme complexty. However, the dynamc programmng problem establshed above [WZ93] can be converted to a matrx search problem f the cost functon satsfes the Monge property: Err y, y + + Err y+, y Err y, y + Err y+, y a a

56 whereby a general result n [AKM87] ensures a matrx search algorthm wth OkN tme. We remark that the domnant tme complexty s the computaton of the dstorton Erra,b for all possble ntervals a,b],.e., ON 2 even f the matrx searchng algorthm can be done n OkN tme. In the case of the MSE cost functon, the error dstorton Erra,b can be computed n lnear tme n the preprocessng stage accordng to 4.2. We mmedately see that dynamc programmng technque approaches the scalar quantzaton problem by mnmzng the MSE dstorton n OkN tme. Thus, estmatng the ntal soluton by dynamc programmng does not ncur an addtonal tme complexty for K-means clusterng. 4.4 Dssmlarty revsted We have conceptually revewed the stochastc complexty n Chapter 2. The stochastc complexty measures the goodness of how a gven model fts the data to be compressed. It s posted as a powerful metrc to decde the optmal number of clusters, eventually servng as a good clusterng valdty ndex. It has been shown n [FGG00] that the CL dstance s an ntrnsc dssmlarty of mnmzng the stochastc complexty. The computaton of the CL dstance n 2.2 can be certanly smplfed by mantanng two k d matrces for the entropes log 2 -c,s and log 2 c,s. However, ths does not allow for a reduced number of searches for the nearest cluster. For Eucldean dstance, a reducble number of searches are obtanable through the trangular nequalty technque [CH9, KFN00]. Ths crteron seemngly apples to computaton of the CL dstances due to the convexty of functon d x c = d CL x,c + log 2 n /N n c, whch can be vewed by checkng the Hessan matrx of 2 d x c 0. Drect applcaton of the convexty s able to yeld a suffcent condton of avodng redundant dstance calculatons. In addton to the CL dstance, we now revew a heurstc dssmlarty termed as SC-dstance. The dssmlarty functon can be defned by a desgn scheme, whch s derved drectly from the change of obectve functon before and after classfyng a 44

57 gven data sample nto one cluster. Fgure 4.2 dsplays the desgn scheme of dervng the SC-dstance by movng the underlyng vector and calculatng the change of the resultng stochastc complexty. The desgn approach can capture the desrable changes of the resultng clusters, whch s extendable n the sense of mnmzng other obectve functons. Thus, t can be envsoned as an ntrnscally heurstc dssmlarty for any partton-based clusterng method. dx4,g2=increased_sc x 2 y x G x4 G 2 y2 x 3 y3 dx4,g=decreased_sc Fgure 4.2. The desgn dagram of the SC-dstance by movng x 4 from cluster G to G 2 In prncple, one may desre for a gentle dssmlarty suffcent to survey the process of clusterng. Eventually, we were nspred to desgn a dssmlarty functon to reveal the maor change of cost functons n the procedure of clusterng. An ntutve way s to nduce the dssmlarty by movng a gven data vector from one cluster to another by assumng that a sensble movement always leads to a desrable decrease of cost functon. Ths also gudes us to actualze the magnary dssmlarty n the sprt of mnmzng the stochastc complexty. Another motvaton to desgn ths dssmlarty functon s due n part to a sngularty restrcton of the CL dstance such that 0<c <. In order to desgn ths dssmlarty, namely, by movng the underlyng vector x from cluster to, we can rewrte the change of stochastc complexty n 2. 45

58 46,, SC SC SC c x c x = where log 2 log 2, = d d n d n n d n H n n n H n SC c x c c x Snce ln2 log 2 n O n n + = + we obtan log ln2 / 2, d d n n d n H n n n H n SC c x c c x Lkewse, SCx,c can be deduced n a smlar manner. We have nvestgated the SCx,c dstance as a feasble dssmlarty functon n P2, whch s superor to the L 2 dstance and the CL dstance. Agan, the concavty of H d n c led us to two further smplfcatons of SCx,c sup 2 log ln2 / 2, SC n n d n H H SC d d = + + c c x c c x 4.3 and nf 2 log ln2 2 /, SC n n d n H n n H SC d d = c c x x c c x The approxmaton n 4.3 mples that computaton of that SC sup s as expensve as that of the CL dstance. Clearly, we can derve the bound of ths approxmaton by the Taylor expanson: / /,, 2 T nf sup d d n n n n H H SC SC + + = + + c x m c x m c x c x x c c c x c x where m s a dagonal matrx such that

59 47 0,,,,,, = = s s s s s s s s s b n x c n c y d s y y H s s m λ λ λ Although the approxmaton n 4.3 allows for a smpler computaton, t s restrcted by another condton such that 0<c <. Once the SCx,G dstance s approxmated, the searches for the nearest cluster to x can be accelerated by the convexty of H d x,c n c. Regardless of the accelerated searches above, computng the SC-dstances stll leads to severely overburdenng tme complexty [P2] due to the calculatons of logarthms. However, ths computatonal dffculty can be crcumvented by precomputng log 2 n for n N and storng them nto an array. We have extended the above desgn approach to the mnmzaton of MSE cost functon n K-means clusterng [P3, XF04]. Supposng a tranng vector x s moved from cluster to cluster, the change of the MSE functon [Spa80] ncurred by ths movement s 2 2, n n n n v c x c x x + = 4.4 The frst part n the rght hand sde, whch represents the ncreased value of the overall varance of cluster ncurred by ths move, can be termed as the addton cost. The second part, representng the decreased value of overall varance of cluster, can be termed as the removal cost. We can conceptualze the change of cluster varance as the Delta-MSE dssmlarty as 2, MSE, w d c x c x = where = + = n n n n w / /, π π

60 The dssmlarty d MSE x,c can be nterpreted as the varance between cluster and sngle pont cluster x, or as the merge cost of combnng cluster wth the sngle pont x. Fortunately, the trangular nequalty [CH9] also allows for a number of avoded computatons of the Delta-MSE dssmlartes under a suffcent condton n [P3]. We expermentally demonstrated n [P3] that the dssmlarty s superor to the standard L 2 dstance. But calculaton of the Delta-MSE dssmlartes does not mposes any addtonal tme complexty by pre-computng each /n - and /n + and storng them n two arrays n each teraton. Snce the Delta-MSE dssmlarty s analytcally derved from the MSE dstorton, t s mmedately enabled wth a heurstc property: f the removal cost d MSE x,c n 4.4 exceeds the addton cost d MSE x,c the MSE dstorton can be reduced by movng x from cluster to cluster. Ths property also holds [P3] n the mnmzaton of the F-rato valdty ndex. Let us normalze the Delta-MSE dssmlarty n the sense of clusterng separablty, expressvely, d F-rato d MSE x, c x, c = 4.5 var G,x where var G / n π varg, x = var G /{ x}/ n 2 π = where vars represents the overall varance of the subset S X. Insghtfully, one may vew 4.5 as the F-rato valdty ndex between the cluster G and the sngle pont cluster x. The mert of addressng such a separablty-based dssmlarty not only hnges on the smplcty for mplementaton but also s n vrtue of characterzng the dynamc structure of cluster varance. Thus, equally, any calculaton of the dssmlarty between x and G s an assumpton on the homogenety of varances amongst clusters. Although the F-rato tself s senstve to any departure from normalty and appearance of outlers, calculatng the F-rato dssmlarty for a gven data pattern wth respect to each possble cluster mplctly enrolls a measurement of 48

61 detectng outlers. For nstance, one may defne an outler n the sense that the dssmlartes between the gven data pattern and all dense clusters exceed an emprcal threshold. However, mnmzaton of the overall F-rato dssmlartes leads to a non closed-form expresson of cluster centrods, whch s nfeasble n K-means clusterng. Snce the F-rato dssmlarty measures the homogenety and heterogenety of clusters smultaneously, ts useful applcaton s to estmate the unknown number of clusters. For examples, ncorporatng the F-rato dssmlarty n 4.5 nto the modfed basc sequental clusterng algorthmc scheme MBSAS [TK99] s able to detect the number of clusters automatcally. 49

62 5. Context quantzaton The essental fgure of mert for data compresson hnges on the compresson rato, the rato of the sze of a compressed fle aganst the orgnal uncompressed fle. Lossless data compresson s utlzed when the data has to be uncompressed exactly as t was before compresson. For nstance, a lossless mage compresson scheme encodes and decodes the data perfectly, and the resultng mage matches the orgnal mage exactly. There s no degradaton or loss of nformaton n processng data compresson. Formally, a common task n lossless compressng a dscrete source X 0, X, X 2,, s the estmaton of condtonal probabltes PX t X t-, where X t - = X 0, X, X 2,, X t - s the prefx of X t, whch s called context. The dea of context has been ntensvely utlzed ether n codng a Markov source or to defne a regresson model for predctve codng. Gven a class of source models, the number of parameters must be carefully chosen n the prncple of mnmum descrpton length [Rs83]. For example, a larger number of parameters n the statstcal model wth an assocated model cost could tradeoff the entropy savngs. In codng a dscrete sequence of symbols, ths cost can be nterpreted as capturng the penaltes of the so-called context dluton occurrng when count statstcs must be spread over too many contexts, thus affectng the accuracy of the correspondng estmates. Context quantzaton [Wu97] s one technque to overcome ths dffculty appearng n the context-based entropy codng. In ths chapter, we wll study the context quantzaton problem n lossless mage compresson. In the frst subsecton, we address a man task to motvate the succeedng research n context quantzaton. Secton 5.2 characterzes the problem formulaton of context quantzaton. In secton 5.3-4, we outlne the two basc preprocessng stages of context quantzaton n lossless mage compresson: predctve codng and context modelng. Secton 5.5 studes context quantzaton n a framework of data clusterng accordng to the Kullback-Lebler dstance. The man results of ths 50

63 research,.e., the context quantzer desgn algorthms based on the mult-class lnear Fsher dscrmnant and the kernel Fsher dscrmnant, are presented n secton Motvaton The poneer soluton to unversal source codng s the defnton of context [Rs83], whch dynamcally selects a varable-order subset of the past samples n X t-,.e., a context, C t. The source codng structured by the context has been proven to be unversal under some asymptotc assumpton. The most popular soluton s the context tree weghtng technque [WST95] that weghts the probablty estmates assocated wth dfferent branches of a context tree to obtan a better estmate of PX t X t-. Although the tree-based context modelng technques [AYA97, Eks96] have attracted consderable nterest n text compresson, ther applcatons to mage compresson demand a fathful schedulng of the two-dmensonal mage sgnals nto a one-dmensonal sequence. In partcular, Mrak et al. nvestgated how to optmze the orderng of the context parameters wthn the context trees [MMW03]. Ths s the so-called growng, reorderng and selecton by prunng algorthm GRASP, whch s desgned n a sprt of context shape optmzaton. Thus, t serves ether as a codng machne n achevng the mnmum code length on the fly or a desgn scheme for seekng an offlne optmal context model. The context shape optmzaton mplctly tackled the context dluton problem by optmally prunng the context tree accordng to the model cost of termnal nodes. We can vew the above desgn schemes from another perspectve by notng t as a tree-based context clusterng to mnmze the descrpton code length. In other words, the context shape optmzaton scheme s nevtably equvalent to quantzng the contexts n a bnary-valued hypercube n achevng the mnmum condtonal entropy. Ths dffers from the pror model wth fxed complexty chosen by most of mage/vdeo mage compresson algorthms. The applcaton of those pror context 5

64 models s restrcted by a good estmate of doman knowledge such as correlaton structure of the samples and typcal nput sequence length. For nstance, the JBIG standard for bnary mage compresson uses the contexts of a fxed sze causal template [CCI92]. The actual codng s mplemented by sequentally applyng the entropy codng based on the estmated condtonal probabltes PX t C t. Estmatng the condtonal probabltes drectly usng count statstcs from past samples can ncur severe context dluton problem f the symbol alphabet s large, whch has motvated the succeedng researches on context quantzaton [Che04, FWA04, P7]. Thus, context quantzaton s to reduce the resoluton of casual contexts n the sprt of mnmzng the expected resultng redundancy. The followng secton wll present a descrptve formulaton of context quantzaton. 5.2 Problem formulaton Formally, context quantzaton s one form of vector quantzaton snce context C s a random vector n the d-dmensonal Eucldean space E d.e., the context model has order d. Naturally, the obectve of optmal context quantzaton should be the mnmzaton of the condtonal entropy HX QC. Snce the convexty of the entropy functon H mples HX QC HX C, context quantzaton seeks to make HX QC as close to HX C as possble for a gven M, or to mnmze the Kullback-Lebler dstance: D Q = H X Q C H X C We remark that the above H refers to the true source entropy but not as actual code length that should nclude the model cost. Although the Kullback-Lebler dstance relatve entropy s not strctly a dstance metrc for ts volaton of symmetry and trangular nequalty, the standard practce s to use t as a non-negatve dstorton of context quantzer Q. A context quantzer Q s a mappng functon of parttonng a d-dmensonal context space E d nto M subsets or codng states: 52

65 53 M m m Q A m,,, { = } = = c c Mnmzng the Kullback-Lebler dstance n context quantzer desgn leads to complex structures and shapes of quantzer cells, whch are n general not convex or even connected n the context space [WCX00]. Fgure 5. plots the complex structure and boundary of MCECQ cells A m formed n context space, for the number of quantzer M=3 and the bnary source of least sgnfcant bt of dfferental pulse code modulaton DPCM errors of the mage Cameraman. Nevertheless, ther assocated sets of probablty densty functons pdfs } { m C X m A P B = c c are smple convex sets n the probablty smplex space of X, owng to a necessary condton for mnmum condtonal entropy quantzer Q [FWA04]. The center of cell B m s the expected condtonal probablty P Qc = = = = m Q m Q P P P m Q P c c c c c c If X s a bnary random varable, then the probablty smplex s one-dmensonal. Hence the quantzer cells B m are smple ntervals. Lettng Z=P X C c the posteror probablty of X = as a functon of context c be a random varable, and then the condtonal entropy HX Qc of a context quantzer Q can be expressed by 0 ], ]}, { 0 = < < < < = = = M M m m M m m m q q q q q q Z X H q q Z P Q X H c Thus, the Mnmal Condton Entropy Context Quantzer MCECQ equals to a scalar quantzaton problem n Z, even though the context c s drawn from a d-dmensonal vector space. The global mnmum of HX Qc ], ]}, { argmn,,, 0 * * 2 * m m M m m m q q M q q Z X H q q Z P q q q M = < < < < =

66 could be attaned by usng dynamc programmng. Thanks to the so-called concave Monge property of the obectve functon, the MCECQ desgn problem can be solved n OMN tme [GYZ98], where N s the number of raw,.e. unquantzed contexts before quantzaton. Fgure 5.. The complex dstrbuton of MCECQ cells A m n context space, for M=3 and the source of least sgnfcant bt of DPCM errors of mage Cameraman 5.3 Predctve codng Predctve codng s a standard technque to acheve a better codng effcency when compressng a Markov source n lossless codng. Conceptually, t predcts the value of current sample based on past samples that the encoder or decoder has processed. Instead of codng each symbol n a memoryless fashon, ths technque encodes the predcton resdual, and thereby, eventually, the redundant nformaton amongst the adacent symbols s elmnated. Ideally, estmatng the value of the current pxel X t based on past samples X t- should be done by usng an adaptve model to explot the redundant nformaton nhabted between X t and X t-. However, the complexty constrants of compressng a long sequence rule out ths possblty. Regardless of ths reservaton, a prmtve edge detector s stll desrable n order to approach the best possble predcton. A common 54

67 predctor s the lossless DPCM, whereby m samples wthn a causal context of the current sample are used to make a lnear predcton of the sample's value,.e. d Xˆ n = α X = Then the predcton resdual sgnal, e n, s constructed as the dfference between the predcted value and the actual value e n = X n Xˆ The codng procedure by usng lossless DPCM predctor has been llustrated n Fgure 5.2. In lossless predctve codng, the dfferental sgnal typcally has a greatly reduced varance n contrast to the orgnal sgnal, and thereby t s sgnfcantly less correlated and exhbts a stable hstogram well approxmated by a Laplacan double-sded exponental dstrbuton. One maor lmtaton of the DPCM s that the predctor s fxed throughout the sequence of samples. Adaptve predcton [DL92] usually reduces the magntude of predcton resduals; thus, t mposes a greatly skewed dstrbuton of source symbols leadng to a lower bt rate. n n Fgure 5.2. The codng procedure of the lossless DPCM encoder and decoder 55

68 A well-known adaptve predctor n lossless mage compresson s the Contextbased, Adaptve, Lossless Image Codec CALIC [WM97]. Even though most of the predctors can be desgned adaptvely, the complexty ssue n practce usually forces the use of a fxed predctor exhbtng the nonlnear correlaton of source samples. For ths purpose, the medan predctor n JPEG-LS standard [WSS00] conssts on performng a nonlnear test to detect vertcal or horzontal edges; see Fgure 5.3. c b d a? xˆ mn = max a + a, b a, b f f c max c mn b c otherwse a, b a, b Fgure 5.3. The medan predctor of the JPEG lossless standard In context quantzaton, the redundancy reducton by predcton mnmzes dependency between adacent samples, and thus t produces a farly unversal source for context modelng. We wll brefly revew the context modelng phase for context quantzaton n the next secton. 5.4 Context modelng The ont b-level mage experts group JBIG standard for bnary mage compresson uses the context modelng wth a fxed sze and hgh-order template [CCI92], whereby the contexts are selected from the pxels surroundng the codng sample. Then the context-based entropy codng s conducted by sequentally applyng arthmetc codng accordng to the estmated condtonal probabltes. However, snce the most common form of statstcal redundancy n contnuous-tone mage data s the smoothness of ntensty functon, a unversal context modelng should able to characterze the level of smoothness wth greatest mert. Ths may requre both an adaptve predcton 56

69 scheme va context modelng and the context template defned by local gradents surroundng a pxel. For nstance, JPEG-LS utlzes a context template comprsng of three drectonal gradents such that the level of actvty e.g. smoothness and edgness surroundng a pxel can be captured. Those gradents n the context modelng capture the statstcal behavor of predcton resduals. Lkewse, the adaptve predctor, as shown n [Wu96], could adust ts parameters va the context modelng based on local gradents. However, n the sense of context quantzaton, the contexts condtonng a bnary source codng can be selected from gray scale mage space. Applcaton of a hghorder context model to the offlne context quantzaton mght ncrease the possblty of over-fttng n the codng phase or lead to the problem of context dluton. Thus, n ths research work, we ncorporated a context model that conssts of three gradents n a local wndow c = c,c 2,c 2 : c c c 2 3 = I, I, 2 = I, I 2, = I, I, Snce the context model above yelds an ntractably huge number of raw contexts , a natural product pre-quantzer Q c k g, f, f = k, f k, f c [2 c 2 k c 2 k c 2, ,-2, 0 < k + ], 0 < < k s able to reduce the resoluton of contexts nto a feasble scope where k s the number of scalar quantzaton levels and g =,, 3.. In ths sense, the raw contexts are merged nto equ-probable regons based on an assumpton that each gradent s a geometry dstrbuton. A theoretcal advantage [WSS00] s that the merge of raw contexts attempts to maxmze the mutual nformaton between the sequence of mage pxels and the sequence of contexts. 57

70 5.5 Context clusterng The exstng context quantzer desgn algorthms are dvded nto two approaches: those that form codng contexts drectly n the context space as n [Che04, P6], and those that classfy contexts n the space of condtonal probablty [WCX00, GYZ98]. In the context space, one can apply the generalzed Lloyd method to desgn context quantzer by clusterng raw contexts n a representatve tranng set accordng to the Kullback-Lebler dstance [Che04]. Alternatvely n [P6], context quantzaton can be performed on the fly, but the sde nformaton and the model cost must be saved n the compressed fle. Namely, the ndex of mappng each raw context to ts codng state and the codng condtonal probablty P Qc need to be coded and transmtted to the decoder. Most of teratve approaches of gradent descent only converge to a local optmum, but f the random varable X to be coded s bnary, then the VQ problem of context quantzaton can be converted to a scalar quantzaton problem n the probablty smplex space of P c. Ths change of space makes t possble to desgn a globally optmal context quantzer through dynamc programmng and matrx search algorthm [GYZ98] on a representatve tranng set, whch can be collected from a seres of tranng mages. Nevertheless, the context quantzer constructed n terms of a tranng set could not approprately ft to the statstcal context model n codng a sngle mage. Under some mld assumpton, ths lmtaton can be crcumvented by usng an adaptve entropy coder. Even f the M-clusterng of n contexts n codng a bnary source can be computed n OMN tme by dynamc programmng, the authors n [GYZ98] dd not detal how to compute the error dstorton of MCECQ quantzer Err [t+,n] = Errc t,c n or the Kullback-Lebler dstance L, where, n. We here remark that ths Kullback- Lebler dstance for arbtrary, can be computed n lnear tme by L, = H *, H, 58

71 where H, s the self condtonal entropy over, H, = l l= + f pl log2 pl + pl log2 pl and H*, s the expected cell condtonal entropy over, H *, = f *, p *, log f *, = f, l l= + F *, p *, =, f *, 2 p *, + p *, log F *, = f l l= + p l 2 p *, We can derve the followng equaton H, = H0, - H0, f*, = f*0, f*0,, F*, = F*0, F*0, whch can be computed accumulatvely n an On tme. Hence we conclude that f H0,, f*0, and F*0, are calculated n the preprocessng stage and mantaned n three arrays, then any Kullback-Lebler dstance L,= H*, -H, over,] can be computed n constant tme. 5.6 Implementaton of quantzer mappng functon The mplementaton of an arbtrary quantzer mappng functon Q [WCX00] has been regarded as an operatonal dffculty n usng MCECQ n practce. The smplest way of mplementng Q s to use a look-up table. But snce C, the number of all possble raw contexts, grows exponentally n the order of contexts, buldng a huge table of C entres for Q s clearly ntractable. Even f hashng functons can be utlzed n avodng excessve memory use of the Q table, ths savng of memory compromses an ncreased tme of quantzer mappng operaton when collson n table access occurs. We note that the savngs of memory use s sensble only when the actual number of dfferent raw contexts appearng n an nput mage s much smaller than C. Furthermore, n order to approach constant executon tme of the quantzer mappng functon, the sze of hashng table has to be larger than the actual number of dstnct raw contexts by a suffcent factor. Thus, n practce, the table sze needs to be 59

72 comparable to the mage sze snce many raw contexts have very low frequency of occurrence. Snce mappng any raw context to an approprate codng state s obvously a classfcaton problem, a common technque to smplfy the quantzer mappng functon Q s through proecton based on Fsher s lnear dscrmnant [Wu99]. When codng a bnary source, the dea was to proect the tranng context vectors n the dscrmnant drecton y such that the two margnal posteror dstrbutons of P C X y c X=0 and P C X y c X=, c E d, have maxmum separaton. Then the dynamc programmng algorthm was used to form a convex M-partton of the correspondng one-dmensonal proecton space to mnmze the condtonal entropy HX y c q m-, q m ], whch mplctly defne the context quantzer Qc = m ff. y c q m-, q m ], m M. We can obvously see that such a proecton scheme has an operatonal advantage n practce. The lnear Fsher dscrmnant [Wu99] was used to separate the two posteror dstrbutons of P C X c X=0 and P C X c X=, whch s a two-class classfcaton problem. Alternatvely, we can seek to separate the M optmal MCECQ cells formed n the probablty smplex space va a mult-class lnear Fsher dscrmnant n 3.2. The goal here s to apply the dscrmnant classfer to form a convex partton n the proecton subspace that best matches the optmal partton of B m s n the probablty smplex space. However, the success of the lnear dscrmnant s lmted to cases where the nput classes A m are lnearly separable to a certan degree. But for more dffcult, lnearly nseparable shapes of context cells A m, the kernel Fsher dscrmnant n 3.4 can serve as a gentle approxmaton of MCECQ cells A m. Motvated by the success of kernel-based learnng machnes [MRM0, MRW99, MRW00, MRW03, MSS0], we have proposed a new desgn technque for context quantzers by usng a mult-class kernel Fsher dscrmnant [P7]. The new desgn technque constructed context quantzer Q n three steps. In the frst step, the dynamc programmng algorthm was appled to desgn MCECQ n the probablty smplex 60

73 space. Ths produces the MCECQ cells B m, whch mplctly consttute the nput classes of KFD,.e. the MCECQ cells A m n the context space. Note that choosng an nput class dfferent from A m mght lead to a undesrable ncrease of HX v Φc q m-, q m ]. Fgure 5.4 presents the mpact of two dfferent nput classes on the actual bt rate of codng the second least sgnfcant bt of DPCM errors of the mage Cameraman. Secondly, the kernel Fsher dscrmnant analyss was conducted to seek a kernel proecton drecton v n F correspondng to a curve n the context space n whch MCECQ cells A m have maxmum separaton. Fnally, after all proecton values of tranng contexts were computed and put nto a sorted lst, the dynamc programmng algorthm was used agan to construct a convex partton of the proecton subspace that mnmzes the condtonal entropy HX v Φc q m-, q m ], whch mplctly defne another context quantzer Qc = m f only f v Φc q m-, q m ], m M bpp 0.9 KFD ntalzed by MCECQ cells KFD ntalzed by GLA cells Optmal MCECQ Number of cell contexts Fgure 5.4. Comparson of dfferent codng bt rates acheved from the two KFD based context quantzers that are constructed from the two dfferent nput classes: the MCECQ cells and the quantzed context cells obtaned by ncorporatng KL dstance nto the GLA algorthm As shown n secton 3.4, estmaton of the kernel Fsher dscrmnant v s equvalent to solvng the egenvalue problem Ax = λbx at the expense of ON 3 tme complexty, whch makes the tranng of the kernel dscrmnant ntractable even over a small number of raw contexts. A further restrcton of usng v s that codng each symbol 6

74 62 usng context quantzer Q needs to manpulate one kernel proecton accordng to 3.6, whch nvolves ON operatons. The possble soluton applcable to any choce of A and B has been clearly dscussed n [MRW03, MSS0], namely s to restrct the dscrmnator v to be n a subspace of F.e., estmate v n terms of the partal soluton n 3.7. Correspondngly, two l l covarance matrces Al and Bl should be computed accordng to of 3.7 where l << N. Desgn of the context quantzer by usng the partal soluton of KFD problem has been treated n [P7]. The selecton of bass vectors {c =,, l} of 3.7 raw context c s equvalent to feature vector x can be done n a greedy manner, whch has been studed n theory [SMB99] as the reduced set method for supported vector machnes. The kernel dscrmnant s ncrementally constructed by addng one raw context as a new base c l+ at a tme to the exstng expanson,.e., ncrementng the dmensonalty l by one at a tme. Accordngly, the new matrx Al+ and the kernel matrx Kl+ can be computed from the prevous matrces Al and Bl = = + = = = + = + M M l n a l l n l l a l l 2 T,, η η η η µ µ d K K K d d A A 5. where l n / K = + η, N l / K = + η and l+ K s the l+th row of full kernel matrx K. We note that Kl s an l N sub-matrx of K. In a smlar manner, the new covarance matrx can be estmated as = + + = + = = = + M l l M l n b l n l b l l T T T, η η η K K µ K K e e e B B 5.2 Lkewse, the nverson of Bl+ can be updated from Bl and Bl -

75 B l + B l + gg = T g T r = / b e B l T / r g r e + β, g = rb l e 5.3 where β s a regularzaton factor n case of the sngularty of matrx B. The updatng of Al+, Bl+ and Bl+ - by nvolves ON l n total whereas solvng the egenvalue problem for Bl+ - Al+ takes a tme complexty of Ol 3 by sngular vector decomposton. In general, the classfcaton performance of kernel dscrmnant s proportonal to the number of base contexts l, whch s measured by the F-rato valdty ndex n 3.5. Hence the number of base contexts plays a crucal role n achevng the desrable codng length. But the number of base contexts l must be carefully chosen snce use of a large l wll ncur an mpractcal ncrease of tme complexty and space complexty n codng stages.e., the kernel proecton for codng one pxel takes Ol tme and Ol base contexts needs to be mantaned n memory. We found out n the experments that the gan of codng effcency remans stable after the number of base contexts l s ncreased to 36. Ths can be vewed from Fgure 5.5, whch plots the average bt rate of codng the ten tested mages used n [P7] aganst the F-rato valdty ndex. The practcal selecton of a new base context has been also descrbed n publcaton P7. Although the kernel Fsher dscrmnant has gven rse to context quantzaton, ts desrable classfcaton rate reles on the selecton of the kernel parameter σ and the regularzaton constant β. A common practce s to mnmze the cross-valdaton estmate of the msclassfcaton errors wth respect to these two model parameters. However, the applcaton of KFD n context quantzaton seeks to obtan a mnmzed achevable codng rate or the mnmum condtonal entropy. Hence we attempt to perform the cross-valdaton test toward mnmzng the condtonal entropy, nstead of the msclassfcaton rate. 63

76 0.95 Bt rate/f-rato bpp F-rato Fgure 5.5. An example curve reflectng the relaton between codng effcency and classfcaton performance. The number of selected bass vectors ranges from 5 to 50 correspondng to the F-rato valdty ndex from 0.4 to.5 Lkewse, the cross-valdaton test should be conducted accordng to partal soluton v n 3.7 even f some of the faster algorthms [CT03, FDB04] have been successfully proposed for the complete kernel dscrmnant n 3.4. The greedy approxmaton of v n [P7] nvolves only ON l 2 tme complexty, and thereby mplctly offerng the k- folder cross-valdaton test wth a lower computatonal complexty as Ok N l 2 +Ok N M+Ok N l+m. In the cross-valdaton test for each par of σ, β, the tranng set s randomly parttoned nto k = 6 numbers of dsonted subsets k { T} = wth equal sze. Then context quantzaton s performed by k tmes of tranng and valdaton, each tme wth a dfferent set T adopted as a valdaton set and the remander as the tranng set. Namely, k context quantzers are constructed respectvely for the k valdaton sets. The valdaton test ams at fndng the two model parameters such that the averaged achevable condtonal entropes are mnmzed. For the sake of an effcency and far estmate of σ, β, the regularzaton parameter β and the Gaussan parameter σ can be determned by usng a seres of Fbonacc searches n a two-dmensonal fashon over the two ntervals [0, 0.0] and [0. var, 2 var] respectvely, where var s the varance 64

77 over all tranng contexts. Namely, for each gven σ appearng n the Fbonacc search for the Gaussan parameter.e., the frst drectonal Fbonacc search, another Fbonacc search for the regularzaton parameter β.e., the second drectonal Fbonacc search s executed such that the mnmum condtonal entropy s acheved. 65

78 6. Summary of the publcatons In the frst paper [P], we present a smple genetc clusterng algorthm for solvng the clusterng problem wth an unknown number of clusters. The algorthm s mplemented by usng randomzed swappng of one cluster centrod between the two parent solutons n the crossover stage. Hence each genetc crossover produces sx new canddate solutons, on whch a local partton s subsequently conducted. The canddate soluton wth the best ftness s then selected nto the next generaton. An advantage of the proposed genetc algorthm s that t s able to detect the number of clusters automatcally f provded wth an approprate clusterng evaluaton functon. The heurstc mean-square-error functon HMSE and the MDL functon are appled as the dstorton functons n evaluatng the ftness of the clusterng solutons. As a comparson benchmark, the generalzed Lloyd algorthm and the randomzed local search algorthm are nvestgated n the experments. We expermentally observed that the algorthm s able to estmate an accurate number of clusters by usng HMSE when the number of clusters s less than 5% of the sze of the artfcal datasets produced by the Gaussan mxture model. We argue that ths s n partal due to the fact that HMSE could farly offset the decrease of MSE when the number of cluster ncreases under some upper bound, and n partcular, a approprately small number of clusters for GMM model could gurantee a less overlappng sturcture of clusters. If the number of clusters exceeds 5% of the sze of the dataset, HMSE fals to locate the number of clusters correctly. In ths case, the MDL functon can be utlzed as an alternatve evaluaton functon n detectng the accurate number of clusters. But the two parameters of MDL functon must be chosen approprately. Performance comparsons have shown that the genetc local repartton algorthm outperforms the two other comparatve clusterng algorthms n estmatng the number of clusters. The algorthm successfully presented a smple scheme for detectng the accurate number of clusters automatcally whle optmzng the locaton of clusters. 66

79 In the second paper [P2], we propose a new dssmlarty functon, the SCdstance, n the clusterng problems of bnary vectors by mnmzng stochastc complexty. The dssmlarty functon s defned as the dfference of the stochastc complexty before and after movng a gven vector from one class to another. Thus, t mplctly takes nto account the change n the class dstrbuton caused by the clusterng procedure, and n ths way, eventually avods the nfnty problem ncurred by some specfc type of clusters e.g., sngle-pont clusters. The nfnty problem occurs frequently n mnmzng stochastc complexty by usng the Shannon codelength dstance. As a comparson, we study two other clusterng dstances: the L 2 -dstance and the revsed CL dstance. The performance of the three dstance functons s nvestgated by usng the generalzed Lloyd algorthm and the randomzed local search algorthm. It turned out from the expermental results that the SC-dstance outperforms the two other dstances n mnmzng stochastc complexty. The orgnal CL dstance must be reformulated carefully n order to avod nfnty problems. However, we notced n the experments that the revsed CL dstance performs even worse than the L 2 -dstance n most cases. We also observed that the L 2 -dstance s also effectve when clusterng a smple bnary-vector data. The paradgm of desgnng the SC-dstance s general n nature as t can also be extended to other cost functons. An essental superorty of the underlyng dstance s that each calculaton of the SC-dstance drectly promses a heurstc descent drecton of stochastc complexty. In the thrd paper [P3], we present the Delta-MSE dssmlarty functon between vectors and clusters n the GLA based vector quantzaton by mnmzng MSE dstorton functon. The Delta-MSE dssmlarty s defned as the change of wthncluster varance before and after movng a gven data vector from one cluster to another. Thus, the dssmlarty measurement automatcally provdes vector quantzaton wth a heurstc drecton, n whch the MSE functon wll decrease most. 67

80 Although derved from the MSE functon, the Delta-MSE dssmlarty s shown to also be applcable n the mnmzaton of the F-rato clusterng valdty ndex. The underlyng dssmlarty s ncorporated nto the GLA algorthm, of whch the ntal codebook was selected by the bucket centers of a k-d tree structure developed by the nested PCA algorthm. When usng the Delta-MSE dssmlarty, the GLA algorthm can also be accelerated by the trangular nequalty elmnaton technque n a smlar manner to the L 2 -dstance. Expermental results demonstrate that the Delta- MSE dssmlarty outperforms the L 2 dstance. Wth the ncrease of codebook sze, ts performance gans over the L 2 dstance are mproved. We have succeeded to study the desgn scheme of dssmlarty functon by movng a gven vector from one cluster to another. The feasblty and success of the Delta- MSE dssmlarty has demonstrated that ths desgn scheme can be appled as a smple approach to dervng a new dssmlarty measurement n partcular when the optmalty of tradtonal dssmlarty functon s lmted n mnmzng the obectve functon. The desgn scheme drectly drves more reassgnments of tranng vectors n the descent drecton of obectve functon, eventually provdng a better convergence to the global optmum. In the fourth paper [P4], we propose a new approach to the k-center clusterng problem by teratve use of the lnear Fsher dscrmnant analyss and the dynamc programmng technque. The optmzaton problem of K-means clusterng faces a common problem that ts clusterng performance, measured by the F-rato valdty ndex, s hghly susceptble to ts ntalzed soluton. However, the globally optmal soluton for k-center clusterng n one dmensonal space can be obtaned by usng dynamc programmng n OkN tme. For ths purpose, the local optmalty has been remeded by teratvely ncorporatng Fsher dscrmnant analyss nto the K-means clusterng algorthm. Namely, at each teraton, the Fsher dscrmnant analyss s frst conducted, of whch the nput classes are the output partton of K-means clusterng n the prevous teraton. Then an optmally ntalzed soluton can be obtaned by usng 68

81 dynamc programmng n the dscrmnant subspace. Once the ntal soluton s determned, we apply the K-means algorthm agan to refne the clusterng soluton. Expermental results show that the proposed approach outperforms the two comparatve suboptmal K-Means algorthms: the PCA based suboptmal K-means clusterng algorthm and the kd-tree based K-Means clusterng algorthm. In partcular, by ncreasng the number of clusters, ts performance gans are mproved aganst the comparatve K-means algorthms. Albet the one-dmensonal subspace can be estmated by usng prncpal component analyss, the best prncpal drecton can be selected only from d number of prncpal components. Thus, the optmalty of the PCA-based approach s only attanable for a small-scale dataset. However, n a contrast, the proposed Fsher dscrmnant based approach s a gentle departure from ths lmtaton by conductng the teratons up to a desrable number. In the ffth paper [P5], we propose a new scheme to the k-clusterng problem based on the kernel PCA and the dynamc programmng technque. The kernel PCA s a state-of-art feature extracton technque, whch undermnes the nonlnear spatal structure by mappng the nput data nto a hgher dmensonal feature space. Hence the data samples become more separable n the nonlnear kernel prncpal drecton, whereby the dynamc programmng technque can be appled to form a convex optmal partton of data samples. The convex partton can be treated as an ntal K- means clusterng soluton approachng the global optmum. Snce the kernel PCA provdes the same number of prncpal components as the sze of data samples, the practcal ntal soluton s chosen from one set of convex parttons produced over a gven number of kernel prncpal drectons. The Delta- MSE dssmlarty s then ncorporated nto the proposed K-Means algorthm nstead of the Eucldean L 2 dstance. It turned out from expermental results that the proposed approach s superor to the PCA based suboptmal K-Means algorthm and the kd-tree based K-Means algorthm. In partcular, by ncreasng the number of clusters, ts clusterng performance s mproved aganst the two other algorthms. 69

82 We have advanced n nvestgatng the suboptmal scheme for selectng the ntal soluton through the kernel PCA. Applcaton of dynamc programmng n the nonlnear prncpal drecton obtaned by the kernel PCA offers a suboptmal ntal partton for K-means clusterng. Even though an optmal partton of data samples can be attanable only n the one-dmensonal kernel component subspace, the progress made by the kernel PCA can be qute sensble n clusterng a large-scale data wth complex shape and structure. The kernel PCA based approach can be vewed as ether a classfcaton or a data reducton technque, whch s obvously not lmted to the scope of tacklng the k-center clusterng problem. In the sxth paper [P6], we apply and examne context clusterng n lossless mage compresson n the framework of JPEG-LS. We employ the JPEG-LS medan predctor to enroll the predcton error pxels. The casual context s defned as the vector formed by the three gradents over the neghborng pxels. In comparson to JPEG-LS, the three drectonal gradents were quantzed wth dfferent codebook szes 7, 9, 9 respectvely. Then the contexts were merged accordng to the quantzed values. In the context space, the K-means algorthm was appled to cluster the merged contexts accordng to the Kullback-Lebler dstance. The context clusterng was mplemented n a style of onlne tranng, and thereby the two types of sde nformaton should be encoded and transmtted nto the compressed fle: the ndex of the cluster membershp and the probablty condtoned by the cell context. For further reducton of the sde nformaton, we manpulate the error pxel n two parts separately: the hgher four bts and lower four bts. For smplcty, we gnored the dependency between the hgher four bts and the lower bts, whch could penalze the actual codng rate sgnfcantly, snce the ntal purpose of ths paper was to evaluate the performance of context clusterng n lossless mage compresson. We have ntroduced an adaptve arthmetc coder, of whch probablty densty s updated upon the past samples. Wth a known frequency table for each symbol n the 70

83 alphabet, the adaptve entropy coder updates the frequency table accordng to the past samples that had been encoded. Namely, the frequency of each symbol decreases by mmedately after the symbol has been coded once. The adaptve arthmetc coder s able to yeld a hgher compresson rato. The context clusterng approach allows for an onlne mplementaton of an adaptve entropy coder by explotng the savngs of probablty densty storage.e., only the probablty densty condtoned by clustered context needs to be transmtted. It has been shown from the experments that context clusterng s an effectve alternatve for lossless mage compresson. A varable sze quantzer for each gradent could be more approprate than a fxed sze such as 9. We notced n the experments that when the quantzer sze of vertcal gradent s equal to the sum of the quantzer szes of two horzontal gradents, the context clusterng method acheves a better compresson rate. Regardless of the dependency between the two parts of predcton resduals, the context clusterng method successfully approaches the JPEG-LS codng rate. In the seventh paper [P7], we propose new algorthms for desgnng context quantzers n the context space based on mult-class Fsher dscrmnant and the kernel Fsher dscrmnant. Optmal context quantzers for mnmum condtonal entropy can be constructed by dynamc programmng n the probablty smplex space. The man dffculty s the resultng complex quantzer mappng functon n the context space, n whch the condtonal entropy codng s conducted. To overcome ths dffculty, we smplfy the quantzer mappng functon by usng the mult-class Fsher dscrmnant and the kernel Fsher dscrmnant. The quantzer mappng functon Q has been smplfed through the proectons, whch map the nput contexts onto the lnear dscrmnant or the kernel dscrmnant curve. Then dynamc programmng s used to form a convex M-partton of the correspondng one-dmensonal proecton space to mnmze the condtonal entropy. In partcular, snce the kernel Fsher dscrmnant s able to descrbe lnearly 7

84 nseparable quantzer cells, t can be posted as a powerful tool to smplfy the quantzer mappng functon. The new algorthms outperform the prevous bnary-class Fsher dscrmnant method for context quantzaton. They are successful n approachng the mnmum emprcal condtonal entropy context quantzer desgned n the probablty smplex space but wth the practcal mplementaton of a smple scalar quantzer mappng functon. The proposed kernel Fsher dscrmnant based method outperforms the lnear Fsher dscrmnant based methods consstently on each test mage, albet ts mprovement over JPEG-LS s qute small. The small margn between the kernel Fsher dscrmnant based method and JPEG-LS ndcates the heurstc context quantzer of JPEG-LS s already very good compared wth a heavly optmzed one. We envson ths work to be a useful benchmark to evaluate the qualty of more practcal context quantzers. 72

85 7. Conclusons We have studed the problems of K-means clusterng and context quantzaton. We have proposed and mproved several clusterng algorthms n the framework of K- means clusterng. We presented a smple genetc clusterng algorthm n the case of an unknown number of clusters, whch s mplemented through the randomzed swappng of one reference vector and a smple local repartton clusterng algorthm. The genetc clusterng algorthm s able to detect the number of clusters successfully wth an approprate clusterng evaluaton functon. We nvestgated the k-center clusterng problem wth bnary data obects by mnmzng the stochastc complexty. For ths purpose, we proposed a new clusterng dssmlarty functon, the SC-dstance. The SC-dstance outperforms the two other dstances when clusterng bnary data obects by mnmzng stochastc complexty: the L 2 -dstance and the Shannon code-length dstance. We have remarked that the desgn scheme of the SC-dstance s general n nature as t can be extended to any other cost functon. We have succeeded to extend the desgn scheme of SC-dstance to mnmzaton of the MSE dstorton n solvng the K-means clusterng problem. Accordngly, we have presented another heurstc dssmlarty functon, the Delta- MSE dssmlarty, whch s superor to the tradtonal L 2 dstance. We have proposed two suboptmal K-means clusterng algorthms by usng the lnear Fsher dscrmnant analyss and the kernel prncpal component analyss. The suboptmal K-means clusterng algorthms are desgned n the sprt of estmatng an ntal clusterng soluton that approaches the global optmum. Both of the two suboptmal K-means algorthms select ther ntal clusterng solutons by usng dynamc programmng ether n Fsher dscrmnant subspace or n the kernel prncpal component subspace. The proposed suboptmal K-means algorthms have been shown to outperform the prevous suboptmal K-means clusterng algorthms. We remnded 73

86 the readers that usablty of the two algorthms s not lmted to dealng wth the smallscale datasets. We studed and formulated the problem of context clusterng or context quantzaton n lossless mage compresson n a framework of JPEG-LS standard. The problem of context clusterng was formulated as the K-means clusterng n terms of the Kullback-Lebler dstance, whch was tackled by the generalzed Lloyd method. Then the predcton resduals n JPEG-LS were dvded nto two parts and coded separately by the clustered contexts. Regardless of the dependency between the two separated parts, the context clusterng method s stll successful n approachng the JPEG-LS codng rate. We contnued our progress n context quantzaton by proposng the new context quantzer desgn algorthms based on mult-class Fsher dscrmnant and the kernel Fsher dscrmnant. The quantzer mappng functon s smplfed by usng the dynamc programmng technque over the mult-class lnear Fsher dscrmnant and kernel Fsher dscrmnant. Snce the kernel Fsher dscrmnant s able to descrbe lnearly nseparable quantzer cells deally, the kernel Fsher dscrmnant based desgn algorthm successfully approaches the optmal context quantzer desgned n the probablty smplex space but wth the practcal mplementaton of a smple context quantzer mappng functon. The man contrbuton of ths work s to nvestgate the problem of K-means clusterng and the problem of context quantzaton by usng dfferent data reducton technques and dynamc programmng. However, the data reducton technques used n ths work only extract an approprate feature or the prncpal subspace from the nput data globally, whch assumes that the nput data s homogenously dstrbuted. A future extenson of ths work s to ncorporate the local data reducton technques e.g. local PCA nto the conventonal clusterng and classfcaton problems. For example, one can choose the medan vector from the subcluster centers formed by usng PCA and dynamc programmng on each cluster, whch s able to effcently 74

87 reduce the computatonal complexty of the K-medan algorthm. Of course, ths desgn scheme can be also extended to the onlne or sequental clusterng algorthm e.g. the MBAS algorthm n the sense of offerng more robustness on addtve nose. A more drect extenson s to apply local PCA frst and then conduct the dynamc programmng on each clustered component, but ths mposes another dffculty on how to determne the number of subclusters for each cluster. We argue that the determnaton of the number of subcluster for each cluster can be approxmately cast to an nteger programmng problem, whch wll be an nterestng drecton of our future work. 75

88 References [AKM87] A. Aggarval, M. Klave, S. Moran, P. Shor and R. Wlber, Geometrc applcatons of a matrx-searchng algorthm, Algorthmca, vol. 2, pp , 987. [AST94] A. Aggarwal, B. Scheber and T. Tokuyama, Fndng a mnmum-weght k- lnk path n graphs wth the concave Monge property and applcatons, Dscrete Computng Geometry, vol. 2, pp , 994. [AYA97] M. Armura, H. Yamamoto and S. Armoto, A btplane tree weghtng method for lossless compresson of gray scale mages, IEICE Trans. Fundamentals, vol. E80-A, no., pp , Nov [BA00] G. Baudat and F. Anouar, Generalzed dscrmnant analyss usng a kernel approach, Neural Computaton, vol. 2, no. 0, pp , [Bez80] J. C. Bezdek, A Convergence Theorem for the Fuzzy ISODATA Clusterng Algorthms, IEEE Trans. on PAMI, vol. 2, no., pp.-8, 980. [BFR99] P. Bradley, U. Fayyad and C. Rena, Scalng EM clusterng to large databases, Techncal report, MSR-TR-98-35, Mcrosoft Research, 999. [BH67] G. Ball and D. Hall, A clusterng technque for summarzng multvarate data, Behavoral Scence, vol.2, pp , March 967. [BIR] [Bs95] C. Bshop, Neural Networks for Pattern Recognton. Oxford, U.K.: Clarendon, 995. [BLS99] H. Bschof, A. Leonards and A. Selb, MDL prncple for robust vector quantzaton, Pattern Analyss and Applcatons, vol. 2, no., pp , 999. [BMS97] P. S. Bradley, O. L. Mangasaran and W. N. Street, Clusterng va Concave Mnmzaton, Advances n Neural Informaton Processng Systems 9 NIPS9, pp , MIT Press, Cambrdge, MA 997. [Bot92] R. Bottemller, Comments on a new vector quantzaton clusterng algorthm, IEEE Trans. on Sgnal Processng, vol. 40, no. 2, pp , Feb [BP98] J. Bezdek and N. Pal, Some new ndexes of cluster valdty, IEEE Trans. on Systems, Man, and Cybernetcs, Part B, vol. 28, no.3, pp , 998. [Bruc64] J.D. Bruce, Optmum quantzaton, ScD thess, MIT, May 4,

89 [BRY98] A.Barron, J.Rssanen and B.Yu, The mnmum descrpton length prncple n codng and modelng, IEEE Trans. Informaton Theory, vol. 44, no. 6, pp , 998. [CBL04] S. Chen, C. Bouman and M.Lowe, Clustered components analyss for functonal MRI, IEEE Trans. on Medcal Imagng, vol. 23, no., pp , Jan [CCI92] CCITT Draft Recommendaton T.82, ISO/IEC Draft Internatonal Standard 544, Coded Representaton of Pcture and Audo Informaton Progressve B-level Image Compresson, Apr [CDB86] R. Cannon, J. Dave and J. Bezdek, Effcent mplementaton of the fuzzy c- means clusterng algorthms, IEEE Trans. on PAMI, vol. 8, no. 2, pp , 986. [CF00] D. Charlesa and C. Fyfe, Kernel Factor Analyss wth Varmax Rotaton, n IEEE-INNS-ENNS Internatonal Jont Conference on Neural Networks IJCNN'00, vol. 3, Como, Italy, July 24-27, [CG98] K. Chang and J. Ghosh, Prncpal curves for nonlnear feature extracton and classfcaton, SPIE Applcatons of Artfcal Neural Networks n Image Processng III, vol. 3307, pp , 998. [CG0] K. Chang and J. Ghosh, A unfed model for probablstc prncpal surfaces, IEEE Trans. on Pattern Analyss and Machne Intellgence, vol. 23, no., pp. 22-4, 200. [CGS00] I. Cadez, S. Gaffney and P. Smyth, A general probablstc framework for clusterng ndvduals, n Proc. of the sxth ACM SIGKDD Internatonal Conference, pp , Boston, Massachusetts, Unted States, [Chav98] M. Chavent, A monothetc clusterng method, Pattern Recognton Letters, vol. 9, pp , September 998. [Che04] J. Chen, Context modelng based on context quantzaton wth applcaton n wavelet mage codng, IEEE Trans. on Image Processng, vol. 3, no., Jan [CH9] S. Chen and W Hseh, Fast algorthm for VQ codebook desgn, n Proc. Inst. Elect. Eng., vol. 38, pp , Oct. 99. [CS95] C. Chnrungrueng and C. H. Sequn, Optmal adaptve k-means algorthm wth dynamc adustment of learnng rate, IEEE Trans. on Neural Network, vol. 6, no., pp , January 995. [CTC94] Tz-Dar Chueh, Tser-Tz Tang and Lang-Gee Chen, Vector Quantzaton Usng Tree-Structured Self-Organzng Feature Maps, IEEE Journal on Selected Areas n Communcatons, vol. 2, no. 9, pp , December

90 [CT03] G.C. Cawley and N.L.C. Talbot, Effcent leave-one-out cross-valdaton of kernel fsher dscrmnant classfers, Pattern Recognton, vol. 36, no., pp , November [DB79] D.L. Daves and D.W. Bouldn, A cluster separaton measure, IEEE Trans. on Pattern Analyss and Machne Intellgence, vol., no. 2, pp , 979. [Del0] P. Delcado, Another look at prncpal curves and surfaces, Journal of Multvarate Analyss, vol. 77, pp. 84-6, 200. [DH73] R. Duda and P. Hart, Pattern Classfcaton and Scene Analyss, John Wley Sons, 973. [DH02] C. Dng and X. He, Cluster mergng and splttng n herarchcal clusterng algorthms Data Mnng, n Proc. of 2002 IEEE Internatonal Conference on Data Mnng, pp.39 46, Maebash Cty, Japan, 9-2 December, [DK96] K. Damantaras and S. Kung, Prncpal Component Neural Networks: Theory and Applcatons, Wley, New York, March 996. [DK97] R. Dave and R. Krshnapurum, Robust Clusterng Methods: A Unfed Vew, IEEE Trans. on Fuzzy Systems, vol. 5, no. 2, pp , May 997. [DLR77] A. P. Dempster, N. M. Lard and D.B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm, Journal of the Royal Statstcal Socety B, vol. 39, no., pp. -38, 977. [DL92] M. Das and N. Loh, New studes on adaptve predctve codng of mages usng multplcatve autoregressve models, IEEE Trans. on Image Processng, vol., no., pp. 06-, January 992. [Dunn74] J. C. Dunn, A fuzzy relatve of the ISODATA process and ts use n detectng Compact Well-Separated Clusters, Journal of Cybernetcs, vol. 3, no. 3, pp , 974. [ECH0] V. Estvll-Castro and M.E. Houle, Robust Dstance-Based Clusterng wth Applcatons to Spatal Data Mnng, Algorthmca -- Specal Issue on Algorthms for Geographc Informaton, vol. 30, no. 2, pp , June 200. [ECY04] V. Estvll-Castro and J. Yang, Fast and Robust General Purpose Clusterng Algorthms, Data Mnng and Knowledge Dscovery, vol. 8, no. 2, pp , March [Eks96] N. Ekstrand, Lossless compresson of grayscale mages va context tree weghtng, n Proc. of IEEE Data Compresson Conference, pp , Apr [Equ89] W. Equtz, A new vector quantzaton clusterng algorthm, IEEE Trans. Acoust., Speech, Sgnal Process., vol. 37, pp , Oct

91 [FDAR03] FDA Draft Assessment of the Relatve Rsk to Publc Health from Foodborne Lstera monocytogenes Among Selected Categores of Ready-to-Eat Foods, FDA/Center for Food Safety and Appled Nutrton, USDA/Food Safety and Inspecton Servce, Centers for Dsease, Centers for Dsease Control and Preventon, September [FDB04] G. Fung, M. Dundar, J. B and B. Rao, A fast teratve algorthm for fsher dscrmnant usng heterogeneous kernels, n Carla E. Brodley edtor, Proc. of the Twenty-frst Internatonal Conference ICML 2004, Banff, Alberta, Canada, July 4-8, [FGG00] P. Fränt, H.G. Gyllenberg, M. Gyllenberg, J. Kvärv, T. Kosk, T. Lund and O. Nevalanen, Mnmzng stochastc complexty usng local search and GLA wth applcatons to classfcaton of bactera, Bosystems, vol. 57, no., pp , June [FKSC00] P. Fränt, T. Kaukoranta, D. Shen and K. Chang, Fast and memory effcent mplementaton of the exact PNN, IEEE Trans. on Image Processng, vol. 9, no. 5, pp , May [FK99] H. Frgu and R. Krshnapuram, A Robust Compettve Clusterng Algorthm Wth Applcatons n Computer Vson, IEEE Trans. on PAMI, vol. 2, no.5, pp , 999. [For65] E. Forgy, Cluster analyss of multvarate data: effcency versus nterpretablty of classfcatons, Bometrcs, vol. 2, no. 3, pp.768, 965. [Fran99] P. Fränt, On the usefulness of self-organzng maps for the clusterng problem n vector quantzaton, n Proc. of th Scandnavan Conf. on Image Analyss SCIA 99, Kangerlussuaq, Greenland, pp , 999. [Fran00] P. Fränt, Genetc algorthm wth determnstc crossover for vector quantzaton, Pattern Recognton Letters, vol. 2, no., pp. 6-68, [FR98] C.Fraley and AE Raftery, How many clusters? Whch clusterng method? Answers va model based cluster analyss, Computer Journal, no.4, pp , 998. [FS75] D. Foley and J. Sammon, A optmal set of dscrmnant vectors, IEEE Trans. on Computers, vol. 3, no. 24, pp , 975. [FV03] P. Fränt and O. Vrmaok, Genetc algorthm usng teratve shrnkng for solvng clusterng problems, n Proc. Wessex Data Mnng Conf. 2003, Ro de Janero, Brasl, pp , December

92 [FWA04] S. Forchhammer, X. Wu and J.D. Andersen, Lossless mage data sequence compresson usng optmal context quantzaton, IEEE Trans. on Image Processng, vol. 3, no. 4, pp , Apr [FWM99] L. Fabrgar, D. Wegener, R. MacCallum and E. Strahan, Evaluatng the use of exploratory factor analyss n psychologcal research, Psychologcal Methods, 3, pp , 999. [Gro02] M. Grolam, Mercer Kernel Based Clusterng n Feature Space, IEEE Trans. on Neural Networks, vol. 3, no. 4, pp , [GKV94] M. Gyllenberg, M. Kosk and M. Verlaan, Clusterng and quantzaton of bnary vectors wth stochastc complexty, n Proc. IEEE Internatonal Symposum on Informaton Theory, Trondhem, Germany, 994. [GYZ98] D. Greene, F. Yao and T. Zhang, A lnear algorthm for optmal context clusterng wth applcaton to b-level mage codng, n Proc. 998 Int l. Conf. Image Processng, pp.508-5, 998. [HC95] Q. Huo and C. Chan, Contextual Vector Quantzaton for Speech Recognton wth Dscrete Hdden Markov Model, Pattern Recognton, vol.28, no.4, pp , 995. [HS89] T. Haste and W. Stuetzle, Prncpal curves, Journal of the Amercan Statstcal Assocaton, vol. 84, pp , 989. [Jarv78] R. Jarvs, Shared near neghbor maxmal spannng trees for cluster analyss, n Proc. of the 4th Internatonal Jont Conference on Pattern Recognton, pp , November 978. [JD88] A. Jan and R. Dubes, Algorthms for clusterng data, Prentce-Hall Inc., Upper Saddle Rver, NJ, 988. [Jel99] F. Jelnek, Statstcal Methods for Speech Recognton, The MIT Press, Cambrdge, MA, 999. [JMF99] A. K. Jan, M. N. Murty and P. J. Flynn, Data clusterng: a revew, ACM Computng Surveys CSUR, vol. 3 no. 3, pp , September 999. [Ka58] H. Kaser, The varmax crteron for analytc rotaton n factor analyss, Psychometrka, vol. 23, pp , 958. [Ker97] Paul R. Kersten, Implementaton Issues n the Fuzzy c-medans Algorthm, n Proc. of the 6th Internatonal Conference on Fuzzy Systems, pp , Barcelona, Span,

93 [KFN00] T. Kaukoranta, P. Fränt and O. Nevalanen, A fast exact GLA based on code vector actvty detecton, IEEE Trans. on Image Processng, vol. 9, no. 8, pp , August [KFN03] J. Kvärv, P. Fränt and O. Nevalanen, Self-adaptve genetc algorthm for clusterng, Journal of Heurstcs, vol. 9, no. 2, pp. 3-29, March [Kng67] B. Kng, "Step-wse clusterng procedures," Journal of Amercan Statstcs Assocaton, no. 69, pp. 86-0, 967. [KKL00] B. Kégl, A. Krzyzak, T. Lnder and K. Zeger, Learnng and desgn of prncpal curves, IEEE Trans. on Pattern Analyss and Machne Intellgence, vol. 22, no. 3, pp , [KK02] B. Kégl and A. Krzyzak, Pecewse lnear skeletonzaton usng prncpal curves, IEEE Trans. on Pattern Analyss and Machne Intellgence, vol. 24, no., pp , [KL97] N. Kambhatla and T.K. Leen, Dmenson Reducton by Local Prncpal Component Analyss, Neural Computaton, vol. 9, no. 7, pp , 997. [KMN02] T. Kanungo, D. Mount, N. Netanyahu, C. Patko, R. Slverman and A. Wu, An Effcent k-means Clusterng Algorthm: Analyss and Implementaton, IEEE Trans. on PAMI, vol. 24, no. 7, pp , July [KMW04] P. Kontkanen, P. Myllymäk, W. Buntne, J. Rssanen and H. Trr, An MDL Framework for Data Clusterng, n Advances n Mnmum Descrpton Length: Theory and Applcatons, Eds: P. Grünwald, I. Myung and M. Ptt, The MIT Press, [KNF75] W. Koontz, P. Narendra and K. Fukunaga, A Branch and Bound Clusterng Algorthm IEEE Trans. on Computer, vol.24, no.9, pp , 975. [Koh95] T. Kohonen, Self-Organzng Maps, Sprnger, Berln, 995. [KR87] L. Kaufman and P. J. Rousseeuw, Clusterng by means of medods, n Statstcal Data Analyss Based on the L-Norm, Y. Dodge, Ed. Amsterdam, pp , The Netherlands: North-Holland, 987. [KS98] F. Kossentn and M.J.T. Smth, A fast pnn desgn algorthm for entropyconstraned resdual vector quantzaton, IEEE Trans. on Image Processng, vol. 7, no. 7, pp , 998. [KS03] P. Kennedy and S. Smoff, CONGO: Clusterng on the Gene Ontology, n Proc. 2nd Australasan Data Mnng Workshop ADM03, pp. 8-98, Sydney Unversty of Technology, Australa,

94 [LBG80] Y. Lnde, A. Buza and R. Gray, An algorthm for vector quantzer desgn, IEEE Trans. on Communcatons, vol. 28, no., pp , January 980. [LBM02] A. Leonards, H. Bschof and J. Maver, Multple egenspaces, Pattern Recognton, pp , vol., no. 35, [LVV03] A. Lkas, N. Vlasss and J. J. Verbeek, The global k-means clusterng algorthm, Pattern Recognton, vol. 36, no. 2, pp , [Mac67] J. MacQueen, Some methods for classfcaton and analyss of multvarate observatons, In L. M. LeCam and J. Neyman, edtors, Proc. Ffth Berkeley Symposum on Math. Stat. and Prob., pp , Unversty of Calforna Press, 967. [MMW03] M. Mrak, D. Marpe and T. Wegand, A context modelng algorthm and ts applcaton n vdeo compresson, n Proc Int. Conf. on Image Processng, Barcelona, Span, Sept [Mog02] B. Moghaddam, Prncpal manfolds and probablstc subspaces for vsual recognton, IEEE Trans. on PAMI., vol. 24, no. 6, pp , June [MRM0] S. Mka, G. Rätsch and K.-R. Müller, A mathematcal programmng approach to the Kernel Fsher algorthm, n T.K. Leen, T.G. Detterch, and V. Tresp, edtors, Advances n Neural Informaton Processng Systems 3, pp , MIT Press, 200. [MRW99] S. Mka, G. Rätsch, J. Weston, B. Schölkopf and K. R. Müller, Fsher dscrmnant analyss wth kernels, n Y.-H. Hu, J. Larsen, E. Wlson, and S. Douglas, edtors, Neural Networks for Sgnal Processng IX, pp. 4-48, IEEE, 999. [MRW00] S. Mka, G. Rätsch, J. Weston, B. Schölkopf, A.J. Smola and K.-R. Müller, Invarant feature extracton and classfcaton n kernel spaces, n S.A. Solla, T.K. Leen, and K.-R. Müller, edtors, Advances n Neural Informaton Processng Systems 2, pp MIT Press, [MRW03] S. Mka, G. Rätsch, J. Weston, B. Schölkopf, A. Smola and K. R. Müller, Constructng Descrptve and Dscrmnatve Nonlnear Features: Raylegh Coeffcents n Kernel Feature Spaces, IEEE Trans. on PAMI., vol. 25, no. 5, pp , [MSS0] S. Mka, A.J. Smola and B. Schölkopf, An mproved tranng algorthm for kernel fsher dscrmnants, n T. Jaakkola and T. Rchardson, edtors, Proc. AISTATS 200, pp , San Francsco, CA, 200. [Murt83] F. Murtagh, A survey of recent advances n herarchcal clusterng algorthms, The Computer Journal, no. 26, pp ,

95 [NMC97] S. Ncholson, B. Mlner and S. Cox, Evaluatng Feature Set performance usng the F-rato and J-measures, n 5th European Conference on Speech Communcaton and Technology, pp , Rhodes, USA, 997. [NS02] P. Navarrete and J. Ruz-del-Solar, On the Generalzaton of Kernel Machnes, n Frst Internatonal Workshop on Pattern Recognton wth Support Vector Machne - Lecture Notes n Computer Scence 2388, Sprnger, pp , Nagara Falls, Canada, August 0, [Oa82] E. Oa, A smplfed neuron model as a prncpal component analyzer, Journal of Mathematcal Bology, vol. 5, pp , 982 [OO02] C. Ordonez and E. Omecnsk, FREM: fast and robust EM clusterng for large data sets, n Proc. of ACM Internatonal Conference on Informaton and Knowledge Management, pp , McLean, VA, USA, November 4-9, [PM99] D. Pelleg and A. Moore, Acceleratng exact k-means algorthms wth geometrc reasonng, n Proc. of the ffth ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pp , San Dego, Calforna, Unted States, 999. [PZ99] V. Pan and Z. Zhao, The complexty of the matrx egenproblem, n Proc. of the thrty-frst annual ACM symposum on Theory of computng, Atlanta, pp , Georga, Unted States, 999. [RB78] V. Raghavan and K. Brchard, A clusterng strategy based on a formalsm of the reproductve process n natural systems, n Proc. of the 2nd Internatonal Conference on Research and Development n Informaton Retreval, pp. 0-22, 978 [RB93] L. Rabner and B. Juang, Fundamentals of Speech Recognton, Prentce Hall, New Jersey, USA, 993. [RG99] S. Rowes and Z. Ghahraman, A unfyng revew of lnear Gaussan models, Neural Computaton, vol. no.2, pp , Feb. 5, 999. [RG0] R. Rospal and M. Grolam, An Expectaton-Maxmzaton Approach to Nonlnear Component Analyss, Neural Computaton, vol. 3, no.3, pp , 200. [RH94] T. Raymond and J. Han, Effcent and Effectve Clusterng Methods for Spatal Data Mnng, n Proc. of the 20th Internatonal Conference on Very Large Data Bases, pp , 994. [Rs83] J. Rssanen, A unversal data compresson system, IEEE Trans. Info. Theory, vol. 29, no. 5, pp , Sept [Rs89] J. Rssanen, Stochastc Complexty n Statstcal Inqury, World Scentfc,

96 [RL87] P. J. Rousseeuw and A. M. Leroy, Robust Regresson and Outler Detecton, J. Wley, New York, 987. [Sch78] G. Schwarz, Estmatng the dmenson of a model, Annual of Statstcs, pp , 978. [Shar78] D. Sharma, Desgn of absolutely optmal quantzers for a wde class of dstorton measures, IEEE Trans. Inform. Theory, vol. 24, no. 6, pp , 978. [SJ03] C. Sugar and G. James, Fndng the Number of Clusters n a Data Set: An Informaton Theoretc Approach, Journal of the Amercan Statstcal Assocaton, no. 98, pp , [SMB99] B. Schölkopf, S. Mka, C.J.C. Burges, P. Knrsch, K. R. Müller, G. Rätsch and A.J. Smola, Input space vs. feature space n kernel-based methods, IEEE Trans. on Neural Networks, vol. 0, no. 5, pp , September 999. [SMS98] B. Schölkopf, S. Mka, A. Smola, G. Rätsch and K. R. Müller, Kernel PCA pattern reconstructon va approxmate pre-mages, n Proc. of the 8th Internatonal Conference on Artfcal Neural Networks, Perspectves n Neural Computng, pp Sprnger Verlag, Berln, 998. [Smy00] P.Smyth, Model selecton for probablstc clusterng usng cross-valdated lkelhood, Statstcs and Computng, vol. 0, no., pp , January [Spa80] H. Späth, Cluster Analyss Algorthms for Data Reducton and Classfcaton of Obects, Ells Horwood Publ., Chchester, U.K., 980. [Spa85] H. Spath, Cluster Dssecton and Analyss, Chchester, England: Ells Horwood, 985. [SS00] A. J. Smola and B. Schölkopf, Sparse Greedy Matrx Approxmaton for Machne Learnng, n P. Langley edtor, Proc. of the Seventeenth Internatonal Conference on Machne Learnng ICML 2000, pp. 9-98, Stanford Unversty, Stanford, CA, USA, June 29 - July 2, [Tb92] R. Tbshran, Prncpal curves revsted, Statstcs and Computaton, vol. 2, pp , 992. [TK99] S. Theodords and K. Koutroumbas, Pattern Recognton, Academc Press, San Dego, USA, 999. [TWH0] R. Tbshran, G. Walther and T. Haste, Estmatng the Number of Data Clusters va the Gap Statstc, Journal of Statstcs Socety B, no. 63, pp ,

97 [VFT0] O. Vrmaok, P. Fränt and T. Kaukoranta, Practcal methods for speedngup the PNN method, Optcal Engneerng, vol. 40, no., pp , November 200. [VF03] O. Vrmaok and P. Fränt, Fast parwse nearest neghbor based algorthm for multlevel thresholdng, Journal of Electronc Imagng, vol. 2, no. 4, pp , October [Ward63] J. Ward, Herarchcal groupng to optmze an obectve functon, Journal of Amercan Statstcs Assocaton, no. 58, pp , 963. [Webb99] A. Webb, A loss functon approach to model selecton n nonlnear prncpal components, Neural Networks, vol. 2, no. 2, pp , March 999. [WC03] S. Wu and T. Chow, Self-Organzng-Map Based Clusterng Usng a Local Clusterng Valdty Index, Neural Processng Letters, vol. 7, no. 3, pp , June [WCX00] X. Wu, P. A. Chou and X. Xue, Mnmum condtonal entropy context quantzaton, n Proc. of 2000 IEEE Int'l. Symp. Inform. Theory, pp. 43, [WM97] X. Wu and N. Memon, Context-based, Adaptve, Lossless Image Codec, IEEE Trans. Communcatons, vol. 45, no. 4, pp , 997. [WSS00] M. Wenberger, G. Serouss and G. Sapro, The LOCO-I Lossless Image Compresson Algorthm: Prncples and Standardzaton nto JPEG-LS, IEEE Trans. on Image Processng, vol. 8, pp , [WST95] F. Wllems, Y. Shtarkov and T. Talkens, The context-tree weghtng method: basc propertes, IEEE Trans. Inf. Theory, vol. 4, no. 3, pp , 995. [Wu92] X. Wu, Color Quantzaton by Dynamc Programmng and Prncpal Analyss, ACM Trans. on Graphcs, vol., no. 4 TOG specal ssue on color, pp , Oct [Wu96] X. Wu, An algorthmc study on lossless mage compresson, n Proc. of the 996 Data Compresson Conference DCC '96, Snowbrd, UT, USA, March 3 - Aprl 03, 996. [Wu97] X. Wu, Lossless Compresson of Contnuous-tone Images va Context Selecton, Quantzaton, and Modelng, IEEE Trans. Image Processng, no. 6, pp , 997. [Wu99] X. Wu, Context quantzaton wth fsher dscrmnant for adaptve embedded wavelet mage codng, n Proc. 999 IEEE Data Compresson Conf., pp. 02-, March

98 [WZK04] F. Wu, W. Zhang and A. Kusalk, Model-Based Clusterng wth Genes Expresson Dynamcs for Tme-Course Gene Expresson Data, n Fourth IEEE Symposum on Bonformatcs and Boengneerng BIBE'04, Tachung, Tawan, pp. 267, May 9-2, [WZ93] X. Wu and K. Zhang, Quantzer Monotonctes and Globally Optmal Quantzer Desgn Algorthms, IEEE Trans. on Informaton Theory, vol. 39, no. 3, pp , May 993. [XF04] M. Xu and P. Fränt, Delta-MSE dssmlarty n suboptmal K-means clusterng, n IAPR Int. Conf. on Pattern Recognton ICPR'04, Cambrdge, UK, vol. 4, pp , August [YK03] N. Yano and M. Kotan, Clusterng gene expresson data usng selforganzng maps and k-means clusterng, n SICE 2003 Annual Conference, vol. 3, pp , Fuku, Japan, August 4-6, [Zahn7] C. Zahn, Graph-theoretcal methods for detectng and descrbng gestalt clusters, IEEE Trans. on Computers, vol.20, no., pp , 97. [Zhan0] B. Zhang, Generalzed K-Harmonc Means Dynamc Weghtng of Data n Unsupervsed Learnng, n Frst SIAM Internatonal Conference on Data Mnng SDM 200, Chcago, USA, Aprl 5-7, 200. [ZRL96] T. Zhang, R. Ramakrshnan and M. Lvny, BIRCH: An effcent data clusterng method for very large databases, n Proc. of the 996 ACM SIGMOD Int'l Conf. on Management of Data, pp. 03-4, Montreal, Canada,

99 Publcaton P P. Fränt and M. Xu, Genetc local repartton for solvng dynamc clusterng problems, Int. Conf. on Sgnal Processng ICSP'02, Beng, Chna, vol. 2, 28-33, August 2002.

100

101

102

103

104

105 NOC Set Set Number of Iteratons Fgure 4: Numbers of clusters for Set and Set2 solved by the genetc local repartton and HMSE HMSE Number of Iteratons b RLS Genetc-GLA GLR Fgure 6: Performance comparsons of the genetc local repartton, Random-NOC RLS and genetc GLA: a Numbers of clusters; b HMSE NOC GLR by MSE GLR by HMSE GLR by MDL Number of Iteratons Fgure 5: Performance comparsons of MSE, HMSE and MDL n the genetc local repartton NOC RLS Genetc-GLA GLR Number of Iteratons a NOC MDL RLS Genetc-GLA GLR Number of Iteratons a Number of Iteratons b RLS Genetc-GLA GLR Fgure 7: Performance comparsons of RLS, the genetc GLA and the genetc local repartton a Numbers of clusters NOC; b MDL 32

106 6. References [] J. C. Bezdek and R. J. Hathaway, ''Optmzaton of fuzzy clusterng crtera usng genetc algorthms'', Proc. Frst IEEE Conference on Evolutonary Computaton, Vol. 2, , 994. [2] H. Bschof, A. Leonards, and A. Selb, "MDL prncple for robust vector quantzaton", Pattern Analyss and Applcatons, 2: 59-72, 999. [3] D. E. Brown and C. L. Huntley, Brown, D.E. and C.L. Huntley, "A practcal applcaton of smulated annealng to clusterng", Pattern Recognton, 25 4, 40-4, 992. [4] D. L. Daves and D. W. Bouldn, "A cluster separaton measure", IEEE Trans. on Pattern Analyss and Machne Intellgence, 2, , 979. [5] P. Fränt, Genetc algorthm wth determnstc crossover for vector quantzaton, Pattern Recognton Letters, Vol. 2, 6-68, 2000 [4] L. A. N. Lorena and J. C. Furtado, "Constructve genetc algorthm for clusterng problems", Evolutonary Computaton 2000, San Dego, USA, 6-9 July [5] V. Ramos and F. Muge, "Map segmentaton by colour cube genetc K-Mean clusterng", Proc. of ECDL th European Conference on Research and Advanced Technology for Dgtal Lbrares, Lsbon, Portugal, 8-20 September [6] J. Rssanen, "Stochastc Complexty", Journal of Statstcs Socety, 4 9, , 987 [7] J. Rssanen, "Modelng by shortest data descrpton" Automatca, vol. 4, , 978 [8] P. Scheunders, A genetc c-means clusterng algorthm appled to color mage quantzaton, Pattern Recognton, Vol. 30 6, p , 997. [6] P. Fränt, H.G. Gyllenberg, M. Gyllenberg, J. Kvärv, T. Kosk, T. Lund and O. Nevalanen, "Mnmzaton stochastc complexty usng local search and GLA wth applcatons to classfcatons of bactera", Bosystems, 57, 37-48, June [7] P. Fränt and J. Kvärv, "Randomzed local search algorthm for the clusterng problem", Pattern Analyss and Applcatons, 3 4, , [8] P. Fränt, J. Kvärv, T. Kaukoranta and O. Nevalanen: "Genetc algorthms for large scale clusterng problem", The Computer Journal, 40 9, , 997. [9] J. Gan and K. Warwck, "A genetc algorthm wth dynamc for multmodal functon optmzaton", Proceedngs of the 4th Internatonal Conference on Artfcal Neural Networks and Genetc Algorthms, pp , Sprnger Wen, New York, 999. [0] A. Ghozel and D. Fogel, "Dscoverng patterns n spatal data usng evolutonary programmng", Genetc Programmng 996: Proceedngs of the frst annual conference, Cambrdge, MA, USA, MIT Press, , 996. [] I. Kärkkänen, "Solvng the number of clusters n cluster analyss", M.Sc. Thess, Dept. of Computer Scence, Unversty of Joensuu, 999 n Fnnsh. [2] C-Y. Lee and E. K. Antonsson, "dynamc parttonal clusterng usng evoluton strateges", Proceedngs of the Thrd Asa-Pacfc Conference on Smulated Evoluton and Learnng, Nagoya, Japan, [3] Y. Lnde, A. Buzo, and R. Gray, "An algorthm for vector quantzer desgn", IEEE Trans. Communcatons, 28, 84-95, January

107 2 Publcaton P2 P. Fränt, M. Xu and I. Kärkkänen, Classfcaton of bnary vectors by usng DeltaSC-dstance to mnmze stochastc complexty, Pattern Recognton Letters, 24-3, 65-73, January 2003.

108

109 Pattern Recognton Letters Classfcaton of bnary vectors by usng DSC dstance to mnmze stochastc complexty Pas Fr ant *, Mantao Xu, Ismo K arkk anen Department of Computer Scence, Unversty of Joensuu, P.O. Box, Fn-800 Joensuu, Fnland Receved 7 October 200; receved n revsed form 28 February 2002 Abstract Stochastc complexty SC has been employed as a cost functon for solvng bnary clusterng problem usng Shannon code length CL dstance as the dstance functon. The CL dstance, however, s defned for a gven statc clusterng only, and t does not take nto account of the changes n the class dstrbuton durng the clusterng process. We propose a new DSC dstance functon, whch s derved drectly from the dfference of the cost functon value before and after the classfcaton. The effect of the new dstance functon s demonstrated by mplementng t wth two clusterng algorthms. Ó 2002 Elsever Scence B.V. All rghts reserved. Keywords: Clusterng; Stochastc complexty; Algorthms; Dstance functons. Introducton Bnary vector classfcaton has been wdely used n DNA computng and human chromosome study, and n solvng taxonomy problems from bomedcal area. Statstcal models are usually appled to descrbe and solve predcton and taxonomy problems. For example, Rssanen 987, 996 has ntroduced a model known as stochastc complexty SC, whch s an extensble explanaton for Shannon nformaton theory Kontkanen et al., 999. To be a cost functon of classfcaton, SC needs to be approxmated by a smple model Rssanen, * Correspondng author. Tel.: ; fax: E-mal address: frant@cs.oensuu.f P. Fr ant Gyllenberg et al. 994, 997, 2000 has gven a smple and practcal approxmaton of SC for bnary vector classfcatons. Thereafter, SC has been employed as a generc evaluaton functon n solvng bnary clusterng problems as follows. The clusterng problem s frst formulated as an optmzaton problem. Approxmaton solutons are then found for every reasonable number of groups. SC s appled for measurng the goodness of the varous clusterng results. Indvdual clusterng can be generated usng any algorthms such as the generalzed Lloyd algorthm GLA Lnde et al., 980, and the randomzed local search RLS Fr ant and Kv arv, A better dea s to ntegrate the SC cost functon drectly n the clusterng algorthm as done n Gyllenberg et al. 997 and Fr ant et al The vector-to-cluster dstance for the classfcaton /03/$ - see front matter Ó 2002 Elsever Scence B.V. All rghts reserved. PII: S

110 66 P. Fr ant et al. / Pattern Recognton Letters of the vectors must then be re-defned correspondngly. Eucldean dstance L 2 -norm provdes the optmal classfcaton of the data vectors for the mnmzaton of the MSE, but not for the SC. The optmal classfcaton for SC s gven by the Shannon code length CL functon Gyllenberg et al., 997. It represents the entropy of the bnary vector when coded by the probablty model of the partcular cluster. Surprsngly, the CL dstance ntroduces a new problem that never arses wth the L 2 -dstance. Ths s llustrated n Fg., n whch we classfy the black pont accordng the two exstng clusters. The probablty dstrbuton of the leftmost cluster ndcates the pont belongs to ths class wth a low probablty. Nevertheless, the probablty dstrbuton of the rghtmost cluster have zero varance n the horzontal dmenson ðr x ¼ 0Þ resultng n zero probablty and nfnte entropy. As a consequence, the pont wll be classfed to the leftmost cluster. Ths nfnte entropy problem happens often n the classfcaton of mult-dmensonal bnary data vectors. It has therefore been necessary to make modfcatons to the exstng clusterng algorthms when SC has been appled as a cost functon. Prevously, the problem has been solved n Gyllenberg et al. 997 and Fr ant et al by applyng the clusterng algorthms frst usng the sub-optmal but less problematc L 2 -dstance. The CL dstance s then appled n the last stage of the algorthm when the global clusterng structure has already settled down and only fne-tunng of the soluton takes place. The drawback of ths approach s that smlar patch should be made for every clusterng algorthm that s to be appled wth the SC. In ths paper, we propose a more general soluton to the nfnty problem by proposng a new Fg.. Illustratve example of the problem n the CL dstance. DSC dstance functon. The dstance functon s derved drectly from the dfference of the cost functon value before and after the classfcaton. It therefore mplctly takes nto account the change n the class dstrbuton caused by the re-classfcaton of the data vector, and n ths way, avods the nfnty problem. The DSC s general n the sense that t apples to any clusterng algorthm and no more patches are therefore needed. The effect of the new dstance functon s demonstrated by mplementng t wth two clusterng algorthms. The rest of the paper s organzed as follows. In Secton 2, we defne the clusterng problem of bnary vectors, and gve the smplfed formalzaton of the SC. The SC functon s then appled wthn two clusterng algorthms as the cost functon, and the CL dstance s employed n the RLS and GLA algorthms as a practcal vector-to-cluster dstance. In Secton 3, we ntroduce the new DSC functon derved from the SC dfference of the old and new classfcaton when a data vector s moved from one class to another. In Secton 4, we make performance comparsons of the dfferent varants ncludng the RLS and GLA algorthms, and the DSC, CL and L 2 dstance functons. 2. Clusterng by mnmzng SC We use the followng notatons: N: number of data vectors, M: number of groups, D: dmenson of vectors, X: set of N data obects X ¼fx ; x 2 ;...; x N g, P: partton ndces of x : P ¼fp ¼ ;...; Ng, C: set of cluster centrods C ¼fc ¼ ;...; Mg. The goal of the clusterng s to partton a gven set of N data vectors nto a number of groups so that a gven cost functon s mnmzed. In the clusterng process, we must solve both the number of clusters M and ther locaton c. The clusterng result s descrbed by the partton P of the data set by gvng for each vector x the cluster ndex p of the group, whch t belongs to. We

111 P. Fr ant et al. / Pattern Recognton Letters consder a set of d-dmensonal bnary data vectors. 2.. Stochastc complexty SC can be appled to the clusterng by fndng the mnmum descrpton of the data va usng a clusterng model. SC measures the nformaton content of the data, and t s defned as the shortest possble code length for the data obtanable by usng a set of class dstrbutons. SC ncludes both the model parameters and the codng of the data n the measurement. Suppose that we have classfed the data vectors nto M groups descrbed by the partton of the data. The model of the class can then be descrbed by the probablty dstrbuton wthn the class n each dmenson: c ¼ n =n ðþ where n s the number of bnary vectors n the class, and n s the number of vectors havng the th coordnate value. The probablty vector c of the class s also the centrod average vector of the cluster. The smplfed approxmaton of the SC functon n Gyllenberg et al. 997 can be descrbed usng the class dstrbuton models as: SC ¼ XM X d n h n þ XM n log n n N ¼ X M ¼ þ d log maxð; n Þ ð2þ 2 where h measures the entropy of a bnary dstrbuton hðpþ ¼ p logðpþ ð pþlogð pþ ð3þ Snce every vector s classfed to some group, t s known that P n ¼ N. Moreover, N log N n the mddle term s a constant and, therefore, the equaton can be smplfed as: SC XM X d n h n þ XM n log n þ d n ¼ ¼ 2 ¼ ¼ However, the smplfed Eq. 4 above could make SC negatve. The frst part of the SC functon measures the ntra-class nformaton as the code length when every data vector s coded accordng to the class probablty model. The code length s calculated by multplyng the number of vectors n each cluster n by the average entropy h ofthe cluster. The second part measures the nter-class nformaton as the code length of the partton. It can be calculated by the number of vectors n each cluster n multpled by the average entropy of the correspondng cluster ndex. The thrd part measures the nformaton of the model as the code length of the class dstrbuton when descrbed by a seres numbers between ½::n Š Clusterng algorthm The SC can be appled for the clusterng problem as follows. We frst fnd approxmaton solutons for every reasonable number of groups usng any clusterng algorthm. The solutons are then evaluated, and the one that mnmzes the SC s the fnal result of the clusterng. Ths search strategy can use any clusterng algorthm to fnd the ndvdual solutons. In the followng, we recall two clusterng algorthms: the GLA by Lnde et al. 980, and the RLS by Fr ant and Kv arv The pseudocode of the GLA s shown n Fg. 2. The algorthm takes any ntal soluton here the partton P as an nput, and teratvely fne-tunes the soluton by repeatng two operatons n turn. The frst operaton calculates the centrods of the clusters, and the second operaton re-partton the data vectors accordng to the new set of centrods. XM ¼ log maxð; n Þ ð4þ Fg. 2. Pseudocode for the GLA.

112 68 P. Fr ant et al. / Pattern Recognton Letters Fg. 3. Pseudocode for the RLS. The algorthm s terated untl no more mprovement appears n the soluton. The method s smple to mplement, and has been wdely used for the clusterng problem as such, or ntegrated wth more complcated methods. The pseudocode of the RLS s shown n Fg. 3. The method takes any ntal soluton, whch s then mproved by a sequence of operatons. At each teraton phase, the algorthm creates a new canddate soluton by makng a small change to the current clusterng structure. Frst, a randomly chosen cluster centrod c s replaced by a randomly chosen data vector x. Ths moves the cluster locaton to another part of the vector space. The partton s then adusted by a local repartton operaton, whch conssts of the two steps as shown n Fg. 4. In the frst step, the old cluster s removed by re-parttonng ts data vectors to other clusters. In the second step, the newly created cluster s populated by attractng data vectors from the neghborng clusters. The modfed clusterng s fne-tuned by the applcaton of the GLA. The new canddate soluton s then evaluated and accepted only f t mproves the prevous soluton. Fg. 4. Pseudocode for the local repartton operaton. Otherwse, the canddate soluton s dscarded and the prevous soluton remans as the startng pont for the next teraton. The GLA and the RLS are both applcable for the clusterng task and also rather smple to mplement. The RLS s less senstve to the ntalzaton because t s capable of makng global changes n the clusterng structure by random swappng of the clusters, and therefore, correct the ncorrect settlement of the ntal clusterng. If the GLA s to be used, t should be repeated several tmes n order to reduce the dependency on the ntalzaton Shannon code-length dstance The clusterng algorthms employ a dstance functon d, whch measures the vector-to-cluster dstance, and s used for the classfcaton of the vectors durng the clusterng process. Usually the dstance functon s defned as the Eucldean dstance L 2 -norm between the data vector x and the partcular cluster centrod c : vffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ux d d E ðx ; c Þ¼t kx k c k k 2 ð5þ k¼ Ths gves optmal classfcaton for the mnmzaton of the MSE but not for the SC. The optmal classfcaton for the SC s gven by the Shannon code-length functon CLðx ; c Þ n Gyllenberg et al d CL ðx ; c Þ¼ Xd ¼ ð x Þ logð c Þþx log c log n ð6þ N It measures the code length when the data vector the summaton term n the equaton and ts class ndex second term n the equaton are coded usng the gven model. In prncple, the CL dstance s well defned, but n practce, t has a fundamental problem n ts defnton, whch wll be explaned by the followng example. Consder a sngle cluster c consstng of the followng three data vectors: x ¼ð0; 0; 0Þ, x 2 ¼ ð0; ; 0Þ and x 3 ¼ð; 0; 0Þ. The correspondng class

113 P. Fr ant et al. / Pattern Recognton Letters probablty dstrbuton of the cluster s c ¼ð0:33; 0:33; 0:00Þ. The dstances of the vectors can now be calculated usng the CL dstance: d CL ðx ; c Þ¼0:58 þ 0:58 þ 0:00 ¼ :7 d CL ðx 2 ; c Þ¼0:58 þ :58 þ 0:00 ¼ 2:7 d CL ðx 3 ; c Þ¼:58 þ 0:58 þ 0:00 ¼ 2:7 The second term of Eq. 6 s omtted n ths example for smplcty. Let us then consder a fourth vector x 4 ¼ð0; 0; Þ, whch s equally close to the cluster accordng to the Eucldean dstance, but the CL dstance gves the followng result: d CL ðx 4 ; c Þ¼0:58 þ 0:58 þ¼undefned The problem can appear when there s a unform bt dstrbuton n any dmenson, and the data vector has dfferent value n the same poston. The homogenous bt dstrbuton ndcates that there s no uncertanty and the entropy of the contradctng value would therefore approach nfnte. Ths s a serous flaw especally n the local re-partton procedure of the RLS. It creates new clusters startng from a sngular cluster, whch evdently has unform bt dstrbuton. As a consequence, no other data vectors except equal ones can ever be classfed to ths cluster. The problem of the CL dstance s that even though t measures the uncertanty of the classfcaton, t does not take nto account the uncertanty of the model tself. In other words, the bt dstrbuton of the class model s ndeed homogeneous, but the model s only an approxmaton and subect to change durng the clusterng process. Zero-probablty s therefore not a feasble approxmaton of the classfcaton. The nfnte values could be avoded by preventng the centrods to take values 0 and. Ths can be acheved, for example, gven bnary data vector x, by takng the centrod values to be mean vector of x and c. If some coordnate of c equals to 0 or value, the number of vectors n th cluster, n can be taken as a parameter of CL dstance functon. Obvously, c can be replaced wth a new vector, whch s the centrod of th cluster after vector x s put nto cluster. c ¼ n c þ x c 6¼ x ð7þ n þ If c ¼ x, CL dstance s adopted as log 2 ðnþ. Another condton on CL dstance value s consdered as follows: x log c þð x Þ logð c Þ¼0 c ¼ x ; c 6¼ x ð8þ Ths patch, however, does not remove the problem tself as t merely assgns a low probablty nstead of a zero value. Addtonal modfcatons have therefore been necessary for the clusterng algorthms so that the CL dstance could have been used properly n the GLA and n the RLS. For example, n the algorthms presented n Gyllenberg et al. 997 and Fr ant et al the CL dstance s appled only n the last step of the clusterng process when the global clusterng structure has already settled down, and only fne-tunng of the soluton takes place. The problem of ths approach s that t s not trval to determne the stage of the clusterng process, when t would be safe enough to start to use the CL dstance. To sum up, the problem wth the CL dstance s fundamental n ts nature. It s therefore better to fx t than to fnd patch for every clusterng algorthm that s to be appled. 3. DSC dstance functon We ntroduce a new vector-to-cluster dstance functon denoted as DSC dstance. It s based on a desgn paradgm, n whch the dstance functon s derved drectly from the dfference of the cost functon value before and after the classfcaton of a data vector. The man advantage of ths desgn phlosophy s that t mplctly takes nto account of the changes n the clusterng model caused by the classfcaton. It s also general n the sense that t does not depend on the chosen clusterng algorthm and should therefore be applcable wth any dstance-based clusterng method. The DSC dstance functon s always defned relatve to a gven model. We can therefore assume that we have a model, for whch we can calculate the SC value. If we then consder the dstance

114 70 P. Fr ant et al. / Pattern Recognton Letters calculaton as a movement of the data vector from one group to another, we can defne the dstance functon as the dfference n the SC of the clusterng before and after the movement of the data vector. Gven two classes, 2 and a bnary vector x, whch we consder to move from the class to class 2, the SC functon n 2 value after the movement s: SC XM X d n 6¼ ; 2 ¼ þ d 2 X M 6¼ ; 2 n þ X ðn log 2ðn ÞÞ n 6¼ ; 2 h log maxð; n Þ ðn Þ logðn Þ ðn 2 þ Þ logðn 2 þ Þ Þh n x þ Xd ¼ ðn n þðn 2 þ Þh n 2 þ x n 2 þ þ d 2 ðlog maxð; n Þ þ log maxð; n 2 þ ÞÞ þ N log N ð9þ We can then calculate the dfference between the SC functon values of the old clusterng before the movement and the new one after the movement as: SC-dffðx; ; 2 Þ¼ Xd n h n x n ¼ n h n n þðn 2 þ Þh n 2 þ x n 2 þ n 2 h n 2 þðn d=2þ n 2 log n þðn 2 d=2þ log n 2 ðn 2 þ d=2þ logðn 2 þ Þ ðn d=2þ logðn Þ n > SC-dffðx; ; 2 Þ¼ Xd n 2 þ h n 2 x n ¼ 2 n 2 h n 2 n 2 þðn 2 d=2þ log n 2 þðd=2 n 2 Þ logðn 2 þ Þ n ¼ ð0þ The SC-dff takes zero value f ¼ 2. Negatve values are obtaned when the movement of the vector mproves the soluton, and postve values otherwse. The SC-dff could now be appled as such n the cases when we re-classfy a vector n an exstng soluton. In the SC-dff functon we assume that the gven vector s already classfed nto some class. In general, however, ths s not the case but we must be able to defne a more general dstance functon that depends only on the vector x and on the canddate cluster c. For example, n the repartton procedure of the RLS algorthm, we classfy vectors whose prevous class has been removed. More general DSC functon can be derved from 0 as follows. The classfcaton can be consdered as a twostep procedure, n whch we frst remove the vector x from the class and then add t to the class 2 For a gven vector x the cost of the removal s constant. Ths means that the parameters N, n, n are fxed n the classfcaton, and as a consequence, we can consder only the cost of addng the vector n class 2 and gnore the removal part n the formula. Thus, the DSC ð 6¼ 2 Þ can be defned merely as the cost of the addton DSCðx; C 2 Þ¼ Xd n 2 þ h n 2 þ x n ¼ 2 þ n 2 h n 2 n 2 þðn 2 d=2þ log n 2 þðd=2 n 2 Þ logðn 2 þ Þ þ log N ðþ

115 P. Fr ant et al. / Pattern Recognton Letters Ths gves the same result as the SC-dff wth the dfference of a constant. The only excepton s when we measure the dstance of x to the cluster, n whch t s already ncluded ð ¼ 2 Þ. In ths case, we should use ðn Þ as the class sze nstead of n because the class sze does not ncrease due to the classfcaton. Thus, f the prevous classfcaton s known, we should apply the followng equaton for ths specal case: DSCðx; C 2 Þ¼ Xd n h n ¼ n ðn Þh n x n Fg. 5. Clusterng results by the RLS algorthm for DNA-. þðd=2 n Þ log n þðn d=2þ logðn Þ þ log N ¼ 2 ; n > DSCðx; C 2 Þ¼log N ¼ 2 ; n ¼ ð2þ Hence, DSC dstance as n Eq. s applcable as vector-to-cluster dstance n all cases, although t underestmates the dstance n the case when the vector s already ncluded n the class. The specal case of 2 should therefore be used when applcable to gve more exact value. 4. Test results We use three bnary data sets to test the new method: DNA-, DNA-2, Normal. The features of the frst two sets DNA- and DNA-2 were extracted from analyss of DNA samples of fshes presence or absence of gven DNA fragment n bologcal research experments. There are dmensonal bnary vectors n DNA- and dmensonal vectors n DNA-2. The thrd set Normal was artfcally created by generatng 265 bnary vectors nto 0-dmensonal vector space wth 2 clusters. We study frst the DNA- and DNA-2 sets when clustered usng the RLS and GLA methods. The RLS was performed 80 teratons. The results n Fgs. 5 8 show that the DSC dstance and L 2 -dstance Fg. 6. Clusterng results by the GLA algorthm for DNA-. Fg. 7. Clusterng results by the RLS algorthm for DNA-2.

116 72 P. Fr ant et al. / Pattern Recognton Letters Table summarzes the clusterng and classfcaton results for the Normal data set. It s the only data set for whch the real classfcaton s known, and thus, classfcaton rate could be calculated. The results show that employng RLS by DSC dstance gves the best performances both n terms of best clusterng result smallest SC values, and the hghest classfcaton rate. The RLS algorthm found the correct number of clusters also wth the L 2 -dstance and CL dstance but the correspondng classfcaton rates were smaller. The GLA, however, was able to fnd the correct result 2 clusters only by usng the DSC dstance. Fg. 8. Clusterng results by the GLA algorthm for DNA-2. come up wth much better results than CL dstance n RLS algorthm. It seems that there s no bg dfference between DSC dstance and L 2 - dstance when they are employed n RLS. The dfference, however, can be sgnfcant when the correct number s to be determned n the stepwse search. The result wth the GLA s qute dfferent from that of the RLS, manly because the varance of the results s much greater. The CL dstance stll performs worse than the L 2 -dstance, but the DSC dstance s now clearly better than the L 2 -dstance almost wth respect to every number of clusters. It s expected that the correct result would be reached more relably usng the DSC dstance. The drawback of the DSC dstance s takes much more tme to compute than the L 2 -dstance. 5. Conclusons We proposed a new vector-to-cluster dstance n the classfcaton problems of bnary vectors by mnmzng SC. The dstance functon was appled both n the GLA and RLS algorthms. Experments show that RLS by DSC dstance gves the best clusterng performance n mnmzng SC among all the varants consdered, and the hghest classfcaton rate. In most cases, the modfed CL dstance performs even worse than the L 2 -dstance. It s somehow dffcult for the stepwse GLA to delver satsfactory results n solvng the correct number of clusters, even by usng the DSC dstance. The L 2 -dstance s moderately effectve to classfy smple data. Among the three dstances, the DSC dstance s the most precse to mnmze stochastc complexty. Our approach by usng DSC dstance s general n ts nature as the same desgn paradgm can be appled wth any other cost functon too. Table The classfcaton results of the RLS and GLA algorthms wth the L 2-dstance, CL-dstance and DSC-dstance RLS algorthm GLA algorthm Real classfcaton L 2 CL DSC L 2 CL DSC SC Number of a 6 a 2 2 clusters Classfcaton rate 96.98% 92.83% 98.87% 74.34% 87.55% 9.32% 00% a The classfcaton rates are calculated from the clusterng of 2 clusters even though smaller SC-value was found wth 6 clusters.

117 P. Fr ant et al. / Pattern Recognton Letters References Fr ant, P., Kv arv, J., Randomzed local search algorthm for the clusterng problem. Pattern Analyss and Applcatons 3 4, Fr ant, P., Gyllenberg, H.G., Gyllenberg, M., Kv arv, J., Kosk, T., Lund, T., Nevalanen, O., Mnmzaton stochastc complexty usng GLA and local search wth applcatons to classfcatons of bactera. Bosystems 57, Gyllenberg, M., Kosk, T., Probablstc Models for Bacteral Taxonomy, TUCS Techncal Report, No. 325, TurkuCenter for ComputerScence, Fnland, February Gyllenberg, M., Kosk, M., Verlaan, M., 994. Clusterng and quantzaton of bnary vectors wth stochastc complexty. Proc. IEEE Internat. Symposum on Informaton Theory, Trondhem, Germany, 994. Gyllenberg, M., Kosk, T., Verlaan, M., 997. Classfcaton of bnary vectors by stochastc complexty. Journal of Multvarate Analyss 63, Kontkanen, P., Myllym ak, P., Slander, T., Trr, H., 999. On the accuracy of stochastc complexty approxmatons. In: Gammerman, A. Ed., Causal Models and Intellgent Data Management, Chapter 9. Sprnger-Verlag. Lnde, Y., Buzo, A., Gray, R., 980. An algorthm for vector quantzer desgn. IEEE Trans. Communcatons 28, Rssanen, J., 987. Stochastc Complexty and the MDL Prncple. Econometrc Revews 6, Rssanen, J., 987. Stochastc complexty. Journal of Statstcs Socety 4 0, Rssanen, J., 996. Fsher nformaton and stochastc complexty. IEEE Trans. on Informaton Theory 42,

118

119 3 Publcaton P3 M. Xu, Delta-MSE dssmlarty n GLA based vector quantzaton, IEEE Int. Conf. on Acoustcs, Speech, and Sgnal Processng ICASSP'04, Montreal, Canada, vol. 5, 83-86, May 2004.

120

121 DELTA-MSE DISSIMILARITY IN GLA BASED VECTOR QUANTIZATION Mantao Xu Department of Computer Scence, Unversty of Joensuu Box, Fn-800 Joensuu FINLAND ABSTRACT The generalzed Lloyd algorthm s one of popular partton-based algorthms to construct the codebook n vector quantzaton. We propose the Delta-MSE dssmlarty measurement between tranng vectors and code vectors based on the MSE dstorton functon. The Delta-MSE functon s heurstcally derved by calculatng the dfference of MSE dstorton before and after movng a tranng vector from one cluster to another. We show that the Delta-MSE dssmlarty apples also to mnmzng the F-rato valdty ndex of the vector quantzer. We ncorporate the underlyng dssmlarty nto the generalzed Lloyd algorthm n vector quantzaton wth the ntal codebook derved from the PCA-based k-d tree algorthm. Expermental results show that the proposed dssmlarty generally acheves better performance than the L2 dstance n constructng the codebook of vector quantzaton.. INTRODUCTION Vector quantzaton VQ s a method for data reducton that s wdely used n low bt rate compresson of mage and audo data source [, 2]. The obectve of vector quantzaton s to search a M set of code vectors codebook wth the mnmum dstorton between tranng vectors and ther representatve code vectors. One of most cted partton-based algorthms s the generalzed Lloyd algorthm GLA. It bascally conssts of two steps: the assgnment of each tranng vector wth a class label by fndng ts closest code vector and the computaton of code vectors. There are many mproved versons of the GLA algorthm such as the genetc GLA algorthms [3], the randomzed local search algorthms [4, 5] and the fast mplementatons of GLA [6-7]. The standard GLA s appled as an ntegral part of the vector quantzaton algorthms above. Ether the genetc algorthms or the randomzed algorthms run the GLA algorthm many tmes durng one run of the algorthms. The computaton n the GLA manly reles on the dstance calculatons between the tranng vectors and the code vectors. The fast mplementatons of GLA such as PDS [8] and MPS [9] reduce a number of dstance calculatons after several runs of partton n GLA. The GLA vector quantzaton algorthm can be also consdered as a clusterng algorthm on tranng sets. Hence ts dssmlarty functon or dstance functon can be reformulated to mprove vector quantzaton performance. The dstorton functon of VQ s always defned by the total dssmlartes between all tranng vectors and ther code vectors. The defnton of a new dssmlarty functon often leads to the re-formalzaton of the dstorton functon, whch also requres that the code vectors are re-computed consstently to mnmze the dstorton functon. However, n ths work, a heurstc and non-symmetrc dssmlarty functon s analytcally nduced from the predefned dstorton functon. The consdered approach takes account nto the dynamc nature of the GLA partton process, n whch the cluster parameters the cluster szes are subect to change all the tme durng the run of the algorthm. The above desgn paradgm can be appled to the MSE dstorton functon to derve a dssmlarty functon between tranng vectors and code vectors. The structure of ths paper s organzed as follows: We frst descrbe the desgn paradgm of the Delta-MSE dssmlarty based on one partton of tranng vectors. In the followng secton, we show that the Delta-MSE dssmlarty s applcable also to the F-rato clusterng valdty ndex. Then the algorthm s Delta-MSE dssmlarty ncorporated nto the GLA algorthm n next secton. In expermental secton, performance comparsons are revewed between the Delta-MSE dssmlarty and the L 2 dstance. Fnally, the conclusons are drawn. 2. DELTA-MSE DISSIMILARITY The am of vector quantzaton s fnd the partton of the tranng set wth the mnmum dstorton between all tranng vectors and ther code vectors. The standard GLA vector quantzaton s an optmzaton problem specfed by the mnmzaton of the MSE functon:

122 MSE N P, C = x c p = where N s the number of data vectors; k s the number of clusters NOC; X = { x, x 2, x N } s a set of N tranng vectors; P = { p =, N } s the set of class labels; C = { c =, k } s the set of code vectors. Assumng that a tranng vector x s moved from cluster to cluster, the change of the MSE functon [0] caused by ths move s: n 2 n 2 v x = x c x c 2 n + n where n and n are two cluster szes respectvely. The frst part n the rght hand sde of equaton 2, representng the ncreased value of the total varance of cluster caused by ths move, s denoted the addton cost. The second part, representng the decreased value of total varance of cluster, s denoted the removal cost. The addton cost can be nterpreted as the dssmlarty between tranng vector x and code vector c x s outsde cluster. A smaller cluster sze n obvously makes the addton cost more dfferent from the L2 square dstance. It should be noted that the change of varance by addng a tranng vector nto one cluster s equvalent to the change of varance by removng the tranng vector from the new cluster. Hence the second part can be nterpreted as the dssmlarty between tranng vector x and ts former code vector c. Obvously, the tranng vectors n sparse clusters are moved frequently by the dssmlarty than those n dense clusters. The Delta-MSE dssmlarty between tranng vector x and code vector c s defned as: D MSE 2 2 x, c = w x c 3 where w s defned as: n / n + p w = 4 n / n p = The square L2 dstance n equaton 3 can be replaced wth the standardzed L2 dstance as: T DMSE x, c = w x c D x c 5 where D s the dagonal matrx wth dagonal elements gven by v, whch denotes the varance of the varable x over the N tranng vectors. The dstrbuton of cluster szes determnes the clusterng performance of the Delta- MSE dssmlarty. The sparser one cluster s, the more dfferent the Delta-MSE dssmlarty can be n comparson to the L2 norm. When the codebook sze s ncreased, most of clusters become sparser. In ths case, the proposed dssmlarty enables more reassgnments of the tranng vectors n sparse clusters, consequently ncreasng the number of vector reassgnments. The Delta- MSE dssmlarty therefore yelds the better VQ dstorton than the L2 dstance. 3. F-RATIO VALIDITY INDEX Many teratve clusterng algorthms rely on F-rato valdty ndex n estmaton of the codebook sze. The F- rato n s defned as the rato of wthn-groups varance to between-groups varance. The total varance of the tranng set can be decomposed nto the sum of wthngroups varance and between-groups varance as: = N k 2 2 σ X x c + n c x p 6 = = where x s the mean vector of tranng set. The F-rato s an extenson of Fsher s dscrmnant to measure the separablty between clusters. The F-rato clusterng valdty s calculated as the rato of the total wthn-groups varance aganst the total between-groups varance as: k F = k = N 2 x c p = k MSE = 2 σ X MSE n c x The smaller the F-rato s, the more separated the clusters are. The F-rato valdty ndex s useful n the estmaton of codebook sze, whch also reles on the geometrcal structure of tranng source. Snce the Delta-MSE dssmlarty s analytcally derved from the MSE dstorton, f the removal cost D MSE x, c s greater than the addton cost D MSE x, c, the MSE dstorton wll decrease after ths movement. In the partton phase, the tranng vector x s nclned to move nto the cluster wth the mnmum addton cost, whch brngs the greatest decrease of MSE value. In the followng, we wll show that the property holds on to the F-rato valdty ndex as well. Lemma: Gven the partton of the tranng set that assgns tranng vector x nto cluster, f the addton cost D MSE x, c s greater than the addton cost D MSE x, c l, the F-rato Fx, c after movng x to cluster s greater than the F-rato value Fx, c l after movng x to cluster l. Proof of Lemma: Suppose that tranng vector x s moved from cluster to cluster and cluster l respectvely. From equaton 3 and 7, the dfference of Fx, c and Fx, c l s calculated as: 7

123 F x, c F x, c l k MSE x, c k MSE x, cl = σ X MSE x, c σ X MSE x, c k σ X MSE x, c MSE x, cl = σ X MSE x, c σ X MSE x, c k σ X DMSE x, c DMSE x, cl = σ X MSE x, c σ X MSE x, c The total varance σx s a postve constant; σx-msex, and σx-msex, l, representng the between-group varances after the two movements respectvely, are also postve. Thus, the value of equaton 0 s postve f and only f D MSE x, c s greater than D MSE x, c l, whch proves the lemma. l l l 8 4. IMPLEMENTATION OF GLA ALGORITHM The Delta-MSE dssmlarty s ncorporated nto the generalzed Lloyd algorthm n ths work. The ncorporated GLA algorthm can also be accelerated by the trangular nequalty elmnaton technque TIE by Chen and Hseh []. The values of all weght numbers {w =, N, =, k} are reserved n two k- dmensonal arrays n each partton phase. The partton of tranng vectors by the Delta-MSE dssmlarty can also be exactly accelerated by applcaton of two trangular nequaltes. The number of Delta-MSE calculatons can be reduced by the followng two nequaltes: wa ca c > + ca x w 9 wa abs cb x cb c > ca x w where x s tranng vector, c a and c b are ts nearest code vector and farthermost code vector found so far; c s the code vector to be detected. If one of the above equatons holds, the calculaton of D MSE x, c can be avoded. A practcal mplementaton of the acceleraton utlzes the k k matrx of the L2 dstances between code vectors. The calculaton of the matrx usually takes Ok 2 d tme. Assumng that k << N, the accelerated partton wth the Delta-MSE dssmlarty takes Odk-sN tme where s s the average number of avoded Delta-MSE calculatons n reassgnments of all tranng vectors. The ntal code vectors here are chosen by a k-d tree algorthm based on the nested prncpal component analyss, whch s proposed n [2-3]. The code vectors are selected as k number of k-d tree bucket centers. The k- d tree algorthm ensures that ts bucket centers can be as approprate canddate code vectors as tranng vectors. The tme complexty of selectng the ntal code vectors from k-d tree buckets s OdkN. Table : Comparson between the L2 dstance and the Delta-MSE dssmlarty Dataset MSE F-rato Ar5 L Delta-MSE Brdge L Delta-MSE Brdge2 L Delta-MSE Camera L Delta-MSE Housec5 L Delta-MSE F-rato L2 Delta-MSE Number of clusters Fg. : F-ratos of the vector quantzers for Ar5 5. EXPERIMENTAL RESULTS We frst study the tranng sets generated from four standard mages: Ar5 and Housec5 are the tranng sets wth the RGB-values from mage Arplane and House - quantzed to 5 bts per color; Brdge and Camera are the tranng sets wth 4 4-blocks from mage Brdge and Cameraman; Brdge2 s the tranng set wth 4 4 bnarzed blocks from mage Brdge. Both L2 norm and Delta-MSE dssmlarty are tested n the standard GLA algorthm. The ntal code vectors are selected from the PCA-based k-d tree bucket centers. The average MSE and F-rato values over the codebook sze from 48 to 70 are dsplayed n table. The F-ratos of Ar5 are plotted aganst the number of clusters codebook sze n fgure. It turns out n fgure that the Delta-MSE dssmlarty acheves sgnfcantly smaller F-rato dstortons than the L2 dstance wth the ncrease of codebook sze. The clusters become sparser wth the ncreased codebook sze, whch makes the Delta-MSE dssmlarty more dfferent from the L2 square dstance and consequently enables more heurstcty for mnmzng the dstortons. Table shows the proposed dssmlarty generally performs better than the L2 norm n the GLA based vector quantzaton.

124 Table 2: Comparson between the standardzed L2 dstance and ts correspondng Delta-MSE dssmlarty. Dataset MSE F-rato Test Error Speaker L Delta-MSE Speaker2 L Delta-MSE Speaker3 L Delta-MSE Speaker4 L Delta-MSE Speaker5 L Delta-MSE Test Errror L2 Delta-MSE Number of clusters Fg. 2: Test errors of the vector quantzers for Speaker. We secondly study the fve real speaker datasets from TIMIT speech corpus by usng the stepwse GLA algorthm. The Delta-MSE dssmlarty n equaton 5 and the standardzed L2 dstance are nvestgated n the GLA algorthm. Each dataset s separated nto a tranng set and a test set about 25%: 75%. The two dssmlartes are ncorporated nto the stepwse GLA algorthm. Then the vector quatnzers are tested by ther test sets. The average test errors are shown n table 2. The average MSE and F-rato values dsplayed n the table are calculated over the codebook sze from 30 to 55. Fgure 2 plots the test error of the vector quantzers for Speaker aganst ther codebook sze. It turns out that the proposed dssmlarty generally has better performance than the standardzed L2 dstance n the GLA based vector quantzaton. Wth the ncrease of the codebook sze, the performance gans of the proposed dssmlarty are ncreased. 6. CONCLUSIONS We have proposed Delta-MSE dssmlarty functon between tranng vectors and code vectors. The dssmlarty functon s calculated as the change of wthngroup varance before and after movng a gven tranng vector from one class to another. The dssmlarty functon provdes the heurstc drecton of tranng vector movements, n whch the VQ dstorton functon wll possbly decrease most. Although derved from the MSE dstorton, the dssmlarty functon apples also to the mnmzaton of the F-rato valdty ndex. The expermental results show that the proposed dssmlarty undermnes more reassgnments of tranng vectors than the L2 norm. Wth the ncrease of codebook sze, the performance gans of the GLA algorthm s ncreased as well. The weakness of the Delta-MSE dssmlarty functon les n ts non-symmetrc formalzaton. The cluster sze can be one of the domnant factors only f the cluster s sparse enough. 7. REFERENCES [] N. M. Nasrabad and R. A. Kng, Image codng usng vector quantzaton: A revew, IEEE Trans. Commun, 368: pp , August 988. [2] A. Gersho and R. Gray, Vector Quantzaton and Sgnal Compresson, KLUWER, Boston, MA., 992. [3] P. Fränt, Genetc algorthm wth determnstc crossover for vector quantzaton, Pattern Recognton Letters, Elsever, 2, pp. 6-68, [4] P. Fränt and J. Kvärv, Randomzed local search algorthm for the clusterng problem, Pattern Analyss and Applcatons, 3 4, , [5]. P. Fränt, J. Kvärv and O. Nevalanen: Tabu search algorthm for codebook generaton n VQ, Pattern Recognton, 3 8, 39-48, August 998. [6] T. Kaukoranta, P. Fränt and O. Nevalanen, A fast exact GLA based on code vector actvty detecton, IEEE Trans. on Image Processng, 9 8, , August 2000 [7] J. Z. C. La and C. C. Lue, Fast Search Algorthms for VQ Codebook Generaton, Journal of Vsual Communcaton and Image Representaton, 7 2, pp , June 996. [8] C. D. Be and R. M. Gray, An mprovement of the mnmum dstorton encodng algorthm for vector quantzaton, IEEE Trans. Commun., vol. 33, pp , Oct [9] S. W. Ra and J. K. Km, A fast mean-dstance-ordered partal codebook search algorthm for mage vector quantzaton, IEEE Trans. Crcuts Syst., vol. 40, pp , Sept [0] H. Späth, Cluster Analyss Algorthms for Data Reducton and Classfcaton of Obects, Ells Horwood Publ., Chchester, U.K., 980. [] S.-H. Chen and W. M. Hseh, Fast algorthm for VQ codebook desgn, Proc. Inst. Elect. Eng., vol. 38, pp , Oct. 99. [2] R. F. Sproull. Refnement to nearest-neghbour searchng n k-d trees, Algorthmca, No.6, pp , 99. [3] A. Lkas, N. Vlasss and J. J. Verbeek, The global k-means clusterng algorthm, Pattern Recognton, 36 2, pp , 2003.

125 4 Publcaton P4 M. Xu and P. Fränt, Iteratve K-Means algorthm based on Fsher dscrmnant, Int. Conf. on Informaton Fuson Fuson'04, Stockholm, Sweden, vol., 70-73, June 2004.

126

127 Iteratve K-Means Algorthm Based on Fsher Dscrmnant Mantao Xu Unversty of Joensuu P. O. Box FIN-800 Joensuu Fnland Pas Fränt Unversty of Joensuu P. O. Box FIN-800 Joensuu Fnland Abstract K-Means clusterng s a well-known tool n unsupervsed learnng. The performance of K-Means clusterng, measured by the F-rato valdty ndex, hghly depends on selecton of ts ntal partton. Ths problematc dependency always leads to a local optmal soluton for k-center clusterng. To overcome ths dffculty, we present an ntutve approach that teratvely ncorporates Fsher dscrmnant analyss nto the conventonal K-Means clusterng algorthm. In other words, at each tme, a suboptmal ntal partton for K-Means clusterng s estmated by usng dynamc programmng n the dscrmnant subspace of nput data. Expermental results show that the proposed algorthm outperforms the two comparatve clusterng algorthms, the PCA-based suboptmal K-Means clusterng algorthm and the kd-tree based K-Means clusterng algorthm. Keywords: K-Means clusterng, dscrmnant analyss, dynamc programmng. Introducton K-Means clusterng s a famous unsupervsed learnng technque n the context of pattern recognton and machne learnng. The obectve of the convectonal K- Means clusterng s to dg out the nherent partton nsde data obects, namely, to search an optmal partton of data obects wth the mnmum value of the mean dstorton functon. Thus, the K-Means clusterng s an optmzaton problem descrbed by the mnmzaton of the MSE functon: N mnmummse P = x c N = p where N s the number of data samples; k s the number of clusters; d s the dmenson of data vector; X = { x, x 2, x N } s a set of N data vectors; P = { p =, N } s class label of X; C = { c =, k } are k cluster centrods. The man challenge for the conventonal K-Means clusterng s that ts classfcaton performance hghly depends on ts ntally selected partton. In other words, wth most of the randomzed ntal parttons, the 2 conventonal K-Means algorthm converges to a locally optmal soluton. The extended versons of K-Means such as K-Medan [], adaptve K-Means [2] and kernel K- Means [4] were recently developed to overcome ths local optmalty problematc. The K-Medan algorthm searches each cluster centrod from data samples such that the centrod mnmzes the summaton of the dstances from all data ponts n the cluster to t. The optmal adaptve K- Means provdes the conventonal K-Means algorthm wth an enhancement of fast convergence by approxmatng an optmal clusterng soluton wth an adaptve learnng rate. The mprovement made the by ths adaptve K-Means algorthm s based on the optmalty crteron that clusters n the underlyng partton of data source have the same varances when the number of clusters s large enough. The optmalty crteron also provdes a based dstance measurement [2] by weghtng the square dstance wth the wthn-class varance. A state-of-art technque to attack the k-center clusterng problem s the kernel verson of K-Means clusterng, whch expresses ts dstance functon n a form of kernel product of two data samples n a hgher dmensonal space, where data samples are more separable. Namely, the kernel machne solves the k- center clusterng problem n a hghly dmensonal Hlbert space nstead of t orgnal feature space. The optmzaton problem of k-center clusterng n d- dmensonal feature space has been proved to be NPcomplete n k. The soluton for k-center clusterng n one dmensonal space, however, can be solved by dynamc programmng n OkN tme [7]. An ntutve approach to estmate an ntal partton closer to the global optmum s to apply the dynamc programmng technque over some one-dmensonal subspace of nput data. In partcular for Wu s work on color quantzaton [8], ths subspace was estmated by usng prncpal component analyss PCA on nput data. In other words, the dynamc programmng technque can be performed over each prncpal component subspace obtaned by PCA. Snce the best prncpal drecton can be selected only from d number of prncpal components, ths estmated ntal partton mght be stll far from the global optmum n the case of hgh dmensonal data source. A departure from ths lmtaton s to teratvely ncorporate both the lnear Fsher dscrmnant and the dynamc programmng technque nto the K-Means clusterng. The ntal partton of the K-

128 Means clusterng at each teraton s estmated by usng dynamc programmng n the dscrmnant subspace of nput data. The nput class partton for the Fsher dscrmnant analyss at each teraton s selected by the output partton of the K-Means clusterng at the former teraton. In ths work, a suboptmal K-Means clusterng algorthm s nvestgated based on the mult-class Fsher dscrmnant and dynamc programmng. In partcular, a based non-symmetrc dstance measurement, the Delta- MSE dssmlarty, s ncorporated nto the proposed clusterng algorthm. In second secton, we descrbe the suboptmal K-Means clusterng algorthm. In secton 3, we brefly revew the mult-class Fsher dscrmnant. In secton 4, a heurstc based dssmlarty, the Delta-MSE functon, s ntroduced for K-Means clusterng. In the expermental secton, the proposed approach s compared to the other two K-Means clusterng algorthms: the PCAbased suboptmal K-Means algorthm [8] and the kd-tree based K-Means clusterng algorthm [5]. Fnally, conclusons are drawn n secton 6. Functon SubOptmalKMeansX, k, m nput: Dataset X Number of clusters k Number of teratons m output: Class labels P OPT C Randomly choose cluster centrods from X; P K-MeansX, C, k; f mn for = to m w solve Fsher dscrmnant based on class label P; X w proect nput data X nto dscrmnant drecton w; P w optmally solve k-center clusterng problems on X w usng dynamc programmng; C, P K-MeansX, P w, k; frato calculate F-rato of P f frato < f mn then P OPT P f mn frato end f end for Fg.. Pseudo-code for the proposed algorthm. 2 Suboptmal K-Means Clusterng As mentoned earler, the conventonal K-Means algorthm typcally converges to a local mnmum of mean square error MSE. The algorthm s often ntalzed by a randomly chosen ntal partton. However, n ths sense, there s no guarantee of convergence to the global mnmum. Motvated by Wu s optmal soluton for scalar quantzaton [7] and soluton for color quantzaton [8], we teratvely apply the mult-class Fsher dscrmnant n estmaton of the suboptmal ntal partton nstead of usng only the d number of prncpal components. The Fsher dscrmnant at each teraton can be constructed from the output class assgnments obtaned by the K- Means clusterng at prevous teraton. The applcaton of dynamc programmng n the dscrmnant drecton leads to a suboptmal partton of data source n the dscrmnant subspace. Thus, the output suboptmal partton can be selected as the ntal partton of K-Means clusterng at next teraton. Namely, K-Means clusterng and Fsher dscrmnant are performed once at each teraton. We have presented the pseudocodes of the proposed suboptmal K-Means clusterng algorthm n fgure. 3 Mult-class Fsher dscrmnant Dscrmnant analyss s a powerful tool n fndng a drecton that best reveals the classfcaton structure. The goal of ts applcaton n ths work s to apply the dscrmnant classfer to form a convex partton n the proecton subspace that best matches the partton obtaned by K-Means clusterng n orgnal feature space. After the dscrmnant drecton s determned, one can apply dynamc programmng to all the proected data samples to mprovngly optmze the partton n the proecton subspace. The mult-class Fsher dscrmnant [8] lends us a tool to desgn a classfer that approxmates the partton acheved by the convectonal K-Means clusterng algorthm. The separaton of nput classes n the proecton drecton w can be measured by the so-called F- rato valdty ndex, Fw, defned as the rato of between class varance and wthn class varance: M T 2 n w c x = F w = k 2 N T 2 w x c = p where n s the sample sze of class. The mult-class lnear Fsher dscrmnant s derved by the mnmzaton of the F-rato valdty ndex n equaton 2,.e., T w SW w w = arg mn k T 3 w w S w where S B s between class covarance matrx and S W s wthn class covarance matrx respectvely: S S B W = = M N = n c = x c x c p B x x c T T p The dscrmnant drecton w can be estmated by computng the leadng egenvector of matrx S W S B. 4 Delta-MSE dssmlarty In ths work, Instead of L 2 square dstance, a heurstc dstance measurement, the Delta-MSE dssmlarty, was taken nto account for the K-Means algorthm as proposed n [9]. The Delta-MSE dssmlarty s analytcally nduced from the clusterng MSE dstorton by movng a gven data sample from one cluster to another cluster. The dssmlarty s calculated as the change of the wthn-class varance caused by ths movement. The desgn approach 4

129 of Delta-MSE always takes nto account the dynamc nature of the K-Means partton process, n whch cluster parameters cluster sze are subect to change all the tme n the clusterng algorthm. Assumng that a data sample x s moved from cluster to cluster, the change of the MSE functon caused by ths move s: n 2 2 x = x c x c 5 n + n v n algorthm. We also compared the two proposed methods wth the two alternatve clusterng algorthms: the PCAbased suboptmal K-Means algorthm and the kd-tree based K-Means clusterng algorthm. The kd-tree based K-Means algorthm selects ts ntal cluster centrods from the k- bucket centers of a so-called nested PCA kdtree structure. Ths kd-tree structure can be constructed by reccursvely usng prncpal component analyss. The two comparatve K-Means clusterng algorthms are here denoted as PCA and KD-Tree respectvely. The frst part n the rght hand sde, representng the ncreased varance of cluster caused by ths move, denotes the based dssmlarty between x and c. as D MSE x, c. The second part, representng the decreased varance of cluster, denotes the dssmlarty between x and c as D MSE x, c. Thus, the Delta-MSE dssmlarty between data pont x and the cluster centrod c can be defned as: 5,8 5,2 4,6 F-rato 4 Glass D MSE 2 x, c = w x c 6 and can be weghted by: n / n + p w = 7 n / n p = 3,4 2,8 8 KD-Tree PCA LFD-I LFD-II Number of clusters Heart It s worth notng that the sparser the cluster s, the more dfferent the Delta-MSE dssmlarty can be n comparson to the L 2 dstance. The weght w makes the based dssmlarty bgger than L 2 square dstance f x s allocated n the cluster and smaller than L 2 square dstance otherwse. In the repartton of data samples drven by the based dssmlarty, each sample s nclned to on or leave the sparser clusters more frequently than the denser clusters. Accordngly, the reassgnments of data samples nto ther closest clusters are drven wth the Delta-MSE dssmlarty more frequently than wth the L 2 square dstance. Thus, the based dssmlarty enables the suboptmal clusterng algorthm wth a faster convergence to the global optmum. 5 Expermental results We have conducted experments on the k-clusterng problems of 5 real datasets from UCI machne learnng repostory [4]. In the experments, we studed the proposed suboptmal K-Means clusterng by two dynamc programmng methods. In the frst method denoted as LFD-I, we mplemented dynamc programmng by the MSE dstorton functon defned on the proecton subspace. Although the frst method converges to a global mnmum [7] n one-dmensonal proecton subspace, n the second method denoted as LFD-II, we consder the MSE dstorton functon defned on the d-dmensonal feature space n desgn of dynamc programmng. Of course, n practce, one can vew ths approach not only as an approxmaton algorthm but also as a heurstc 7,4 6,8 F-rato 6,2 5, ,4 3,8 F-rato 3,2 2,6 2 KD-Tree LFD-I PCA LFD-II Number of clusters Image KD-Tree LFD-I PCA LFD-II Number of clusters Fg. 2. Frato dstortons obtaned by usng the four dfferent K-Means clusterng algorthms.

130 The four K-Means clusterng approaches are tested: PCA, LFD-I and LFD-II and KD-Tree over the fve famous datasets from UCI machne learnng repostory [6] as boston, heart, glass, mage and thyrod. The clusterng performances of the four K-Means clusterng algorthms are measured by the F-rato clusterng valdty ndex. Fgure 2 plots the F-rato valdty ndex obtaned by the four K-Means algorthms over the datasets: glass, heart and mage. The F-rato valdty ndex s presented as the functon of the number of clusters k. It can be observed that the two proposed algorthms n general outperform the other two comparatve algorthms. In partcular, wth the number of cluster k ncreased, ther clusterng performance gans are much mproved aganst the other two comparatve algorthms. Among the four clusterng algorthms, the proposed suboptmal K-Means algorthms based on the mult-class Fsher Dscrmnant analyss yeld better results than the other tow algorthms. We also compared clusterng results from the four algorthms on the practcal number of clusters for each dataset. Table dsplays the F-rato valdty ndces on the practcal number of clusters for each dataset. Not surprsngly, the suboptmal K-Means algorthms based on the Fsher dscrmnant analyss acheve better F-rato valdty ndces than the other two algorthms. Table : Performance comparsons of the four K-Means algorthms on the practcal numbers of clusters Datasets k KD-Tree PCA LFD-I LFD-II boston glass heart mage thyrod Concluson We have proposed a new approach to the k-center clusterng problem based on the lnear Fsher dscrmnant analyss and the dynamc programmng technque. The lnear Fsher dscrmnant analyss serves as a tool of fndng the subspace that best matches the classfcaton structure obtaned by the conventonal K-Means clusterng algorthm. Applcaton of dynamc programmng n the lnear dscrmnant subspace mproves the clusterng partton of the K-means algorthms. The mproved partton s consdered as the ntal partton of K-Means clusterng n next teraton. Thus, a desgn technque for the teratve K-Means clusterngs can be constructed by teratvely by ncorporatng the Fsher dscrmnant analyss and the dynamc programmng technque. Experment results show that the proposed approach n general outperforms the other two comparatve K-Means algorthms: the PCA based suboptmal K-Means clusterng algorthm and the kd-tree based K-Means clusterng algorthm. In partcular, by ncreasng the number of clusters, ts classfcaton performance gans are mproved aganst the two comparatve K-Means algorthms. References [] P. S. Bradley, O. L. Mangasaran and W. N. Street, "Clusterng va Concave Mnmzaton", Advances n Neural Informaton Processng Systems 9 NIPS9, , MIT Press, Cambrdge, MA 997. [2] C. Chnrungrueng and C. H. Sequn. Optmal adaptve k-means algorthm wth dynamc adustment of learnng rate. IEEE Transactons on Neural Network, Int. J. Trackng n Aerospace, 6:57 69, 995. [3] D. H. Foley and J. W. Sammon. A optmal set of dscrmnant vectors. IEEE Transactons on Computers, vol. 3, no. 24, pp , 975. [4] M. Grolam, "Mercer Kernel Based Clusterng n Feature Space", IEEE Trans. on Neural Networks, 34; , [5] A. Lkas, N. Vlasss and J. J. Verbeek, "The Global K-means Clusterng Algorthm", Pattern Recognton, 36 2: 45-46, [6] UCI Repostory of Machne Learnng Databases and Doman Theores. http: // ~mlearn/mlrepostory.html, [7] X. Wu and K. Zhang, "Quantzer Monotonctes and Globally Optmal Quantzer Desgn Algorthms", IEEE Trans. on Informaton Theory, vol. 39, no. 3, p , May 993. [8] X. Wu, "Color Quantzaton by Dynamc Programmng and Prncpal Analyss", ACM Trans. on Graphcs, vol., no. 4 TOG specal ssue on color, p , Oct [9] M. Xu, "Delta-MSE Dssmlarty n GLA-based Vector Quantzaton", IEEE Int. Conf. on Acoustcs, Speech, and Sgnal Processng, ICASSP 04, Montreal, Canada, to appear

131 5 Publcaton P5 M. Xu and P. Fränt, A heurstc K-Means clusterng algorthm by kernel PCA, IEEE Int. Conf. on Image Processng ICIP'04, Sngapore, vol. 3, , October 2004.

132

133 A HEURISTIC K-MEANS CLUSTERING ALGORITHM BY KERNEL PCA Mantao Xu and Pas Fränt Unversty of Joensuu P. O. Box, 800 Joensuu, Fnland {xu, ABSTRACT K-Means clusterng utlzes an teratve procedure that converges to local mnma. Ths local mnmum s hghly senstve to the selected ntal partton for the K-Means clusterng. To overcome ths dffculty, we present a heurstc K-means clusterng algorthm based on a scheme for selectng a suboptmal ntal partton. The selected ntal partton s estmated by applyng dynamc programmng n a nonlnear prncpal drecton. In other words, an optmal partton of data samples n the kernel prncpal drecton s selected as the ntal partton for the K-Means clusterng. Experment results show that the proposed algorthm outperforms the PCA based K-Means clusterng algorthm and the kd-tree based K-Means clusterng algorthm respectvely.. INTRODUCTION K-Means s a well-known technque n unsupervsed learnng and vector quantzaton. The K-Means clusterng s formulated by mnmzng a formal obectve functon, mean-squared-error dstorton. mnmum MSE P = x c N = p where N s the number of data samples; k s the number of clusters; d s the dmenson of data vector; X = { x, x 2, x N } s a set of N data samples; P = { p =, N } s class label of X; C = { c =, k } are k cluster centrods. Due to ts smplcty for mplementaton, the conventonal K-Means can be appled to a gven clusterng algorthm as a postprocessng stage to mprove the fnal soluton []. However, the man challenge for the conventonal K- Means s that ts classfcaton performance hghly reles on the selected ntal partton. In other words, wth most 2 of randomzed ntal parttons, the conventonal K-Means algorthm converges to a locally optmal soluton. An extended verson of K-Means, the K-Medan clusterng, serves a soluton to overcome ths lmtaton. The K- Medan algorthm searches each cluster centrod from data samples such that the centrod mnmzes the summaton of the dstances from all data ponts n the cluster to t. However, n practce, there were no effcent solutons known to most of the formulated K-Medan problems that are NP-Hard [2]. A more advanced technque [3] s to formulate the K-Means clusterng as a kernel machne n a hghly dmensonal feature space. Namely, the kernel machne solves k-clusterng problem n a hghly dmensonal Hlbert space nstead of ts nput space. The optmzaton of k-clusterng problems n d- dmensonal space has proved to be NP-hard n k, however, for one-dmensonal feature space, a scheme based on dynamc programmng [8] can serve as a tool to drve a globally mnmal soluton. Hence, a heurstc approach to estmate the ntal partton for K-Means clusterng s to tackle the clusterng optmzaton problem n some onedmensonal component space. Motvated by Wu s work on color quantzaton [9], ths can be solved by dynamc programmng n the prncpal component subspace. In partcular, a nonlnear curve can be selected as ths prncpal drecton,.e. a kernel prncpal component [5]. Developed by Scholkopf et al. [6], the kernel prncpal component analyss KPCA s a state-of-art technque for feature extracton wth an underlyng nonlnear spatal structure, whch transfers the nput data nto a hgher dmensonal feature space. In ths sense, a kernel trck s utlzed to perform operaton n the new feature space, where data samples are more separable. Snce the best prncpal drecton can be selected only from d-number of prncpal components n the lnear PCA, the estmated ntal partton could be far from the global optma n the case of hgh dmensonal data source. However, the kernel PCA can provde the same number of prncpal components as the number of nput data samples. In a larger sense, data samples are more separable n the nonlnear prncpal curve drecton than n the lnear one. Hence, an ntal partton closer to the global optma can /04/$ IEEE. 3503

134 be obtaned by applyng dynamc programmng n the nonlnear prncpal curve subspace. In ths paper, a heurstc K-Means clusterng algorthm s nvestgated based on the kernel PCA and dynamc programmng. A based dstance measurement, the Delta-MSE dssmlarty, s ncorporated nto the proposed clusterng algorthm nstead of usng the Eucldean dstance. In next secton, we descrbe the heurstc K-Means algorthm by usng kernel PCA and dynamc programmng. In secton 3, we brefly revew the technque of the kernel prncpal component analyss. Secton 4 ntroduces the Delta-MSE dssmlarty for the K-Means algorthm. In expermental secton, the proposed algorthm s compared to the two exstng clusterng approaches: the PCA based suboptmal K-Means algorthm [9] and the kd-tree based K-Means clusterng algorthm [4]. Fnally, conclusons are drawn n secton 6. nput: output: Datasets X Number of clusters k Number of prncpal components m Class membershp P OPT Functon HeurstcKMeansX, k, m W solve m number of kernel prncpal drectons of X; f mn for = to m X PJ proect X on each kernel prncpal drecton w; P I solve k optmal clusterng problems on each scalar varable X PJ by dynamc programmng; P solve K-Means clusterng problem d-dmensonal nput space wth ntal partton P I ; frato calculate F-rato of P f frato < f mn then P OPT P f mn frato end f end for Fgure. Pseudocodes of heurstc K-Means 2. HEURISTIC K-MEANS CLUSTERING As mentoned earler, the conventonal K-Means algorthm typcally converges to a local mnmum of mean-squarederror MSE. The K-Means algorthm s often ntalzed wth a randomly chosen partton. However, n ths sense, there s no guarantee of convergence to the global optma. The optmzaton problem of k-clusterng n d-dmensonal feature space has been proved to be NP-complete n k. Encouraged by the success of kernel PCA [5,6], we apply the kernel PCA n estmaton of the suboptmal ntal partton nstead of usng only the d number of prncpal components n Wu s work on color quantzaton [9]. The nonlnear prncpal components are constructed by performng PCA n the hgher dmensonal feature space expanded by Mercy kernel functons. Applcaton of dynamc programmng n each nonlnear prncpal drecton leads to an optmal partton of data samples n the proecton subspace. Among the output optmal parttons n m number of prncpal drectons, the partton wth the mnmum F-rato clusterng valdty ndex s selected as the ntal partton for K-Means clusterng. The selecton strategy leads to a smaller dstorton between the suboptmal ntal partton and the globally optmal soluton. We present the pseudocodes of the proposed heurstc clusterng algorthm n fgure. 3. KERNEL PCA Prncpal component analyss s one of the most popular technques for feature extracton. The prncpal components of nput data X can be obtaned by solvng the egenvalue problem of the covarance matrx of X. Ths conventonal PCA can be generalzed as a nonlnear one, the kernel PCA, by Φ: R d F, a mappng from nput data space to a hghly dmensonal feature space F. The space F and therewth also the mappng Φ mght be very complcated. To avod ths problem, the kernel PCA employs a kernel trck to perform feature space operatons by explctly usng the nner product between two ponts n the feature space: Φ x, Φ x K x, x 2 Thus, ts covarance matrx can be wrtten as: N T WΦ = Φ x Φ x 3 N = For any egenvalue of W Φ, λ 0, and ts correspondng egenvectors V F\{0}, the equvalent formulaton of egenvalue problem [6] n F can be defned as: N λα = Kα 4 where egenvector V s spanned n space F as: V = N α Φ 5 x = and where K = Kx, x and α = α,α 2 α N T. For the kernel component extracton, we compute proecton of each data sample x onto egenvector V N Φ x, V = α K x, x 6 = The kernel PCA allows us to obtan the features wth hghorder correlaton between the nput data samples. In nature, the kernel proecton of data samples onto the kernel prncpal component mght undermne the nonlnear spatal structure of nput data. Namely, the nherent nonlnear structure nsde nput data s reflected wth most mert n the prncpal component subspace. 3504

135 .2 F-rato F-rato Camera KD-Tree PCA KPCA-I KPCA-II Number of clusters Image KD-Tree PCA KPCA-I KPCA-II to another, whch s calculated as the change of the wthnclass varance caused by ths movement. Let a data sample x move from cluster to cluster, the change of the MSE functon caused by ths move s: n 2 n 2 v x = x c x c 7 n + n The frst part n the rght hand sde, the ncreased varance of cluster, denotes the based dssmlarty between x and c. The second part, representng the decreased varance of cluster, denotes the dssmlarty between x and c. Thus, the Delta-MSE dssmlarty between data pont x and the cluster centrod c s wrtten as: 2 n x c / n +, p DMSE x, c = 8 2 n x c / n, p = It s worth notng that the sparser the cluster s, the more dfferent the Delta-MSE dssmlarty can be n comparson to the L 2 square dstance. In the repartton of data samples drven by ths dssmlarty, each sample s nclned to on or leave sparse clusters more frequently than dense clusters. Thus, the heurstc dssmlarty enables the proposed clusterng procedure to converge to a soluton closer to the global optma F-rato Number of clusters Mssa KD-Tree KPCA-I Number of clusters PCA KPCA-II Fgure 2: Frato dstortons obtaned by usng the four dfferent K-Means clusterng algorthms. 4. Delta-MSE dssmlarty Instead of usng the Eucldean dstance, we ncorporate a heurstc dstance measurement, Delta-MSE dssmlarty, nto the K-Means clusterng as proposed n [0]. Ths dssmlarty s analytcally nduced from the clusterng MSE functon by movng a data sample from one cluster 5. EXPERIMENTAL RESULTS We have conducted experments on the k-clusterng problems of 5 real datasets from UCI machne learnng repostory [7] and the datasets from 6 standard mages: Brdge and Camera are the datasets wth 4 4-blocks from mage Brdge and Cameraman; Housec5 and Housec8 - quantzed to 5 bts and 8 bts per color respectvely; Mssa and Mssa2 are the datasets wth 4x4 vectors from the dfference mage of frame and 2 for Mss Amercan and the dfference mage of frame 2 and 3 respectvely. We studed the proposed K-Means algorthm by two dynamc programmng methods. In the frst method denoted as KPCA-I, we mplemented the dynamc programmng by the MSE dstorton defned only on the proecton subspace. In the second method denoted as KPCA-II, we consdered the MSE dstorton defned on the whole d-dmensonal nput space n desgn of dynamc programmng. Of course, n practce, one can vew ths approach as a heurstc algorthm for selectng the ntal partton for the K-means clusterng. We also compared the two proposed approaches wth the two exstng clusterng algorthms: the PCA-based suboptmal K- Means algorthm denoted as PCA and the kd-tree based K-Means clusterng algorthm denoted as KD-Tree. The kd-tree based K-Means algorthm selects the ntal cluster centrods from the k-bucket centers of a kd-tree developed also by prncpal component analyss. 3505

136 The four K-Means clusterng approaches, PCA, KPCA-I and KPCA-II and KD-Tree, are tested over the fve datasets from UCI repostory and the sx mage datasets. The performances of the clusterng algorthms are measured by the F-rato clusterng valdty ndex. Fgure 2 plots the F-rato valdty ndex obtaned by the four K-Means approaches over the datasets: Camera, Image mage segmentaton data from UCI and Mssa. The F-rato valdty ndex s presented as the functon of the number of clusters k. It can be observed that the two proposed methods n general outperform the other comparatve algorthms. In partcular, as the number of cluster k s ncreased, ther clusterng performances are much mproved aganst the two others. Among the four clusterng approaches, the proposed K-Means algorthms by the kernel PCA yeld better results than the others. We also compare clusterng results from the four algorthms wth number of clusters k = 0 n table -2. Not surprsngly, the proposed heurstc K-Means algorthms acheve better F-rato valdty ndces than the others. Table : Performance comparsons of the four K-Means clusterng algorthms on the fve real datasets from UCI. Datasets KD-Tree PCA KPCA-I KPCA-II boston glass heart mage thyrod Table 2: Performance comparsons of the four K-Means clusterng algorthms on the sx mage datasets. Datasets KD-Tree PCA KPCA-I KPCA-II brdge camera housec housec mssa mssa CONCLUSION We have proposed a new approach to the k-clusterng problem based on the kernel PCA and dynamc programmng. Applcaton of dynamc programmng n the nonlnear prncpal drecton obtaned by the kernel PCA estmates a suboptmal ntal partton for the K-means clusterng. Snce data samples are more separable n the nonlnear prncpal drecton than n the lnear one, an ntal partton closer to the global optmum s acheved by the proposed selecton scheme. A heurstc dstance measurement, Delta-MSE functon, s also ncorporated nto the proposed K-Means clusterng algorthm nstead of the Eucldean dstance. Experment results show that the proposed algorthm n general outperforms the two exstng K-Means algorthms compared n ths work. In partcular, by ncreasng the number of clusters, ts classfcaton performance s mproved aganst the two other algorthms. 7. REFERENCES [] P. Fränt, J. Kvärv and O. Nevalanen: Tabu search algorthm for codebook generaton n VQ, Pattern Recognton, 3 8, pp , August 998. [2] M.R. Garey and D.S. Johnson, Computers and Intractablty: A Gude to NP-Completeness, W. H. Freeman, New York, 979. [3] M. Grolam, Mercer Kernel Based Clusterng n Feature Space, IEEE Trans. on Neural Networks, 34: pp , [4] A. Lkas, N. Vlasss and J. J. Verbeek, The Global K-means Clusterng Algorthm, Pattern Recognton, 36 2: pp , [5] B.Schölkopf, A. Smola, and K.R. Müller, Kernel prncpal component analyss, Advances n Kernel Methods - Support Vector Learnng, pp , MIT Press, Cambrdge, MA, 999. [6] B. Schölkopf, S. Mka, A. Smola, G. Rätsch, and K.R. Müller, Kernel PCA pattern reconstructon va approxmate pre-mages, Proceedngs of the 8th Internatonal Conference on Artfcal Neural Networks, Perspectves n Neural Computng, pp Sprnger Verlag, Berln, 998. [7] UCI Repostory of Machne Learnng Databases and Doman Theores. http: // [8] X. Wu, Color Quantzaton by Dynamc Programmng and Prncpal Analyss, ACM Trans. on Graphcs, vol., no. 4 TOG specal ssue on color, pp , Oct [9] X. Wu and K. Zhang, "Quantzer Monotonctes and Globally Optmal Quantzer Desgn Algorthms", IEEE Trans. on Informaton Theory, vol. 39, no. 3, pp , May 993. [0] M. Xu, Delta-MSE Dssmlarty n GLA-based Vector Quantzaton, n Proceedngs of IEEE Int. Conf. on Acoustcs, Speech, and Sgnal Processng, ICASSP 04, Montreal, Canada, May,

137 6 Publcaton P6 M. Xu and P. Fränt, Context clusterng n lossless compresson of grayscale mage, Scandnavan Conf. on Image Analyss SCIA'03, Göteborg, Sweden, Lecture Notes on Computer Scence, vol. 2749, , June 2003.

138

139 Context Clusterng n Lossless Compresson of Gray-Scale Image Mantao Xu and Pas Fränt Department of Computer Scence, Unversty of Joensuu, P. O. Box, 800 Joensuu, Fnland {Xu, Frant}@cs.oensuu.f Abstract. We consder and evaluate the context clusterng method for lossless mage compresson based on the exstng LOCO-I algorthm used n JPEG-LS the latest lossless mage compresson standard. We employ the LOCO-I Medpredctor to enroll the error pxels. The contexts are defned by calculatng gradent of current pxels. The three drectonal gradents are quantzed wth dfferent codebook sze 7, 9, 9 respectvely. The error pxels are then corrected and encoded by the clustered-contexts. A man advantage of usng the context clusterng method s that t can elmnate the storage of probablty vector. An adaptve arthmetc encoder s also ntroduced to yeld a hgher compresson rate. Introducton Lossless mage compresson has been wdely used n many applcatons felds such as pcture archvng, geophyscs surveyng and telemedcne. Recently, several algorthms have been developed so well as the benchmarks of mage compresson. FELICS s a smple and effcent compresson algorthm avodng the tme-consumng arthmetc codng []. The latest JPEG-LS standard was mplemented based on the LOCO-I algorthm [2, 3]. CALIC, the Adaptve Context-based Lossless Image Codec [4], was evaluated to be n best performance by the JPEG workng group. Both of the algorthms employ predcton technques before context modelng. In addton to predcton technques, the contnuous-tone mage can be dvded nto a set of btplane mages that are coded by CTW, the context tree weghtng method [5]. An alternatve of b-level mage compresson s context clusterng [6], replacng the condtonal probablty vector or densty functon n each context wth the reference probablty vector n ts cluster of contexts. The method mproves b-level mage compresson performance by reducng the storage of the probablty densty functons PDFs. Snce the condtonal probablty densty functon n each context s represented as a probablty vector n a Eucldean space, context clusterng can be used to quantze the probablty densty functons of the exstng algorthms. Thus, only the referenced probablty vectors need to be coded and transmtted to the decoder as a part of the storage of the compressed fle. To mplement lossless mage compresson, we frst follow the LOCO-I Med-predctor. Then context clusterng and an adaptve arthmetc J. Bgun and T. Gustavsson Eds.: SCIA 2003, LNCS 2749, pp , Sprnger-Verlag Berln Hedelberg 2003

140 Context Clusterng n Lossless Compresson of Gray-Scale Image 329 coder are employed. We replace the fxed-sze quantzer n LOCO-I wth an optmzed quantzer. For reducton of PDF vector storage, the codeword of each corrected error pxel s dvded nto two parts. Context clusterng s mplemented only n PDF vector space by usng the generalzed Kullback-Lebler dstance. Fnally, we test the performances of JPEG-LS, CALIC, BTPC bt plane predctve codng and our context clusterng algorthm for several gray-scale mages. Expermental results show that the proposed algorthm yelds a smlar compresson rate wth JPEG-LS but provdes a flexble framework and some varatons of methods ncluded n JPEG-LS. 2 Predcton We frst take a 4-pxel neghborhood as the context used for the predcton of current pxel; see fgure. Snce the predctor n CALIC seems more complcated, we here employ the LOCO-I predctor descrbed n the rght sde of fgure. c b d a? a, b a, b mn xˆ = max a + b c f c max f c mn otherwse a, b a, b Fg.. 4-pxel context n gray scale mage and the Med-predctor n LOCO-I The predcted error pxel s calculated as ε t = xt - xˆ t and wll be corrected later by bas cancellaton for statstcal modelng, where t s the current pxel poston. 3 Context Modelng The context modelng ncludes the followng two steps: context quantzaton, and bas cancellaton. The context s determned by calculatng three gradents between the four context pxels: g = d - b, g 2 = b - c, and g 3 = c - a. 3. Context Quantzaton Each gradent varable can be quantzed ether by a K-means quantzer or a fast scalar quantzer [7]. The codebook sze of g, g 2, and g 3 s 7, 9, and 9 respectvely. We here assume that the value of g s very close to the value of g 2 because they are the horzontally neghborng gradents whle g 3 s the unque vertcal gradent. We therefore assume the number of quantzed values for g 3 should be more or less equal to the sum

141 330 M. Xu and P. Fränt of the numbers for g and g 2. Then contexts are merged accordng to the quantzed values. That s to say, the contexts wth the dentcal quantzed values are merged. The number of models s reduced further by assumng that symmetrc contexts C = [g, g 2, g 3 ] and C = [-g, -g 2, -g 3 ] have the same statstcal propertes wth the dfference of the sgn. The total number of models s thus: T + 2T + 2T C = Bas Cancellaton Snce the quantzed context Ct s known both by encoder and decoder, the predcton error pxel ε t, whch s used to recover the mage pxel xt, can be corrected adaptvely by Ct. The adaptve correcton s used to cancel the bas offset of TSGD dstrbuton [3] due to the fxed predctor. We denote the estmated error pxel and the corrected error pxel as εˆ t and ε t respectvely. We here present the pseudo-codes of the bas cancellaton [8] as follows: PROCEDURE BasCancellatonx, xˆ,c return ε FOR q TO NumberOfQuantzedContexts DO Sq 0;Nq ; FOR t TO NumberOfImagePxels DO q QuantzatonIndexOfCt; εˆ t Sq / Nq; x t xˆ t + εˆ t; ε t x t x t; Sq Sq+ ε t; Nq Nq+ ; IF Nq 28 THEN Sq Sq/2; Nq Nq/2; return ε ; END PROCEDURE. 4 Statstcal Modelng To prepare for statstcal modelng, we need to nterleave the corrected errors nto non-negatve regon by a transformaton n LOCO-I; see formula 4 n [2]. Then the condtonal PDF functons n quantzed contexts can be calculated for the encoder. However, the condtonal PDF functons, whch should be stored n the compressed fle, wll take a large memory space. To reduce the storage of the PDF functons, we here separate the corrected error pxel ε nto two varables, ε and ε 2, whch wll be coded separately. A smple soluton s to dvde each corrected error ε by a constant nteger d = 6 and calculate the correspondng quotent and modulo as:

142 Context Clusterng n Lossless Compresson of Gray-Scale Image 33 ε = floor ε /, ε = mod ε, 2 d 2 d 4. Context Clusterng To reduce PDF functon storage further, we employ context clusterng n PDF vector space and replace the PDF vector n each context wth the reference PDF vector n ts context cluster. A gven context C s represented by the par of probabltes C = f, p, where f s the probablty of the context C and p s the condtonal PDF functon or vector of the corrected error pxel n the context C. An optmzed codebook Y consstng of reference vectors s generated by the K- means clusterng algorthm. The th reference vector y s calculated as: y = f, p, f = f L =, p = L = f p f 3 where L s the partton table. Instead of the condtonal PDF vector p, the reference condtonal PDF vector p s used to encode all error pxels neghbored by any context n cluster. The number of condtonal PDF vectors used by encoder s therefore reduced. To optmze the codebook Y n the PDF vector space {f, p f³r, p³r dm }, we employ Kullback-Lebler dstance as the vector-to-cluster dstance n ths work. The quantzed contexts are therefore reallocated nto clusters by clusterng only n the PDF vector space. 4.2 Clusterng Dstance and Dstorton Functon As shown n secton 4., each context s optmally allocated nto a cluster wth a reference condtonal PDF vector p, whch s used to encode the corrected error pxel nstead of ts own PDF vector p. So the dstorton of context clusterng should reflect the total dfference of entropy ganed by usng the two knds of PDF vectors above. In compresson of gray-scale mages, the dstorton functon s naturally generalzed as: f M = f = L = D p p 4 where D p p s Kullback-Lebler dstance and calculated as: dm D p 5 p = p log p / p k 2 k k k =

143 332 M. Xu and P. Fränt 5 Adaptve Arthmetc Codng In ths secton we ntroduce an adaptve arthmetc coder, of whch PDF functon s updated wth the encoded pxels. The adaptve coder s demonstrated by encodng a sequence S = { S =,, N } that conssts of a, b and c. The sequence S s descrbed as aaabbaaccbbccbb, from whch the ntal frequency of each symbol s calculated as f a = 5, f b = 6 and f c = 4. The ntal frequences here need to be transmtted to decoder. Then the sequence can be coded by a set of adaptve probabltes as { 5/5, 4/4, 3/3, 6/2, 5/, 2/0, /9, 4/8, 3/7, 4/6, 3/5, 2/4, /3, 2/2, / }. To make t clear, we present the pseudo-codes of ths example adaptve coder as: PROCEDURE AdaptveCoderS,f a,f b,f c g a f a ; g b f b ; g c f c ; N f a +f b +f c ; FOR TO NumberOfCharacters DO ArthmetcCodngS,g a /N,g b /N,g c /N; IF S= a THEN g a g a -; ELSE IF char= b THEN g b g b -; ELSE g c g c -; N N-; END PROCEDURE. In the adaptve coder above, once f the current character has been coded, t wll be removed from the sequence, therefore, the next pxel to be encoded s coded by probablty dstrbuton of the rest of the sequence. Hence the adaptve arthmetc coder can yeld a better compresson rate f the sequence sze s not very large. Smlarly, the adaptve arthmetc coder works as well n codng the corrected error pxels by updatng the condtonal PDF vectors. 6 Expermental Results Ths secton descrbes the testng results of the lossless mage compresson wth context clusterng. We used sx standard mages for testng our context clusterng algorthm; see table, whch lsts the fnal bt rate of sx compressed mages. We also expermented three lossless mage compresson algorthms, CALIC, LOCO-I and BTPC, to evaluate the performance of our algorthm. From testng results, we learned that the context clusterng acheves a compresson rato close to LOCO-I but more flexble than others n selecton of contexts and the condtonal PDFs. We chose 7, 9 and 9 as the quantzer sze for each gradent respectvely. We learned from experments that the selecton of gradent quantzer sze could have effect on the compresson performance. The adaptve arthmetc coder does mprove compresson rato n ths work, however, the mprovement s senstve to the number of symbols to be coded: when the number of symbols s very large the mprovement s lmted.

144 Context Clusterng n Lossless Compresson of Gray-Scale Image 333 Table. Compresson results of testng mages n bts/pxel Testng Image Context Clusterng CALIC LOCO BTPC Camera Brdge Mssa Mssa Lena Boats Conclusons Context clusterng s shown to be an effectve alternatve for lossless mage compresson n ths work. By replacng the condtonal PDF vector wth the reference PDF vector, context clusterng solves the storage problem of the PDF vectors. Therefore an adaptve arthmetc coder can be drectly used to encode the corrected error pxels, however, ts codng performance depends much on the number of the symbols to be coded. A varable quantzer sze for each gradent should be more reasonable than a fxed sze such as 9. We found that when the quantzer sze of vertcal gradent s equal to the sum of the quantzer szes of two horzontal gradents, our context clusterng method acheves a better performance. Context clusterng stll suffers a bottleneck between the entropy of arthmetc codng and storage of PDF vectors. Usage of more PDF vectors n arthmetc coder s best way to reduce the codng entropy, however, the storage of PDF vectors wll ncrease sharply. The further research wll concentrate on mprovng the exstng methods n the proposed context clusterng algorthm. For example, an optmal dvson number d n equaton 2 needs to be estmated to mprove compresson performance. Selecton of the optmal number of clustered contexts n PDF vector space should be nvestgated n future. Furthermore, the scalar context quantzatons n ths work wll be replaced by the vector quantzaton n context gradent space, whch needs to be solved on the fly. References. Howard, P., Vtter, J.: Fast and Effcent Lossless Image Compresson. In: Storer, J., Cohn, M. eds: Proceedngs of the IEEE Data Compresson Conference, Snowbrd, Utah, March 30 Aprl, 993. IEEE Computer Socety Press Wenberger M., Serouss G., Sapro G.: A Low Complexty, Context-Based, Lossless Image Compresson Algorthm. In: Storer, J., Cohn, M. eds: Proceedngs of the 6th Data Compresson Conference, Snowbrd, Utah, March 3 Aprl 3, 996. IEEE Computer Socety Press

145 334 M. Xu and P. Fränt 3. Wenberger M., Serouss G., Sapro G.: The LOCO-I Lossless Image Compresson Algorthm: Prncples and Standardzaton nto JPEG-LS. IEEE Trans. Image Processng Wu, X., Memon, N.: Context-based, Adaptve, Lossless Image Codec. IEEE Trans. Communcatons, Vol Ekstrand, N.: Lossless Compresson of Grayscale Images va Context Tree Weghtng. In: Storer, J., Cohn, M. eds: Proceedngs of the 6th Data Compresson Conference, Snowbrd, Utah, March 3 - Aprl 3, 996. IEEE Computer Socety Press Greene, D.H., Yao F., Zhang T.: A Lnear Algorthm for Optmal Context Clusterng wth Applcaton to B-level Image Codng. In: Proceedngs of the 998 IEEE Int. Conf. on Image Processng, Chcago, Illnos, October 4 7, 998. IEEE Computer Socety Ortega, A., Vetterl, M.: Adaptve Scalar Quantzaton Wthout Sde Informaton. IEEE Trans. Image Processng Wu, X.: Lossless Compresson of Contnuous-tone Images va Context Selecton, Quantzaton, and Modelng. IEEE Trans. Image Processng

146

147 7 Publcaton P7 M. Xu, X. Wu and P. Fränt, Context quantzaton by kernel Fsher dscrmnant, IEEE Trans. on Image Processng, accepted for publcaton.

148

149 > TIP R2 < Context Quantzaton by Kernel Fsher Dscrmnant Mantao Xu, Xaoln Xu, Senor Member, IEEE and Pas Fränt Abstract Optmal context quantzers for mnmum condtonal entropy can be constructed by dynamc programmng n the probablty smplex space. The man dffculty, operatonally, s the resultng complex quantzer mappng functon n the context space n whch the condtonal entropy codng s conducted. To overcome ths dffculty we propose new algorthms for desgnng context quantzers n the context space based on mult-class Fsher dscrmnant and the kernel Fsher dscrmnant. In partcular, the kernel Fsher dscrmnant can descrbe lnearlynonseparable quantzer cells by proectng nput context vectors onto a hgh-dmensonal curve, n whch these cells become better separable. The new algorthms outperform the prevous lnear Fsher dscrmnant method for context quantzaton. They approach the mnmum emprcal condtonal entropy context quantzer desgned n the probablty smplex space, but wth a practcal mplementaton that employs a smple scalar quantzer mappng functon rather than a large look-up table. Index Terms Context quantzaton, Entropy codng, Fsher dscrmnants, mage compresson. I. INTRODUCTION KEY and mportant task n compressng a dscrete Asequence X 0, X, X 2, s the estmaton of condtonal probabltes PX t X t -, where X t - = X 0, X, X 2, X t - s the prefx or context of X t. Gven a class of source models, the model order or the number of parameters must be carefully chosen n the prncple of mnmum descrpton length or unversal source codng. The poneer soluton to the problem s Rssanen s algorthm Context [], whch dynamcally selects a varable-order subset of the past samples n X t -, called the context, C t. The algorthm structures the contexts of dfferent orders by a tree and t can be shown to be, under certan assumptons, unversal n terms of approachng mnmum adaptve code length for a class of fnte memory sources. A more recent and ncreasngly popular unversal source codng technque s context tree weghtng [2]. The dea s to weght Manuscrpt receved June 2, 2004; revsed December 23, Ths work was supported n part by NSERC, NSF and Noka Research Fellowshp. The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Dr. Therry Blu. M. Xu s wth Department of Computer Scence, Unversty of Joensuu, PL, 800 Joensuu, Fnland e-mal: xu@ cs.oensuu.f. X. Wu s wth Department of Electrcal & Computer Engneerng, McMaster Unversty, Hamlton, Ontaro, Canada, L8G 4K. He s also a vstng professor of Unversty of Joensuu on a Noka Fellowshp, Fnland emal: xwu@mal.ece.mcmaster.ca. P. Fränt s wth Department of Computer Scence, Unversty of Joensuu, PL, 800 Joensuu, Fnland e-mal: frant@ cs.oensuu.f. the probablty estmates assocated wth dfferent branches of a context tree to obtan a better estmate of PX t X t -. Although the tree-based context modelng technques have had remarkable success n text compresson, applyng them to mage compresson poses a great dffculty. The context tree can only model a sequence not a two-dmensonal sgnal lke mages. In order to apply the context tree-based technques to mage codng one needs to schedule the pxels or transform coeffcents of an mage nto a lnear sequence as proposed by the authors of [3], [4]. Recently, Mrak et al. nvestgated how to optmze the orderng of the context parameters wthn the context trees [5]. But any lnear orderng of pxels wll nevtably destroy the ntrnsc two-dmensonal sample structures of an mage. Ths s why most mage/vdeo mage compresson algorthms choose a pror two-dmensonal context model wth fxed complexty, based on doman knowledge such as correlaton structure of the pxels and typcal nput mage sze, and estmate only the model parameters. For nstance, the JBIG standard for bnary mage compresson uses the contexts of a fxed sze causal template [6]. The actual codng s mplemented by sequentally applyng arthmetc codng based on the estmated condtonal probabltes. Estmatng the condtonal probabltes PX t C t drectly usng count statstcs from past samples can ncur severe context dluton problem f the number of symbols n the context s large or/and f the symbol alphabet s large wth respect to the length of nput sgnal, whch s the case for mage/vdeo compresson. Context quantzaton s a common technque to overcome ths dffculty [7] [9]. For examples, the state-of-the-art lossless mage compresson algorthm CALIC [0] and the JPEG 2000 entropy codng algorthm EBCOT [] quantze the context, C t nto a relatvely small number M of condtonng states, and estmate PX t QC t, Q M, nstead of PX t C t, where Q denotes a context quantzer. Context quantzaton s a form of vector quantzaton because context C s a random vector n the d-dmensonal context space E d.e., the context model has order d. Naturally, the obectve of optmal context quantzaton should be mnmzaton of the condtonal entropy HX QC. Although the convexty of the entropy functon H mples HX QC HX C, we would lke to make HX QC as close to HX C as possble for a gven M, or mnmze the Kullback-Lebler dstance: D Q = H X Q C H X C

150 > TIP R2 < 2 Fg.. An example dstrbuton of MCECQ cells A m n context space, for M=3 and the source of least sgnfcant bts of DPCM errors of mage cameraman. The x and y axes represent values of the frst two elements n raw context the two drectonal gradents I, - - I, -2 and I-, - II- 2, as gven n 2 and 3. The symbols, +, and ο n the scatter plot are respectvely the raw contexts of cells A, A 2,and A 3. Note that the above H referrng to the true source entropy s not the actual code length whch should nclude the model cost. Although Kullback-Leber dstance relatve entropy s not strctly a dstance metrc for ts volaton of symmetry and trangular nequalty, the standard practce s to use t as a nonnegatve dstorton of context quantzer Q. The problem of context quantzaton n mnmzng Kullback-Lebler dstance was frst studed by Wu [7] and then by Chen [2] for the applcaton of wavelet mage compresson. Greene et al. also developed optmal context quantzaton algorthm for compresson of bnary mages [3]. Recently, Forchhammer et al. proposed a context quantzer desgn algorthm under the crteron of mnmal adaptve code length, and appled t to lossless vdeo codng. A more theoretcal treatment of the problem can be found n [8]. The exstng context quantzer desgn algorthms can be classfed nto two approaches: those that form codng contexts drectly n the context space of condtonng events or the feature space n the termnology of classfcaton and pattern recognton lke [7] and [2], and those that form codng contexts n the probablty smplex space [8], [9], [3]. In the context space one can apply the generalzed Lloyd method [4] to desgn context quantzer by clusterng raw contexts of a tranng set accordng to Kullback-Leber dstance, whch was the dea n [2]. But ths teratve approach of gradent descent can not guarantee the globally optmal soluton. If the random varable X to be coded s bnary, then the VQ problem of context quantzaton can be converted to a scalar quantzaton problem n the probablty smplex space of PX. Ths change of space makes t possble to desgn globally optmal context quantzer by dynamc programmng DP [8], [9], [3]. For the sake of rgor we remnd the reader that the above mentoned optmalty s wth respect to the statstcs of the chosen tranng data. In practce, f the statstcs of an nput mage msmatches those of the tranng set then the codng performance becomes of course suboptmal. Nevertheless, desgnng optmal context quantzer stll has practcal sgnfcance because stuatons exst where sutable tranng set can be found. Furthermore, an off-lne optmzed context quantzer can be used n conuncton to adaptve arthmetc codng to compensate for any codng loss due to the msmatch of statstcs. Regardless of what space s chosen to desgn context quantzer, an nput context feature vector c a realzaton of the random varable C has to be mapped to a codng state a context quantzer cell when t comes to actual context-based codng usng PX Qc. In ths regard, both desgn approaches face a common operatonal dffculty of complex quantzer mappng functon Qc. Unlke n conventonal VQ, the cells codng states of optmal context quantzer are not convex or even connected n the context space. Snce the quantzer mappng functon Qc s hghly unstructured and complex n the context space of c, ts descrpton seems only possble va table look-up. Unfortunately, the table sze requred by Qc grows exponentally n the order of the context. To crcumvent ths problem the prevous authors resorted to prequantzaton of raw contexts c,.e., lmtng the resoluton of the context space [2], or the technque of product quantzaton [3]. Another technque s the proecton by lnear Fsher dscrmnant [. However, all these technques compromse optmalty. In ths paper we reexamne the problem of optmal context quantzaton and strve to approach the mnmal emprcal condtonal entropy of X under the constrant of a smple quantzer mappng functon Qc. We have made a measured progress n meetng the obectve by desgnng context quantzers usng kernel Fsher dscrmnant. The presentaton of ths paper s organzed as follows. Secton 2 characterzes the structure of the cells of context quantzer n both probablty smplex space and context space, and exposes the complexty of quantzer mappng functon. The man results of ths research,.e., the context quantzer desgn algorthms based on mult-class lnear Fsher dscrmnant and kernel Fsher dscrmnant, are presented n Secton 3. The detals of the desgn algorthm by usng kernel Fsher dscrmnant are gven n Secton 4. Secton 5 presents some expermental results, and the concluson follows n Secton 6. II. STRUCTURE AND COMPLEXITY OF QUANTIZER MAPPING A context quantzer Q parttons a d-dmensonal context space E d nto M subsets: A m = { c Q c = m}, m =, M The crteron of mnmzng the Kullback-Lebler dstance n context quantzer desgn leads to complex structures and shapes of quantzer cells, whch are n general not convex or even connected [8]. However, the assocated sets of

> TIP-005-2004.R2 < 3 Fg. 2. The separablty of two MCECQ cells A n a and A 2 n b n the proecton subspace formed by the kernel Fsher dscrmnant.

If X s a bnary random varable, then the probablty smplex s one-dmensonal. In ths case, the quantzer cells B m are smple ntervals.

Z q m, q m = m M ], 0 = q < q 0 < < q M < q M = where the quantzer thresholds {q m m=, M-} partton the unt nterval nto M contguous cells {B m m=, M}.

The separablty of two MCECQ cells A n a and A 2 n b n the proecton subspace formed by the lnear Fsher dscrmnant. context c s drawn from a d-dmensonal vector space.

151 > TIP R2 < 3 Fg. 2. The separablty of two MCECQ cells A n a and A 2 n b n the proecton subspace formed by the kernel Fsher dscrmnant. probablty mass functons pmfs B m = { P c Q c = m}, m =, m M X C are smple convex sets n the probablty smplex space of X, owng to a necessary condton for mnmum condtonal entropy quantzer Q [9]. If X s a bnary random varable, then the probablty smplex s one-dmensonal. In ths case, the quantzer cells B m are smple ntervals. Let Z=P X C c the condtonal probablty of X = as a functon of context c be a random varable, then the condtonal entropy HX Qc of a context quantzer Q can be expressed by H X Q c = P{ Z q m, q m ]} H X Z q m, q m = m M ], 0 = q < q 0 < < q M < q M = where the quantzer thresholds {q m m=, M-} partton the unt nterval nto M contguous cells {B m m=, M}. Thus the mnmal condton entropy context quantzer MCECQ can be reduced to a scalar quantzaton problem n Z, even though the Fg. 3. The separablty of two MCECQ cells A n a and A 2 n b n the proecton subspace formed by the lnear Fsher dscrmnant. context c s drawn from a d-dmensonal vector space. The globally optmal soluton of the problem q, q,, q 2 M = argmn 0< q< < qm M < m = H X Z q P{ Z q m, q m ] m, q can be obtaned usng dynamc programmng. Greene et al. showed that the MCECQ desgn problem can be solved n ONM tme, where N s the number of raw,.e. unquantzed contexts, thanks to a so-called concave Monge property of the obectve functon [3]. Once Z s scalar quantzed for mnmal emprcal condtonal entropy of a tranng set, the optmal MCECQ cells A m are formed mplctly by Am = { c PX C c q m, q m ]} However, PX C s seldom known exactly n practce. Otherwse one would drectly drve an entropy coder wth PX C. Instead, a tranng set s used to estmate PX C. Wu et al. [8] showed that the partton of the context space E d by m ]}

152 > TIP R2 < 4 MCECQ cells, A m, s generally very complex n shape and structure, resultng hghly rregular quantzer mappng functon Qc. An example of the dstrbuton of A m n the context space s gven n Fg.. Only when P C X c X=0 and P C X c X= are of Kotz-type d-dmensonal ellptcal dstrbutons, the MCECQ cells A m are bounded by quadratc surfaces [8]. Consequently, the mplementaton of an arbtrary quantzer mappng functon Q becomes an operatonal dffculty n usng MCECQ n practce, whch s the man ssue that motvated ths research. The smplest way of mplementng Q s to use a look-up table. But snce C, the number of all possble raw contexts, grows exponentally n the order of contexts, buldng a huge table of C entres for Q s clearly mpractcal. Hashng technques can be used to avod excessve memory use of the Q table by explotng the fact that the actual number of dfferent raw contexts appearng n an nput mage s much smaller than C. But ths savng of memory s at the expense of ncreased tme of quantzer mappng operaton when collson n table access occurs. To acheve constant executon tme of the quantzer mappng functon the sze of hashng table has to be larger than the number of dstnct raw contexts by a suffcent factor. In case of mage codng, the table sze needs to be comparable to the mage sze snce many raw contexts have very low frequency of occurrence. A common technque to smplfy the quantzer mappng functon Q s through proecton. Wu proposed a suboptmal context quantzer desgn algorthm based on Fsher s lnear dscrmnant [7]. The dea was to proect the tranng context vectors n the drecton y such that the two margnal posteror dstrbutons of P C X y c X=0 and P C X y c X=, c E d, have maxmum separaton. Then a dynamc programmng algorthm was used to form a convex M-partton of the correspondng one-dmensonal proecton space to mnmze the condtonal entropy: H X Q y c 2 n whch the ntervals q m-, q m ], m M, defne the context quantzer Q. In ths desgn approach the context quantzer Q s a scalar one n the proecton drecton y,.e., a subspace of the orgnal context space E d. Although the proecton approach s suboptmal, t smplfes the quantzer mappng functon to Qc = m f and only f y c q m-, q m ], whch has operatonal advantages n practce [7]. III. IMPROVED DESIGN ALGORITHMS OF FISHER DISCRIMINANTS The progress made by ths paper s to combne the advantages of the two MCECQ desgn approaches n the probablty smplex space and n the proecton context space of Fsher s dscrmnant. Namely, we seek to attan smultaneously the optmalty of MCECQ n probablty smplex space and the smplcty of quantzer mappng n the proecton space. A. Mult-class Lnear Fsher Dscrmnant In [7], a lnear Fsher dscrmnant was used to separate the two posteror dstrbutons of P C X c X=0 and P C X c X=, whch s a two-class clusterng problem. However, the success of ths approach s lmted to cases where P C X c X=0 and P C X c X= are lnearly separable to certan degree. But for more dffcult, lnearly non-separable shapes of context cells a departure from [7] s needed. We seek to separate the M optmal MCECQ cells formed n the probablty smplex space va a sutable, non-lnear proecton of the context space. The goal s to apply the dscrmnant classfer to form a convex partton n the proecton subspace that best matches the optmal partton of B m s n the probablty smplex space. The mult-class Fsher dscrmnant [5] lends us a tool to desgn a classfer that approxmates the optmal partton of contexts n the probablty smplex space by an optmzed partton n a proecton subspace. The separaton of nput classes.e., the partton of A m s formed by MCECQ n the context space n proecton drecton y can be measured by the so-called F-rato valdty ndex, Jy, defned as the rato of between-class varance versus wthn-class varance: J y M T n y = = N T y x = m x m π where π s the class label of each sample x and x s the mean vector of all raw context samples. The mult-class lnear Fsher dscrmnant s the maxmzaton of F-rato valdty ndex n 3,.e., T v S B v y = arg max 4 T v v S v where v represents a dscrmnant vector n raw context space. S B and S W n 4 are the between-class covarance matrx and the wthn-class covarance matrx respectvely: S S B n = W = = M N = m x m W x m x T x, T π m π where m and n are the mean vector and sample sze of class n context space respectvely. After the proecton drecton y s determned by 4, one can stll apply dynamc programmng to the proected samples y c to optmze context quantzer the same way as n 2 B. Kernel Fsher Dscrmnant The mult-class lnear Fsher dscrmnant outperformed the two-class lnear Fsher dscrmnant n terms of desgnng context quantzers of shorter code length n our experments

153 > TIP R2 < 5 see Secton 5. But the contexts of dfferent MCECQ cells nput classes for the Fsher dscrmnant are not lnearly separable n the context space as shown n [8]. A superor alternatve s to use a non-lnear classfer of hgher dscrmnatng power. Encouraged by the success of the kernel-based learnng machnes, such as support vector machne, kernel prncpal component analyss and kernel Fsher dscrmnant analyss KFD n many other classfcaton and learnng applcatons [6] [20], we propose a new desgn technque of context quantzers by usng the mult-class kernel Fsher dscrmnant. The mult-class kernel Fsher dscrmnant has been ntensvely studed as a generalzaton of dscrmnant analyss usng kernel approach [2], [22]. As an extenson of Fsher dscrmnant, the kernel one s known for ts hgh dscrmnatng powers on the nput clusters of complex structures. The kernel dscrmnant frst maps the source feature vectors or context vectors n MCECQ desgn nto some new feature space F n whch dfferent classes are better separable. A lnear dscrmnant s computed to separate nput classes n F. Implctly, ths process constructs a non-lnear classfer of hgh dscrmnatng power n the orgnal feature space. In our applcaton of context quantzaton, the obectve of the kernel dscrmnant s, gven an M nput partton A m ={c: Qc = m}, < m < M, to fnd a proecton drecton y n a new feature space F such that dfferent A m s are most separable n y. A dynamc programmng algorthm s then appled to desgn an MCECQ n y. The resultng MCECQ n F mplctly constructs a context quantzer n the context space E d. Let Φc be the nonlnear mappng from context space to some hgh-dmensonal Hlbert space F. Our goal s to fnd the proecton lne y n F such that the F-rato valdty ndex Jy T Φ y S B y J y = 5 T Φ y S y s maxmzed, where Φ Φ S and B S are the between-class and W wthn-class covarance matrces. Snce the space F s of very hgh or even nfnte dmensons, the functon Φc s nfeasble. A technque to overcome ths dffculty s the Mercer kernel functon kx,y = Φx,Φy, whch s the dot product n Hlbert feature space F. A popular choce for the kernel functon k that has been proved useful e.g. n support vector machnes s the Gaussan RBF radal bass functon, kx,y = exp- x-y 2 /2σ. It s known that under some mld assumptons on S Φ and Φ B S, any soluton y F maxmzng W 5 can be wrtten as the lnear span of all mapped context samples [9]: N W y = α c 6 = As a result, the F-rato Jy can be reformulated as y T T S B y α Aα J y = = 7 T T y y α Bα S W where A and B are N N matrces: A = M M T T T n µ µ µ µ, B = KK n µ µ = = where K s the kernel matrx, K =Φc Φc and µ = K / n, µ = K / where 0, N are membershp vectors correspondng to class labels, and s the vector of all ones. The proecton of a test context c onto the dscrmnant s gven by the nner product N y, c = α k c, c = where kx,y = exp- x-y 2 /2σ s the RBF kernel functon. The superor dscrmnatng power of KFD over the lnear Fsher dscrmnant LFD method of [7] for MCECQ desgn s llustrated n Fg The plots are for the context vectors n the modelng of the least sgnfcant bt of the test mage cameraman. By comparng the hstograms of the proected MCECQ cells A and A 2 from Cameraman mage for case of M= 2 for the two methods respectvely, one can easly see that KFD offers sgnfcantly better separaton of A and A 2 than LFD. Note that the proecton of KFD s n general nonlnear unlke the classc LFD. Computatonally, the KFD problem s to fnd the leadng egenvector of B - A. As the dmenson of F s hgher than the number of source samples N, and B s a hghly sngular N N matrx obtaned from only N source samples, some form of regularzaton s necessary. The smplest soluton s to add ether the dentty or kernel matrx K to matrx B, namely matrx B s replaced by B β =B + βi. Ths makes the problem numercal more stable because the wthn-class matrx B becomes more postve defnte for large β. It s also roughly equvalent to add ndependent noses to each of the kernel bases. IV. IMPLEMENTATION OF KFD FOR CONTEXT QUANTIZATION In the above formulatons, matrces B and A are too large n sze n practce. Maxmzng 7 takes ON 3 tme snce t needs to solve the N N matrx egenvalue problem. Ths complexty s too hgh for large N. More mportantly n context quantzaton applcaton, we are not able to use all the bass functons correspondng to all raw tranng contexts. Solvng the kernel Fsher dscrmnant for two classes can be cast to a quadratc optmzaton problem [8], [9]. However, ths scheme can not be drectly appled to estmatng the multclass kernel Fsher dscrmnant. The possble soluton N

> TIP-005-2004.R2 < 6 applcable to any choce of A and B s to restrct the dscrmnator y to be n a subspace of F, as proposed n [9], [20].

154 > TIP R2 < 6 applcable to any choce of A and B s to restrct the dscrmnator y to be n a subspace of F, as proposed n [9], [20]. Instead of usng 6, we express y n the subspace: l y = α c 8 = where l << N, and samples c could be ether selected from all raw tranng context samples or estmated by some clusterng algorthms. Wthout loss of generalty, f we choose each c n 8 from the tranng set, l, then: T l A l l J y = 9 T l B l l where αl s l-dmensonal vector, Al and Bl A l = M n µ l µ l µ l µ l = T B l = K l K l M n µ l µ l = T T 0 are two l l covarance matrces wth Kl beng an l N submatrx of K, where µ l = K l / n, µ l = K l / N Gven the dmenson l of the subspace of F, the partal expanson 8 presents a greedy approxmaton of the optmal KFD soluton, whch was descrbed n [9], [20] and studed theoretcally as the reduced set method for supported vector machnes n [23]. Ths approxmaton can be ncrementally mproved by addng a raw context sample or a new context base one at a tme to the exstng expanson,.e., ncrementng the dmensonalty l by one at a tme. Such ncremental expanson can be done n a greedy fashon as follows. For each teraton we frst randomly select a subset U of the remanng tranng set, and then we conduct an exhaustve search n U, nstead of n the whole remanng tranng set, for the tranng context c that maxmzes 9 after c beng added to 8. The proper sze of U was shown to be 59 n order to obtan nearly as good a performance as f the search was through the entre remanng tranng set [24]. Snce l<<n, ncrementng the kernel expanson 8 by one base context merely takes ON l tme. Consequently, the approxmaton of the kernel dscrmnant n l-dmensonal subspace of F has ON l 2 tme complexty, whch s drastcally lower than ON 3. The pseudo code of ths practcal approxmaton algorthm of kernel Fsher dscrmnant for context quantzaton s presented n Fg. 4. We buld the context quantzer n three steps. In the frst step, we apply the dynamc programmng algorthm to desgn MCECQ n the probablty smplex space. Ths produces the MCECQ cells B m that consttute the nput classes of KFD. In Fg. 4. Pseudocode of context quantzaton by kernel Fsher dscrmnant. the second step, we map B m back to A m n the context space, and use the kernel Fsher dscrmnant to fnd a proecton drecton n F correspondng to a curve n the context space n whch MCECQ cells A m have maxmum separaton. In the fnal step, we compute all proecton values of tranng contexts and put them nto a sorted lst. Snce each class n proecton drecton s n general not convex, n order to make the underlyng classfcaton problem tractable and more mportantly make the quantzer mappng functon smple, the dynamc programmng s used agan to construct a convex partton of the proecton subspace that mnmzes the condtonal entropy HX y c q m-, q m ], where the kernel proecton y c s gven by l y c = α k c, c = Once the KFD context quantzer s desgned, the decoder can map a raw context c to a codng state m n entropy decodng usng the followng context quantzer mappng functon Qc = m f y c q m-, q m ].

> TIP-005-2004.R2 < 7 Fg. 5. Average bt rates acheved by the four context quantzers on codng the sgn of DPCM error pxel n bts/sample. Fg. 6.

EXPERIMENTAL RESULTS We mplemented the proposed context quantzers and evaluated them n DPCM predctve lossless codng of gray scale mages.

155 > TIP R2 < 7 Fg. 5. Average bt rates acheved by the four context quantzers on codng the sgn of DPCM error pxel n bts/sample. Fg. 6. Average bt rates acheved by the four context quantzers on codng the least sgnfcant bt of DPCM error pxel n bts/sample. V. EXPERIMENTAL RESULTS We mplemented the proposed context quantzers and evaluated them n DPCM predctve lossless codng of gray scale mages. The predcton resduals are coded by bnary arthmetc codng that uses context states optmzed by the proposed algorthms. The bnary random varables to be coded are the bnary decsons n resolvng the value of the predcton resdual. In partcular we are nterested n two bnary sources: the sgns of DPCM predcton errors on grey scale mages, and the least sgnfcant bts of the DPCM predcton errors. These bnary sources are among the most dffcult to compress wth ther self entropy beng maxmum bt per sample, and thus present great challenges to contextbased entropy codng. Consequently, they serve as good, demandng test cases for the performance of dfferent context quantzers. The causal context n whch the current pxel I, s coded conssts of three gradents n a local wndow c = c,c 2,c 2 : c = I, I, 2 c = I, I 2, 2 c = I, I, 3 2 The reason for choosng c,c 2,c 2 as feature vector n context modelng s because they capture the varance and sgnal the presence of edge structure n the mage sgnal whle keepng the dmensonalty of the feature space low. We dd not use hgher order context model to avod overfttng n the codng phase. Even ths three-dmensonal feature space generates a very large number of raw contexts, namely A scalar prequantzaton scheme:, f, f Qk c = k, f k, f c [2 c 2 c 2 c 2 k +,2 k + + +,-2, 0 < k + ], 0 < < k 3 s used to reduce the number of raw contexts to a manageable level of 2k + 3 k was chosen to be 6 n our experments. Snce the gradent s the dfference of adacent samples, t obeys geometrcal dstrbuton for natural mages. The above scalar prequantzaton merges the raw contexts nto equally probable regons. The tranng set of raw contexts was generated out of 23 mages that were samples from two archves of benchmark gray scale mages on the Internet [25], [26]. The test set consstng of mages arplane, barb, boat, cameraman, couple, crowd, grl, lena, peppers, tffany, s dsont from the tranng set. The model parameters β,σ to construct the kernel dscrmnants for the two tranng sets are chosen as , 4.6 and , 5.33 respectvely, whch can be estmated by applyng the cross-valdaton [27], [28] estmaton of the mnmzed msclassfcaton rate or desrable mnmum condtonal entropy. Ether the encodng or decodng of each bnary symbol by a KFD context quantzer needs proectng a context to the dscrmnant drecton n Ol tme accordng to 8. Thus, the encodng or decodng complexty of a KFD context quantzer s ON l, where N s the length of nput sequences. We compare three context quantzers of Fsher dscrmnant type revewed and developed n ths paper. Namely, LFD-I: the two-class lnear Fsher dscrmnant scheme of [7], LFD-II: the mult-class lnear Fsher dscrmnant scheme dscussed n Subsecton A of III, and KFD: the MCECQ desgn algorthm based on kernel Fsher dscrmnant developed n Subsecton B

COMPRESSION BY DIFFERENT METHODS of III and Secton IV. All the three context quantzer desgn algorthms output convex quantzer cells n the context space wth smple quantzer mappng functons.

These rates were obtaned by MCECQ desgned for the sample statstcs of each ndvdual test mage.

statstcs and as though the quantzer mappng functon, regardless how complex, could be precsely mplemented n practce. Fg.

156 > TIP R2 < 8 TABLE I BIT RATES OF SIGNS OF DPCM ERRORS FOR DIFFERENT METHODS TABLE II BIT RATES OF LEAST SIGNIFICANT BITS OF DPCM ERRORS FOR DIFFERENT METHODS TABLE III BIT RATES OF LOSSLESS IMAGE COMPRESSION BY DIFFERENT METHODS of III and Secton IV. All the three context quantzer desgn algorthms output convex quantzer cells n the context space wth smple quantzer mappng functons. As a performance benchmark we also nclude the deal results,.e., the condtonal entropy rates of MCECQ quantzer n the probablty smplex space, aganst whch the testng results of the three practcal methods are measured. These rates were obtaned by MCECQ desgned for the sample statstcs of each ndvdual test mage. Clearly, these rates serve as a theoretcal lower bound wth respect to the context model n queston, snce they are the best achevable n the deal stuaton when the tranng data and nput mage have dentcal statstcs and as though the quantzer mappng functon, regardless how complex, could be precsely mplemented n practce. Fg. 5-6 plot the average bt rates acheved by the three MCECQ desgn methods n the context space, LFD-I, LFD-II and KFD, on codng the sgn and the least sgnfcant bt of DPCM errors for the ten test mages. The DPCM errors are generated by the medan predctor used by JPEG-LS. The bt rates are presented as functons of the number of context quantzer cells. As lower bounds for the achevable bt rates by any convex partton of the context space, we also nclude n the fgures the correspondng average condtonal entropy rates of optmal MCECQs desgned n the probablty smplex space as explaned above. It can be observed from our expermental results, as expected, that LFD-II outperforms LFD-I, and KFD outperforms the two varants of lnear dscrmnant type, because KFD has hgher dscrmnatng power than the other two wth ts capablty of formng more complex quantzer cells. In fact, the KFD method acheves the bt rates that are less than 0.5% away from the lower bound. We apply the three context quantzers desgned from the tranng set to encode the sgns and the least sgnfcant bts of DPCM errors from 0 test mages outsde of the tranng set. All three context quantzer have 2 cells, n other words the condtonal entropy codng s carred out n 2 codng states. Tables I and II show the actual code lengths obtaned by the three context quantzers. Not surprsngly, the kernel Fsher dscrmnant n general outperforms the two lnear ones. Table III presents the lossless bt rates of the ten gray-level test mages acheved by adaptve bnary arthmetc codng that uses the modelng contexts desgned by the proposed MCECQ methods for each bnary decson. As references n comparson the bt rates of the JPEG-LS lossless mage codng standard are also lsted n the table. The comparson s far and meanngful because JPEG-LS uses the same context template as n our experments but t employs a heurstc context quantzaton scheme [29]. Snce an alternatve method for lossless codng of gray-scale mages s to code each btplane usng a hgh-order bnary context as n JBIG, we also nclude n Table 3 the lossless bt rates obtaned by JBIG standard. The proposed KFD-based context quantzer outperforms all other methods consstently on each test mage, albet ts mprovement over JPEG-LS s qute small. The small margn between the two methods ndcates that the heurstc context quantzer of JPEG-LS s already very good compared wth a heavly optmzed one. We envson ths work to be a useful algorthmc tool to evaluate the qualty of more practcal context quantzers. VI. CONCLUSIONS We proposed new algorthms for desgnng context quantzers toward mnmum condtonal entropy based on mult-class Fsher dscrmnant and the kernel Fsher dscrmnant. We succeeded n approachng the lower bound of the achevable bt rates wth a practcal mplementaton that employs a smple scalar quantzer mappng functon rather

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervsed Learnng and Clusterng Why consder unlabeled samples?. Collectng and labelng large set of samples s costly Gettng recorded speech s free, labelng s tme consumng 2. Classfer could be desgned