Intelligent Information Acquisition for Improved Clustering

Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center pmelvl@us.bm.com Maytal Saar-Tsechansky Unversty of Texas at Austn maytal@mal.utexas.edu 1. Introducton and motvaton In many data mnng and machne learnng tasks, datasets nclude nstances that have mssng feature values that can be acqured at a cost. However, both the acquston cost and the usefulness wth respect to the learnng task may vary dramatcally for dfferent feature values. Whle ths observaton has nspred a number of approaches for actve and cost-senstve learnng, most work n these areas has focused on classfcaton settngs. Yet, the problem of obtanng most useful mssng data cost-effectvely s equally mportant n unsupervsed settngs, such as clusterng, snce the amount by whch acqured nformaton may mprove performance vares sgnfcantly across nstances and features. For example, clusterng algorthms are commonly used to dentfy users wth smlar preferences, so as to produce personalzed product recommendatons. Wth nstances correspondng to ndvdual consumers and features descrbng consumers ratngs of a gven product/servce, ndvdual features of partcular nstances may be mssng as customers may have not provded feedback on all the tems they purchased. Furthermore, because consumers are often reluctant to provde feedback, acqurng feedback on unrated tems may ental costly ncentves, such as free or dscounted products or servces. However, obtanng dfferent feature values may have varyng effect on accuracy of subsequently obtaned clusterng of consumers. Thus, choosng whch ratngs to acqure va ncentves that wll beneft the clusterng task most cost-effectvely s an mportant decson --- as acqurng feedback for all mssng ratngs s prohbtvely expensve. In ths paper, we address the problem of actve feature-value acquston (AFA) for clusterng: gven a clusterng of ncomplete data, the task s to select feature values whch, when acqured, are lkely to provde the hghest mprovement n clusterng qualty wth respect to acquston cost. To the best of our knowledge, ths general problem has not been consdered prevously, as pror research focused ether on acqurng parwse dstances ([3],[4]) or cluster labels for complete nstances [1]. Pror work addressed the AFA task for supervsed learnng, where mssng feature values are acqured n a cost-effectve manner for tranng classfcaton models [6]. However, ths approach explots supervsed nformaton to estmate the expected mprovement n model accuracy for prospectve acqustons. The prmary challenge addressed n ths paper les n a pror estmaton of the value of a potental acquston n the absence of any supervson (.e., t s not known to whch cluster each nstance actually belongs). We employ an expected utlty acquston framework and present an nstantaton of our overall framework for K-means, where the value of prospectve acqustons s derved from ther expected mpact on the clusterng confguraton (see [8] for an nstantaton of our framework for herarchcal agglomeratve clusterng algorthm). Emprcal results demonstrate that the proposed utlty functon effectvely dentfes acqustons that mprove clusterng qualty per unt cost sgnfcantly better than acqustons selected unformly at random. In addton, we show that our polcy performs well for dfferent feature cost structures. 2. Task defnton and algorthm The clusterng task s tradtonally defned as the problem of parttonng a set of nstances nto dsjont subsets, or clusters, where each cluster contans smlar nstances. We focus our attenton on clusterng n domans where nstances nclude mssng feature values that can be acqured at a cost. A

dataset consstng of m n-dmensonal nstances s represented by an m-by-n data matrx X, where x corresponds to the value of the j-th feature of the -th nstance. Intally, the data matrx X s ncomplete,.e., ts elements correspondng to mssng values are undefned. For each mssng feature value x, there s a correspondng cost C at whch t can be acqured. Let q refer to the query for the value of x. Then, the general task of actve feature-value acquston s the problem of selectng the nstance-feature query that wll result n the hghest ncrease n clusterng qualty per unt cost. The overall framework for the generalzed AFA problem s presented n Algorthm 1. Informaton s acqured teratvely, where at each step all possble queres are ranked based on ther expected contrbuton to clusterng qualty normalzed by cost. The hghest-rankng query s then selected, and the feature value correspondng to ths query s acqured. The dataset s approprately updated, and ths process s repeated untl some stoppng crteron s met, e.g., desrable clusterng qualty has been acheved. To reduce computatonal costs, multple queres can be selected at each teraton. Whle ths framework s ntutve, the crux of the problem les n devsng effectve measures for the utlty of acqustons. In subsequent sectons, we address challenges related to performng ths task accurately and effcently. Algorthm 1: Actve Feature-value Acquston for Clusterng Gven: X ntal (ncomplete) nstance-feature matrx, L clusterng algorthm, b sze of query batch, C cost matrx for all nstance-feature pars. Output: M = L(X) fnal clusterng of the dataset ncorporatng acqured values 1. Intalze TotalCost to ntal cost of X 2. Intalze set of possble queres Q = {q : x s mssng}. 3. Repeat untl stoppng crteron s met 4. Generate a clusterng M = L(X) 5. q Q compute utlty score 6. Select a subset S of b queres wth the hghest scores 7. q S : Acqure values for x : X = X x. TotalCost = TotalCost + C 8. Remove S from Q 9. Return M = L(X) At every step of the AFA algorthm, the feature value whch n expectaton wll result n the hghest clusterng mprovement per unt cost s acqured. Fundamental to our approach s a utlty functon U ( x = x, C ) whch quantfes the beneft from a specfc value x for feature x acqured va the correspondng query q at cost C. Then, expected utlty for query q, ( q ), s defned as the expectaton of the utlty over the margnal dstrbuton for the feature x : ( q ) = U ( x = x, C ) P( x. Snce the true margnal dstrbuton of each mssng feature x value s unknown, an emprcal estmate of P( x can be obtaned usng probablstc classfers. For example, n the case of dscrete (categorcal) data, for each feature j, a näve Bayes classfer M j can be traned to estmate the feature's probablty dstrbuton based on the values of other features of a gven nstance. Then, the expectaton can be easly computed by pecewse summaton over the possble values. For contnuous attrbutes, computaton of expected utlty can be performed ether usng computatonal methods such as Monte Carlo estmaton, or va dscretzng them and usng probablstc classfers as descrbed above.

2.1 Capturng the utlty from a prospectve acquston Devsng a utlty functon U to capture the benefts of possble acquston outcomes s the crtcal component of the AFA framework. Acqustons am to mprove clusterng qualty. Clusterng qualty measures proposed n pror work can be loosely dvded nto external measures, such as parwse F- measure [7], whch are derved from a category dstrbuton unseen at clusterng tme, and nternal measures, e.g., rato between average nter-cluster and ntra-cluster dstances, whch use only data that s avalable to the clusterng algorthm. Snce external measures cannot be assessed at the tme of clusterng, an acquston polcy must capture the value of acqustons usng merely the dataset at hand. Most clusterng algorthms optmze a specfc objectve functon, whch allows defnng utlty as mprovement n ths objectve per unt cost. For example, the objectve of the popular K-Means algorthm [5] s to mnmze the sum of squared dstances between every nstance x and the centrod of the 2 nstance's cluster, µ : J ( X ) = ( x µ ), where y s the ndex of the cluster to whch nstance y { } k h x x s assgned, y h = 1 computaton. Thus, the objectve-based utlty from acquston outcome X y, and mssng feature values are omtted from the squared dstance x = x can be defned as the J ( X x J ( X ) cost-normalzed reducton n the value of the objectve functon: Obj U ( x, C ) C where the objectve functon value after the acquston J ( X x x) s estmated followng the relocaton of cluster centrods caused by the acquston. = =, Whle an objectve-based utlty functon provdes a well-motvated acquston strategy, t may select feature values that mprove cluster centrod locatons wthout sgnfcantly changng cluster assgnments whch often underle external measures of clusterng outcome. The effect of such wasteful acqustons can be sgnfcant, renderng an objectve-based utlty a suboptmal strategy for mprovng external evaluaton measures. Because nternal objectve functons may not relate well to external measures we propose an alternatve utlty measure whch approxmates the qualtatve mpact on clusterng confguraton caused by the acquston. We defne ths utlty as the number of nstances for whch cluster membershp changes as the result of an acquston, gven a certan value of the acqured feature. Formally, gven the current data ( X ) ( X x matrx X, let y be the cluster assgnment of the pont x before the acquston, and y be the cluster assgnment of x after the acquston. Then, the perturbaton-based utlty of acqurng value x for feature x s defned as follows: assgnments after the acquston, { } M U Pert M ( X x y = 1 ( X ) y ( x = x, C ) =. For K-Means, the cluster C ( X x ( X x Y y = 1 = can be obtaned by re-estmatng the cluster centrod to whch nstance x s currently assgned, assumng the value x for feature x. Then, performng a sngle assgnment step for all ponts would provde the new set of cluster assgnments ( X x Y. As we show below, ths utlty measure dentfes hghly nformatve acqustons. Henceforth, we refer to ths perturbaton-based utlty as Expected Utlty (); we refer to the use of the objectvebased utlty as Expected-Utlty-Objectve (-Objectve). 2.2 Effcency consderatons: Instance-based samplng A sgnfcant challenge les n the fact that exhaustvely evaluatng all potental acqustons s computatonally nfeasble for datasets of even moderate sze. We propose to make ths selecton tractable by evaluatng only a sub-sample of the avalable queres. We specfy an exploraton parameter α whch controls the complexty of the search. To select a batch of b queres, frst a sub-sample of αb queres s

selected from the avalable pool, and then the expected utlty of each query n ths sub-sample s evaluated. The value of α can be set dependng on the amount of tme the user s wllng to spend on ths process. One approach s to draw ths sample unformly at random to make the computaton feasble. However, t may be possble to mprove performance by applyng Expected Utlty estmaton to a partcularly nformatve sample of queres. In partcular, because the goal of clusterng s to defne boundares between potental classes, nstances near these boundares have the most mpact on cluster formaton. Consequently, mssng features of these nstances gve us the most decsve nformaton to adjust the clusterng boundares. Formally, f µ and µ are respectvely the closest and second closest y centrods for nstance x n the current clusterng, we defne the margn δ ( x ) of nstance x as the dfference between ther dstances from x, accordng to the dstance metrc D beng used for clusterng: δ ( x ) = D( x,µ ) - D( x, µ y ). Gven ncomplete nformaton about the poston of nstances n the feature space, smaller margns for nstances correspond to lower confdence n ther current cluster assgnment. For these nstances, obtanng a better estmate of ther poston n the feature space s more lkely to mprove our ablty to assgn them to the correct cluster than for nstances wth large margns. Followng ths ratonale, we rank all nstances n ascendng order of ther margns based on the current cluster assgnments. Then, a set of αb queres from the top-ranked nstances are selected for evaluaton; where b s the desred batch sze and α s the exploraton parameter. Ths canddate set of queres s then subjected to the same expected utlty samplng descrbed n the prevous secton. We refer to ths approach as Instance-Based Samplng Expected Utlty (IBS-). 3. Expermental evaluaton We evaluated our proposed approach on four datasets from the UCI repostory [2]: rs, wne, lettersl, and proten, whch have been prevously used n a number of clusterng studes. Features wth contnuous values n these datasets were dscretzed nto 10 bns of equal wdth. Snce feature acquston costs are not avalable for these datasets, n our frst set of experments, we assume that acquston costs are unform for all feature values followed by experments for other cost dstrbutons. Dscrete feature values enable the use of a pecewse summaton for the expectaton calculaton and s computatonally preferable. However, n prncple, contnuous values can also be used. We compare the proposed acquston polces wth a strategy that selects queres unformly at random, and usng the K-means clusterng algorthm. The samplng parameter α of our methods s set to 10. We report results obtaned from 100 runs for each actve acquston polcy. In each run, a small fracton of features s randomly selected for ntalzaton for each nstance n the dataset 1, and we evaluate clusterng performance after each acquston step. Lastly, because the datasets we consder have underlyng class labels, we employ an external metrc, parwse F-measure, to evaluate clusterng qualty. We have found that emprcally there are no qualtatve dfferences n our results for dfferent external measures. Gven a clusterng and underlyng class labels, parwse precson and recall are defned as the proporton of same-cluster nstances that have the same class, and the proporton of same-class nstances that have been placed n the same cluster, respectvely. Then, F-measure s the harmonc mean of precson and recall, F1 = ((2 Precson Recall)/(Precson+Recall)). The performance comparson for any two acquston schemes A and B can be summarzed by the average percentage ncrease n parwse F-measure of A over B over all acquston phases. We refer to ths metrc as the average % F-measure ncrease. 4. Results Table 1 presents summary results for, -Objectve, IBS-, and IBS-, whch acqures feature values drawn unformly at random from nformatve nstances selected by IBS. Let us 1 We randomly selected 1 out of 4 features for each nstance n the rs dataset, 2 out of 4 features for wne, and 3 out of 16 and 20 features for the letter-l and proten data sets, respectvely.

frst examne the relatve performance of the polcy whch dentfes acqustons that are lkely to mpact the cluster assgnments and -Objectve whch targets acqustons whch are expected to mprove the clusterng algorthm s nternal objectve functon. Fgure 1(a) presents clusterng performance as a functon of acquston costs for the proten data set, obtaned wth, -Objectve, and random samplng. For all data sets, leads to better clusterng than random query samplng. The mprovements n performance range from a 10% to 32% ncrease n F-measure on the top 20% of acquston phases. One can also observe the cost benefts of usng to obtan a desred level of performance. For example, on the rs data set, Expected Utlty acheves a parwse F-measure of 0.8 wth less than 300 feature values, whle random samplng requres twce as many acqustons to acheve the same result. In contrast to Expected Utlty, usng the objectve-based utlty functon n -Objectve s rather neffectve n mprovng parwse F-measure. Ths s because the K-means objectve s focused on producng tghter clusters, and the acquston strategy based on t may select feature values that reduce ths objectve wthout changng any cluster assgnments, resultng n no mprovement wth respect to external evaluaton measures. Data Set % F-measure Increase over -Objectve IBS- IBS- rs -0.81 6.19 7.96 3.14 wne -8.42 10.92 11.41 4.58 letters 7 6.22 5.55 0.16 proten 4.78 14.19 14.93 2.90 Table 1: Performance of dfferent acquston polces for clusterng Parwse F Measure 5 0.35 0.3 0.25 0.2 -Objectve 0.15 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Number of feature-values acqured (a) Parwse F Measure 0.6 0.58 0.56 0.54 0.52 0.5 8 6 4 2 IBS- IBS- 0.38 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Number of feature-values acqured Fgure 2: Learnng curves for alternatve acquston polces (b) Now, let us examne the beneft to from evaluatng a subset of acqustons from partcularly nformatve nstances as captured by our Instance-Based samplng approach. Table 1 presents summary performance for -IBS and IBS-, and for the rs data set, Fgure 1(b) show clusterng qualty after each acquston phase obtaned by, IBS-, and IBS-. On 3 of the 4 datasets, IBS- produces the hghest average ncrease n parwse F-measure compared to random samplng. On these datasets, IBS- also performs substantally better than random. These results demonstrate that our margn measure effectvely dentfes partcularly nformatve nstances for acquston. Consequently, IBS- focuses the evaluaton of Expected Utlty to a more promsng set of queres, leadng to better models on average. However, the mprovements of IBS- over are not very large. Lastly, we evaluated the polces when appled to the rs dataset under dfferent cost dstrbutons. We assgned each feature a cost drawn unformly at random from a range between 1 and 100. For ths evaluaton we nclude a cost-senstve benchmark polcy, Cheapest-frst, whch selects acqustons n order of ncreasng cost. The results for all randomly assgned cost dstrbutons show that

IBS- and Expected Utlty consstently results n better clusterng than random acquston for a gven cost. Fgure 3 presents F-measure versus acquston costs for two representatve cost dstrbutons. As shown, n settngs where features have varyng nformaton value wth non-neglgble costs, 's ablty to capture the value of dfferent feature values per unt cost s more crtcal. In such cases, acqurng an unnformatve feature value for a substantal cost results n a sgnfcant loss and, as shown, and IBS- are more lkely to avod such losses. In contrast, the performance of Cheapest-frst s nconsstent. It performs well when ts underlyng assumpton holds and the cheapest features are also nformatve. In such cases, does not perform as well, snce t mperfectly estmates the expected mprovement from each acquston. When many nexpensve features are also unnformatve Cheapest-frst can perform poorly, as shown by the early acquston stages of Fgure 3., however, estmates the trade-off between cost and expected mprovement n clusterng qualty, and although the estmaton s mperfect, t consstently selects better queres than random acqustons for all cost structures. 0.9 0.9 0.8 0.85 0.8 Parwse F Measure 0.7 0.6 0.5 Parwse F Measure 0.75 0.7 0.65 0.6 0.55 IBS- Cheapest-Frst 0.3 500 1000 1500 2000 2500 3000 3500 4000 Costs (a) Inexpensve features are also nformatve 0.5 IBS- 5 Cheapest-Frst 500 1000 1500 2000 2500 3000 3500 Costs (b) Some expensve feature are nformatve Fgure 3: Performance under dfferent feature-value cost structures 5. Conclusons In ths paper, we proposed an expected utlty approach to actve feature-value acquston for clusterng, where nformatve feature values are obtaned based on the estmated expected mprovement n clusterng qualty per unt cost. Experments show that the Expected Utlty approach consstently leads to better clusterng than random samplng for a gven acqustons cost. 6. References [1] S. Basu, A. Banerjee, and R. J. Mooney. Actve sem-supervson for parwse constraned clusterng. In Proceedngs of the 2004 SIAM Internatonal Conference on Data Mnng (SDM-04), Apr. 2004. [2] C. L. Blake and C. J. Merz. UCI repostory of machne learnng databases. http://www.cs.uc.edu/ mlearn/mlrepostory.html, 1998. [3] J. M. Buhmann and T. Zller. Actve learnng for herarchcal parwse data clusterng. In ICPR, pages 2186 2189, 2000. [4] T. Hofmann and J. M. Buhmann. Actve data clusterng. In Advances n Neural Informaton Processng Systems 10, 1998. [5] J. MacQueen. Some methods for classfcaton and analyss of multvarate observatons. In Proceedngs of 5th Berkeley Symposum on Mathematcal Statstcs and Probablty, pages 281 297, 1967. [6] P. Melvlle, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utlty approach to actve feature-value acquston. In Proceedngs of the Internatonal Conference on Data Mnng, pages 745 748, Houston, TX, November 2005. [7] M. Stenbach, G. Karyps, and V. Kumar. A comparson of document clusterng technques. In Proceedngs of the KDD-2000 Workshop on Text Mnng, 2000. [8] D. Vu, M. Blenko, P, Melvlle, M. Saar-Tsechansky. Actve nformaton acquston for mproved clusterng, Workng Paper, McCombs School of Busness, May 2007.