Intelligent Information Acquisition for Improved Clustering

Similar documents
Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Cluster Analysis of Electrical Behavior

An Entropy-Based Approach to Integrated Information Needs Assessment

Unsupervised Learning

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Machine Learning. Topic 6: Clustering

Performance Evaluation of Information Retrieval Systems

Classifier Selection Based on Data Complexity Measures *

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Pruning Training Corpus to Speedup Text Classification 1

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Machine Learning: Algorithms and Applications

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Feature Reduction and Selection

Wishing you all a Total Quality New Year!

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

Support Vector Machines

Classifying Acoustic Transient Signals Using Artificial Intelligence

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Active Contours/Snakes

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

The Research of Support Vector Machine in Agricultural Data Classification

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Unsupervised Learning and Clustering

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

CS 534: Computer Vision Model Fitting

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Support Vector Machines

S1 Note. Basis functions.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Query Clustering Using a Hybrid Query Similarity Measure

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

Related-Mode Attacks on CTR Encryption Mode

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Data Mining: Model Evaluation

Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Load-Balanced Anycast Routing

Reducing Frame Rate for Object Tracking

Self-tuning Histograms: Building Histograms Without Looking at Data

Unsupervised Learning and Clustering

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

APPLIED MACHINE LEARNING

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Programming in Fortran 90 : 2017/2018

Maintaining temporal validity of real-time data on non-continuously executing resources

Smoothing Spline ANOVA for variable screening

Fitting: Deformable contours April 26 th, 2018

Three supervised learning methods on pen digits character recognition dataset

X- Chart Using ANOM Approach

EXTENDED BIC CRITERION FOR MODEL SELECTION

Optimizing Document Scoring for Query Retrieval

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The Codesign Challenge

Generalized Team Draft Interleaving

Understanding K-Means Non-hierarchical Clustering

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Optimal Workload-based Weighted Wavelet Synopses

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Feature Selection as an Improving Step for Decision Tree Construction

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data

Hierarchical clustering for gene expression data analysis

A Heuristic for Mining Association Rules In Polynomial Time*

Network Intrusion Detection Based on PSO-SVM

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

ISSN: International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 4, April 2012

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

A Heuristic for Mining Association Rules In Polynomial Time

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Associative Based Classification Algorithm For Diabetes Disease Prediction

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Parallel matrix-vector multiplication

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A Multivariate Analysis of Static Code Attributes for Defect Prediction

A Binarization Algorithm specialized on Document Images and Photos

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Machine Learning 9. week

K-means and Hierarchical Clustering

A Two-Stage Algorithm for Data Clustering

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Deep Classification in Large-scale Text Hierarchies

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Transcription:

Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center pmelvl@us.bm.com Maytal Saar-Tsechansky Unversty of Texas at Austn maytal@mal.utexas.edu 1. Introducton and motvaton In many data mnng and machne learnng tasks, datasets nclude nstances that have mssng feature values that can be acqured at a cost. However, both the acquston cost and the usefulness wth respect to the learnng task may vary dramatcally for dfferent feature values. Whle ths observaton has nspred a number of approaches for actve and cost-senstve learnng, most work n these areas has focused on classfcaton settngs. Yet, the problem of obtanng most useful mssng data cost-effectvely s equally mportant n unsupervsed settngs, such as clusterng, snce the amount by whch acqured nformaton may mprove performance vares sgnfcantly across nstances and features. For example, clusterng algorthms are commonly used to dentfy users wth smlar preferences, so as to produce personalzed product recommendatons. Wth nstances correspondng to ndvdual consumers and features descrbng consumers ratngs of a gven product/servce, ndvdual features of partcular nstances may be mssng as customers may have not provded feedback on all the tems they purchased. Furthermore, because consumers are often reluctant to provde feedback, acqurng feedback on unrated tems may ental costly ncentves, such as free or dscounted products or servces. However, obtanng dfferent feature values may have varyng effect on accuracy of subsequently obtaned clusterng of consumers. Thus, choosng whch ratngs to acqure va ncentves that wll beneft the clusterng task most cost-effectvely s an mportant decson --- as acqurng feedback for all mssng ratngs s prohbtvely expensve. In ths paper, we address the problem of actve feature-value acquston (AFA) for clusterng: gven a clusterng of ncomplete data, the task s to select feature values whch, when acqured, are lkely to provde the hghest mprovement n clusterng qualty wth respect to acquston cost. To the best of our knowledge, ths general problem has not been consdered prevously, as pror research focused ether on acqurng parwse dstances ([3],[4]) or cluster labels for complete nstances [1]. Pror work addressed the AFA task for supervsed learnng, where mssng feature values are acqured n a cost-effectve manner for tranng classfcaton models [6]. However, ths approach explots supervsed nformaton to estmate the expected mprovement n model accuracy for prospectve acqustons. The prmary challenge addressed n ths paper les n a pror estmaton of the value of a potental acquston n the absence of any supervson (.e., t s not known to whch cluster each nstance actually belongs). We employ an expected utlty acquston framework and present an nstantaton of our overall framework for K-means, where the value of prospectve acqustons s derved from ther expected mpact on the clusterng confguraton (see [8] for an nstantaton of our framework for herarchcal agglomeratve clusterng algorthm). Emprcal results demonstrate that the proposed utlty functon effectvely dentfes acqustons that mprove clusterng qualty per unt cost sgnfcantly better than acqustons selected unformly at random. In addton, we show that our polcy performs well for dfferent feature cost structures. 2. Task defnton and algorthm The clusterng task s tradtonally defned as the problem of parttonng a set of nstances nto dsjont subsets, or clusters, where each cluster contans smlar nstances. We focus our attenton on clusterng n domans where nstances nclude mssng feature values that can be acqured at a cost. A

dataset consstng of m n-dmensonal nstances s represented by an m-by-n data matrx X, where x corresponds to the value of the j-th feature of the -th nstance. Intally, the data matrx X s ncomplete,.e., ts elements correspondng to mssng values are undefned. For each mssng feature value x, there s a correspondng cost C at whch t can be acqured. Let q refer to the query for the value of x. Then, the general task of actve feature-value acquston s the problem of selectng the nstance-feature query that wll result n the hghest ncrease n clusterng qualty per unt cost. The overall framework for the generalzed AFA problem s presented n Algorthm 1. Informaton s acqured teratvely, where at each step all possble queres are ranked based on ther expected contrbuton to clusterng qualty normalzed by cost. The hghest-rankng query s then selected, and the feature value correspondng to ths query s acqured. The dataset s approprately updated, and ths process s repeated untl some stoppng crteron s met, e.g., desrable clusterng qualty has been acheved. To reduce computatonal costs, multple queres can be selected at each teraton. Whle ths framework s ntutve, the crux of the problem les n devsng effectve measures for the utlty of acqustons. In subsequent sectons, we address challenges related to performng ths task accurately and effcently. Algorthm 1: Actve Feature-value Acquston for Clusterng Gven: X ntal (ncomplete) nstance-feature matrx, L clusterng algorthm, b sze of query batch, C cost matrx for all nstance-feature pars. Output: M = L(X) fnal clusterng of the dataset ncorporatng acqured values 1. Intalze TotalCost to ntal cost of X 2. Intalze set of possble queres Q = {q : x s mssng}. 3. Repeat untl stoppng crteron s met 4. Generate a clusterng M = L(X) 5. q Q compute utlty score 6. Select a subset S of b queres wth the hghest scores 7. q S : Acqure values for x : X = X x. TotalCost = TotalCost + C 8. Remove S from Q 9. Return M = L(X) At every step of the AFA algorthm, the feature value whch n expectaton wll result n the hghest clusterng mprovement per unt cost s acqured. Fundamental to our approach s a utlty functon U ( x = x, C ) whch quantfes the beneft from a specfc value x for feature x acqured va the correspondng query q at cost C. Then, expected utlty for query q, ( q ), s defned as the expectaton of the utlty over the margnal dstrbuton for the feature x : ( q ) = U ( x = x, C ) P( x. Snce the true margnal dstrbuton of each mssng feature x value s unknown, an emprcal estmate of P( x can be obtaned usng probablstc classfers. For example, n the case of dscrete (categorcal) data, for each feature j, a näve Bayes classfer M j can be traned to estmate the feature's probablty dstrbuton based on the values of other features of a gven nstance. Then, the expectaton can be easly computed by pecewse summaton over the possble values. For contnuous attrbutes, computaton of expected utlty can be performed ether usng computatonal methods such as Monte Carlo estmaton, or va dscretzng them and usng probablstc classfers as descrbed above.

2.1 Capturng the utlty from a prospectve acquston Devsng a utlty functon U to capture the benefts of possble acquston outcomes s the crtcal component of the AFA framework. Acqustons am to mprove clusterng qualty. Clusterng qualty measures proposed n pror work can be loosely dvded nto external measures, such as parwse F- measure [7], whch are derved from a category dstrbuton unseen at clusterng tme, and nternal measures, e.g., rato between average nter-cluster and ntra-cluster dstances, whch use only data that s avalable to the clusterng algorthm. Snce external measures cannot be assessed at the tme of clusterng, an acquston polcy must capture the value of acqustons usng merely the dataset at hand. Most clusterng algorthms optmze a specfc objectve functon, whch allows defnng utlty as mprovement n ths objectve per unt cost. For example, the objectve of the popular K-Means algorthm [5] s to mnmze the sum of squared dstances between every nstance x and the centrod of the 2 nstance's cluster, µ : J ( X ) = ( x µ ), where y s the ndex of the cluster to whch nstance y { } k h x x s assgned, y h = 1 computaton. Thus, the objectve-based utlty from acquston outcome X y, and mssng feature values are omtted from the squared dstance x = x can be defned as the J ( X x J ( X ) cost-normalzed reducton n the value of the objectve functon: Obj U ( x, C ) C where the objectve functon value after the acquston J ( X x x) s estmated followng the relocaton of cluster centrods caused by the acquston. = =, Whle an objectve-based utlty functon provdes a well-motvated acquston strategy, t may select feature values that mprove cluster centrod locatons wthout sgnfcantly changng cluster assgnments whch often underle external measures of clusterng outcome. The effect of such wasteful acqustons can be sgnfcant, renderng an objectve-based utlty a suboptmal strategy for mprovng external evaluaton measures. Because nternal objectve functons may not relate well to external measures we propose an alternatve utlty measure whch approxmates the qualtatve mpact on clusterng confguraton caused by the acquston. We defne ths utlty as the number of nstances for whch cluster membershp changes as the result of an acquston, gven a certan value of the acqured feature. Formally, gven the current data ( X ) ( X x matrx X, let y be the cluster assgnment of the pont x before the acquston, and y be the cluster assgnment of x after the acquston. Then, the perturbaton-based utlty of acqurng value x for feature x s defned as follows: assgnments after the acquston, { } M U Pert M ( X x y = 1 ( X ) y ( x = x, C ) =. For K-Means, the cluster C ( X x ( X x Y y = 1 = can be obtaned by re-estmatng the cluster centrod to whch nstance x s currently assgned, assumng the value x for feature x. Then, performng a sngle assgnment step for all ponts would provde the new set of cluster assgnments ( X x Y. As we show below, ths utlty measure dentfes hghly nformatve acqustons. Henceforth, we refer to ths perturbaton-based utlty as Expected Utlty (); we refer to the use of the objectvebased utlty as Expected-Utlty-Objectve (-Objectve). 2.2 Effcency consderatons: Instance-based samplng A sgnfcant challenge les n the fact that exhaustvely evaluatng all potental acqustons s computatonally nfeasble for datasets of even moderate sze. We propose to make ths selecton tractable by evaluatng only a sub-sample of the avalable queres. We specfy an exploraton parameter α whch controls the complexty of the search. To select a batch of b queres, frst a sub-sample of αb queres s

selected from the avalable pool, and then the expected utlty of each query n ths sub-sample s evaluated. The value of α can be set dependng on the amount of tme the user s wllng to spend on ths process. One approach s to draw ths sample unformly at random to make the computaton feasble. However, t may be possble to mprove performance by applyng Expected Utlty estmaton to a partcularly nformatve sample of queres. In partcular, because the goal of clusterng s to defne boundares between potental classes, nstances near these boundares have the most mpact on cluster formaton. Consequently, mssng features of these nstances gve us the most decsve nformaton to adjust the clusterng boundares. Formally, f µ and µ are respectvely the closest and second closest y centrods for nstance x n the current clusterng, we defne the margn δ ( x ) of nstance x as the dfference between ther dstances from x, accordng to the dstance metrc D beng used for clusterng: δ ( x ) = D( x,µ ) - D( x, µ y ). Gven ncomplete nformaton about the poston of nstances n the feature space, smaller margns for nstances correspond to lower confdence n ther current cluster assgnment. For these nstances, obtanng a better estmate of ther poston n the feature space s more lkely to mprove our ablty to assgn them to the correct cluster than for nstances wth large margns. Followng ths ratonale, we rank all nstances n ascendng order of ther margns based on the current cluster assgnments. Then, a set of αb queres from the top-ranked nstances are selected for evaluaton; where b s the desred batch sze and α s the exploraton parameter. Ths canddate set of queres s then subjected to the same expected utlty samplng descrbed n the prevous secton. We refer to ths approach as Instance-Based Samplng Expected Utlty (IBS-). 3. Expermental evaluaton We evaluated our proposed approach on four datasets from the UCI repostory [2]: rs, wne, lettersl, and proten, whch have been prevously used n a number of clusterng studes. Features wth contnuous values n these datasets were dscretzed nto 10 bns of equal wdth. Snce feature acquston costs are not avalable for these datasets, n our frst set of experments, we assume that acquston costs are unform for all feature values followed by experments for other cost dstrbutons. Dscrete feature values enable the use of a pecewse summaton for the expectaton calculaton and s computatonally preferable. However, n prncple, contnuous values can also be used. We compare the proposed acquston polces wth a strategy that selects queres unformly at random, and usng the K-means clusterng algorthm. The samplng parameter α of our methods s set to 10. We report results obtaned from 100 runs for each actve acquston polcy. In each run, a small fracton of features s randomly selected for ntalzaton for each nstance n the dataset 1, and we evaluate clusterng performance after each acquston step. Lastly, because the datasets we consder have underlyng class labels, we employ an external metrc, parwse F-measure, to evaluate clusterng qualty. We have found that emprcally there are no qualtatve dfferences n our results for dfferent external measures. Gven a clusterng and underlyng class labels, parwse precson and recall are defned as the proporton of same-cluster nstances that have the same class, and the proporton of same-class nstances that have been placed n the same cluster, respectvely. Then, F-measure s the harmonc mean of precson and recall, F1 = ((2 Precson Recall)/(Precson+Recall)). The performance comparson for any two acquston schemes A and B can be summarzed by the average percentage ncrease n parwse F-measure of A over B over all acquston phases. We refer to ths metrc as the average % F-measure ncrease. 4. Results Table 1 presents summary results for, -Objectve, IBS-, and IBS-, whch acqures feature values drawn unformly at random from nformatve nstances selected by IBS. Let us 1 We randomly selected 1 out of 4 features for each nstance n the rs dataset, 2 out of 4 features for wne, and 3 out of 16 and 20 features for the letter-l and proten data sets, respectvely.

frst examne the relatve performance of the polcy whch dentfes acqustons that are lkely to mpact the cluster assgnments and -Objectve whch targets acqustons whch are expected to mprove the clusterng algorthm s nternal objectve functon. Fgure 1(a) presents clusterng performance as a functon of acquston costs for the proten data set, obtaned wth, -Objectve, and random samplng. For all data sets, leads to better clusterng than random query samplng. The mprovements n performance range from a 10% to 32% ncrease n F-measure on the top 20% of acquston phases. One can also observe the cost benefts of usng to obtan a desred level of performance. For example, on the rs data set, Expected Utlty acheves a parwse F-measure of 0.8 wth less than 300 feature values, whle random samplng requres twce as many acqustons to acheve the same result. In contrast to Expected Utlty, usng the objectve-based utlty functon n -Objectve s rather neffectve n mprovng parwse F-measure. Ths s because the K-means objectve s focused on producng tghter clusters, and the acquston strategy based on t may select feature values that reduce ths objectve wthout changng any cluster assgnments, resultng n no mprovement wth respect to external evaluaton measures. Data Set % F-measure Increase over -Objectve IBS- IBS- rs -0.81 6.19 7.96 3.14 wne -8.42 10.92 11.41 4.58 letters 7 6.22 5.55 0.16 proten 4.78 14.19 14.93 2.90 Table 1: Performance of dfferent acquston polces for clusterng Parwse F Measure 5 0.35 0.3 0.25 0.2 -Objectve 0.15 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Number of feature-values acqured (a) Parwse F Measure 0.6 0.58 0.56 0.54 0.52 0.5 8 6 4 2 IBS- IBS- 0.38 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Number of feature-values acqured Fgure 2: Learnng curves for alternatve acquston polces (b) Now, let us examne the beneft to from evaluatng a subset of acqustons from partcularly nformatve nstances as captured by our Instance-Based samplng approach. Table 1 presents summary performance for -IBS and IBS-, and for the rs data set, Fgure 1(b) show clusterng qualty after each acquston phase obtaned by, IBS-, and IBS-. On 3 of the 4 datasets, IBS- produces the hghest average ncrease n parwse F-measure compared to random samplng. On these datasets, IBS- also performs substantally better than random. These results demonstrate that our margn measure effectvely dentfes partcularly nformatve nstances for acquston. Consequently, IBS- focuses the evaluaton of Expected Utlty to a more promsng set of queres, leadng to better models on average. However, the mprovements of IBS- over are not very large. Lastly, we evaluated the polces when appled to the rs dataset under dfferent cost dstrbutons. We assgned each feature a cost drawn unformly at random from a range between 1 and 100. For ths evaluaton we nclude a cost-senstve benchmark polcy, Cheapest-frst, whch selects acqustons n order of ncreasng cost. The results for all randomly assgned cost dstrbutons show that

IBS- and Expected Utlty consstently results n better clusterng than random acquston for a gven cost. Fgure 3 presents F-measure versus acquston costs for two representatve cost dstrbutons. As shown, n settngs where features have varyng nformaton value wth non-neglgble costs, 's ablty to capture the value of dfferent feature values per unt cost s more crtcal. In such cases, acqurng an unnformatve feature value for a substantal cost results n a sgnfcant loss and, as shown, and IBS- are more lkely to avod such losses. In contrast, the performance of Cheapest-frst s nconsstent. It performs well when ts underlyng assumpton holds and the cheapest features are also nformatve. In such cases, does not perform as well, snce t mperfectly estmates the expected mprovement from each acquston. When many nexpensve features are also unnformatve Cheapest-frst can perform poorly, as shown by the early acquston stages of Fgure 3., however, estmates the trade-off between cost and expected mprovement n clusterng qualty, and although the estmaton s mperfect, t consstently selects better queres than random acqustons for all cost structures. 0.9 0.9 0.8 0.85 0.8 Parwse F Measure 0.7 0.6 0.5 Parwse F Measure 0.75 0.7 0.65 0.6 0.55 IBS- Cheapest-Frst 0.3 500 1000 1500 2000 2500 3000 3500 4000 Costs (a) Inexpensve features are also nformatve 0.5 IBS- 5 Cheapest-Frst 500 1000 1500 2000 2500 3000 3500 Costs (b) Some expensve feature are nformatve Fgure 3: Performance under dfferent feature-value cost structures 5. Conclusons In ths paper, we proposed an expected utlty approach to actve feature-value acquston for clusterng, where nformatve feature values are obtaned based on the estmated expected mprovement n clusterng qualty per unt cost. Experments show that the Expected Utlty approach consstently leads to better clusterng than random samplng for a gven acqustons cost. 6. References [1] S. Basu, A. Banerjee, and R. J. Mooney. Actve sem-supervson for parwse constraned clusterng. In Proceedngs of the 2004 SIAM Internatonal Conference on Data Mnng (SDM-04), Apr. 2004. [2] C. L. Blake and C. J. Merz. UCI repostory of machne learnng databases. http://www.cs.uc.edu/ mlearn/mlrepostory.html, 1998. [3] J. M. Buhmann and T. Zller. Actve learnng for herarchcal parwse data clusterng. In ICPR, pages 2186 2189, 2000. [4] T. Hofmann and J. M. Buhmann. Actve data clusterng. In Advances n Neural Informaton Processng Systems 10, 1998. [5] J. MacQueen. Some methods for classfcaton and analyss of multvarate observatons. In Proceedngs of 5th Berkeley Symposum on Mathematcal Statstcs and Probablty, pages 281 297, 1967. [6] P. Melvlle, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utlty approach to actve feature-value acquston. In Proceedngs of the Internatonal Conference on Data Mnng, pages 745 748, Houston, TX, November 2005. [7] M. Stenbach, G. Karyps, and V. Kumar. A comparson of document clusterng technques. In Proceedngs of the KDD-2000 Workshop on Text Mnng, 2000. [8] D. Vu, M. Blenko, P, Melvlle, M. Saar-Tsechansky. Actve nformaton acquston for mproved clusterng, Workng Paper, McCombs School of Busness, May 2007.