Clustering with Confidence: A Binning Approach

Similar documents
Clustering with Confidence: A Low-Dimensional Binning Approach

Motivation. Technical Background

A Fast Clustering Algorithm with Application to Cosmology. Woncheol Jang

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

10701 Machine Learning. Clustering

Supervised vs. Unsupervised Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

The Curse of Dimensionality

Machine Learning (BSMC-GA 4439) Wenke Liu

Merging Mixture Components for Cell Population Identification in Flow Cytometry Data The flowmerge package

Clustering. Supervised vs. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Bumptrees for Efficient Function, Constraint, and Classification Learning

MSA220 - Statistical Learning for Big Data

Robust Model-based Clustering of Flow Cytometry Data The flowclust package

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering in Data Mining

Unsupervised Learning and Clustering

Note Set 4: Finite Mixture Models and the EM Algorithm

Understanding Clustering Supervising the unsupervised

Clustering CS 550: Machine Learning

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CS 229 Midterm Review

Clustering Part 4 DBSCAN

CHAPTER 4: CLUSTER ANALYSIS

Clustering in Ratemaking: Applications in Territories Clustering

4. Ad-hoc I: Hierarchical clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

CLUSTERING IN BIOINFORMATICS

Pruning Nearest Neighbor Cluster Trees

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering part II 1

Association Rule Mining and Clustering

Gene Clustering & Classification

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning and Clustering

Parallel and Hierarchical Mode Association Clustering with an R Package Modalclust

University of Florida CISE department Gator Engineering. Clustering Part 2

Nearest Neighbor Classification

Cluster analysis. Agnieszka Nowak - Brzezinska

ECG782: Multidimensional Digital Signal Processing

Cluster Analysis. Angela Montanari and Laura Anderlucci

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Cluster Analysis for Microarray Data

Image Segmentation. Shengnan Wang

Using the Kolmogorov-Smirnov Test for Image Segmentation

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Algorithms for Manipulation of Level Sets of Nonparametric Density Estimates 1

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Clustering and Visualisation of Data

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Chapter 6 Continued: Partitioning Methods

Clustering Algorithms for Data Stream

Machine Learning and Pervasive Computing

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Distances, Clustering! Rafael Irizarry!

Tensor Based Approaches for LVA Field Inference

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

3. Cluster analysis Overview

Lesson 3. Prof. Enza Messina

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Hierarchical Clustering

Methods for Intelligent Systems

Network Traffic Measurements and Analysis

Nonparametric Approaches to Regression

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning : Clustering

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

University of California, Berkeley

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Section 4 Matching Estimator

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

10-701/15-781, Fall 2006, Final

Geometric data structures:

6. Object Identification L AK S H M O U. E D U

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Two-dimensional Totalistic Code 52

A noninformative Bayesian approach to small area estimation

Hierarchical Clustering 4/5/17

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Dynamic Thresholding for Image Analysis

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Based on Raymond J. Mooney s slides

Transcription:

Clustering with Confidence: A Binning Approach Rebecca Nugent Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213 Werner Stuetzle Department of Statistics University of Washington Box 354322, Seattle, WA 98195 Abstract We present a plug-in method for estimating the cluster tree of a density. The method takes advantage of the ability to exactly compute the level sets of a piecewise constant density estimate. We then introduce clustering with confidence, an automatic pruning procedure that assesses significance of splits (and thereby clusters) in the cluster tree; the only user input required is the desired confidence level. 1 Introduction The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. Ideally, we would be able to find the number of groups as well as where each group lies in the feature space with minimal input from the user. For example, Figure 1 below shows two curvilinear groups, 200 observations each, and two spherical groups, 100 observations each; we would like a four cluster solution, one per group. two statistical approaches. Parametric (model-based) clustering assumes the data have been generated by a finite mixture of g underlying parametric probability distributions p g (x) (often multivariate Gaussian), one component per group (Fraley, Raftery 1998; McLachlan, Basford 1988). This procedure selects a number of mixture components (clusters) and estimates their parameters. However, it is susceptible to violation of distributional assumptions. Skewed or non-gaussian groups may be modeled incorrectly by a mixture of several Gaussians. In contrast, the nonparametric approach assumes a correspondence between groups in the data and modes of the density p(x). Wishart (1969) first advocated searching for modes as manifestations of the presence of groups; nonparametric clustering algorithms should be able to resolve distinct data modes, independently of their shape and variance. Hartigan expanded this idea and made it more precise (1975, 1981, 1985). Define a level set L(λ; p) of a density p at level λ as the subset of the feature space for which the density exceeds λ: L(λ; p) = {x p(x) > λ}. Figure 1: Data set with four well-separated groups To cast clustering as a statistical problem, we regard the data, X = {x 1,...,x n } R p, as a sample from some unknown population density p(x). There are Its connected components are the maximally connected subsets of a level set. For any two connected components A and B, possibly at different levels, either A B, B A, or A B =. This hierarchical structure of the level sets is summarized by the cluster tree of p. The cluster tree is easiest to define recursively (Stuetzle 2003). Each node N of the tree represents a subset D(N) of the support L(0; p) of p and is associated with a density level λ(n). The root node represents the entire support of p and is associated with density level λ(n) = 0. To determine the daughters of a node, we find the lowest level λ d for which L(λ d ; p) D(N) has two or more connected components. If no such λ d exists, then D(N) is a mode of the density, and N is a

leaf of the tree. Otherwise, let C 1, C 2,..., C n be the connected components of L(λ d ; p) D(N). If n = 2, we create two daughter nodes at level λ d, one for each connected component; we then apply the procedure recursively to each daughter node. If n > 2, we create two connected components C 1 and C 2 C3... C n and their respective daughter nodes and then recurse. We call the regions D(N) the high density clusters of p. Note that we have defined a binary tree; other possible definitions with non-binary splits could be used. However, the recursive binary tree easily accommodates level sets with more than two connected components. For example, if the level set L(λ d ; p) D(N) has three connected components, C 1, C 2, C 3, the binary tree first splits, say, C 1 from C 2 C3 at height λ d and, after finding any subsequent splits and leaves on the corresponding branch for C 1, returns to the connected component C 2 C3 at height λ d, immediately splits C 2 and C 3, and creates two more daughter nodes at height λ d. The final cluster tree structure will contain a node for each of the three connected components, all at height λ d. Figure 2 shows a univariate density with four modes and the corresponding cluster tree with initial split at λ = 0.0044 and subsequent splits at λ = 0.0288, 0.0434. Estimating the cluster tree is a fundamental goal of nonparametric cluster analysis. Several other level set estimation procedures have been proposed. Walther s level set estimator consists of two steps (1997). First, compute a density estimate ˆp of the underlying density p. Second, approximate L(λ; ˆp) by L (λ; ˆp), a union of balls B i of radius r centered around x i with ˆp(x i ) > λ that do not contain x i with ˆp(x i ) λ. While this structure is computationally convenient for finding connected components, its accuracy depends on the choice of r. Walther provides a formula for r based on the behavior of p on the boundary L(λ; p) of the level set L(λ; p). The Cuevas, Febrero, and Fraiman level set estimator also is based on a union of balls of radius r (2000, 2001). However, their estimate L (λ; ˆp) is the union of all balls B i centered around x i where ˆp(x i ) > λ; no balls are excluded. They propose several heuristic methods for choosing r based on interpoint distance. Stuetzle s runt pruning method (2003) estimates the cluster tree of p by computing the cluster tree of the nearest neighbor density estimate and then prunes branches corresponding to supposed spurious modes. The method is sensitive to the pruning amount parameter chosen by the user. Klemelä developed visualization tools for multivariate density estimates that are piecewise constant over (hyper-)rectangles (2004). He defines a level set tree which has nodes at every level corresponding to a unique value of ˆp. Note that the cluster tree only has nodes at split levels. Density λ 0.02 0.04 0.06 0.08 0.10 Figure 2: (a) Density with four modes; (b) corresponding cluster tree with three splits (four leaves) There are several previously suggested clustering methods based on level sets. Wishart s one level mode analysis is probably the first such method (1969); its goal is to find the connected components of L(λ; p) for a given λ. After computing a density estimate ˆp, all observations x i where ˆp(x i ) λ are set aside. Well-separated groups of the remaining high density observations correspond to connected components of L(λ; p). One drawback is that not all groups or modes may be identified by examining a single level set. Wishart s hierarchical mode analysis addresses this weakness; this complex algorithm constructs a cluster tree (although not so named) through iterative merging. Density λ 0.00 0.02 0.04 0.06 0.08 0.10 2 Constructing the cluster tree for a piecewise constant density estimate We can estimate the cluster tree of a density p by the cluster tree of a density estimate ˆp. However, for most density estimates, computing the cluster tree is a difficult problem; there is no obvious method for computing and representing the level sets. Exceptions are density estimates that are piecewise constant over (hyper-)rectangles. Let B 1, B 2,..., B N be the rectangles, and let ˆp i be the estimated density for B i. Then L(λ; ˆp) = B i. ˆp i>λ If the dimension is low enough, any density estimate can be reasonably binned. To illustrate, we choose the Binned Kernel Density Estimate (BKDE), a binned approximation to an ordinary kernel density estimate, on the finest grid we can computationally afford (Hall, Wand 1996; Wand, Jones 1995). We use ten-fold cross validation for bandwidth estimation. Figure 3a shows a heat map of the BKDE for the four well-separated groups on a 20 x 20 grid with a cross-validation bandwidth of 0.0244. Several high frequency areas are indicated.

Note that during cluster tree construction, L(λ d ; ˆp) D(N) changes structure only when the level λ d is equal to the next higher value of ˆp(B i ) for one or more bins B i in D(N). After identifying and sorting the bins unique density estimate values, we can compute the cluster tree of ˆp by stepping through only this subset of all possible levels λ. Every increase in level then corresponds to the removal of one or more bins from the level set L(λ d ; ˆp). Finding the connected components of L(λ d ; ˆp) D(N) can be cast as a graph problem. Let G be an adjacency graph over the bins B i L(λ d ; ˆp) D(N). The vertices of the graph are the bins B i. Two bins B i and B j are connected by an edge if adjacent (i.e. share a lowerdimensional face). The connected components of G correspond to the connected components of the level set L(λ d ; ˆp) D(N) and finding them is a standard graph problem (Sedgewick 2002). same is true for every subtree of the cluster tree. Figure 3b shows L(0.00016; ˆp) with removed bins in red. Two daughter nodes are created, one for each connected component/cluster core. Figure 3c illustrates the subsequent partitioning of the feature space after assignment of fluff bins to the closest core. Figure 4a shows the cluster tree generated from the example BKDE; the corresponding cluster assignments and partitioned feature space (labeled by color) are in Figures 4b,c. The cluster tree indicates the BKDE has nine modes. The first split at λ = 1 10 16 and the mode of ˆp around (0.5, 0) (in yellow in Figure 4c) are artifacts of the density estimate; no observations are assigned to the resulting daughter node. The remaining eight leaves correspond to (a subset of) one of the groups. 0.000 0.001 0.002 0.003 0.004 0.005 Figure 3: (a) example data BKDE on a 20 x 20 grid; (b) level set at λ = 0.00016; cluster cores in orange, white; (c) corresponding partitioned feature space When a node N of the cluster tree has been split into daughters N l, N r, the high density clusters D(N l ), D(N r ), also referred to as the cluster cores, do not form a partition of D(N). We refer to the bins in D(N)\(D(N l ) D(N r )) (and observations in these bins) as the fluff. We assign each bin B in the fluff to N r if the Manhattan distance d M (B, D(N r )) is smaller that d M (B, D(N l )). If d M (B, D(N r )) > d M (B, D(N l )), then B is assigned to N l. In case of ties, the algorithm arbitrarily chooses an assignment. The cluster cores and fluff represented by the leaves of the cluster tree form a partition of the support of ˆp and a corresponding partition of the observations. The Figure 4: (a) Cluster tree for BKDE; (b) cluster assignments by color; (c) partitioned feature space We compare these results to common clustering approaches (briefly described here). K-means is a iterative descent approach that minimizes the withincluster sum of squares; it tends to partition the feature space into equal-sized (hyper-)spheres (Hartigan, Wong 1979). Although we know there are four groups, we range the number of clusters from two to twelve and plot the total within-cluster sum of squares against the number of clusters. The elbow in the curve occurs around five or six clusters. Solutions with more clusters have negligible decreases in the criterion. Modelbased clustering is described in Section 1. We allow the procedure to search over one to twelve clusters with no constraints on the covariance matrix; a tencluster solution is chosen. Finally, we include results

from three hierarchical linkage methods (Mardia, et al 1979). Hierarchical agglomerative linkage methods iteratively join clusters in an order determined by a chosen between-cluster distance; we use the minimum, maximum, or average Euclidean distance between observations in different clusters (single, complete, average respectively). The hierarchical cluster structure is graphically represented by a tree-based dendrogram whose branches merge at heights corresponding to the clusters dissimilarity. We prune each dendrogram for the known number of groups (four). In all cases, we compare the generated clustering partitions to the true groups using the Adjusted Rand Index (Table 1) (Hubert, Arabie 1985). The ARI is a common measure of agreement between two partitions. Its expected value is zero under random partitioning with a maximum value of one; larger values indicate better agreement. Table 1: Comparison of Clustering Methods Method ARI Cluster Tree: 20 x 20 (k = 8) 0.781 Cluster Tree: 15 x 15 (k = 6) 0.865 K-means (k = 4) 0.924 K-means (k = 5) 0.803 K-means (k = 6) 0.673 MBC (k = 10) 0.534 Linkage (single, complete, average) 1 The groups in the simulated data are well-separated; however, the two curvilinear groups lead to an expected increased number of clusters. While the four cluster k-means solution had high agreement with the true partition, the total within-cluster sum of squares criterion was lower for solutions with larger numbers of clusters (5, 6) who had lower ARIs comparable to the cluster tree partition generated for the BKDE on a 20 x 20 grid. Model-based clustering overestimated the number of groups in the population; the ARI correspondingly drops. All three linkage methods gave perfect agreement; this performance is not unexpected given the well-separated groups. Although the cluster tree method has an ARI of 0.781, the algorithm generates the correct partition for the density estimate. Its performance is dependent on the density estimate. A similarly found density estimate on a 15 x 15 grid with a cross-validation selected bandwidth of 0.015 yielded a cluster tree with an ARI of 0.865. In both cases, the number of groups is overestimated (8 and 6). Figure 4 illustrates this problem in the approach. While the cluster tree is accurate for the given density estimate, a density estimate is inherently noisy, which results in spurious modes not corresponding to groups in the underlying population. In our example, the procedure identified the four original groups (the three splits post-artifact split in the cluster tree) but erroneously continued splitting the clusters. The corresponding branches of the cluster tree need to be pruned. 3 Clustering with Confidence We propose a bootstrap-based automatic pruning procedure. The idea is to find (1 α) simultaneous upper and lower confidence sets for each level set. During cluster tree construction, only splits indicated as significant by the bootstrap confidence sets are taken to signal multi-modality. Spurious modes are discarded during estimation; the only decision from the user is the confidence level. 3.1 Bootstrap confidence sets for level sets We choose upper confidence sets to be of the form L u (λ; ˆp) = L(λ δλ u ; ˆp) and lower confidence sets of form L l (λ; ˆp) = L(λ + δλ l ; ˆp) with δu λ, δl λ > 0. By construction, Lower = L l (λ; ˆp) L(λ; ˆp) L u (λ; ˆp) = Upper. Let ˆp 1, ˆp 2,..., ˆp m be the density estimates for m bootstrap samples of size n drawn with replacement from the original sample. We call a pair (L(λ δλ u; ˆp), L(λ + δλ l ; ˆp)) a non-simultaneous (1 α) confidence band for L(λ; p) if for (1 α) of the bootstrap density estimates ˆp i, the upper confidence set L(λ δu λ ; ˆp) contains L(λ; ˆp i ) and the lower confidence set L(λ+δl λ ; ˆp) is contained in L(λ; ˆp i ): P boot {L(λ + δλ l ; ˆp) L(λ; ˆp i ) L(λ δu λ ; ˆp)} 1 α. Here is one method to determine δλ u, δl λ (Buja 2002). For each bootstrap sample ˆp i and each of the finitely many levels of ˆp, find the smallest δλ u (i) such that L(λ; ˆp i ) L(λ δ u λ(i); ˆp) and the smallest δλ l (i) such that L(λ + δ l λ (i); ˆp) L(λ; ˆp i ). Choose δλ u = (1 α 2 ) quantile of the δu λ (i) and δl λ = (1 α 2 ) quantile of the δl λ (i). By construction, the pair (L(λ δλ u; ˆp), L(λ+δl λ ; ˆp)) is a (1 α) non-simultaneous confidence band for L(λ; p). To get confidence bands for all λ occurring as values of ˆp with simultaneous coverage probability 1 α, we simply increase the coverage level of the individual bands until P boot {L(λ + δ l λ ; ˆp) L(λ; ˆp i ) L(λ δu λ ; ˆp) λ} 1 α.

Note that the actual upper and lower confidence sets for L(λ; p) are the level sets (L(λ δ u λ ; ˆp), L(λ+δl λ ; ˆp)) respectively for ˆp. The bootstrap is used only to find δ u λ, δl λ. 3.2 Constructing the cluster tree After finding δλ u, δl λ for all λ, we incorporate the bootstrap confidence sets into the cluster tree construction by only allowing splits at heights λ for which the corresponding bootstrap confidence set (L l (λ; ˆp), L u (λ; ˆp)) gives strong evidence of a split. We use a similar recursive procedure as described in Section 2. The root node represents the entire support of ˆp and is associated with density level λ(n) = 0. To determine the daughters of a node, we find the lowest level λ d for which a) L l (λ d ; ˆp) D(N) has two or more connected components that b) are disconnected in L u (λ d ; ˆp) D(N). Condition (a) indicates that the underlying density p has two peaks above height λ; condition (b) indicates that the two peaks are separated by a valley dipping below height λ. Satisfying both conditions indicates a split at height λ. If no such λ d exists, N is a leaf of the tree. Otherwise, let C1, l C2 l be two connected components of L l (λ d ; ˆp) D(N) that are disconnected in L u (λ d ; ˆp) D(N). Let C1 u and C2 u be the connected components of L u (λ d ; ˆp) D(N) that contain C1 l and C2 l respectively. If n = 2, we create two daughter nodes at level λ d for C1 u and C2 u and, to each, apply the procedure recursively. If n > 2, we create two connected components C1 u and Cu 2 C u 3... Cn u and daughter nodes and recurse. Returning to the example cluster tree (Figure 4a), we examine the split at λ = 0.0036, the first split that breaks the lower left curvilinear group into two clusters. Figure 5 shows the bootstrap confidence set (α = 0.05) for this level set. The original level set L(0.0036; ˆp) is in Figure 5a, color-coded by subsequent final leaf. The lower confidence set (Figure 5b) is found to be δλ l = 0.0037 higher, i.e. L(0.0073; ˆp). The upper confidence set (Figure 5c) is found to be δλ u = 0.0026 lower, i.e. L(0.001; ˆp). At λ = 0.0036, the upper confidence set does not have two connected components (no valley). Moreover, even though the lower confidence set does have two connected components, they do not correspond to the two connected components in L(0.0036; ˆp). We do not have evidence of a significant split and so do not create daughter nodes at this level. Clustering with Confidence (CWC) using α = 0.05 generates the cluster tree and data/feature space partitions seen in Figure 6. The cluster tree s significant splits have identified the four original groups (ARI = 1). No other smaller clusters (or modal artifacts) are found. Note that the split heights are higher than the Figure 5: Bootstrap confidence set for λ = 0.0036: (a)l(0.0036; ˆp) ; (b) Lower confidence set, δ l λ = 0.0037; (c) upper confidence set, δ u λ = 0.0026 corresponding split heights in the cluster tree in Figure 4. The CWC procedure required stronger evidence for a split than was available at the lower levels. It performed similarly to hierarchical linkage methods and more favorably than k-means or model-based clustering (Table 1). 0.000 0.001 0.002 0.003 0.004 0.005 0.006 Figure 6: (a) 95% confidence cluster tree (b) cluster assignments; (c) partitioned feature space

4 Examples The algorithms presented could be used for any number of dimensions, but are more tractable for lower dimensions. For easy visualization of the results, we present two examples using real data in two dimensions. We comment on higher dimensionality in the summary and future work section. 4.1 Identifying different skill sets in a student population In educational research, a fundamental goal is identifying whether or not students have mastered the skills taught. The data available usually consists of whether or not a student answered questions correctly and which skills the questions require. Cognitive diagnosis models and item response theory models, often used to estimate students abilities, can be lengthy in run time; estimation instability grows as the number of students, questions, and skills increases. Clustering capability estimates easily derived from students responses and skill-labeled questions to partition students into groups with different skill sets has been proposed as a quick alternative (Ayers, Nugent, Dean 2008). A capability vector B i = {B i1, B i2,..., B ik } [0, 1] K for K skills is found for each student. For skill k, a B ik near 0 indicates no mastery, near 1 indicates mastery, and near 0.5 indicates partial mastery or uncertainty. The cluster center is viewed as a skill set capability profile for a group of students. 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 Figure 7: (a) Cluster tree (b) Cluster assignments Figure 7b shows the (jittered) estimated capabilities for 551 students on two skills, s and, from an online mathematics tutor (Assistment system; Heffernan, Koedinger, Junker 2001). The BKDE cross-validated bandwidth was 0.032. The corresponding cluster tree (Figure 7a) has 10 different skill set profiles (clusters). Figure 7b shows the subsequent partition. Note that the two highest splits in the tree separate the clusters in the top half section of the feature space; these students have partial to complete mastery of and are separated by their mastery level. The more well-separated skill set profiles are identified earlier. Dissimilarity 0.0 0.5 1.0 1.5 118 356 395 516 408 499 83 295 116 441 460 447 362 159 333 13 312 277 424 433 144 394 30 432 88 452 501 225 251 298 461 105 543 546 316 281 468 291 255 351 416 171 99 38 44 267 494 194 66 65 112 314 289 393 315 534 249 403 190 219 84 47 293 398 288 530 110 262 154 363 57 56 250 355 503 201 147 69 126 525 430 189 545 502 386 217 484 367 423 72 253 485 1564 237 378 410 148 197 173 191 465 152 200 353 76 321 324 490 176 218 517 11 319 26840 455 181 54 42 34 27 459 246 463 26 3761 29 32 46236 167 271 323 145 28 58 39 103 49 341 444 458 466 549 3024678 476 207 53 387 427 53620 193 185 436 195 359 325 24 117 313 346 120 245 6 309 276 153 2 495 493421 12 92 155 449 365 265 114 178 328 366 62 426 397 409 188 345 497 86 71 343 464 488 132 202 482 212 506 164 471 274 411 124 282 382 264 407 330 260 417 235 390 102 254 450 48 352 361 374 538 70 376 223 380 317 140 162 344 172 392 107 420 19 64 111 136 208 339 318 396 310 332 243 340 322 16 96 515 149 168 457 130 196 179 385 134 422 60 17 135 342 474 370 419 481 4929 528 143 85 119 241 127 180 211 527 226 334 521 283 198 357 89 137 331 404 508 199 203 106 1397 513 240 358 290 101 279 486 470 3083 121 304 349 369 327 214 73 93 160 150 280 372 415 498 413 529 303294 90 373 151 183 326 347 75 273 1 270 104 446 383 526 22 384 163 371 187 125 275 21 113 182 519 533 161 350 91 209 216 286 98 166 548 224 524 300 158 297 95 138 307 256 535 505 381 242 402 475 511 532 252 175 425 269 443 81 337 348 364 133 244 544 221 418 94 399 510 229 428 491 233 220 405 375 205 87 97 68 184 236 400 412 487 239 377 284 78 406 537 518 230 336 523 507 320 108 531 266 131 142 389 261 478 520 215 122 204 77 80 213 391 335 306 547 388 477 550 248 123 227 222 238 354 115 177 259 170 186 258 292 448 228 489 247 278 329 401 192 453 41 285 496 272 157 67 18 25 52 46 109 500 301 454 15 263 522 551 232 128 456 174 379 231 542 82 28751 368 472 451 100 257 440 479 74 54043 431 442 437 438480 165 469 169 305 429 539 514 33 146 129 434 414 31159 435 210 509 141 299 338 483 206 504 439 541445 473 360 234 79 296 512 Figure 8: (a) K-means assignment (k=14); (b) Modelbased clustering assignment (k = 11);(c) Complete linkage dendrogram; (d) Complete linkage cluster assignment (k = 4) Figure 8 shows selected results from other clustering procedures. The total within-cluster sum of squares criterion suggested a k-means solution with 14 or 15 clusters. Figure 8a illustrates its propensity to partition into small tight spheres (perhaps overly so). Given a search range of two to 18 clusters with no constraints, model-based clustering chose a eleven cluster solution (Figure 8b). The strips are divided into cohesive sections of feature space; the uncertain mastery students are clustered together. Note that model-based clustering identified a very loose cluster of students with poor mastery and mastery ranging from 0.2 to 1 that overlaps with a tighter cluster of students with an value of 0.5. This skill profile is a casualty of the assumption that each mixture component in the density estimate corresponds to a group in the population. However, overall the results seem to give more natural skill profile groupings than k-means. All hierarchical linkage methods yielded dendrograms without obvious cluster solutions; Figures 8c,d show an example of a four cluster complete linkage solution. Students with different mastery levels are combined; in this respect, the results are unsatisfying. Developing 10 different instruction methods for a group of students is unreasonable in practice. We use CWC to construct a cluster tree for α = 0.90 to

0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 Anti BrdU FITC Figure 9: (a) 10% confidence cluster tree (b) cluster assignments identify skill set profiles in which we are at least 10% confident. Figure 9 shows the subsequent cluster tree and assignments. At 10% confidence, we have only three skill set profiles, each identified only by their mastery level of s. This result may be a more reasonable partition given the less clear separation along mastery. Note that there are instances in which close fluff observations are assigned to different cores, a result of the tie-breaker. 4.2 Automatic Gating in Flow Cytometry Flow cytometry is a technique for examining and sorting microscopic particles. Fluorescent tags are attached to mrna molecules in a population of cells and passed in front of a single wavelength laser; the level of fluorescence in each cell (corresponding, for example, to level of gene expression) is recorded. We might be interested in discovering groups of cells that high fluorescence levels for multiple channels or groups of cells that have different levels across channels. A common identification method is gating or subgroup extraction from two-dimensional plots of measurements on two channels. Most commonly, these subgroups are identified subjectively by eyeballing the graphs. Clustering techniques would allow for more statistically motivated subgroup identification (see Lo, Brinkman, Gottardo 2008 for one proposed method). Figure 10a shows 1545 flow cytometry measurements on two fluorescence markers applied to Rituximab, a therapeutic monoclonal antibody, in a drug-screening project designed to identify agents to enhance its antilymphoma activity (Gottardo, Lo 2008). Cells were stained, following culture, with the agents anti-brdu and the DNA binding dye 7-AAD. We construct a cluster tree (cross-validated BKDE bandwidth of 21.834) and plot the cluster assignments indicating whether the observations are fluff or part of a cluster core (Figures 10b,c). The cluster tree has 13 leaves (8 clusters, 5 modal artifacts). The cores 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Anti BrdU FITC Figure 10: (a) Flow cytometry measurements on the two fluorescent markers anti-brdu and 7-AAD; (b) Cluster tree with 13 leaves (8 clusters, 5 artifacts); (c) Cluster assignments: fluff = x; core = are located in the high frequency areas; the fluff is appropriately assigned. The sizes of the cores give some evidence as to their eventual significance. For example, the core near (400, 700) consists of one observation; we would not expect the corresponding cluster to remain in the confidence cluster tree for any reasonable α. Anti BrdU FITC Anti BrdU FITC Figure 11: (a) K-means assignment (k=10); (b) Model-based clustering assignment (k = 7) Figure 11a shows a ten-cluster k-means solution similarly chosen as before. While many of the smaller clusters are found by both k-means and the cluster tree, k-means splits the large cluster in Figure 10c into several smaller clusters. Model-based clustering chooses a seven-cluster solution (Figure 11b). We again see overlapping clusters in the lower left quadrant; this solution would not be viable in this application. Linkage methods again provided unclear solutions.

We use CWC to construct a confidence cluster tree for α = 0.10; we are at least 90% confident in the generated clusters (Figure 12). All modal artifacts have been removed, and the smaller clusters have merged into two larger clusters with cores at (200, 200) and (700, 300). Note that the right cluster is a combination of the two high 7-AAD/low anti-brdu clusters from Figure 10c. CWC did not find enough evidence to warrant splitting this larger cluster into subgroups. 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Figure 12: (b) 90% Cluster Tree with two leaves; (c) Cluster assignments: fluff = x; core = 5 Summary and future work We have presented a plug-in method for estimating the cluster tree of a density that takes advantage of the ability to exactly compute the level sets of a piecewise constant density estimate. The approach shows flexibility in finding clusters of unequal sizes and shapes. However, the cluster tree is dependent on the (inherently noisy) density estimate. We introduced clustering with confidence, an automatic pruning procedure that assesses significance of splits in the cluster tree; the only input needed is the desired confidence level. These procedures may become computationally intractable as the number of adjacent bins grows with the dimension and are realistically for use in lower dimensions. One high-dimensional approach would be to employ projection or dimension reduction techniques prior to cluster tree estimation. We also have developed a graph-based approach that approximates the cluster tree in high dimensions. Clustering with Confidence then could be applied to the resulting graph to identify significant clusters. References Ayers, E, Nugent R, and Dean, N. (2008) Skill Set Profile Clustering Based on Student Capability Vectors Computed From Online Tutoring Data. Proc. of the Int l Conference on Educational Data Mining (peer-reviewed). To appear. Buja, A. (2002) Personal Communication. Also Buja, A. and Rolke, W. Calibration for Simultaneity: (Re)Sampling Methods for Simultaneous Inference Anti BrdU FITC with Applications to Function Estimation and Functional Data. In revision. Cuevas, A., Febrero M., and Fraiman, R. (2000) Estimating the Number of Clusters. The Canadian Journal of Statistics, 28:367-382. Cuevas, A., Febrero M., and Fraiman, R. (2001) Cluster Analysis: a further approach based on density estimation. Computational Statistics & Data Analysis, 36:441-459. Fraley, C and Raftery, A (1998). How Many Clusters? Which Clustering Method? - Answers Via Model-Based Cluster Analysis. The Computer Journal, 41:578-588. Gottardo, R. and Lo, K.(2008) flowclust Bioconductor package Hall, P. and Wand, M. P. (1996) On the Accuracy of Binned Kernel Density Estimators. Journal of Multivariate Analysis, 56:165-184. Hartigan, J. A. (1975) Clustering Algorithms. Wiley. Hartigan, J. A. (1981) Consistency of Single Linkage for High-Density Clusters. Journal of the American Statistical Association, 76:388-394. Hartigan, J. A. (1985) Statistical Theory in Clustering. Journal of Classification, 2:63-76. Hartigan, J and Wong, M.A. (1979). A k-means clustering algorithm. Applied Statistics, 28, 100-108. Heffernan, N.T., Koedinger K.R., and Junker, B. W. (2001) Using Web-Based Cognitive Assessment Systems for Predicting Student Performance on State Exams. Research proposal to the Institute of Educational Statistics, US Department of Education. Department of Computer Science at Worcester Polytechnic Institute, Massachusetts. Hubert, L. and Arabie, P. (1985) Comparing partitions. Journal of Classification, 2, 193-218. Klemelä, J. (2004) Visualization of Multivariate Density Estimates with Level Set Trees. Journal of Computational and Graphical Statistics, 13:599-620. Lo, K., Brinkman R., and Gottardo, R. (2008). Automated Gating of Flow Cytometry Data via Robust Model- Based Clustering. Cytometry, Part A, 73A: 321-332. Mardia, K., Kent, J.T., and Bibby, J.M. (1979) Multivariate Analysis. Academic Press. McLachlan, G.J. and Basford, K.E. (1988) Mixture Models: Inference and Applications to Clustering. Marcel Dekker. Sedgewick, Robert. (2002) Algorithms in C, Part 5: Graph Algorithms, 3rd Ed. Addison-Wesley. Stuetzle, W. (2003) Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample. Journal of Classification. 20:25-47. Walther, G. (1997) Granulometric Smoothing. The Annals of Statistics, 25:2273-2299. Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall. Wishart, D. (1969) Mode Analysis: A Generalization of Nearest Neighbor which Reduces Chaining Effect. Numerical Taxonomy, ed. A. J. Cole, Academic Press, 282-311.