Weighted and Continuous Clustering

Similar documents
Clustering Images. John Burkardt (ARC/ICAM) Virginia Tech... Math/CS 4414:

Computational Geometry Lab: SEARCHING A TETRAHEDRAL MESH

Lecture 3: Linear Classification

2 Sets. 2.1 Notation. last edited January 26, 2016

Geometric Constructions

Topic: Topic 1-Numeration

Question. Why is the third shape not convex?

1. CONVEX POLYGONS. Definition. A shape D in the plane is convex if every line drawn between two points in D is entirely inside D.

Other Voronoi/Delaunay Structures

The goal is the definition of points with numbers and primitives with equations or functions. The definition of points with numbers requires a

Introduction to Voronoi Diagrams and Delaunay Triangulations

6. Concluding Remarks

Parallel Quality Meshes for Earth Models

Computational Geometry: Lecture 5

Lifting Transform, Voronoi, Delaunay, Convex Hulls

Dgp _ lecture 2. Curves

Simplicial Complexes: Second Lecture

Here are some of the more basic curves that we ll need to know how to do as well as limits on the parameter if they are required.

Grade 6 Math Circles October 16 & Non-Euclidean Geometry and the Globe

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Basics of Computational Geometry

Continuity and Tangent Lines for functions of two variables

Planar Graphs. 1 Graphs and maps. 1.1 Planarity and duality

Grade 6 Math Circles October 16 & Non-Euclidean Geometry and the Globe

3 Vectors and the Geometry of Space

Academic Vocabulary CONTENT BUILDER FOR THE PLC MATH GRADE 1

L22 Measurement in Three Dimensions. 22b Pyramid, Cone, & Sphere

Elementary Planar Geometry

MAT 3271: Selected Solutions to the Assignment 6

The Death Map. Hollins University Visitors 25 October / 63

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 24 Solid Modelling

NURBS: Non-Uniform Rational B-Splines AUI Course Denbigh Starkey

Limits at Infinity

Computational Geometry

Voronoi diagrams and applications

Natural Inclusion of Boundary Points in a Mesh

A Polynomial Time Algorithm for Multivariate Interpolation in Arbitrary Dimension via the Delaunay Triangulation

Representing the Real World

Clustering Color/Intensity. Group together pixels of similar color/intensity.

COMPUTING CONSTRAINED DELAUNAY

Divisibility Rules and Their Explanations

Delaunay Triangulations


Definitions. Topology/Geometry of Geodesics. Joseph D. Clinton. SNEC June Magnus J. Wenninger

The Cut Locus and the Jordan Curve Theorem

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #17. Loops: Break Statement

What is dimension? An investigation by Laura Escobar. Math Explorer s Club

Conjectures concerning the geometry of 2-point Centroidal Voronoi Tessellations

A Flavor of Topology. Shireen Elhabian and Aly A. Farag University of Louisville January 2010

3.5 Equations of Lines and Planes

Nearest Neighbor Predictors

(Refer Slide Time: 0:19)

ROBOT MOTION USING DELAUNAY TRIANGULATION

CONNECTED SPACES AND HOW TO USE THEM

1 Appendix to notes 2, on Hyperbolic geometry:

Tiling Three-Dimensional Space with Simplices. Shankar Krishnan AT&T Labs - Research

Grade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015

In our first lecture on sets and set theory, we introduced a bunch of new symbols and terminology.

Lesson 19. Opening Discussion

One simple example is that of a cube. Each face is a square (=regular quadrilateral) and each vertex is connected to exactly three squares.

Developmental Math An Open Program Unit 7 Geometry First Edition

Voronoi Diagram. Xiao-Ming Fu

Outreach Lecture: Polytopes

Optimal Compression of a Polyline with Segments and Arcs

CS 532: 3D Computer Vision 14 th Set of Notes

Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore

Lecture 1: Overview

Implementing Geometric Algorithms. Wouter van Toll June 20, 2017

Grade 6 Math Circles February 19th/20th. Tessellations

CAD & Computational Geometry Course plan

Unit Maps: Kindergarten Math

Grids Geometry Computational Geometry

Naming Angles. One complete rotation measures 360º. Half a rotation would then measure 180º. A quarter rotation would measure 90º.

Chapter 7. Polygons, Circles, Stars and Stuff

Quadratic and cubic b-splines by generalizing higher-order voronoi diagrams

1. The Normal Distribution, continued

How to print a Hypercube

Delaunay Triangulations. Presented by Glenn Eguchi Computational Geometry October 11, 2001

Convex Optimization for Simplicial Mesh Improvement

ON MAKING A THEOREM FOR A CHILD. Seymour Papert MIT

A simple problem that has a solution that is far deeper than expected!

10.4 Linear interpolation method Newton s method

Planar Graphs and Surfaces. Graphs 2 1/58

Some Open Problems in Graph Theory and Computational Geometry

Session 27: Fractals - Handout

66 III Complexes. R p (r) }.

Simple Graph. General Graph

My Favorite Problems, 4 Harold B. Reiter University of North Carolina Charlotte

How Do Computers Solve Geometric Problems? Sorelle Friedler, University of Maryland - College Park

4.7 Approximate Integration

Bulgarian Math Olympiads with a Challenge Twist

CS S Lecture February 13, 2017

Lecture 17. ENGR-1100 Introduction to Engineering Analysis CENTROID OF COMPOSITE AREAS

GETTING TO KNOW THE WEBINAR

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

(Refer Slide Time: 06:01)

Lecture 3: Some Strange Properties of Fractal Curves

Animations involving numbers

Congruence Arithmetic

04 - Normal Estimation, Curves

Transcription:

John (ARC/ICAM) Virginia Tech... Math/CS 4414: http://people.sc.fsu.edu/ jburkardt/presentations/ clustering weighted.pdf... ARC: Advanced Research Computing ICAM: Interdisciplinary Center for Applied Mathematics 25 September 2009

The K-Means algorithm is a simple and very useful tool for making an initial clustering of data. It does have the drawback that often stops at an answer that isn t the best one, but this can be detected and corrected by running the program with several different starting points. Because it can give some information about almost any set of data, researchers have tried to extend it to handle new classes of problems for which good clusterings are hard to find. As an example, online image repositories may have thousands of images classified as images of an eagle. It is important to be able to return a few typical images.

An Extension of K-Means

An Extension of K-Means

Today we will consider some extensions of the K-Means algorithm that are not so ambitious but which let us consider new classes of problems. If we know that some data has a greater importance, we would like to be able to include that information, so that the clustering can take that into account. When you cluster hundreds of points in a region, the points are organized into polygonal subregions. Can we just go ahead and cluster all the points in, say, the unit square? Instead of a discrete set, we have a continuous set of points to consider.

Weighted K-Means A Weighted K-Means Algorithm K-Means for Continuous Data Weighted K-Means for Continuous Data

Weighted K-Means: The Centroid Suppose that we want to cluster a set of data points X, but now we want to be able to stipulate that some points are more important than others. We might encounter this problem if we are deciding where to put the capital city of a new state, or the hubs of an airline, or a brand new set of elementary schools in a city. In each of these cases, we are trying to choose convenient centers, but now it s not just the geographic location that we have to consider, but also how many people or average flights or school children will have to travel from the data points to the cluster centers.

Weighted K-Means: Using Weights The natural way to record the varying importance of the different points would be to assign weights to each one. Thus, our original problem can be thought of as the simple case where each point was given the same weight. With the addition of weights, the K-Means algorithm can handle situations in which some data points are so important that they almost pull one of the cluster centers close to them by gravitational attraction.

Weighted K-Means: The Centroid In the unweighted case, a cluster center c j was the centroid or average of the coordinates of all the data points x i in the cluster: x c j = i c j x i x i c j 1 The centroid is a geometric quantity whose location can be determined from just a picture of the points. If we imagine the points being connected to the centroid, and having equal weight, then this object is perfectly balanced around the centroid. No matter how we turn it, it will be balanced.

Weighted K-Means: The Centroid

Weighted K-Means: The Cluster Variance For unweighted clustering, we defined the cluster variance as: var(x, c) = = where c j is the cluster center. k var(x, c j ) j=1 k x i c j x i c j 2 x i c j 1 j=1 We noted that each step of the K-Means algorithm reduces the variance, and that the best clustering has the minimum value of the variance.

Weighted K-Means: The Center of Mass When point p i has weight w i, the formula for the center of mass is: x c j = i c j w i x i x i c j w i and cluster center c j is the center of mass of its cluster. The location of the center of mass can be anywhere within the convex hull of the set of data points. Similarly, if each point p i is connected to the center of mass, and given weight w i, this object will also be perfectly balanced.

Weighted K-Means: The Center of Mass

Weighted K-Means: Center of Mass Minimization The center of mass of a set of weighted points minimizes the weighted variance. For one cluster with data points x and weights w, the variance with respect to an arbitrary point c is var(x, w, c) = wi x i c 2 wi and this is minimized if we take c to be the center of mass. Proof: Solve var(x,w,c) c = 0.

Weighted K-Means: Weighted Cluster Variance For weighted clustering, we define the weighted cluster variance as: var(x, w, c) = = where c j is the cluster center. k var(x, w, c j ) j=1 k x i c j w i x i c j 2 x i c j w i j=1 Each step of the weighted K-Means algorithm reduces the weighted cluster variance, and the best clustering minimizes this quantity.

Weighted K-Means A Weighted K-Means Algorithm K-Means for Continuous Data Weighted K-Means for Continuous Data

WKMeans Algorithm The weighted K-Means algorithm is very similar to the K-Means algorithm. Changes occur mainly in two places: update centers, we use centers of mass instead of centroid; variance, where we compute a sum of weighted variances.

WKMeans Algorithm: Main Program f u n c t i o n [ c, ptoc ] = wkm ( dim, n, p, w, k ) %% WKM c a r r i e s out the weighted K Means algorithm. % % % I n i t i a l i z e t h e c l u s t e r c e n t e r s. % c = k m e a n s i n i t i a l i z e ( dim, n, p, k ) ; % % R e p e a t e d l y update c l u s t e r s and c e n t e r s t i l no change. % v = 1; w h i l e ( 1 ) end ptoc = kmeans update clusters ( dim, n, p, k, c ) ; c = wkmeans update centers ( dim, n, p, w, k, ptoc ) ; v o l d = v ; v = wkmeans variance ( dim, n, p, w, k, c, ptoc ) ; i f ( v == v o l d ) break end r e t u r n end

WKMeans Algorithm: Update the Centers f u n c t i o n c = wkmeans update centers ( dim, n, p, w, k, ptoc ) %% WKMEANS UPDATE CENTERS r e s e t s t h e c l u s t e r c e n t e r s to t h e w e i g h t e d data a v e r a g e s. % wp ( 1 : dim, 1 : n ) = p ( 1 : dim, 1 : n ) w( 1 : n ) ; f o r j = 1 : k i n d e x = f i n d ( ptoc ( 1 : n ) == j ) ; c ( 1 : dim, j ) = sum ( wp ( 1 : dim, i n d e x ), 2 ) / sum ( w( i n d e x ) ) ; end r e t u r n end

WKMeans Algorithm: Compute the Variance f u n c t i o n v = wkmeans variance ( dim, n, p, k, c, ptoc ) %% WKMEANS VARIANCE computes the variance of the weighted K means c l u s t e r i n g. % pmc ( 1 : dim, 1 : n ) = p ( 1 : dim, 1 : n ) c ( 1 : dim, ptoc ( 1 : n ) ) ; v = 0. 0 ; f o r j = 1 : k i n d e x = f i n d ( ptoc ( 1 : n ) == j ) ; wpmc( i n d e x ) = w( i n d e x ). ( norm ( pmc ( 1 : dim, i n d e x ) ) ) ˆ 2 ; v a r ( j ) = sum ( wpmc( i n d e x ) ) / sum ( w( i n d e x ) ) ; v = v + v a r ( j ) ; end r e t u r n end

Weighted K-Means A Weighted K-Means Algorithm K-Means for Continuous Data Weighted K-Means for Continuous Data

Continuous Data So far, our data set X has been a set of N objects in D-dimensional space. Whether the objects are actually geometrical, we can think of them as points in that space. It s natural to ask whether it s essential that X is a finite set; in particular, we d be interested in knowing whether we can cluster a set X that comprises all the points in a geometric region, perhaps a square, or a circle or perhaps the surface of a sphere. It may seem like an awfully big jump, but the idea is to take what we have done with finite sets and see how far we can extend it.

Continuous Data Ignore for a moment the question of how we can do the clustering. Instead, let s ask, What would be the result of clustering points in a square?. We ve already seen that clustering produces a set of special center points C with the property that each data point is close to one of the centers. This idea makes perfectly good sense for points in a square.

Continuous Data So how does our algorithm begin? We first have to initialize C with some random values. Well, that s easy if the region is a square. For other regions, we would clearly need to be able to produce K random points that lie in the region. So keep in mind that we have to be able to sample the region.

Continuous Data: Initial Centers

Continuous Data - Can t List the Cluster Assignments Our next step was to assign each data point to the nearest c j in C. In the discrete case, we recorded this information in a vector ptoc, and used that information several times. But if X is infinite, we can t store the cluster map in a vector. So no matter how many times we want to know which cluster some point x i belongs to, we have to figure the answer out from scratch. So keep in mind that it will be expensive to determine cluster membership, and that we can t save the results for later reference.

Continuous Data - Update Centers by Averaging Next we averaged the x i belonging to each c j. x c j = i c j x i x i c j 1 I wrote the discrete average that way to suggest how we can replace sums by integrals: x c j = i c j x i dx x i c j 1 dx This calculation is harder than it looks, because we are integrating over an irregular region. (We will come back to this point!)

Continuous Clusters - Example of Centers and Clusters

Continuous Data - The Variance Finally, knowing the new c j, we computed the variance. This becomes: var(x, c) = = var(x, c) = = k var(x, c j ) j=1 k x i c j x i c j 2 x i c j 1 j=1 k var(x, c j ) j=1 k x i c j x i c j 2 dx x i c j 1 dx j=1

Continuous Data - Time for Details Now, what we have done is to think about our new problem with a continuous X and decide that everything we did in the discrete case can pretty much be copied or slightly adjusted, and the next thing you know we ll be solving these problems too! But it s time for some more careful analysis. In particular, when we replaced sums by integrals, we went from a simple task to a very hard one!

Continuous Data - Approximating an Integral To focus attention, let s ask how we could compute an integral like x i c j 1 dx which represents the area or volume or hypervolume of the points that belong to the cluster centered at c j. If we were working in a 2D region, this cluster would probably be a convex polygon, but we have no idea of how many sides it has or where the vertices are! If we are willing to live with an estimate of this value, then there is a reasonable approach that will work for any shape, and any number of dimensions, called Monte Carlo sampling.

Continuous Data - Approximating an Integral The integral x i c j 1 dx represents, in the 2D case, the area of the cluster. Let s suppose we re working in the unit square. The cluster associated with c j is a polygonal portion of the unit square. So we compute a large number S of sample points (we said we would have to know how to do that). We count the number of sample points that are part of c j s cluster to estimate for the relative area of that cluster. Since the total area is 1 in this case, we actually are estimating the area itself. Using this same idea, we can estimate the area of each cluster.

Continuous Data - Approximating an Integral OK, but how do we approximate integrals like x i c j x i c j 2 dx which we need for the variance? Again, we can use the same set of S sample points. Each sample point s j that lies in c j s cluster will contribute to our estimate. The integral estimate is simply s x i c j 2 dx Area(c j ) i c j s i c j 2 x i c j s i c j 1

Continuous Data: A CKMeans Program Our MATLAB program for this problem is similar to the others, except that we can t compute the ptoc array. Instead, we rename our current centers c old, we use sampling to find all the points nearest each centroid, and average those to get the new set of centers c. Then we estimate the variance to see if we should keep working. These calculations are all estimated using sampling, so we may find that increasing the size of the sampling set S will improve our answers.

Continuous Data - Main Program f u n c t i o n c = ckm ( dim, n, p, w, k ) %% CKM c a r r i e s out the continuous K Means algorithm. % % % I n i t i a l i z e t h e c l u s t e r c e n t e r s. % c = c k m e a n s i n i t i a l i z e ( dim, n, p, k ) ; % % R e p e a t e d l y update c l u s t e r s and c e n t e r s t i l no change. % v = 1; w h i l e ( 1 ) end c o l d = c ; c = ckmeans update centers ( dim, n, p, w, k, c old ) ; v o l d = v ; v = ckmeans variance ( dim, n, p, w, k, c, ptoc ) ; i f ( v == v o l d ) break end r e t u r n end

Continuous Data - Main Program Despite all the approximation that goes on, the results can have some surprising structure. We only talked about the unit square, but you can do this kind of process on any region for which you can produce sample points. Often, an engineer or scientist needs to break down a region into simple subregions for analysis. This method gives you polygonal regions (in 2D). The subdivision, which assigns each point to the nearest center, is known as the Voronoi diagram.

Continuous Data: On a Square

Continuous Data: In a Triangle

Continuous Data: In an Ellipse

Continuous Data: Delaunay Triangulation Often you want your original region to be broken up into triangular subregions. This is especially true in engineering problems which use finite elements, and in graphics applications which expect a regular mesh of triangles. Because our continuous K-Means algorithm is giving us a Voronoi Diagram, we can draw lines that connect the centers to create the dual graph, which is known as the Delaunay triangulation. This will have the correct structure.

Continuous Data: Voronoi/Delaunay Step 1

Continuous Data: Voronoi/Delaunay Step 5

Continuous Data: Voronoi/Delaunay Step 50

Weighted K-Means A Weighted K-Means Algorithm K-Means for Continuous Data Weighted K-Means for Continuous Data

Weighted Continuous Data Just as in the discrete case, it is often the case that some parts of the data are much more important that others. If an engineer is using the continuous K-Means algorithm to form a mesh of an aircraft, it may be important that there be lots of points near places where cracks are likely to form, such as where the wings connect to the body. It is easy to add weights to the continuous case. Instead of a weight vector, we have a weight function w(x), which shows up in the integrals.

Weighted Continuous Data: A Circle

Weighted Continuous Data: A Square, Step 1

Weighted Continuous Data: A Square, Step 5

Weighted Continuous Data: A Square, Step 50

Conclusion The K-Means algorithm can be extended to use weights, and it can be extended to the case where the objects we are grouping are actually regions in space. Weighting allows for a more sophisticated grouping, which takes into account that some things are more important, or more valuable, or have more information. Clustering continuous regions of space is only possible when we use approximations. However, it enables us to achieve regular subdivisions of regions even when those regions have unusual shapes, and to deal with weight functions.