Finding Holes in Data

Size: px

Start display at page:

Download "Finding Holes in Data"

Miles Lloyd
5 years ago
Views:

1 Finding Holes in Data Leonid Pekelis Stat 208 June 18th, 2009 Abstract Attempts are made to employ persistent homology to infer topological properties of point cloud data. Two potential approaches are introducted and briefly evaluated, both relying on Monte Carlo estimation and the program PLEX [9]. First, homology signatures are compared to a null distribution that exhibits complete spatial randomness. This can be regarded as similar to hypothesis testing with the bootstrap. The second applies permutation testing to landmark selection. 1

2 1 Intro Topological Statistics is a relatively new and exciting field that lies in the intersection of Mathematics (Algebraic Topology) and Statistics. The main points are that methods from geometry and topology attempt to make precise the notions that concepts such as measures and coordinate systems should also be intrinsic to the data. There is no a priori reason that the Euclidean distance function and IR n are the best choices. Although the mathematical theory behind Topological Statistics has at least a solid foundation (see [2] for example), the statistical application is not to the same level of rigor.this paper employs the bootstrap in order to study the statistics of some topological methods used to study data sets. In order to present a complete argument, we briefly define some terminology and explain the general procedures behind estimating persistent homology of data. Much of what follows is taken from some combination of [2], [3], and [11], and contains no proofs whatsoever. Let X be a subset of IR N ; the space that our data lives in. And let X X be the, assumed random, sample of X that we observe. The purpose of persistent homology is to estimate the topological structure of X by using X to build a simplicial complex, S, that is a close approximatino to X itself. The simplicial complex is defined by three pieces of information: A vertex set Z X, with fixed, but arbirtrary ordering {z 0,...,z m } A rule that specifies when a p-simplex ω = [z 0,...,z p ] belongs to S And the recursive rule that if we define the faces of σ by the set of p- 1 simplexes U = {z [0]...z [p 1] : []is any permulation of0,...,p}, then U S. Note that a p-simplex is defined as the convex hull of a set of p + 1 independent points, or the p-dimensional analogue of a triangle. Standard examples of simplicial complexes include Cech, Rips and α- shape complexes. Generally, the vertex set is chosen to be Z = X. The problem is the resulting simplicial complex quickly becomes too unweildy for computation. For example, it is not uncommon for a Rips complex to have 10 times the number of points in the sample. The witness complex, defined next, instead operates by choosing a small set L X of landmark points (usually th to 20th of the sample) as the vertex set, and the rest of the unchosen points are witnesses in determining the simplexes forming S. We define the lazy witness complex,w t (X, ν), as follows: 2

3 The vertex set is L. For each xinx, define m(x) to be 0 for ν = 0, and the distance from x to the vth closest landmark, l L, for v > 0. The edge, or line, [ab] is then in W t (X, ν) if there exists a witness x X such that max{ a x, b x } t + m(x). A higher dimensional simplex is in W t (X, ν) if and only if all of it s edges are. There does not seem to be much theory motivating this definition, but de Silva and Carlsson (2004) note that for ν = 1, W t (X, ν) can be linked to Vitori-like regions surrounding each point. For the remainder of the paper we take ν = 1, and drop both the parameters ν and X, except for where the sample taken is not obvious. At the heart of topological statistics rests the fact that the witness complex, W t, can be characterized by a sequence of Betti numbers, b(w t ; t) = (b 0, b 0,...), for all t 0. Each integer, b k, corresponds to te number of k- dimensional holes found in W t. For example, b 1 corresponds to the number of 1-dimensional holes, or the number of closed curves (or loops) that cannot be deformed into a single point without tearing. Analogously, b 2, corresponds to when a closed 2-dimensional surface on the simplicial complex cannot be deformed. The b 0 number denotes the number of connected components, and can be seen analogous to the above descriptions when considering line segments in IR 1. Examples of Betti sequences include (1, 0, 0,...) for a single point, (1, 2, 1, 0, 0,...) for a torus, and (1, 0, 1, 0, 0,...) for a sphere. Furthermore, these sequences are usually not constant in t for the general simplicial complex W t. We do not go into details, but it should be intuitive from the definition of W t that many disconnected components should exist for small t, but that they will most likely merge to form bigger, but less numerous, connected components as t increases. At the same time, other features may appear only with larger t values. Thus, the persistent homology of W t can be defined by a series if intervals [t, t ] k, that dictate when a k- dimensional hole opens in the complex, and again closes. Combining such intervals for all holes observed produces a bar code of intervals. See Figure 5 in the appendix for an example barcode. The map back to Betti sequences from bar codes forms by selecting any value of t on the x-axis. The number of intervals that t is included in is the Betti sequence, b(w t ; t). In this sense, t can be thought of as an index of time, even though it really denotes a distance threshold in selecting simplexes. 3

4 The main assumption is that if a certain interval persists for a long time, it can be regarded as a genuine hole in the data, while shorter intervals are noise, either from the sample itself, or introduced in the homology algorithm. This paper attempts some headway into quantifying a long time. We propose, and evaluate, 2 potential approaches. The first borrows strength from spatial statistics and compares barcodes observed in a sample to a distribution of barcodes obtained from a process exhibiting no topological strucure in X. The second applies permutation testing to landmark selection. Both methods use Monte Carlo sampling to approximate underlying distributions. It should be commented that such sampling techniques are practically indispensible at this stage since hardly any theory behind persistent homology has been established. Furthermore, buth approaches exhibit reasonable assumptions, a priori. When testing for topological structure in X, our null hypothesis should be its absence: H 0 : X exhibits no topological structure. Although it is not entirely obvious what this means, we discuss in the next section why a homogeneous Poisson process may be a good approximation. Second, the largest source of uncertainty in the homology algorithm is the choice of landmark points. Currnet methods involves sequential max-min or random sampling [2], but neither method is particularly well supported, or even seems to give consistent results. We demonstrate this inconsistency, and test a third approach that involves a simple scheme of weighted permutations. The next section outlines the null-hypothesis method and presents some results concerning estimated distribution of barcodes for the homogeneous Poisson process, as well as one example. Section three gives findings for permutation tests. Section four concludes. 4

5 2 The Hypothesis Testing Approach 2.1 Definition and Null Distributions As per the null hypothesis presented earlier, we wish to compare the barcodes obtained from our sample X to a process with no topological structure. A candidate, the homogeneous Poisson process, P, is characterized by the following properties: [1] 1. The number N(P B) of points in any region B is distributed Poisson(λm(B)), where m is a measure on the space X, that P is defined on, and λ > If B 1 B 2 is the null set, then N(P B 1 ) N(P B 2 ) are independent. 3. Given N(P B) = n, the points n are independently and uniformly distributed in B. From the above properties we see that any topological structure is possible in any region, B X so long as m(b) > 0. Since the parameter λ and the space itself comletely define P, we can use these characterizations to define our test. Given the sample X X, let ˆλ = #{x X } and ˆX = X, be the parameters defining ˆP. The hope is then that such a process is a good approximation of P for X, which is itself a good approximation of H 0. Figure 1 demonstrates the distribution of a few parameter functions under H 0. To form the images, we selected 1000 random samples of P(λ, F), where λ = 50 and F = [0, 1] [0, 1]. Image (a) shows the distribution of lengths of b 0 intervals. This can be interpreted as the distribution of the length of time a connected component is expected to appear. Interestingly, it almost resembles a Poisson distribution. This makes some intuitive sense if we consider the b 0 interval length to be proportionate to the size of a cluster of points, which is distributed Poisson according to property 1. Parts (b) and (c) demonstrate the distribution of b 1 intervals, or 1-dimensional holes, and the time to coalesce, defined by the first time where the Betti code signature is (1, 0, 0,...). We see that 1-dimensional holes are rather uncommon when sampling from the null distribution, but it is possible for holes to form over varying lengths of t. The time to coalesce is instructive as after this time we are certain to see the barcode of a single point forever. This means that t has grown so large that 1 connected simplex covers the data and we cannot infer any topological structure afterwards. Again this parameter seems to be roughly Poisson or exponential in distribution. We do not make any further inferences other than to note that these result should be taken with 5

6 a grain of salt. We will later note that even though the homogeneous Poisson exhibits interesting barcode distributions, the lengths themselves are very small compared to data with any actual structure. Frequency Frequency β 0 means (t) (a) Average b 0 interval length β 1 means (t) (b) Average b 1 interval length Frequency Time to coalesce (t) (c) Average time to coalesce Figure 1: Approximated Distributions for various parameters of interest of the homogeneous Poisson Process. 6

7 2.2 The Circle Example The circle example is defined as follows. Pick n points uniformly distributed on a unit circle, and apply to each x i, i {1,...,n}, noise ǫ i N(0, ǫ) iid. In the first example, let n = 50 and ǫ {0,.5,.10,...,.95}. This gives 20 models, indexed by increasing noise (see Figure 6 in the appendix for an image of a sample dataset). If X i is a sample from model i, we let W i be the smallest rectangle parallel to the x and y axis that encompasses the data. Then the null-distribution process corresponding to X i is gives by P(λ = 50, W i ), and the achieved significanle level (ASL) is defined by ASL i = P H0 {ˆl b 1 ˆl b1 }, where l b1 denotes the average length of a b 1 interval. The idea is that we expect ASL i to increase with higher noise, corresponding to a lower probability of detecting structure. Such a result would give evidence to the sensitivity of persistent homology in detecting topological structure. The results are summarized in Table 1. Each null-distribution is obtained by 1000 samples from P(λ, W i ). We see that for the majority of models, the ASL is virtually zero. This corresponds to zero or 1 b 1 interval lengths being larger than the observed length in the circle data, given the null hypothesis. We also see two ASL values of 1, corresponding to the models of ǫ =.40 and.60. This occured because among the landmarks selected for those models, not once did a simplicial complex occur that encircled some area of IR 2. We attribute this to two possible explanations: one, that the random points were selected such that they all bunched up along one side of the circle, or, two, that the lard mark points were. The first hypothesis is not interesting. Since we see that longer b 1 intervals were correctly expressed for models with larger noise, it seems obvious that if we sampled data multiple times from the models, a sort of averaging of b 1 interval lengths would occur. The second hypothesis, though, deserves more investigation. In the current set of trials we chose a landmark set of 5 points samples without replacement from 50. It seems that the choice of these 5 points has a strong effect on the resulting barcode structure. In fact, the above trial took averages over 10 independent landmark selections for each model in order to get sensible results. Figure 2 (a) shows a boxplot of the b 1 lengths when each trial for each model is viewed separately, 10trials 20models = 200 values in all. Almost 70% of landmarks selected did not find evidence of a b 1 -hole at any level of t. The next trial demonstrates that this is most likely due to a combination of landmark size and number of sample points. We repeat the same test now 7

8 Model Noise ASL value Table 1: ASL values for n = 50 test. for 100 points in a unit curve, and landmarks of size 10 instead of 5. Figure 2 (b) shows the same box plot of all b 1 lengths for 10 trials under each model noise, ǫ i {0,...,.95}. The median length of a 1-dimensional hole is no longer 0. In face, only about 10% of all samples mistakenly did not observe a 1-dimensional hole for any interval. If we take this to be an estimate of type I error for testing the presense of 1-dimensional holes in the data, then taking 3 or 4 landmark sample should reduce the chance of effor sufficiently. It is reasonable to believe that increasing sample size and landmark amounts further would reduce the error rate even more, but we do not rule out a leveling-off effect. The ASL values for the second trial are not shown. It is enough to say that all were identically zero, but one which had a p-value of It seems that even low amounts of circular structure produce b 1 intervals that exceed those found under structureless conditions. In fact, one worries that the hypothesis test may be too strong to allow for comparison between samples. 8

9 (a) Sample size 50 (b) 100 Figure 2: Bot Plot of 200 b 1 interval lengths, 10 each sampled by landmark from model i in {1,...,20}. Avg β 1 interval length Avg β 1 interval length Model index (a) Data Points Model index (b) Bootstrap Nonparametric Regression Figure 3: Data Points and Nonparametric loess regression bootstrap for length of b 1 interval, with n = 100 data. A perhaps more interesting figure is 3, giving the average b 1 interval length among the 10 trials for each model. As theorized, we see what apears 9

10 to be a gradual decrease in the intervals lengths for which a 1-dimensional hole appeared. A local polynomial regression of degree 2 with a moving neighborhood of 75% of the data is superimposed on the points to suggest a trend. We examine this trend by running a simple pairwise bootstrap. Figure 3 (b) plots the resulting polynomial regression for 200 multinomial samples of the data points. A decreasing trend is clearly visible up to about a noise level of.4 to.5 (models 8-10), after which the interval length appears to level off. Note that this level still lies within the top 0.01 percentile fo b 1 lengths under the null hypothesis. Intuitively, with more noise, landmarks have more trouble closing around a central hole as they are inflienced noisy witnesses and hence the underlying structure is expressed less. We have seen that the size of a landmark set and sample have a reasonable effect on 2 dimensional data. Also, the persistent homology algorithm seems to do well in detectin a single 1-dimensional hole in the data described. The next section attempts to apply permutation of landmark sets on real data in many dimensions. 10

11 3 Landmark Permutation It has been of interest to the author to use the persistent homology algorithm on a set of real data. The notable examples of applications to real data have so far been largely confined to two examples: neuron firing in the visual cortex [11], and a database of high contrast images [4]. A further, related example used the Mapper algorithm to compare three dimensional meshes of various figures [10]. In this section, we attempt a topological examination of the dataset - noisyapples. The data consists of NIR spectroscopy signals measured on 5 different types of apples, at 51 uniformy spaced frequencies with added white noise. There are 5 varieties of apples: Red Delicious, Golden Delicious, Rome, Granny Smith, and Macintosh. As can be shown with standard techniques (smoothed PCA for example) the five varieties can be roughly clustered according to NIR signature. Since the 0th betti interval measures connected components, it would seem plausible that one could test for clustering in a data set. If the data contained more than one cluster, it seems plausible that it could be described by multiple simplicial complexes, one at each cluster. Topologically speaking, this corresponds to a b 0 value equal to the number of clusters. In terms of persistent homology, if the barcode of a sample has a long period of time where more than 1 b 0 interval overlaps, one could conclude that there is some clustering in the data. Moreover, since it is possible to extract the simplicial complex itself (although we do not do it here), the location and even rough shape of the cluster can be found. Persistent homology also has the advantage of being unsupervised as compared to other grouping methods such as LDA. Hence it could be useful where collecting any training data is too costly. There are a number of issues that must be confronted to see if persistent homology can be used for clustering. In the following paragraphs we review some of these issues and pose potential solutions for further study. The term long is used abstractly above as no definitive measures exist for significance of barcodes. How do we tell if an interval is long enough to be considered intrinsic to the data, or simply noise from the algorithm? The null hypothesis approach of the previous section could be applied here, but is currently untested. Also, there is still question as to whether a homogeneous poisson process is a correct choice for the null. Second, it was shown previously that homology results are sensitive to the particular choice of landmark set, at least when data sets are small. It appears that parameters may become more robust as sample size increases, 11

12 but there is currently no evidence on what an appropriate sample size may be. One direction may be a study on how robustness to random landmark choices changes with increased sample size. It would be instructive to compare results to other landmark choosing algorithms, such as maxmin. The other, which we illustrate below, instead attempts to reduce random landmark error directly, by permutation resampling of landmark sets. In other words, make many landmark choices and describe parameter results in terms of the resulting distribuion. The following presents an introduction to this approach. Figure 4 shows the results of 100 independent landmark selections. Each landmark consisted of 50 points selected without replacement from the 320 NIR signals observed. The y axis records the number of b 0 intervals observed at any point in time. Now the horizontal bars correspond to the length of time n connected components are observed in the sample. The x axis terminates at the maximum time to coalesce (described in the previous section) among all 100 resamples. Figure (a) plots results for all 100 landmarks simultaneously. Here the transparency of any single point corresponds to how many landmark samples had that many connected components at that particular point in time. Figure (b) shows the middle 90th percent quantile, calculated pointwise in each bar. For reference an arrow points to 5 connected components, the desired result. We do not attempt any description of the results, other than to say it is expected that many connected components are seen while the threshold is very low. Also, 50 is the highest b 0 number, corresponding to each landmark having its own complex. Finally, the resemblence to a transposed plot of explained variances for PCA suggests looking for an elbow may also be possible here. 12

13 90% CI β 0 barcodes # Connected Components Correct number t (a) (b) Figure 4: b 0 interval distributions for 100 independent random landmark choices of noisyapples data. 13

14 4 Conclusion In this study we examined the uses of Monte Carlo sampling in studying the persistent homology of data. The results were largely inconclusive. On the one hand, it looks as though the bar code distributions under a null hypothesis of no topological structure are concentrated around interval lengths that are too small to compare to any data with structure. On the other hand, this does suggest that if data can be characterized by having n-dimensional holes then null hypothesis testing will detect it. The problem of effectively selecting landmarks sets still exists. One possible avenue for future study is to weigh the results from a landmark set by the value max z Z {min l L z l }, which can be thought of as respresenting how poorly the landmark set was chosen. Another future test will be to take arbitrary rotations of low dimensional figures (with noise) in a high dimensional space and see if the correct bar code values are still expressed. Of course, there is no end to such toy examples. The real interest lies in devising methods to employ persistent homology in studying arbitrary data sets. We have presented the possibility that topological structure is present in any high dimensional data set, but it is as open question as to how best to harness this structure in describing data. One existing method, Mapper, identified earlier and described in Singh et. al. (2007), has effectively used persistent homology analysis to create simplicial complexes that illustrate the structure of data. It may be instructive to see if Monte Carlo sampling can be of help. 14

5 Appendix For the analysis in this paper we use a combination of the Java-based software JPLEX (http://comptop.stanford.edu/programs/jplex/index.

15 5 Appendix For the analysis in this paper we use a combination of the Java-based software JPLEX ( the spatstat R package by Adrian Baddeley and Rolf Turner ( and self created R and Java code. For the latter 2 see attached code appendix. 5.1 Images Figure 5: Sample Barcodes from PLEX. 15

16 Noise sd = 0 Noise sd = 0.1 Noise sd = 0.2 Noise sd = Noise sd = Noise sd = Noise sd = 0.25 Noise sd = (a) Noise sd = 0.4 Noise sd = 0.5 Noise sd = 0.6 Noise sd = Noise sd = Noise sd = Noise sd = 0.65 Noise sd = (b) Figure 6: Example of Circle data used in Section 2.2

17 References [1] Adrian Baddeley. Analysing spatial point patterns in r. Workshop Notes Version 3, [2] Gunnar Carlsson. Topology and data. Bull. Amer. Math. Soc. (N.S.), 46(2): , [3] Gunnar Carlsson and Vin de Silva. Topological estimation using witness complexes. In M. Alexa and S. Rusinkiewicz, editors, Eurographics Symposium on Point-Based Graphics (2004), page na, [4] Gunnar Carlsson, Tigran Ishkhanov, Vin Silva, and Afra Zomorodian. On the local behavior of spaces of natural images. Int. J. Comput. Vision, 76(1):1 12, [5] M. Greenbug and Harper J. Algebraic Topology. Westview Press, [6] Trevor Hastie and Patrice Y. Simard. Metrics and models for handwritten character recognition. Statistical Science, 13(1):54 65, [7] V. Sankrithi Krishnan. An introduction to category theory / V. Sankrithi Krishnan. North Holland, New York :, [8] J Munkres. Topology: A First Course. Prentice Hall College Div, [9] Parry P. and de Silva V. Plex: Simplicial complexes in matlab. available at [10] Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In Point Based Graphics 2007, pages , Prague, [11] Gurjeet Singh, Facundo Memoli, Tigran Ishkhanov, Guillermo Sapiro, Gunnar Carlsson, and Dario L. Ringach. Topological analysis of population activity in visual cortex. J. Vis., 8(8):1 18,

Topological Classification of Data Sets without an Explicit Metric

Topological Classification of Data Sets without an Explicit Metric Tim Harrington, Andrew Tausz and Guillaume Troianowski December 10, 2008 A contemporary problem in data analysis is understanding the