1 Surprises in high dimensions

Size: px

Start display at page:

Download "1 Surprises in high dimensions"

Clemence Gilbert
5 years ago
Views:

1 1 Surprises in high imensions Our intuition about space is base on two an three imensions an can often be misleaing in high imensions. It is instructive to analyze the shape an properties of some basic geometric forms, which we unerstan very well in imensions two an three, in high imensions. To that en, we will look at the sphere an the cube as their imension increases. 1.1 Geometry of the -imensional Sphere Consier the unit sphere is imensions. Its volume is given by V () = π Γ( ) where Γ is the Gamma function. Recall that for positive integers n, Γ(n) = (n 1)!. Using Stirling s Formula, π ( n ) n Γ(n) n e we can see Γ ( ) grows much faster than π, an hence V () 0 as. In wors, the volume of the -imensional sphere with raius 1 goes (very quickly) to 0 as the imension increases to infinity, see Figure 1.1. That means a unit sphere in high imensions has almost no volume (compare this to the volume of the unit cube, which is always 1). One can also show that most of the volume of the -imensional sphere is containe near the bounary of the sphere. That is, for a -imensional sphere of raius r, most of the volume is containe in an annulus of with proportional to r. This means that if you peel a high-imensional orange, then there is almost nothing left, since almost all of the orange s mass is in the peel. 1. Geometry of the -imensional Cube Claim: Most of the volume of the high-imensional cube is locate in its corners. Proof (probabilistic argument). You o not nee to know or follow this proof. But for completeness, I inclue it anyway. We assume that the cube is given by [ 1, 1], our argument easily extens to cubes of any size. Pick a point at ranom in the box [ 1, 1]. We want to calculate the probability that the point is also in the sphere. 1

-imensional sphere is containe near its bounary.

2 Figure 1: Volume of a unit sphere when the imension of the sphere increases Figure : Most of the volume of the -imensional sphere is containe near its bounary. Hence, if you peel a high-imensional orange, then there is almost nothing left.

3 Let x = [x 1,..., x ] R an each x i [ 1, 1] is chosen uniformly at ranom. The event that x also lies in the sphere means x = x i 1. Let z i = x i an note that the expection of z i is given by an the variance of z i is E(z i ) = 1 Var(z i ) = t t = 1 3 = E ( x t 4 t Using Chernoff s Inequality, P ( ( ) x 1) = P x i 1 ) = 3 ( ) 1 = = ( ) = P (z i E(z i )) 1 3 [ exp ( ] 3 1) 4 10 [ exp ]. 10 Since this value converges to 0 as the imension goes to infinity, this shows ranom points in high cubes are most likely outsie the sphere. In other wors, almost all the volume of hypercubes lie in their corners. For completeness, here is Chernoff s Inequality. Theorem 1.1 (Chernoff s Inequality). Let X 1, X,..., X n be inepenent ranom variables with zero mean such that for all i there hols X i 1 almost surely. Furthermore, efine σ i := E X i an σ := n σ i. Then, P { n X i t } max { e t 4σ, e t }, an P { n X i t } max { e t 4σ, e t }, 3

4 1.3 Comparisons between the -imensional Sphere an Cube We compare a unit -imensional cube (a cube with sielength 1) with a unit -imensional sphere (a sphere with raius 1) as the imension increases Figure 3: -imensional unit sphere an unit cube, centere at the origin. In two imensions (Fig. 3), the unit square is completely containe in the unit sphere. The istance from the center to a vertex (raius of the circumscribe sphere or length of the iagonal of the cube) is an the apothem (raius of the inscribe sphere) is 1. In four imensions (Fig. 4), the istance from the center to a vertex (i.e., the lenght of the crossiagonal) is 1, so the vertices of the cube touch the surface of the sphere. However, the length of the apothem is still 1. The result, when projecte in two imensions no longer appears convex, however all hypercubes are convex. This is part of the strangeness of higher imensions - hypercubes are both convex an pointy. In imensions greater than 4 the istance from the center to a vertex is > 1, an thus the vertices of the hypercube exten far outsie the sphere. Hence, we can picture a cube in high imensions to look like a regular cube, but also something like the spiky soli in the right panel of Figure 6. Maybe the Renaissance artist Pablo Uccello ha hypercubes in min when he rew the picture shown in in the right panel of Figure 7? Curses an Blessings of Dimensionality.1 Curses Bellman, in 1957, coine the term the curse of imensionality. It escribes the problem cause by the exponential increase in volume associate with aing extra imensions to Eucliean space. 4

5 0.5 1 Figure 4: Projections of the 4-imensional unit sphere an unit cube, centere at the origin (4 of the 16 vertices of the hypercube are shown). / Figure 5: Projections of the -imensional unit sphere an unit cube, centere at the origin (4 of the vertices of the hypercube are shown). For example, 100 evenly-space gri points suffice to sample the interval [0, 1] with a istance of 0.01 between the gri points. For the unit square [0, 1] [0, 1] we woul nee 100 gri points to ensure a istance of 0.01 between ajacent points. If we want sample the 10- imensional unit cube with a gri with a istance of 0.01 between ajacent points, we woul nee points. Thus we nee a factor of more points, even though the imension increase only by a factor of 10. Any algorithm woul now nee to process a factor of more points than in imension 1. This exponential increase of complexity vs a linear increase in imension is a manifestation of the curse of imensionality. But we can feel the curse alreay even if the complexity oes not increase exponentially. Often the computation time of algorithms scales baly with imension. For example, the 5

6 Figure 6: Representations of a 16-imensional hypercube. Both representations are vali, each captures ifferent features of the hypercube. Figure 7: Artistic representations of a 7-imensional hypercube? (Drawing by Paolo Uccello from the 15th Century.) 6

Figure 8: Classification using 1 feature. (Image by Vincent Spruyt) Figure 9: Classification using features.

For large ata sets (n 1) in high imensions ( 1), the complexity of n can be too high to be feasible.

This example an the images below are from the blog of Vincent Spruyt (VP Chief Scientist at Sentiance).

To be concrete, assume we have a training set of 10 labele cat/og-images, an a larger number of unlabele images. We start out using one feature to attempt to istinguish between cats an ogs.

7 Figure 8: Classification using 1 feature. (Image by Vincent Spruyt) Figure 9: Classification using features. (Image by Vincent Spruyt) naive approach for fining nearest neighbors among n points in imensions requires O(n ) operations, but for = 1 we can simply sort in O(n log n) operations. For large ata sets (n 1) in high imensions ( 1), the complexity of n can be too high to be feasible. The curse of imensionality manifests itself also in the number of training ata versus the number features neee for successful classification. This example an the images below are from the blog of Vincent Spruyt (VP Chief Scientist at Sentiance). Assume we have images of cats an ogs an we want to construct an algorithm that can correctly classify those images into, well, cats an ogs. To be concrete, assume we have a training set of 10 labele cat/og-images, an a larger number of unlabele images. We start out using one feature to attempt to istinguish between cats an ogs. For example we coul use the number of re pixels. With one feature alone we will not be able to separate the training images into two separate classes, see Figure 8. Therefore, we might ecie to a another feature, e.g. the average green color in the image. But aing a secon feature still oes not result in a linearly separable classification problem: No single line can separate all cats from all ogs in this example, see Figure 9. Finally we ecie to a a thir feature, e.g. the average blue color in the image, yieling a three-imensional feature space, see Figure 10. Aing a thir feature results in a linearly separable classification problem in our example. A plane exists that perfectly separates ogs from cats, see Figure 10. 7

Figure 10: Classification using 3 features. (Images by Vincent Spruyt) In the three-imensional feature space, we can now fin a plane that perfectly separates ogs from cats.

increasing the number of features until perfect classification results are obtaine is the best way to train a classifier. But this is not the case.

Due to this sparsity, it becomes much more easy to fin a separable hyperplane because the likelihoo that a training sample lies on the wrong sie of the best hyperplane becomes infinitely small when

That means, we perfectly separate the training ata, but when we try it on the rest of the ata, the classification will fail terribly.

8 Figure 10: Classification using 3 features. (Images by Vincent Spruyt) In the three-imensional feature space, we can now fin a plane that perfectly separates ogs from cats. This means that a linear combination of the three features can be use to obtain perfect classification results on our training ata of 10 images: The above illustrations might seem to suggest that increasing the number of features until perfect classification results are obtaine is the best way to train a classifier. But this is not the case. If we woul keep aing features, the imensionality of the feature space grows, an becomes sparser an sparser. Due to this sparsity, it becomes much more easy to fin a separable hyperplane because the likelihoo that a training sample lies on the wrong sie of the best hyperplane becomes infinitely small when the number of features becomes infinitely large. However, if the amount of available training ata is fixe, then overfitting occurs if we keep aing imensions. That means, we perfectly separate the training ata, but when we try it on the rest of the ata, the classification will fail terribly. On the other han, if we keep aing imensions, the amount of training ata nees to grow exponentially fast to maintain the same coverage an to avoi overfitting.. Blessings The blessings of imensionality inclue the concentration of measure phenomenon (so-calle in the geometry of Banach spaces), which means that certain ranom fluctuations are very well controlle in high imensions an the success of asymptotic methos, use wiely in mathematical statistics an statistical physics, which suggest that statements about very high-imensional settings may be mae where moerate imensions woul be too complicate. The most well known example is the Law of Large Numbers, which says in a nutshell that the sum of inepenent, ientically istribute scalar-value ranom variables is approximately normal istribute. We observe such blessings, or such concentration of measure phenomena, 8

Figure 11: Concentration of the conition number of a Gaussian ranom matrix. for example in physics. It is very ifficult to preict the behavior of iniviual atoms in a gas.

9 Figure 11: Concentration of the conition number of a Gaussian ranom matrix. for example in physics. It is very ifficult to preict the behavior of iniviual atoms in a gas. But we nevertheless preict very precisely how a gas will behave, because we can make quite accurate statements about the behavior average over millions of atoms. There is a vast number of such concentration phenomena that exten from scalars to vectors an matrices. These generalizations often make our life in machine learning much easier. Here is one such example of the blessings of imensionality from ranom matrix theory. The conition number of a matrix (the ratio of its largest singular value an its smallest singular value) is an important measure of its stability with respect to noise when solving a linear system of equations. In general it is very ifficult or even impossible to tell in avance what the conition number of a matrix will be without having explicit access to all the entries of that matrix. However, for matrices whose entries are chosen ranomly, we can often give a fairly precise preiction of the conition number of that matrix base on very few parameters, such as the imensions of the matrix. Theorem.1. Let A be a n matrix (with n) whose entries are inepenent stanar normal ranom variables. Let σ min an σ max be the smallest an largest singular value of A (i.e., the smallest an largest eigenvalue of A A), respectively. Let γ = σmax an let κ(a) = n σ min enote the conition number of A. Then, as, n (but with = γ fixe), n κ(a) 1 + γ 1 γ. In Matlab we can generate A via A = rann(,n). Figure 11 shows the istribution of κ(a) of a Gaussian ranom matrix for ifferent realizations of A. One can clearly see how well κ(a) is concentrate near the value 1+ γ 1 (which is equal to 3 in this case). γ 9

Solutions to Tutorial 1 (Week 8)

The University of Syney School of Mathematics an Statistics Solutions to Tutorial 1 (Week 8) MATH2069/2969: Discrete Mathematics an Graph Theory Semester 1, 2018 1. In each part, etermine whether the two