Cellular Tree Classifiers. Gérard Biau & Luc Devroye

Size: px

Start display at page:

Download "Cellular Tree Classifiers. Gérard Biau & Luc Devroye"

Hillary Casey
5 years ago
Views:

1 Cellular Tree Classifiers Gérard Biau & Luc Devroye Paris, December 2013

2 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

3 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

4 The Big Data era Big Data is a collection of data sets so large and complex that it becomes impossible to process using classical tools.

5 The Big Data era Big Data is a collection of data sets so large and complex that it becomes impossible to process using classical tools. It includes data sets with sizes beyond the ability of commonly-used software tools to process the data within a tolerable elapsed time.

6 The Big Data era Big Data is a collection of data sets so large and complex that it becomes impossible to process using classical tools. It includes data sets with sizes beyond the ability of commonly-used software tools to process the data within a tolerable elapsed time. As of 2012, every day 2.5 quintillion ( ) bytes of data were created.

7 The Big Data era Big Data is a collection of data sets so large and complex that it becomes impossible to process using classical tools. It includes data sets with sizes beyond the ability of commonly-used software tools to process the data within a tolerable elapsed time. As of 2012, every day 2.5 quintillion ( ) bytes of data were created. Megabytes and gigabytes are old-fashioned.

8 The Big Data era The challenges involved in Big Data problems are interdisciplinary.

9 The Big Data era The challenges involved in Big Data problems are interdisciplinary. Data analysis in the Big Data regime requires consideration of:

10 The Big Data era The challenges involved in Big Data problems are interdisciplinary. Data analysis in the Big Data regime requires consideration of: Systems issues: How to store, index and transport data at massive scales?

11 The Big Data era The challenges involved in Big Data problems are interdisciplinary. Data analysis in the Big Data regime requires consideration of: Systems issues: How to store, index and transport data at massive scales? Statistical issues: How to cope with errors and biases of all kinds? How to develop models and procedures that work when both n and p are astronomical?

12 The Big Data era The challenges involved in Big Data problems are interdisciplinary. Data analysis in the Big Data regime requires consideration of: Systems issues: How to store, index and transport data at massive scales? Statistical issues: How to cope with errors and biases of all kinds? How to develop models and procedures that work when both n and p are astronomical? Algorithmic issues: How to perform computations?

13 The Big Data era The challenges involved in Big Data problems are interdisciplinary. Data analysis in the Big Data regime requires consideration of: Systems issues: How to store, index and transport data at massive scales? Statistical issues: How to cope with errors and biases of all kinds? How to develop models and procedures that work when both n and p are astronomical? Algorithmic issues: How to perform computations? Big Data requires massively parallel softwares running on tens, hundreds, or even thousands of servers.

14 Greedy algorithms Greedy algorithms build solutions incrementally, usually with little effort.

15 Greedy algorithms Greedy algorithms build solutions incrementally, usually with little effort. Such procedures form a result piece by piece, always choosing the next item that offers the most obvious and immediate benefit.

16 Greedy algorithms Greedy algorithms build solutions incrementally, usually with little effort. Such procedures form a result piece by piece, always choosing the next item that offers the most obvious and immediate benefit. Greedy methods have an autonomy that makes them ideally suited for distributive or parallel computation.

17 Greedy algorithms Our goal is to formalize the setting and to provide a foundational discussion of various properties of tree classifiers that are designed following these principles.

Greedy algorithms Our goal is to formalize the setting and to provide a foundational discussion of various properties of tree classifiers that are designed

18 Greedy algorithms Our goal is to formalize the setting and to provide a foundational discussion of various properties of tree classifiers that are designed following these principles. They may find use in a world with new computational models in which parallel or distributed computation is feasible and even the norm.

19 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

20 Classification

21 Classification?

22 Classification

23 Basics of classification We have a random pair (X,Y) R d {0,1} with an unknown distribution.

24 Basics of classification We have a random pair (X,Y) R d {0,1} with an unknown distribution. Goal: Design a classifier g : R d {0,1}.

25 Basics of classification We have a random pair (X,Y) R d {0,1} with an unknown distribution. Goal: Design a classifier g : R d {0,1}. The probability of error is L(g) = P{g(X) Y}.

26 Basics of classification We have a random pair (X,Y) R d {0,1} with an unknown distribution. Goal: Design a classifier g : R d {0,1}. The probability of error is L(g) = P{g(X) Y}. The Bayes classifier { 1 if P{Y = 1 X = x} > 1/2 g (x) = 0 otherwise has the smallest probability of error, that is L = L(g ) = inf P{g(X) Y}. g:r d {0,1}

27 Basics of classification The data: D n = {(X 1,Y 1 ),...,(X n,y n )}, i.i.d. copies of (X,Y).

28 Basics of classification The data: D n = {(X 1,Y 1 ),...,(X n,y n )}, i.i.d. copies of (X,Y). A classifier g n (x) is a function of x and D n.

29 Basics of classification The data: D n = {(X 1,Y 1 ),...,(X n,y n )}, i.i.d. copies of (X,Y). A classifier g n (x) is a function of x and D n. The probability of error is L(g n ) = P{g n (X) Y D n }.

30 Basics of classification The data: D n = {(X 1,Y 1 ),...,(X n,y n )}, i.i.d. copies of (X,Y). A classifier g n (x) is a function of x and D n. The probability of error is L(g n ) = P{g n (X) Y D n }. It is consistent if lim EL(g n) = L. n

31 Basics of classification The data: D n = {(X 1,Y 1 ),...,(X n,y n )}, i.i.d. copies of (X,Y). A classifier g n (x) is a function of x and D n. The probability of error is L(g n ) = P{g n (X) Y D n }. It is consistent if lim EL(g n) = L. n It is universally consistent if it is consistent for all possible distributions of (X, Y).

33 Tree classifiers Many popular classifiers are universally consistent.

34 Tree classifiers Many popular classifiers are universally consistent. These include several brands of histogram rules, k-nearest neighbor rules, kernel rules, neural networks, and tree classifiers.

35 Tree classifiers Many popular classifiers are universally consistent. These include several brands of histogram rules, k-nearest neighbor rules, kernel rules, neural networks, and tree classifiers. Tree methods loom large for several reasons:

36 Tree classifiers Many popular classifiers are universally consistent. These include several brands of histogram rules, k-nearest neighbor rules, kernel rules, neural networks, and tree classifiers. Tree methods loom large for several reasons: All procedures that partition space can be viewed as special cases of partitions generated by trees.

37 Tree classifiers Many popular classifiers are universally consistent. These include several brands of histogram rules, k-nearest neighbor rules, kernel rules, neural networks, and tree classifiers. Tree methods loom large for several reasons: All procedures that partition space can be viewed as special cases of partitions generated by trees. Tree classifiers are conceptually simple, and explain the data very well.

38 Trees?

39 Trees?

40 Trees?

41 Trees?

42 Trees?

43 Trees?

44 Trees?

45 Trees

46 Trees

47 Trees

48 Trees

49 Trees

50 Trees

51 Trees

52 Trees

53 Trees?

54 Trees?

55 Trees?

56 Trees?

57 Trees?

58 Trees?

59 Trees?

60 Trees The tree structure is usually data dependent, and it is in the construction itself that trees differ.

61 Trees The tree structure is usually data dependent, and it is in the construction itself that trees differ. Thus, there are virtually infinitely many possible strategies to build classification trees.

62 Trees The tree structure is usually data dependent, and it is in the construction itself that trees differ. Thus, there are virtually infinitely many possible strategies to build classification trees. Despite this great diversity, all tree species end up with two fundamental questions at each node: 1 Should the node be split? 2 In the affirmative, what are its children?

63 The cellular spirit Cellular trees proceed from a different philosophy.

64 The cellular spirit Cellular trees proceed from a different philosophy. A cellular tree should be able to answer questions 1 and 2 using local information only. Majority vote Cellular decision No split Split

65 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

66 A model Let C be a class of possible subsets of R d that can be used for splits.

67 A model Let C be a class of possible subsets of R d that can be used for splits. Example: C = { hyperplanes of R d}.

68 A model Let C be a class of possible subsets of R d that can be used for splits. Example: C = { hyperplanes of R d}. The class is parametrized by a vector σ R p.

69 A model Let C be a class of possible subsets of R d that can be used for splits. Example: C = { hyperplanes of R d}. The class is parametrized by a vector σ R p. There is a splitting function f(x,σ) such that R d is partitioned into A = {x R d : f(x,σ) 0} and B = {x R d : f(x,σ) < 0}.

70 A model A cellular split may be seen as a family of mappings σ : (R d {0,1}) n R p.

71 A model A cellular split may be seen as a family of mappings σ : (R d {0,1}) n R p. We see that σ is a model for the cellular decision 2.

72 A model A cellular split may be seen as a family of mappings σ : (R d {0,1}) n R p. We see that σ is a model for the cellular decision 2. In addition, there is a second family of mappings θ, but this time with a boolean output.

73 A model A cellular split may be seen as a family of mappings σ : (R d {0,1}) n R p. We see that σ is a model for the cellular decision 2. In addition, there is a second family of mappings θ, but this time with a boolean output. It is a stopping rule and models the cellular decision 1.

74 Cellular procedure A cellular machine partitions the data recursively.

75 Cellular procedure A cellular machine partitions the data recursively. If θ(d n) = 0, the root cell is final, and the space is not split.

76 Cellular procedure A cellular machine partitions the data recursively. If θ(d n) = 0, the root cell is final, and the space is not split. Otherwise, R d is split into A = {x : f (x,σ(d n)) 0} and B = {x : f (x,σ(d n)) < 0}.

77 Cellular procedure A cellular machine partitions the data recursively. If θ(d n) = 0, the root cell is final, and the space is not split. Otherwise, R d is split into A = {x : f (x,σ(d n)) 0} and B = {x : f (x,σ(d n)) < 0}. The data D n are partitioned into two groups.

78 Cellular procedure A cellular machine partitions the data recursively. If θ(d n) = 0, the root cell is final, and the space is not split. Otherwise, R d is split into A = {x : f (x,σ(d n)) 0} and B = {x : f (x,σ(d n)) < 0}. The data D n are partitioned into two groups. The groups are sent to child cells, and the process is repeated.

79 Cellular procedure A cellular machine partitions the data recursively. If θ(d n) = 0, the root cell is final, and the space is not split. Otherwise, R d is split into A = {x : f (x,σ(d n)) 0} and B = {x : f (x,σ(d n)) < 0}. The data D n are partitioned into two groups. The groups are sent to child cells, and the process is repeated. Final classification proceeds by a majority vote.

80 An important remark Is any classifier a cellular tree?

81 An important remark Is any classifier a cellular tree? 1 Set θ = 1 if we care at the root, and θ = 0 elsewhere. 2 The root node is split by the classifier into a set A = {x R d : g n (x) = 1} and its complement, and both child nodes are leaves.

82 An important remark Is any classifier a cellular tree? 1 Set θ = 1 if we care at the root, and θ = 0 elsewhere. 2 The root node is split by the classifier into a set A = {x R d : g n (x) = 1} and its complement, and both child nodes are leaves. This is not allowed.

83 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

84 An example At first sight, there are no consistent cellular tree classifiers.

85 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree.

86 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree. When d = 1, split by finding the median element among the X i s.

87 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree. When d = 1, split by finding the median element among the X i s. Keep doing this for k rounds.

88 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree. When d = 1, split by finding the median element among the X i s. Keep doing this for k rounds. In d dimensions, rotate through the coordinates.

89 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree. When d = 1, split by finding the median element among the X i s. Keep doing this for k rounds. In d dimensions, rotate through the coordinates. This rule is consistent, provided k and k2 k /n 0.

90 An example At first sight, there are no consistent cellular tree classifiers. Example: The k-median tree. When d = 1, split by finding the median element among the X i s. Keep doing this for k rounds. In d dimensions, rotate through the coordinates. This rule is consistent, provided k and k2 k /n 0. This is not cellular.

91 A randomized solution The cellular splitting method σ mimics the median tree classifier.

92 A randomized solution The cellular splitting method σ mimics the median tree classifier. The dimension to cut is chosen uniformly at random.

93 A randomized solution The cellular splitting method σ mimics the median tree classifier. The dimension to cut is chosen uniformly at random. The selected dimension is then split at the median.

94 A randomized solution The cellular splitting method σ mimics the median tree classifier. The dimension to cut is chosen uniformly at random. The selected dimension is then split at the median. The novelty is in the choice of the decision function θ.

95 A randomized solution The cellular splitting method σ mimics the median tree classifier. The dimension to cut is chosen uniformly at random. The selected dimension is then split at the median. The novelty is in the choice of the decision function θ. This function ignores the data altogether and uses a randomized decision that is based on the size of the input.

96 Consistency Consider a nonincreasing function ϕ : N (0, 1].

97 Consistency Consider a nonincreasing function ϕ : N (0, 1]. Then, if U is the uniform [0,1] random variable associated with node A, θ = 1 [U>ϕ(N(A))].

98 Consistency Consider a nonincreasing function ϕ : N (0, 1]. Then, if U is the uniform [0,1] random variable associated with node A, θ = 1 [U>ϕ(N(A))]. Theorem Let β be a real number in (0,1). Define { 1 if n < 3 ϕ(n) = 1/log β n if n 3. Then lim EL(g n) = L as n. n

99 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

100 A full 2 d -ary tree At the root, we find the median in direction 1.

101 A full 2 d -ary tree At the root, we find the median in direction 1. Then on each of the two subsets, we find the median in direction 2.

102 A full 2 d -ary tree At the root, we find the median in direction 1. Then on each of the two subsets, we find the median in direction 2. Then on each of the four subsets, we find the median in direction 3, and so forth.

103 A full 2 d -ary tree At the root, we find the median in direction 1. Then on each of the two subsets, we find the median in direction 2. Then on each of the four subsets, we find the median in direction 3, and so forth. Repeating this for k levels of nodes leads to 2 dk leaf regions.

104 The stopping rule θ The quality of the classifier at node A is assessed by ( ˆL n (A) = 1 n N(A) min n 1 [Xi A,Y i=1], i=1 i=1 1 [Xi A,Y i=0] ).

105 The stopping rule θ The quality of the classifier at node A is assessed by ( ˆL n (A) = 1 n N(A) min n 1 [Xi A,Y i=1], i=1 Define the nonnegative integer k + by i=1 k + = αlog 2 (N(A)+1). 1 [Xi A,Y i=0] ).

106 The stopping rule θ The quality of the classifier at node A is assessed by ( ˆL n (A) = 1 n N(A) min n 1 [Xi A,Y i=1], i=1 Define the nonnegative integer k + by i=1 k + = αlog 2 (N(A)+1). 1 [Xi A,Y i=0] ). Set ˆL n (A,k + ) = A j P k +(A) ˆL n (A j ) N(A j) N(A).

107 The stopping rule θ The quality of the classifier at node A is assessed by ( ˆL n (A) = 1 n N(A) min n 1 [Xi A,Y i=1], i=1 Define the nonnegative integer k + by i=1 k + = αlog 2 (N(A)+1). 1 [Xi A,Y i=0] ). Set ˆL n (A,k + ) = A j P k +(A) ˆL n (A j ) N(A j) N(A). Both ˆL n (A) and ˆL n (A,k + ) may be evaluated on the basis of the data points falling in A only.

108 The stopping rule θ The quality of the classifier at node A is assessed by ( ˆL n (A) = 1 n N(A) min n 1 [Xi A,Y i=1], i=1 Define the nonnegative integer k + by i=1 k + = αlog 2 (N(A)+1). 1 [Xi A,Y i=0] ). Set ˆL n (A,k + ) = A j P k +(A) ˆL n (A j ) N(A j) N(A). Both ˆL n (A) and ˆL n (A,k + ) may be evaluated on the basis of the data points falling in A only. This is cellular.

109 The stopping rule θ Put θ = 0 if ˆL n (A) ˆL n (A,k + ) ( ) β 1. N(A)+1

110 The stopping rule θ Put θ = 0 if ˆL n (A) ˆL n (A,k + ) ( ) β 1. N(A)+1 Theorem Take 1 dα 2β > 0. Then lim EL(g n) = L as n. n

111 Outline 1 Context 2 Cellular tree classifiers 3 A mathematical model 4 Are there consistent cellular tree classifiers? 5 A non-randomized solution 6 Random forests

112 Leo Breiman ( )

113 Leo Breiman ( )

114 From trees to forests Leo Breiman promoted random forests.

115 From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules.

116 From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized.

117 From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach).

118 From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/randomforests

119 Random forests They are serious competitors to state-of-the-art methods.

120 Random forests They are serious competitors to state-of-the-art methods. They are fast and easy to implement.

121 Random forests They are serious competitors to state-of-the-art methods. They are fast and easy to implement. They can handle a very large number of input variables.

122 Random forests They are serious competitors to state-of-the-art methods. They are fast and easy to implement. They can handle a very large number of input variables. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown.

123 Random forests They are serious competitors to state-of-the-art methods. They are fast and easy to implement. They can handle a very large number of input variables. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure.

124 Three basic ingredients 1. Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split.

125 Three basic ingredients 1. Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut.

126 Three basic ingredients 1. Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size.

127 Three basic ingredients 2. Aggregation Final predictions are obtained by aggregating over the ensemble.

128 Three basic ingredients 2. Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable.

129 Three basic ingredients 3. Bagging The subspace randomization scheme is blended with bagging.

130 Three basic ingredients 3. Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010).

131 Mathematical framework A training sample: D n = {(X 1,Y 1 ),...,(X n,y n )}.

132 Mathematical framework A training sample: D n = {(X 1,Y 1 ),...,(X n,y n )}. A generic pair: (X,Y) satisfying EY 2 <.

133 Mathematical framework A training sample: D n = {(X 1,Y 1 ),...,(X n,y n )}. A generic pair: (X,Y) satisfying EY 2 <. Our goal: Estimate the regression function r(x) = E[Y X = x].

134 Mathematical framework A training sample: D n = {(X 1,Y 1 ),...,(X n,y n )}. A generic pair: (X,Y) satisfying EY 2 <. Our goal: Estimate the regression function r(x) = E[Y X = x]. Quality criterion: E[r n (X) r(x)] 2.

135 The model A random forest is a collection of randomized base regression trees {r n (x,θ m,d n ),m 1}.

136 The model A random forest is a collection of randomized base regression trees {r n (x,θ m,d n ),m 1}. These random trees are combined to form the aggregate r n (X,D n ) = E Θ [r n (X,Θ,D n )].

137 The model A random forest is a collection of randomized base regression trees {r n (x,θ m,d n ),m 1}. These random trees are combined to form the aggregate r n (X,D n ) = E Θ [r n (X,Θ,D n )]. Θ is independent of X and the training sample D n.

138 The procedure

139 The procedure

140 The procedure

141 The procedure

142 The procedure

143 The forest A local averaging estimate: r n (x) = n i=1 K(x,X i)y i n j=1 K(x,X j).

144 The forest A local averaging estimate: r n (x) = n i=1 K(x,X i)y i n j=1 K(x,X j). Centered cuts: K(x,z) = k n1,...,k nd d j=1 knj=kn k n! k n1!...k nd! d j=1 p knj nj 1 [k nj<α j].

145 The forest A local averaging estimate: r n (x) = n i=1 K(x,X i)y i n j=1 K(x,X j). Centered cuts: K(x,z) = k n1,...,k nd d j=1 knj=kn k n! k n1!...k nd! d j=1 p knj nj 1 [k nj<α j]. K(x,z) = Uniform cuts: k n1,...,k nd d j=1 knj=kn k n! k n1!...k nd! d m=1 p knm nm 1 k nm! log xm z m 0 t knm e t dt.

146

147 Consistency Theorem The random forests estimate r n is consistent whenever p nj logk n for all j = 1,...,d and k n /n 0 as n.

148 Consistency Theorem The random forests estimate r n is consistent whenever p nj logk n for all j = 1,...,d and k n /n 0 as n. In the purely random model, p nj = 1/d, independently of n and j.

149 Consistency Theorem The random forests estimate r n is consistent whenever p nj logk n for all j = 1,...,d and k n /n 0 as n. In the purely random model, p nj = 1/d, independently of n and j. A more in-depth analysis is needed.

150 Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation.

151 Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients.

152 Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies.

153 Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities.

154 Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity.

155 Our vision The regression function r(x) = E[Y X] depends in fact only on a nonempty subset S of the d features.

156 Our vision The regression function r(x) = E[Y X] depends in fact only on a nonempty subset S of the d features. In other words: r(x) = E[Y X S ].

157 Our vision The regression function r(x) = E[Y X] depends in fact only on a nonempty subset S of the d features. In other words: r(x) = E[Y X S ]. In this dimension reduction scenario, the ambient dimension d can be very large, much larger than n.

158 Our vision The regression function r(x) = E[Y X] depends in fact only on a nonempty subset S of the d features. In other words: r(x) = E[Y X S ]. In this dimension reduction scenario, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity: The smaller S, the sparser r.

159 Sparsity and random forests Ideally, p nj = 1/S for j S.

160 Sparsity and random forests Ideally, p nj = 1/S for j S. To stick to reality, we will rather require p nj = 1 S (1+ξ nj).

161 Sparsity and random forests Ideally, p nj = 1/S for j S. To stick to reality, we will rather require p nj = 1 S (1+ξ nj). Today, an assumption. Tomorrow, a lemma.

162 Main result Theorem If p nj = (1/S)(1+ξ nj ) for j S, with ξ nj logn 0 as n, then for the choice /(1+ k n n S log 2 ), we have lim sup n E[ r n (X) r(x)] sup (X,Y) F S S log n Λ.

163 Main result Theorem If p nj = (1/S)(1+ξ nj ) for j S, with ξ nj logn 0 as n, then for the choice /(1+ k n n S log 2 ), we have lim sup n E[ r n (X) r(x)] sup (X,Y) F S S log n Λ. Take-home message The rate n 0.75 S log is strictly faster than the usual minimax rate n 2/(d+2) as soon as S 0.54d.

164 Dimension reduction

165 Proof root G k G n k P k A G + k k + P k +(A)

166 Proof Let ψ(n,k) = L k L. Set (2 ) kn = min l 0 : ψ(n,l) < dl 1 dα n. Then 2 dk n n 0 as n.

Classification and Trees

Classification and Trees Luc Devroye School of Computer Science, McGill University Montreal, Canada H3A 2K6 This paper is dedicated to the memory of Pierre Devijver. Abstract. Breiman, Friedman, Gordon