An Implementation and Discussion of Random Forest with Gaussian Process Leaves

Size: px

Start display at page:

Download "An Implementation and Discussion of Random Forest with Gaussian Process Leaves"

Edwin Wiggins
5 years ago
Views:

1 An Implementation and Discussion of Random Forest with Gaussian Process Leaves Anonymous Author(s) Affiliation Address Abstract Stationary Gaussian Process Regression assumes that the correlation structure is always appropriate in all spatial locations and thus couldn t fit piece-wise continuous data set well. Various nonstationary GP model has been developed to solve this problem. Here, we propose to use random forest for partitioning and Gaussian Process Regression on leaves of the trees to handle piece-wise continuous data sets. Such combination takes the advantages of the randomization and averaging lies in Random Forest, the independence gained from binary tree partitioning and the smooth nonlinear regression achieved by Gaussian Process to provide a solution for general massive data regression. 1 Introduction Gaussian Process regression models are adopted widely in many machine learning applications, especially the domains need prediction such as earth sciences, planning, computer simulation experiments, etc. Because of the correlation matrix based nature, Gaussian Process model is able to simulate the smooth nature of the objective function and show the effect of potential correlated input dimensions. However, such property also causes problems in some situations where discontinuity is a nature of the objective function. Nonstationary Gaussian Process models are proposed to solve this problem by partitioning the input space into different regions and each of the regions will be fitted with an independent GP model. This transformed the regression problem into a partitioning problem where the tree structure can return a satisfactory disjoint partition given a specified standard. Similar to Classification Forest, the partitioning procedure is conducted by trees independently based on information gain and limited by the tree structure parameters. Several other partitioning methods are based on input space clustering or MCMC based posterior computing. Compared with those options, our method is simpler in principle and thus can be more general. Random Forest can mitigate the problem caused by an over fitted tree and effectively lower the input dimension number of Gaussian Process, whose result will degrade largely when more than, possibly, a dozen input dimensions are feed. Generally, tree structure can partition the data space into appropriate sections to achieve a better fit of Gaussian Process models while the bootstrapping and bagging procedure of Random Forest will limit the dimensions of each tree for Gaussian Process model and mitigate the possible over fitted result.

2 Method The construction of the forest follows the route of classification forests in general. Only at the leaves of the trees, the output is the prediction of the input given by the Gaussian Process model possessed by the node. 2.1 Gaussian Process The leaf nodes of each tree will possess a Gaussian Process model which is independent from the models of others nodes after the construction of the trees. But, similar to the combination of linear regression and Random Forest, the information gain will be measured at each node during construction to decide if one split benefits the most. Like decision trees, we need the entropy which can represent regression quality of the current node. To keep things simple, we use squared-exponential kernel to simulate the smoothness nature and same length scale parameters for different dimension. Due to the complexity of optimization when using different length scale parameters for different dimensions, a unified length scale parameter is still the most popular model get adopted in many applications. Before we compute the entropy, we need to fit the Gaussian Process model for the current node to obtain the best fitted model parameters and then the optimized correlation matrix. κ (x, x ) = δ exp ( 1 2l x x ) = X κ (x, x ) X + diag(δ ) logp(y X) = 1 2 YΣ Y 1 2 log Σ N log (2π) 2 δ, l, δ = arg { logp(y X) } Σ = X κ (x, x ) X + diag(δ ) 59 Here we give the differential entropy for node u: E(u) = P(Y μ, Σ )log P(Y μ, Σ )dx = 1 2 log { (2πe) Σ } Binary Tree The construction of trees is based on the decisions of splitting. Here we use information gain to quantify the quality of a split [1]. During growing, a sample of input dimensions will be taken to decide the dimensions going to be split upon. All breaks of these dimensions will be tested to measure the corresponding information gain. The node will fork only when at least one positive information gain achieved and split at the threshold leads to largest information gain. The information gain, N, N left, N right represent the number of data points in current node, number of the data points in the left node and number of data points in the right node respectively. I = H(u) N N H(u ) N N H(u ) Here we use the differential entropy E(u) gained before as H(u): I = E(u) N N E(u ) N N Eu 2.3 Forest At forest level, we only need to define several parameters to decide the forest structure. n : number of trees in the forest m : maximum depth of each tree n :minimum number of data points for leaf nodes n number of dimensions each split will try

3 d : number of data points each tree owns The bootstrapping procedure is done in the same way as classification forests. d means the number of data points and each tree will be fed with d data points sampled from the whole training data set based on uniformly possibility distribution. The bagging process is also same as classification forests, except the output distribution function is averaged from all trees whose output distribution function are computed by Gaussian Process regressions instead of counting up labels. Estimate the prediction Y by averaging the prediction from T trees given input X : 3 Experiments P(Y X ) = 1 T P (Y X ) So far, our Random Forest Gaussian Process regression model processes following parameters: n : number of trees in the forest m : maximum depth of each tree n :minimum number of data points for leaf nodes n number of dimensions one split will try on d : number of data points each tree owns μ the mean of the prior multivariate Gaussian distribution l the initial value of the length scale parameter in the squared exponential kernel δ : the initial value of the noise variance added to the correlation matrix the initial value of the coeficient in the squared exponential kernel δ In our experiments, all training data from real data sets were normalized to zero mean, so the μ is set to 0 in all experiments. The first five parameters decide the size and structure of the forest. The last three parameters should not affect the result in theory while, in practice, a successful optimization of the three parameters in the correlation matrix depends on appropriate initial values of them. Because the correlation variance of two given input points: x, x is initialized as δ exp ( x x ), if δ or l is too small, the optimization is possible to generate a correlation matrix of independent multivariate Gaussian distribution in which any correlation between two different data points will be eliminated and the regression will definitely fail. If δ or l is too large, it is also possible to encounter overflow or singular matrix due to the numerically approximated gradient and Hessian functions. As for the initial value of δ, both a too large and too small can easily leads to computing failure. In the following subsections, we will show the comparison between single Gaussian Process regression and Random Forest Gaussian Process regression on simple synthetic data. A discussion of the parameters will also be covered. 3.1 Simple synthetic data Compared with single Gaussian Process regression, the tree partitioned GP model should own the ability to recognize discontinuities exist within the data. To prove this, we added faults into sin(x) to make it discontinuous. The Gaussian noise added to the objective function follows N(0, 0.1). A comparison between single GP regression and tree partitioned GP regression is showed in Figure 1. Figure 1 shows the tree partitioned GP regression has a much better result than the single GP model. The parameters of the random forest GP model are listed as below: m n = 1 = 20 n = 3

4 n = 1 d = 20 (all training data) l = 1.0 δ = 1.0 δ = Figure 1: Comparison between single GP regression and tree partitioned GP regression. The left grapy is shows the result of a single GP model, the right one shows the result of a random forest GP model with only one tree. The red dash line represents the objective function. The black lines represent the result of the regression. x marked spots are training data points. The random forest GP model used in Figure 1 has only one tree and all training data, thus it is actually a tree partitioned GP model. The reason for using one tree is the so few data points and input dimensions. The random forest GP result also benefitted from the small noise. As the noise goes up (Figure 2), the results of both single GP model and random forest GP become worse. But we still can see the benefit of partitioning. Another property we want our model holds is the ability to distinguish if a split is beneficial or not. The property is achieved by setting a rational information gain threshold. So far, we believe the value zero is a reasonable choice since we don t want the overall entropy to go up. From this perspective, regression models can be different from the classification ones. For classification model, a split won t increase the overall entropy in any cases. But it is possible for regression model. In Figure 3, we use the continuous sin(x) function to test if our model will wrongly split it. The result shows our model decided not to generate child branches. 3.2 Discussion of the parameters This simple synthetic data set provides a good chance for observing the influence of the tree-structure-relevant parameters because of its easy-to-understand output. We discuss the influence of structure relevant parameters including: n, m, d and n here. Due the only one input dimension we have here, we won t be able to discuss the effect of n and we also can expect a relatively trivial effect of m because of the few of faults. In addition, in order to amplify the influence of these parameters, we increase the number of training points from twenty to forty. Given one tree, n affect the fineness of the regression largely. A too large n will result in a similar underfitting result similar to single GP model while a too small n will introduce unnecessary zigzags which mean an overfitting regression. However, m is also able to limit the fineness and can correct the overfitting regression resulted from a too small n. But, the influence of m is sensitive to the location of the discontinuity in the objective function. For example, if one side of the optimal split point has more faults while the other side owns only a few of faults, some m can result in half overfitting and half underfitting regression. n needs to be considered together with d to enable nontrivial bootstrapping and bagging

5 (since our inputs have only one dimension). The results show bootstrapping and bagging is a sort of smoothing method. Appling bootstrapping and bagging reduces the chance of splitting for a noisy vibration by attenuating the density of the data and keep the overall trend and meaning small vibrations which can be observed by most of the trees. Because of bootstrapping and bagging, we can use smaller n and larger m while don t need to concern much about the overfitting problem. However, this is just the advantage from the data set perspective introduced by bootstrapping and bagging. The benefits from data dimension perspective are not covered here. Figure 2: The influence of noise variance on the regression results of single GP model (left column) and random forest GP (tree partitioned GP) model (right column). The noise variance imposed upon the models for the three rows are 0.2, 0.5 and 1.0 respectively Figure 3: For well formed continuous objective function with non-significant noise, the random forest GP model should know it is unnecessary to partition the data. The left graph is

6 returned from a single GP model, the right one is obtained by our random forest GP model. The black lines are the prediction returned by the models, the red line is the objective function: sin(x) and the x marks represent training data points. 4 Apply to Real world data sets In this section, we demonstrate how we apply our random forest Gaussian Process regression model in real world data sets. Two data sets will be used here. One is the Canada flu trend data downloaded from Google.org[6]. A regression of this data set might be helpful for flu trend prediction. The other data set is the records of salinity, temperature and oxygen density at deep water region. The data is drawn from the database of UBC Earth and Ocean Science department. The goal is find the relation of oxygen density with temperature and salinity. 4.1 Canada flu trends The data set records the flu intensity index for nine provinces of Canada for every seven days from 2003 till now. We use the records from 2004 to 2012 due to the completeness. As for the provinces, we picked up the records of Alberta, British Columbia, Saskatchewan, Manitoba, Ontario, Quebec, Newfoundland and Labrador, in total seven provinces. The location of these seven provinces is roughly in order, from the west coast to the east coast. So it is appropriate to treat these provinces as the axis of location. The input data contains two dimensions: date and location. The output is the flu intensity index. We hope to find a function of date and location to simulate the flu intensity index which might be helpful for flu prediction M anipulating the data Appling the input data directly to the model will ends in a failure of optimization due to the too large x x which will let the correlation matrix turn to be a diagonal one. Besides, the output vector Y also needs to be normalized to ease the calculation of the log likelihood. Both location and date axes need to subtract their mean values respectively and date axis should be divided by seven to unify the intervals of the two axes to one The normalization of the output vector Y is simply conducted as: Y = Result Here we compare the regressions achieved by single GP and random forest GP respectively. Figure 4 shows the result. The training set is sampled as ¼ size of the whole data set we used. For the random forest model, we set up it with n = 10, m = 10, n = 5, n = 1, 191 d = (training data set). From Figure 4, we can see that random forest GP returned a finer grained graph in general while avoided an abnormal high output for the winter of 2009 in all districts. After checking the records for all of these districts, we found the single GP regression is correct. There is an apparent increase in all these districts in the winter of The reason why random forest GP doesn t return an obvious pike as the single GP does is the property of bootstrapping and bagging. Although there is an obvious peak in most locations, the duration time for that increase is so short and only occupied a few records. Many trees of the forest didn t own enough data points to describe the peak and finally result in a not really responsive surface at that region. Although the output of the random forest GP shows a steadier surface, it is actually slightly overfitting. Possible reasons for this are too small n, too large m or even d. 4.2 Deep water oxygen density To find the latent oxygen density function of salinity and temperature, we build the random forest GP model with two input axes: normalized salinity and temperature and one output

206 207 208 209 axis representing the normalized oxygen density. Figure 5 shows the data set. The results from single GP and our random forest GP are showed in Figure 6.

7 axis representing the normalized oxygen density. Figure 5 shows the data set. The results from single GP and our random forest GP are showed in Figure 6. The training data MSE for single GP and random forest GP are and respectively. The testing data MSE for single them are and respectively Figure 4: The regression result of the flu trend data set. The left one is returned from a single GP, the right one is returned from our random forest GP. The blue spots displayed in the From the result, we can see the random forest GP model returned a finer grained surface in figure are the whole points of the data set. Many of them get shadowed by the surface, but the differences of the result are still clear Figure 5: The normalized data of oxygen density, temperature and salinity. Both of the graphs shows the same data set, but from different perspectives Figure 6: The results of single GP regression (left) and random forest GP regression (right). Blue circle spots represent are the whole data points and the blue-to-red surfaces represent the prediction for the corresponding input. The random forest GP regression results in a more similar MSE for both training and testing data set. The result from single GP is very likely to be an overfitting one and thus not a reliable prediction to generalize the potential pattern. In this data set, random forest GP shows a better performance on extracting the latent objective function from very noisy data sets.

8 The property is achieved due to the averaging effect of forests. The right graph of Figure 6 comes from a forest with forty trees, ten as maximum growing depth and twenty points limit for a smallest node. The data fed to each tree are only 1/10 in size of the whole training data. 5 Conclusion and future work In this work, a random forest with Gaussian Process on leaves for regression is implemented and the route behind it is provided. Many of the steps are inspired by and referred to the classification counterparts, including the calculation of the information gain and the bootstrapping/bagging procedures. Based on the simple synthetic data, some properties of the random forest GP model are demonstrated. We found using tree structure to partition the data set enables the model to adapt to those data sets contains discontinuities. The combination tree classification and Gaussian Process can keep the piece-wise smoothes which shows the correlation between different points, just as single Gaussian Process regression does, while also introduces discontinuities to cut off the false correlation brought by the kernel function who always treat all data points in the same way. When the parameters are set properly, the tree partitioned GP regression can result in a better regression for a piece-wise continuous data set. The bootstrapping and bagging process, as a sort of randomization and averaging, can improve the robustness of the tree structure partitioning process when plenty of data points are available. Another potential benefit of random forest is limiting the input dimensions for Gaussian Process regression. Since our data sets have a few dimensions, we didn t cover this part. But the performance for such data sets is worth of study. However, there are still many problems left for future studies. The values of the parameters are very important to the final regression while the concrete effects of them and the correlation between them can be complex and subtle. In general, we can observe that a smaller number of trees, a smaller limit of the minimum number of points for each leaf and a larger maximum allowed growing depth will generate a grain-finer result which is prone to be an overfitting one. The opposite direction of such parameters is more likely to generate an underfitting regression. But how to adjust those parameters for an ideal result is not answered in this paper. 6 Related work Non-stationary Gaussian Process regression has been studied for many years and many partition strategies have been proposed so far. Chipman et al.[3] proposed regression with random forests and Gramacy et al.[4] augmented the model with Gaussian Process at leaves. The fitting procedure is conducted with MCMC algorithm all guided by posterior estimation. Although their inference and calculation, which are all based on posterior instead of likelihood, are more accurate and correct in theory, the practical implementation can be too complex. Kim et al.[2] and K, Das et al.[5] utilize other clustering algorithm to fulfill the partitioning task. Such pre-processing requires some knowledge about the specific data set and thus might not be a general solution. But this is also a promising study direction which is more likely to a promising improvement in the result. Re fere nce s [1] A. Criminisi, J. Shotton, E. Konukoglu, (2011), Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, Tech. Rep. MSR-TR , Microsoft. [2] Kim, H.-M., Mallick, B. K., and Holmes, C. C., (2005), Analyzing Nonstationary Spatial Data Using Piecewise Gaussian Processes, Journal of the American Statistical Association, 100, [3] Chipman, H., George, E., & McCulloch, R. (1998), Bayesian CART model search (with dis-cussion). Journal of the American Statistical Association, 93, [4] Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. J. of the American Statistical Association, 103, [5] K. Das and A. Srivastava. (2010) Block-GP: Scalable Gaussian Process Regression for Multimodal Data. In the 10 th IEEE International Conference on Data Mining, ICDM 2010, pages [6] Data Source: Google Flu Trends (

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.