Modelling Bivariate Distributions Using Kernel Density Estimation

Size: px

Start display at page:

Download "Modelling Bivariate Distributions Using Kernel Density Estimation"

Ambrose Arnold
6 years ago
Views:

1 Modelling Bivariate Distributions Using Kernel Density Estimation Alexander Bilock, Carl Jidling and Ylva Rydin Project in Computational Science 6 January 6 Department of information technology

2 Abstract Kernel density estimation is a topic covering methods for computing continuous estimates of the underlying probability density function of a data set. A wide range of approximation methods are available for this purpose, theses include the use of binning on coarser grids and fast Fourier transform (FFT) in order to speed up the calculations. A key factor in the kernel density estimation process is the selection of the so-called kernel bandwidth. The aim of this project is to implement different kernel density estimation approaches proposed in the literature and compare their performance in terms of speed and accuracy. Matlab is used as the main environment for the implementation. The results show that using FFT can speed up the calculation with almost maintained accuracy if the data is binned on a dense grid. Some general advice for selection of kernel bandwidth is also discussed.

3 Contents Introduction Univariate kernel density estimates Bivariate kernel density estimates Error Estimation 6 5 Approximations 8 5. Binning Fourier transform Bandwidth selection 6. Plug-in bandwidth selection Cross validation Smoothed cross validation Pre-transformation Applications of KDE 7 7. Cloud transform Examples with real data Method and results 8. Comparison of binning methods Comparison of KDE-calculation methods Comparison of bandwidth selection methods Summary and conclusions A Comparison of binning methods 6 B Comparison of KDE-calculation methods C Comparison of Bandwidth selection methods 5

4 Introduction In many fields of science data exploration is of significant importance. In one dimension, investigating the properties of a data set can often be done intuitively. However, in higher dimensions detecting properties such as skewness and multi-modality may be difficult. In lower dimensions histograms can be used to reveal some of the properties, but making a smooth estimation of the underlying probability density function (PDF) is often desired. A popular method for doing that is kernel density estimation (KDE). The purpose of this work is to implement two dimensional KDEs in Matlab using different methods and investigate them in terms of accuracy and speed. In Section and the theory for kernel density estimation is presented. Error estimation is introduced in Section. Section 5 describes approximative ways of calculating KDEs in order to increase the speed. In Section 6 the bandwidth concept is introduced with a walk-through of existing algorithms. An application field for KDEs are introduced in Section 7, including some examples with geostatistical data. Section 8 presents the methods of and results from the performance study. Conclusions and analysis is found in Section 9. Univariate kernel density estimates One way to explore the properties of a data set is by constructing a histogram. If the histogram is normalised, it yields a non smooth representation of the PDF. A KDE is used to get a smooth estimation of the PDF. The univariate KDE ˆf of the PDF f is defined as ˆf(x, h) = n K h (x x i ) () n i= for a dataset with n samples x = [x, x, x,..., x n ] from f. The kernel function K h (u) = h K( u h ) is a symmetric and non-negative function fulfilling R K h(u)du =. There is a wide range of kernels, although the kernel function does not have a significant impact on the estimator. In this work the two most commonly used have been considered, namely the Gaussian kernel

5 K(u) = π e u, () and the Epanechnikov kernel K(u) = ( u ) { u <}, () where {...} is the indicator function { u <} = { if u < otherwise. The main difference between those kernels is that while the Gaussian kernel has an infinite support (non-zero everywhere) the Epanechnikov kernel is non-zero only on a limited domain. The parameter h is called the bandwidth of the kernel. The choice of h is the most important factor regarding the accuracy of the estimate. The bandwidth selection methods used in this project are described in Section 6. A simple visualisation is seen in Figure. It shows a KDE of the dataset x = [.9;.9;.8;.9;.5; ], calculated with a Gaussian kernel and h =.75. For comparison, a histogram constructed from the same points is shown as well. In the left Figure the blue dots are the data points and the red curves are the kernels evaluated at each point. The green curve is the final KDE. Bivariate kernel density estimates In the bivariate case the data points are represented by two vectors x = [x, x, x,..., x n ] and x = [x, x, x,..., x n ] where x i = (x i, x i ) is a sample from a bivariate distribution f. In analogy with the univariate case, the bivariate kernel density estimate is defined as ˆf(x, H) = n K H (x x i ). () n i=

6 .5. Kernel Estimate Points Kernels.5..5 Density.5..5 Density x (a) KDE x (b) Histogram Figure : Kernel density estimation and histogram from a dataset with 6 points. Here the bandwidth is the positive definite matrix, [ ] h h H =, (5) h h and the kernel function K H is a symmetric and non negative function fulfilling R K H (u)du =. In the bivariate case K H (u) = H / K(H / u). As in the univariate case the bivariate kernels used in this work have been the Gaussian kernel, K(u) = π e ut u, (6) and the Epanechnikov kernel, K(u) = π ( ut u)) { u T u <}. (7) Figure demonstrates the difference between a bivariate histogram and a kernel density estimation. It shows a dataset generated from a combination of two bivariate normal distributions, visualised through a scatterplot, a histogram, a Gaussian kernel density estimate and the true PDF. 5

7 (a) Scatter plot (b) True Density (c) Histogram (d) Kernel density estimate Figure : Comparison between scatter plot, histogram and KDE from a dataset generated from two normal distributions Error Estimation To assess the closeness of a kernel density estimator to the target density an error criteria must be used. A common error estimate for kernel density estimation is the Mean Integrated Square Error (MISE): MISE( ˆf) = E( ( ˆf(x, H) f(x)) )dx. (8) Since the MISE depends on the true density f it can only be calculated for data sets drawn from known distributions f. The MISE can be approximated with the Integrated Mean Square Error IMSE. The expression for the IMSE is obtained by moving the expectation value in (8) inside the integral. The IMSE can be calculated numerically using, for instance, Monte Carlo integration. The algorithm goes as follows: 6

8 Generate m datasets each with n random points from the density f on a uniform grid [X, Y ]. Generate a set of k uniformly distributed random points x c on the grid. For each one of the m datasets a KDE ˆf is calculated and evaluated on the grid. Use linear interpolation to obtain an approximation ˆf(x c, h) of the KDE in the random points x c. The Mean Squared Error MSE is given as MSE = m ( m ˆf i (x c, h) f(x c )). (9) i= The Integrated Mean Square Error is approximated as IMSE = MSE A where MSE is the mean of MSE for all Monte Carlo points and A is the area of the domain spanned by the grid [X, Y ]. In some situations it is more interesting to study the Integrated Square Error, ISE. The difference from the IMSE-calculation above is that no mean is taken in order to form the MSE. Instead, the value of the squared error is saved for each data set. The result can thereafter be integrated as above to form the ISE and presented e.g. in box plots to visualize the deviations from its mean value, which then is an approximate MISE. Provided the number of sample points and the bandwidth matrix, exact values of the MISE can be calculated on a closed form if f is a combination of normal distributions and K is the Gaussian kernel, as described by equation (.6) in []. This closed form can be used in comparison studies of bandwidth selection methods. The Asymptotic MISE (AMISE) is an approximation of MISE used in the bandwidth selection since it depends on the bandwidth h in a simpler way. In Wand and Jones (995) [] it is stated that under certain assumptions on f, h and K AMISE( ˆf) = (nh) R(K) + h ( µ (K)) R(f ), () where R(L) = L (x)dx, µ (L) = x L(x)dx for any function L. 7

9 5 Approximations 5. Binning In many practical applications direct computation of the kernel density estimation is too computationally expensive. One strategy to reduce the computational load is by using binning. Instead of calculating the kernel estimators on each data point an approximation is made by binning the data on the grid where the KDE is calculated. In this way the number of kernel evaluations is changed from O(nM) to O(M ), where M is the number of grid points (in any dimension). This implies that binning reduces the computational burden provided that the number of data points exceeds the number of grid points (neglecting the time required for the binning itself). The expression for the approximate, binned KDE in dimension d is f(x i ) = n M l =... M d l d = K H (x i x l )c l, () where c l is the weight assigned to the grid point x l. The two most commonly used binning rules are simple binning and linear binning. In the univariate case, simple binning assigns a unit mass to the nearest grid point of the data point x. In the case of linear binning, x gives a weighted contribution to both of the surrounding grid points. If y and z are the left and right surrounding grid points, the weighted masses are (z x)/(z y) for y and (x y)/(z y) for z. The extension to the bivariate case and higher dimensions is straightforward. The line between the closest two grid points in one dimension is replaced by the area enclosed by the four surrounding grid points in the bivariate case, and so on with volumes in higher dimensions. The approximation by linear binning is considerable more accurate as compared to simple binning. Moreover the number of grid points can be a quarter as many for linear binning as compared to simple binning with maintained accuracy []. Figure illustrates a bivariate example of linear binning. 5. Fourier transform As described in Section 5., an approximation of the KDE can be calculated by binning the data and assign a weight to each grid point. The more 8

10 - - - Y X Figure : Bivariate linear binning with green markers as data, mesh represented by blue lines and scaled weight contributions as filled red circles. the number of data points exceeds the number of grid points, the faster will the binned calculation be as compared to the calculation according to the definition. The speed can be increased further by making use of the fast Fourier transorm (FFT). The key point is that expression () for the binned approximation can be rewritten in form of a convolution f = n L l = (L ) L d... c j l k l, () l d = L d where L i = M i although it can be shrunk for a slightly reduced computational burden. Furthermore k l = n K H(δ l,..., δ d l d ), where δ i is the mesh size in direction i. With the convolution form of (), the fourier transform can easily be applied, and using the FFT is recommended since the computational load is reduced from O(M ) to O(M log(m)). An FFT-method for KDE-calculations is presented by Wand in []. This algorithm, however, suffers from the drawback of not allowing unconstrained 9

11 bandwidth matrices. A corrected version of the algorithm is recently presented by Gramacki and Gramacki in [], which is the one used in the implementation of this work. As can be seen in Section 8, the FFT method surpasses the binned calculation () in terms of computation time. Regarding the accuracy, no numerical difference has been detected. However, the FFT method may introduce some visual artifacts as seen in Figure. This is assumed to be caused by numerical errors due to the limited precision of the floating point format. Attempts to remove the effect by an extended zero-padding of the computational domain turned out unsuccessful (a) Linear binning (b) Linear binning with FFT Figure : Approximative versions of the KDE in figure (d). Linear binning has been used in both cases, but the KDE to the right has been calculated using FFT. It is seen that this has introduced some artifacts. 6 Bandwidth selection An implementation of the kernel density estimation requires the selection of a bandwidth, denoted h in the univariate case and H in the bivariate. The choice of bandwidth has been shown to be of greater importance than the actual choice of kernel []. Figure 5 demonstrates the importance of an appropriate bandwidth. In 5(a) the KDE is over-smoothed caused by a too large value of h, and it therefore misses some of the distribution s structural behaviour. On the other hand, a too small h as in 5(b) makes the KDE under-smoothed. In 5(c) the bandwidth is calculated according to Silverman s rule of thumb described in [] and the KDE seems to catch the actual bimodality of the distribution.

12 ..5 Histogram Kernel est..5 Histogram Kernel est.. f(x).5 f(x) (a) h = (b) h =.5..5 Histogram Kernel est. f(x) (c) h =. Figure 5: Kernel density estimation with a Gaussian kernel for three different values of h and a data set with sample points from a combined normal density. In the univariate case it is possible to choose the bandwidth by inspection. This is done by calculating the KDE for a large number of h and then decrease h until the KDE in some sense looks satisfying. This approach is also possible in the bivariate case but in higher dimensions the data can not be visualised intuitively. Visual inspection of the bandwidth also assumes some sense of knowledge of the data, for example the positions of the modes. In many situations the distribution is totally unknown and an automatic bandwidth selection is preferred in order to avoid the problems of the inspection method. The previously mentioned rule of thumb is a bandwidth selection method which is very easy to understand and implement. The rule of thumb gives a satisfying result in many situations and can serve as a useful starting point. However, the method lacks in terms of robustness and optimality. More robust and in some sense optimal alternatives to the rule of thumb is to try to minimise the AMISE. Calculating the bandwidth in the univariate case is

13 manageable but becomes very complex in higher dimensions. The extension to bivariate bandwidth selection increases the complexity significantly since the bivariate bandwidth H is the matrix defined in equation 5. Often some simplification can be made by considering diagonal H:s and in some cases it have been shown that a diagonal H can be sufficient [6]. On the other hand diagonal H:s do not support an arbitrary change of the kernel orientation which in some cases is quite crucial. In the next two Sections the main classes used for bandwidth selection will be presented, namely plug in methods (PI) and cross validation (CV). 6. Plug-in bandwidth selection As previously mentioned, most available bandwidth selection method aim to minimise the asymptotic error estimation AMISE. In the univariate case the following expression for the optimal bandwidth h AMISE can be obtained by differentiating the AMISE expression () with respect to h and setting the derivative equal to zero [ ] /5 R(K) h AMISE = µ (K). () R(f )n Usually the only unknown quantity in the expression above is the actual probability density function f. In the plug-in method R(f ) is replaced by the kernel functional estimator ˆψ (g) that can be obtained from the formula ˆψ r (g) = n n n i= j= L (r) g (X i X j ), () where L g is an appropriate kernel and is g the pilot bandwidth. The pilot bandwidth is usually chosen by applying the formula for the AMISE optimal bandwidth again [ K ] /7 () g AMISE = µ (K). (5) ψ 6 n This has the effect of introducing ψ 6 which requires a new pilot bandwidth to be estimated. Every new estimate ˆψ r will depend on ψ r+. The common

14 solution to this problem is at some point to estimate ψ r with an easily obtained estimate such as the rule of thumb instead of an AMISE based approximation. This yields a variety of plug in methods differing in the number of steps in which kernel functional estimators are obtained before the simple estimate is applied. If k stages are applied before the simple estimate it is referred to as an k-stage plug in method. Several versions of the PI-method have been developed. The most well-known univariate plug-in selector is the algorithm developed by Sheater and Jones (99) []. The plug in method can be extended to several dimensions, first shown by Wand and Jones 99 [6] and refined and optimised by Doung and Hazelton [5]. In the bivariate case the plug in method aimes to minimise the bivariate AMISE AMISE ˆf(H) = n H R(k) + µ (K) (vech T H ψ vech H). (6) where vech denotes the following operation [ ] h h vech H = vech = h h [ h h h ] T. (7) The matrix ψ is defined as ψ ψ ψ ψ = ψ ψ ψ, (8) ψ ψ ψ where ψ r r = f (r,r ) (x)f(x)dx R and f (r,r ) (x) = r x r x f(x)

15 is the partial derivatives of x with respect to x and x. As in the univariate case ψ r,r has to be estimated. A commonly used estimate is ˆψ (r,r )(G) = n n n i= j= K (r,r ) G (X i X j ) (9) where G is the pilot bandwidth matrix. In Doung and Hazelton [5] it is suggested that this matrix should be on the form G = g I. Choosing g can be done in a similar way as in the univariate case. For each entry ψ (r,r ) in ˆψ, g = g AMSE is chosen such that it minimises the Asymptotic Mean Square Error approximation AMSE ˆψ (r,r )(g) = n g (r +r ) ψ R(K (r +r ) )+ ( + n g (r +r ) K (r +r ) () + g µ (K)(ψ r +,r + ψ r,r +)). () This method may produce matrices ψ that are not positive definite. In that case a minimum to the objective function does not exist. To solve this issue Doung and Hazelton suggest another approach as opposed to finding one optimal g for each entry in ˆψ. Instead, g = g SAMSE that minimises the sum SAMSE = AMSE ˆψ (r,r )(g) r +r = should be calculated and used as a common g for all entries in ˆψ. A closed form expression for g SAMSE is stated in Doung and Hazelton [5]. In analogy with the univariate case, the estimates of g depends on ψ r,r and therefore an easy estimate of ˆψ r,r has to be made at some stage. The plug in method as described above requires higher derivatives. Therefore it is not possible to implement the method for an Epanechnikov kernel since its derivatives of second order and higher all are equal to.

16 6. Cross validation The most commonly used bandwidth selectors besides PI belongs to the class using cross-validation (CV). Generally methods based on CV can be applied to any kernel. This differs from the PI methods that usually require higher order derivatives. The MISE previously defined in equation (8) can be rewritten as MISE(h) = E( ( ˆf(x, h) f(x)) ) = ˆf(x, h) ˆf(x, h)f(x) + f(x). () CV aims to minimise the MISE which is equivalent to keeping the approximation ˆf(x) as close to f(x) as possible. The third term in () is independent of the bandwidth and the equivalent minimisation can be written as MISE(h) f(x) = ˆf(x, h) ˆf(x, h)f(x). () The calculation of the first term on the RHS is quite straightforward since it only involves known quantities. However, the second term complicates things since it involves the unknown quantity f(x). Several versions of bandwidth selection methods using CV have been developed but the main focus has been to investigate smoothed cross validation. 6.. Smoothed cross validation The most commonly used bandwidth selector within the CV family is smoothed cross validation (SCV). SCV can be seen as a general method for bandwidth selection. It usually performs better as compared to other CV methods. The method will be presented for the bivariate case. SCV uses the following pilot bandwidth estimate to approximate f in equation () ˆf L (x, G) = n n i= L G (x X i ). () 5

17 Here L is an appropriate kernel and G the pilot bandwidth. This gives the objective function SCV = (n) R(K) H / n + n n (K H K H L G L G K H L G L G + L G L G )(X i X j ). i= j= () where denotes convolution. This method is similar to PI in the sense that a pilot estimate is used. As in PI estimates the choice of G is important and there are different ways to choose it. Usually it is chosen the same way as g in the PI selection in Section 6.. Since that method can not be applied for the Epanechnikov kernel, neither can this version of SCV. The convolutions in equation () are simplified a lot if a Gaussian kernel is used, since in that case there is a closed form expression [8]. 6. Pre-transformation Bandwidth selection methods with pilot bandwidths often require some sort of pre-transformation of the data [5]. This is of particular importance when the data is scaled differently along the coordinate axes. The two main methods for pre-transforming the data are sphering and scaling. Both methods use the variance of the dataset in order to make it more uniformly scaled. After the transformation the bandwidth can be calculated and the data back transformed into its original form. The preferred scaling method is not always obvious, although some general recommendations can be given [5]. If the data in some sense has a different local orientation as compared to the global, the sphering method can destroy the local structures of the distribution. On the other hand, if the entire dataset is skewed, sphering can yield a considerably more accurate result than scaling. Figure 6(a) shows an example where using sphering pre-transformation can be suitable. In Figure 6(b) shows an example where using scaling can be more suitable due to the difference in orientation of the two modals. 6

18 (a) Correlated gaussian (b) Assymetric bimodal Figure 6: Two examples of distributions where two different scaling methods will give significantly different results. 7 Applications of KDE 7. Cloud transform Application fields of kernel density estimation include the so called cloud transform (see Kolbjørnsen and Abrehamsen, []). In this context, the term is used equivalently to the conditional cumulative distribution F (y x). This can be estimated from data according to the following expression ˆF (y x) = n i= ( ) ( ) / x Xi y Yi n ( ) x Xi k d K k d h h y h i= (5) where K (y) = y k (t)dt. If the data is bivariate, then d = and k d = k is a one-dimensional kernel. For illustration, Figure 7 shows the estimator of the conditional cumulative distribution ˆF (y x) for the scattered data in Figure (a). 7

19 Figure 7: The estimated conditional cumulative distribution ˆF (y x) for the scattered data in Figure (a). 7. Examples with real data This Section contains examples of kernel density estimates and cloud transforms for petro-elastic data. Scattered data showing porosity versus acoustic impedance for two different wells separately and both wells together presented in Figure 8. Kernel density estimates and the conditional cumulative distributions of the porosity given the acoustic impedance is seen in Figures 9 and, respectively. To produce the plots in the latter Figures, the Gaussian kernel has been used and the bandwidths have been generated with the plug-in method (see Section 6). Corresponding plots for data sets of log permeability versus porosity are shown in Figures, and. 8

20 Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well..5. Porosity Acoustic Impedance (c) Well and Figure 8: Scatter plots of acoustic impedance and porosity for two wells Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well Porosity Acoustic Impedance (c) Well and Figure 9: Kernel density estimates for the data shown in Figure 8. 9

21 Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well Porosity Acoustic Impedance (c) Well and Figure : Conditional cumulative distributions of porosity given acoustic impedance for the data shown in Figure log Permeability.5 log Permeability Porosity (a) Well Porosity (b) Well.5.5 log Permeability.5.5 log Permeability Porosity (c) Well Porosity (d) All wells together Figure : Scatter plots of log permeability and porosity for three different wells and all together.

22 6 log Permeability log Permeability Porosity Porosity (a) Well (b) Well log Permeability log Permeability Porosity Porosity (c) Well (d) All wells together Figure : Kernel density estimates for the data shown in Figure log Permeability log Permeability Porosity Porosity (a) Well (b) Well log Permeability log Permeability Porosity (c) Well Porosity (d) All wells together Figure : Conditional cumulative distributions of log permeability given porosity for the data shown in Figure.

23 8 Method and results All code for comparison and testing is written in Matlab, including the calculations of the KDE:s. The code implements a communication with R in order to use its ks-package (developed by Tarn Doung) for the bandwidth selection process. Also, C is used in order to speed up the linear binningalgorithm. The testing is done by comparing the results obtained when using different methods with the true values from a known underlying density. A set of four target densities (picked from the larger set in []) are used for those studies, all built up by a combination of normal distributions and representing different properties. The four densities are shown in figure (a) Target density : Uncorrelated gaussian (b) Target density : Correlated gaussian (c) Target density : Strongly skewed (d) Target density : Assymetric bimodal. Figure : Shows the four target densities used in the tests. The tests are carried out on a system using Scientific Linux 6.5 with the CPU AMD Opteron (Bulldozer) 68SE,.6 GHz.

24 8. Comparison of binning methods To verify and extend the results on binning methods in [] a comparison test is performed on the four target densities in Figure. In order to compare simple and linear binning, simulations are done to estimate the relative mean integrated squared error (RMISE) as defined by Wand in [] / RMISE = E { f(x) ˆf(x)} dx E { ˆf(x) f(x)} dx. (6) In words, RMISE is the MISE error due to binning divided with the MISE of the KDE calculated according to the definition. The study is done on the aforementioned target densities. The denominator MISE is calculated with the closed form. The numerator is estimated with the IMSE calculated as described in Section (so the RMISE is actually estimated as the RIMSE). An equally spaced grid with M = M = M is used for four different values of M. For each M, four different numbers of sample points are investigated. data sets are generated to approximate the MSE and uniformly distributed random points are used for the Monte Carlo integration, points in which the KDE is approximated with linear interpolation. For each number of sample points, the bandwidth H is chosen with the plug-in method for an initial data set and then used all-through for the remaining data sets and grid sizes. Parts of the results are seen in Figure 5, while the remaining Figures are found in appendix A. For each target density, it is seen that linear binning yields a more accurate result than simple binning for almost all combinations of grid- and sample sizes. The absolute difference is most significant for small sample sizes and on coarser grids, situations in which good approximations are naturally harder to make. Note however that the relative difference is increasing as the grid size is growing. One should also note that the RIMSEvalues are growing with the sample size, which implies that larger samples require more grid points to reduce the binning error. It should be recalled that there is an additional uncertainty introduced by the linear interpolation, which grows larger as the grid size shrinks. This extra level of approximation is also the reason for which even smaller grid sizes are not used in the test.

25 - - - log RIMSE - -5 log RIMSE (a) Target density 6 (b) Target density - - log RIMSE - - log RIMSE (c) Target density 6 (d) Target density Figure 5: log RIMSE versus number of grid points for the target densities when using samples. Star and circle corresponds to simple binning and linear binning, respectively. Regarding the speed, linear binning is generally faster (all time comparison plots are found in appendix A). This result may be surprising considering that linear binning is a more complicated algorithm than simple binning. The explanation is found in our implementation. Simple binning is implemented purely in Matlab, while the more complex linear binning algorithm is partly written in C to speed up the execution of an expensive for-loop. Having these implementation differences in mind, the results should not be used to draw any general conclusions of how the methods compare in terms of speed. In which case, the binning time is small as compared to the time required for the actual KDE-computation (se also Section 8.). Conclusively, the choice of binning method should be based on the accuracy comparison, and thus linear binning is to prefer. 8. Comparison of KDE-calculation methods The aim of the test described in this Section is to compare the KDE:s computed by definition (), the binned estimate () and the binned estimate

26 computed with FFT (). Linear binning (lb) is used since it is the preferable binning method according to Section 8.. The ISE for the three methods is compared through box plots. The ISE values are calculated as described in Section, using data sets and uniformly distributed random points for the Monte Carlo-integration. The tests are performed on data sets generated from the four target densities seen in Figure. For each density, three different values are used for the sample size of the data sets: {,, }. Each test is performed with both the Epanechnikov and the Gaussian kernel on two different grid sizes, and 6 6. The Gaussian bandwidth is chosen once for each target density using the plug-in method for a data set consisting of points. The bandwidth used for the Epanechnikov kernel is obtained by scaling the Gaussian bandwidth with a factor 6, as described in Doung 5 []. The test results are presented in Figure 6 and 7. The results are similar for target density, and. For these densities the ISE values are in the same range on both the grids and for all three KDE methods. The main difference in ISE is seen for an increased sample sizes. On the grid the KDE by definition estimate improves more than the binned estimate as the number of points increased. This is expected and follows from that the RIMSE grows when the sample size increases as described in Section 8.. The ISE behaviour for calculations using linear binning with or without FFT can not be distinguished. This is expected since the only observed difference between the methods are the visual artifacts described in Section 5.. Another observed pattern is that increasing the sample size n of the dataset improves the accuracy for all methods. This is intuitive since a larger sample contains more information about the estimated PDF. An unexpected result is that the binned estimates in some cases have a smaller mean value (approximate MISE) then the KDE by definition. This occurs mainly when the number of grid points is larger than the number of sample points. The results for the strongly skewed target density stand out as compared to the others. For this density the test on the grid yields an ISE about times larger than for the other densities, and the estimate by definition performs significantly better than the binned estimates. On the 6 6 grid the ISE is about times larger than for the other densities and the binned estimates has a smaller mean value for all sample sizes. Those slightly unexpected results are probably a consequence of the densities strong 5

27 skewness. A general advice is to use a dense grid and a large sample size if a KDE approximation is calculated from a very skew data set. The results of the Gaussian and the Epanechnikov kernel are similar but not identical. The kernel with the best peforrmance varies between the different methods and target densities. The Epanechnikov kernel is proved to be the most efficient [], which is not observed in this test. The reason could be that the target densities used are combinations of normal distributions which may yield an advantage for the infinitely supported Gaussian kernel. Furthermore, the bandwidth matrices are algorithmically chosen for the Gaussian kernel and adapted to Epanechnikov using a scale factor. In a sense, this may cause the Epanechnikov bandwidth to be less optimal than the Gaussian. For each parameter setting the mean execution time of each KDE method is recorded. This is found to be independent of target density and kernel type. Hence the execution times for one combination of kernel and target density is representative for all remaining settings. Figure shows the results for target density and the Epanechnikov kernel. The remaining figures from the time study are found in appendix B. Some general patterns are observed for all test densities. The execution time increases as the sample size increases. As mentioned in Section 5., the number of kernel evaluations required for the KDE by definition is O(nM), where M is the number of grid points. This can be observed in Figure where the time for the KDE by definition, denoted def, increases proportionally to the sample size. For binned estimates the kernel evaluations are O(M ), although one must also take into account the time required by the binning algorithm. This time is proportional to O(nM) as shown in []. However, the computational burden of the binning procedure is significantly less than the actual KDE-calculation. This is also clear from Figure, where it is seen that the execution time for the KDE by definition grows rapidly compared to the binned estimates. Furthermore, using FFT results in an enormous speed-up as compared to the other methods. The ratio of the times required for the binned KDE estimates computed with and without FFT is about % on the mesh. This ratio shrinks to about.5 % on the 6 6 grid, and would shrink further on finer meshes due to the FFT:s speed benefits as discussed in Section 5.. Due to this speed-up it is strongly recommended to use FFT in binned estimations. 6

28 ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density.5 KDE by definition Linear binning FFT linear binning.5 KDE by definition Linear binning FFT linear binning log(ise) - log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points -8 Random points Figure 6: Result for the KDE method test on Grid 7

29 ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density - KDE by definition Linear binning FFT linear binning - KDE by definition Linear binning FFT linear binning - - log(ise) -5-6 log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density - KDE by definition Linear binning FFT linear binning - KDE by definition Linear binning FFT linear binning log(ise) log(ise) log(ise) Random points ISE Gaussian kernel on target density KDE by definition Linear binning FFT linear binning Random points log(ise) Random points ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning Random points Figure 7: 6 6 Grid 8

30 .9 Time x grid target density - Time x grid target density def lb fft def lb fft def lb fft fft fft fft Time 6 x 6 grid target density. Time 6 x 6 grid target density def lb fft def lb fft def lb fft fft fft fft Figure 8: Time Epanechnikov kernel 8. Comparison of bandwidth selection methods A benefit of using combined normal distributions as target densities is that they allow exact computations of the MISE-value, provided the number of sample points and the bandwidth matrix, as described by equation (.6) in []. This way, it is not necessary to carry out the thorough IMSE-calculations in the same way as for the KDE-comparison. Instead one can simply use the bandwidth matrices suggested by the selection methods, calculate the corresponding exact MISE-values and compare the results. The main interest in this work is to compare the performance of the Plug- In and Smoothed Cross Validation selection methods. As mentioned in the introduction to Section 8, R:s ks-package is used in the implementation. This package allows a number of different options for bandwidth selection. The focus is to investigate how PI and SCV compare when using one- and two-stage methods and the two different pre-transformations, sphering and scaling. Let sphering be denoted with a star ( ). Scaling is used if nothing else is stated. The total parameter setting yields eight combinations of band- 9

31 width selection methods. For a faster execution, the bandwidth algorithms make use of binning the data on a grid of size 6 6. Earlier observations have shown that this decreases the execution time with orders of magnitude without any notable loss of accuracy. The test is carried out on the four target densities in Figure using a various set of sample sizes. For each sample size and target density the kspackage is used to calculate the bandwidth, which is then used to compute the exact MISE value. This procedure is then repeated for data sets. In Figure 9- the MISE is visualized in form of box plots for the sample size n = {,, }. It is seen in Figure 9- that pre-transformation is an important factor. The results seems to agree with the arguments of Section 6., since sphering performs better on the skewed densities and, while scaling is better for the multiply directed density. In general the two-stage method performs equally well or better than the one-stage method. No distinct conclusion can be drawn regarding the difference between PI and SCV. On target density PI and SCV shows similar results for all sample sizes. Considering target densities and there are some clear differences in accuracy but the result is not consistent. PI outperforms SCV on target density for all sample sizes. On the other hand, SCV seems to be the more robust selector on target density. Both selectors show similar accuracy on target density even if PI is slightly better. The results also indicates a slight difference in dissipation between the two selectors with PI having the lowest. Besides the accuracy study the execution time for each bandwidth method is recorded. No difference in execution time could be detected between the target densities. The result for target density is presented in Figure (the remaining figures are found in appendix C). PI is faster than SCV for all cases investigated. The difference is especially large for the one-stage method and for the smallest sample size n =, in which case the speed of PI completely surpasses the speed of SCV. A possible explanation can be that the objective function for SCV is hard to minimize for a small sample size. Decreasing the sample size to n = yields significantly reduced execution times for the SCV method. A further increase of the sample size to n = do not reduce the execution times in the same extent. This can probably be explained by the binning approximation used in the bandwidth selection method. The one-stage method is faster than the twostage method for all investigated cases, although the differences in some cases are very small especially for SCV. The patterns of the remaining time plots are roughly similar to the one in figure.

32 -stage -stage -stage* -stage* -stage -stage -stage* -stage* -. MISE Gaussian kernel on target density -. MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -.8 Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* -.6 MISE Gaussian kernel on target density -.6 MISE Gaussian kernel on target density log(mise) -. log(mise) Plug-in SCV -.8 Plug-in SCV Figure 9: Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points. -stage -stage -stage* -stage* -stage -stage -stage* -stage* -6.6 MISE Gaussian kernel on target density -5. MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -6. Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* - MISE Gaussian kernel on target density -5. MISE Gaussian kernel on target density log(mise) -.6 log(mise) Plug-in SCV -6 Plug-in SCV Figure : Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points.

33 -stage -stage -stage* -stage* -stage -stage -stage* -stage* MISE Gaussian kernel on target density -7 MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -7.6 Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* -. MISE Gaussian kernel on target density -7.5 MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -7.5 Plug-in SCV Figure : Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points. Time bandwidth selection target number Time bandwidth selection target number PI PI PI* PI* SCV SCV SCV* SCV* PI PI PI* PI* SCV SCV SCV* SCV* (a) Sample size: (b) Sample size: 8 Time bandwidth selection target number PI PI PI* PI* SCV SCV SCV* SCV* (c) Sample size: Figure : Execution time for the bandwidth selection methods on target density for sample sizes {,, }.

34 9 Summary and conclusions In data exploration KDE is a useful tool to find underlying PDF:s. In this project the focus has been to investigate the properties of different approximations and methods in order to identify an efficient and accurate estimate. The main focus has been on binning, bandwidth selection and use of FFT. The test carried out in Section 8. shows that linear binning is more accurate than simple binning. Regarding the KDE calculations the sample size and the grid size are the most important factors for accuracy. A denser grid makes the binned estimate more reliable, while on the other hand it requires additional computations. Using FFT is shown to be faster than the KDE by definition. However, KDE by definition is more accurate on a coarse grid. Therefore our recommendation is to use FFT on a dense grid for a good trade off between performance and speed. The bandwidth selection can be seen as one of the more crucial parts of KDE calculations. Some general recommendation can be given from the tests carried out in Section 8. even though the results strongly depends on the shape of the target density. First of all, the data should be pretransformed correctly. If only one orientation is present in the data set sphering should be seen as the preferable pre-transformation due to its non-destructive properties. For data with multiple orientations scaling is the preferable pre-transformation method. A one-stage method should be considered due to the more robust and solid performance compared to the one-stage counterpart. However, the computational cost is higher for the two-stage method. Regarding the executional times the real bottlenecks of the calculations can be found in the selection of the bandwidth. Since the bandwidth selection is performed using an external call to an already existing software profiling is hard to perform. The ks-package contains some highly evolved code where including calls to C to improve the speed on time consuming parts. Compared to the bandwidth selection the binning and actual KDE-calculation are usually fast. Since the implemented KDE-calculation makes use of already existing software the portability is somewhat tricky. As an extension of the work it would be highly interesting to have all the code written in Matlab, due to the software portability as well as for analysis purpose. It would also be desired to have a bandwidth selection especially developed for the Epanechnikov kernel instead of scaling the Gaussian bandwidth. This would not work for the PI approach since it requires higher order derivatives. However, it should be possible to implement for SCV, although some practical

35 issues must be dealt with such as the convolution in equation and the choice of pilot kernel. Another interesting aspect would be to perform the tests on a non-gaussian target density. In addition to the theoretical results themselves presented and discussed in this report, the source code written and its implementation has been an important part of the work. Anyone interested in the subject who wishes to make use of these resources is welcome to contact the authors on any of the -addresses found below. Contact Information Alexander Bilock: alexander.bilock.57@student.uu.se Carl Jidling: carl.jidling.87@student.uu.se Ylva Rydin: ylva.rydin.@student.uu.se Acknowledgements Thanks to our supervisor David Marquez for his support and comments.

36 References [] M. P. Wand, Fast Computation of Multivariate Kernel Estimators, Journal of Computational and Graphical Statistics, (99). [] M. P. Wand, M. C. Jones, Kernel Smoothing, Chapman & Hall, st edition, (995). [] A. Gramacki, J. Gramacki, FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices, (5). [] M. C. Jones, S. J. Sheather, Using non-stochastic terms to advantage in integrated squared density derivatives (99). [5] T. Doung, M. Hazelton, Plug-In Bandwidth Matrices for Bivariate Kernel Density Estimation, Nonparametric Statistics, (). [6] M. P. Wand, M. C. Jones, Comparison of smoothing parameterizations in bivariate kernel density estimation, (99). [7] M. P. Wand, M. C. Jones, Multivariate Plug-in Bandwidth Selection, (99). [8] T. Doung, M. Hazelton, Cross-validation Bandwidth Matrices for Multivariate Kernel Density Estimation, (5). [9] S.R. Sain, K.A. Baggerly, D.W. Scott, Cross-validation of multivariate densities, (99). [] J. E. Chacón, Cross-validation Bandwidth Matrices for Multivariate Kernel Density Estimation, The Canadian Journal of Statistics, Vol., No, (6). [] O. Kolbjørnsen, P.Abrehamsen, Theory of the Cloud transform for Applications. Geostatistics Banff, Vol. (7th International Geostatistics Congress), (). [] T. Doung Spherically symmetric multivariate beta family kernels, Statistics and Probability Letters Volume, (5). 5

37 Appendix A Comparison of binning methods Accuracy Star and circle corresponds to simple binning and linear binning, respectively. n = n = log RIMSE - -5 log RIMSE n = 6 n = - - log RIMSE - - log RIMSE Figure : Target density. 6

38 n = n = log RIMSE - log RIMSE n = 6 n = - log RIMSE - log RIMSE Figure : Target density. n = n = - log RIMSE - - log RIMSE n = 6 n = log RIMSE log RIMSE Figure 5: Target density. 7

39 n = - n = log RIMSE - log RIMSE n = 6 n = - log RIMSE - - log RIMSE Figure 6: Target density. Execution time - n = - n = Linear Simple.5.5 Linear Simple Linear Simple n =.5 n = Linear Simple 6 Figure 7: TargDens= 8

40 - n = - n = Linear Simple.5 Linear Simple n = n = Linear Simple Linear Simple 6 Figure 8: TargDens=.5 - Linear Simple n =. n = Linear Simple n = n = Linear Simple 6..5 Linear Simple 6 Figure 9: TargDens= 9

FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices

FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices arxiv:1508.02766v6 [stat.co] 7 Sep 2016 Artur Gramacki Institute of Control and Computation Engineering