Modelling Bivariate Distributions Using Kernel Density Estimation

Size: px
Start display at page:

Download "Modelling Bivariate Distributions Using Kernel Density Estimation"

Transcription

1 Modelling Bivariate Distributions Using Kernel Density Estimation Alexander Bilock, Carl Jidling and Ylva Rydin Project in Computational Science 6 January 6 Department of information technology

2 Abstract Kernel density estimation is a topic covering methods for computing continuous estimates of the underlying probability density function of a data set. A wide range of approximation methods are available for this purpose, theses include the use of binning on coarser grids and fast Fourier transform (FFT) in order to speed up the calculations. A key factor in the kernel density estimation process is the selection of the so-called kernel bandwidth. The aim of this project is to implement different kernel density estimation approaches proposed in the literature and compare their performance in terms of speed and accuracy. Matlab is used as the main environment for the implementation. The results show that using FFT can speed up the calculation with almost maintained accuracy if the data is binned on a dense grid. Some general advice for selection of kernel bandwidth is also discussed.

3 Contents Introduction Univariate kernel density estimates Bivariate kernel density estimates Error Estimation 6 5 Approximations 8 5. Binning Fourier transform Bandwidth selection 6. Plug-in bandwidth selection Cross validation Smoothed cross validation Pre-transformation Applications of KDE 7 7. Cloud transform Examples with real data Method and results 8. Comparison of binning methods Comparison of KDE-calculation methods Comparison of bandwidth selection methods Summary and conclusions A Comparison of binning methods 6 B Comparison of KDE-calculation methods C Comparison of Bandwidth selection methods 5

4 Introduction In many fields of science data exploration is of significant importance. In one dimension, investigating the properties of a data set can often be done intuitively. However, in higher dimensions detecting properties such as skewness and multi-modality may be difficult. In lower dimensions histograms can be used to reveal some of the properties, but making a smooth estimation of the underlying probability density function (PDF) is often desired. A popular method for doing that is kernel density estimation (KDE). The purpose of this work is to implement two dimensional KDEs in Matlab using different methods and investigate them in terms of accuracy and speed. In Section and the theory for kernel density estimation is presented. Error estimation is introduced in Section. Section 5 describes approximative ways of calculating KDEs in order to increase the speed. In Section 6 the bandwidth concept is introduced with a walk-through of existing algorithms. An application field for KDEs are introduced in Section 7, including some examples with geostatistical data. Section 8 presents the methods of and results from the performance study. Conclusions and analysis is found in Section 9. Univariate kernel density estimates One way to explore the properties of a data set is by constructing a histogram. If the histogram is normalised, it yields a non smooth representation of the PDF. A KDE is used to get a smooth estimation of the PDF. The univariate KDE ˆf of the PDF f is defined as ˆf(x, h) = n K h (x x i ) () n i= for a dataset with n samples x = [x, x, x,..., x n ] from f. The kernel function K h (u) = h K( u h ) is a symmetric and non-negative function fulfilling R K h(u)du =. There is a wide range of kernels, although the kernel function does not have a significant impact on the estimator. In this work the two most commonly used have been considered, namely the Gaussian kernel

5 K(u) = π e u, () and the Epanechnikov kernel K(u) = ( u ) { u <}, () where {...} is the indicator function { u <} = { if u < otherwise. The main difference between those kernels is that while the Gaussian kernel has an infinite support (non-zero everywhere) the Epanechnikov kernel is non-zero only on a limited domain. The parameter h is called the bandwidth of the kernel. The choice of h is the most important factor regarding the accuracy of the estimate. The bandwidth selection methods used in this project are described in Section 6. A simple visualisation is seen in Figure. It shows a KDE of the dataset x = [.9;.9;.8;.9;.5; ], calculated with a Gaussian kernel and h =.75. For comparison, a histogram constructed from the same points is shown as well. In the left Figure the blue dots are the data points and the red curves are the kernels evaluated at each point. The green curve is the final KDE. Bivariate kernel density estimates In the bivariate case the data points are represented by two vectors x = [x, x, x,..., x n ] and x = [x, x, x,..., x n ] where x i = (x i, x i ) is a sample from a bivariate distribution f. In analogy with the univariate case, the bivariate kernel density estimate is defined as ˆf(x, H) = n K H (x x i ). () n i=

6 .5. Kernel Estimate Points Kernels.5..5 Density.5..5 Density x (a) KDE x (b) Histogram Figure : Kernel density estimation and histogram from a dataset with 6 points. Here the bandwidth is the positive definite matrix, [ ] h h H =, (5) h h and the kernel function K H is a symmetric and non negative function fulfilling R K H (u)du =. In the bivariate case K H (u) = H / K(H / u). As in the univariate case the bivariate kernels used in this work have been the Gaussian kernel, K(u) = π e ut u, (6) and the Epanechnikov kernel, K(u) = π ( ut u)) { u T u <}. (7) Figure demonstrates the difference between a bivariate histogram and a kernel density estimation. It shows a dataset generated from a combination of two bivariate normal distributions, visualised through a scatterplot, a histogram, a Gaussian kernel density estimate and the true PDF. 5

7 (a) Scatter plot (b) True Density (c) Histogram (d) Kernel density estimate Figure : Comparison between scatter plot, histogram and KDE from a dataset generated from two normal distributions Error Estimation To assess the closeness of a kernel density estimator to the target density an error criteria must be used. A common error estimate for kernel density estimation is the Mean Integrated Square Error (MISE): MISE( ˆf) = E( ( ˆf(x, H) f(x)) )dx. (8) Since the MISE depends on the true density f it can only be calculated for data sets drawn from known distributions f. The MISE can be approximated with the Integrated Mean Square Error IMSE. The expression for the IMSE is obtained by moving the expectation value in (8) inside the integral. The IMSE can be calculated numerically using, for instance, Monte Carlo integration. The algorithm goes as follows: 6

8 Generate m datasets each with n random points from the density f on a uniform grid [X, Y ]. Generate a set of k uniformly distributed random points x c on the grid. For each one of the m datasets a KDE ˆf is calculated and evaluated on the grid. Use linear interpolation to obtain an approximation ˆf(x c, h) of the KDE in the random points x c. The Mean Squared Error MSE is given as MSE = m ( m ˆf i (x c, h) f(x c )). (9) i= The Integrated Mean Square Error is approximated as IMSE = MSE A where MSE is the mean of MSE for all Monte Carlo points and A is the area of the domain spanned by the grid [X, Y ]. In some situations it is more interesting to study the Integrated Square Error, ISE. The difference from the IMSE-calculation above is that no mean is taken in order to form the MSE. Instead, the value of the squared error is saved for each data set. The result can thereafter be integrated as above to form the ISE and presented e.g. in box plots to visualize the deviations from its mean value, which then is an approximate MISE. Provided the number of sample points and the bandwidth matrix, exact values of the MISE can be calculated on a closed form if f is a combination of normal distributions and K is the Gaussian kernel, as described by equation (.6) in []. This closed form can be used in comparison studies of bandwidth selection methods. The Asymptotic MISE (AMISE) is an approximation of MISE used in the bandwidth selection since it depends on the bandwidth h in a simpler way. In Wand and Jones (995) [] it is stated that under certain assumptions on f, h and K AMISE( ˆf) = (nh) R(K) + h ( µ (K)) R(f ), () where R(L) = L (x)dx, µ (L) = x L(x)dx for any function L. 7

9 5 Approximations 5. Binning In many practical applications direct computation of the kernel density estimation is too computationally expensive. One strategy to reduce the computational load is by using binning. Instead of calculating the kernel estimators on each data point an approximation is made by binning the data on the grid where the KDE is calculated. In this way the number of kernel evaluations is changed from O(nM) to O(M ), where M is the number of grid points (in any dimension). This implies that binning reduces the computational burden provided that the number of data points exceeds the number of grid points (neglecting the time required for the binning itself). The expression for the approximate, binned KDE in dimension d is f(x i ) = n M l =... M d l d = K H (x i x l )c l, () where c l is the weight assigned to the grid point x l. The two most commonly used binning rules are simple binning and linear binning. In the univariate case, simple binning assigns a unit mass to the nearest grid point of the data point x. In the case of linear binning, x gives a weighted contribution to both of the surrounding grid points. If y and z are the left and right surrounding grid points, the weighted masses are (z x)/(z y) for y and (x y)/(z y) for z. The extension to the bivariate case and higher dimensions is straightforward. The line between the closest two grid points in one dimension is replaced by the area enclosed by the four surrounding grid points in the bivariate case, and so on with volumes in higher dimensions. The approximation by linear binning is considerable more accurate as compared to simple binning. Moreover the number of grid points can be a quarter as many for linear binning as compared to simple binning with maintained accuracy []. Figure illustrates a bivariate example of linear binning. 5. Fourier transform As described in Section 5., an approximation of the KDE can be calculated by binning the data and assign a weight to each grid point. The more 8

10 - - - Y X Figure : Bivariate linear binning with green markers as data, mesh represented by blue lines and scaled weight contributions as filled red circles. the number of data points exceeds the number of grid points, the faster will the binned calculation be as compared to the calculation according to the definition. The speed can be increased further by making use of the fast Fourier transorm (FFT). The key point is that expression () for the binned approximation can be rewritten in form of a convolution f = n L l = (L ) L d... c j l k l, () l d = L d where L i = M i although it can be shrunk for a slightly reduced computational burden. Furthermore k l = n K H(δ l,..., δ d l d ), where δ i is the mesh size in direction i. With the convolution form of (), the fourier transform can easily be applied, and using the FFT is recommended since the computational load is reduced from O(M ) to O(M log(m)). An FFT-method for KDE-calculations is presented by Wand in []. This algorithm, however, suffers from the drawback of not allowing unconstrained 9

11 bandwidth matrices. A corrected version of the algorithm is recently presented by Gramacki and Gramacki in [], which is the one used in the implementation of this work. As can be seen in Section 8, the FFT method surpasses the binned calculation () in terms of computation time. Regarding the accuracy, no numerical difference has been detected. However, the FFT method may introduce some visual artifacts as seen in Figure. This is assumed to be caused by numerical errors due to the limited precision of the floating point format. Attempts to remove the effect by an extended zero-padding of the computational domain turned out unsuccessful (a) Linear binning (b) Linear binning with FFT Figure : Approximative versions of the KDE in figure (d). Linear binning has been used in both cases, but the KDE to the right has been calculated using FFT. It is seen that this has introduced some artifacts. 6 Bandwidth selection An implementation of the kernel density estimation requires the selection of a bandwidth, denoted h in the univariate case and H in the bivariate. The choice of bandwidth has been shown to be of greater importance than the actual choice of kernel []. Figure 5 demonstrates the importance of an appropriate bandwidth. In 5(a) the KDE is over-smoothed caused by a too large value of h, and it therefore misses some of the distribution s structural behaviour. On the other hand, a too small h as in 5(b) makes the KDE under-smoothed. In 5(c) the bandwidth is calculated according to Silverman s rule of thumb described in [] and the KDE seems to catch the actual bimodality of the distribution.

12 ..5 Histogram Kernel est..5 Histogram Kernel est.. f(x).5 f(x) (a) h = (b) h =.5..5 Histogram Kernel est. f(x) (c) h =. Figure 5: Kernel density estimation with a Gaussian kernel for three different values of h and a data set with sample points from a combined normal density. In the univariate case it is possible to choose the bandwidth by inspection. This is done by calculating the KDE for a large number of h and then decrease h until the KDE in some sense looks satisfying. This approach is also possible in the bivariate case but in higher dimensions the data can not be visualised intuitively. Visual inspection of the bandwidth also assumes some sense of knowledge of the data, for example the positions of the modes. In many situations the distribution is totally unknown and an automatic bandwidth selection is preferred in order to avoid the problems of the inspection method. The previously mentioned rule of thumb is a bandwidth selection method which is very easy to understand and implement. The rule of thumb gives a satisfying result in many situations and can serve as a useful starting point. However, the method lacks in terms of robustness and optimality. More robust and in some sense optimal alternatives to the rule of thumb is to try to minimise the AMISE. Calculating the bandwidth in the univariate case is

13 manageable but becomes very complex in higher dimensions. The extension to bivariate bandwidth selection increases the complexity significantly since the bivariate bandwidth H is the matrix defined in equation 5. Often some simplification can be made by considering diagonal H:s and in some cases it have been shown that a diagonal H can be sufficient [6]. On the other hand diagonal H:s do not support an arbitrary change of the kernel orientation which in some cases is quite crucial. In the next two Sections the main classes used for bandwidth selection will be presented, namely plug in methods (PI) and cross validation (CV). 6. Plug-in bandwidth selection As previously mentioned, most available bandwidth selection method aim to minimise the asymptotic error estimation AMISE. In the univariate case the following expression for the optimal bandwidth h AMISE can be obtained by differentiating the AMISE expression () with respect to h and setting the derivative equal to zero [ ] /5 R(K) h AMISE = µ (K). () R(f )n Usually the only unknown quantity in the expression above is the actual probability density function f. In the plug-in method R(f ) is replaced by the kernel functional estimator ˆψ (g) that can be obtained from the formula ˆψ r (g) = n n n i= j= L (r) g (X i X j ), () where L g is an appropriate kernel and is g the pilot bandwidth. The pilot bandwidth is usually chosen by applying the formula for the AMISE optimal bandwidth again [ K ] /7 () g AMISE = µ (K). (5) ψ 6 n This has the effect of introducing ψ 6 which requires a new pilot bandwidth to be estimated. Every new estimate ˆψ r will depend on ψ r+. The common

14 solution to this problem is at some point to estimate ψ r with an easily obtained estimate such as the rule of thumb instead of an AMISE based approximation. This yields a variety of plug in methods differing in the number of steps in which kernel functional estimators are obtained before the simple estimate is applied. If k stages are applied before the simple estimate it is referred to as an k-stage plug in method. Several versions of the PI-method have been developed. The most well-known univariate plug-in selector is the algorithm developed by Sheater and Jones (99) []. The plug in method can be extended to several dimensions, first shown by Wand and Jones 99 [6] and refined and optimised by Doung and Hazelton [5]. In the bivariate case the plug in method aimes to minimise the bivariate AMISE AMISE ˆf(H) = n H R(k) + µ (K) (vech T H ψ vech H). (6) where vech denotes the following operation [ ] h h vech H = vech = h h [ h h h ] T. (7) The matrix ψ is defined as ψ ψ ψ ψ = ψ ψ ψ, (8) ψ ψ ψ where ψ r r = f (r,r ) (x)f(x)dx R and f (r,r ) (x) = r x r x f(x)

15 is the partial derivatives of x with respect to x and x. As in the univariate case ψ r,r has to be estimated. A commonly used estimate is ˆψ (r,r )(G) = n n n i= j= K (r,r ) G (X i X j ) (9) where G is the pilot bandwidth matrix. In Doung and Hazelton [5] it is suggested that this matrix should be on the form G = g I. Choosing g can be done in a similar way as in the univariate case. For each entry ψ (r,r ) in ˆψ, g = g AMSE is chosen such that it minimises the Asymptotic Mean Square Error approximation AMSE ˆψ (r,r )(g) = n g (r +r ) ψ R(K (r +r ) )+ ( + n g (r +r ) K (r +r ) () + g µ (K)(ψ r +,r + ψ r,r +)). () This method may produce matrices ψ that are not positive definite. In that case a minimum to the objective function does not exist. To solve this issue Doung and Hazelton suggest another approach as opposed to finding one optimal g for each entry in ˆψ. Instead, g = g SAMSE that minimises the sum SAMSE = AMSE ˆψ (r,r )(g) r +r = should be calculated and used as a common g for all entries in ˆψ. A closed form expression for g SAMSE is stated in Doung and Hazelton [5]. In analogy with the univariate case, the estimates of g depends on ψ r,r and therefore an easy estimate of ˆψ r,r has to be made at some stage. The plug in method as described above requires higher derivatives. Therefore it is not possible to implement the method for an Epanechnikov kernel since its derivatives of second order and higher all are equal to.

16 6. Cross validation The most commonly used bandwidth selectors besides PI belongs to the class using cross-validation (CV). Generally methods based on CV can be applied to any kernel. This differs from the PI methods that usually require higher order derivatives. The MISE previously defined in equation (8) can be rewritten as MISE(h) = E( ( ˆf(x, h) f(x)) ) = ˆf(x, h) ˆf(x, h)f(x) + f(x). () CV aims to minimise the MISE which is equivalent to keeping the approximation ˆf(x) as close to f(x) as possible. The third term in () is independent of the bandwidth and the equivalent minimisation can be written as MISE(h) f(x) = ˆf(x, h) ˆf(x, h)f(x). () The calculation of the first term on the RHS is quite straightforward since it only involves known quantities. However, the second term complicates things since it involves the unknown quantity f(x). Several versions of bandwidth selection methods using CV have been developed but the main focus has been to investigate smoothed cross validation. 6.. Smoothed cross validation The most commonly used bandwidth selector within the CV family is smoothed cross validation (SCV). SCV can be seen as a general method for bandwidth selection. It usually performs better as compared to other CV methods. The method will be presented for the bivariate case. SCV uses the following pilot bandwidth estimate to approximate f in equation () ˆf L (x, G) = n n i= L G (x X i ). () 5

17 Here L is an appropriate kernel and G the pilot bandwidth. This gives the objective function SCV = (n) R(K) H / n + n n (K H K H L G L G K H L G L G + L G L G )(X i X j ). i= j= () where denotes convolution. This method is similar to PI in the sense that a pilot estimate is used. As in PI estimates the choice of G is important and there are different ways to choose it. Usually it is chosen the same way as g in the PI selection in Section 6.. Since that method can not be applied for the Epanechnikov kernel, neither can this version of SCV. The convolutions in equation () are simplified a lot if a Gaussian kernel is used, since in that case there is a closed form expression [8]. 6. Pre-transformation Bandwidth selection methods with pilot bandwidths often require some sort of pre-transformation of the data [5]. This is of particular importance when the data is scaled differently along the coordinate axes. The two main methods for pre-transforming the data are sphering and scaling. Both methods use the variance of the dataset in order to make it more uniformly scaled. After the transformation the bandwidth can be calculated and the data back transformed into its original form. The preferred scaling method is not always obvious, although some general recommendations can be given [5]. If the data in some sense has a different local orientation as compared to the global, the sphering method can destroy the local structures of the distribution. On the other hand, if the entire dataset is skewed, sphering can yield a considerably more accurate result than scaling. Figure 6(a) shows an example where using sphering pre-transformation can be suitable. In Figure 6(b) shows an example where using scaling can be more suitable due to the difference in orientation of the two modals. 6

18 (a) Correlated gaussian (b) Assymetric bimodal Figure 6: Two examples of distributions where two different scaling methods will give significantly different results. 7 Applications of KDE 7. Cloud transform Application fields of kernel density estimation include the so called cloud transform (see Kolbjørnsen and Abrehamsen, []). In this context, the term is used equivalently to the conditional cumulative distribution F (y x). This can be estimated from data according to the following expression ˆF (y x) = n i= ( ) ( ) / x Xi y Yi n ( ) x Xi k d K k d h h y h i= (5) where K (y) = y k (t)dt. If the data is bivariate, then d = and k d = k is a one-dimensional kernel. For illustration, Figure 7 shows the estimator of the conditional cumulative distribution ˆF (y x) for the scattered data in Figure (a). 7

19 Figure 7: The estimated conditional cumulative distribution ˆF (y x) for the scattered data in Figure (a). 7. Examples with real data This Section contains examples of kernel density estimates and cloud transforms for petro-elastic data. Scattered data showing porosity versus acoustic impedance for two different wells separately and both wells together presented in Figure 8. Kernel density estimates and the conditional cumulative distributions of the porosity given the acoustic impedance is seen in Figures 9 and, respectively. To produce the plots in the latter Figures, the Gaussian kernel has been used and the bandwidths have been generated with the plug-in method (see Section 6). Corresponding plots for data sets of log permeability versus porosity are shown in Figures, and. 8

20 Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well..5. Porosity Acoustic Impedance (c) Well and Figure 8: Scatter plots of acoustic impedance and porosity for two wells Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well Porosity Acoustic Impedance (c) Well and Figure 9: Kernel density estimates for the data shown in Figure 8. 9

21 Porosity Porosity Acoustic Impedance (a) Well Acoustic Impedance (b) Well Porosity Acoustic Impedance (c) Well and Figure : Conditional cumulative distributions of porosity given acoustic impedance for the data shown in Figure log Permeability.5 log Permeability Porosity (a) Well Porosity (b) Well.5.5 log Permeability.5.5 log Permeability Porosity (c) Well Porosity (d) All wells together Figure : Scatter plots of log permeability and porosity for three different wells and all together.

22 6 log Permeability log Permeability Porosity Porosity (a) Well (b) Well log Permeability log Permeability Porosity Porosity (c) Well (d) All wells together Figure : Kernel density estimates for the data shown in Figure log Permeability log Permeability Porosity Porosity (a) Well (b) Well log Permeability log Permeability Porosity (c) Well Porosity (d) All wells together Figure : Conditional cumulative distributions of log permeability given porosity for the data shown in Figure.

23 8 Method and results All code for comparison and testing is written in Matlab, including the calculations of the KDE:s. The code implements a communication with R in order to use its ks-package (developed by Tarn Doung) for the bandwidth selection process. Also, C is used in order to speed up the linear binningalgorithm. The testing is done by comparing the results obtained when using different methods with the true values from a known underlying density. A set of four target densities (picked from the larger set in []) are used for those studies, all built up by a combination of normal distributions and representing different properties. The four densities are shown in figure (a) Target density : Uncorrelated gaussian (b) Target density : Correlated gaussian (c) Target density : Strongly skewed (d) Target density : Assymetric bimodal. Figure : Shows the four target densities used in the tests. The tests are carried out on a system using Scientific Linux 6.5 with the CPU AMD Opteron (Bulldozer) 68SE,.6 GHz.

24 8. Comparison of binning methods To verify and extend the results on binning methods in [] a comparison test is performed on the four target densities in Figure. In order to compare simple and linear binning, simulations are done to estimate the relative mean integrated squared error (RMISE) as defined by Wand in [] / RMISE = E { f(x) ˆf(x)} dx E { ˆf(x) f(x)} dx. (6) In words, RMISE is the MISE error due to binning divided with the MISE of the KDE calculated according to the definition. The study is done on the aforementioned target densities. The denominator MISE is calculated with the closed form. The numerator is estimated with the IMSE calculated as described in Section (so the RMISE is actually estimated as the RIMSE). An equally spaced grid with M = M = M is used for four different values of M. For each M, four different numbers of sample points are investigated. data sets are generated to approximate the MSE and uniformly distributed random points are used for the Monte Carlo integration, points in which the KDE is approximated with linear interpolation. For each number of sample points, the bandwidth H is chosen with the plug-in method for an initial data set and then used all-through for the remaining data sets and grid sizes. Parts of the results are seen in Figure 5, while the remaining Figures are found in appendix A. For each target density, it is seen that linear binning yields a more accurate result than simple binning for almost all combinations of grid- and sample sizes. The absolute difference is most significant for small sample sizes and on coarser grids, situations in which good approximations are naturally harder to make. Note however that the relative difference is increasing as the grid size is growing. One should also note that the RIMSEvalues are growing with the sample size, which implies that larger samples require more grid points to reduce the binning error. It should be recalled that there is an additional uncertainty introduced by the linear interpolation, which grows larger as the grid size shrinks. This extra level of approximation is also the reason for which even smaller grid sizes are not used in the test.

25 - - - log RIMSE - -5 log RIMSE (a) Target density 6 (b) Target density - - log RIMSE - - log RIMSE (c) Target density 6 (d) Target density Figure 5: log RIMSE versus number of grid points for the target densities when using samples. Star and circle corresponds to simple binning and linear binning, respectively. Regarding the speed, linear binning is generally faster (all time comparison plots are found in appendix A). This result may be surprising considering that linear binning is a more complicated algorithm than simple binning. The explanation is found in our implementation. Simple binning is implemented purely in Matlab, while the more complex linear binning algorithm is partly written in C to speed up the execution of an expensive for-loop. Having these implementation differences in mind, the results should not be used to draw any general conclusions of how the methods compare in terms of speed. In which case, the binning time is small as compared to the time required for the actual KDE-computation (se also Section 8.). Conclusively, the choice of binning method should be based on the accuracy comparison, and thus linear binning is to prefer. 8. Comparison of KDE-calculation methods The aim of the test described in this Section is to compare the KDE:s computed by definition (), the binned estimate () and the binned estimate

26 computed with FFT (). Linear binning (lb) is used since it is the preferable binning method according to Section 8.. The ISE for the three methods is compared through box plots. The ISE values are calculated as described in Section, using data sets and uniformly distributed random points for the Monte Carlo-integration. The tests are performed on data sets generated from the four target densities seen in Figure. For each density, three different values are used for the sample size of the data sets: {,, }. Each test is performed with both the Epanechnikov and the Gaussian kernel on two different grid sizes, and 6 6. The Gaussian bandwidth is chosen once for each target density using the plug-in method for a data set consisting of points. The bandwidth used for the Epanechnikov kernel is obtained by scaling the Gaussian bandwidth with a factor 6, as described in Doung 5 []. The test results are presented in Figure 6 and 7. The results are similar for target density, and. For these densities the ISE values are in the same range on both the grids and for all three KDE methods. The main difference in ISE is seen for an increased sample sizes. On the grid the KDE by definition estimate improves more than the binned estimate as the number of points increased. This is expected and follows from that the RIMSE grows when the sample size increases as described in Section 8.. The ISE behaviour for calculations using linear binning with or without FFT can not be distinguished. This is expected since the only observed difference between the methods are the visual artifacts described in Section 5.. Another observed pattern is that increasing the sample size n of the dataset improves the accuracy for all methods. This is intuitive since a larger sample contains more information about the estimated PDF. An unexpected result is that the binned estimates in some cases have a smaller mean value (approximate MISE) then the KDE by definition. This occurs mainly when the number of grid points is larger than the number of sample points. The results for the strongly skewed target density stand out as compared to the others. For this density the test on the grid yields an ISE about times larger than for the other densities, and the estimate by definition performs significantly better than the binned estimates. On the 6 6 grid the ISE is about times larger than for the other densities and the binned estimates has a smaller mean value for all sample sizes. Those slightly unexpected results are probably a consequence of the densities strong 5

27 skewness. A general advice is to use a dense grid and a large sample size if a KDE approximation is calculated from a very skew data set. The results of the Gaussian and the Epanechnikov kernel are similar but not identical. The kernel with the best peforrmance varies between the different methods and target densities. The Epanechnikov kernel is proved to be the most efficient [], which is not observed in this test. The reason could be that the target densities used are combinations of normal distributions which may yield an advantage for the infinitely supported Gaussian kernel. Furthermore, the bandwidth matrices are algorithmically chosen for the Gaussian kernel and adapted to Epanechnikov using a scale factor. In a sense, this may cause the Epanechnikov bandwidth to be less optimal than the Gaussian. For each parameter setting the mean execution time of each KDE method is recorded. This is found to be independent of target density and kernel type. Hence the execution times for one combination of kernel and target density is representative for all remaining settings. Figure shows the results for target density and the Epanechnikov kernel. The remaining figures from the time study are found in appendix B. Some general patterns are observed for all test densities. The execution time increases as the sample size increases. As mentioned in Section 5., the number of kernel evaluations required for the KDE by definition is O(nM), where M is the number of grid points. This can be observed in Figure where the time for the KDE by definition, denoted def, increases proportionally to the sample size. For binned estimates the kernel evaluations are O(M ), although one must also take into account the time required by the binning algorithm. This time is proportional to O(nM) as shown in []. However, the computational burden of the binning procedure is significantly less than the actual KDE-calculation. This is also clear from Figure, where it is seen that the execution time for the KDE by definition grows rapidly compared to the binned estimates. Furthermore, using FFT results in an enormous speed-up as compared to the other methods. The ratio of the times required for the binned KDE estimates computed with and without FFT is about % on the mesh. This ratio shrinks to about.5 % on the 6 6 grid, and would shrink further on finer meshes due to the FFT:s speed benefits as discussed in Section 5.. Due to this speed-up it is strongly recommended to use FFT in binned estimations. 6

28 ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density.5 KDE by definition Linear binning FFT linear binning.5 KDE by definition Linear binning FFT linear binning log(ise) - log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points -8 Random points Figure 6: Result for the KDE method test on Grid 7

29 ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density - KDE by definition Linear binning FFT linear binning - KDE by definition Linear binning FFT linear binning - - log(ise) -5-6 log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning KDE by definition Linear binning FFT linear binning log(ise) log(ise) Random points Random points ISE Gaussian kernel on target density ISE Epanechnikov kernel on target density - KDE by definition Linear binning FFT linear binning - KDE by definition Linear binning FFT linear binning log(ise) log(ise) log(ise) Random points ISE Gaussian kernel on target density KDE by definition Linear binning FFT linear binning Random points log(ise) Random points ISE Epanechnikov kernel on target density KDE by definition Linear binning FFT linear binning Random points Figure 7: 6 6 Grid 8

30 .9 Time x grid target density - Time x grid target density def lb fft def lb fft def lb fft fft fft fft Time 6 x 6 grid target density. Time 6 x 6 grid target density def lb fft def lb fft def lb fft fft fft fft Figure 8: Time Epanechnikov kernel 8. Comparison of bandwidth selection methods A benefit of using combined normal distributions as target densities is that they allow exact computations of the MISE-value, provided the number of sample points and the bandwidth matrix, as described by equation (.6) in []. This way, it is not necessary to carry out the thorough IMSE-calculations in the same way as for the KDE-comparison. Instead one can simply use the bandwidth matrices suggested by the selection methods, calculate the corresponding exact MISE-values and compare the results. The main interest in this work is to compare the performance of the Plug- In and Smoothed Cross Validation selection methods. As mentioned in the introduction to Section 8, R:s ks-package is used in the implementation. This package allows a number of different options for bandwidth selection. The focus is to investigate how PI and SCV compare when using one- and two-stage methods and the two different pre-transformations, sphering and scaling. Let sphering be denoted with a star ( ). Scaling is used if nothing else is stated. The total parameter setting yields eight combinations of band- 9

31 width selection methods. For a faster execution, the bandwidth algorithms make use of binning the data on a grid of size 6 6. Earlier observations have shown that this decreases the execution time with orders of magnitude without any notable loss of accuracy. The test is carried out on the four target densities in Figure using a various set of sample sizes. For each sample size and target density the kspackage is used to calculate the bandwidth, which is then used to compute the exact MISE value. This procedure is then repeated for data sets. In Figure 9- the MISE is visualized in form of box plots for the sample size n = {,, }. It is seen in Figure 9- that pre-transformation is an important factor. The results seems to agree with the arguments of Section 6., since sphering performs better on the skewed densities and, while scaling is better for the multiply directed density. In general the two-stage method performs equally well or better than the one-stage method. No distinct conclusion can be drawn regarding the difference between PI and SCV. On target density PI and SCV shows similar results for all sample sizes. Considering target densities and there are some clear differences in accuracy but the result is not consistent. PI outperforms SCV on target density for all sample sizes. On the other hand, SCV seems to be the more robust selector on target density. Both selectors show similar accuracy on target density even if PI is slightly better. The results also indicates a slight difference in dissipation between the two selectors with PI having the lowest. Besides the accuracy study the execution time for each bandwidth method is recorded. No difference in execution time could be detected between the target densities. The result for target density is presented in Figure (the remaining figures are found in appendix C). PI is faster than SCV for all cases investigated. The difference is especially large for the one-stage method and for the smallest sample size n =, in which case the speed of PI completely surpasses the speed of SCV. A possible explanation can be that the objective function for SCV is hard to minimize for a small sample size. Decreasing the sample size to n = yields significantly reduced execution times for the SCV method. A further increase of the sample size to n = do not reduce the execution times in the same extent. This can probably be explained by the binning approximation used in the bandwidth selection method. The one-stage method is faster than the twostage method for all investigated cases, although the differences in some cases are very small especially for SCV. The patterns of the remaining time plots are roughly similar to the one in figure.

32 -stage -stage -stage* -stage* -stage -stage -stage* -stage* -. MISE Gaussian kernel on target density -. MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -.8 Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* -.6 MISE Gaussian kernel on target density -.6 MISE Gaussian kernel on target density log(mise) -. log(mise) Plug-in SCV -.8 Plug-in SCV Figure 9: Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points. -stage -stage -stage* -stage* -stage -stage -stage* -stage* -6.6 MISE Gaussian kernel on target density -5. MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -6. Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* - MISE Gaussian kernel on target density -5. MISE Gaussian kernel on target density log(mise) -.6 log(mise) Plug-in SCV -6 Plug-in SCV Figure : Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points.

33 -stage -stage -stage* -stage* -stage -stage -stage* -stage* MISE Gaussian kernel on target density -7 MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -7.6 Plug-in SCV -stage -stage -stage* -stage* -stage -stage -stage* -stage* -. MISE Gaussian kernel on target density -7.5 MISE Gaussian kernel on target density log(mise) log(mise) Plug-in SCV -7.5 Plug-in SCV Figure : Box plots showing the accuracy for the different bandwidth selection methods for a sample size of points. Time bandwidth selection target number Time bandwidth selection target number PI PI PI* PI* SCV SCV SCV* SCV* PI PI PI* PI* SCV SCV SCV* SCV* (a) Sample size: (b) Sample size: 8 Time bandwidth selection target number PI PI PI* PI* SCV SCV SCV* SCV* (c) Sample size: Figure : Execution time for the bandwidth selection methods on target density for sample sizes {,, }.

34 9 Summary and conclusions In data exploration KDE is a useful tool to find underlying PDF:s. In this project the focus has been to investigate the properties of different approximations and methods in order to identify an efficient and accurate estimate. The main focus has been on binning, bandwidth selection and use of FFT. The test carried out in Section 8. shows that linear binning is more accurate than simple binning. Regarding the KDE calculations the sample size and the grid size are the most important factors for accuracy. A denser grid makes the binned estimate more reliable, while on the other hand it requires additional computations. Using FFT is shown to be faster than the KDE by definition. However, KDE by definition is more accurate on a coarse grid. Therefore our recommendation is to use FFT on a dense grid for a good trade off between performance and speed. The bandwidth selection can be seen as one of the more crucial parts of KDE calculations. Some general recommendation can be given from the tests carried out in Section 8. even though the results strongly depends on the shape of the target density. First of all, the data should be pretransformed correctly. If only one orientation is present in the data set sphering should be seen as the preferable pre-transformation due to its non-destructive properties. For data with multiple orientations scaling is the preferable pre-transformation method. A one-stage method should be considered due to the more robust and solid performance compared to the one-stage counterpart. However, the computational cost is higher for the two-stage method. Regarding the executional times the real bottlenecks of the calculations can be found in the selection of the bandwidth. Since the bandwidth selection is performed using an external call to an already existing software profiling is hard to perform. The ks-package contains some highly evolved code where including calls to C to improve the speed on time consuming parts. Compared to the bandwidth selection the binning and actual KDE-calculation are usually fast. Since the implemented KDE-calculation makes use of already existing software the portability is somewhat tricky. As an extension of the work it would be highly interesting to have all the code written in Matlab, due to the software portability as well as for analysis purpose. It would also be desired to have a bandwidth selection especially developed for the Epanechnikov kernel instead of scaling the Gaussian bandwidth. This would not work for the PI approach since it requires higher order derivatives. However, it should be possible to implement for SCV, although some practical

35 issues must be dealt with such as the convolution in equation and the choice of pilot kernel. Another interesting aspect would be to perform the tests on a non-gaussian target density. In addition to the theoretical results themselves presented and discussed in this report, the source code written and its implementation has been an important part of the work. Anyone interested in the subject who wishes to make use of these resources is welcome to contact the authors on any of the -addresses found below. Contact Information Alexander Bilock: alexander.bilock.57@student.uu.se Carl Jidling: carl.jidling.87@student.uu.se Ylva Rydin: ylva.rydin.@student.uu.se Acknowledgements Thanks to our supervisor David Marquez for his support and comments.

36 References [] M. P. Wand, Fast Computation of Multivariate Kernel Estimators, Journal of Computational and Graphical Statistics, (99). [] M. P. Wand, M. C. Jones, Kernel Smoothing, Chapman & Hall, st edition, (995). [] A. Gramacki, J. Gramacki, FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices, (5). [] M. C. Jones, S. J. Sheather, Using non-stochastic terms to advantage in integrated squared density derivatives (99). [5] T. Doung, M. Hazelton, Plug-In Bandwidth Matrices for Bivariate Kernel Density Estimation, Nonparametric Statistics, (). [6] M. P. Wand, M. C. Jones, Comparison of smoothing parameterizations in bivariate kernel density estimation, (99). [7] M. P. Wand, M. C. Jones, Multivariate Plug-in Bandwidth Selection, (99). [8] T. Doung, M. Hazelton, Cross-validation Bandwidth Matrices for Multivariate Kernel Density Estimation, (5). [9] S.R. Sain, K.A. Baggerly, D.W. Scott, Cross-validation of multivariate densities, (99). [] J. E. Chacón, Cross-validation Bandwidth Matrices for Multivariate Kernel Density Estimation, The Canadian Journal of Statistics, Vol., No, (6). [] O. Kolbjørnsen, P.Abrehamsen, Theory of the Cloud transform for Applications. Geostatistics Banff, Vol. (7th International Geostatistics Congress), (). [] T. Doung Spherically symmetric multivariate beta family kernels, Statistics and Probability Letters Volume, (5). 5

37 Appendix A Comparison of binning methods Accuracy Star and circle corresponds to simple binning and linear binning, respectively. n = n = log RIMSE - -5 log RIMSE n = 6 n = - - log RIMSE - - log RIMSE Figure : Target density. 6

38 n = n = log RIMSE - log RIMSE n = 6 n = - log RIMSE - log RIMSE Figure : Target density. n = n = - log RIMSE - - log RIMSE n = 6 n = log RIMSE log RIMSE Figure 5: Target density. 7

39 n = - n = log RIMSE - log RIMSE n = 6 n = - log RIMSE - - log RIMSE Figure 6: Target density. Execution time - n = - n = Linear Simple.5.5 Linear Simple Linear Simple n =.5 n = Linear Simple 6 Figure 7: TargDens= 8

40 - n = - n = Linear Simple.5 Linear Simple n = n = Linear Simple Linear Simple 6 Figure 8: TargDens=.5 - Linear Simple n =. n = Linear Simple n = n = Linear Simple 6..5 Linear Simple 6 Figure 9: TargDens= 9

FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices

FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices FFT-Based Fast Computation of Multivariate Kernel Density Estimators with Unconstrained Bandwidth Matrices arxiv:1508.02766v6 [stat.co] 7 Sep 2016 Artur Gramacki Institute of Control and Computation Engineering

More information

Bandwidth Selection for Kernel Density Estimation Using Total Variation with Fourier Domain Constraints

Bandwidth Selection for Kernel Density Estimation Using Total Variation with Fourier Domain Constraints IEEE SIGNAL PROCESSING LETTERS 1 Bandwidth Selection for Kernel Density Estimation Using Total Variation with Fourier Domain Constraints Alexander Suhre, Orhan Arikan, Member, IEEE, and A. Enis Cetin,

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

MATLAB Routines for Kernel Density Estimation and the Graphical Representation of Archaeological Data

MATLAB Routines for Kernel Density Estimation and the Graphical Representation of Archaeological Data Christian C. Beardah Mike J. Baxter MATLAB Routines for Kernel Density Estimation and the Graphical Representation of Archaeological Data 1 Introduction Histograms are widely used for data presentation

More information

Nonparametric Estimation of Distribution Function using Bezier Curve

Nonparametric Estimation of Distribution Function using Bezier Curve Communications for Statistical Applications and Methods 2014, Vol. 21, No. 1, 105 114 DOI: http://dx.doi.org/10.5351/csam.2014.21.1.105 ISSN 2287-7843 Nonparametric Estimation of Distribution Function

More information

Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) Kernel Density Estimation (KDE) Previously, we ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this

More information

On Kernel Density Estimation with Univariate Application. SILOKO, Israel Uzuazor

On Kernel Density Estimation with Univariate Application. SILOKO, Israel Uzuazor On Kernel Density Estimation with Univariate Application BY SILOKO, Israel Uzuazor Department of Mathematics/ICT, Edo University Iyamho, Edo State, Nigeria. A Seminar Presented at Faculty of Science, Edo

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

Introduction to Nonparametric/Semiparametric Econometric Analysis: Implementation

Introduction to Nonparametric/Semiparametric Econometric Analysis: Implementation to Nonparametric/Semiparametric Econometric Analysis: Implementation Yoichi Arai National Graduate Institute for Policy Studies 2014 JEA Spring Meeting (14 June) 1 / 30 Motivation MSE (MISE): Measures

More information

Section 4 Matching Estimator

Section 4 Matching Estimator Section 4 Matching Estimator Matching Estimators Key Idea: The matching method compares the outcomes of program participants with those of matched nonparticipants, where matches are chosen on the basis

More information

An Introduction to PDF Estimation and Clustering

An Introduction to PDF Estimation and Clustering Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1 An Introduction to PDF Estimation and Clustering David Corrigan corrigad@tcd.ie Electrical and Electronic Engineering Dept., University

More information

Nonparametric regression using kernel and spline methods

Nonparametric regression using kernel and spline methods Nonparametric regression using kernel and spline methods Jean D. Opsomer F. Jay Breidt March 3, 016 1 The statistical model When applying nonparametric regression methods, the researcher is interested

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

PLUG-IN BANDWIDTH MATRICES FOR BIVARIATE KERNEL DENSITY ESTIMATION

PLUG-IN BANDWIDTH MATRICES FOR BIVARIATE KERNEL DENSITY ESTIMATION Nonparametric Statistics, 2003, Vol. 15(1), pp. 17 30 PLUG-IN BANDWIDTH MATRICES FOR BIVARIATE KERNEL DENSITY ESTIMATION TARN DUONG* and MARTIN L. HAZELTON School of Mathematics and Statistics, University

More information

Nonparametric Density Estimation

Nonparametric Density Estimation Nonparametric Estimation Data: X 1,..., X n iid P where P is a distribution with density f(x). Aim: Estimation of density f(x) Parametric density estimation: Fit parametric model {f(x θ) θ Θ} to data parameter

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung POLYTECHNIC UNIVERSITY Department of Computer and Information Science VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung Abstract: Techniques for reducing the variance in Monte Carlo

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Economics Nonparametric Econometrics

Economics Nonparametric Econometrics Economics 217 - Nonparametric Econometrics Topics covered in this lecture Introduction to the nonparametric model The role of bandwidth Choice of smoothing function R commands for nonparametric models

More information

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Moritz Baecher May 15, 29 1 Introduction Edge-preserving smoothing and super-resolution are classic and important

More information

CALCULATION OF OPERATIONAL LOSSES WITH NON- PARAMETRIC APPROACH: DERAILMENT LOSSES

CALCULATION OF OPERATIONAL LOSSES WITH NON- PARAMETRIC APPROACH: DERAILMENT LOSSES 2. Uluslar arası Raylı Sistemler Mühendisliği Sempozyumu (ISERSE 13), 9-11 Ekim 2013, Karabük, Türkiye CALCULATION OF OPERATIONAL LOSSES WITH NON- PARAMETRIC APPROACH: DERAILMENT LOSSES Zübeyde Öztürk

More information

CHAPTER 4. Numerical Models. descriptions of the boundary conditions, element types, validation, and the force

CHAPTER 4. Numerical Models. descriptions of the boundary conditions, element types, validation, and the force CHAPTER 4 Numerical Models This chapter presents the development of numerical models for sandwich beams/plates subjected to four-point bending and the hydromat test system. Detailed descriptions of the

More information

Advanced Applied Multivariate Analysis

Advanced Applied Multivariate Analysis Advanced Applied Multivariate Analysis STAT, Fall 3 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/ / 3 General Information Course

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

Nonparametric Risk Assessment of Gas Turbine Engines

Nonparametric Risk Assessment of Gas Turbine Engines Nonparametric Risk Assessment of Gas Turbine Engines Michael P. Enright *, R. Craig McClung, and Stephen J. Hudak Southwest Research Institute, San Antonio, TX, 78238, USA The accuracy associated with

More information

Programs for MDE Modeling and Conditional Distribution Calculation

Programs for MDE Modeling and Conditional Distribution Calculation Programs for MDE Modeling and Conditional Distribution Calculation Sahyun Hong and Clayton V. Deutsch Improved numerical reservoir models are constructed when all available diverse data sources are accounted

More information

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods Scott A. Sarra, Derek Sturgill Marshall University, Department of Mathematics, One John Marshall Drive, Huntington

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

A Fast Clustering Algorithm with Application to Cosmology. Woncheol Jang

A Fast Clustering Algorithm with Application to Cosmology. Woncheol Jang A Fast Clustering Algorithm with Application to Cosmology Woncheol Jang May 5, 2004 Abstract We present a fast clustering algorithm for density contour clusters (Hartigan, 1975) that is a modified version

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Computer Experiments. Designs

Computer Experiments. Designs Computer Experiments Designs Differences between physical and computer Recall experiments 1. The code is deterministic. There is no random error (measurement error). As a result, no replication is needed.

More information

Integration. Volume Estimation

Integration. Volume Estimation Monte Carlo Integration Lab Objective: Many important integrals cannot be evaluated symbolically because the integrand has no antiderivative. Traditional numerical integration techniques like Newton-Cotes

More information

This chapter explains two techniques which are frequently used throughout

This chapter explains two techniques which are frequently used throughout Chapter 2 Basic Techniques This chapter explains two techniques which are frequently used throughout this thesis. First, we will introduce the concept of particle filters. A particle filter is a recursive

More information

GRAPHICS PROCESSING UNITS IN ACCELERATION OF BANDWIDTH SELECTION FOR KERNEL DENSITY ESTIMATION

GRAPHICS PROCESSING UNITS IN ACCELERATION OF BANDWIDTH SELECTION FOR KERNEL DENSITY ESTIMATION Int. J. Appl. Math. Comput. Sci., 2013, Vol. 23, No. 4, 869 885 DOI: 10.2478/amcs-2013-0065 GRAPHICS PROCESSING UNITS IN ACCELERATION OF BANDWIDTH SELECTION FOR KERNEL DENSITY ESTIMATION WITOLD ANDRZEJEWSKI,

More information

2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers 2 Computation with Floating-Point Numbers 2.1 Floating-Point Representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001) An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (000/001) Summary The objectives of this project were as follows: 1) Investigate iterative

More information

2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers 2 Computation with Floating-Point Numbers 2.1 Floating-Point Representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers

More information

A Bayesian approach to parameter estimation for kernel density estimation via transformations

A Bayesian approach to parameter estimation for kernel density estimation via transformations A Bayesian approach to parameter estimation for kernel density estimation via transformations Qing Liu,, David Pitt 2, Xibin Zhang 3, Xueyuan Wu Centre for Actuarial Studies, Faculty of Business and Economics,

More information

Chapter 4: Non-Parametric Techniques

Chapter 4: Non-Parametric Techniques Chapter 4: Non-Parametric Techniques Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Supervised Learning How to fit a density

More information

An Introduction to Markov Chain Monte Carlo

An Introduction to Markov Chain Monte Carlo An Introduction to Markov Chain Monte Carlo Markov Chain Monte Carlo (MCMC) refers to a suite of processes for simulating a posterior distribution based on a random (ie. monte carlo) process. In other

More information

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

Level-set MCMC Curve Sampling and Geometric Conditional Simulation Level-set MCMC Curve Sampling and Geometric Conditional Simulation Ayres Fan John W. Fisher III Alan S. Willsky February 16, 2007 Outline 1. Overview 2. Curve evolution 3. Markov chain Monte Carlo 4. Curve

More information

Multivariate Standard Normal Transformation

Multivariate Standard Normal Transformation Multivariate Standard Normal Transformation Clayton V. Deutsch Transforming K regionalized variables with complex multivariate relationships to K independent multivariate standard normal variables is an

More information

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistical Methods -

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistical Methods - Physics 736 Experimental Methods in Nuclear-, Particle-, and Astrophysics - Statistical Methods - Karsten Heeger heeger@wisc.edu Course Schedule and Reading course website http://neutrino.physics.wisc.edu/teaching/phys736/

More information

10.4 Linear interpolation method Newton s method

10.4 Linear interpolation method Newton s method 10.4 Linear interpolation method The next best thing one can do is the linear interpolation method, also known as the double false position method. This method works similarly to the bisection method by

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure

Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure Challenge Problem 5 - The Solution Dynamic Characteristics of a Truss Structure In the final year of his engineering degree course a student was introduced to finite element analysis and conducted an assessment

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Improving the Post-Smoothing of Test Norms with Kernel Smoothing

Improving the Post-Smoothing of Test Norms with Kernel Smoothing Improving the Post-Smoothing of Test Norms with Kernel Smoothing Anli Lin Qing Yi Michael J. Young Pearson Paper presented at the Annual Meeting of National Council on Measurement in Education, May 1-3,

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Jason Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Nonparametric Methods 1 / 49 Nonparametric Methods Overview Previously, we ve assumed that the forms of the underlying densities

More information

SAS/STAT 13.2 User s Guide. The KDE Procedure

SAS/STAT 13.2 User s Guide. The KDE Procedure SAS/STAT 13.2 User s Guide The KDE Procedure This document is an individual chapter from SAS/STAT 13.2 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute

More information

Schedule for Rest of Semester

Schedule for Rest of Semester Schedule for Rest of Semester Date Lecture Topic 11/20 24 Texture 11/27 25 Review of Statistics & Linear Algebra, Eigenvectors 11/29 26 Eigenvector expansions, Pattern Recognition 12/4 27 Cameras & calibration

More information

Kernel Density Estimation

Kernel Density Estimation Kernel Density Estimation An Introduction Justus H. Piater, Université de Liège Overview 1. Densities and their Estimation 2. Basic Estimators for Univariate KDE 3. Remarks 4. Methods for Particular Domains

More information

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8 Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8 Grade 6 Grade 8 absolute value Distance of a number (x) from zero on a number line. Because absolute value represents distance, the absolute value

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction A Monte Carlo method is a compuational method that uses random numbers to compute (estimate) some quantity of interest. Very often the quantity we want to compute is the mean of

More information

Ultrasonic Multi-Skip Tomography for Pipe Inspection

Ultrasonic Multi-Skip Tomography for Pipe Inspection 18 th World Conference on Non destructive Testing, 16-2 April 212, Durban, South Africa Ultrasonic Multi-Skip Tomography for Pipe Inspection Arno VOLKER 1, Rik VOS 1 Alan HUNTER 1 1 TNO, Stieltjesweg 1,

More information

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2 Describing, Exploring, and Comparing Data Slide 1 Chapter 2 Describing, Exploring, and Comparing Data Slide 2 2-1 Overview 2-2 Frequency Distributions 2-3 Visualizing Data 2-4 Measures of Center 2-5 Measures of Variation 2-6 Measures of Relative

More information

Probability and Statistics for Final Year Engineering Students

Probability and Statistics for Final Year Engineering Students Probability and Statistics for Final Year Engineering Students By Yoni Nazarathy, Last Updated: April 11, 2011. Lecture 1: Introduction and Basic Terms Welcome to the course, time table, assessment, etc..

More information

Middle School Math Course 3

Middle School Math Course 3 Middle School Math Course 3 Correlation of the ALEKS course Middle School Math Course 3 to the Texas Essential Knowledge and Skills (TEKS) for Mathematics Grade 8 (2012) (1) Mathematical process standards.

More information

An Interval-Based Tool for Verified Arithmetic on Random Variables of Unknown Dependency

An Interval-Based Tool for Verified Arithmetic on Random Variables of Unknown Dependency An Interval-Based Tool for Verified Arithmetic on Random Variables of Unknown Dependency Daniel Berleant and Lizhi Xie Department of Electrical and Computer Engineering Iowa State University Ames, Iowa

More information

brahim KARA and Nihan HOSKAN

brahim KARA and Nihan HOSKAN Acta Geophysica vol. 64, no. 6, Dec. 2016, pp. 2232-2243 DOI: 10.1515/acgeo-2016-0097 An Easy Method for Interpretation of Gravity Anomalies Due to Vertical Finite Lines brahim KARA and Nihan HOSKAN Department

More information

Edge and corner detection

Edge and corner detection Edge and corner detection Prof. Stricker Doz. G. Bleser Computer Vision: Object and People Tracking Goals Where is the information in an image? How is an object characterized? How can I find measurements

More information

Direction Fields; Euler s Method

Direction Fields; Euler s Method Direction Fields; Euler s Method It frequently happens that we cannot solve first order systems dy (, ) dx = f xy or corresponding initial value problems in terms of formulas. Remarkably, however, this

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

Name Course Days/Start Time

Name Course Days/Start Time Name Course Days/Start Time Mini-Project : The Library of Functions In your previous math class, you learned to graph equations containing two variables by finding and plotting points. In this class, we

More information

Package r2d2. February 20, 2015

Package r2d2. February 20, 2015 Package r2d2 February 20, 2015 Version 1.0-0 Date 2014-03-31 Title Bivariate (Two-Dimensional) Confidence Region and Frequency Distribution Author Arni Magnusson [aut], Julian Burgos [aut, cre], Gregory

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

Optimised corrections for finite-difference modelling in two dimensions

Optimised corrections for finite-difference modelling in two dimensions Optimized corrections for 2D FD modelling Optimised corrections for finite-difference modelling in two dimensions Peter M. Manning and Gary F. Margrave ABSTRACT Finite-difference two-dimensional correction

More information

Maths Year 11 Mock Revision list

Maths Year 11 Mock Revision list Maths Year 11 Mock Revision list F = Foundation Tier = Foundation and igher Tier = igher Tier Number Tier Topic know and use the word integer and the equality and inequality symbols use fractions, decimals

More information

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION

DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION DATA DEPTH AND ITS APPLICATIONS IN CLASSIFICATION Ondrej Vencalek Department of Mathematical Analysis and Applications of Mathematics Palacky University Olomouc, CZECH REPUBLIC e-mail: ondrej.vencalek@upol.cz

More information

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A.

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A. - 430 - ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD Julius Goodman Bechtel Power Corporation 12400 E. Imperial Hwy. Norwalk, CA 90650, U.S.A. ABSTRACT The accuracy of Monte Carlo method of simulating

More information

3 Nonlinear Regression

3 Nonlinear Regression 3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear

More information

Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2. Xi Wang and Ronald K. Hambleton

Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2. Xi Wang and Ronald K. Hambleton Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2 Xi Wang and Ronald K. Hambleton University of Massachusetts Amherst Introduction When test forms are administered to

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic AMTH142 Lecture 1 Scilab Graphs Floating Point Arithmetic April 2, 27 Contents 1.1 Graphs in Scilab......................... 2 1.1.1 Simple Graphs...................... 2 1.1.2 Line Styles........................

More information

Package feature. R topics documented: July 8, Version Date 2013/07/08

Package feature. R topics documented: July 8, Version Date 2013/07/08 Package feature July 8, 2013 Version 1.2.9 Date 2013/07/08 Title Feature significance for multivariate kernel density estimation Author Tarn Duong & Matt Wand

More information

Package feature. R topics documented: October 26, Version Date

Package feature. R topics documented: October 26, Version Date Version 1.2.13 Date 2015-10-26 Package feature October 26, 2015 Title Local Inferential Feature Significance for Multivariate Kernel Density Estimation Author Tarn Duong & Matt Wand

More information

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines EECS 556 Image Processing W 09 Interpolation Interpolation techniques B splines What is image processing? Image processing is the application of 2D signal processing methods to images Image representation

More information

MetroPro Surface Texture Parameters

MetroPro Surface Texture Parameters MetroPro Surface Texture Parameters Contents ROUGHNESS PARAMETERS...1 R a, R q, R y, R t, R p, R v, R tm, R z, H, R ku, R 3z, SR z, SR z X, SR z Y, ISO Flatness WAVINESS PARAMETERS...4 W a, W q, W y HYBRID

More information

Software Tutorial Session Universal Kriging

Software Tutorial Session Universal Kriging Software Tutorial Session Universal Kriging The example session with PG2000 which is described in this and Part 1 is intended as an example run to familiarise the user with the package. This documented

More information

INDEPENDENT COMPONENT ANALYSIS WITH QUANTIZING DENSITY ESTIMATORS. Peter Meinicke, Helge Ritter. Neuroinformatics Group University Bielefeld Germany

INDEPENDENT COMPONENT ANALYSIS WITH QUANTIZING DENSITY ESTIMATORS. Peter Meinicke, Helge Ritter. Neuroinformatics Group University Bielefeld Germany INDEPENDENT COMPONENT ANALYSIS WITH QUANTIZING DENSITY ESTIMATORS Peter Meinicke, Helge Ritter Neuroinformatics Group University Bielefeld Germany ABSTRACT We propose an approach to source adaptivity in

More information

Three Different Algorithms for Generating Uniformly Distributed Random Points on the N-Sphere

Three Different Algorithms for Generating Uniformly Distributed Random Points on the N-Sphere Three Different Algorithms for Generating Uniformly Distributed Random Points on the N-Sphere Jan Poland Oct 4, 000 Abstract We present and compare three different approaches to generate random points

More information

Chapter 7: Dual Modeling in the Presence of Constant Variance

Chapter 7: Dual Modeling in the Presence of Constant Variance Chapter 7: Dual Modeling in the Presence of Constant Variance 7.A Introduction An underlying premise of regression analysis is that a given response variable changes systematically and smoothly due to

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

Scaled representations

Scaled representations Scaled representations Big bars (resp. spots, hands, etc.) and little bars are both interesting Stripes and hairs, say Inefficient to detect big bars with big filters And there is superfluous detail in

More information

Visualizing and Exploring Data

Visualizing and Exploring Data Visualizing and Exploring Data Sargur University at Buffalo The State University of New York Visual Methods for finding structures in data Power of human eye/brain to detect structures Product of eons

More information

In the real world, light sources emit light particles, which travel in space, reflect at objects or scatter in volumetric media (potentially multiple

In the real world, light sources emit light particles, which travel in space, reflect at objects or scatter in volumetric media (potentially multiple 1 In the real world, light sources emit light particles, which travel in space, reflect at objects or scatter in volumetric media (potentially multiple times) until they are absorbed. On their way, they

More information

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION CHRISTOPHER A. SIMS Abstract. A new algorithm for sampling from an arbitrary pdf. 1. Introduction Consider the standard problem of

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

LOCAL BANDWIDTH SELECTION FOR KERNEL ESTIMATION OF' POPULATION DENSITIES WITH LINE TRANSECT SAMPLING

LOCAL BANDWIDTH SELECTION FOR KERNEL ESTIMATION OF' POPULATION DENSITIES WITH LINE TRANSECT SAMPLING LOCAL BANDWIDTH SELECTION FOR KERNEL ESTIMATION OF' POPULATION DENSITIES WITH LINE TRANSECT SAMPLING Patrick D. Gerard Experimental Statistics Unit Mississippi State University, Mississippi 39762 William

More information

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable Learning Objectives Continuous Random Variables & The Normal Probability Distribution 1. Understand characteristics about continuous random variables and probability distributions 2. Understand the uniform

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information