Kernel Density Estimation (KDE)

Size: px

Start display at page:

Download "Kernel Density Estimation (KDE)"

Gavin Simmons
5 years ago
Views:

1 Kernel Density Estimation (KDE) Previously, we ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this tutorial, we ll carry on the problem of probability density function inference, but using another method: Kernel density estimation. Kernel density estimates (KDE) are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. Why do we care? One of the main problems in practical applications is that the needed probability distribution is usually not readily available, but rather it must be derived from other existing information (e.g. sample data). KDEs are similar to histograms in terms of being non parametric method, so there are no restrictive assumptions about the shape of the density function, but KDE is far more superior to histograms as far as accuracy and continuity. Overview Let s consider a finite data sample {x 1, x 2,..., x} observed from a stochastic (i.e. continuous and random) process. We wish to infer the population probability density function. In the histogram method, we select the left bound of the histogram ( x o ), the bin s width ( h ), and then compute the bin k probability estimator f ˆ h( k ): Bin k represents the following interval [ x ( k 1) h, x k h) fˆ ( k ) h i1 I{( k1) h x x kh} i o I{} is an event function that returns 1 (one) if the condition is true, 0 (zero) otherwise. The choice of bins, especially the bin width ( h ), has a substantial effect on the shape and other properties of f ˆ h( k ). Finally, we can think of the histogram method as follows: o o Each observation (event) is statistically independent of all others, and its occurrence probability is equal to 1. fˆ ( k ) is simply the integral (sum) of the event probabilities in each bin. h Kernel Density Estimation (KDE) Tutorial 1 Spider Financial Corp, 2013

2 What is a kernel? A kernel is a non negative, real valued, integrable function K (.) satisfying the following two requirements: Kudu ( ) 1 Ku ( ) K( u) And, as a result, the scaled function K * ( u ), where K * ( u) K( u), is a kernel as well. ow, place a scaled kernel function at each observation in the sample and compute the new probability estimators fˆ ( x ) for a value x (compared to an earlier bin in the histogram). h ˆ 1 f ( x ) K ( xx ) h h i i1 1 u Kh(u) K( ) h h ˆ 1 x x ( ) ( i fh x K ) h h As an example, let K(.) be the standardized Gaussian density function. The KDE looks like the sum of Gaussian curves, each centered on one observation. i1 ote: For Gaussian kernel, the bandwidth h is the same as the standard deviation of ( x x ). i Kernel Density Estimation (KDE) Tutorial 2 Spider Financial Corp, 2013

3 1/ x{ x1, x2,.., x} The KDE method replaces the discrete probability P(x) with a kernel 0 x{ x1, x2,.., x} function. This permits overlap between kernels, thus promoting continuity in the probability estimator. Why KDE? Due to our data sampling, we are left with a finite set of values for continuous random variables. Using a kernel instead of discrete probabilities, we promote the continuity nature in the underlying random variable. To proceed with KDE, you ll need to decide on two key parameters: Kernel function and bandwidth. Which kernel should I use? A range of kernel functions are commonly used: uniform, triangular, biweight, triweight and Epanechnikov. The Gaussian kernel is often used; K(.) (.), where is the standard normal density function. How do I properly compute kernel bandwidth? Intuitively, one wants to choose an h as small as the data allows, but there is a trade off between the bias of the estimator and its variance. Selection of the bandwidth of a kernel estimator is a subject of considerable research. We will outline two popular methods: 1. Subjective selection One can experiment by using different bandwidths and simply selecting one that looks right for the type of data under investigation. 2. Selection with reference to some given distribution Here one selects the bandwidth that would be optimal for a particular PDF. Keep in mind that you are not assuming that f( x ) is normal, but rather selecting an h which would be optimal if the PDF were normal. Using a Gaussian kernel, the optimal bandwidth h opt is defined as follows: h opt The normal distribution is not a wiggly distribution; it is uni modal and bell shaped. It is therefore to be expected that h opt will be too large for multimodal distributions. Furthermore, 2 2 the sample variance ( s ) is not a robust estimator of ; it overestimates if some outliers (extreme observations) are present. To overcome these problems, Silverman proposed the following bandwidth estimator: Kernel Density Estimation (KDE) Tutorial 3 Spider Financial Corp, 2013

0.9 ˆ h opt 5 R ˆ min(s, ) 1.34 R IQR Q Q 3

Data driven estimation this is an area of current research using several different methods: Fourier transform, diffusion based, etc.

) and (optionally) with a bandwidth value.

4 0.9 ˆ h opt 5 R ˆ min(s, ) 1.34 R IQR Q Q 3 1 Where IQR is the interquartile range and s is the sample standard deviation. 3. Data driven estimation this is an area of current research using several different methods: Fourier transform, diffusion based, etc. Process Using the umxl add in for Excel, you can compute the KDE values for different kernel functions (e.g. Gaussian, uniform, triangular, etc.) and (optionally) with a bandwidth value. For our sample data, we are using 50 randomly generated values of the normal distribution (using the random generator in the Excel Analysis Pack). We plotted the histogram for our reference: ow we are ready to construct our KDE plot. First, select the empty cell in your worksheet where you wish the output table to be generated, then locate and click on the Descriptive Statistics icon in the umxl tab (or toolbar). Then, select the Kernel density estimation item from the drop down menu. Kernel Density Estimation (KDE) Tutorial 4 Spider Financial Corp, 2013

By default, the output table cells range is set to the current selected cell in your worksheet. 3.

5 The KDE wizard appears. Select the cells range for the values of the input variable. otes: 1. The cells range includes (optional) the heading ( Label ) cell, which would be used in the output tables where it references those variables. 2. By default, the output table cells range is set to the current selected cell in your worksheet. 3. By default, the output graph cells range is set to the 7 cells to the right of the currently selected cell in your worksheet. Finally, once we select the input data (X) cells range, the Options and Missing Values tabs become available (enabled). ext, select the Options tab: Kernel Density Estimation (KDE) Tutorial 5 Spider Financial Corp, 2013

otes: 1. By default, the Gaussian kernel function is selected. Let s leave this option unchanged. 2. By default, the optimal bandwidth option is checked.

This option in effect instructs the wizard to generate a second curve for the Gaussian distribution for comparison purposes. Leave this option checked. ow, click on the Missing Values tab.

6 otes: 1. By default, the Gaussian kernel function is selected. Let s leave this option unchanged. 2. By default, the optimal bandwidth option is checked. The KDE function will use the Silverman estimate for the bandwidth. Leave it checked. 3. By default, the output table size is set to 5. Leave it unchanged. 4. Overlay ormal distribution is checked. This option in effect instructs the wizard to generate a second curve for the Gaussian distribution for comparison purposes. Leave this option checked. ow, click on the Missing Values tab. In this tab, you can select an approach to handle missing values in the data set (X s). By default, any observation with missing value would be excluded from the analysis. This treatment is a good approach for our analysis, so let s leave it unchanged. ow, click OK to generate the output tables. otes: 1. The values of all X are sorted in ascending order. Kernel Density Estimation (KDE) Tutorial 6 Spider Financial Corp, 2013

2. The summary statistics in the 1 st row are computed merely to facilitate the creation of the table or computing the overlay Gaussian distribution function.

7 2. The summary statistics in the 1 st row are computed merely to facilitate the creation of the table or computing the overlay Gaussian distribution function. The generated plot of the KDE is shown below: ote that the KDE curve (blue) tracks very closely with the Gaussian density (orange) curve. Case 2 ow let s try a non normal sample data set. We generated 50 random values of a uniform distribution between 3 and 3. Following similar steps, we plotted the histogram and the KDE: ote that the KDE curve (blue) tracks much more closely with the underlying distribution (i.e. uniform) than the histogram. Case 3 For our 3 rd case, we generated 50 random values of a binomial distribution (p=0.2 and batch size=20). Following similar steps, we plotted the histogram and the KDE. Kernel Density Estimation (KDE) Tutorial 7 Spider Financial Corp, 2013

8 ote that KDE curve (blue) tracks much more closely with the underlying distribution (i.e. uniform) than the histogram. Conclusion In this tutorial, we demonstrated the process to generate a kernel density estimation in Excel using umxl s add in functions. The KDE method is a major improvement for inferring the probability density function of the population, in terms of accuracy and continuity of the function. evertheless, it introduce a new challenge: selecting a proper bandwidth. In the majority of cases, the Silverman estimator for the bandwidth proves to be satisfactory, but is it optimal? Do we care? Where do we go from here? First, to answer the question of optimality, we need to introduce additional algorithms to estimate its values. For example, in Annals of Statistics, Volume 38, umber 5, pages , Z. I. Botev, J. F. Grotowski, and D. P. Kroese described a numerical sample data driven method for finding the optimal bandwidth using a Kernel density estimation via the diffusion approach. Second, in cases where the range of values that the random number can take are known to be constrained from one side (e.g. prices, binomial data, etc.), or in a range (e.g. survival rate, default rate, etc.), then how do we adapt the KDE to factor in those constraints? Finally, we defined the KDE probability estimator using a fixed bandwidth ( h ) for all observations. If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (point wise estimator), this produces a particularly powerful method known as adaptive or variable bandwidth kernel density estimation. Kernel Density Estimation (KDE) Tutorial 8 Spider Financial Corp, 2013

Section 4 Matching Estimator

Section 4 Matching Estimator Matching Estimators Key Idea: The matching method compares the outcomes of program participants with those of matched nonparticipants, where matches are chosen on the basis