Multivariate Conditional Distribution Estimation and Analysis

IT 14 62 Examensarbete 45 hp Oktober 14 Multivariate Conditional Distribution Estimation and Analysis Sander Medri Institutionen för informationsteknologi Department of Information Technology

Abstract Multivariate Conditional Distribution Estimation and Analysis Sander Medri Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan Postadress: Box 536 751 21 Uppsala Telefon: 18 471 3 3 Telefax: 18 471 3 The goals of this thesis were to implement different methods for estimating conditional distributions from data and to evaluate the performance of these methods on data sets with different characteristics. The methods were implemented in C++ and several existing software libraries were also used. Tests were run on artificially generated data sets and on some real-world data sets. The accuracy, run time and memory usage of the methods was measured. Based on the results the natural or smoothing spline methods or the k-nearest neighbors method would potentially be a good first choice to apply to a data set if not much is known about it. In general the wavelet method did not seem to perform particularly well. The noisy-or method could be a faster and possibly more accurate alternative to the popular logistic regression in certain cases. Hemsida: http://www.teknat.uu.se/student Handledare: Michael Ashcroft Ämnesgranskare: Kristiaan Pelckmans Examinator: Ivan Christoff IT 14 62 Tryckt av: Reprocentralen ITC

Acknowledgements This project was supported by the Estonian Ministry of Education and Research and the Archimedes Foundation.

Contents Contents 5 List of Figures 8 List of Tables 9 1 Background 11 1.1 Introduction................................... 11 1.2 Linear models.................................. 12 1.2.1 Ordinary least squares......................... 13 1.2.2 Building a probability distribution.................. 13 1.3 Basis functions.................................. 13 1.3.1 Splines.................................. 14 1.3.1.1 Natural cubic splines.................... 15 1.3.1.2 B-splines........................... 16 1.3.1.3 Smoothing splines...................... 17 1.3.2 Wavelets................................. 17 1.4 k-nearest neighbors............................... 18 1.5 Kernel density estimation........................... 19 1.6 Logistic regression............................... 21 1.7 Noisy-OR..................................... 22 2 Analysis 24 2.1 Method of analysis............................... 24 2.2 Software libraries................................ 25 2.3 Data........................................ 25 3 Results 28 3.1 Accuracy of methods.............................. 29 3.1.1 Artificial data sets........................... 29 3.1.1.1 Ordinary least squares................... 29 3.1.1.2 B-splines........................... 3 3.1.1.3 Natural splines........................ 31 3.1.1.4 Regression splines...................... 32 3.1.1.5 Wavelets............................ 33 3.1.1.6 Smoothing splines...................... 34 3.1.1.7 k-nearest neighbors..................... 35 3.1.1.8 Kernel density estimation................. 36 6

Contents 3.1.1.9 Noisy-OR and logistic regression............. 37 3.1.1.1 Performance graphs..................... 38 3.1.2 Real-world data sets.......................... 42 3.2 Run time and memory usage......................... 44 3.2.1 Regression splines........................... 44 3.2.2 Natural splines and B-splines..................... 45 3.2.3 Smoothing splines........................... 47 3.2.4 Wavelets................................. 48 3.2.5 Ordinary least squares......................... 49 3.2.6 k-nearest neighbors........................... 5 3.2.7 Kernel density estimation....................... 51 3.2.8 Noisy-OR and logistic regression................... 52 3.2.9 Performance graphs.......................... 54 3.3 Observations and discussion......................... 57 3.3.1 General observations.......................... 57 3.3.2 Discrete methods............................ 57 3.3.3 Real methods.............................. 58 3.3.3.1 Splines in general...................... 58 3.3.3.2 Wavelets............................ 58 3.3.3.3 Natural and smoothing splines.............. 58 3.3.3.4 Regression splines...................... 59 3.3.3.5 B-splines........................... 59 3.3.3.6 k-nearest neighbors..................... 59 3.3.3.7 Kernel density estimator.................. 59 3.3.3.8 Ordinary least squares................... 59 3.3.4 Data sets................................. 59 References 61 7

List of Figures 1.1 The pdf of a normal distribution with a mean of and a standard deviation of 1..................................... 11 1.2 Regression spline, natural spline and B-spline estimates on a dataset with three knots at the locations indicated by vertical dashed lines.... 16 1.3 The Haar and Ricker wavelets and the corresponding scaling functions. 18 1.4 KDE on data sampled from the normal distribution. Highest bandwidth in the top-left graph and lowest in the bottom-right............. 2.1 A linear dataset in one and two dimensions................. 25 2.2 A simple polynomial dataset in one and two dimensions......... 26 2.3 A cyclical dataset in one and two dimensions................ 26 2.4 A difficult dataset in one and two dimensions................ 27 3.1 Accuracy of noisy-or compared to logistic regression for different datasets with different dimensions at 1 data points.............. 38 3.2 Accuracy of the methods on the linear dataset at 1 data points for different dimensions............................... 39 3.3 Accuracy of the methods on the square dataset at 1 data points for different dimensions............................. 4 3.4 Accuracy of the methods on the cyclical dataset at 1 data points for different dimensions............................. 41 3.5 Accuracy of the methods on the difficult dataset at 1 data points for different dimensions............................. 42 3.6 Run time and peak memory usage of noisy-or compared to logistic regression for different numbers of data points and dimensions..... 54 3.7 Run times and peak memory usage for different methods for transforming data in 3 dimensions or with 1 data points............. 55 3.8 Run time and memory usage OLS, k-nn and KDE for different number of data points and dimensions......................... 56 8

List of Tables 1.1 Example of a basis transformation....................... 14 3.1 RMSE and log-likelihood of ordinary least squares on different datasets of different sizes and different dimensions.................. 29 3.2 RMSE and log-likelihood of B-splines on different datasets of different sizes and different dimensions......................... 3 3.3 RMSE and log-likelihood of natural splines on different datasets of different sizes and different dimensions..................... 31 3.4 RMSE and log-likelihood of regression splines on different datasets of different sizes and different dimensions................... 32 3.5 RMSE and log-likelihood of Ricker wavelets on different datasets of different sizes and different dimensions..................... 33 3.6 RMSE and log-likelihood of using RBFs with k-means on different datasets of different sizes and different dimensions (k = N/2).......... 34 3.7 RMSE and log-likelihood of k-nn on different datasets of different sizes and different dimensions (k = N)...................... 35 3.8 Log-likelihood of KDE on different datasets of different sizes and different dimensions (bandwidth of 1)...................... 36 3.9 Log-likelihood of noisy-or on different datasets of different sizes and different dimensions............................... 37 3.1 Log-likelihood of logistic regression on different datasets of different sizes and different dimensions......................... 37 3.11 Performance of the different methods on the motorcycle dataset..... 43 3.12 Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using regression splines............................. 44 3.13 Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using natural splines............................... 45 3.14 Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using B-splines.................................. 46 3.15 Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number dimensions and data points using using RBF with k-means............................ 47 3.16 Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for different smoothing parameters, dimensions and data points using the Ricker wavelet................... 48 9

List of Tables 3.17 Run time (in milliseconds) for building the model, computing the estimate, computing the likelihood of data, building the error and generating values and peak memory usage (in megabytes) for ordinary least squares....................................... 49 3.18 Run time (in milliseconds) for building the model, computing the estimate, computing the likelihood of data, building the error and generating values and peak memory usage (in megabytes) for k-nn....... 5 3.19 Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for KDE...................................... 51 3. Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for logistic regression.............................. 52 3.21 Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for noisy-or................................... 53 1

Chapter 1 Background 1.1 Introduction A random variable is a function that gives a unique numerical value to every outcome of a random event (e.g. a random variable can represent the number of times tails come up when flipping a coin a certain number of times). A probability distribution describes the relative likelihood of such a variable taking on certain values. When the random variables are continuous the probability distribution is called the probability density function (pdf) and when the variables are discrete it is called the probability mass function (pmf). As an example the pdf of a standard normal distribution is shown on figure 1.1..4.35.3.25.2.15.1.5 4 3 2 1 1 2 3 4 Figure 1.1: The pdf of a normal distribution with a mean of and a standard deviation of 1 11

Chapter 1. Background A joint probability distribution for multiple random variables is a probability distribution that describes the probability of those variables taking on certain values simultaneously. A conditional probability distribution of a variable is its probability distribution when the values of other variables are fixed to some specific values. There exist different methods for estimating the distribution of data. This estimation can be useful in any field where it is necessary to make predictions about the future based on existing data. For example by using these methods it would be possible to estimate the probability of a patient having a heart attack based on their symptoms given data about previous patients; or to estimate the probability of the change in weather conditions, stock prices, etc. For example, when building Bayesian networks it is necessary to specify the conditional independences between the random variables that make up the nodes of the network. When building such a network from data it is necessary to estimate the conditional distributions of different variables on alternative subsets of other variables. This needs to be done numerous times in order to build an accurate network, and so the speed of the methods used for estimating these distributions becomes very important in addition to their accuracy. For estimating distributions based on data with continuous variables the methods considered in this thesis are linear regression and k-nearest neighbors (using an error function in conjunction with the output of the method) and kernel density estimation. Additionally methods such as splines and wavelets will be studied which will better allow the analysis of non-linear relationships. In the case of data with discrete or mixed variables logistic regression and noisy-or methods will be examined. 1.2 Linear models In a linear model the output is produced by a linear combination of input variables f (x) = β + n i=1 β i x i, (1.1) where x = (x 1, x 2,..., x n ) is a vector of input variables and f (x) is the output of the model. The term β is a constant called the intercept or bias (the value of the estimate when the x vector is zero). The output of the model could in general also be a vector, however in this thesis only such situations are considered where it is a scalar. 12

Chapter 1. Background Such a simple model is obviously not capable of accurately representing non-linear relationships and depending on the method used different data sets with very different shapes could end up having the same model (e.g. Anscombe s quartet [1]). 1.2.1 Ordinary least squares There are many different methods for creating a linear model based on a data set. Ordinary least squares (OLS) is one of the simplest of such methods and also a very common one. When using OLS the coefficients β are chosen in order to minimize the residual sum of squares RSS(β) = (y i f (x i )) 2, (1.2) resulting in β = (x T x) 1 x T y. (1.3) 1.2.2 Building a probability distribution A model for estimating the value for y alone does not provide a way to calculate the probability of y given x. One way to get this conditional probability is to also build a model for an error for the estimate f (x). This can be done for example by using the difference between the estimate and the actual value of y with a pdf of a distribution (e.g. a Gaussian) P(y x) = 1 ) ( σ 2π exp (y f (x))2 2σ 2, (1.4) where the variance is built from the training data (e.g. by summing all of the differences and dividing them by the number of data points). 1.3 Basis functions In order to make use of linear methods in cases where the underlying function can not be directly expressed by a linear combination of inputs we can perform a transformation of the data using a basis for some function space. It is then possible to use linear models on the transformed data in order to express non-linear relationships in the original space. For example every quadratic polynomial can be expressed as a linear combination of the functions {1, x, x 2 }. If we want to model a relationship like y = x 2 + 2 from 13

Chapter 1. Background data then we can first transform the data using the functions from that basis and then perform linear regression on the transformed data. For a simple example consider the data in table 1.1. In table 1.1(A) the relationship between x and y is y = x 2 + 2 and x is added for computing the intercept β. Applying equation (1.3) on this data directly results in β = (1, 3) which produces the values in the column ŷ according to equation (1.1). The data in table 1.1(B) has been transformed using the functions {1, x, x 2 } on the original input variable x producing three new variables x 1, x 2 and x 3. The values for ŷ are produced in the same manner as in table 1.1(A) with β = (2,, 1). It should be evident that the values for ŷ in table 1.1(B) match the original values of y much better than in table 1.1(A). Table 1.1: Example of a basis transformation. (A) x x y ŷ 1 2 1 1 1 3 4 1 2 6 7 1 3 11 1 (B) x 1 x 2 x 3 y ŷ 1 2 2 1 1 1 3 3 1 2 4 6 6 1 3 9 11 11 For calculating basis transformations in multidimensional cases there is a basis function b iki (x i ) for each coordinate of x R n, where k i = 1,..., m i and m i is the number of functions for coordinate i. The transformation is then defined by g k1...k n (x) = b 1k1 (x 1 )b 2k2 (x 2 ) b nkn (x n ), k i = 1,..., m i i {1,..., n}. (1.5) 1.3.1 Splines One possibility is to transform the data to have splines as regressors. Splines are piecewise polynomials that join at points that are called knots. The number and degrees of the polynomials and the number and position of knots can vary. The highest degree of the polynomials that make up the spline is also the degree of the spline. 14

Chapter 1. Background A spline with degree D and K knots t j can be represented by the following set of functions (also known as the truncated polynomial representation) [2]: s i (x) = x i, i =,..., D (1.6) s D+j (x) = (x t j ) D +, j = 1,..., K where u + = u if u > u + = if u. Such fixed-knot splines are called regression splines [2]. As an example, this is a set of functions for a cubic spline with knots at t 1 and t 2 : s (x) = 1 s 1 (x) = x s 2 (x) = x 2 s 3 (x) = x 3 s 4 (x) = (x t 1 ) 3 + s 5 (x) = (x t 2 ) 3 +. The cubic spline is continuous and has continuous first and second derivatives. In general a spline of degree D has continuous derivatives up to order D 1. Discontinuities can be allowed by introducing additional functions (e.g. by introducing (x t i ) + the spline can be made discontinuous at knot t i ). For a detailed description of how the continuity restrictions are formed see [3]. Regression splines are simple and straightforward to use with the main problem being in how many knots to choose and where to place them. It is also necessary to decide the degree of the spline, however cubic splines should usually be sufficient as there should be no need to use splines of higher degrees unless smooth derivatives are required [2]. 1.3.1.1 Natural cubic splines A problem with the method described so far is that the polynomials that are fit to data beyond the boundary knots can behave wildly. In order to avoid this erratic behavior a natural cubic spline adds the constraint that the function is linear beyond the boundary knots. The following basis functions result from applying this constraint [2]: n (x) = 1 n 1 (x) = x n k+1 = d k (x) d K 1 (x) (1.7) d k (x) = (x t k) 3 + (x t K ) 3 + t K t k. 15

Chapter 1. Background An example of how this looks like compared to regular regression splines and B- splines is shown on figure 1.2. 14 1 Regression spline Natural spline B-spline 1 8 6 4 6 4 2 2 4 6 Figure 1.2: Regression spline, natural spline and B-spline estimates on a dataset with three knots at the locations indicated by vertical dashed lines. 1.3.1.2 B-splines An alternative to these splines defined through truncated polynomials are the B- splines. The B-spline basis functions are defined recursively as follows [2]: 1 if k i x < k i+1 b i,1 (x) = (1.8) otherwise b i,m (x) = x k i b k i+m 1 i,m 1 (x) + k i+m x b k i k i+m i+1,m 1 (x), (1.9) k i+1 where k i is the ith knot and m is the order of the spline (a cubic B-spline has m = 4). With B-splines extra knots get added to the beginning and end of the (ordered) sequence of knots. The number of knots added is equal to two times the order m of the spline (m knots are added before the first knot and m knots after the last knot). 16

Chapter 1. Background The value of these knots is arbitrary and customarily equal to the first and last knots respectively [2]. The main benefits for B-splines are that they are better conditioned [3] [4] than truncated polynomials and computationally efficient (least squares computations with B- splines can be reduced to O(n) [2], where n is the number of data points). 1.3.1.3 Smoothing splines The smoothing spline estimate of a function y is the function f (x) that minimizes the following equation n (y i f (x i )) 2 + λ i=1 f (x) 2 dt, (1.1) where λ (the smoothing parameter) determines the magnitude of the penalty the existence of the second derivative imposes. The main difference here with regression splines or B-splines is that there is no need to directly specify (it is essentially still done indirectly) the number of knots to use, instead one parameter for the desired smoothness needs to be provided. The solution can be formulated as [2] f (x) = β + β T x + N α j h j (x), (1.11) j=1 where h j (x) = x x j 2 log x x j which is a radial basis function (a function whose value depends on the distance from some point). 1.3.2 Wavelets Wavelets are a class of functions with certain properties and it is possible to define many different types of wavelets. There are multiple different methods for utilizing wavelets, however only their use in a simple basis expansion is considered in this thesis. What makes these wavelet bases interesting is their structure which grants specificity in location and frequency and the fact that many types of functions can be sparsely and uniquely represented them [5]. Wavelet bases can be formed by a father wavelet (scaling function) φ(x) and scalings and translations of a mother wavelet ψ(x) [5] f (x) = c φ(x) + M 2 j 1 j= k= 17 d jk ψ jk (x), (1.12)

Chapter 1. Background where ψ jk (x) = 2 j/2 ψ(2 j x k). Equation (1.12) is known as a linear truncated wavelet estimator. The coefficients c and d can be found for example by using linear regression. For an example of wavelets the Haar and Ricker (the Mexican hat) wavelets and the corresponding scaling functions are shown on figure 1.3. 1.5 1..5..5 1. Haar wavelet 1.5 1..5..5 1. 1.5.5..5 1. 1.5 1.5 1. Ricker wavelet 1.5.5..5 1. 1.5.5.4.5..3.2.1.5 4 2 2 4. 4 2 2 4 Figure 1.3: The Haar and Ricker wavelets and the corresponding scaling functions. The performance of this method depends on the choice of M, in a similar way the choice for the number of knots affects the spline [5] (the possible values for M range up to n 1 if 2 n is the number of observations). The performance is also affected by the wavelet used and that choice depends on the specific problem (e.g. the discontinuous nature of the Haar wavelet makes it a poor choice for estimating continuous functions). This estimator will have difficulties with inhomogeneous functions with local singularities [5]. On such data it is probably better to use a non-linear wavelet estimator [5], where the general idea is to use n 1 for M and use the data to decide which coefficients to keep (e.g. discard coefficients if their value is below a certain threshold). Another problem to be aware of is that using simple periodic wavelets will cause poor behavior near boundaries when the underlying function of the data is not periodic [6]. 1.4 k-nearest neighbors One alternative to using linear regression (with various basis transformations) is the k-nearest neighbor method which makes no assumptions about the underlying data. 18

Chapter 1. Background When using k-nn the estimate of the function is formed by averaging the k responses y i produced by the k points x i that are closest to the input x in the training data f (x) = 1 k y i, (1.13) x i N k (x) where N k (x) defines the k closest points to x according to some distance measure. For example with k = 1 the value of f (x) would be the value of y i corresponding to the x i in the training data that is closest to x. The behavior of this method can be adjusted by changing the distance measure, changing k or giving different weights to different data points. There are various different ways to define distance and which definition is best depends on the problem, however if nothing specific is known about the data the Euclidean distance measure is commonly used. Choice of k determines the smoothness of the fit or how much noise is able to affect the estimate. One option in choosing k is to follow a guideline [7] (in some places referred to as the rule-of-thumb) that states that k should be n, where n is the number of data points. cross-validation. The value for k can also be chosen empirically, for example by using This method is suitable for capturing non-linear dependencies since the function is approximated locally. However, k-nn is unlikely to perform well if there is insufficient data or if the data is too sparse [8], the method also is expected to perform worse as the dimensionality of the data increases [9]. With this method most of the computation time is spent when calculating the output, since there is no model built beforehand (e.g. like calculating the coefficients in linear regression). 1.5 Kernel density estimation Instead of building essentially two models (one for the relationship between the values of the variables and one for the likelihood of a data point based on the estimate) it is possible to try and directly estimate the probability distribution of the data. That is what density estimation is the attempt to construct an estimate of a probability distribution function based on a set of observations. One method for performing this estimation is called the kernel density estimation (KDE), which was introduced by Parzen [1] and Rosenblatt [11]. 19

Chapter 1. Background The basic KDE can be written as f (x) = 1 n nh K i=1 ( x xi h ), (1.14) where (x 1, x 2,..., x n ) is an independent and identically distributed sample drawn from some distribution, K( ) is the kernel and h is a parameter called the bandwidth which determines the smoothness of the resulting estimate. The effect of using different bandwidths is illustrated on figure 1.4..4.3.4.3.2.2.1.1 5 5 4 2 2 4.5.4.3.2.1 4 2 2 4.5.4.3.2.1 4 2 2 4 Figure 1.4: KDE on data sampled from the normal distribution. Highest bandwidth in the top-left graph and lowest in the bottom-right. The kernel is a non-negative real-valued integrable function which must be symmetric and integrate to one over R + K(x)dx = 1, (1.15) K( x) = K(x) x R. (1.16) Since the estimate is a linear combination of kernels it inherits most of the properties of the kernel (all except symmetry). However the choice of the kernel is much less important than the selection of a good bandwidth [12], therefore it is a good idea to simply use a kernel that is computationally efficient e.g. the Gaussian [13].

Chapter 1. Background Estimating conditional distributions can be done by estimating the joint distribution and the marginal distribution separately and then dividing f (y x) = f (x, y) f (x) = n i=1 K h (y y i ) m j=1 K h j (x j x ij ) i=1 n m j=1 K, (1.17) h j+m (x j x ij ) where n is the number of data points, m is the dimension of the data and K h (x) = 1 h K( x h ). A problem that arises while using a fixed bandwidth is poor performance on the tails of distributions or over-smoothing occurs on the main part of the density (for an example see [14]). To overcome this problem the adaptive kernel density estimator can be used (AKDE), where the bandwidth is varied from one point to another (for an overview of such methods see [15]). Using fixed width bandwidths in the multivariate case can problematic because the KDE suffers from the curse of dimensionality [15], i.e. the data becomes too sparse in higher dimensions. 1.6 Logistic regression Logistic regression models the conditional probability P(Y = k X = x) of a discrete variable Y as a function of x [2] P(Y = k X = x) = exp(β k + β T k x), k = 1,..., K 1 1 + B P(Y = K X = x) = 1 1 + B, (1.18) B = K 1 exp(β i + β T i x). i=1 Logistic regression is a popular method that is widely used (especially in situations where K = 2, i.e. where there are only two possible values for Y). These models are for example also used often as a data analysis tool in order to understand the role of the input variables (e.g. finding out which features are redundant) [2]. The coefficients β are commonly estimated by iteration (e.g. by using the Newton- Raphson method) where the convergence criterion is the maximization of the likelihood function [2] (no closed-form expression exists for computing the coefficients that maximize the likelihood function) l(θ) = N i=1 log p yi (x i ; θ), (1.19) 21

Chapter 1. Background where θ is the set of parameters and p k (x i ; θ) = P(Y = k X = x i ; θ). One big downside of this approach is that it might converge slowly or not converge at all. 1.7 Noisy-OR The noisy-or model [16] represents the conditional probability of a single binary variable Y (which represents an OR gate) given n binary variables X i (inputs to the gate). However, this model can also be generalized to working with n-ary variables and arbitrary functions [17]. The basic idea in noisy-or is to specify the conditional probability of Y given the values for each X i separately instead of the possible combinations of values for different X i. With each given variable X i in the model is associated a noise variable N i which determine the values of new variables X i. The variables X i take the values of corresponding X i only if N i is off (e.g. taking the value ) and if N i is on the X i is off. The output Y is off if all X i are off and is on if any X i is on P(Y = X i = for all i) = 1, (1.) P(Y = X i = 1 for some i) =. (1.21) The probability of N i being active is q i P(N i = 1) = q i. (1.22) Using these assumptions it can be shown that P(Y = X = x) = q i, (1.23) i S where S = {i : X i = 1} (for a proof see [18]). The probability q i can be found through the causal strength p i of X i for Y (e.g. given a set of items that all have property A and some also have property B, the causal strength of A for B is the number of items with property B divided by the number of items with property A) q i = 1 p i. (1.24) When attempting to construct this model from data it is arguably more practical to use a version introduced by Henrion [19] where an extra term p is added which 22

Chapter 1. Background represents the probability that Y will occur when all X i are absent P(Y = 1 X i = for all i) = p. (1.25) The noisy-or model is shown to be useful when learning conditional probability distributions from small datasets where the relationships are expected to follow the noisy-or relatively well and when it is desirable to include expert knowledge in the model []. 23

Chapter 2 Analysis 2.1 Method of analysis The methods described in the previous chapter were implemented in C++ and the performance of those implementations was measured on different datasets. The runtime, memory usage and accuracy of the implementations was measured. The experiments were run on an Intel i7-363qm processor under a 64-bit Windows 7. Accuracy of the methods was tested on different artificial data sets and data sets retrieved from external sources. The artificially generated data sets were split into training and test data sets with 8% of the data designated for training. For data sets with less than 1 data points the experiments were run ten times and the performance was averaged over all of the runs. For the other data sets k-fold crossvalidation was used and the results were again averaged. When testing for run time and memory usage the methods were run ten times in each case and the lowest run time and peak memory usage was reported. The run times were only measured from the start of the computations relevant to the methods, tasks common to all methods such as loading the data from disk were not factored in. The lowest run time is reported (instead of the average for example) to avoid factoring in situations where the process is waiting unnecessarily due to something irrelevant to the task. For the wavelet transform the scaling of the data to range [, 1] was factored into the run time since this is necessary to achieve decent accuracy. The radial basis function transform was timed in conjunction with the k-means clustering of the data and it was treated as one single method. 24

Chapter 2. Analysis 2.2 Software libraries Most of the methods were implemented directly for this thesis, however a few existing software libraries were leveraged. The liblinear [21] library was used for running logistic regression and the KMlocal [22] library was used for running k-means. Scipy [23] was used for generating and plotting test data. The Armadillo [24] library was used for operating with matrices (mainly in performing linear regression). 2.3 Data Four artificial data sets with different characteristics were generated for testing accuracy of the methods. The data sets were scaled to be in the same range and to have the same amount of noise. The added noise has a variance of 2 and it was added homogeneously to all data sets. This means that the data is homoscedastic which is not likely to be the case with real world data and is easier to handle for the different methods. 1 1 1 1 8 8 6 6 4 4 15 1 5 5 1 15 15 1 5 5 1 15 15 1 5 5 1 15 Figure 2.1: A linear dataset in one and two dimensions. The first artificial data set is a simple linear data set (see figure 2.1) that should not be challenging for any method. It is meant as a baseline where all of the methods should be able to achieve their best performance. The second data set is a simple polynomial data set (figure 2.2). It is intended as a small step up from the linear data set that some methods might perform slightly worse on than others. The cyclical data 25

Chapter 2. Analysis set (figure 2.3) was added to test if certain methods might have an advantage if there is such a repeating pattern in the data. The last data set (figure 2.4) was designed to be difficult to model for all methods. 1 1 8 1 1 6 8 6 4 4 15 1 5 5 1 15 15 1 5 5 1 15 15 1 5 5 15 1 Figure 2.2: A simple polynomial dataset in one and two dimensions. 1 1 1 8 1 8 6 6 4 4 15 1 5 5 1 15 15 1 5 5 1 15 25 15 1 5 5 1 15 Figure 2.3: A cyclical dataset in one and two dimensions. 26

Chapter 2. Analysis 1 1 1 8 1 8 6 6 4 4 15 1 5 5 1 15 25 15 1 5 5 1 15 15 1 5 5 1 15 Figure 2.4: A difficult dataset in one and two dimensions. In addition to the artificially generated data sets some data sets from external sources were used. These data sets were retrieved from the UCI Machine Learning Repository [25]. The lenses data set [26] (used to test noisy-or and logistic regression) contains data representing people that need hard or soft contact lenses or no contact lenses. In order to apply the noisy-or method the data set was simplified to represent people that either need contact lenses or not. The chess data set [27] represents different positions in a chess game and whether or not white can win from those positions. This data set was also used for the noisy-or and logistic regression methods. The motorcycle data set [28] contains data from accelerometer readings taken over time when testing helmets. This data set was used for testing all of the other methods. 27

Chapter 3 Results The results of the experiments are presented in this chapter. All of the results are listed in tables and more relevant aspects are represented with graphs. In the tables the L symbol represents the log-likelihood of the data given the trained model and the numbers in the first row represent the dimension of the data set. RMSE stands for root-mean-square error and is provided where possible. The number of knots for the spline methods were chosen so that the resulting transformation would have the same dimensionality (and thus the same degrees of freedom) for each spline. Due to this the B-splines have the smallest amount of effective knots and natural splines have the largest. 28

3.1 Accuracy of methods 3.1.1 Artificial data sets 3.1.1.1 Ordinary least squares Table 3.1: RMSE and log-likelihood of ordinary least squares on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-43 1.98-52 3.37-63 8.51-59 5.51 linear 1e+3-423 2. -434 2.12-436 2.14-447 2.27 linear 1e+4-418 1.96-4255 2.3-4179 1.95-439 2.8 linear 1e+5-42222 2. -42258 2. -42343 2.1-42128 1.99 square 1e+2-93 19.2-9 19.76-85 15.95-91 19.67 square 1e+3-777 11.62-83 13.49-781 12.1-778 12. square 1e+4-7226 8.96-733 9.44-7214 8.91-7226 8.97 square 1e+5-66988 6.89-71235 8.52-6563 6.43-67564 7.9 cyclical 1e+2-1 35.43-94 26.73-94 25.21-91 22.69 cyclical 1e+3-999 35.76-93 25.21-893.97-873 19.1 cyclical 1e+4-9963 35.26-9286 25.13-8845.14-8692 18.66 cyclical 1e+5-997 35.41-92821 25.8-88788.5-865 17.84 difficult 1e+2-99 33.42-91 23.3-93 23.85-93 24.11 difficult 1e+3-977 31.99-91 22.82-879 19.65-865 18.23 difficult 1e+4-9752 31.72-997 22.86-8678 18.54-8487 16.85 difficult 1e+5-9774 32.7-9678 22.53-8697 18.72-849 16.88 Table 3.1 lists the accuracy experiment results for ordinary least squares. For the linear dataset the RMSE comes down to around two, which is to be expected since that is the variance of the noise that has been added to the data. For the other datasets the RMSE does not go that low and that is also to be expected since we are building a linear fit on the data. 29

3.1.1.2 B-splines Table 3.2: RMSE and log-likelihood of B-splines on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-185 5.96-1145 95.62 - - - - linear 1e+3-499 2.51-446 2. -4213 52.43 - - linear 1e+4-4251 2.3-4276 2.5-5916 6.91-1369 43.15 linear 1e+5-42162 1.99-481 1.98-42231 2. -68288 7.35 square 1e+2-292 5.95-646 71.23 - - - - square 1e+3-577 2.97-669 3.43-515 132. - - square 1e+4-4252 2.3-4165 1.94-4537 2.29-17186 52.53 square 1e+5-42182 1.99-4248 2.2-42246 2. -46561 2.81 cyclical 1e+2-1 36.17-871 186.57 - - - - cyclical 1e+3-996 35.17-997 31.12-668 711.73 - - cyclical 1e+4-9951 35.4-931 25.43-97 41.55-1795 264.5 cyclical 1e+5-99735 35.44-92913 25. -88914.63-91741 26.16 difficult 1e+2-94 26.59-1176 348.57 - - - - difficult 1e+3-939 26.49-175 3. -6692 1336.94 - - difficult 1e+4-9454 27.34-8871.33-9528 33.55-5 74.46 difficult 1e+5-94438 27.19-8776 19.42-84769 19.9-91487 43.75 Table 3.2 lists the accuracy experiment results for B-splines. This method performs much worse on the linear dataset compared to OLS when the number of data points is low or the dimension is large. B-splines do outperform OLS on the square data set as expected, but not on the cyclical data set and only slightly on the difficult data set, however that is likely due to the small number of knots used. 3

3.1.1.3 Natural splines Table 3.3: RMSE and log-likelihood of natural splines on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L RMS L RMS L RMS L RMS linear 1e+2-45 2.24-64 2.76 - - - - linear 1e+3-425 2.3-427 2.4-67 4.1 - - linear 1e+4-4258 2.3-4252 2.3-5471 3.69-9989 21.79 linear 1e+5-42115 1.99-4231 2.1-42479 2.2-152585 52.17 square 1e+2-53 3.29-4 7.13 - - - - square 1e+3-59 2.95-53 3.25-79 6.82 - - square 1e+4-4762 2.62-4515 2.31-4614 2.43-9424 24.8 square 1e+5-45478 2.35-45952 2.41-45123 2.31-8364 16.4 cyclical 1e+2-99 31.46-216 4.78 - - - - cyclical 1e+3-969 3.74-916 23.28-1243 32.3 - - cyclical 1e+4-9724 31.28-916 22.91-8784 19.51-1453 182.96 cyclical 1e+5-9752 3.99-916 21.9-86288 18.9-118795 15.1 difficult 1e+2-82 13.7-123 18.84 - - - - difficult 1e+3-797 13.3-74 9.76-841 13.4 - - difficult 1e+4-881 13.74-7371 9.65-7121 8.46-12855 127.72 difficult 1e+5-8391 13.47-73549 9.57-73 8.2-92492 24.58 Table 3.3 lists the accuracy experiment results for natural splines. Natural splines perform quite well on the linear data set if the number of data points is large relative to the number of dimensions. The natural splines perform slightly worse than the B-splines on the square data set. This is most likely explained by the difference in the number of knots. The natural splines outperform B-splines on the cyclical and difficult data sets. 31

3.1.1.4 Regression splines Table 3.4: RMSE and log-likelihood of regression splines on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-42 1.93-1291 356.18 - - - - linear 1e+3-4 1.97-566 6.12-2915 42.73 - - linear 1e+4-43 1.98-4227 2. -4616 2.35-1527 13.14 linear 1e+5-42322 2.1-42144 1.99-42416 2.2-44345 2.23 square 1e+2-46 2.22-658 35.8 - - - - square 1e+3-424 2.1-457 2.22-3565 52.41 - - square 1e+4-4238 2.1-47 1.98-4341 2.11-155 32.7 square 1e+5-42436 2.2-4223 2. -42422 2.2-44617 2.3 cyclical 1e+2-1 35.65-711 293.29 - - - - cyclical 1e+3-999 35.77-956 27.92-4988 967.72 - - cyclical 1e+4-9975 35.46-928 25.5-954 26.25-25467 266.76 cyclical 1e+5-99769 35.5-92817 25.8-88811.52-871 18.69 difficult 1e+2-149 39.41-1154 428.25 - - - - difficult 1e+3-938 26.28-944 23.42-4672 2763.9 - - difficult 1e+4-943 26.98-89.72-8482 16.63-26627 369.86 difficult 1e+5-9445 26.66-87475 19.19-83654 15.85-85112 16.74 Table 3.4 lists the accuracy experiment results for regression splines. Regression splines perform relatively well on the linear data set and they perform the best on the square data set compared to other splines as would be expected. The performance on the cyclical and difficult data sets is somewhere in between the natural splines and B-splines. 32

3.1.1.5 Wavelets Table 3.5: RMSE and log-likelihood of Ricker wavelets on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-576 12.9-895 16.81 - - - - linear 1e+3-1822 7.36-39 7.38-2349 9.34 - - linear 1e+4-4418 2.18-534 2.7-14789 6.86-8478 16.72 linear 1e+5-81363 4.44-13347 5.33-44491 2.21-66239 6.59 square 1e+2-634.96-199 24.61 - - - - square 1e+3-9.36-2422 1.5-2249 8.57 - - square 1e+4-7219 4.23-9526 5.27-14261 7. -5837 4.36 square 1e+5-64746 3.68-62654 3.52-129695 6.37-69887 7.95 cyclical 1e+2-1 34.87-268 1.15 - - - - cyclical 1e+3-999 35.76-942 26.66-1358 61.15 - - cyclical 1e+4-9919 34.45-9267 24.89-8978 21.49-12875 12.75 cyclical 1e+5-99749 35.46-92742 24.98-151884 48.59-87129 18.87 difficult 1e+2-1 33.25-224 59.39 - - - - difficult 1e+3-965 3.1-917 23.17-1296 37.62 - - difficult 1e+4-9558 28.78-914 21.86-8746 18.95-189 221.86 difficult 1e+5-9776 31.79-918 21.84-85473 17.37-85582 17.46 Table 3.5 lists the accuracy experiment results for Ricker wavelets. The performance of the wavelets seems to be quite erratic in some sense. It can get either better or worse as either the number of data points or dimension of the data increases. 33

3.1.1.6 Smoothing splines Table 3.6: RMSE and log-likelihood of using RBFs with k-means on different datasets of different sizes and different dimensions (k = N/2). Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-47 2.54-54 3.45-63 4.65-59 4.56 linear 1e+3-423 2. -452 2.3-453 2.33-472 2.55 linear 1e+4-4178 1.95-4232 2.1-4266 2.4-4314 2.9 linear 1e+5-93726 26.24-42171 1.99-42335 2.1-42393 2.2 square 1e+2-43 1.97-64 5.73-89 12.52-84 12.73 square 1e+3-426 2.3-46 2.4-49 2.8-56 3.93 square 1e+4-4254 2.3-4295 2.7-4418 2. -4437 2.22 square 1e+5-78 12.7-42289 2. -42388 2.1-42717 2.5 cyclical 1e+2-99 31.38-94 25.62-94 26.23-93 23.31 cyclical 1e+3-816 14.41-896 21.21-896 21.24-875 19.17 cyclical 1e+4-12347 116.13-7646 11.4-8649 18.24-8583 17.68 cyclical 1e+5-123738 117.66-7764 8.32-7665 1.85-8368 15.82 difficult 1e+2-86 17.42-95 26.67-94 24.17-92 23.58 difficult 1e+3-645 6.11-788 12.44-845 16.53-844 16.45 difficult 1e+4-861 17.91-7156 8.66-7634 1.99-789 12.49 difficult 1e+5-65868 6.52-6197 5.36-755 8.24-73354 9.48 Table 3.6 lists the accuracy experiment results for RBF with k-means. This method seems to perform better as the dimension increases when the number of data points is large. That seems to indicate that the choice of k = N/2 is not good when there is a lot of data. Ignoring that, the RBF method performs very well in all situations. 34

3.1.1.7 k-nearest neighbors Table 3.7: RMSE and log-likelihood of k-nn on different datasets of different sizes and different dimensions (k = N). Dataset 1 2 3 4 type N L RMSE L RMSE L RMSE L RMSE linear 1e+2-63 4.27-8 7.76-7 7.75-75 9.56 linear 1e+3-473 2.55-531 3.31-575 4.27-623 5.38 linear 1e+4-4282 2.6-4513 2.31-4827 2.7-592 3.8 linear 1e+5-42487 2.2-4315 2.8-44441 2.23-46629 2.49 square 1e+2-94 1.18-75 9.65-84 13.27-91 16.44 square 1e+3-568 3.55-633 5.48-686 7.35-75 8. square 1e+4-4951 2.86-595 3.9-5968 4.78-648 5.93 square 1e+5-43973 2.18-4712 2.55-5876 3.8-53768 3.56 cyclical 1e+2-91.61-94 25.18-92 23.58-91 21.95 cyclical 1e+3-746 9.52-832 15.42-883 19.88-873 19.4 cyclical 1e+4-651 4.84-723 8.97-818 13.94-8381 15.98 cyclical 1e+5-5149 3.17-63454 5.78-7488 9.83-79714 13.2 difficult 1e+2-76 1.29-85 16.54-9.64-89 19.96 difficult 1e+3-583 4.45-765 11.5-82 13.34-826 15.2 difficult 1e+4-4783 2.64-665 6.73-7524 1.39-7725 11.51 difficult 1e+5-43773 2.16-58264 4.46-68238 7.34-716 8.86 Table 3.7 lists the accuracy experiment results for k-nn. With all data set types the RMSE comes quite close to the expected value of 2 and gets progressively worse as the dimensionality increases. Worst performance is shown on the cyclical dataset and best on the linear dataset. In all cases the method performs better as the size of the dataset increases. 35

3.1.1.8 Kernel density estimation Table 3.8: Log-likelihood of KDE on different datasets of different sizes and different dimensions (bandwidth of 1). Dataset 1 2 3 4 type N L L L L linear 1e+2-141 -118-19 -258 linear 1e+3-868 -523-661 -891 linear 1e+4-884 -4699-489 -5532 linear 1e+5-77754 -45717-45631 -46419 square 1e+2-92 -212-19 -391 square 1e+3-741 -617-919 -17 square 1e+4-6631 -4848-544 -5911 square 1e+5-6316 -4458-44633 -46425 cyclical 1e+2-97 -246-334 -485 cyclical 1e+3-922 -179-1529 -73 cyclical 1e+4-96 -8195-954 -11116 cyclical 1e+5-9199 -79362-77941 -8324 difficult 1e+2-97 -21-352 -485 difficult 1e+3-897 -819-1136 -1649 difficult 1e+4-8984 -7334-7647 -969 difficult 1e+5-88949 -752-7743 -72376 Table 3.8 lists the accuracy experiment results for KDE. The bandwidth choice is apparently not great for the one dimensional case. Otherwise the method performs decently. The accuracy does deteriorate as the dimensionality increases and there is smaller amounts of data. 36

3.1.1.9 Noisy-OR and logistic regression Table 3.9: Log-likelihood of noisy-or on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L L L L or 1e+2.48.47.45.28 or 1e+3.5.5.43.35 or 1e+4.5.51.43.39 or 1e+5.5.5.44.37 and 1e+2.49 8.56 1.7 9.27 and 1e+3.5 86.22 8.95 61.31 and 1e+4.5 853.79 798.1 588.69 and 1e+5.5 8383.81 789.65 5798.39 Table 3.1: Log-likelihood of logistic regression on different datasets of different sizes and different dimensions. Dataset 1 2 3 4 type N L L L L or 1e+2 7.43 4.99 3.13 2.8 or 1e+3 72.57 41.21 19.74 1.64 or 1e+4 75.1 355.19 183.6 97.22 or 1e+5 6957.41 34.61 1754.53 869.72 and 1e+2 7.38 14.6 12.11 8.74 and 1e+3 71.71 138.84 123.25 84.82 and 1e+4 75.7 1388.21 1174.92 796.91 and 1e+5 6986.44 13 862.8 11 99.1 8531.93 Table 3.9 lists the accuracy experiment results for noisy-or. This method performs very well on the OR data set and the performance increases with the dimensionality of the data. With the AND data set the performance also increases with the dimensionality. Table 3.1 lists the accuracy experiment results for logistic regression. Like with noisy- OR the performance increases with the dimensionality, however logistic regression performs much worse on the OR data set and slightly worse on the AND data set with the default settings. 37

3.1.1.1 Performance graphs 7 6 5 OR data set Noisy-OR Logistic regression Log-likelihood 4 3 1 14 1 2 3 4 Number of dimensions AND data set 1 1 Log-likelihood 8 6 4 1 2 3 4 Number of dimensions Figure 3.1: Accuracy of noisy-or compared to logistic regression for different datasets with different dimensions at 1 data points. Figure 3.1 shows how the noisy-or and logistic regression methods compare in terms of accuracy. For these simple data sets and using default settings for logistic regression the noisy-or outperforms logistic regression in all cases and for the OR dataset it performs extremely well. The graphs seem to indicate that as the dimensionality increases logistic regression could catch up and possibly surpass the performance of noisy-or. 38

11 1 9 Log-likelihood of a linear data set OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood 8 7 6 5 4 1 2 3 4 Number of dimensions Figure 3.2: Accuracy of the methods on the linear dataset at 1 data points for different dimensions. Figures 3.2 to 3.5 show the performance of the methods on the synthetic data sets when there is an abundance of data. Figure 3.2 shows how the different methods perform on the linear data set. It can be seen that the performance of the wavelet method varies a lot with the number of dimensions and is worst at two dimensions. It is also interesting how much the performance of the KDE and RBF methods increases moving from one dimension to two. This is likely explained by the choice of the bandwidth for KDE and number of means for RBF. The OLS, natural spline and regression spline perform the best across all dimensions. 39

13 1 11 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a square data set 1 Log-likelihood 9 8 7 6 5 4 1 2 3 4 Number of dimensions Figure 3.3: Accuracy of the methods on the square dataset at 1 data points for different dimensions. Figure 3.3 shows how the different methods perform on the square data set. Again it can be seen how widely the performance of the wavelet method varies, this time however the worst case performance is in three dimensions and not two. The best methods on this data set are the RBF and the regression spline. 4

16 14 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a cyclical data set 1 Log-likelihood 1 8 6 4 1 2 3 4 Number of dimensions Figure 3.4: Accuracy of the methods on the cyclical dataset at 1 data points for different dimensions. Figure 3.4 shows how the different methods perform on the cyclical data set. Wavelets perform very badly in the three dimensional case, but otherwise are relatively competitive. For this data set the best methods seem to be k-nn and RBF. 41

13 1 11 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a difficult data set 1 Log-likelihood 9 8 7 6 5 4 1 2 3 4 Number of dimensions Figure 3.5: Accuracy of the methods on the difficult dataset at 1 data points for different dimensions. Figure 3.5 shows how the different methods perform on the difficult data set. On this data set the wavelets perform consistently in the sense that their performance is not worse in some dimension by a relatively large amount. The best methods for this data set seem to be k-nn, KDE and RBF. 3.1.2 Real-world data sets The logistic regression method outperforms noisy-or on the real-world data sets even with the default parameters (given by the liblinear library) and does so massively on the chess data set. On the lenses dataset noisy-or achieved a log-likelihood of 8.55 and logistic regression obtained 3.47. This was done using 3-fold cross-validation. On the chess dataset using 1-fold cross-validation noisy-or achieved 15.83 and logistic regression achieved 35.37. 42