Multivariate Conditional Distribution Estimation and Analysis

Size: px
Start display at page:

Download "Multivariate Conditional Distribution Estimation and Analysis"

Transcription

1 IT Examensarbete 45 hp Oktober 14 Multivariate Conditional Distribution Estimation and Analysis Sander Medri Institutionen för informationsteknologi Department of Information Technology

2

3 Abstract Multivariate Conditional Distribution Estimation and Analysis Sander Medri Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan Postadress: Box Uppsala Telefon: Telefax: The goals of this thesis were to implement different methods for estimating conditional distributions from data and to evaluate the performance of these methods on data sets with different characteristics. The methods were implemented in C++ and several existing software libraries were also used. Tests were run on artificially generated data sets and on some real-world data sets. The accuracy, run time and memory usage of the methods was measured. Based on the results the natural or smoothing spline methods or the k-nearest neighbors method would potentially be a good first choice to apply to a data set if not much is known about it. In general the wavelet method did not seem to perform particularly well. The noisy-or method could be a faster and possibly more accurate alternative to the popular logistic regression in certain cases. Hemsida: Handledare: Michael Ashcroft Ämnesgranskare: Kristiaan Pelckmans Examinator: Ivan Christoff IT Tryckt av: Reprocentralen ITC

4

5 Acknowledgements This project was supported by the Estonian Ministry of Education and Research and the Archimedes Foundation.

6 Contents Contents 5 List of Figures 8 List of Tables 9 1 Background Introduction Linear models Ordinary least squares Building a probability distribution Basis functions Splines Natural cubic splines B-splines Smoothing splines Wavelets k-nearest neighbors Kernel density estimation Logistic regression Noisy-OR Analysis Method of analysis Software libraries Data Results Accuracy of methods Artificial data sets Ordinary least squares B-splines Natural splines Regression splines Wavelets Smoothing splines k-nearest neighbors Kernel density estimation

7 Contents Noisy-OR and logistic regression Performance graphs Real-world data sets Run time and memory usage Regression splines Natural splines and B-splines Smoothing splines Wavelets Ordinary least squares k-nearest neighbors Kernel density estimation Noisy-OR and logistic regression Performance graphs Observations and discussion General observations Discrete methods Real methods Splines in general Wavelets Natural and smoothing splines Regression splines B-splines k-nearest neighbors Kernel density estimator Ordinary least squares Data sets References 61 7

8 List of Figures 1.1 The pdf of a normal distribution with a mean of and a standard deviation of Regression spline, natural spline and B-spline estimates on a dataset with three knots at the locations indicated by vertical dashed lines The Haar and Ricker wavelets and the corresponding scaling functions KDE on data sampled from the normal distribution. Highest bandwidth in the top-left graph and lowest in the bottom-right A linear dataset in one and two dimensions A simple polynomial dataset in one and two dimensions A cyclical dataset in one and two dimensions A difficult dataset in one and two dimensions Accuracy of noisy-or compared to logistic regression for different datasets with different dimensions at 1 data points Accuracy of the methods on the linear dataset at 1 data points for different dimensions Accuracy of the methods on the square dataset at 1 data points for different dimensions Accuracy of the methods on the cyclical dataset at 1 data points for different dimensions Accuracy of the methods on the difficult dataset at 1 data points for different dimensions Run time and peak memory usage of noisy-or compared to logistic regression for different numbers of data points and dimensions Run times and peak memory usage for different methods for transforming data in 3 dimensions or with 1 data points Run time and memory usage OLS, k-nn and KDE for different number of data points and dimensions

9 List of Tables 1.1 Example of a basis transformation RMSE and log-likelihood of ordinary least squares on different datasets of different sizes and different dimensions RMSE and log-likelihood of B-splines on different datasets of different sizes and different dimensions RMSE and log-likelihood of natural splines on different datasets of different sizes and different dimensions RMSE and log-likelihood of regression splines on different datasets of different sizes and different dimensions RMSE and log-likelihood of Ricker wavelets on different datasets of different sizes and different dimensions RMSE and log-likelihood of using RBFs with k-means on different datasets of different sizes and different dimensions (k = N/2) RMSE and log-likelihood of k-nn on different datasets of different sizes and different dimensions (k = N) Log-likelihood of KDE on different datasets of different sizes and different dimensions (bandwidth of 1) Log-likelihood of noisy-or on different datasets of different sizes and different dimensions Log-likelihood of logistic regression on different datasets of different sizes and different dimensions Performance of the different methods on the motorcycle dataset Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using regression splines Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using natural splines Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number of knots, dimensions and data points using B-splines Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for number dimensions and data points using using RBF with k-means Runtime (in milliseconds) and peak memory usage (in megabytes) for transforming the data for different smoothing parameters, dimensions and data points using the Ricker wavelet

10 List of Tables 3.17 Run time (in milliseconds) for building the model, computing the estimate, computing the likelihood of data, building the error and generating values and peak memory usage (in megabytes) for ordinary least squares Run time (in milliseconds) for building the model, computing the estimate, computing the likelihood of data, building the error and generating values and peak memory usage (in megabytes) for k-nn Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for KDE Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for logistic regression Run time (in milliseconds) for building the model, computing the likelihood of data and generating values and peak memory usage (in megabytes) for noisy-or

11 Chapter 1 Background 1.1 Introduction A random variable is a function that gives a unique numerical value to every outcome of a random event (e.g. a random variable can represent the number of times tails come up when flipping a coin a certain number of times). A probability distribution describes the relative likelihood of such a variable taking on certain values. When the random variables are continuous the probability distribution is called the probability density function (pdf) and when the variables are discrete it is called the probability mass function (pmf). As an example the pdf of a standard normal distribution is shown on figure Figure 1.1: The pdf of a normal distribution with a mean of and a standard deviation of 1 11

12 Chapter 1. Background A joint probability distribution for multiple random variables is a probability distribution that describes the probability of those variables taking on certain values simultaneously. A conditional probability distribution of a variable is its probability distribution when the values of other variables are fixed to some specific values. There exist different methods for estimating the distribution of data. This estimation can be useful in any field where it is necessary to make predictions about the future based on existing data. For example by using these methods it would be possible to estimate the probability of a patient having a heart attack based on their symptoms given data about previous patients; or to estimate the probability of the change in weather conditions, stock prices, etc. For example, when building Bayesian networks it is necessary to specify the conditional independences between the random variables that make up the nodes of the network. When building such a network from data it is necessary to estimate the conditional distributions of different variables on alternative subsets of other variables. This needs to be done numerous times in order to build an accurate network, and so the speed of the methods used for estimating these distributions becomes very important in addition to their accuracy. For estimating distributions based on data with continuous variables the methods considered in this thesis are linear regression and k-nearest neighbors (using an error function in conjunction with the output of the method) and kernel density estimation. Additionally methods such as splines and wavelets will be studied which will better allow the analysis of non-linear relationships. In the case of data with discrete or mixed variables logistic regression and noisy-or methods will be examined. 1.2 Linear models In a linear model the output is produced by a linear combination of input variables f (x) = β + n i=1 β i x i, (1.1) where x = (x 1, x 2,..., x n ) is a vector of input variables and f (x) is the output of the model. The term β is a constant called the intercept or bias (the value of the estimate when the x vector is zero). The output of the model could in general also be a vector, however in this thesis only such situations are considered where it is a scalar. 12

13 Chapter 1. Background Such a simple model is obviously not capable of accurately representing non-linear relationships and depending on the method used different data sets with very different shapes could end up having the same model (e.g. Anscombe s quartet [1]) Ordinary least squares There are many different methods for creating a linear model based on a data set. Ordinary least squares (OLS) is one of the simplest of such methods and also a very common one. When using OLS the coefficients β are chosen in order to minimize the residual sum of squares RSS(β) = (y i f (x i )) 2, (1.2) resulting in β = (x T x) 1 x T y. (1.3) Building a probability distribution A model for estimating the value for y alone does not provide a way to calculate the probability of y given x. One way to get this conditional probability is to also build a model for an error for the estimate f (x). This can be done for example by using the difference between the estimate and the actual value of y with a pdf of a distribution (e.g. a Gaussian) P(y x) = 1 ) ( σ 2π exp (y f (x))2 2σ 2, (1.4) where the variance is built from the training data (e.g. by summing all of the differences and dividing them by the number of data points). 1.3 Basis functions In order to make use of linear methods in cases where the underlying function can not be directly expressed by a linear combination of inputs we can perform a transformation of the data using a basis for some function space. It is then possible to use linear models on the transformed data in order to express non-linear relationships in the original space. For example every quadratic polynomial can be expressed as a linear combination of the functions {1, x, x 2 }. If we want to model a relationship like y = x from 13

14 Chapter 1. Background data then we can first transform the data using the functions from that basis and then perform linear regression on the transformed data. For a simple example consider the data in table 1.1. In table 1.1(A) the relationship between x and y is y = x and x is added for computing the intercept β. Applying equation (1.3) on this data directly results in β = (1, 3) which produces the values in the column ŷ according to equation (1.1). The data in table 1.1(B) has been transformed using the functions {1, x, x 2 } on the original input variable x producing three new variables x 1, x 2 and x 3. The values for ŷ are produced in the same manner as in table 1.1(A) with β = (2,, 1). It should be evident that the values for ŷ in table 1.1(B) match the original values of y much better than in table 1.1(A). Table 1.1: Example of a basis transformation. (A) x x y ŷ (B) x 1 x 2 x 3 y ŷ For calculating basis transformations in multidimensional cases there is a basis function b iki (x i ) for each coordinate of x R n, where k i = 1,..., m i and m i is the number of functions for coordinate i. The transformation is then defined by g k1...k n (x) = b 1k1 (x 1 )b 2k2 (x 2 ) b nkn (x n ), k i = 1,..., m i i {1,..., n}. (1.5) Splines One possibility is to transform the data to have splines as regressors. Splines are piecewise polynomials that join at points that are called knots. The number and degrees of the polynomials and the number and position of knots can vary. The highest degree of the polynomials that make up the spline is also the degree of the spline. 14

15 Chapter 1. Background A spline with degree D and K knots t j can be represented by the following set of functions (also known as the truncated polynomial representation) [2]: s i (x) = x i, i =,..., D (1.6) s D+j (x) = (x t j ) D +, j = 1,..., K where u + = u if u > u + = if u. Such fixed-knot splines are called regression splines [2]. As an example, this is a set of functions for a cubic spline with knots at t 1 and t 2 : s (x) = 1 s 1 (x) = x s 2 (x) = x 2 s 3 (x) = x 3 s 4 (x) = (x t 1 ) 3 + s 5 (x) = (x t 2 ) 3 +. The cubic spline is continuous and has continuous first and second derivatives. In general a spline of degree D has continuous derivatives up to order D 1. Discontinuities can be allowed by introducing additional functions (e.g. by introducing (x t i ) + the spline can be made discontinuous at knot t i ). For a detailed description of how the continuity restrictions are formed see [3]. Regression splines are simple and straightforward to use with the main problem being in how many knots to choose and where to place them. It is also necessary to decide the degree of the spline, however cubic splines should usually be sufficient as there should be no need to use splines of higher degrees unless smooth derivatives are required [2] Natural cubic splines A problem with the method described so far is that the polynomials that are fit to data beyond the boundary knots can behave wildly. In order to avoid this erratic behavior a natural cubic spline adds the constraint that the function is linear beyond the boundary knots. The following basis functions result from applying this constraint [2]: n (x) = 1 n 1 (x) = x n k+1 = d k (x) d K 1 (x) (1.7) d k (x) = (x t k) 3 + (x t K ) 3 + t K t k. 15

16 Chapter 1. Background An example of how this looks like compared to regular regression splines and B- splines is shown on figure Regression spline Natural spline B-spline Figure 1.2: Regression spline, natural spline and B-spline estimates on a dataset with three knots at the locations indicated by vertical dashed lines B-splines An alternative to these splines defined through truncated polynomials are the B- splines. The B-spline basis functions are defined recursively as follows [2]: 1 if k i x < k i+1 b i,1 (x) = (1.8) otherwise b i,m (x) = x k i b k i+m 1 i,m 1 (x) + k i+m x b k i k i+m i+1,m 1 (x), (1.9) k i+1 where k i is the ith knot and m is the order of the spline (a cubic B-spline has m = 4). With B-splines extra knots get added to the beginning and end of the (ordered) sequence of knots. The number of knots added is equal to two times the order m of the spline (m knots are added before the first knot and m knots after the last knot). 16

17 Chapter 1. Background The value of these knots is arbitrary and customarily equal to the first and last knots respectively [2]. The main benefits for B-splines are that they are better conditioned [3] [4] than truncated polynomials and computationally efficient (least squares computations with B- splines can be reduced to O(n) [2], where n is the number of data points) Smoothing splines The smoothing spline estimate of a function y is the function f (x) that minimizes the following equation n (y i f (x i )) 2 + λ i=1 f (x) 2 dt, (1.1) where λ (the smoothing parameter) determines the magnitude of the penalty the existence of the second derivative imposes. The main difference here with regression splines or B-splines is that there is no need to directly specify (it is essentially still done indirectly) the number of knots to use, instead one parameter for the desired smoothness needs to be provided. The solution can be formulated as [2] f (x) = β + β T x + N α j h j (x), (1.11) j=1 where h j (x) = x x j 2 log x x j which is a radial basis function (a function whose value depends on the distance from some point) Wavelets Wavelets are a class of functions with certain properties and it is possible to define many different types of wavelets. There are multiple different methods for utilizing wavelets, however only their use in a simple basis expansion is considered in this thesis. What makes these wavelet bases interesting is their structure which grants specificity in location and frequency and the fact that many types of functions can be sparsely and uniquely represented them [5]. Wavelet bases can be formed by a father wavelet (scaling function) φ(x) and scalings and translations of a mother wavelet ψ(x) [5] f (x) = c φ(x) + M 2 j 1 j= k= 17 d jk ψ jk (x), (1.12)

18 Chapter 1. Background where ψ jk (x) = 2 j/2 ψ(2 j x k). Equation (1.12) is known as a linear truncated wavelet estimator. The coefficients c and d can be found for example by using linear regression. For an example of wavelets the Haar and Ricker (the Mexican hat) wavelets and the corresponding scaling functions are shown on figure Haar wavelet Ricker wavelet Figure 1.3: The Haar and Ricker wavelets and the corresponding scaling functions. The performance of this method depends on the choice of M, in a similar way the choice for the number of knots affects the spline [5] (the possible values for M range up to n 1 if 2 n is the number of observations). The performance is also affected by the wavelet used and that choice depends on the specific problem (e.g. the discontinuous nature of the Haar wavelet makes it a poor choice for estimating continuous functions). This estimator will have difficulties with inhomogeneous functions with local singularities [5]. On such data it is probably better to use a non-linear wavelet estimator [5], where the general idea is to use n 1 for M and use the data to decide which coefficients to keep (e.g. discard coefficients if their value is below a certain threshold). Another problem to be aware of is that using simple periodic wavelets will cause poor behavior near boundaries when the underlying function of the data is not periodic [6]. 1.4 k-nearest neighbors One alternative to using linear regression (with various basis transformations) is the k-nearest neighbor method which makes no assumptions about the underlying data. 18

19 Chapter 1. Background When using k-nn the estimate of the function is formed by averaging the k responses y i produced by the k points x i that are closest to the input x in the training data f (x) = 1 k y i, (1.13) x i N k (x) where N k (x) defines the k closest points to x according to some distance measure. For example with k = 1 the value of f (x) would be the value of y i corresponding to the x i in the training data that is closest to x. The behavior of this method can be adjusted by changing the distance measure, changing k or giving different weights to different data points. There are various different ways to define distance and which definition is best depends on the problem, however if nothing specific is known about the data the Euclidean distance measure is commonly used. Choice of k determines the smoothness of the fit or how much noise is able to affect the estimate. One option in choosing k is to follow a guideline [7] (in some places referred to as the rule-of-thumb) that states that k should be n, where n is the number of data points. cross-validation. The value for k can also be chosen empirically, for example by using This method is suitable for capturing non-linear dependencies since the function is approximated locally. However, k-nn is unlikely to perform well if there is insufficient data or if the data is too sparse [8], the method also is expected to perform worse as the dimensionality of the data increases [9]. With this method most of the computation time is spent when calculating the output, since there is no model built beforehand (e.g. like calculating the coefficients in linear regression). 1.5 Kernel density estimation Instead of building essentially two models (one for the relationship between the values of the variables and one for the likelihood of a data point based on the estimate) it is possible to try and directly estimate the probability distribution of the data. That is what density estimation is the attempt to construct an estimate of a probability distribution function based on a set of observations. One method for performing this estimation is called the kernel density estimation (KDE), which was introduced by Parzen [1] and Rosenblatt [11]. 19

20 Chapter 1. Background The basic KDE can be written as f (x) = 1 n nh K i=1 ( x xi h ), (1.14) where (x 1, x 2,..., x n ) is an independent and identically distributed sample drawn from some distribution, K( ) is the kernel and h is a parameter called the bandwidth which determines the smoothness of the resulting estimate. The effect of using different bandwidths is illustrated on figure Figure 1.4: KDE on data sampled from the normal distribution. Highest bandwidth in the top-left graph and lowest in the bottom-right. The kernel is a non-negative real-valued integrable function which must be symmetric and integrate to one over R + K(x)dx = 1, (1.15) K( x) = K(x) x R. (1.16) Since the estimate is a linear combination of kernels it inherits most of the properties of the kernel (all except symmetry). However the choice of the kernel is much less important than the selection of a good bandwidth [12], therefore it is a good idea to simply use a kernel that is computationally efficient e.g. the Gaussian [13].

21 Chapter 1. Background Estimating conditional distributions can be done by estimating the joint distribution and the marginal distribution separately and then dividing f (y x) = f (x, y) f (x) = n i=1 K h (y y i ) m j=1 K h j (x j x ij ) i=1 n m j=1 K, (1.17) h j+m (x j x ij ) where n is the number of data points, m is the dimension of the data and K h (x) = 1 h K( x h ). A problem that arises while using a fixed bandwidth is poor performance on the tails of distributions or over-smoothing occurs on the main part of the density (for an example see [14]). To overcome this problem the adaptive kernel density estimator can be used (AKDE), where the bandwidth is varied from one point to another (for an overview of such methods see [15]). Using fixed width bandwidths in the multivariate case can problematic because the KDE suffers from the curse of dimensionality [15], i.e. the data becomes too sparse in higher dimensions. 1.6 Logistic regression Logistic regression models the conditional probability P(Y = k X = x) of a discrete variable Y as a function of x [2] P(Y = k X = x) = exp(β k + β T k x), k = 1,..., K B P(Y = K X = x) = B, (1.18) B = K 1 exp(β i + β T i x). i=1 Logistic regression is a popular method that is widely used (especially in situations where K = 2, i.e. where there are only two possible values for Y). These models are for example also used often as a data analysis tool in order to understand the role of the input variables (e.g. finding out which features are redundant) [2]. The coefficients β are commonly estimated by iteration (e.g. by using the Newton- Raphson method) where the convergence criterion is the maximization of the likelihood function [2] (no closed-form expression exists for computing the coefficients that maximize the likelihood function) l(θ) = N i=1 log p yi (x i ; θ), (1.19) 21

22 Chapter 1. Background where θ is the set of parameters and p k (x i ; θ) = P(Y = k X = x i ; θ). One big downside of this approach is that it might converge slowly or not converge at all. 1.7 Noisy-OR The noisy-or model [16] represents the conditional probability of a single binary variable Y (which represents an OR gate) given n binary variables X i (inputs to the gate). However, this model can also be generalized to working with n-ary variables and arbitrary functions [17]. The basic idea in noisy-or is to specify the conditional probability of Y given the values for each X i separately instead of the possible combinations of values for different X i. With each given variable X i in the model is associated a noise variable N i which determine the values of new variables X i. The variables X i take the values of corresponding X i only if N i is off (e.g. taking the value ) and if N i is on the X i is off. The output Y is off if all X i are off and is on if any X i is on P(Y = X i = for all i) = 1, (1.) P(Y = X i = 1 for some i) =. (1.21) The probability of N i being active is q i P(N i = 1) = q i. (1.22) Using these assumptions it can be shown that P(Y = X = x) = q i, (1.23) i S where S = {i : X i = 1} (for a proof see [18]). The probability q i can be found through the causal strength p i of X i for Y (e.g. given a set of items that all have property A and some also have property B, the causal strength of A for B is the number of items with property B divided by the number of items with property A) q i = 1 p i. (1.24) When attempting to construct this model from data it is arguably more practical to use a version introduced by Henrion [19] where an extra term p is added which 22

23 Chapter 1. Background represents the probability that Y will occur when all X i are absent P(Y = 1 X i = for all i) = p. (1.25) The noisy-or model is shown to be useful when learning conditional probability distributions from small datasets where the relationships are expected to follow the noisy-or relatively well and when it is desirable to include expert knowledge in the model []. 23

24 Chapter 2 Analysis 2.1 Method of analysis The methods described in the previous chapter were implemented in C++ and the performance of those implementations was measured on different datasets. The runtime, memory usage and accuracy of the implementations was measured. The experiments were run on an Intel i7-363qm processor under a 64-bit Windows 7. Accuracy of the methods was tested on different artificial data sets and data sets retrieved from external sources. The artificially generated data sets were split into training and test data sets with 8% of the data designated for training. For data sets with less than 1 data points the experiments were run ten times and the performance was averaged over all of the runs. For the other data sets k-fold crossvalidation was used and the results were again averaged. When testing for run time and memory usage the methods were run ten times in each case and the lowest run time and peak memory usage was reported. The run times were only measured from the start of the computations relevant to the methods, tasks common to all methods such as loading the data from disk were not factored in. The lowest run time is reported (instead of the average for example) to avoid factoring in situations where the process is waiting unnecessarily due to something irrelevant to the task. For the wavelet transform the scaling of the data to range [, 1] was factored into the run time since this is necessary to achieve decent accuracy. The radial basis function transform was timed in conjunction with the k-means clustering of the data and it was treated as one single method. 24

25 Chapter 2. Analysis 2.2 Software libraries Most of the methods were implemented directly for this thesis, however a few existing software libraries were leveraged. The liblinear [21] library was used for running logistic regression and the KMlocal [22] library was used for running k-means. Scipy [23] was used for generating and plotting test data. The Armadillo [24] library was used for operating with matrices (mainly in performing linear regression). 2.3 Data Four artificial data sets with different characteristics were generated for testing accuracy of the methods. The data sets were scaled to be in the same range and to have the same amount of noise. The added noise has a variance of 2 and it was added homogeneously to all data sets. This means that the data is homoscedastic which is not likely to be the case with real world data and is easier to handle for the different methods Figure 2.1: A linear dataset in one and two dimensions. The first artificial data set is a simple linear data set (see figure 2.1) that should not be challenging for any method. It is meant as a baseline where all of the methods should be able to achieve their best performance. The second data set is a simple polynomial data set (figure 2.2). It is intended as a small step up from the linear data set that some methods might perform slightly worse on than others. The cyclical data 25

26 Chapter 2. Analysis set (figure 2.3) was added to test if certain methods might have an advantage if there is such a repeating pattern in the data. The last data set (figure 2.4) was designed to be difficult to model for all methods Figure 2.2: A simple polynomial dataset in one and two dimensions Figure 2.3: A cyclical dataset in one and two dimensions. 26

27 Chapter 2. Analysis Figure 2.4: A difficult dataset in one and two dimensions. In addition to the artificially generated data sets some data sets from external sources were used. These data sets were retrieved from the UCI Machine Learning Repository [25]. The lenses data set [26] (used to test noisy-or and logistic regression) contains data representing people that need hard or soft contact lenses or no contact lenses. In order to apply the noisy-or method the data set was simplified to represent people that either need contact lenses or not. The chess data set [27] represents different positions in a chess game and whether or not white can win from those positions. This data set was also used for the noisy-or and logistic regression methods. The motorcycle data set [28] contains data from accelerometer readings taken over time when testing helmets. This data set was used for testing all of the other methods. 27

28 Chapter 3 Results The results of the experiments are presented in this chapter. All of the results are listed in tables and more relevant aspects are represented with graphs. In the tables the L symbol represents the log-likelihood of the data given the trained model and the numbers in the first row represent the dimension of the data set. RMSE stands for root-mean-square error and is provided where possible. The number of knots for the spline methods were chosen so that the resulting transformation would have the same dimensionality (and thus the same degrees of freedom) for each spline. Due to this the B-splines have the smallest amount of effective knots and natural splines have the largest. 28

29 3.1 Accuracy of methods Artificial data sets Ordinary least squares Table 3.1: RMSE and log-likelihood of ordinary least squares on different datasets of different sizes and different dimensions. Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.1 lists the accuracy experiment results for ordinary least squares. For the linear dataset the RMSE comes down to around two, which is to be expected since that is the variance of the noise that has been added to the data. For the other datasets the RMSE does not go that low and that is also to be expected since we are building a linear fit on the data. 29

30 B-splines Table 3.2: RMSE and log-likelihood of B-splines on different datasets of different sizes and different dimensions. Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.2 lists the accuracy experiment results for B-splines. This method performs much worse on the linear dataset compared to OLS when the number of data points is low or the dimension is large. B-splines do outperform OLS on the square data set as expected, but not on the cyclical data set and only slightly on the difficult data set, however that is likely due to the small number of knots used. 3

31 Natural splines Table 3.3: RMSE and log-likelihood of natural splines on different datasets of different sizes and different dimensions. Dataset type N L RMS L RMS L RMS L RMS linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.3 lists the accuracy experiment results for natural splines. Natural splines perform quite well on the linear data set if the number of data points is large relative to the number of dimensions. The natural splines perform slightly worse than the B-splines on the square data set. This is most likely explained by the difference in the number of knots. The natural splines outperform B-splines on the cyclical and difficult data sets. 31

32 Regression splines Table 3.4: RMSE and log-likelihood of regression splines on different datasets of different sizes and different dimensions. Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.4 lists the accuracy experiment results for regression splines. Regression splines perform relatively well on the linear data set and they perform the best on the square data set compared to other splines as would be expected. The performance on the cyclical and difficult data sets is somewhere in between the natural splines and B-splines. 32

33 Wavelets Table 3.5: RMSE and log-likelihood of Ricker wavelets on different datasets of different sizes and different dimensions. Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.5 lists the accuracy experiment results for Ricker wavelets. The performance of the wavelets seems to be quite erratic in some sense. It can get either better or worse as either the number of data points or dimension of the data increases. 33

34 Smoothing splines Table 3.6: RMSE and log-likelihood of using RBFs with k-means on different datasets of different sizes and different dimensions (k = N/2). Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.6 lists the accuracy experiment results for RBF with k-means. This method seems to perform better as the dimension increases when the number of data points is large. That seems to indicate that the choice of k = N/2 is not good when there is a lot of data. Ignoring that, the RBF method performs very well in all situations. 34

35 k-nearest neighbors Table 3.7: RMSE and log-likelihood of k-nn on different datasets of different sizes and different dimensions (k = N). Dataset type N L RMSE L RMSE L RMSE L RMSE linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.7 lists the accuracy experiment results for k-nn. With all data set types the RMSE comes quite close to the expected value of 2 and gets progressively worse as the dimensionality increases. Worst performance is shown on the cyclical dataset and best on the linear dataset. In all cases the method performs better as the size of the dataset increases. 35

36 Kernel density estimation Table 3.8: Log-likelihood of KDE on different datasets of different sizes and different dimensions (bandwidth of 1). Dataset type N L L L L linear 1e linear 1e linear 1e linear 1e square 1e square 1e square 1e square 1e cyclical 1e cyclical 1e cyclical 1e cyclical 1e difficult 1e difficult 1e difficult 1e difficult 1e Table 3.8 lists the accuracy experiment results for KDE. The bandwidth choice is apparently not great for the one dimensional case. Otherwise the method performs decently. The accuracy does deteriorate as the dimensionality increases and there is smaller amounts of data. 36

37 Noisy-OR and logistic regression Table 3.9: Log-likelihood of noisy-or on different datasets of different sizes and different dimensions. Dataset type N L L L L or 1e or 1e or 1e or 1e and 1e and 1e and 1e and 1e Table 3.1: Log-likelihood of logistic regression on different datasets of different sizes and different dimensions. Dataset type N L L L L or 1e or 1e or 1e or 1e and 1e and 1e and 1e and 1e Table 3.9 lists the accuracy experiment results for noisy-or. This method performs very well on the OR data set and the performance increases with the dimensionality of the data. With the AND data set the performance also increases with the dimensionality. Table 3.1 lists the accuracy experiment results for logistic regression. Like with noisy- OR the performance increases with the dimensionality, however logistic regression performs much worse on the OR data set and slightly worse on the AND data set with the default settings. 37

38 Performance graphs OR data set Noisy-OR Logistic regression Log-likelihood Number of dimensions AND data set 1 1 Log-likelihood Number of dimensions Figure 3.1: Accuracy of noisy-or compared to logistic regression for different datasets with different dimensions at 1 data points. Figure 3.1 shows how the noisy-or and logistic regression methods compare in terms of accuracy. For these simple data sets and using default settings for logistic regression the noisy-or outperforms logistic regression in all cases and for the OR dataset it performs extremely well. The graphs seem to indicate that as the dimensionality increases logistic regression could catch up and possibly surpass the performance of noisy-or. 38

39 Log-likelihood of a linear data set OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood Number of dimensions Figure 3.2: Accuracy of the methods on the linear dataset at 1 data points for different dimensions. Figures 3.2 to 3.5 show the performance of the methods on the synthetic data sets when there is an abundance of data. Figure 3.2 shows how the different methods perform on the linear data set. It can be seen that the performance of the wavelet method varies a lot with the number of dimensions and is worst at two dimensions. It is also interesting how much the performance of the KDE and RBF methods increases moving from one dimension to two. This is likely explained by the choice of the bandwidth for KDE and number of means for RBF. The OLS, natural spline and regression spline perform the best across all dimensions. 39

40 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a square data set 1 Log-likelihood Number of dimensions Figure 3.3: Accuracy of the methods on the square dataset at 1 data points for different dimensions. Figure 3.3 shows how the different methods perform on the square data set. Again it can be seen how widely the performance of the wavelet method varies, this time however the worst case performance is in three dimensions and not two. The best methods on this data set are the RBF and the regression spline. 4

41 16 14 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a cyclical data set 1 Log-likelihood Number of dimensions Figure 3.4: Accuracy of the methods on the cyclical dataset at 1 data points for different dimensions. Figure 3.4 shows how the different methods perform on the cyclical data set. Wavelets perform very badly in the three dimensional case, but otherwise are relatively competitive. For this data set the best methods seem to be k-nn and RBF. 41

42 OLS Regression spline Natural spline B-spline RBF Ricker wavelet k-nn KDE Log-likelihood of a difficult data set 1 Log-likelihood Number of dimensions Figure 3.5: Accuracy of the methods on the difficult dataset at 1 data points for different dimensions. Figure 3.5 shows how the different methods perform on the difficult data set. On this data set the wavelets perform consistently in the sense that their performance is not worse in some dimension by a relatively large amount. The best methods for this data set seem to be k-nn, KDE and RBF Real-world data sets The logistic regression method outperforms noisy-or on the real-world data sets even with the default parameters (given by the liblinear library) and does so massively on the chess data set. On the lenses dataset noisy-or achieved a log-likelihood of 8.55 and logistic regression obtained This was done using 3-fold cross-validation. On the chess dataset using 1-fold cross-validation noisy-or achieved and logistic regression achieved

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines A popular method for moving beyond linearity 2. Basis expansion and regularization 1 Idea: Augment the vector inputs x with additional variables which are transformation of x use linear models in this

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation IT 15 010 Examensarbete 30 hp Februari 2015 Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation Xiaoyuan Chu Institutionen för informationsteknologi

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017 Data Analysis 3 Support Vector Machines Jan Platoš October 30, 2017 Department of Computer Science Faculty of Electrical Engineering and Computer Science VŠB - Technical University of Ostrava Table of

More information

Economics Nonparametric Econometrics

Economics Nonparametric Econometrics Economics 217 - Nonparametric Econometrics Topics covered in this lecture Introduction to the nonparametric model The role of bandwidth Choice of smoothing function R commands for nonparametric models

More information

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling

Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Locally Weighted Least Squares Regression for Image Denoising, Reconstruction and Up-sampling Moritz Baecher May 15, 29 1 Introduction Edge-preserving smoothing and super-resolution are classic and important

More information

3 Nonlinear Regression

3 Nonlinear Regression 3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Basis Functions Tom Kelsey School of Computer Science University of St Andrews http://www.cs.st-andrews.ac.uk/~tom/ tom@cs.st-andrews.ac.uk Tom Kelsey ID5059-02-BF 2015-02-04

More information

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN

More information

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010 INFORMATICS SEMINAR SEPT. 27 & OCT. 4, 2010 Introduction to Semi-Supervised Learning Review 2 Overview Citation X. Zhu and A.B. Goldberg, Introduction to Semi- Supervised Learning, Morgan & Claypool Publishers,

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2015 11. Non-Parameteric Techniques

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Xavier Le Faucheur a, Brani Vidakovic b and Allen Tannenbaum a a School of Electrical and Computer Engineering, b Department of Biomedical

More information

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric) Splines Patrick Breheny November 20 Patrick Breheny STA 621: Nonparametric Statistics 1/46 Introduction Introduction Problems with polynomial bases We are discussing ways to estimate the regression function

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine Module 4. Non-linear machine learning econometrics: Support Vector Machine THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction When the assumption of linearity

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007,

More information

6.034 Quiz 2, Spring 2005

6.034 Quiz 2, Spring 2005 6.034 Quiz 2, Spring 2005 Open Book, Open Notes Name: Problem 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts) Score 1 1 Decision Trees (13

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques. . Non-Parameteric Techniques University of Cambridge Engineering Part IIB Paper 4F: Statistical Pattern Processing Handout : Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 23 Introduction

More information

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation 6.098/6.882 Computational Photography 1 Problem Set 4 Assigned: March 23, 2006 Due: April 17, 2006 Problem 1 (6.882) Belief Propagation for Segmentation In this problem you will set-up a Markov Random

More information

Lecture 7: Splines and Generalized Additive Models

Lecture 7: Splines and Generalized Additive Models Lecture 7: and Generalized Additive Models Computational Statistics Thierry Denœux April, 2016 Introduction Overview Introduction Simple approaches Polynomials Step functions Regression splines Natural

More information

Four equations are necessary to evaluate these coefficients. Eqn

Four equations are necessary to evaluate these coefficients. Eqn 1.2 Splines 11 A spline function is a piecewise defined function with certain smoothness conditions [Cheney]. A wide variety of functions is potentially possible; polynomial functions are almost exclusively

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Multiclass Density Estimation Analysis in N-Dimensional Space

Multiclass Density Estimation Analysis in N-Dimensional Space IT 16 062 Examensarbete 15 hp Augusti 2016 Multiclass Density Estimation Analysis in N-Dimensional Space featuring Delaunay Tessellation Field Estimation Yuhao Li Institutionen för informationsteknologi

More information

Instance-based Learning

Instance-based Learning Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 19 th, 2007 2005-2007 Carlos Guestrin 1 Why not just use Linear Regression? 2005-2007 Carlos Guestrin

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Last time... Bias-Variance decomposition. This week

Last time... Bias-Variance decomposition. This week Machine learning, pattern recognition and statistical data modelling Lecture 4. Going nonlinear: basis expansions and splines Last time... Coryn Bailer-Jones linear regression methods for high dimensional

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Basis Functions. Volker Tresp Summer 2017

Basis Functions. Volker Tresp Summer 2017 Basis Functions Volker Tresp Summer 2017 1 Nonlinear Mappings and Nonlinear Classifiers Regression: Linearity is often a good assumption when many inputs influence the output Some natural laws are (approximately)

More information

Comparing different interpolation methods on two-dimensional test functions

Comparing different interpolation methods on two-dimensional test functions Comparing different interpolation methods on two-dimensional test functions Thomas Mühlenstädt, Sonja Kuhnt May 28, 2009 Keywords: Interpolation, computer experiment, Kriging, Kernel interpolation, Thin

More information

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences 1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from

More information

Nonparametric Approaches to Regression

Nonparametric Approaches to Regression Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Learning from Data Linear Parameter Models

Learning from Data Linear Parameter Models Learning from Data Linear Parameter Models Copyright David Barber 200-2004. Course lecturer: Amos Storkey a.storkey@ed.ac.uk Course page : http://www.anc.ed.ac.uk/ amos/lfd/ 2 chirps per sec 26 24 22 20

More information

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM) School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity The truth is never linear! 1/23 Moving Beyond Linearity The truth is never linear! r almost never! 1/23 Moving Beyond Linearity The truth is never linear! r almost never! But often

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

lecture 10: B-Splines

lecture 10: B-Splines 9 lecture : -Splines -Splines: a basis for splines Throughout our discussion of standard polynomial interpolation, we viewed P n as a linear space of dimension n +, and then expressed the unique interpolating

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Neural Computation : Lecture 14 John A. Bullinaria, 2015 1. The RBF Mapping 2. The RBF Network Architecture 3. Computational Power of RBF Networks 4. Training

More information

Automated Parameterization of the Joint Space Dynamics of a Robotic Arm. Josh Petersen

Automated Parameterization of the Joint Space Dynamics of a Robotic Arm. Josh Petersen Automated Parameterization of the Joint Space Dynamics of a Robotic Arm Josh Petersen Introduction The goal of my project was to use machine learning to fully automate the parameterization of the joint

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2011 11. Non-Parameteric Techniques

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Machine Learning and Pervasive Computing

Machine Learning and Pervasive Computing Stephan Sigg Georg-August-University Goettingen, Computer Networks 17.12.2014 Overview and Structure 22.10.2014 Organisation 22.10.3014 Introduction (Def.: Machine learning, Supervised/Unsupervised, Examples)

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class Guidelines Submission. Submit a hardcopy of the report containing all the figures and printouts of code in class. For readability

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity Basic non-linear models one input feature: polynomial regression step functions splines smoothing splines local regression. more features: generalized additive models. Polynomial

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

The K-modes and Laplacian K-modes algorithms for clustering

The K-modes and Laplacian K-modes algorithms for clustering The K-modes and Laplacian K-modes algorithms for clustering Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://faculty.ucmerced.edu/mcarreira-perpinan

More information

Edge and local feature detection - 2. Importance of edge detection in computer vision

Edge and local feature detection - 2. Importance of edge detection in computer vision Edge and local feature detection Gradient based edge detection Edge detection by function fitting Second derivative edge detectors Edge linking and the construction of the chain graph Edge and local feature

More information

6 Model selection and kernels

6 Model selection and kernels 6. Bias-Variance Dilemma Esercizio 6. While you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one (i.e., a Linear Model with quadratic features φ(x)

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

Text Modeling with the Trace Norm

Text Modeling with the Trace Norm Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

HW 10 STAT 672, Summer 2018

HW 10 STAT 672, Summer 2018 HW 10 STAT 672, Summer 2018 1) (0 points) Do parts (a), (b), (c), and (e) of Exercise 2 on p. 298 of ISL. 2) (0 points) Do Exercise 3 on p. 298 of ISL. 3) For this problem, try to use the 64 bit version

More information

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes 1 CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview

More information

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

LOGISTIC REGRESSION FOR MULTIPLE CLASSES Peter Orbanz Applied Data Mining Not examinable. 111 LOGISTIC REGRESSION FOR MULTIPLE CLASSES Bernoulli and multinomial distributions The mulitnomial distribution of N draws from K categories with parameter

More information

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each

More information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1

More information

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis luca@dmi.units.it Trieste, Winter Semester

More information

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CS 521 Data Mining Techniques Instructor: Abdullah Mueen CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines & Support Vector Machines & Support Vector Machines Arvind Visvanathan CSCE 970 Pattern Recognition 1 & Support Vector Machines Question? Draw a single line to separate two classes? 2 & Support Vector

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

CS 450 Numerical Analysis. Chapter 7: Interpolation

CS 450 Numerical Analysis. Chapter 7: Interpolation Lecture slides based on the textbook Scientific Computing: An Introductory Survey by Michael T. Heath, copyright c 2018 by the Society for Industrial and Applied Mathematics. http://www.siam.org/books/cl80

More information