Knowledge Discovery and Data Mining

Size: px

Start display at page:

Download "Knowledge Discovery and Data Mining"

Trevor Sparks
5 years ago
Views:

1 Knowledge Discovery and Data Mining Basis Functions Tom Kelsey School of Computer Science University of St Andrews Tom Kelsey ID BF / 30

2 Housekeeping P1 spec, dates and assessment will be released on Friday Legitimate Concerns: CS students can I cope witht the Maths? Maths students is there too much programming? Other science students both of the above These are problems for me/us We can t offer a course that can t be done by a subset of the student cohort......or is too easy Tom Kelsey ID BF / 30

3 Some recurring ideas I encourage you (particularly for this course) to think of our analysis process as separating signal from noise. Standard statistics models: anova, t-tests, regression etc. Often 2 formal models within each for signal (location model) and noise (dispersion model).. Our stats models for these are chosen in part from theoretical considerations, partly from convenience. There is a big difference between fitting a model with strong a priori theoretical form, and fitting models to data where little is known. Tom Kelsey ID BF / 30

4 Preliminary notation Lowercase bold letters are vectors (a) defaulting to column vectors (i 1) with elements a i ; uppercase bold letters represent matrices A (i j) - elements a ij, rows a i, columns a j. a T is the transpose of a X is typically our data: n rows with entries for p attributes X 1 is the inverse of X (recall Gaussian elimination) y, ŷ typically n observed and predicted response values, respectively. Tom Kelsey ID BF / 30

5 The Linear Regression equation Variations on this regression equation will recur throughout the course: y = Xβ + e where y is a n-element column vector for the response, X is a n p matrix of covariate values, β is a p-element column vector of population parameters, and e is a column vector of errors. How do we calculate the equation for the model prediction at the i-th observation? From our training data we have input values X and responses y. Tom Kelsey ID BF / 30

6 The LRE We use these to identify a suitable ˆβ. At the i-th observation we have ŷ i = x T i ˆβ. In the well-known case of Residual Sum of Squares we choose RSS(β) = (y Xβ) T (y Xβ) To minimise this error, we identify where its derivative is zero: X T (y Xβ) = 0 This has unique solution (if X T X is nonsingular) ˆβ = (X T X) 1 X T y Tom Kelsey ID BF / 30

7 The LRE For test input we can predict the response for this choice of ˆβ. ŷ = X(X T X) 1 X T y The matrix that converts actual responses into predicted responses is known as the projection operator, or hat matrix, or influence matrix, depending on which textbook you read. The residual is ê = y X ˆβ giving an RSS (or SSE) of ê T ê. Tom Kelsey ID BF / 30

8 The LRE Key points: 1 We know how to invert matrices Gaussian elimination so variations on this technique are attractive 2 Matrix inversion is O(p 3 ), so the computational complexity is low 3 We need to watch out for problems like numeric cancellation, though 4 For other definitions of error, the calculations for ˆβ are minimisation problems for which we typically have good algorithms Tom Kelsey ID BF / 30

9 Basis Functions Need-to-knows: 1 polynomial regression as a linear combination of basis functions. 2 why and how curves may be created as a linear combination of basis functions. 3 why and how to construct a univariate piecewise constant basis e.g. a bin smooth. 4 b-spline bases. 5 extension to tensor product b-spline basis functions (multi-dimensional case). Tom Kelsey ID BF / 30

10 Basis functions Functions which are combined to produce more complex functions. Usually simple themselves, but combine to create complex functions. Several common methods may be viewed from the basis function perspective: polynomial regression, splines, (as we will see) regression trees, bin-smoothing, wavelet analysis, fourier analysis + many more. Tom Kelsey ID BF / 30

11 Basis functions to construct a bin smooth Divide the x-region(s) into K regions R j, each having the same number of data points. Basis functions are: { 0 x / R b j (x) = j 1 x R j Now multiply each basis by the average of the outputs in that region. This gives a discontinuous function, but is easy to derive and interpret. Tom Kelsey ID BF / 30

12 Linear Source: Ty Tong s Nonparametric Kernel Regression Slides Tom Kelsey ID BF / 30

13 Quadratic Source: Ty Tong s Nonparametric Kernel Regression Slides Tom Kelsey ID BF / 30

14 Cubic Source: Ty Tong s Nonparametric Kernel Regression Slides Tom Kelsey ID BF / 30

15 Cubic Spline Source: Ty Tong s Nonparametric Kernel Regression Slides Tom Kelsey ID BF / 30

16 BinSmooth Source: Ty Tong s Nonparametric Kernel Regression Slides Tom Kelsey ID BF / 30

17 Piecewise linear 1 Basis functions are 1 and x. 2 Split the x region into parts divided at c k 3 Specify a linear model for each part 4 Add a continuity constraint 5 Simplify the linear system of equations 6 Solve for the parameters 1 Choose a loss function 2 Define error using the linear regression equation 3 Solve the resulting optimisation problem For RSS, the final stage is just matrix inversion, which is cheap. This procedure generalises! Tom Kelsey ID BF / 30

18 Other basis functions We often want nonlinear models, but also want to keep our linear regression equation. A solution is to use nonlinear basis functions 1, x, x 2, x 3,... for polynomials but search for linear combinations of the model parameters. To do this we turn X into a design matrix with additional columns 1, x i, x 2 i, x3 i,..., and use the linear regression equation to find optimal parameters. As before, we set parameter derivatives to zero, then solve by inverting X T X. Tom Kelsey ID BF / 30

19 Other basis functions This approach gives the best of all worlds, in some sense. By increasing the degree of the polynomial (e.g. 1, x i, x 2 i, x3 i,... x58 i ) we can get an excellent (too good?) fit to the data (i.e the RSS can be made as small as we want). We only have to invert a p + 1 p + 1 matrix, where p is the degree of the polynomial. This is computationally inexpensive. So we get nonlinear models by solving a linear regression problem. But data values have a global influence on the model, and we may not want this. Tom Kelsey ID BF / 30

20 Local regression using basis functions More basis functions, and/or different bases. What makes a good choice of basis function? local" effects outweigh far off data points minimum loss of accuracy due to cancellation easy to calculate straightforward convergence anlysis B-splines (usually) have all these characteristics. For detailed exposition, see Ch 19 of Approximation theory and methods, MJD Powell, CUP Appendix to Ch 5 of HTF Tom Kelsey ID BF / 30

21 A local univariate basis For calculation of a series of b-spline bases (B): B i,j (x) = x δ i B δ i+j 1 i,j 1 (x) + δ i+j x B δ i δ i+j i+1,j 1 (x) (1) δ i+1 Where we start with: B i,1 (x) = { 1 δ i x δ i+1, 0 otherwise. (2) Tom Kelsey ID BF / 30

22 Example b-spline basis B-spline Basis Functions x Source: Tom Kelsey ID BF / 30

23 Example b-spline basis fitted response β^1 β^2 β^3 β^4 β^ x data Source: Tom Kelsey ID BF / 30

24 B-spline discussion B-splines are constructed to be zero on large ranges of x Calculations are relatively easy matrices in the linear regression equation are sparse and have exploitable structure, giving O(n) solution for parameters Curves can be as smooth as desired, giving useful derivative information care is needed for repeated knots, though Overfitting remains a problem For some data, transformations (e.g. log, cube root,...) are used to obtain well-behaved residuals Tom Kelsey ID BF / 30

25 Computational complexity I Choose a model f (x) A linear basis, no knots B linear basis, with knots C nonlinear basis II Choose error measure or loss function L(y, f (x)) (i) RSS (ii) quadratic loss (iii) arbitrary loss Tom Kelsey ID BF / 30

26 Cases A(i) find (X T X) 1 A(ii) find (X T X) 1 A(iii) global optimisation with nonlinear objective B(i) solve a linear system, then find (X T X) 1 B(ii) solve a linear system, then find (X T X) 1 B(iii) global optimisation with nonlinear objective, linear constraints C(i) Quadratic Programming, no constraints C(ii) Quadratic Programming, linear constraints C(iii) global optimisation with nonlinear objective and nonlinear constraints Tom Kelsey ID BF / 30

27 What about more dimensions? The result of multiplying a univariate basis function in one dimension against those of another results in a shape in higher dimensions Two b-spline bases of degree 1 (linear) give tensor products that appear to be pyramids or ramps More curved bases such as 3 rd degree (cubic) b-spline bases will give bump-like tensor products The number of cases grows quickly, since we have to consider each parameter in every dimension w.r.t each other parameter in every other dimension i.e. using tensor products Tom Kelsey ID BF / 30

28 basis value x2 Example b-spline basis x1 Source: Tom Kelsey ID BF / 30

29 basis value x2 Example b-spline basis x1 Source: Tom Kelsey ID BF / 30

30 Need-to-knows: Recap 1 polynomial regression as a linear combination of basis functions. 2 why and how curves may be created as a linear combination of basis functions. 3 why and how to construct a univariate piecewise constant basis e.g. a bin smooth. 4 tensor products for b-spline basis functions in higher dimensions. 5 read the Kondor paper for more mathematical detail (non-examinable!) Tom Kelsey ID BF / 30

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN