Algorithms for LTS regression

Size: px

Start display at page:

Download "Algorithms for LTS regression"

Antonia Merritt
5 years ago
Views:

1 Algorithms for LTS regression October 26, 2009

2 Outline Robust regression. LTS regression. Adding row algorithm. Branch and bound algorithm (BBA). Preordering BBA. Structured problems Generalized linear model. Seemingly unrelated regressions. Conclusions.

3 Robust regression Ordinary least squares (OLS) Ordinary linear model (OLM): y = Aβ + ε, where y R m, A R m n, β R n and ε R m. Objective function: RSS(β) = n ε 2 i. i=1 OLS estimation is sensitive to outliers.

4 Algorithms for LTS regression Robust regression Outliers outlier influential

5 Algorithms for LTS regression Robust regression Outliers outlier influential

6 Algorithms for LTS regression Robust regression Outliers outlier influential

7 Algorithms for LTS regression Robust regression Outliers outlier influential

8 Algorithms for LTS regression Robust regression Outliers outlier influential

9 Robust regression Robust regression Breakdown point: smallest fraction of contamination that can bias the estimator arbitrarily. OLS: one observation is enough to contaminate the estimator; i.e. breakdown point is 1/n 0. High-breakdown estimators can resist contamination of nearly 50% of the data. Known algorithms: least median of squares (LMS), least trimmed squares (LTS).

10 LTS regression LTS regression Objective function: RSS h (β) = h (ε 2 ) [i], i=1 where the coverage h may lie between m/2 and m. Equivalent to finding the h-subset with optimal LS objective function. Naive algorithm: enumerate all subsets. Number of h-subsets: ( ) m h = m! h!(m h)!. Computational load is prohibitive for m > 30.

11 LTS regression LTS regression Approximate algorithms: PROGRESS (Rousseeuw and Leroy, 1987), FSA (Hawkins, 1994), FAST-LTS (Rousseeuw and Van Driessen, 2006). Exact algorithm: branch and bound (Agulló, 2001).

12 LTS regression Issues In practice, coverage h = 50% is used. But: contamination rarely exceeds 15% 20% of data. High-breakdown seems like overkill: large portions of good data are ignored. Problem: find a good tradeoff between robustness (i.e. discard all bad data ) and efficiency (i.e. include all good data ). Choice of the unknown coverage parameter h?

13 Adding row algorithm Adding row algorithm (ARA) Strong correspondence between row-selection techniques and procedures for computing all-variable-subsets regression. Computes the all-observation-subsets regression for a range of coverage values [h min, h max ]. The organization of the algorithm is determined by the all-subsets tree. A node corresponds to an observation-subset model. The model is represented by the upper-triangular factor of the QR decomposition of (A, y). A Cholesky updating algorithm is employed to move from one node to another.

14 Adding row algorithm Regression tree [], [1234] [1], [234] [2], [34] [3], [4] [4], [] [12], [34] [13], [4] [14], [] [23], [4] [24], [] [34], [] [123], [4] [124], [] [134], [] [234], [] [1234], [] Number of observations: m = 4. Number of nodes: 2 m.

15 Adding row algorithm Cholesky update (R, z) (x T, y)

16 Adding row algorithm Cholesky update Givens rotation: G 1

17 Adding row algorithm Cholesky update Givens rotation: G 2

18 Adding row algorithm Cholesky update Givens rotation: G 3

19 Adding row algorithm Cholesky update Givens rotation: G 4

20 Adding row algorithm Cholesky update Givens rotation: G 5

21 Adding row algorithm Cholesky update The RSS.

22 Adding row algorithm Branch and bound algorithm (BBA) ARA is prohibitive even for a moderate number of observations. Fundamental property: given two subsets of observations S A and S B, RSS(S A ) RSS(S B ) if S A S B, where RSS(S Z ) denotes the residual sum of squares of the LS estimator of the model S Z. In other words: updating the model by one observation increases the RSS. Can be used to restrict the number of evaluated subsets (reduce search space i.e. number of generated nodes).

23 Adding row algorithm Cutting test k S k + 1 k + 2 k + 3 Residual lookup table:... < ρ k < ρ k+1 < ρ k+2 < ρ k+3 <... Cutting test: b S = RSS(S) > ρ k+3, where S is a set of k observations and b S the node bound.

24 Adding row algorithm Observation preordering BBA: computational efficiency rises when more nodes are cut i.e. when bigger subtrees are bounded with bigger values. Strategies: sort observations in decreasing order of 1. absolute LS residuals: estimate the observation-subset model and determine the residuals ˆε + cheap (solve one linear system); approximate bounds; 2. partial increment in RSS: (S,[123]) RSS([S,1])? RSS([S,2])? RSS([S,3])? expensive (m Cholesky updates); + exact bounds.

25 Adding row algorithm Computing the bound (R, z) (x T, y)

26 Adding row algorithm Computing the bound Givens rotation: G 1

27 Adding row algorithm Computing the bound Givens rotation: G 2

28 Adding row algorithm Computing the bound Givens rotation: G 3

29 Adding row algorithm Computing the bound Givens rotation: G 4

30 Adding row algorithm Computing the bound The RSS!

31 Adding row algorithm Computing the bound Benefits: the upper-triangular factor is not modified; half the computational cost; avoids matrix-copy operations.

32 Adding row algorithm Examples Example 1 Dataset: random data matrix X (m = 50,n = 4); fixed coefficients β; y = Xβ + ε. Outliers: construct modified model (X m, y m ) from (X, y) by replacing ten observations with random data. Compute LTS estimates and determine relative efficiencies: RE(h) = σ2 h σls 2, where h = h min,...,h max.

33 Adding row algorithm Examples Relative efficiencies: original model (X, y) re Index

34 Adding row algorithm Examples Relative efficiencies: contaminated model (X m, y m ) rem Index

35 Adding row algorithm Examples Example 2 Dataset: randomly generated according to y = x i,1 + x i, x i,n e i, where e i N(0, 1) is the error term and x i,j N(0, 100) are the explanatory variables. Outliers: replace some of the x i,1 by values that are normally distributed with mean 100 and variance 100.

36 Adding row algorithm Examples Relative efficiencies: original model (X, y) RE h

37 Adding row algorithm Examples Relative efficiencies: contaminated model (X m, y m ) RE h

38 Structured linear models General linear model Generalized least squares (GLS) The general linear model (GLM) is given by y = Xβ + ε, ε (0, σ 2 Ω) where y R m, X R m n, β R n and Ω R m m (m n). Ω is assumed to be of full rank. The BLUE of β minimizes the objective function Xβ y 2 Ω = (Xβ y) T Ω 1 (Xβ y). 1

39 Structured linear models General linear model Generalized linear least squares problem (GLLSP) The GLS is reformulated as ˆβ,û = argmin u subject to y = Xβ + Bu, β,u where B R m p such that Ω = BB T and u (0, σ 2 I m ) (p m n). Ω may be singular. Residual sum of squares (RSS): RSS(ˆβ) = û 2. Computational tool: generalized QR decomposition (GQRD) of X and B.

40 Structured linear models General linear model Solving the GLLSP In a node on level n of the ARA tree: A: S: y X B Compute GQRD and transform GLLSP: A: S: y R T

41 Structured linear models General linear model Updating the GLLSP For each new node: Ã: S: y R T Update GQRD and transform GLLSP: Ã: S: ỹ R T

42 Structured linear models General linear model Updating the GLLSP Reduce the GLLSP: Ã S ỹ R T

43 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations. k 1 n n p

44 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations.

45 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations.

46 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations.

47 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations.

48 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations.

49 Structured linear models General linear model Givens sequence Update the GQRD by the use of Givens rotations. n p 1 n k

50 Structured linear models Seemingly unrelated regressions Seemingly unrelated regressions (SUR) G regressions: y (i) = X (i) β (i) + ε (i), i = 1,..., G, where y (i) R m, X (i) R m n i (m n i ), β (i) R n i, ε (i) R m and E(ε (i) k ) = 0. Contemporaneous disturbances are correlated: Var(ε (i) t, ε (j) t ) = σ ij and Var(ε (i) s, ε (j) t ) = 0 if s t. Compact form: vec(y ) = G X (i) vec({β (i) } G ) + vec(e), i=1 where Y = ˆy (1) y (G) R m G, E = ˆε (1) ε (G) R m G, vec(e) (0, Σ I m) and Σ = ˆσ ij R G G.

51 Structured linear models Seemingly unrelated regressions GLLSP of the SUR model SUR-GLLSP: {ˆβ (i) } G, Û = argmin {β (i) } G,U U F subject to ( ) G vec(y ) = X (i) vec({β (i) } G ) + K vec(u), i=1 where E = UC T, C R G G is upper triangular such that CC T = Σ, K = C I m, vec(u) (0, σ 2 I M ) and M = Gm. Objective function: RSS({ˆβ (i) } G ) = Û F.

52 Structured linear models Seemingly unrelated regressions Solving the SUR-GLLSP On level n G of the subsets tree: vec(y) X (i) K Orthogonal transformation Q T from the left:

53 Structured linear models Seemingly unrelated regressions Solving the SUR-GLLSP Permute rows and columns: Orthogonal transformation P T from the right:

54 Structured linear models Seemingly unrelated regressions Solving the SUR-GLLSP Reduce SUR-GLLSP: vec({z (i) }) R (i) T 11

55 Structured linear models Seemingly unrelated regressions Updating the SUR-GLLSP For each new new node: vec({z (i) }) R (i) T 11 C Apply orthogonal tramsformation Q T from the left:

56 Structured linear models Seemingly unrelated regressions Updating the SUR-GLLSP Apply orthogonal transformation P T from the right: Reduce SUR-GLLSP: vec({ z (i) }) R (i) T 11

57 Conclusions Conclusions Efficient method to compute exact LTS estimates for a coverage range. Allows to assess the quality (i.e. degree of contamination) of the data; to identify outliers; to choose the best (e.g. the most efficient) estimate. Algorithm can be applied to other, structured linear models: black box. R package.

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always