Industrialising Small Area Estimation at the Australian Bureau of Statistics

Size: px

Start display at page:

Download "Industrialising Small Area Estimation at the Australian Bureau of Statistics"

Cordelia Thornton
5 years ago
Views:

1 Industrialising Small Area Estimation at the Australian Bureau of Statistics Peter Radisich Australian Bureau of Statistics Workshop on Methods in Official Statistics - March

2 Outline Background Current method Industrialisation Alternative methods

3 BACKGROUND

4 What is a Small Area? Liberal use of the term Area Small Domain Estimation Areas with small sample size Standard methods not fit for purpose Some areas may have no sample

5 What is Small Area Estimation? The small area estimates are still simple Averages Proportions Totals Just lots of them!

6 Why are estimates for small areas needed? Resource allocation for services Planning and decision making at the local level Policy development and evaluation Research (social, health, labour market) Microeconomic analysis (sustainability of regional economies)

7 Why Use Small Area Estimation? ABS surveys Direct estimates (weighted totals and averages) Usually designed for state and national level Impractical to design for small areas Effort Cost Respondent burden

8 Why Use Small Area Estimation? Need for explicit statistical models Why will a model work? Provide greater access to rich sources of data Sharing data (borrowing strength) More data = better estimates Most of the time. Sharing data not always appropriate

9 How accurate are direct estimates for small areas? Relative Standard Errors for direct estimates of Needs assistance with mobility

10 Small Area Applications People with a disability Labour force status Undercount of Aboriginal and Torres Strait Islander peoples Health conditions Household energy consumption Household wealth Farm water use and agricultural practices Land use, land cover and crop yields

11 CURRENT METHOD As shown by extensive research and analysis, the mean square error would be halved, if you divided it by 2.

12 Current method generalised linear mixed models (GLMM) Usually logistic regression Random intercept at small area level Unit level modelling Ignore design weights Include design variables into the model

13 Small Area Methods Poisson Logistic Multinomial Linear Log Linear Models Synthetic Random effects Estimation WinBUGS MPQL REML Quality Diagnostics

14 Small Area Methods Poisson Logistic Multinomial Linear Log Linear Models Random effects Estimation WinBUGS INLA? PROC MPQL GLIMMIX REML Quality Diagnostics

15 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and measures of accuracy 4. Calibrate to published estimates 5. Quality assurance of predictions

16 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and measures of accuracy 4. Calibrate to published estimates 5. Quality assurance of predictions

17 Data sources Survey has the variable of interest Census Administrative Centrelink, Tax, Building Approvals, Weather, etc. Estimated resident population Anything else you can get access to.

18 Preparing the data End goal: two final data sets Sample data Population data

19 Preparing the data Sample Data Population Data

20 Preparing data Look for common data items Variable definition in Survey vs Census Look for important variables Contextual variables Easy to drown in data definitions Family Composition vs Family Type Data item lists for surveys are massive

21 Preparing data Understanding data takes time We have information in different people/sections Understand data, but not models Understand models, but not the data The search for explanatory variables When to stop? Interplay with knowing the data & modelling

22 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and measures of accuracy 4. Calibrate to published estimates 5. Quality assurance of predictions

23 Build regression model GLMM type model y: variable of interest y s (y r ): sample (population) data set X: explanatory variables ( fixed ) X s (X r ): sample (population) data set Z: small area variable ( random ) Z s (Z r ): sample (population) data set

24 Build regression model Sample Data Population Data y s X s Z s y r X r Z r

25 Build regression model Sample Data Population Data y s X s Z s y r X r Z r

26 Build regression model We know everything except y r Fit a model using the sample (y s, X s, Z s ) Model selection ( pruning of columns in X s ) Parameter estimates Standard errors Predict using the population (X r, Z r )

27 Build regression model Rough description of model E y s u = h X s β + Z s u h η = η exp η exp η linear log linear logistic

28 Build regression model Random effects u~gaussian 0, φi Parameters of the GLMM model θ = β, u, φ

29 Build regression model Parameters of the model Contain everything about y s that is relevant to predicting y r If we knew θ then we would not need the sample data.

30 Build regression model Sample Data Population Data Parameter estimates θ = β, u, φ

31 Build regression model Model selection Weak theory We expect many variables to be unimportant Raftery (1995) Bayesian Model Selection in Social Research Laundry list of possible explanatory variables

32 P-value Build regression model? Use BIC to select explanatory variables Implied p-value = Pr(χ 2 1>log(n)) Much smaller than usual Sample size

33 Build regression model? Use BIC to select explanatory variables Implied p-value = Pr(χ 2 1>log(n)) Much smaller than usual 0.05

34 CV(β k ) Build regression model? Use BIC to select explanatory variables If CV(β k ) > 30% then drop X k from the model 35% 33% 31% 29% 27% 25% Sample size

35 Build regression model.but No guarantee that model predictions will be close to survey estimates at broad level Step 4: calibration Use BIC, but constrain possible models close to direct estimates at broad level Not significant variables may be kept Significant variables may be dropped

36 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and measures of accuracy 4. Calibrate to published estimates 5. Quality assurance of predictions

37 Calculate predictions Calculate predictions by plugging in estimates (EBLUP) E y r u = h X r β + Z r u Predictions only depend on sample through parameter estimates y r = h X r β + Z r u

38 Calculate predictions and measures of accuracy Sample Data Population Data Parameter estimates Small Area Estimates

39 Calculate measures of accuracy Primary measure of accuracy: Mean Square Error (MSE) Calculated by magic MSE y r = G 1 + G 2 + 2G 3 + G 4 MSE estimator only depends on sample data through parameter estimates

40 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and MSEs 4. Calibrate to published estimates 5. Quality assurance of predictions

41 Calibrate to published estimates Small Area Estimates created after key headline figures released Eg State level estimates of disability counts Y d = Y NSW d "NSW" Estimates from GLMM Direct estimate using survey weights

42 Calibrate to published estimates Small Area Estimates created after key headline figures released Eg State level estimates of disability counts Y d = Y NSW d "NSW" Estimates from GLMM Direct estimate using survey weights

43 Calibrate to published estimates Also, the modelling is done variable by variable. For multicategory variables, we model each category separately as a binary variable Estimates of number of people with any disability Y d,1 Y d,2 Estimates of number of people with a mild disability

44 Calibrate to published estimates Also, the modelling is done variable by variable. For multicategory variables, we model each category separately as a binary variable Estimates of number of people with any disability Y d,1 Y d,2 Estimates of number of people with a mild disability

45 Calibrate to published estimates Competing goals Want to publish our model based predictions Want to be coherent with other releases Want to be coherent with small area estimates for other variables Calibration! Implemented through GREGWT macro Some tricks used for large number of constraints

46 Current SAE Process 1. Obtain and prepare data. 2. Build the regression model 3. Calculate predictions and MSEs 4. Calibrate to published estimates 5. Quality assurance of predictions

47 Quality Diagnostics for Small Area Estimates Relative Root Mean Square Errors (RRMSEs) Bias plots Check model assumptions, Goodness of Fit Consistency with direct estimates Spatial mapping

48 Bias plot 261 small areas with sample size of at least 30 people

51 Distribution of SAEs Small area estimates of proportions of males with any disability.

52 Difficulties with the current method Models required for every variable Survey s have a large number of potential variables of interest Example: disability consultancies Over 50 breakdowns of disability Required fitting 50 GLMMs

53 Difficulties with the current method Big Data = Big Data manipulation Creating the X and Z matrices is expensive Big data sets often leads to explosion in number of potential explanatory variables. Difficult to transfer knowledge and experience Knowledge about data and computer systems Knowledge about modelling and analysis

54 INDUSTRIALISATION

55 Industrialisation of SAEs Knowledge Management ABS 2017 Toolset SAE Methods Data Preparation Documentation Integrated SAE System Metadata Retrieval Quality Assurance Diagnostics

56 ALTERNATIVE METHODS

57 Alternative methods Standard statistical output How efficient are direct estimates? Survey data comes with weights. These weights do not depend on the variable of interest Fit one model, produce estimates for all variables

58 Alternative methods Fay Herriot type models Bayesian Bootstrap INLA Weighting methods Reweighting Model Based Direct Estimation BARE

59 Alternative methods Fay Herriot type models Extensive literature Similar to unit level modelling Change small area means new FH model Less efficient

60 Alternative methods Bayesian Bootstrap Polya s urn model Robust standard errors (eg model selection) Use of Monte Carlo less efficient Bayesian analogue of model-assisted method

61 Alternative methods INLA Bayesian approximation for GLMMs Similar to computations used in current method More accurate computations

62 Alternative methods Weight based methods Y d = i s w di y i Weighted sum over whole sample, not just in those in the area/domain Weights for each unit are different for different areas/domains

63 Alternative methods Y d = w di y i Reweighting i s One set of weights for each small area hard calibration on area specific benchmarks Consistency with survey weights ( weight sharing ) w 1i + w 2i + + w Di = W i

64 Alternative methods 1 Y d = w i w i y i i s d i s d Model Based Direct Estimation One set of weights (w di =w i ) Based on Linear Mixed Model Weighted average over sample in the area/domain Similar to direct estimation hard calibration on fixed effects soft calibration on random effects

65 Alternative methods MSE estimation Very difficult for reweighting/bare data sharing = bias up, variance down Bias hard to quantify Expect MSE to be smaller Difficult to quantify how much smaller

66 Summary Background Current method Industrialisation

67 More information? Pfeffermann, D. (2013) New Important Developments in Small Area Estimation, Statistical Science, 28, 1, A Guide to Small Area Estimation is available on -> Statistical References. Small Area Estimation by J. N. K. Rao Sean Buttsworth sean.buttsworth@abs.gov.au (02) Peter Radisich peter.radisich@abs.gov.au (02)

Small area estimation by model calibration and "hybrid" calibration. Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland

Small area estimation by model calibration and "hybrid" calibration Risto Lehtonen, University of Helsinki Ari Veijanen, Statistics Finland NTTS Conference, Brussels, 10-12 March 2015 Lehtonen R. and Veijanen