Computing Optimal Strata Bounds Using Dynamic Programming

Similar documents
Computational methods for optimum stratification: A review

STRATIFICATION IN BUSINESS AND AGRICULTURE SURVEYS WITH

arxiv: v1 [cs.lg] 18 Feb 2009

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

E-Campus Inferential Statistics - Part 2

Sampling Statistics Guide. Author: Ali Fadakar

An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve

Optimizing Speech Recognition Evaluation Using Stratified Sampling

STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY

Hellenic Complex Systems Laboratory

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Generating random samples from user-defined distributions

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Goal programming and Lexicographic Goal Programming Apporches in bi-objective Stratified Sampling: An Integer Solution

MATH3016: OPTIMIZATION

MAT 110 WORKSHOP. Updated Fall 2018

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Using Simulation and Assignment Modeling for Optimization with Constraint in Ability of Servers

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

A Deterministic Dynamic Programming Approach for Optimization Problem with Quadratic Objective Function and Linear Constraints

arxiv: v3 [cs.db] 20 Feb 2018

What is Process Capability?

Package blocksdesign

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Chapter 1. Looking at Data-Distribution

Sampling Size Calculations for Estimating the Proportion of False Positive

Hyperparameters and Validation Sets. Sargur N. Srihari

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey

SLStats.notebook. January 12, Statistics:

Soci Statistics for Sociologists

Package stratifyr. April 12, Type Package

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013

Correctly Compute Complex Samples Statistics

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

2. Data Preprocessing

A Grouping Genetic Algorithm for Joint Stratification and Sample Allocation Designs

Week 4: Simple Linear Regression III

Handling Data with Three Types of Missing Values:

EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA. Shang-Lin Yang. B.S., National Taiwan University, 1996

Data Preprocessing. Data Preprocessing

AC : USING A SCRIPTING LANGUAGE FOR DYNAMIC PROGRAMMING

Targeted Random Sampling for Reliability Assessment: A Demonstration of Concept

Packet Sampling for Flow Accounting: Challenges and Limitations

Bivariate (Simple) Regression Analysis

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Correctly Compute Complex Samples Statistics

IS IT REALLY THAT BAD?

CS 229 Midterm Review

Towards Interactive Global Illumination Effects via Sequential Monte Carlo Adaptation. Carson Brownlee Peter S. Shirley Steven G.

Acknowledgments. Acronyms

The Bootstrap and Jackknife

Programs for MDE Modeling and Conditional Distribution Calculation

Land Cover Stratified Accuracy Assessment For Digital Elevation Model derived from Airborne LIDAR Dade County, Florida

Technical Report Probability Distribution Functions Paulo Baltarejo Sousa Luís Lino Ferreira

Introduction to digital image classification

User s guide to R functions for PPS sampling

3. Data Preprocessing. 3.1 Introduction

A Random Number Based Method for Monte Carlo Integration

Package stochprofml. February 20, 2015

Package blocksdesign

Statistical Pattern Recognition

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

PASS Sample Size Software. Randomization Lists

3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)

Learning to Segment Document Images

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

BESTFIT, DISTRIBUTION FITTING SOFTWARE BY PALISADE CORPORATION

A Modified Generalized Two-phase Exponential Estimator To Estimate Finite Population Mean

Research with Large Databases

Product Catalog. AcaStat. Software

Methods for Estimating Change from NSCAW I and NSCAW II

IN-CLASS EXERCISE: INTRODUCTION TO THE R "SURVEY" PACKAGE

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Empirical Asset Pricing

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Robust Shape Retrieval Using Maximum Likelihood Theory

Data Analysis & Probability

Simulation: Solving Dynamic Models ABE 5646 Week 12, Spring 2009

(R) / / / / / / / / / / / / Statistics/Data Analysis

2/4/2008 S:\TRAFFIC\Modeling\01 Support Materials\Documents\CORSIMcal.doc CORSIM Calibration Parameters

Volume 5, Issue 7, July 2017 International Journal of Advance Research in Computer Science and Management Studies

Visual Sample Plan Training Course 03 VSP 2.0 Introductory Exercises

Introduction to Nonparametric/Semiparametric Econometric Analysis: Implementation

ANNUAL REPORT OF HAIL STUDIES NEIL G, TOWERY AND RAND I OLSON. Report of Research Conducted. 15 May May For. The Country Companies

Introduction to Machine Learning CMU-10701

REPORT ON THE AGRICULTURE SURVEY DATA SIMULATION PROGRAM*

Adaptive Metric Nearest Neighbor Classification

Classification with Diffuse or Incomplete Information

Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation

Spectral Methods for Network Community Detection and Graph Partitioning

A Fuzzy Logic Approach to Assembly Line Balancing

Statistical Analysis of List Experiments

Unit 8 SUPPLEMENT Normal, T, Chi Square, F, and Sums of Normals

SD 372 Pattern Recognition

Poisson Regressions for Complex Surveys

Transcription:

Computing Optimal Strata Bounds Using Dynamic Programming Eric Miller Summit Consulting, LLC 7/27/2012 1 / 19

Motivation Sampling can be costly. Sample size is often chosen so that point estimates achieve a minimum level of precision. A stratified sampling design can reduce costs by improving efficiency relative to simple random sampling. 2 / 19

Stratified Sampling Design A stratification variable is used to partition the population into homogeneous subgroups. Simple random sampling is performed within each group. We want to choose the set of strata boundary points that minimizes the within-stratum variance and maximizes the between-strata variance. This can improve the precision of point estimates. 3 / 19

Optimal Stratification Number of strata (based on the needs of the end user) Optimal sample allocation (simple closed form solution exists) Optimal strata bounds 4 / 19

Optimal Stratification: Previous Research Approximation methods - Delenius and Hodges (1959), Gunning and Horgan (2004) Numerical optimization methods - Lavallee and Hidiroglou (1988), Kozak (2004) Dynamic Programming - Buhler and Deutler (1975), Khan, Nand and Ahmad (2008) 5 / 19

Contribution Use dynamic programming to determine optimal strata bounds. Build on the work of Khan, et. al. (2008) and take the theory to the data. Describe the user-written Stata command optbounds which calculates optimal strata boundary points. Assess margin of error (for a 95% confidence interval) and design effect. 6 / 19

Optimal Stratification for Variance Reduction Let X be a random variable that is defined on [a, b] and is partitioned into L strata. We want to minimize the following expression: L Var( x st) = Wh 2 Var( x h ) (1) h=1 W h is the weight given to stratum h, x h is the sample mean within stratum h and x st is the stratified sample mean. If we make a certain stratum smaller, the other strata must necessarily become larger. As a result, the strata variances must be minimized simultaneously. 7 / 19

Optimal Stratification: Sequential Formulation We can rewrite (1) as a function of the strata boundary points (d 0,..., d L ). An optimal stratification scheme solves the following problem: min {d 1,...,d L 1 } h=1 L φ h (d h, d h 1 ), (2) subject to a = d 0 d 1... d L 1 d L = b d h and d h 1 are the boundary points for stratum h φ h depends on the allocation method For example, under Neyman (optimal) allocation φ h = W h σ h, n h = σ 2 h = Nh i=1 (x hi x h ) 2 N h 1 nw hσ h Lk=1 W k σ k and 8 / 19

Optimal Stratification as a Multi-Stage Problem We can rewrite (2) as a series of simple recursive equations. Dynamic programming provides a method for finding the set of decision rules (policy functions) that solve these equations. It can be shown that the solutions to the sequential and recursive problems are identical. This is referred to as the principle of optimality (Bellman 1957). 9 / 19

Optimal Stratification: Recursive Formulation We can solve the recursive problem below using standard dynamic programming techniques (Rust 2008). V h (d h ) = min d h 1 [ φ h (d h, d h 1 ) + V h 1 (d h 1 ) ], h 2 (3) V 1(d 1) = φ 1(d 1) Subject to d h d h 1 0 10 / 19

Example: Triangular Distribution Let X be a continuous random variable with support [a, b] and mode m. This variable is said to follow a triangular distribution if it has the following density function: 2(x a) ; a x m (b a)(m a) f (x) = (4) 2(b x) ; m < x b (b a)(b m) 11 / 19

Estimating the Mode of a Triangular Distribution Let X be a random variable with pdf (4). For a random sample X = (X 1,..., X s) with order statistics X (1) < X (2) <... < X (s), the likelihood for X is: L(X; a, m, b) = ( 2 ) s { r X (i) a b a m a i=1 s i=r+1 b X } (i) b m (5) r is implicitly defined by X (r) m < X (r+1) For given values of a and b we can easily compute m. In general, a and b are unobserved population parameters (Kotz and van Dorp 2004). The ML estimates of endpoints a and b can be computed using numerical methods (e.g. Nelder-Mead). 12 / 19

Experiment Compare the results of stratification using dynamic programming and the popular cumulative square root frequency (CSRF) algorithm. Use the variable price from the Stata auto dataset (74 observations). Use a sample size of 15 and allocate sampled items between three strata using Neyman allocation. Price is assumed to follow a triangular distribution. For the CSRF algorithm price is grouped into 9 equal classes. 13 / 19

Stata Output.. optbounds price, distribution(triangular) stratanum(3) endpts(1 2) nooutput b > ins(9) ML estimate of the mode 3798 ----------------------- Stratification Results Minimized Standard Deviation 1161.825181 Optimal Strata Bounds 1 1 6079.705973 2 9674.824836.. sum price Variable Obs Mean Std. Dev. Min Max price 74 6165.257 2949.496 3291 15906. 14 / 19

Optimal Strata Bounds 15 / 19

Results Method Point Estimate Standard Error Margin of Error as Design Effect (Population Mean) % of Point Estimate DP 5,969 163 4.9%.091 CSRF 8,451 419 8.8%.220 16 / 19

Sensitivity Analysis Method Margin of Error as Design Effect % of Point Estimate DP 4.9%.091 CSRF 3 Classes 5.5%.094 5 Classes 9.6%.236 7 Classes 8.2%.195 9 Classes 8.8%.220 11 Classes 9.0%.203 13 Classes 8.9%.201 15 Classes 4.5%.053 17 Classes 8.6%.197 17 / 19

Conclusion A stratified sampling design can improve the precision of point estimates. In practice, optimal stratification using dynamic programming compares favorably with the commonly used CSRF algorithm. Dynamic programming methods are flexible and theoretically appealing. 18 / 19

References Bellman, R., Dynamic Programming, Princeton University Press, 1957. Buhler, W. and T. Deutler, Minimum Variance Stratification, Metrika, 1975, 22, 161 175. Cochran, W., Sampling Techniques, John Wiley and Sons, 1977. Delenius, T. and J.L. Hodges, Minimum Variance Stratification, Journal of the American Statistical Association, 1959, 54 (285), 88 101. Gunning, P. and Horgan J.M., A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations, Survey Methodology, 2004, 30 (2), 159 166. Khan, M., N. Nand, and N. Ahmad, Determining the Optimum Strata Boundary Points Using Dynamic Programming, Survey Methodology, 2008, 34 (2), 205 214. Kotz, S. and J.R. van Dorp, Beyond Beta: Other Continuous Families of Distributions with Bounded Support and Applications, World Scientific Publishing Company Inc., 2004. Kozak, M., Optimal Stratification using Random Search Method in Agricultural Surveys, Statistics in Transition, 2004, 6 (5), 797 806. Lavallee, P. and M. Hidiroglou, On the Stratification of Skewed Populations, Survey Methodology, 1988, 14, 33 43. Rust, J., Dynamic Programming, in S.N. Durlauf and L.E. Blume, eds., The New Palgrave Dictionary of Economics, Palgrave Macmillan 2008. 19 / 19