Generating random samples from user-defined distributions

Similar documents
Probability Models.S4 Simulating Random Variables

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistical Methods -

ACCURACY AND EFFICIENCY OF MONTE CARLO METHOD. Julius Goodman. Bechtel Power Corporation E. Imperial Hwy. Norwalk, CA 90650, U.S.A.

DECISION SCIENCES INSTITUTE. Exponentially Derived Antithetic Random Numbers. (Full paper submission)

MULTI-DIMENSIONAL MONTE CARLO INTEGRATION

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

SPSS Basics for Probability Distributions

Probability and Statistics for Final Year Engineering Students

Fathom Dynamic Data TM Version 2 Specifications

Dealing with Categorical Data Types in a Designed Experiment

15 Wyner Statistics Fall 2013

BESTFIT, DISTRIBUTION FITTING SOFTWARE BY PALISADE CORPORATION

A Random Number Based Method for Monte Carlo Integration

Multivariate Normal Random Numbers

Computational Methods. Randomness and Monte Carlo Methods

Scientific Computing: An Introductory Survey

The Normal Distribution

Quantitative - One Population

R Programming Basics - Useful Builtin Functions for Statistics

Package UnivRNG. R topics documented: January 10, Type Package

Markov chain Monte Carlo methods

Computing Murphy Topel-corrected variances in a heckprobit model with endogeneity

Integration. Volume Estimation

Laplace Transform of a Lognormal Random Variable

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

C N O S N T S RA R INT N - T BA B SE S D E L O L C O A C L S E S A E RC R H

10-701/15-781, Fall 2006, Final

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Chapter 6: DESCRIPTIVE STATISTICS

Intensity Transformations and Spatial Filtering

Discrete Optimization. Lecture Notes 2

Use of Extreme Value Statistics in Modeling Biometric Systems

Testing Random- Number Generators

1.2 Numerical Solutions of Flow Problems

Monte Carlo Integration

Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

CS 563 Advanced Topics in Computer Graphics Monte Carlo Integration: Basic Concepts. by Emmanuel Agu

Behavior of the sample mean. varx i = σ 2

Chapter 6. The Normal Distribution. McGraw-Hill, Bluman, 7 th ed., Chapter 6 1

CHAPTER 2: SAMPLING AND DATA

StatsMate. User Guide

Technical Report Probability Distribution Functions Paulo Baltarejo Sousa Luís Lino Ferreira

Numerical Integration

An introduction to plotting data

ELEN E4830 Digital Image Processing. Homework 3 Solution

YEAR 12 Core 1 & 2 Maths Curriculum (A Level Year 1)

Introductory Applied Statistics: A Variable Approach TI Manual

Design of Experiments

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Algebra 2 Semester 2 Final Exam Study Outline Semester 2 Final Exam Study Tips and Information

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistics and Error Analysis -

Getting to Know Your Data

Categorical Data in a Designed Experiment Part 2: Sizing with a Binary Response

Monte Carlo Methods and Statistical Computing: My Personal E

What We ll Do... Random

Towards Interactive Global Illumination Effects via Sequential Monte Carlo Adaptation. Carson Brownlee Peter S. Shirley Steven G.

METROPOLIS MONTE CARLO SIMULATION

Table Of Contents. Table Of Contents

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400

Semantic Importance Sampling for Statistical Model Checking

Computing Optimal Strata Bounds Using Dynamic Programming

BIOL Gradation of a histogram (a) into the normal curve (b)

The Normal Distribution

Chapter 1. Introduction

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

MAPLE CARD. All commands and constants must be typed in the case (UPPER or lower) that is indicated. Syntax Rules

Economics Nonparametric Econometrics

Monte Carlo Integration

Mini-Max Type Robust Optimal Design Combined with Function Regularization

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

The Plan: Basic statistics: Random and pseudorandom numbers and their generation: Chapter 16.

STATISTICS (STAT) Statistics (STAT) 1

Digital image processing

Introduction to Mplus

C A S I O f x L B UNIVERSITY OF SOUTHERN QUEENSLAND. The Learning Centre Learning and Teaching Support Unit

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Unit WorkBook 2 Level 4 ENG U2 Engineering Maths LO2 Statistical Techniques 2018 UniCourse Ltd. All Rights Reserved. Sample

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Chapter Two: Descriptive Methods 1/50

R Primer for Introduction to Mathematical Statistics 8th Edition Joseph W. McKean

Measures of Central Tendency

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Excel 2010 with XLSTAT

CDA6530: Performance Models of Computers and Networks. Chapter 8: Statistical Simulation --- Discrete-Time Simulation

MATLAB INTRODUCTION. Risk analysis lab Ceffer Attila. PhD student BUTE Department Of Networked Systems and Services

Basic Algorithms for Digital Image Analysis: a course

DATA PROCESSING AND CURVE FITTING FOR OPTICAL DENSITY - ETHANOL CONCENTRATION CORRELATION VESELLENYI TIBERIU ŢARCĂ RADU CĂTĂLIN ŢARCĂ IOAN CONSTANTIN

4 Visualization and. Approximation

CREATING SIMULATED DATASETS Edition by G. David Garson and Statistical Associates Publishing Page 1

Laboratorio di Problemi Inversi Esercitazione 4: metodi Bayesiani e importance sampling

C A S I O f x S UNIVERSITY OF SOUTHERN QUEENSLAND. The Learning Centre Learning and Teaching Support Unit

Random Number Generation and Monte Carlo Methods

Spatial Analysis and Modeling (GIST 4302/5302) Guofeng Cao Department of Geosciences Texas Tech University

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

Frequency Distributions

Transcription:

The Stata Journal (2011) 11, Number 2, pp. 299 304 Generating random samples from user-defined distributions Katarína Lukácsy Central European University Budapest, Hungary lukacsy katarina@phd.ceu.hu Abstract. Generating random samples in Stata is very straightforward if the distribution drawn from is uniform or normal. With any other distribution, an inverse method can be used; but even in this case, the user is limited to the builtin functions. For any other distribution functions, their inverse must be derived analytically or numerical methods must be used if analytical derivation of the inverse function is tedious or impossible. In this article, I introduce a command that generates a random sample from any user-specified distribution function using numeric methods that make this command very generic. Keywords: st0229, rsample, random sample, user-defined distribution function, inverse method, Monte Carlo exercise 1 Introduction In statistics, a probability distribution identifies the probability of a random variable. If the random variable is discrete, it identifies the probability of each value of this variable. If the random variable is continuous, it defines the probability that this variable s value falls within a particular interval. The probability distribution describes the range of possible values that a random variable can attain, further referred to as the support interval, and the probability that the value of the random variable is within any (measurable) subset of that range. Random sampling refers to taking a number of independent observations from a probability distribution. Typically, the parameters of the probability distribution (often referred to as true parameters) are unknown, and the aim is to retrieve them using various estimation methods on the random sample generated from this probability distribution. For example, consider a normally distributed population with true mean μ and true variance σ 2. Assume that we take a sample of a given size from this population and calculate its mean and standard deviation these two statistics are called the sample mean and the sample standard deviation. Depending on the estimation method used, the sample size, and other factors, these two statistics can be shown to asymptotically converge to the true parameters of the original probability distribution and are thus good approximations of these values. c 2011 StataCorp LP st0229

300 Random samples The researcher may wish to evaluate the accuracy of the estimation method prior to using it on real data. In such a case, a random sample generated from a distribution with known true parameters can be used, and its estimated sample parameters can be compared with the true parameters to establish the accuracy of the estimation method. Taking our previous example, we can use a random sample generated from a normal distribution with parameters μ = 0 and σ 2 = 1; we can estimate its sample mean by taking the arithmetic average, and we can estimate its sample variance by taking the arithmetic average of the squared errors. If these two values are reasonably close enough to the true values of these parameters, we can conclude that our estimation method (arithmetic averaging) is accurate. In this article, I introduce a command that generates random samples from userspecified probability distribution functions with known parameters that can be used, among other applications, in such simulation exercises. 2 Probability distributions in Stata A probability distribution can take on any functional form as long as it fulfills certain conditions; namely, it must be nonnegative and it must integrate or sum to one. Some probability distributions have become so widely used that they have been given specific names. Two of the most important are the normal and the uniform distributions; among others are, for example, β, χ 2, F, γ, Bernoulli, Poisson, logarithmic, and lognormal distributions. For more information on basic statistics and random sampling, refer to Gentle (2003) andlind, Marchal, and Wathen (2008). Several built-in random-number functions such as runiform(), rbeta(a,b), rbinomial(n,p), rchi2(df), etc., are available to generate a random sample in Stata 10 and later versions. However, if a sample is to be drawn from any other distribution function, an inverse cumulative distribution function method must be used to apply an inverse distribution function on a uniform random sample. This method builds on the fact that if x is a continuous random variable with cumulative distribution function F x, and if u = F x (x), then u has a uniform distribution on (0, 1). It holds then that if u has a uniform distribution on (0, 1) and if x is defined as x = Fx 1 (u), then x has cumulative distribution function F x. This essentially means equation u = F x (x) must be solved for x. For more information on the inverse cumulative distribution function method and its applications to simulation experiments, refer to Ulrich and Watson (1987) and Avramidis and Wilson (1994). To apply this method, a number of built-in inverse cumulative distributions are available, such as invibeta(a,b,p), invbinomial(n,k,p), invchi2(n,p), invf(n1,n2,p), invnormal(p), etc. They can be applied to generate random samples as outlined in the following two examples, which generate a normally and a χ 2 distributed series my normal and my chi2, respectively: generate my normal = invnormal(runiform()) generate my chi2 = invchi2(n,runiform())

K. Lukácsy 301 However, a user may wish to draw a random sample from a distribution that is not built in Stata. In such a case, the inverse distribution function must be algebraically derived so that the inverse cumulative distribution function method can be used. But sometimes, the inverse of a function cannot be expressed by a formula. For example, if f is the function f(x) =x +sin(x), then f is a one-to-one function and therefore possesses an inverse function f 1. However, there is no simple formula for this inverse, because the equation y = x +sin(x) cannot be solved algebraically for x. In this case, numerical methods must be applied to match the values of x and y. 3 The rsample command The syntax of the rsample command is rsample pdf function [, left(#) right(#) bins(#) size(#) plot(#) ] This command generates the random variables into a new variable called rsample. pdf function is required. It is a string that specifies the probability distribution function that the random sample is to be drawn from. It must be formulated in terms of x, for example, exp(-x^2), although no x variable needs to exist prior to command execution. Two properties must normally hold for the probability distribution functions: 1. They must be nonnegative on the whole support interval of the random variable. 2. They must sum or integrate to 1. The second condition does not have to be fulfilled in this case because the rsample command calculates what the pdf function sums or integrates up to and uses that value as a rescaling factor. This way, a rescaled pdf function is used that always integrates or sums to 1. The first condition, however, must be fulfilled. If it is violated, an error message appears that reminds the user to supply a nonnegative pdf function. The rescaling constant always appears on the screen. left(#) and right(#) specify the support interval, that is, all values that the random variable can attain. left() must be specified to be less than right(). If these values are not specified, they take on default values of -2 and 2, respectively. bins(#) specifies the number of bins into which the support interval is split for the purposes of the algorithm. Essentially, it allows the user to specify the precision of the algorithm. If this value is not specified, it takes on a default value of 1000. However, if this value (whether defined by the user or its default) exceeds the total number of observations (whether defined by the corresponding optional parameter or by the size of the existing Stata file), it is automatically set to equal one-fifth of the total number of observations.

302 Random samples size(#) specifies the number of observations to be generated. The default is size(5000). If this value is specified even though the command is executed in an existing Stata session with a nonzero number of observations, then this value is not used and the original number of observations is preserved. plot(#) allows the user to choose whether results will be presented graphically in a histogram or whether no graphical view is needed. It is a binary option; that is, it can only take on values 1 and 0, where 1 is for yes and 0 is for no. The default is plot(1). 4 Examples In this section, I provide several specific examples with various pdf functions and various optional parameters specified. The graphical representations of the last three examples are presented in figures 1, 2, and3 in the form of histograms of the generated random values. These three examples were generated from normal, lognormal, and Laplace distribution functions, respectively.. rsample exp(-x^2), left(-2.5) right(2.5). rsample exp(-abs(x)), left(-4) right(4) bins(500). rsample exp(-abs(x)), left(-4) right(4) size(500). rsample exp(-abs(x)), plot(0). rsample exp(-log(x)^2), left(0.1) right(6) bins(300) size(3000) plot(0). rsample exp(-x^2). rsample exp(-log(x)^2), left(0.1) right(6). rsample exp(-abs(x)), left(-4) right(4) Density 0.2.4.6.8 2 1 0 1 2 values random sample theoretical pdf Figure 1. Random sample from a normal distribution function

K. Lukácsy 303 Density 0.1.2.3.4.5 0 2 4 6 values random sample theoretical pdf Figure 2. Random sample from a lognormal distribution function Density 0.2.4.6 4 2 0 2 4 values random sample theoretical pdf Figure 3. Random sample from a Laplace distribution function 5 References Avramidis, A. N., and J. R. Wilson. 1994. A flexible method for estimating inverse distribution functions in simulation experiments. ORSA Journal on Computing 6: 342 355. Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods. 2nd ed. New York: Springer. Lind, D. A., W. G. Marchal, and S. A. Wathen. 2008. Basic Statistics for Business and Economics. 6th ed. New York: McGraw Hill.

304 Random samples Ulrich, G., and L. T. Watson. 1987. A method for computer generation of variates from arbitrary continuous distributions. SIAM Journal on Scientific Computing 8: 185 197. About the author Katarína Lukácsy works as a statistical analyst for a large multinational company primarily focusing on improving the relevance of search results and related issues. Currently, she is finalizing her PhD dissertation on price-setting mechanisms from statistical, econometrical, and theoretical viewpoints.