Biostatistics & SAS programming Kevin Zhang February 27, 2017 Random variables and distributions 1
Data analysis Simulation study Apply existing methodologies to your collected samples, with the hope to find some useful conclusions. Check assumptions Apply the PROC Interpret results Development Try to develop new methodologies or enhance existing methods to draw some conclusions. Derive formulas Programming using IML and MACRO Simulation study to verify the results February 27, 2017 Biostat 2
Simulation procedure: Generate datasets from assumed distribution Applying your algorithm to each dataset and collect results Interpret results: Is it close to what you expected? Assessing the accuracy and compare to existing methods February 27, 2017 Biostat 3
Distribution The distribution defines the rule of the probability Evaluate the probability Generate random values from a specified probability In fact, each variable in your sample is a sequence of random values (in most cases we don t know the distribution) February 27, 2017 Biostat 4
Mathematical expression f(x) Density function (PDF) / Mass function (PMF): Describing the probability assignment to each possible values It means f(a) = P(X=a), i.e. what is the probability assigned to value a F(x) Cumulative distribution function (CDF): Telling what is the probability from the very beginning till a given threshold It means F(a) = P(X a) February 27, 2017 Biostat 5
Commonly used distributions Discrete Bernoulli (or called 0-1), B(1, p) Continuous Continuous uniform, U(a,b) Binomial, B(n,p) Normal, N(μμ, σσ 2 ) Poisson, P(λλ) Geometric, G(p) Discrete uniform, DU(a,b) Student s T, t(df) Exponential, Exp(ββ) Chi-square, χχ 2 (df) F distribution, F(df1, df2) February 27, 2017 Biostat 6
Bernoulli distribution Modeling cases with only two outcomes F(x) and f(x): Numerical characteristics Example: Flip a coin, get a Head? February 27, 2017 Biostat 7
Binomial distribution Try a sequence of same designed Bernoulli case F(x) and f(x) Numerical characteristics Example: Flip a coin 10 times, how many Heads you got? February 27, 2017 Biostat 8
Poisson distribution How many desired results will be obtained during the given time? F(x) and f(x) Numerical characteristics: Example: How many customers entering the local Walmart between 8 am and 10 am? February 27, 2017 Biostat 9
Geometric distribution How many trials are needed to acquire the desired number of results? F(x) and f(x) Numerical characteristics Example: How many trials will allow us to get five 1 s by rolling a same fair die? February 27, 2017 Biostat 10
Discrete uniform distribution Modeling the cases that all possible results are equally likely. F(x) and f(x) Characteristics Example: Rolling a fair die February 27, 2017 Biostat 11
Random numbers in Computer Random generator In fact it is an algorithm that choosing numbers randomly from a certain sequence of numbers. The randomness in the computer depends on time, date, computer name, IP address, hardware IDs, etc. Thus it makes the choice different from computer to computer. Random seed: a number to distinguish the randomness. In fact is the evidence for the computer to choose values from the certain sequence. Computers will obtain EXACTLY SAME random sequence if you set a same random seed. February 27, 2017 Biostat 12
DATA step SAS: Using DATA step together with loop /* Bernolli experiment */ data bino1(keep = x); p = 0.5; n = 1; keep lists the variables you wish to keep inside the data set Parameters for the distributions call streaminit(123); /* set random number seed */ do i = 1 to 1000; x = rand("binomial", p, n); /* x ~ Bernolli(0.5) */ output; end; Random generator run; Random seed Loop 1000 times, thus you get 1000 values February 27, 2017 Biostat 13
More Poisson distribution /* --- Poisson random numbers --- */ data pos(keep = x); call streaminit(123); /* set random number seed */ lambda = 4; do i = 1 to 1000; x = rand("poisson", lambda); /* x ~ Pois(10) */ output; end; run; February 27, 2017 Biostat 14
Full list You can find the manual of RAND() function call of SAS here: https://support.sas.com/documentation/cdl/en/lefunctionsref/69762/h TML/default/viewer.htm#p0fpeei0opypg8n1b06qe4r040lv.htm We can use RAND() to get random numbers of all distributions together with provide parameters February 27, 2017 Biostat 15
PROC IML We can also use IML procedure to program it: IML Interactive Matrix Programming, https://support.sas.com/rnd/app/iml/ It allows us to define vectors and matrices, and calculate some results just like other programming languages (like MATLAB, R, Python) Example: proc iml; call randseed(123); /* set random number seed */ x = j(10,1); /* allocate a vector with 10 values in it */ call randgen(x, "Uniform"); /* u ~ U[0,1] */ print x; run; February 27, 2017 Data Mining: Concepts and Techniques 16
HW Try to generate following random sequence Normal, with mean 3 and standard deviation 4 Chi-square with degrees of freedom 5 Student s T with degrees of freedom 10 Geometric distribution with p = 0.3 Exponential distribution F distribution with n=3 and d=10 Research the histograms of above random sequences, together with /normal option, see what happens? February 27, 2017 Biostat 17