Usng Stata Welcome to Stata, one of the most un-user frendly programs ever created. Although later versons have some features to make t easer to use, they more than make up for t by not beng 00% compatble wth prevous versons... and some even generate code through the "helpful" dalog boxes that cause errors. Ths tutoral s ntended to gve you the very basc tools needed to get thngs done n Stata. How ths document s organzed: Courer font s what you type or see n Stata. Thngs n [brackets] are optonal arguments. Blue text are reserved commands n Stata; the underlned part of the next s how you can abbrevate the command. Red text are Stata error messages. Fle extensons are lsted (e.g.,.do,.dta,.log) to let you know what Stata s expectng. Stata doesn't not requre the extensons to be used f the fle s of the type expected. How Stata Works Stata uses a combnaton of command-lne and menu-drven nputs along wth a very basc spreadsheet-style data edtor. There are several types of fles you can work wth n Stata, but the basc one s the Stata data set fle (.dta). Ths fle contans all your data as well as varable names. Extenson.dta.log.raw Fle Type Stata data fle Log fle (explaned below) ASCII (text) fle PgUp and PgDn buttons scroll through commands n the Revew wndow (.e., wrtes them n the Stata Command wndow for you) of 4
Data Sets You can vew the data Stata s workng wth by openng the data edtor or data browser. Both of these work smlar to a spreadsheet wth the varables lsted n columns. The only dfference s that the edtor lets you change values and the browser doesn't. If you nsst on usng the command lne, you can use the lst command. Although t's old school, lst could help fnd your problem areas when used n conjuncton wth f. For example: lst varname f varname > 5000 wll lst all observatons of the varable varname that are greater then 5000. The good thng s that each one has the observaton number. You can jot that down and look up the data ponts n the data edtor or you can go back to your orgnal data source to track down potental problems. Save Data save flename.dta [,*] nolabel - omts value labels; stll saves assocatons between varables and value label names (just not the labels themselves). replace - allows you to overwrte the exstng fle; prevents "fle flename.dta already exsts" error. orphans - saves all value labels, ncludng those not attached to any varables emptyok - allows you to save an empty data set to prevent "no varables defned" error. (Used for programmng.) ntercooled - makes Stata/SE save n Intercooled Stata format. Load Data (.dta Fles) use flename.dta [,*] clear - Stata wll not let you load a data fle f you already have a data n memory. Usng the clear opton removes any data from memory (even f t hasn't been saved) to allow you to load the data fle. Load Data (Other Sources) Formatted Text Fle - one observaton per lne; values are tab or comma delmted; can have varable names n the frst lne (optonal); f you don't nclude fle extenson,.raw s assumed. nsheet [varlst] usng flename.raw [,*] varlst - lst varables names separated by spaces (not commas) double - forces Stata to store varables as doubles (rather than floats) no[names] - nforms Stata whether varable names are ncluded; Stata wll fgure t out on ts own, but ths opton wll allow the fle to open faster comma - specfy comma delmted (not requred) tab - specfy tab delmted (not requred) 2 of 4
delmter("*") - specfy a dfferent delmter n the data (e.g., delmter(";")) clear - removes any data from memory (even f t hasn't been saved) to allow you to load the data fle. Examples: nsheet usng newdata nsheet usng newerdata.txt, clear nsheet usng werddata.txt, clear delmter("&") nsheet heght gender mom dad usng heghts.dat Log Fles These keep track of everythng that happens durng your Stata sesson by recordng everythng that appears n the Stata Results wndow (the one wth the black background). A log fle can be handy for trackng down errors when you're runnng a.do fle. If you specfy the.log extenson, the fle s saved n ASCII (text) format whch means the colors are not saved, but t's pretty easy to tell the dfference between your commands and Stata output because commands are preceded by a perod (.). There's a dfferent format for the Stata vewer, but t's not really any better than a text fle. There are also other optons for a log fle than aren't covered here, but ths secton should gve you all you need to know. The only trcky part s decdng where (or f) to turn the log fle on or off durng executon of your.do fle or Stata sesson. You don't really need to close the log fle to be able to read t. log usng flename.log [,*] log off log on log close replace - overwrtes current log fle append - adds ths sesson to the end of the log fle text Examples: log usng newlog.log, replace log usng "fle wth spaces.log" log close 3 of 4
Commands Stata commands are the thngs that get thngs done n Stata. They are how you tell Stata to do what t s you want done. Note: exp refers to any expresson, logcal or mathematcal; the type should be clear n the context; f exp s wrtten twce n a sngle lne, t does not mply that t s the same expresson. Expressons use the followng operators: Arthmetc Logcal Relatonal + addton ~ not > greater than - subtracton! not < less than * multplcaton or (shft \) >= > or equal / dvson & and <= < or equal ^ power == equal + strng concatenaton ~= not equal!= not equal Generate - creates a new varable based on exp generate [type] newvar[:lblname] = exp [f exp] type - specfes the varable type; f none s specfed, Stata wll automatcally select float for numerc data and str for text Examples: generate age2 = age*age generate bgnc = ncome>00000 & ncome!=. gen double untpr = cost/quantty gen byte bgnc = ncome>00000 & ncome!=. gen xlag = x[_n-] Lst - prnts data on the screen lst [varlst] [f exp] [, *] table - lsts varables vertcally, one observaton per row 4 of 4
dsplay - lsts observatons together; useful f there are a lot of varables to keep t from wrappng around the screen Replace - changes the contents of an exstng varable replace oldvar = expresson [f expresson] [, nopromote] oldvar - name of a varable that already exsts n the data set nopromote - prevents replace from promotng the varable type to accommodate the change (e.g., f you replace an nteger varable wth data contanng 3.4 and prevent the type to promote, you'll end up wth 3) Examples: replace ncome=. f ncome<=0 replace age = 25 n 007 Set Memory - specfes how much system memory you want to be dedcated to Stata ; Note: typng memory wthout set before t wll dsplay a report of Stata's memory usage set memory #[b k m g] [, permanently ] # - amount of memory to set; specfed n terms of bytes (b), klobytes (k), megabytes (m), or ggabytes (g) permanently - specfes that n addton to makng the change rght now, Stata wll remember the new lmt and use t n the future when you open Stata Examples: set memory 5m Set Type - specfes the default data type assgned to new varables (such as by generate) when the storage type s not explctly specfed set type * where * s ether a numerc storage type lsted here or a strng explaned below the table 5 of 4
Numerc Storage Type Bytes Mnmum Maxmum byte -27 00 +/- nt 2-32,767 32,740 +/- long 4-2,47,483,647 2,47,483,620 +/- Closets to 0 wthout beng 0 float 4 -.7047339*0^38.7047339*0^36 +/-0^-36 double 8-8.9884656743*0^307 8.9884656743*0^308 +/-0^-323 Precson for float s 3.795x0^-8 Precson for double s.44x0^-6 Character strngs are specfed by str#, where # gves the maxmum length of the strng (ranges from to 80). Each character reserved by a strng takes one byte regardless of the data stored n the strng (e.g., "t" stored n a varable of type str80, stll takes up 80 bytes). Summarze summarze [varlst] [f expresson] [, detal] varlst - lst of varables, separated by spaces (not commas); f you don't ndcate a varable lst, Stata wll summarze all the varables n the data set f expresson - allows you to specfy a subset of the data to be summarzed detal - standard summarze command lsts number of observatons, mean, standard devaton, mnmum and maxmum; specfyng detal adds, 5, 0, 5, 75, 90, 95, 99th percentles, varaton, skewness, and kurtoss Functons Functons are actually seres of embedded commands desgned to accomplsh a specfc task. They make workng wth Stata a lttle easer because you don't have to program them n yourself. Ths s just a subset of frequently used functons. You can get more functons by usng the onlne help n Stata and searchng for these. Type of functon Mathematcal Functons Probablty Functons Random Numbers Strng Functons Programmng Functons Date Functons Tme-seres Functons See help mathfun probfun random strfun progfun datefun tsfun 6 of 4
Matrx Functons matfcns Mathematcal Functons abs(x) exp(x) nt(x) ln(x) or log(x) log0(x) max(x,x2,...,xn) mn(x,x2,...,xn) round(x,y) sqrt(x) returns the absolute value of x returns the ex returns the nteger obtaned by truncated x towards zero returns the natural logarthm of x returns the base 0 logarthm of x returns the maxmum of x, x2,..., xn (mssng values are gnored) returns the mnmum of x, x2,..., xn (mssng values are gnored) returns x rounded off to unts of y returns the square root of x Probablty Functons bnomal(n,k,p) returns the probablty of k or more successes n n trals when the probablty of a success on a sngle tral s p ch2(n,x) returns the cumulatve ch-squared dstrbuton wth n degrees of freedom ch2tal(n,x) returns the reverse cumulatve (upper-tal) ch-squared dstrbuton wth n degrees of freedom; ch2tal(n,x) = - ch2(n,x) F(n,n2,f) returns the cumulatve F dstrbuton wth n numerator and n2 denomnator degrees of freedom Fden(n,n2,f) returns the probablty densty functon for the F dstrbuton wth n numerator and n2 denomnator degrees of freedom Ftal(n,n2,f) returns the reverse cumulatve (upper-tal) F dstrbuton wth n numerator and n2 denomnator degrees of freedom; Ftal(n,n2,f) = - F(n,n2,f) nvbnomal(n,k,p) returns the nverse bnomal: for P<=0.5, probablty p such that the probablty of observng k or more successes n n trals s P; for P>0.5, probablty p such that the probablty of observng k or fewer successes n n trals s -P. nvch2(n,p) returns the nverse of ch2(); f ch2(n,x) = p, then nvch2(n,p) = x nvf(n,n2,p) returns the nverse cumulatve F dstrbuton; f F(n,n2,f) = p, then nvf(n,n2,p) = f nvnorm(p) returns the nverse cumulatve standard normal dstrbuton; f norm(z) = p, then nvnorm(p) = z norm(z) returns the cumulatve standard normal dstrbuton normden(z) returns the standard normal densty normden(x,m,s) returns the normal densty wth mean m and standard devaton s; normden(x,m,s) = normden((x-m)/s)/s tden(n,t) returns the probablty densty functon of Student's t dstrbuton wth n > 0 degrees of freedom ttal(n,t) returns the reverse cumulatve (upper-tal) Student's t dstrbuton wth n > 0 degrees of freedom 7 of 4
Random Numbers unform() nvnorm(unform()) returns unformly dstrbuted pseudo-random numbers on the nterval [0,) returns normally dstrbuted random numbers wth mean zero and standard devaton one Strng Functons Programmng Data Functons Tme-seres Functons Matrx Functons set seed # unform() nvnorm(unform()) sum(x) sum(x!=.) 8 of 4
Regresson Basc Regresson regress depvar [varlst] [,*] depvar - name of dependent varable varlst - lst of ndependent varables, separated by spaces (not commas) level(#) - specfes the confdence level (e.g., 95) for confdence ntervals of the coeffcents noconstant - suppresses the constant (ntercept) term robust - uses the Whte Heteroskedastcty Consstent Covarance Estmator; results n hgher standard errors and lower t-ratos Examples: regress y x x2 reg heght gender mom dad, level(95) reg consumpton output, noconstant Usng Results Parameter Estmates - returns the estmated coeffcent for regressorname _b[regressorname] Predct - generates a new varable that stores the desgnated predcton based on the last regresson run by Stata predct newvarname [,statstc] Statstc: xb - ftted values; sample pont estmate; ths s the default so you don't need to nclude t resduals - resduals (dependent varable mnus ybar) rstandard - standardzed resduals stdp - standard error of each predcted value (.e., Stdev yˆ ) ) stdf - standard error of each forecasted value stdr - standard error of each resdual Varance - dsplays the varance-covarance matrx (.e., Var (ˆ) ) vce ( 9 of 4
Testng Lnear Hypotheses After Estmaton test coeflst - test that coeffcents are equal 0; lst coeffcents separated by spaces test exp = exp [=...] - test that lnear expressons are equal accumulate - adds test to prevous test(s) n memory makng a jont test Note: Ths performs the Wald Test... approxmated wth an F dstrbuton nstead of chsquare F-Test - to do a real F-test of m restrctons:. Run the unrestrcted regresson: y = x ˆ x ˆ x ˆ x ˆ k k uˆ β + 2 β 2 + 3 β 3 + + β + 2. Record SSR (just on paper f you want) 3. Run the restrcted regresson: ~ ~ ~ y = ~ x ~ xk k u~ β + + β + 4. generate F = ((RstctdSSR - UnrstctdSSR)/m)/(UnrstctdSSR/(N-k)) 5. Compare that to an F(2,N-k)... dsplay Ftal(m,N-k,F) Example - regress lwage educ huswage cty unem exper expersq Usng Wald Test: test uduc-expr = 0 test cty + unem = 0, accumulate Returns 3.96... p-value 0.099 Usng F-Test: generate edex = educ - exper generate ctun = cty + unem regress lwage edex huswage ctun expersq generate F = ((90.2475-86.55)/2)/(86.55/(428-7)) Returns 4.022... p-value 0.086 Advanced Regresson Technques beta - requests that normalzed beta coeffcents be reported nstead of confdence ntervals, f the orgnal model s y = xβ + x2β 2 + u, beta alters the model to be y y x x ~ x2 x2 ~ u~ ~ = β + β 2 +, where β = βstdev( x ) Stdev( y) Stdev( x ) Stdev( x ) 2 cluster [varname] - varname descrbes ID varable to allow correlaton between errors wthn a cluster 0 of 4
Heterskedastcty - here's a seres of commands to deal wth heteroskedastcty; assume only x2 and x3 are correlated to the error terms regress y x x2 x3 predct e, resduals generate e2 = e^2 regress e2 x2 x3 predct sgma2 Method - Transform Model generate newy = y/sqrt(sgma2) generate newx = x/sqrt(sgma2) etc. regress newy newx newx2 newx3 Method 2 - Weghts regress y x x2 x3 [weght = sgma2] generatng lagged varables - generate lagy = y[_n-] Regressors Correlated wth Error Terms - use nstrumental varable estmaton and the Hausman test vreg depvar [varlst] (varlst2 = varlst_v) [,*] varlst2 - lst of ndependent varables that are correlated wth the error term varlst_v - lst of nstrument varables used n place of the varables n varlst2 Other optons are same are regress command hushrs (husband hours) s probably a jont decson when decdng the wfe's hours (hours), so t's probably correlated wth the error term; suppose huseduc s known to be a good nstrument; test f huswage s also a good nstrument: vreg hours kdslt6 educ wage famne unem (hushrs = huseduc), robust hausman, save vreg hours kdslt6 educ wage famne unem (hushrs = huswage huseduc), robust hausman Seemngly Unrelated Regresson (SUR) - smultaneous equatons usng pooled data (.e., cross-secton data over tme that may not necessarly be from same source) sureg (depvar varlst [,noconstant]) (depvar2 varlst2)... noconstant - omts constant term for specfed equatons of 4
sure - terate over the estmated dsturbance covarance matrx and parameter estmates untl the parameter estmates converge; better fnte sample propertes dfk - use alternate dvsor n computng the covarance matrx for the equaton errors; better estmates for small samples small - specfes that small sample statstcs are to be computed; shfts test statstcs from ch-squared and Z statstcs to F statstcs and t-statstcs Examples: 3 smultaneous equatons: sureg (prce foregn weght length) (mpg foregn weght) (dspl foregn weght) Test f coeffcent for foregn s zero s all equatons: test foregn Test across equatons test [prce] foregn = [mpg] foregn Problem wth Heteroskedastcty or Seral Correlaton - Run smple OLS on stacked data (use n = mn( n n2 ) ; drop extra data) Create new varable to account for pars regress y x x2, cluster[d] robust Fxed Effect Regresson wth Panel Data - xtreg depvar [varlst], type (varname) Type s one of the followed dependng on whch estmaton technque s used: be - between-effects estmator: y = β 0 + x ' + u fe - fxed-effects estmator: ( y t y ) = ( x t x )' + ( ut u ) T T where y = y t, xm = x mt T t= T t= re - GLS random-effects estmator pa - GEE populaton-averaged estmator mle - maxmum-lkelhood random-effects estmator (varname) - specfes the varable correspondng to an ndependent unt (e.g., a subject d); ths varable represents the n x t (smlar to cluster) Output: Reports # Observatons, # Groups (ndvduals), mn, max and avg Obs/Group R-Sq... only care about overall... that's the one based on the orgnal model: N t = j= y β d + x ' + u 0 j jt t t F(##, ###) (above table) testng H 0 : β = 0 (.e., all parameters are zero)... ths s the standard F-test checkng all parameters smultaneously for a regresson n Const = ˆ β = ˆ β 0 0 N = 2 n d = 2 n 2 of 4
Corr(u_,xb) = Corr ˆ β, x ( o t effect model whch assumes E β x ) = 0 Sgma_u = standard devaton of u t Sgma_e = standard devaton of ˆβ ' ˆ )... ths s to check the assumpton of the random ( 0 t 0 F(##, ###) (below table) testng H 0 : Var( β 0 ) = 0 ;.e., whether ndvdual effect s correlated wth regressors (or all the same); numerator degrees of freedom s N + k; denomnator s NT - (N + k) (assumng same number of tme perods per ndvdual)... another way to thnk of ths test s a test on whether N - dummy varables are smultaneously equal to zero ( dummy s left out and captured wth the constant term n the regresson) 3 of 4
Programmng General Program Specfy Stata Verson - some commands and formats are specfc to the verson of Stata (so copyng someone else's code made not work n your verson of Stata). If the person wrote t n a prevous verson, you may be able to get away wth a smple command that allows the older code to work. Type the verson number you want to emulate at the begnnng of the fle: verson 8 Comments * Used at the begnnng of a lne; the lne s gnored. /* */ Used n the mddle of a lne; everythng between /* and */ s gnored. // Used at the begnnng or end of a lne (must be preceded by one or more blanks f at the end); everythng on the lne after // s gnored. /// Instructs Stata to vew from /// to the end of a lne as a comment, and to jon the next lne wth the current lne; must be preceded by one or more blanks; used make long lnes more readable. 4 of 4