Sacha Kapoor - Masters Metrics 091610 1 Address: Max Gluskin House, 150 St.George, Rm 329 Email: sacha.kapoor@utoronto.ca Web: http://individual.utoronto.ca/sacha$_$kapoor 1 Basics Here are some data resources available to University of Toronto Students: CHASS: http://datacentre.chass.utoronto.ca/ Data Library, 5th Floor Robarts Rotman Finance Lab There are also many data sets online. You just need to do a bit more searching. There are many different types of data: Financial markets data: CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies Canadian Financial Markets Research Centre Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database Companies financial data: Financial Post Corporate Database COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies National income statistics: What is Stata? OECD National Accounts Database World Bank databases Penn World Tables A high level general purpose statistical software package (built on a C environment), with lots of built in functions. Caveat: Functions are not substitutes for understanding. 3 versions: Stata SE (for large datasets, found on arbor.economics.utoronto.ca, can be accessed remotely). Intercooled Stata (for medium sized datasets, can be purchased through Robarts Library). Small Stata (for small data sets). 3 ways to use Stata:
Sacha Kapoor - Masters Metrics 091610 2 Interactively, through the command prompt (enter the commands one by one). Batch files, by collecting commands and running them all at once. Point and Click. How to collect commands? Use a do file. doedit To track results/output you should use a log file: cd../../../../documents/ta/2010-2011/masters_metrics log using "tutorial_091610.log", replace where the first command changes the working directory to the data location and the second command opens the log file. To examine the current working directory: dir To import comma delimited data (.csv) use the insheet command: insheet using "S&P_data.csv" To examine attributes of the data: des Another way to obtain the same information and more: edit Note that in Stata 11, as opposed to previous versions, you can run commands and have the editor open at the same time. Before proceeding label the data and variables: label data "S&P (01-31-80 to 12-31-99)" label variable eps "Earnings per share" label variable price "Price per share" label variable weather "Weather" To convert the data into Stata format: save sacha_s&p.dta, replace To import data already in Stata format use the use command: use sacha_s&p.dta To destring the date variable, let s try:
Sacha Kapoor - Masters Metrics 091610 3 destring date, replace date in 1/10 destring date, force replace date in 1/10 Two issues: 1. missing data; 2. proper command for destringing dates. To deal with the first problem take the necessary precautions in your preamble: use "sacha_s&p.dta" preserve destring date, force replace date in 1/10 edit restore date in 1/10 To deal with the second problem: generate date2 = date(date,"mdy") date2 in 1/10 Now let s tell Stata that this is a time series: tsset date2, monthly To extract more detailed date information: generate year = year(date2) generate month = month(date2) generate day = day(date2) label variable year "Year" label variable month "Month" label variable day "Day" in 1/10 To drop variables: preserve drop day To keep variables: keep year To drop observations 5 through 15.
Sacha Kapoor - Masters Metrics 091610 4 drop in 5/15 Let s restore the data: restore Still on the topic of time series data, to generate a trend: generate x = _n x in 1/10 To generate lags (for x): generate x_1 = x[_n-1] replace x_1=0 if x_1==. Let s take a closer look at the weather variable. des weather edit weather One way to turn this into a dummy variable: generate weather2 = 0 replace weather2=1 if weather =="yes" replace weather2=0 if weather =="no" weather2 in 1/10 Notice how the replace command conditions on a logical expression. For future reference conditional statements can involve any one of the following: <, less than >, greater than <=, less than or equal to >=, greater than or equal to ==, equal to in a logical expression =, not equal to in a logical expression 2 Some Basic (Mostly) Statistical Commands To check the current memory allocation: help memory To set a new allocation: set memory 100
Sacha Kapoor - Masters Metrics 091610 5 Note that the set command can be used to change many basic defaults in Stata. I always begin investigations with the following command: tabulate weather Why is it nonsensical to tabulate price? tabulate price To present continuous data: histogram price An even better way: histogram price, kdensity Compare this with: histogram eps, kdensity Coarser evidence is obtained with the following command: summarize price eps To include a summary of a categorical variable we can use the xi environment: xi: summarize price eps i.weather To calculate means for price and eps under good and bad weather: by weather, sort: summarize price eps To summarize a subset of values: summarize price if price <=150 To collapse the data and create a new dataset: preserve collapse(mean) price, by (weather) save "price.dta", replace restore des To test the hypothesis that price=150, with 95 percent confidence: ttest price=150, level(95) To test the equality of means: gen price_g = price if weather2==1 gen price_b = price if weather2==0 ttest price_g = price_b, unequal unpaired
Sacha Kapoor - Masters Metrics 091610 6 3 Regression Suppose our interest is in the relationship between price and eps: twoway(scatter price eps) twoway(scatter price eps) lfit price eps Fitting a line through these points is equivalent to: regress price eps Controls are easy to add: regress price eps x The xi environment works here as well: xi: regress price eps x i.weather One way to deal with persistence in the dependent variable: generate price_1 = price[_n-1] xi: regress price eps x i.weather price_1 4 Merging Data Sets Let s access online data from Stata.com webuse odd webuse even1 Merges can be one-to-one merge using http://www.stata-press.com/data/r10/odd or can match observations across datasets webuse even1, merge number using http://www.stata-press.com/data/r10/odd, sort
Sacha Kapoor - Masters Metrics 091610 7 5 Loops Let s generate data: set obs 100 To create a variable with draws from a uniform distribution: generate y = runiform() y in 1/10 To generate many variables with draws from the uniform distribution: forvalues i = 1(1)100{ generate x i = runiform() } Note: (1) gives the increment, the loop generates 100 uniform random variables over (0,1). To check for consistency of an estimator: webuse census2, generate x = rnormal(1000,100) generate e = rnormal() x e in 1/10 generate y = 100+1*x + e regress y x 6 Panel Data use "MATHPNL.DTA" des Tell Stata you have a panel: xtset distid year To run regressions using panel data: xtreg math4 y93 y94 y95 y95 y96 y97 y98 lrexpp lenrol lunch, fe xtreg math4 y93 y94 y95 y95 y96 y97 y98 lrexpp lenrol lunch, fe robust To obtain predictions for the dependent variable and residuals, respectively: predict yhat predict resid To compare predictions with actual values: edit yhat math4 To close the log file: log close