How to use FSBforecast Excel add in for regression analysis

How to use FSBforecast Excel add in for regression analysis FSBforecast is an Excel add in for data analysis and regression that was developed here at the Fuqua School of Business over the last 3 years by faculty members who teach statistics, in collaboration with Professor John Butler at the University of Texas. See the separate handout on How to install and uninstall FSBforecast for details on how to install or update it. After it has been installed, you should see FSBforecast appear on the main menu bar in Excel whenever you use it. If you click on the FSBforecast tab, a toolbar will appear with the following options: FSforecast is very simple to use this handout contains about all you need to know. The examples shown here were created from the accompanying file called FSBforecast_car_data.xlsx that contains data on makes and models of cars sold in the U.S. in 1993. To obtain this file, go to the Decision 411 course software web page, click on the FSBforecast_car_data_file link, then click the Extract button on the Winzip toolbar to extract the Excel file to a directory of your choice. Then open it from there using Excel after FSBforecast has been installed. (A second file containing the completed analysis, called FSBforecast_car_data_with_analysis.xlsx, is also available there.) Data definitions: FSBforecast expects your variables to reside in named ranges in Excel. Variables which are to be used in the same analysis should all be the same length, and the best approach is to organize them on a single data worksheet in consecutive columns with their names in the first row. For example, here is a picture of a portion of the sample file, which is arranged in this format. Note that text labels (to be used as variable names) appear in row 1 and the data appears in subsequent rows. Only a portion of this file is shown here. Overall it has 15 columns and 93 rows of data. Variables are defined as named ranges in Excel. They can be located anywhere in a workbook, but it is usually best to organize them in a single table on a single data worksheet with variable names in the first row. To assign the text labels in row 1 as range names for the data in the rows below, proceed as follows: 1. Select the entire data area (including the top row with the names) by positioning the cursor on cell A1 and then holding down the Shift key while hitting the End key and then the Home key, i.e., Shift End Home. Caution: check to be sure that the lower right corner of the selected (blue) area is really the lower right corner of the data area. Sometimes this automatic method of selecting a range grabs an area with blank rows or columns or even the entire worksheet. If that happens, you will need to select the area manually by clicking and dragging the cursor to the bottom right data value. 1

2. Hit the Create From Selection button on the FSBforecast menu and check (only) the Top row box in the dialog box. To define the variables for analysis, highlight the table of data (including the first row with the variable names) and hit the Create From Selection button. Check only the Top row box for creating names. You can have any number of named ranges in your workbook, although you cannot use more than 50 variables at one time in the Data Analysis or Regression procedures. You can have up to 32,000 rows of data, although the graphs will take a long time to draw if you have a huge number of rows, and the row limit is somewhat less for regressions with large numbers of variables. A 50 variable regression is limited to about 18,000 rows. The regression procedure has a brief output mode that suppresses some of the chart output to speed up the analysis of large data sets and keep file sizes from getting too large when many models are fitted. In brief output mode, a regression with 50 variables and 18,000 rows of data will run in about 30 seconds on most PC s, which is as fast or faster than most other regression software such as SPSS. Data analysis: The Data Analysis procedure provides descriptive statistics, correlations, series plots, and scatterplots for a selected group of variables. Simply click the Data Analysis button on the FSBforecast toolbar and check the boxes for the variables you wish to include. The variable list that you see will only include variables containing at least some rows of numeric data. In this example, the variables Make and Type do not appear on the list of variables available for analysis because they have only text values. Model does appear because a few of its values are numeric (e.g., for the Audi 90 and 100 models), but you would not choose it for analysis. 2 In the Data Analysis procedure, select the variables you want to analyze and choose the plot options.

If you check the Show Series Plots box, you will also get a plot of each variable versus row number. We recommend that you always ask for series plots in at least one of your data analysis runs, no matter how large the data set. These plots give you a visual impression of each variable by itself and are vitally important if the variables are time series (although in this example they are not). If your variables are time series (i.e., measurements of the same quantities performed at different periods in time and arranged in chronological order), then you should check the Time Series Data box. This will provide an additional table of statistics, namely the autocorrelations of the variables, i.e., their correlations with their own prior values, going back as far as 12 periods into the past depending on the amount of history available. Also, the series plots are drawn with connecting lines when the Time Series box is checked. Here is a picture of the top portion of the Data Analysis report for the variables selected above, showing the descriptive statistics and series plots. (Only two of the 7 series plots in this analysis are shown.) Notice that the Cylinders variable has only a small number of possible values and they are all integers (4, 5, 6, 8), and there are only two cars with 5 cylinders and only seven cars with 8 cylinders in the sample. This is an example of the properties of your data that you can clearly see when you look at the series plots. The results of running the procedure are stored on a new worksheet. Descriptive stats and optional series plots appear at the top. If the Time series data box is checked, you also get a table of autocorrelations and the series plots have connecting lines. Sample sizes may vary if any values are missing: Be aware that on any given run the data analysis procedure ignores rows where any of the selected variables have missing values or text values, so that the sample size is the same for all the variables. (In some data files, missing values may be coded as text labels such as NA, meaning not available. ) This means that the sample sizes and the values of the sample statistics may vary from one data analysis run to another if you add or drop variables that have missing or text values in different positions. If the sample size ( Count ) is less than you expected or if it varies from one run to another, you should look carefully at the data matrix to see if there are unsuspected missing or text values scattered around among the variables. In this data set, if you choose Model as one of the variables to be analyzed, you will only get a sample size of 7, because there are only 7 cars whose model names consist of numbers (like the Audi 90 and 100). The reason for following this convention is that it keeps the data analysis sheet in synch with a regression model sheet that uses the same set of variables e.g., the correlation matrix on both sheets will be the same. When 3

fitting a regression model, only rows of data in which all the chosen dependent and independent variables have numeric values can be used to estimate the model. Correlation and scatterplot matrices: The Data Analysis procedure always shows you the correlation matrix of the selected variables (i.e., all correlations between one variable and another), because correlations are the key statistics that are used to measure linear relationships among variables. If you check the Show Scatter Plots box when running the Data Analysis procedure you will also get a matrix of all 2 way scatterplots, which is the visual counterpart of the correlation matrix. The scatter plots may take some time to draw if you choose to analyze a large number of variables at once (e.g., 15 or more) and there are also many rows of data (e.g., 1000 or more). If you run the procedure and select n variables, you will get n 2 plots, and they are drawn at the rate of several per second (faster or slower depending on the number of rows of data). If you try this with 50 variables, you will get 2500 scatterplots on a single worksheet. The result is impressive to look at, but you may wait a while for it! Here is a picture of what the output looks like when only 3 variables are chosen: The correlation matrix is displayed farther down on the Data Analysis worksheet, and there is an option to generate a full matrix of all 2 way scatterplots. Any of the individual scatterplots can be enlarged by pulling on its corners, and it can be copied and pasted to another worksheet or to a Word or Powerpoint document and re formatted there as well. The same is true of all chart output in FSBforecast. Note that in these plots, the relationship between MPG_City and the two other variables appears to be somewhat nonlinear, i.e., the points appear to be distributed around a curved line rather than a straight line. Other patterns you might (or might not) observe in a scatterplot are extreme values of some variables ( outliers ), which may or may not line up with extreme values of other variables, or clusters of points along the edges or in the corners of some plots. These sorts of patterns can present challenges for fitting models that assume linear relationships and normally distributed errors. Sometimes transformations of variables are needed to straighten things out. 4

Regression: The Regression procedure fits multiple regression models and allows them to be easily compared side by side. Just hit the Regression button and select the dependent variable you want to use and check the boxes for the independent variables from which you wish to predict it, then hit the Run button. Consecutive models are named Model 1, Model 2, etc., by default, but you can also enter a name of your choice in the Model Name box before hitting Run, and the custom name will be used to label all of the output. To run a regression, select the dependent variable and then check the boxes for the independent variables you wish to include, and hit the Run button. A model can have up to 50 independent variables and over 18,000 rows of data. If you also check the Brief Output box, then some of the usual regression output the normal probability plot, the descriptive statistics and plots of the individual variables, the residuals vs independent variable plots, and the residual table will not be included on the model worksheet. These take a large amount of time and space to produce compared to the rest of the standard output. If you have relatively large numbers of independent variables (say, a dozen or more) and/or relatively large numbers of rows (say, 500 or more), you may wish to ask for brief output when first running a model. Brief output will give you more compact model sheets, and it will also cut down on the time needed to re draw plots with large numbers of points when you scroll up and down the sheet. Once you have identified a promising looking model for a large data set, you can re run it with full output for a more complete picture. Brief output mode will also keep the file size more manageable if you fit a large number of models in one workbook. It is easy to end up with file sizes of 10M or 20M or more if you run a lot of full output regressions with many variables and many rows of data. If all your variables consist of time series (i.e., variables whose values are ordered in time, such as daily or weekly or monthly or annual observations of some quantities), then you should also check the Time Series Data box. This will provide additional model statistics that are relevant only for time series, such as autocorrelations of the residuals, which reveal whether there are unexplained time patterns. 5

There is also a Set Intercept to 0 option, which forces the intercept to be zero in the equation. In the special case of a simple (1 variable) regression model, this means that the regression line is a straight line that passes through the origin, i.e., the point (0, 0) in the X Y plane. If you use this option, values for R squared and adjusted R squared are not computed, because they do not have the same meaning for a model that does not include an intercept and there is no universally accepted way of defining them in this situation. The model sheet: The regression results for each model are stored on a new worksheet whose name is whatever model name was entered in the name box on the regression input panel when the model was run (either a default name such as Model n or a custom name of your choice). Here is a picture of a portion of the regression output which appears at the top of the model sheet. More tables and charts will appear below it. The results of running each model are stored on a new worksheet. At the top of the sheet the variables are listed and the model equation is printed out as a text string, suitable for copying and pasting into a report. The usual tables of regression model statistics, coefficient estimates, and significance tests appear below followed by a table of residual distribution statistics that includes the Anderson Darling test for a non normal error distribution and the size and location of the largestmagnitude residual. If the Time series data box was checked, a table of residual autocorrelations and tests of their significance are also shown. It is easy to refine an existing model by adding or removing variables. If you hit the Regression button while positioned on an existing model worksheet, the variable specifications for that model are the starting point for specifying the next model. You can add or remove a variable relative to that model by checking or unchecking a single box. 6

Charts appear farther down on the model sheet. The output always includes a chart of actual and predicted values vs. observation number, residuals vs. observation number, residual histogram plot, residuals vs. predicted values, and a line fit plot in the case of a simple (1 variable) regression model. Forecasts, if any were produced, are shown in a table and also plotted. Full output, which is the default, also includes a normal probability plot and plots of residuals vs. each of the independent variables and dependent variable vs. each of the independent variables. On the worksheet the charts are all arranged one above the other, not side by side as shown here, and the charts and tables are sized to be printable at 100% scaling on 8.5 wide paper. The default print area is preset to include all pages of output, so the entire output is printable on standard width paper with a few keystrokes, leaving a complete audit trail on paper. However, for presentation purposes, it is usually best to copy and paste individual charts and tables to other documents, as discussed later. All table and chart titles include the model name and the name of the dependent variable to leave an audit trail if they are copied and pasted to reports. At the very bottom of the model sheet is a table that shows actual and predicted values, residuals, and standardized residuals for all rows in the data file. The table is sorted in descending order of absolute values of the residuals, so that outliers appear at the top. Forecasting: If you wish to generate forecasts from your fitted regression models, there are two ways to do it in FSBforecast: manually and automatically. In the manual approach, define your variables so that they contain only the sample data to be used for estimating the model, not the data to be used for forecasting. Then, after fitting a regression model, scroll down to the line on the worksheet that says Forecasts: Dep. Var. = etc., and click the + in the left sidebar of the sheet to maximize (i.e., open up) the forecast table. Then type (or copy and 7

paste) values for the independent variables into the cells at the right end of the forecast row, as in the shaded cells in the table below, and then click the Forecasting button. The forecast and its confidence limits will then be computed and displayed in the cells to the left. Two plots of the forecasts are also produced. The first one shows only the forecast(s), together with 95% confidence limits for both means and forecasts. (A 95% confidence interval for the mean is a confidence interval for the true height of the regression line for given values of the independent variables. A 95% confidence interval for the forecast is a confidence interval for a prediction that is based on the regression line. The latter confidence interval also takes into account the unexplained variations of the data around the regression line, so it is wider.) The second plot shows the actual and predicted values from the sample to which the model was fitted, together with the forecasts and 95% confidence intervals for forecasts. (The latter plot is always produced, even if there are no forecasts.) How to generate forecasts manually : enter values for the independent variables in one or more rows at the right end of the forecast table, below the variable names, then hit the Forecasting button on the toolbar. The forecasts and confidence limits will be displayed at the left end of the same row(s), and they will also be plotted. In the automatic forecasting approach, which is more systematic and more suitable for generating many forecasts at once, define your variables up front so that they include rows for out of sample data from which forecasts are to be computed later. FSBforecast will automatically generate forecasts for any rows where all of the independent variables have values but the dependent variable is missing (i.e., has a blank cell). The variables must all be ranges with the same length, but the dependent variable will have some empty cells at the bottom or elsewhere. The advantage of this approach is that you only need to enter the forecast data once, at the time the data file is first created, and it will automatically be transformed if you apply any data transformations to the same variables later. Also, when using this method it is possible for forecasts to be generated in the middle of the data set if missing values of the dependent variable happen to occur there. The file used in the example above contains an extra row of data at the bottom for a hypothetical car whose mileage is to be predicted. It has values for all the numeric variables other than MPG_City, so any model fitted to MPG_City will generate a forecast for this row automatically, without the need for you to type values for the independent variables in the forecast table. Only one forecast is shown in this example, but you can generate any number of forecasts in this way by including 8

additional rows with out of sample data for the independent variables. You can also use this feature to do out ofsample testing of a model by removing the values of the dependent variable from a large block of rows and then comparing the forecasts to the actual values afterward. A forecast is also generated automatically for any row of data where the dependent variable is missing and all independent variables are present. Viewing tables and charts in your regression output: Each model worksheet provides a number of standard tables and charts, and they can be maximized or minimized by clicking the + s or s on the left sidebar of the worksheet. At the time you run the model you have the option for full regression output (which is the default) or brief output (which you get by checking the box). If you allow full output to be produced, much of it will be minimized to start with, and you will need to go down the left sidebar of the sheet checking the + s to see the complete results. As noted earlier, full output includes scatterplots of the dependent variable versus each of the independent variables and plots of the residuals versus each of the independent variables. These are all minimized by default because they take up a lot of room when there are many variables. Full output also includes a normal probability plot (a diagnostic test for normally distributed errors) as well as the usual histogram plot of the residuals. In the special case of a simple regression model, you also get a line fit plot (the regression line and confidence bands around it) in both brief output and full output mode. See the last page of this handout for an example. Choosing the output to display: click the symbol to minimize (hide) a table or chart and click + to maximize (unhide) it. Model summary worksheet: An innovative feature of FSBforecast is that it maintains a separate Model Summary worksheet that shows side by side summary statistics and model coefficients for all regression models that have been fitted in the same workbook. This allows easy comparison of models, and it also provides an audit trail for all of the models you have fitted so far. Here s an example of the model summary worksheet that was obtained after fitting two more models in which less significant variables were successively removed: 9

Model statistics and coefficients are compared side by side on the Model Comparison worksheet. This sheet also provides an audit trail of your work. Each model is time and date stamped. Variable Transformations: At any stage in your analysis you can create new variables in additional columns by entering and copying your own Excel formulas and assigning range names to the results. However, there is also a Variable Transformations option on the Regression panel that allows you to easily create new variables by applying standard transformations to your existing variables such as the natural log transformation or exponential or power transformations. The transformed variables are automatically assigned descriptive names, such as X_LN (natural log of X). The Variable Transformation tool can be used to create additional variables from transformations of the existing ones. 10

In the data set shown here, the relationship between miles per gallon and some of the other variables looks somewhat nonlinear on the scatterplots, as pointed out earlier. Perhaps it would be better to predict gallons permile as the dependent variable? The MPG_City variable can be transformed into units of gallons per mile by raising it to the power of negative 1, as shown in the dialog box below. Basic variable transformation options: natural log, exponential, power, plus/minus/times/divided by ( f(x) ), and creation of dummy variables for integer or categorical data. The transformed variable will be automatically assigned the name MPG_City_POWneg1, and it will show up next to the original variable in the alphabetical list of variable names in the dialog boxes: You could also assign a less geeky name to the variable (e.g., GallonsPerMile) by using the Name Manager to change its name. To change the name of a variable, click the Formulas tab on the Excel main menu, then click the Name Manager button, then click on the variable whose name you want to change, then click the Edit button, and enter a new name for it in the Name box and hit OK. The Make Dummy Variable transformation can be used to create dummy (0 1) variables from variables that consist either of numbers or text labels, including variables such as DriveTrain (front/rear/all) in this file. A separate dummy variable (with a name such as DriveTrain_EQ_front ) will automatically be created for each distinct value of the input variable. 11

If the Time Series Data box is checked on the regression input panel, then many additional transformations are available which are specific to time series, such as computing lagged values, or changes from one period to another, or percentage changes from one period to another, or adjusting for inflation using a fixed rate of deflation: Additional transformations that are specific to time series data: lags, differences, and deflation. These are only available when the Time Series Data box is checked on the regression input panel. Scaling of variables: The coefficients in the regression equation and regression summary table are displayed in fixed format with 3 decimal places. Normally this is fine for a wide range of units of measurement, but if your dependent and independent variables are measured in units that are poorly scaled relative to each other (e.g. one measured in dollars and another measured in millions or billions of dollars), the coefficients may end up displaying as zeros in 3 decimal place format because their estimated values are less than 0.0005, even though they are statistically significant. Keep in mind that the value of a regression coefficient is measured in units of Y per unit of X, whatever those units may be. If you are puzzled to find zeros or very small numbers in the model equation or table of regression coefficients, when the model otherwise seems reasonable, you should consider rescaling some of the variables. For example, if an independent variable has a coefficient that is displayed as zero despite being statistically significant (as indicated by a large t stat and a small P value), consider rescaling it in thousands of its original units, so that its values are smaller by a factor of 1000, which will increase its estimated coefficient by the same factor while leaving the t stat and P value unaffected. Alternatively, you might rescale the dependent variable so that its values are larger rather than smaller. In the car data example above, the coefficients of RevsPerMile and Weight were on the order of 0.002 and 0.008 respectively, so they were displayed with only one significant digit of precision. Some re scaling of variables might be helpful there. For example, you could create a new dependent variable called GallonsPer100Miles by multiplying GallonsPerMile by 100. This would increase the values of all the estimated coefficients by a factor of 100, other things being equal. 12

Displaying gridlines and column headings on the spreadsheet: By default the data analysis sheets and model sheets do not show gridlines and column headings, in order to make the data stand out more clearly. However, if you wish to turn them back on, you can do so by going to the View toolbar and clicking the boxes for Gridlines and/or Headings. This allows you to do things like changing column widths if necessary. Copying output to Word and Powerpoint files: The various tables and charts produced by FSBforecast have been designed in such a way that they can be easily copied to document files, and the table and chart titles all include the name of the dependent variable and the model name so that they can be traced back to their source. When copying and pasting a chart or table, there are several alternatives. On the Home tab, the pull down Paste menu has a row of icons for different formats as well as a paste special option. The icons give you a number of complicated options, e.g., tables can be pasted in a form that allows their contents to edited, and they can be given the same format as either their source or destination, and their contents can be merged into other tables. We suggest that you use the picture option, which is on the right end of the list of icons, or else choose paste special and then choose one of the picture formats (e.g., png or enhanced metalfile). This will paste the table or chart as an image whose contents cannot be edited. It can be scaled up and down in a way that will keep everything in proportion, and it will be secure against having its numbers changed (accidentally by you or deliberately by others) later on. Often charts can be made smaller without loss of readability or impact, and you should always consider doing this when preparing reports. For example, here is the line fit plot for a simple regression model pasted as a picture and scaled way down: 55 Line Fit Plot Dep. Var. = MPG_City, Model = Model 3 MPG_City 45 35 25 15 5 1500 2000 2500 3000 3500 4000 4500 Weight 13 Actual Upper 95%F Predicted Lower 95%F