Introduction to CS databases and statistics in Excel Jacek Wiślicki, Laurent Babout,

One of the applications of MS Excel is data processing and statistical analysis. The following exercises will demonstrate some of these functions. The base files for the exercises is included in http://lbabout.iis.p.lodz.pl/teaching_and_student_projects_files/files/us/lab_04b.zip. Download the archive and extract its content to your local drive. Exercise 1 Open lab_04b.xls. The first worksheet (database) contains a dataset with a large amount of employee names and some personal data. Using the Autofilter function available from the Data tab: all the people living in Canada earning less that 15 zł per hour, all the people whose family name starts with cal, all the people working more than 35 hours a week employed between 1997 and 1999. autofilter Exercise 2 Remove the autofilter (it is not necessary, however it will not be used anymore). Using the Sort function from the Data tab sort the table by country ascending, family name descending and employment year descending: Exercise 3 Import the data from lab_04b.csv file into an empty worksheet. This file is an example of CSV (comma-separated values), that reflect column format used by spreadsheets, however it a plain text file (open it with a notepad and see the structure). CSV format is very simple and useful when interchanging data between different systems. page 1 of 9

Hint: 1. Click on the From Text button in the Get External Data group on the Data tab. 2. Choose text files filter and point the source file. 3. When an import dialog opens, you can set all import parameters. Set file encoding to Central European (ISO) so that the Polish diacritic characters are correctly displayed: column format file encoding start import from row text file preview 4. Since the file is semicolon delimited (not constant width), press Next button. 5. Choose the column delimiter character (in this case semicolon), no text qualifier: semicolon no qualifier file preview with distinguished columns 6. Press Next button. 7. Finally you can adjust the column types. In this case all can be left as general (MS Excel 2010 will recognize numbers). Press Finish button and point the cell where the import should start. The worksheet should look as below: page 2 of 9

Exercise 4 Using the formulae calculate the average, maximum and minimum, median and the standard deviation of the average grades. Then create a graph illustrating the average grade distribution having sorted the students by their marks (if needed, reverse Y-axis categories). Format the plot as in the example using appropriate options and functions: Exercise 5 The second worksheet (sales) in lab_04b.xls contains some data about quarter sales of some product. The sales differ among the quarters, which is a quite common phenomenon in case of many seasonal products as ice-cream for example. The quarter to which a current row refers is denoted with 1, the others with 0. Columns n and time mean the same. At first, create a line plot of sales with respect to time adding a linear trend line. As you can see, the sales are generally growing, however their reflect some seasonal fluctuations: Your task is to determine the estimated sales at any time (quarter) in the future, respecting the overall trend and fluctuations. page 3 of 9

Edit the trend line and in the Options tab check the equation and R-square displaying. The equation is the trend line functions in form y = ax + b, while R 2 visualizes how accurate is the trend line approximation. The maximum value of R 2 is 1, however it would happen only if the data fit exactly the trend equation. In realistic situations you can regard the trend line as quite good if R 2 is greater than 0.8. trend line parameters Choose Data analysis in the Analysis group on the Data tab (provided it is installed. If not installed, see the hint below). Select Regression from the list and press OK. How to install Add-in: If this add-in is not installed, proceed as follows: 1. click the File tab, then click Options 2. Click Add-Ins, and then in the Manage box, select Excel Add-ins 3. Click Go 4. In the Add-Ins available box, select the Analysis ToolPak check box, and then click OK page 4 of 9

Enter the X and Y data ranges (Y are sales, X are time and the quarters) and point the output range (any cell outside the data table). For ease of orientation in the regression parameters, select the data ranges with column labels. In such case, check the Labels option. titles Press OK button. The regression analysis results are placed in your worksheet: intersection [i] [tr] [q1r] [q2r] The only cells required in the prognosis are marked with a yellow background. Multiple regression has a form y = b 0 + a 1 x 1 +a 2 x 2 +a 3 x 3 +...+a n x n + [random component], here y is sales, x 1 - time, x n ones and zeros (the quarters). b 0 is an intersection, a 1 is a time coefficient and so on. There is no random factor, which is a difference between the real data and the estimated data. In the column at right of the table enter the regression formula: page 5 of 9

[i]+[tr]*[t]+[q1r]*[q1]+[q2r]*[q2]+[q3r]*[q3]+[q4r]*[q4] where: [i] absolute reference to the cell with intersection in the regression table, [t] relative reference to the cell in time column, [q1], [q2], [q3], [q4] relative references to the cells with quarters (zeros and ones), [tr] absolute reference to the cell with time coefficient in the regression table [q1r], [q2r], [q3r], [q4r] absolute references to the cells with quarter coefficients in the regression table. estimated values The result should be as follows: with the formula in cell H2: =$B$40+$B$41*C2+$B$42*D2+$B$43*E2+$B$44*F2 +$B$45*G2 As you can see in the regression results, R 2 is about 0,969 which is fairly close to one. This means that the regression estimation of the trend is very good. Illustrate this by drawing a plot containing real sales data and the estimated ones: page 6 of 9

Having the correct regression estimation, you can calculate the sales for any quarter in the future (and past to some zero moment), assuming that the trend will be constant. It is apply the regression formula to any data row containing the time index, and 1 in the quarter for each the estimation is performed. Exercise 6 source: http://www-zo.iinf.polsl.gliwice.pl/~kadam/pimfet_std/excel/excel.htm The last worksheet in lab_04b.xls (babies) contains some data of newborns. The exercise demonstrates techniques of statistical data analysis such as histograms. The important thing that will simplify the work is to name the data ranges used for calculations. We will use babies' weights and heights: 1. Select the cell, range of cells, that you want to name 2. Click the Name box at the left end of the formula bar Entre name here 3. Type the name that you want to use to refer to your selection (e.g. weight). Names can be up to 255 characters in length. Remark: alternatively, you can use the Name Manager box Names group on the Formulas tab. In the same way name the heights range as height. Then, calculate the maximum and the minimum weight and height: that you will find in the Defined Knowing the upper and lower bands of the values in our distribution, we will create histograms (e.g. for weight the ranges will be [1800; 2000[, [2000; 2200[, etc.). In order to achieve it, create the table containing the data for the graph. Use the FREQUENCY (CZĘSTOŚĆ) function, whose arguments will be the weight named range and the ranges in the distribution table. The formula you are entering is named matrix formula as its values are affect some cell ranges. Follow the next page 7 of 9

steps carefully not to make any mistakes. Select the whole column in the distribution table and enter the FREQUENCY function: =FREQUENCY(weight;K11:K25) named range histogram thresholds (aka bins) histogram thresholds Accept the formula by pressing Ctrl+Shift+Enter simultaneously. This is the only way to enter a matrix formula. The formula will appear in braces: {=FREQUENCY(weight;K11:K25)} and the distribution table will be filled up with the data: Remarks: 1. you you don't use the matrix formula (so, basically, ou omit to press Ctrl+Shift+Enter), the cumulative frequency is displayed (e.g. for the bin 2600, the corresponding frequency will actually consider the number of babies with a weight lower than 2600). You can still perform the frequency distribution or histogram by displaying in a new column the subtraction of adjacent cells of the cumulative frequency column (e.g. for bin 2600 (e.g. cell A10), the cumulative frequency is 10 in B10, but the frequency corresponds to C10 = B10 - B9 = 10-4 = 6). What about the first bin, i.e. 2000? 2. alternatively, you can also choose the histogram module from the Data Analysis dialog box (Data Analysis in Analysis group) to directly create, as for the regression, an output table which displays the histogram 3. you can also display a relative or normalised histogram. You simply have to divide each frequency value by the number of observations (i.e. the number of babies in the statistic). Now you can create the weight histogram: page 8 of 9

Repeat the above steps and prepare the height histogram: Finally, create the plot illustrating the dependence of height on weight. Adjust axis ranges and try applying the linear trend line: page 9 of 9