ENGG1811 Computing for Engineers Data Analysis using Spreadsheets 1 I Data Analysis Pivot Tables Simple Statistics Histogram Correlation Fitting Equations to Data Presenting Charts Solving Single-Variable Equations Goal Seek examples: black holes and ballistics Data Analysis Data analysis techniques allow professionals such as engineers, social scientists and economists to extract meaningful information from a typically vast amount of data. Spreadsheets are widely available, and provide useful features for data analysis. Some features are integrated with charts. This week Pivot tables Simple Statistics Histogram Correlation Curve fitting and regression analysis Goal Seek (solves single-variable equations) Next week Solver (for more general optimisation problems) Matrix calculations (very briefly, other tools are more appropriate) Data tables (ideal for financial modelling) Value lookup operations ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 1 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 2 Summarising Data Pivot Tables Summarising Data Pivot Tables A Pivot Table is an interactive table that can be used to summarise large amounts of data quickly. You can move data sets (named by their column header) between rows, columns and the table body, and use multivalued filters to see different summaries of the source data. To create a pivot table, select the data, including header rows, and choose Data Pivot Table Create Dialogue box appears as shown overleaf To make changes, right-click on table and select Edit Layout Page fields: overall single-valued filter Row and column fields: filterable data Data fields: function (sum, count, max etc) applied to values at intersecting row/col Pivot Table Simple Statistics Filters apply to whole table (single value drop-downs or standard filter) Row and column filters are fully optioned (like Excel s filters) Multiple rows in body if multiple fields selected Usually sum, but also count (= number of orders here) Row and column totals ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 5 Spreadsheets provide many predefined statistical functions to calculate useful information such as: mean, max, min, median, standard deviation, etc. These can be applied to columns or rows of data Excel provides a tool called Descriptive Statistics that calculates such commonly used statistical functions for a given data set and produces a useful report. OpenOffice Calc doesn t have this, but there are several user-contributed statistics packages that do More advanced statistics functions are available ( 2, t-test, various distributions, confidence intervals etc), but serious analysis usually requires specialised software such as SPSS, SAS or R, and the knowledge of how to use it. ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 6 Spreadsheets Part 1 1
Histogram A histogram is a graphical representation of a frequency distribution of a single variable, using a column chart. The columns represent contiguous ranges in the variable, called bins or classes, usually equally spaced, and the column height shows how many variable values lie in that range. Frequencies are stored in a table, the first column of which are the bin boundary values, the second is filled in by the software Calc and Excel consider bin thresholds to be upper bounds, so there s an extra unlabelled bin after the last Some built-in histogram generators such as Matlab s display the midrange on the chart rather than the boundaries, but you can always add a column next to the bin for your own labels. Histogram: Frequency table Find a spot on the sheet for the frequency table, 10 to 20 bins is about right Estimate the variable range by plotting or simple stats Fill in the bin values, equally spaced and rounded unless the categories are going to be labelled First value should be less than min and last value > max, to make sure the full range is covered ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 7 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 8 Histogram: completion Click in the first frequency cell, select the Function Wizard (important because this is an array operation). Find FREQUENCY in the function list and double-click classes: the bin threshold range, without the header end freqs are 0 as expected Chart the result (see lecture, extended in lab03), adding titles and labels in case the bin values are misleading ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 9 Correlation Correlation is a statistical measure referring to the strength of linear relationship between two or more dependent variables that have the same independent variable (time, position etc) It can vary from 1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation) Negative correlation means that larger values in one set are associated with smaller values in the other, positive goes the other way. Uncorrelated data simply shows no significant relationship ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 10 Correlation using Spreadsheets Correlation Table Calc provides the CORREL function. It accepts two ranges and returns the correlation coefficient. Alternatively, you can plot a chart for two or more variables and try to visually identify possible correlations between variables. For several variables, you can produce a table by using references carefully to allow fill operations across and down Excel provides a Correlation tool that calculates correlations between two or more variables. It constructs the table for you. ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 11 Formula: = correl($d$8:$d$32; D$8:D$32) first range uses absolute addresses second range is mixed to allow fill right For each row after the first, copy formula for the new cell on the diagonal from the cell above it change column letter on the first range only beware of Calc s formula prompts! ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 12 Spreadsheets Part 1 2
Correlation Charting You can visually inspect the relationship between variables on a chart, also an exercise in secondary axes Create a line chart for Precision, move it aside, double-click to edit, Format Data ranges Data Series tab Add a new series (Name is Sensor2 (G7), y-values are in G8:G32 Oops, the scales are badly mismatched! Oops, the scales are badly mismatched! Secondary Axis When two variable with quite different Y value ranges are to be plotted together, one goes on a secondary axis on the right Edit chart, select one of the data series, say Sensor2, either press Format Selection on the toolbar or right-click and pick Format Data Series. Axis selection is under Options. Don t forget to tidy the line format, add titles, adjust fonts etc to achieve a professional look ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 13 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 14 Correlation is Not Necessarily Causation If two data sets are correlated, it doesn t mean that the processes behind one caused the other, they could be influenced by some (often complex) third process such as climate change socio-economic factors There have been celebrated cases of correlations for which no credible explanation is likely, like this one: A classic is overleaf, revealed at the lecture geographic influences solar activity http://pubs.acs.org/doi/pdf/10.1021/ci700332k ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 15 Fitting Equations to Data Given an X-Y plot of values derived from a physical system with likely uncertainty in measurements, many lines or curves can be drawn to approximate the data The method of least squares chooses the parameters for a regression line or curve of a given type specified by the user The best-fit curve is the one that minimises the sum of the squares of the residuals (differences between data and the predicted value). Trend Lines apply regression to data on a line chart. They provide the regression equation and R 2 value, and show the regression line superimposed on the data values. The R 2 value (varies from 0 to 1) indicates how well the model fits the data. R 2 = 1 indicates that the regression line (curve) is a perfect fit to the data. ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 17 Regression and Spreadsheets Fitting Equations to Data Calc can apply linear, exponential, logarithmic and power regression trend lines to a chart, but not polynomials Excel can also do polynomials Both can display the equation and R 2 value on the chart Both do so with poor choice of typeface and significance But you can roll-your-own with drawing elements Both can extrapolate the trend (use the trend line as a prediction of behaviour outside the range); in Calc you just change the X axis limits. If you just want the numbers, can use the linest, logest, growth functions; trend calculates points on the linear trend line Apart from simple analyses, regression is often done with Matlab or with statistical packages Use scatter plot with small unfilled symbol markers, no lines Options to show regression on the chart When chart is in edit mode, Insert Trend Lines activates the dialogue box ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 18 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 19 Spreadsheets Part 1 3
Content vs Presentation Fixing the Presentation The result shows that the trend is appropriate, but the chart is not yet ready to be included in a professionally presented document. Defaults are rarely good enough. Y-axis number format is inappropriate There may be a need to point out features Auto-range implies there s a zero day Equation format is very poor (font, exp, precision, spacing are all wrong) Don t need grids as the chart is there to show a trend, not data Axis number font size is too large, the numbers aren t as important as the shape of the plot ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 20 Remember to double-click the chart to enter edit mode Feature to improve Axis attributes Remove grid Edit equation format Add a text box (anywhere, not just on a chart) Format text in box (double click to edit) Lines, arrows, circles Format drawing elements How to make changes Right click on axis, select Format Axis Format Grid All Grids from main menu (not supported in Calc, try the next) Pick text from the drawing toolbar Drag (off chart), type initial text General font changes from main toolbar; To apply sub/superscript, select characters and then Format Character Pick from drawing toolbar Right click, select Line/Area/Text; resize by dragging handles ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 21 Final Product There s still room for personal taste: you might not like to use fill and outline on the regression box, or prefer different relative font sizes. Always ask yourself: does the chart (etc) convey the message embedded in the data effectively? ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 22 Other Presentation Issues Any labels you add overlay the chart, they re not part of it Select in turn with Shift-click, or drag a marquee around them Format Group allows selected objects to be part of a single item The chart can be copied intact to another document such as OpenOffice Writer or Microsoft Word Calc to Writer is reliable, Excel to Word isn t always Other copy options if the other application can t import Click on group, File Export as PDF, use Range Selection, and export graphic images using lossless compression (PNG) If you really have to take a screen dump, expand the view first and always save as PNG, never JPG. PDF PNG, 150ppi JPEG, 150ppi PNG, 75ppi ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 23 Trends in Global CO 2 Concentrations Solving Equations in One Variable For the final example, data for atmospheric concentrations of CO 2 taken at the Mauna Loa Observatory (Hawaii) each month since 1958 is readily available on the web Plot just the last 10 years of the data using thin lines: the result shows the detail known as the Keeling Curve after Mauna Loa CO 2 researcher Charles David Keeling Annual pattern is clearly evident, can you explain these? the overall shape, peak in May each year, trough in September the kink around January Apply trend line, looks pretty clear and ominous doesn t it? on this model (that is, business as usual), when will we never see a value below 400ppm* again? Apply a trend line to the whole data set, is it still linear? Finding the roots of an equation can sometimes, but not always, be done analytically, but if not we need other approaches. Method 1: Graphical estimate Assign a range of values to a variable (say x), calculate the corresponding function values (say f (x)), and plot a graph for f (x) vs x. Now try to see where the chart line intersects with the x-axis. This approach is useful provided we can guess finite intervals within which to search for possible root values * For interesting observations about this milestone, see climate.nasa.gov/400ppmquotes/ ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 24 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 25 Spreadsheets Part 1 4
Solving Single Equations f ( x) 4x 12x 64x 16 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 26 3 2 Solving Single Equations Method 2: Goal Seek (Tools Goal Seek) works with three inputs A formula cell A target value that you would like the cell to calculate, and A variable cell that is used, even indirectly, by the formula cell Goal Seek tries to find a value for the variable cell that results in the target value in the formula cell. It uses iterative refinement, beginning with the current value. If an initial guess is not close enough, Goal Seek may not be able to find a solution Goal Seek tries a fixed number of iterations (attempts at getting closer to the goal) and stops after that, even if the equation is not solved. The Solver tool (next week) is more powerful than Goal Seek. The tools are very similar in Calc and Excel ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 27 Example 1: Cygnus X-1 Mass Calculation Mass of an black hole in a binary system can be calculated from a formula that is cubic in the unknown mass (see CygnusX1 sheet) This cubic has one positive real root Example 2: Projectile The model used for the Trajectory example last week can be inverted to select one parameter so the trajectory passes through a designated point (the target) Assume V 0 is fixed, angle can vary x and y are now the target location Solve for t in terms of y, quadratic ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 28 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 29 Example 2: Seeking the Target One of the solutions for t (typically the larger) is used to recalculate y (which must match because that s the equation we rearranged) The same t calculates x, but we may miss the target horizontally Use Goal Seek to change t (x, y) and set x to the target x value Summary of Learning Outcomes With practice you should be able to do the following Create and manipulate a pivot table from multivariate data Construct histograms to analyse large data sets Identify correlation between a small number of variables, using the correl function and correctly interpreting its results Use scatter charts and trend lines to identify the parameters of processes underlying noisy data, and to extrapolate trends Present charts and other visual content professionally Apply graphical methods to solving equations in one variable Set up a single-variable equation model and use Goal Seek to identify the value of an input parameter that converges the result to a particular target ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 30 ENGG1811 UNSW, CRICOS Provider No: 00098G Data Analysis using Spreadsheets I slide 31 Spreadsheets Part 1 5