ANALYSING CLIMATIC DATA USING GENSTAT FOR WINDOWS. Roger Stern and James Gallagher

Size: px

Start display at page:

Download "ANALYSING CLIMATIC DATA USING GENSTAT FOR WINDOWS. Roger Stern and James Gallagher"

Dominic Davis
5 years ago
Views:

1 ANALYSING CLIMATIC DATA USING GENSTAT FOR WINDOWS Roger Stern and James Gallagher Statistical Services Centre The University of Reading THE UK MET OFFICE ISBN: JUNE 2004

3 Contents CONTENTS 1. Introduction The four parts of this guide Use of this guide Acknowledgements... 2 Part I Introduction to Genstat GenStat basics Starting GenStat 7th Edition Data input Some basic data manipulation Factor Columns Understanding how GenStat works Simple statistical inference The use of boxplots Comparisons of means References Simple regression Setting up the data Correlation and regression A GenStat tutorial Review of Chapters Challenge 1 Easy use of GenStat Part II Summary and presentation of climatic data Before starting the analysis Examining the data Repeated measures Time series Challenge 2 Taking control Summary of climatic data Introduction Setting up the data Producing the summary values Climate Indices Other summaries Challenge 3 - Climatic indices Regression Introduction Linear regression Multiple regression Polynomial Regression Including factors in a regression study Nonlinear Regression Combining regression and simple analysis of variance: detecting lack of fit Challenge 4 - Climate change

4 Contents Part III Statistical Methods Distributions in climatology Introduction Probability ideas Fitting distributions Generalised linear models Moving on from chi-square tests: log-linear modelling References Basic multivariate methods Introduction Understanding the Concepts Principal Components Analysis (Empirical Orthogonal Function Analysis) Cluster Analysis Concluding Remark References Further methods Introduction Extremes More on extremes Directional Data wind roses using an example in GenStat More on directional data Further methods References Part IV Commands and Strategy Moving from menus to commands Introduction Finding errors in commands Using Input Windows The syntax of GenStat's commands Examples of GenStat programs Challenge 5 Changing a GenStat procedure Developing a strategy Introduction Data ODBC Software Users INDEX

5 1 - Introduction 1. Introduction 1.1 The four parts of this guide This guide is intended primarily for scientists who wish to use GenStat for the analysis of climatic data. Our primary aim is to teach GenStat, rather than statistics, hence minimal information is given regarding the data and the interpretation of the results. The version of GenStat described here is the Seventh Edition for PCs under Win 95, 98, NT or XP. The minimum recommended configuration is a Pentium PC with 32 Mb RAM. Part I of this guide is an introductory tutorial. It covers GenStat s facilities for simple data entry and analysis and introduces descriptive statistics and simple inference (t-tests and simple regression). Part II describes GenStat s facilities for the summary and presentation of climatic data. We use examples of monthly and daily data. Part III introduces GenStat s facilities for regression analysis. This is followed by chapters that describe some further facilities for processing climatic data, including the fitting of distributions, and multivariate analysis. GenStat is a Windows package and we assume that users have some experience of working in a Windows environment. However, GenStat can also be used by giving commands, and, for users who wish to proceed further, we have a chapter entitled Moving from menus to commands in Part IV of the guide. The menus are based on an underlying command language, which is available for nonstandard analyses. This language is common to all versions of GenStat, including those on workstations and mainframes. The final chapter in Part IV is on developing a strategy. This covers three topics. The first is on the data and includes the use of ODBC for data transfer. The second is on software and considering how GenStat might fit into a software strategy and lists some alternative packages. The third is on staff, and includes ideas for training. GenStat 1 is developed by the GenStat Committee of the Statistics Department, IACR-Rothamsted, Harpenden, Hertfordshire AL5 2JQ, UK. 1.2 Use of this guide One purpose in writing this guide is to provide supporting material for those, who are on a training course. This guide, particularly the first part, may also be used for self-study, either within a supervised environment, or for users who have experience of other statistical packages. This guide is not intended for self-study by beginners to statistical computing. We find typically that Part I (Chapters 2 6) takes between one and three hours for those familiar with other statistical software. Thus the key elements of Part I could be covered in a half-day session of a training course. This would introduce the software and could include a discussion on initial impressions of GenStat at the end of the session. In general t use of modern statistics packages has helped training courses considerably. It is now possible for even short courses to concentrate primarily on ideas of climatic analysis. Previously a much greater proportion was often devoted to mastering the software. Most real datasets are much larger than those used in the Introduction. Hence, in Part II, we are particularly concerned with ways of organizing and presenting the types of data that are often used in the analysis of climatic data. Many studies involve looking at relationships. In statistical packages this is mainly handled by the regression facilities and hence we introduce those in Part III. We then introduce other facilities in GenStat for processing climatic data. We choose methods that are important in their own right, but also aim for users to gain sufficient confidence that they can look for additional facilities, when they 1 GenStat is distributed by VSN International Ltd, Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, UK (Tel: +44 (0) Fax: +44 (0) info@vsn-intl.com ). GenStat is a registered trademark of the Lawes Agricultural Trust. 1

6 1 - Introduction are needed. Resource people may wish to read through this part of the guide when preparing a training course. All datasets used in the examples and exercises can be found on the CD. If you read this manual in a printed or pdf version, the files can be downloaded from The University of Reading website ( 1.3 Acknowledgements The structure of this guide and part of the materials have been adapted from the guide called Using GenStat for Windows, 5 th Edition, in Agriculture and Experimental Biology. This was prepared by staff from the SSC, Reading and ICRAF, Nairobi. It was, in turn based on original notes, prepared by Gillian Arnold, from the University of Bristol. We are very grateful to the many people who contributed to earlier versions of this guide. We also appreciate the permission from the Zimbabwe Meteorological Services to use their detailed data in this guide. We also acknowledge the other Met Services that have allowed us to use their data. We also wish to acknowledge the efforts of the GenStat development team, who were prepared to add further facilities to this version of GenStat, mainly for those involved in climatic analysis. These included particularly the topics described in Chapter 15, on extremes and directional data. We are also most grateful to the UK Met Office, who have funded the preparation costs of this guide. Roger Stern Statistical Services Centre School of Applied Statistics The University of Reading Reading RG6 6FN James Gallagher Statistical Services Centre School of Applied Statistics The University of Reading Reading RG6 6FN 2

7 Part I Introduction to Genstat 3

8 2 GenStat Basics 2. GenStat basics This part of the guide is written in the form of our introductory tutorial. The aim is for the user to become familiar with the basic operations of GenStat for Windows. If you are following the chapter while using GenStat then we indicate where you should type something with a. The remaining text describes what is being done. In this guide, we sometimes assume a user already has experience of Excel. We show how data entered into Excel can be analysed with GenStat and also how data from GenStat can be saved as an Excel file. Users who are not familiar with Excel should omit these sections. Experience with Excel is not necessary for using GenStat. 2.1 Starting GenStat 7th Edition You start GenStat within Windows on a PC by clicking on the GenStat icon on the desktop or toolbar or by selecting GenStat executable, from the Programs Menu. If no GenStat icon is available on the desktop, you can create one yourself 2. After starting GenStat, you see a standard Windows interface with a title bar, menu bar, tool bar, status bar and several windows, Fig. 2.1a. The Output window will contain the output from the operations we perform. The input log keeps a record of what has been done in an analysis. Many of the menus are standard for Windows applications. Only Run, Data, Spread, Graphics and Stats are GenStat-specific. Fig. 2.1a GenStat Windows Menu bar Tool bar Output window Input log Fig. 2.1b shows an example of the interface after a spreadsheet has been opened. Status bar 2 By default, GenStat is installed in the folder C:\Program Files\Gen7Ed. Use Windows Explorer and go to the subfolder Gen7Ed\bin. Right click with the mouse on Genwind7.exe and create a shortcut. This shortcut can now be dragged onto the desktop. You might rename the icon on the desktop (right click and click rename ) as GenStat 7 th Edition, to avoid confusion with previous versions. 4

2 GenStat Basics Fig. 2.1b Windows Tile Vertical 2.2 Data input 2.2.1 Data input using the Spread Menu We show two ways to enter data into GenStat. The first is within GenStat.

Edit the box to make a GenStat spreadsheet with 11 rows and 2 columns as shown in Fig. 2.2b. Different types of spreadsheet can be made, but the default (i.e. what GenStat will select in the absence of further information) - Vector - is usually the type you will need.

Enter the following numbers into the first column: 1330 2094 1851 1470 1557 1932 1184 2452 1347 1792 1488 Press the [Enter] key after the last number.

9 2 GenStat Basics Fig. 2.1b Windows Tile Vertical 2.2 Data input Data input using the Spread Menu We show two ways to enter data into GenStat. The first is within GenStat. Choose Spread New Create. Fig. 2.2a Spread New Create Fig. 2.2b Specify the size Choosing Create brings up a box allowing you to specify how many data columns you want, and how many rows of data there will be. Edit the box to make a GenStat spreadsheet with 11 rows and 2 columns as shown in Fig. 2.2b. Different types of spreadsheet can be made, but the default (i.e. what GenStat will select in the absence of further information) - Vector - is usually the type you will need. Click [OK], and an empty spreadsheet will appear. You can start to enter data by clicking in a cell in the spreadsheet. Type the number, and then press the [Enter] key. Enter the following numbers into the first column: Press the [Enter] key after the last number. The cursor will then move to the top of the next column. Enter these numbers into the second column:

For each row, the value in the first column is the annual rainfall total, and the value in the second column is the number rain days.

10 2 GenStat Basics Make sure that you press the [Enter] key after typing the final number. The resulting sheet is shown in Fig. 2.2c. If you have made any mistakes, these can be easily corrected, using the arrow keys to move to the cell to amend and entering the correct value. For each row, the value in the first column is the annual rainfall total, and the value in the second column is the number rain days. It is helpful to give the columns more meaningful names than the default C1, C2, etc. To give a name, position the cursor as shown Fig. 2.2c. It becomes a pencil, rather than a hand, and clicking on the mouse gives a popup screen where you can type the name for the column, as shown in Fig. 2.2d. Then press [OK]. Fig. 2.2c Naming C1 Fig. 2.2d Giving the name Once you have given column C1 the name total, repeat with C2 with the name raindays. These names now appear on the columns of the spreadsheet Organising the Windows It is useful to decide how you wish to use the different windows in GenStat. Use Window Tile Vertically to give the layout with the three Windows namely the Output, the Input Log and the Spreadsheet. This is roughly as shown earlier in Fig. 2.1b. These windows indicate one difference between most statistics packages, like GenStat, and spreadsheets, like Excel. With a spreadsheet you have effectively one type of window within which you can have your data and results. In GenStat you have one window for your data and this is called the spreadsheet. It does not include any results. You have a separate window, called output, for the results. You also have here a third window called the Input Log, see Fig. 2.1b. This keeps a record of what you have done. Now minimize the Input Log and then use Window Tile Vertically again (or press <Shift><F4>) to give the layout roughly as in Fig. 2.2e. 6

size again. Then use Window Tile Horizontally. Which layout of the windows do you prefer? 2.

11 2 GenStat Basics Fig. 2.2e Windows Tile vertically (with Input Log minimised) Now try maximizing the output window, and then reducing it to its half size again. Then use Window Tile Horizontally. Which layout of the windows do you prefer? Saving the file Use File Save As and save the file as cmtut1.gsh, as shown in Fig. 2.2f. Fig. 2.2f File Save As Use Run Restart Session, so you are ready to try the second way of entering data. It will warn you, as shown in Fig. 2.2h, but persevere by clicking on Yes. 7

2 GenStat Basics Fig.2.2g Run Restart Fig. 2.2h Accepting the restart If you are not experienced in computing, or if you are not familiar with Excel, then go to Section 2.3. 2.2.4 Data input from Excel worksheets This section assumes you are familiar with Excel.

Most of your data is probably entered already, in a database or in a spreadsheet like Excel. Importing data into GenStat is easy. Minimize GenStat and go into Excel.

In the cells above the data, you can enter the names for the columns : total and raindays. Fig. 2.2i Data entry in Excel Save your Excel workbook and give it the name cmtut1.

12 2 GenStat Basics Fig.2.2g Run Restart Fig. 2.2h Accepting the restart If you are not experienced in computing, or if you are not familiar with Excel, then go to Section Data input from Excel worksheets This section assumes you are familiar with Excel. If not, or if you are using different spreadsheet software, then omit this section and go to Section 2.3. Most of your data is probably entered already, in a database or in a spreadsheet like Excel. Importing data into GenStat is easy. Minimize GenStat and go into Excel. We assume you are now in Excel. Create a new Excel workbook and enter the same data as earlier, see Fig. 2.2i. In the cells above the data, you can enter the names for the columns : total and raindays. Fig. 2.2i Data entry in Excel Save your Excel workbook and give it the name cmtut1.xls. You have now finished with Excel, so minimize Excel and go back to GenStat. In GenStat, choose File Open and select the Input file. Indicate that the file to import is of the Other Spreadsheet Files type as shown in Fig. 2.2j. 8

2k Use first sheet In the next window, you can select which worksheet of the workbook

In this case just click Finish to import the data into a GenStat spreadsheet.

13 2 GenStat Basics Fig. 2.2j Look for Excel file Fig. 2.2k Use first sheet In the next window, you can select which worksheet of the workbook you want to import. In this case just click Finish to import the data into a GenStat spreadsheet. In this example the data were easy to import, because the Excel sheet only included what was to be imported. To import any set of data equally easily, into GenStat, from Excel, you can define a named range in Excel. Go back into Excel and add a line or two of description as shown in Fig. 2.2l. Then, in Excel, highlight the range containing the data and the header row and choose Insert Name Define, see Fig. 2.2m. Fig. 2.2l Excel file with description Fig. 2.2m Defining a name in Excel Give the range a name, for instance Data, as shown in Fig. 2.2n. Then save the Excel file and minimize Excel. Fig. 2.2n Specifying the name of the range as Data 9

2 GenStat Basics Go back to GenStat and restart the session by selecting Run Restart Session and then clicking [Restart], to clear all windows, dialogue boxes and the spreadsheet.

2o. The R:Data in Fig. 2.2o signifies that you are using a named range. Fig. 2.2o Importing the named range into GenStat An alternative way of transferring the data is to copy a range of cells from Excel and paste it into GenStat.

14 2 GenStat Basics Go back to GenStat and restart the session by selecting Run Restart Session and then clicking [Restart], to clear all windows, dialogue boxes and the spreadsheet. When you now reopen the file cmtut1.xls, you are able to select the range Data as shown in Fig. 2.2o. The R:Data in Fig. 2.2o signifies that you are using a named range. Fig. 2.2o Importing the named range into GenStat An alternative way of transferring the data is to copy a range of cells from Excel and paste it into GenStat. This is not considered good practice in data management, as will be seen in Chapter 18, but is a fast and easy way of data transfer for a quick provisional analyses To show this way, choose Run Restart Session to clear all data out of GenStat. Go back into Excel. Highlight the range containing the data and column headers and choose Edit Copy, or right click with the mouse in this range and click Copy. Now the data are loaded into the Windows clipboard. Go back to GenStat and choose Spread New From Clipboard, see Fig. 2.2p and the data are entered into a GenStat spreadsheet. Fig.2.2p GenStat option to import from the clipboard Advanced data input It is also possible to import data from other file formats or to create links with other files. More information can be found in Chapter 18.3 (page 210 )where we show how to establish an ODBC link Leaving GenStat To end a GenStat session, choose File Exit. You will be asked if you want to save any of the open windows or spreadsheets. Select [Yes] to save the spreadsheet, but [No] for the other windows, and [Exit] GenStat. As well as showing you how to enter data into GenStat, you have seen how easy it is to transfer data from another package, such as Excel. So, if you are already familiar with a spreadsheet or another statistical package, using GenStat does not has to stop you from 10

2 GenStat Basics using other software. You can use GenStat in addition. We will show examples from Excel spreadsheets at various points in this guide. 2.3

The data in the spreadsheet are passed into the GenStat server as soon as you click anywhere outside the spreadsheet or the spread menu. Try doing this by clicking in the output window.

15 2 GenStat Basics using other software. You can use GenStat in addition. We will show examples from Excel spreadsheets at various points in this guide. 2.3 Some basic data manipulation Summary statistics Restart the session and reopen the file cmtut1.gsh. The data in the spreadsheet are passed into the GenStat server as soon as you click anywhere outside the spreadsheet or the spread menu. Try doing this by clicking in the output window. Some summary information about the two columns total and raindays will appear in the output window showing minimum, mean and maximum values, number of values and number of those that are missing. What are the values for these two variates? For further statistical summaries use the Stats menu, as shown below. Choose Stats Summary Statistics Summarize Contents of Variates. Select the variates required in the resulting dialogue shown in Fig. 2.3b, and then click [OK]. Fig. 2.3a Choosing the dialogue Fig 2.3b Selecting the columns Select the Output Window. If you cannot see this window, try clicking the or buttons in the tool bar successively until it appears. Some of the results are shown in Fig. 2.2c. There are other statistics available with the same dialogue box. 11

Click on the [Clear] button to clear all currently selected statistics.

16 2 GenStat Basics Fig. 2.3c The results Find the Summarize concents of Variates dialogue again. Click on the [Clear] button to clear all currently selected statistics. Reselect the variables and choose Arithmetic Mean, Standard Deviation and Standard Error of Mean, and click [OK]. Use Graphics Point Plot and complete the dialogue box as shown in Fig. 2.3e. Fig.2.3d Choosing the scatter plot Fig.2.3e Specifying the y and x dialogue The relationship between the total rainfall and the number of rain days is as follows: 12

17 2 GenStat Basics Fig. 2.3f The graph is in its own window Close the graph window with File Exit (choosing [No] to the question about saving the graph), and then close the point plot and summary dialogue boxes by choosing [Cancel]. 13

18 2 GenStat Basics Many dialogue boxes in GenStat do not close when you click [OK]. They only close if you click on [Cancel]. This is so you can easily repeat an operation, or get more output from the current analysis without having to go back through the menus. It is quite easy to get a large number of windows and dialogue boxes open at once, so it can be quite hard to find the one for which you are looking. Clicking the or buttons in the tool bar can help find the one you want. Alternatively, to find a particular dialogue or menu box, just repeat the menu commands that opened it (e.g. Graphics Point Plot) as this will bring back the box complete with anything that had been entered. It is a good idea to close a box by clicking [Cancel] as soon as it is no longer needed Calculating and formatting columns It is easy to calculate new variates from those already entered. In this example, it would be interesting to find the mean rain per rainday in each year. This is simplest to do within the spreadsheet. First, the spreadsheet needs to be selected. Do this, either by clicking somewhere in it (if you can see it), or use the toolbar arrow buttons or the Window menu, as shown in Fig. 2.3g. Fig.2.3g Selecting the spreadsheet To calculate a new column, choose Spread Calculate Column as shown in Fig. 2.3h. 14

2 GenStat Basics Fig. 2.3h Choosing the calculate dialogue Fig. 2.3i

The calculation can either be typed into the top box, or you can use the mouse to click on the operator buttons and double click on the variates as

Click [Cancel] after this to remove the dialogue box. There is now a new variate, called meanperday, added to the spreadsheet, as shown in Fig. 2.

19 2 GenStat Basics Fig. 2.3h Choosing the calculate dialogue Fig. 2.3i Giving the formula Fig. 2.3j The results Complete the box as shown in Fig. 2.3i. The calculation can either be typed into the top box, or you can use the mouse to click on the operator buttons and double click on the variates as required. Type the name of the new column into the bottom box labelled Save Result In, Fig. 2.3i. Then click [OK]. Click [Cancel] after this to remove the dialogue box. There is now a new variate, called meanperday, added to the spreadsheet, as shown in Fig. 2.3i, which holds the 11 values of the mean rain per rain day. The name is part shaded (in yellow on a colour screen) to indicate that the column meanperday is a calculated column. To illustrate the difference between an ordinary and a calculated column, try to change a value in the meanperday column. GenStat gives a warning, see Fig. 2.3k. 15

If you change a value in the original column, the derived values do not, however, change automatically. You could then use Spread Calculate Recalculate, to update the derived values. Fig.2.

20 2 GenStat Basics Fig. 2.3k Showing that the column was calculated Thus GenStat's spreadsheet is a little like an ordinary spreadsheet in that it records the calculation, rather than just doing the transformation. If you change a value in the original column, the derived values do not, however, change automatically. You could then use Spread Calculate Recalculate, to update the derived values. Fig.2.3j Commands keep a record of your work You may have noticed that commands have been appearing in the Input Log as you work. This is a record of what you have done, written in the GenStat command language. You can re-run any of these commands with the Run menu, or copy them into a new window to make a program. There are already some examples on page 18. More information can be found in Chapter 16. To understand for instance what has happened within GenStat, when you did the last calculations, we have shown the output window in the figure above. There you see, in line 31, that the spreadsheet generated a command font command, that was executed by the GenStat Server. Line 32 shows that the results were then passed back to the spreadsheet. (Ignore the large number there, , which may well be different when you run the commands. It is an internal reference number so GenStat knows which spreadsheet contains the new column.) Calculations will normally be done in a spreadsheet as above. Once you become experienced in using GenStat, you could alternatively do calculations only in the GenStat server, using the Data Calculations menu, rather than the Spread Calculate Column route that you used above. The result is the same to the GenStat Server, but you would not automatically see the calculated column in a spreadsheet. In the spreadsheet, each value of meanperday is displayed with the same number of significant digits which may lead to a variable number of decimal places. You can change this with Spread Column Attributes/Format, Fig. 2.3n. Make sure that you select fmeanperday in the column box. A faster way is to right-click in the meanperday column and to choose the Column Attributes option. The same Column Attributes Dialogue Window will appear. Type 1 in the Decimals box Fig. 2.3n, and check that Fixed is now the numeric format. You may also wish to enter a concise explanation of the contents of the column in the Description 16

2 GenStat Basics box. Now, whenever meanperday is printed in the output, it will be displayed with 1 decimal place by default. Click [OK] to effect the change. Fig. 2.

It would be useful to have this information entered too. Click in the first column (total) of the spreadsheet. Choose Spread Insert Column before Current Column.

A new column will appear in the spreadsheet filled with missing values (denoted by *), Fig. 2.3p.

21 2 GenStat Basics box. Now, whenever meanperday is printed in the output, it will be displayed with 1 decimal place by default. Click [OK] to effect the change. Fig. 2.3m Choosing the Format dialogue Fig. 2.3n Give the column one decimal place Assume that these data values came from 11 years in order. It would be useful to have this information entered too. Click in the first column (total) of the spreadsheet. Choose Spread Insert Column before Current Column. This gives a dialogue box called Create a new column as shown in Fig. 2.3o. Fig. 2.3o Spread Insert Column Fig. 2.3p Right-click and choose Fill Type year in the name box and click on [OK]. A new column will appear in the spreadsheet filled with missing values (denoted by *), Fig. 2.3p. You could now type in the numbers 1 to 11, or the real years, if they are known, but there is a quicker way to fill in regular sequences. Right click in the Spreadsheet and choose Fill from the popup menu as shown in Fig. 2.3p or choose Spread Calculate Fill. In the Fill dialogue, shown in Fig. 2.3q, make sure that year is in the top box. Clicking [OK] will fill year with the numbers 1 to 11. Fill can also be used to make patterned sequences. Details of the use of this, or any other dialogue, can be found by clicking the [Help] button in the dialogue box. An example is given in Fig. 2.3r. 17

2 GenStat Basics Fig. 2.3q Complete dialogue Fig.2.3r Try the Help button Try plotting the mean per day against the year as a line graph.

Remember to close the graph with File Exit at the top of the graph. 2.4 Factor Columns 2.4.1 Introducing factors So far, all the information entered into GenStat has been numerical.

Four years in this data set were El Nino years, the second, third, eighth and tenth. The remainder were ordinary years.

22 2 GenStat Basics Fig. 2.3q Complete dialogue Fig.2.3r Try the Help button Try plotting the mean per day against the year as a line graph. Use Graphics Line Plot SingleXY type with meanperday as the Y and year as the X. Now investigate the graph. What is the year with the lowest mean? Is there any obvious pattern? Remember to close the graph with File Exit at the top of the graph. 2.4 Factor Columns Introducing factors So far, all the information entered into GenStat has been numerical. It is possible to include textual information as well. One structure that accepts this kind of information is a factor. This is a special column used to indicate groups in the data. Four years in this data set were El Nino years, the second, third, eighth and tenth. The remainder were ordinary years. So we will make the factor with two groups or levels, and here, one is labelled E and the other O. Click in the first column of the spreadsheet (year) and choose Spread Insert Column after Current Column. Type the name type into the Name box, and click to select Factor under Column Type in Fig. 2.4a. The dialogue will change. Fig. 2.4a Spread Insert Column Specify that the factor has 2 levels and then click on the [Labels] button. The dialogue shown in Fig. 2.4b appears. Type 'E' and press the [Enter] key. The next level (2) will become selected. Type 'O', press [Enter] and then click [OK] to make the changes take effect. Click [OK] in the Create a new column dialogue, as shown in Fig. 2.4a to make the new column, which contains empty cells. 18

O E E O O O O E O E O If you make a mistake by typing lower case 'e' instead of an upper case 'E', GenStat will turn it into an upper case 'E'; if you type the wrong letter, GenStat will

23 2 GenStat Basics Fig.2.4b Add labels to the factor Fig. 2.4c Entry of data Now type the following values into the new column, as shown in Fig. 2.4c. O E E O O O O E O E O If you make a mistake by typing lower case 'e' instead of an upper case 'E', GenStat will turn it into an upper case 'E'; if you type the wrong letter, GenStat will give you a message and ask you to retype your entry. Double clicking gives a pop-up menu, as shown above, which lists the allowable levels. The factor column can be used to label a graph. Choose Graphics Point Plot. Fill in the boxes as in Fig. 2.4d, and click [Finish] to produce the graph shown in Fig. 2.4e. If you first click [Next], you can add titles to the graph and the axes. Fig. 2.4d Graphics Point Plot 19

2 GenStat Basics Fig. 2.4e Resulting graph By using the Edit Edit Graph once you have the graph, or right clicking in the graph, you can choose to edit the Axes key or other commands.

24 2 GenStat Basics Fig. 2.4e Resulting graph By using the Edit Edit Graph once you have the graph, or right clicking in the graph, you can choose to edit the Axes key or other commands. They can be used to modify the layout of the graph until it is ready for reporting or publishing. Graphs can be saved in different formats by choosing File Save as see Fig. 2.4f. You leave the GenStat Graphics Window by choosing File Exit from the menu bar. Fig. 2.4f File Save As, choosing and emf type Back in the spreadsheet, the column called type can be modified to display longer labels. Select the type column in the spreadsheet. Right click and choose Column Attributes. Click the [Labels] button, and edit the labels (to be El Nino and Other), making sure that you press [Enter] after typing each new label. Click two [OK] buttons when you have finished, and the labels in the variety column should now be modified. Alternatively, the full labels could have been entered when the factor was first created. You would still have been able to enter the values into the column by typing E or O only, the first letter of the labels. Earlier, you used Stats Summary Statistics Summarise Contents of Variates to give some summaries of the data. Now, with the data in two groups, it is useful to give the summaries for each group individually. The dialogue used earlier can be used for this, but a more general alternative is: Stats Summary Statistics Summaries of Groups (Tabulation), Fig. 2.4g. 20

2 GenStat Basics Fig. 2.4g Stats Summary Tabulation Fig. 2.4h Results Complete the dialogue as shown and press [OK].

2 Saving data Before continuing, save the spreadsheet. Choose File Save as.

A wide range of other file formats is also available. Fig. 2.

2 we showed how data could be imported from an Excel worksheet or could be entered directly in GenStat

25 2 GenStat Basics Fig. 2.4g Stats Summary Tabulation Fig. 2.4h Results Complete the dialogue as shown and press [OK]. The results are shown in Fig. 2.3h in the Output Window Saving data Before continuing, save the spreadsheet. Choose File Save as. By default, a Window appears suggesting you save the data as a GenStat spreadsheet (*.gsh). A wide range of other file formats is also available. Fig. 2.4i File Save As, then change to Excel In Section 2.2 we showed how data could be imported from an Excel worksheet or could be entered directly in GenStat using the Spread menu. We had imported the file cmtut1.xls from Excel and have modified it. If you change the format in the figure above, and specify an Excel file then, when you try to save, you get a warning message. Fig. 2.4j GenStat warning that the original file will be replaced 21

2 GenStat Basics 2.4.3 Deleting data In this section we will delete the column, called meanperday, that has been generated.

Then click in the name field (or press <ALT><Ctrl>C, or use Spread Select Current Column). Clicking again will deselect the column. Practice selecting and deselecting columns.

26 2 GenStat Basics Deleting data In this section we will delete the column, called meanperday, that has been generated. We also show the difference between deleting a whole column and deleting its contents. First select the column meanperday. Then click in the name field (or press <ALT><Ctrl>C, or use Spread Select Current Column). Clicking again will deselect the column. Practice selecting and deselecting columns. Finishing with the meanperday column selected. Once selected, you might think that the <Delete> key should delete the column. Press the <Delete> key. A dialogue box asks Do you want to Delete the selected cells?. If you click [OK], it deletes (as expected), but just the data. The column remains! Use Edit Undo to get the column back. What you need to do is to delete the whole column. The column should still be selected. Use Spread Delete Current Column. You may wish to practice this. If so, use Edit Undo and try again. You can also select one, or more, rows and delete them in the same way Available variables You can check which variables are currently available to the GenStat server using Data Display or pressing the F5 key, see Fig. 2.4k and Fig. 2.4l. Fig.2.4k Display Dialogue Fig. 2.4l Information about the current variables This lists the names of the structures and their types. All structures used so far are variates (meanperday, raindays, total and year) and factors (type), but later on you will use other types of columns too. This is also a useful dialogue box from which you can delete columns when they are no longer needed. Click [Close] to close the Display dialogue box. 22

2 GenStat Basics 2.5 Understanding how GenStat works 2.5.1 A first introduction to the GenStat command language Although in Chapter 2.

You simply typed commands, which you submitted to GenStat. GenStat 7 th Edition is indeed a Windows application, but the menus are based on an underlying command language.

Use File New and choose the Text Window, Fig 2.5a. This gives you an Input Window. In this window, type Print 3+4 as shown in Fig. 2.5b. Fig.2.5a File New Fig. 2.5b Complete the text window Now select the Run menu as shown in Fig.

27 2 GenStat Basics 2.5 Understanding how GenStat works A first introduction to the GenStat command language Although in Chapter 2.1, we mentioned that GenStat is basically a standard Windows application, the truth is a bit more complex. Before the Windows version you could use GenStat as long as you knew the "language". You simply typed commands, which you submitted to GenStat. GenStat 7 th Edition is indeed a Windows application, but the menus are based on an underlying command language. You can still use GenStat by typing commands in the Input Window as we show now. At the same time, we show how GenStat may be used as a calculator. Restart GenStat. Use File New and choose the Text Window, Fig 2.5a. This gives you an Input Window. In this window, type Print 3+4 as shown in Fig. 2.5b. Fig.2.5a File New Fig. 2.5b Complete the text window Now select the Run menu as shown in Fig. 2.5c. You can choose either Submit Line (if the cursor is still on the line you typed) or Submit Window. Choose one of these. Fig.2.5c The Run menu Fig. 2.5d Results You have now submitted your "program" of commands to the GenStat server. The results are put in the Output Window, see Fig. 2.5d. You can go to the output window in various ways, e.g. by using the Windows menu. There you see that GenStat normally "echoes" the command and shows you that 3+4=7. 23

2 GenStat Basics An alternative to typing the command is to use the Data menu Data Calculations, see Fig. 2.

If you look in the Output window, you see that 3 + 4 still equals 7! Fig. 2.5g 3 + 4 is still 7! Fig.2.5h The Input log The Input Log Window is also useful.

You see that the use of the Calculation menu has resulted in GenStat preparing the commands PRINT 3+4 for you and has submitted them to the GenStat server. So, that is how GenStat works.

28 2 GenStat Basics An alternative to typing the command is to use the Data menu Data Calculations, see Fig. 2.5e. This gives the dialogue shown in Fig. 2.5f. Fig. 2.5e Choose the dialogue Fig. 2.5f Another way of calculating Then type as the function, click on Print in Output and then on [OK]. If you look in the Output window, you see that still equals 7! Fig. 2.5g is still 7! Fig.2.5h The Input log The Input Log Window is also useful. It keeps a record of all the commands you have submitted. Access it by Window Input Log. You see that the use of the Calculation menu has resulted in GenStat preparing the commands PRINT 3+4 for you and has submitted them to the GenStat server. So, that is how GenStat works. You prepare commands, which are submitted to the GenStat server. The Windows version has simply given you a variety of ways to prepare the commands for GenStat. GenStat obeys the commands and puts the results in the Output Window. It keeps a record in the Log Window. If the commands produce graphs, then GenStat puts the graphs in a Graphics Window. If you make a mistake in the command, it prints an error message in the Fault Window (and in the Output Window). 24

29 2 GenStat Basics The example above (3 + 4 = 7) indicates that GenStat may be used as a simple calculator. This is worth a little practice. It is useful to have a scientific calculator. Also it is sometimes useful to transform data. For example, if you want to calculate the difference between 4.35 and 2.37 expressed as a percentage of 4.35, open the calculator with Data Calculations, check that Print in Output, is still ticked and type the following calculation in the top box: 100 * ( ) / 4.35 Click [OK]. This will give the following in the output window: (100*( ))/ i.e. the difference is 45.52% of It is important that the brackets () are included where appropriate to make sure that the calculation you are trying to do has only one meaning. The symbols +, -, *, / are used for the operations of addition, subtraction, multiplication and division respectively and ** is used for powers. There are also various mathematical functions available. One is for calculating the square root of a number. The function is SQRT(), where the number whose square root is required is given in the parenthesis, for example SQRT(12.37). Fig. 2.5i gives an overview of how to perform some calculations by using the Input Window. More information can be found in the GenStat Help file under List of functions for expressions. Try more calculations to see how all this works, using both an Input window and the Data Calculations dialogue box. Some examples are given below. Fig. 2.5i Some basic calculations using the Input Window Symbol Operation Example Result + addition PRINT subtraction PRINT * product PRINT 3* / division PRINT 3/ ** exponentiation PRINT 3** Function Operation Example Result SQRT(x) Square root PRINT SQRT(4) 2.00 EXP(x) Exponential function PRINT EXP(1) LOG(x) natural logarithm of x, for x > 0 PRINT LOG(2.718) LOG10(x) logarithm to base 10 of x, for x > 0. PRINT LOG10(10) ROUND(x) rounds the values of x to the nearest integer. PRINT ROUND( ) Other examples PRINT (1/2) PRINT (100*( ))/ Server sessions After the above calculations, the Input and Output Windows look a mess. All the data can be cleared out of the GenStat server with Data Clear All Data or Run Restart Session. Less drastically, you can clean up the output window by clicking the Clear Output button ( ) in the toolbar. 25

Analysing climatic data using Genstat for Windows 3 Simple statistical inference 3. Simple statistical inference In the analysis so far, we have just considered descriptive statistics.

We take an example from Mead, Curnow and Hasted, (2003) pages 33-34 and 38-39.

The yields, in tons per hectare, were as follows: new: 2.5 2.1 2.4 2.0 2.6 2.2 standard: 2.2 1.9 1.8 2.1 2.1 1.7 2.3 2 1.7 2.2 3.

31 Analysing climatic data using Genstat for Windows 3 Simple statistical inference 3. Simple statistical inference In the analysis so far, we have just considered descriptive statistics. Thus we have summarised the data numerically and drawn graphs. In this example, we introduce ideas of simple statistical inference. We take an example from Mead, Curnow and Hasted, (2003) pages and This compares wheat yields for 6 farmers where there was a new system of giving agroclimatic advice, compared to 10 farmers, who used standard information. The yields, in tons per hectare, were as follows: new: standard: The use of boxplots Because these columns are of different lengths, they are entered into two separate spreadsheets. For the first set, use Spread New Create as shown earlier in Chapter Set it to have 1 column of 6 rows, enter the data as shown in Fig.3.1a and give the column the name new. Save the spreadsheet, giving it the name cmtut2.gsh (see Chapter 2.3.4) if you need instructions on saving). Then use Spread New Create again. Change the number of rows to 10 and enter the second set of data into this other spreadsheet, naming the column as standard, see Fig. 3.1b. Save the spreadsheet, giving it the name cmtut3.gsh. Fig. 3.1a 1st sample Fig. 3.1b 2nd sample Fig. 3.1c Graphics Boxplot One way to present the data is to use a boxplot. Use Graphics Boxplot, complete the dialogue as shown in Fig. 3.1c and click [Finish]. This gives the display shown in Fig. 3.1d. 27

Analysing climatic data using Genstat for Windows 3 Simple statistical inference Fig. 3.1d Resulting graph Fig. 3.1e Changed One use of boxplots is to show outliers.

The general shape of the graph is the same, but the odd value is indicated as deserving close scrutiny. There are two ways of displaying the boxplot. Use Graphics Boxplot and click [Next].

32 Analysing climatic data using Genstat for Windows 3 Simple statistical inference Fig. 3.1d Resulting graph Fig. 3.1e Changed One use of boxplots is to show outliers. Go back to the spreadsheet and insert a value of 2.9 instead of 2.0 for the 8 th value in the Standard group. The general shape of the graph is the same, but the odd value is indicated as deserving close scrutiny. There are two ways of displaying the boxplot. Use Graphics Boxplot and click [Next]. You can now choose between two types: Box and Whisker, Fig. 3.1f and Schematic, Fig. 3.1g. Try both, as shown below. The advantage of a schematic boxplot is that you can easily discover outliers. Fig. 3.1f Box and Whisker plot maximum value third quartile median first quartile minimum value 28

1g Schematic boxplot outlier upper inner fence third quartile median first quartile lower inner In a Box and Whisker boxplot, the ends of the whiskers mark the minimum and maximum values of the data

33 Analysing climatic data using Genstat for Windows 3 Simple statistical inference Fig. 3.1g Schematic boxplot outlier upper inner fence third quartile median first quartile lower inner In a Box and Whisker boxplot, the ends of the whiskers mark the minimum and maximum values of the data set, in a schematic boxplot they mark the upper and lower inner fences. The upper inner fence is defined as the upper quartile plus 1.5 times the interquartile range, or the maximum value if that is smaller. The lower fence is defined similarly. Extreme values between 1.5 and 3 times the interquartile range (plus the upper or minus the lower quartile) are by default marked as green crosses. More extreme values (more than 3 times the above mentioned range) are marked as red crosses. If you made this change, then set the edited value back to 2.0, in the spreadsheet before continuing. 3.2 Comparisons of means Simple comparisons of the means of two different samples can be made with Stats Statistical Tests One and two sample t-tests. Complete the dialogues as shown in Fig. 3.2a and 3.2b. Use the Options button in Fig. 3.2a to give the dialogue in Fig. 3.2b. Fig. 3.2a Stats Tests One and Two Sample Fig. 3.2b Options sub-dialogue The output window shows the results, see Fig. 3.2c. 29

1 Some more data manipulation: appending spreadsheets In the 2-sample example that was used for the t-test, the data were put

34 Analysing climatic data using Genstat for Windows 3 Simple statistical inference Fig. 3.2c Output from the t-test dialogue Some more data manipulation: appending spreadsheets In the 2-sample example that was used for the t-test, the data were put into separate spreadsheets. Data often need reorganising before analysis and here this step is illustrated by joining the data together for the two sets. Fig. 3.2f shows what we are aiming for. Fig. 3.2d 1st set Fig. 3.2e 2nd set Fig. 3.2f Stacked data 30

Analysing climatic data using Genstat for Windows 3 Simple statistical inference What we wish to do is to append the data from the two columns and add a further column, that specifies from which set

35 Analysing climatic data using Genstat for Windows 3 Simple statistical inference What we wish to do is to append the data from the two columns and add a further column, that specifies from which set each observation has come. If the spreadsheets are no longer in GenStat then they will have to be opened. They were saved earlier with the names cmtut2.gsh and cmtut3.gsh, see Fig. 3.2d and 3.2e. Click in the shorter spreadsheet cmtut2.gsh, so it is the active window. Use Spread Manipulate Append and complete the dialogue as shown in Fig. 3.2g. This appends cmtut3.gsh to the data in cmtut2.gsh and adds the information for a factor that distinguishes between the two groups. Press [OK]. Fig. 3.2g Spread Manipulate Append (with cmtut2.gsh as the active window) The layout of the data shown in Fig. 3.2f is more common and is used in most of the remainder of this guide. Rename the column new in the long spreadsheet to yield. Use File Save As to save the spreadsheet, giving it the name cmtut4.gsh. 3.3 References Mead, R., Curnow, R. N. and Hasted, A. M. (2003) Statistical Methods in Agriculture and Experimental Biology, 3rd edn. Boca Raton: Chapman & Hall/CRC Press. 31

Analysing climatic data using Genstat for Windows 4 Simple regression 4.

This example is taken from pages 178-181 of Mead, Curnow and Hasted (2003). We will return in Chapter 11 to the more general use of GenStat for regression.

1a Restart again Fig. 4.1b Accept the warning Use Spread New Create and make a spreadsheet with 2 columns and 17 rows as shown in Fig. 4.1d. Fig. 4.1c Create a new sheet Fig.

37 Analysing climatic data using Genstat for Windows 4 Simple regression 4. Simple regression 4.1 Setting up the data We now introduce some key elements of data analysis, by means of simple regression. This example is taken from pages of Mead, Curnow and Hasted (2003). We will return in Chapter 11 to the more general use of GenStat for regression. Use Run Restart Session to start a new job. Accept the option [Yes] to proceed. Fig. 4.1a Restart again Fig. 4.1b Accept the warning Use Spread New Create and make a spreadsheet with 2 columns and 17 rows as shown in Fig. 4.1d. Fig. 4.1c Create a new sheet Fig. 4.1d Sheet with 17 rows and 2 columns Enter the data and name the two columns as shown in Fig. 4.1e : (See page 6 for instructions on naming columns, if necessary.) Save the data giving the file the name cmtut5.gsh. Click outside the spreadsheet to transfer the data to the GenStat server. This gives some summary statistics for each of the two columns. 33

specify some summary statistics as described in Section 2.3.1. Fig. 4.

1f Produce some descriptive statistics Choose Graphics Point Plot and

38 Analysing climatic data using Genstat for Windows 4 Simple regression Choose Stats Summary Statistics Summarize Contents of Variates and specify some summary statistics as described in Section Fig. 4.1e Regression data Fig. 4.1f Produce some descriptive statistics Choose Graphics Point Plot and complete the dialogue as shown in Fig. 4.1g to give the scatterplot. Fig. 4.1g Graphics Point plot Fig. 4.1h Results 34

2a to give the correlation between uptake and conc. You should find a value of 0.984. Fig. 4.

Fig. 4.2b Regression dialogue Fig. 4.2c Specify the x and y The results are in the output window and are given in full in the regression section of this guide.

39 Analysing climatic data using Genstat for Windows 4 Simple regression 4.2 Correlation and regression Choose Stats Summary Statistics Correlations and complete the dialogue as shown in Fig. 4.2a to give the correlation between uptake and conc. You should find a value of Fig. 4.2a Stats Summary Correlations Choose Stats Regression Analysis Linear models, Fig. 4.2b, and complete the dialogue as shown in Fig. 4.2c. Fig. 4.2b Regression dialogue Fig. 4.2c Specify the x and y The results are in the output window and are given in full in the regression section of this guide. They show the fitted equation is: uptake = * conc Return to the regression dialogue to give a plot of the fitted line. Click on [Further Output] then [Fitted Model], to give the plot shown in Fig. 4.2e. 35

Analysing climatic data using Genstat for Windows 4 Simple regression Fig. 4.2d Further output sub-dialogue Fig. 4.2e Fitted model Repeat the steps to give [Further Output] again and select [Model Checking].

40 Analysing climatic data using Genstat for Windows 4 Simple regression Fig. 4.2d Further output sub-dialogue Fig. 4.2e Fitted model Repeat the steps to give [Further Output] again and select [Model Checking]. Accept all the defaults by pressing [OK]. In the Graphics Window, 4 plots will be shown as in Fig. 4.2f. Fig. 4.2f Model checking This example should have shown it is easy to do statistics once you have become familiar with the use of dialogues in GenStat. This allows training courses to concentrate on statistical concepts. The computing has become easy. 4.3 A GenStat tutorial GenStat includes its own tutorials as part of the software. Use Help Tutorial and try the one called Linear Regression. If you find it helpful then try some of the other tutorials. 36

41 5 Review of Chapters Review of Chapters 2 4 Here we review some of the tasks you have undertaken in the tutorial. Could you? Task 1 Open a set of data you entered into Excel, for example the file cmtut1.xls? Hint See page 10 2 Enter a new set of data that has 3 columns and 6 rows? See page 5 3 Save the data in a GenStat spreadsheet to the disc? See page 7 4 Import a named range from an Excel worksheet. See page 10 5 Derive a new column containing the square of the values in an existing column? See page 15 6 Find the names and lengths of all the columns of data. See page 22 7 Summarise data in two columns by giving a boxplot? See page 27 8 Explain why a boxplot is often a useful summary of a set of data and also to compare different sets? See plot on page 29, look in a statistics book, or ask someone. 3 9 Give a line plot? See dialogue on page 12 You did not do a line plot but it is another option on the graphics menu. 10 Summarise a column of data? See page 11 and page 20 3 This tutorial is to teach GenStat, rather than statistics. 37

42 5 Review of Chapters 2-4 Task Hint 11 Explain the use of the Input Log Window? See page Explain how GenStat works? See page Explain what is meant by a factor column using the example in the third spreadsheet on page 31 See page Make an existing column into a factor column? See the pop-up menu on page Organise the Windows to show just the spreadsheet and output windows horizontally. See page 6 16 Leave GenStat? If not, then keep practising!) 38

6 Challenge 1 6. Challenge 1 Easy use of GenStat Open the file called cmtut1.gsh again. Mark the columns giving the rain days, the total, and the type of year, in that order.

43 6 Challenge 1 6. Challenge 1 Easy use of GenStat Open the file called cmtut1.gsh again. Mark the columns giving the rain days, the total, and the type of year, in that order. Now right-click on the mouse to give the popup menu shown in Fig. 6.1a. Fig. 6.1a Quick analysis When you use the menu to give a point plot you will probably get the dialogue for the point plot, so all you have to do is to press OK. If you would like to get the graph straight away, then use Options Spreadsheet Options and on the General tab untick the option to Open menu with selected data. Now try again and you should get the graph immediately. Try more of the Analysis and Graph options on the Quick analysis menu in Fig. 6.1a. For example what do you get with the same column still marked if you use Analysis Tally? 39

44 Part II Summary and presentation of climatic data 43

45 44

46 7 Before starting the analysis 7. Before starting the analysis In this chapter we look first at the way the data are set-up for the analysis. In most studies the time-consuming tasks are primarily concerned with data manipulation, rather than analysis. Hence we describe the facilities for manipulation in this chapter. We will use a simple example of monthly data, but the concepts apply equally to daily records, synoptic data, upper air, and so on. 7.1 Examining the data In this chapter we mainly use an example of monthly data. They are the monthly rainfall totals for 33 years, from 1950 to 1983 (omitting 1956 and 1957) at Galle, in Sri Lanka. We assume that the initial calculations to produce these monthly summaries were done elsewhere. In Chapter 9, we show that GenStat can be used to produce these summaries. Use File Open and open the file called genrain.xls. Part of the data are shown in Fig. 7.1a. Fig. 7.1a Monthly rainfall data (with bookmarks) Climatologists emphasise the importance of data scrutiny and the data above are in colour for the maximum and minimum value each month. We see, for example, that there were 47 mm in the driest May, and 590 mm in the wettest. To produce bookmarks, such as those shown above, use Search Bookmark By Value. Then select all the months and use the option, shown in Fig. 7.1c, to mark the minimum and maximum. A variety of other criteria are also possible. 45

7 Before starting the analysis Fig. 7.1b Bookmarks dialogues Fig. 7.1c Add bookmarks Sometimes it is useful to reorganize the data into a single long column.

Mark the 12 columns in the Available Data field, i.e. everything except the Year and move them into the Stack Columns field.

Say you want the resulting factor column to be called Month and that you want to use the names from the columns to label the month factor.

We first move the year column to be the first in the new spreadsheet. This is simply done by positioning the cursor so it becomes a hand, as shown in Fig. 7.1f.

47 7 Before starting the analysis Fig. 7.1b Bookmarks dialogues Fig. 7.1c Add bookmarks Sometimes it is useful to reorganize the data into a single long column. This can be done using Spread Manipulate Stack. Complete the dialogue as shown in Fig. 7.1e. You wish to stack 12 columns, i.e. all months. Mark the 12 columns in the Available Data field, i.e. everything except the Year and move them into the Stack Columns field. Do the same with the Year, but move it into the Repeat Column field. Say you want the resulting factor column to be called Month and that you want to use the names from the columns to label the month factor. Finally rename the stacked column to be Rain. Fig. 7.1d Stacking data Fig. 7.1e Stacking 12 columns The resulting spreadsheet, Fig. 7.1f, is the new shape, but is not quite as we would like. We first move the year column to be the first in the new spreadsheet. This is simply done by positioning the cursor so it becomes a hand, as shown in Fig. 7.1f. Then the mouse can be used to slide the column as you wish. An alternative is to use Spread Column Reorder. Following this step, either right-click to give the pop-up menu or use Spread Sort, or <Ctrl> <F9> and sort on the year column. This gives the data, in time-series order, as shown in Fig 7.1g. 46

7 Before starting the analysis Fig. 7.1f Moving Fig. 7.1g Sorting We now continue our examination of the data. Use Graphics Boxplot to give the dialogue shown in Fig. 7.1h.

The first alternative, called List of variates corresponds to the way we had the data originally, while the second corresponds to the single long column we have just produced. Fig. 7.

48 7 Before starting the analysis Fig. 7.1f Moving Fig. 7.1g Sorting We now continue our examination of the data. Use Graphics Boxplot to give the dialogue shown in Fig. 7.1h. At the top of this dialogue we see the question How are the data organized? The first alternative, called List of variates corresponds to the way we had the data originally, while the second corresponds to the single long column we have just produced. Fig. 7.1h Graphics Boxplot Fig. 7.1i Schematic boxplots Complete the dialogue as shown in Fig. 7.1h and click Finish. The results are given in Fig. 7.1i. If you do not get the same type of picture, then return to the dialogue, click Next and then use the option for a Schematic, rather than a Box and Whisker plot. The main features of the resulting graphs are that there is rain throughout the year, but the main seasons, in April/May and October/November are clear. The data are reasonably symmetrical, but there are 9 values that are marked separately as being out of the ordinary. For example the 712 mm in July 1953 is particularly high, given that most of the years have less than a third of this value. A rugplot is an alternative to the boxplot. As is shown in Fig. 7.1j, it also gives the same options for the layout of the data. The resulting plot is in Fig. 7.1k. 47

7 Before starting the analysis Fig. 7.1j Graphics Rugplot Fig. 7.1k Resulting graph These graphs are called exploratory, because they are intended to support the scientist in examining the data.

2 Repeated measures In the next two sections we compare the two alternative layouts of the data that were mentioned above.

49 7 Before starting the analysis Fig. 7.1j Graphics Rugplot Fig. 7.1k Resulting graph These graphs are called exploratory, because they are intended to support the scientist in examining the data. The alternative is presentation graphics, intended for readers of reports. 7.2 Repeated measures In the next two sections we compare the two alternative layouts of the data that were mentioned above. The first we call repeated measure and the second time series for reasons that we explain below. The two layouts of the monthly data, used in Section 7.1, are shown in Figs. 7.2a and 7.2b. Fig. 7.2a Repeated measures Fig. 7.2b Time series:layout layout To relate to statistics books and other areas of application, we consider the general issues that can guide the user on which layout might be appropriate for a given analysis. The first issue is the unit that is used. In the first layout, Fig. 7.2a, the unit is a year, and we have made 12 measurements in each year. These are sometimes called repeated-measures, because we are measuring the same unit repeatedly in time. 48

7 Before starting the analysis Repeated measures occur in many areas of application of statistics. For example, in a health study, we may weigh the same person on repeated occasions.

2b, the unit is a month, and we just have a single measurement each month.

In climatology, both layouts are useful, depending on the objectives of the study. We now consider the use of the repeated measure layout in more detail.

As examples, we calculate the seasonal and annual totals from the data. An example of the calculate dialogue is in Fig. 7.2c and the results are in Fig. 7.2d.

50 7 Before starting the analysis Repeated measures occur in many areas of application of statistics. For example, in a health study, we may weigh the same person on repeated occasions. In forestry, we measure the girth of a tree each time. In a livestock study we measure milk yield repeatedly on each animal. In the time-series layout, shown in Fig. 7.2b, the unit is a month, and we just have a single measurement each month. This is the same layout as we would have if the data were a monthly economic index, for which we wish to do a time-series analysis. In climatology, both layouts are useful, depending on the objectives of the study. We now consider the use of the repeated measure layout in more detail. Having the year as the unit, is also useful if we interpret the value for a given month as one index of the year. Then we have a total of 12 indices and more can be calculated and then analysed. As examples, we calculate the seasonal and annual totals from the data. An example of the calculate dialogue is in Fig. 7.2c and the results are in Fig. 7.2d. We have here defined Winter to be the three month period from January to March, and so on. Fig. 7.2c Spread Calculate Column Fig. 7.2d Quarterly and annual values This layout also makes it easy to give simple time-series graphs, as shown in Fig. 7.2f. We will see, in the regression chapter, Chapter 11, that it is equally easy to fit a trend line to the data. Fig. 7.2e Graphics Line plot Fig. 7.2f Time-series graph Sometimes care must be taken, because of the circular nature of the data. In this example the year goes round in a circle, while for synoptic data the day may be the circle. For example when 49

7 Before starting the analysis calculating the seasonal summary, how would you calculate the winter rainfall, if November to February is defined as the Winter season?

2g, it is not quite Winter = Jan + Feb + Nov + Dec This is because you usually need the January and February values that follow December, while this calculation will give you the preceding ones. Fig.

51 7 Before starting the analysis calculating the seasonal summary, how would you calculate the winter rainfall, if November to February is defined as the Winter season? In terms of the calculations in Fig. 7.2g, it is not quite Winter = Jan + Feb + Nov + Dec This is because you usually need the January and February values that follow December, while this calculation will give you the preceding ones. Fig. 7.2g Using the SHIFT function Fig. 7.2h Sub-dialogue for the functions The calculation is not difficult, once the complication is recognised, because Genstat, like Excel, has a wide variety of functions that can be used in calculations. In the Spread Calculation Column dialogue, in Fig. 7.2g, you either have to know that the Shift function exists, or click on the Functions button, and use the sub-dialogue given in Fig. 7.2h. In Fig. 7.2g we have given the calculation as Winter2 = shift(jan+feb;-1) + Nov + Dec It could equally be given as Winter2 = shift(jan;-1) +shift(feb;-1) + Nov + Dec We have saved the results into a column called Winter2, rather than Winter, because we did not want to overwrite the previous column that was calculated. 7.3 Time series The time-series layout, shown again in Fig. 7.3a, is often the way the data are presented. For example, it is the obvious layout if many different elements have been measured. In the repeated measures layout we were able to indicate the minimum and maximum values each month as an initial check of the data. With the time series layout, we use the Stats Summary of Groups (Tabulation) dialogue to give the same information. Complete the dialogue as shown in Fig. 7.3b and press OK. 50

3b Stats Summary Tabulation The results are displayed in the output

Later in the section we will use the Save button to put similar

3c Results in output window In Section 7.

52 7 Before starting the analysis Fig 7.3a Time series layout Fig. 7.3b Stats Summary Tabulation The results are displayed in the output window as shown in Fig. 7.3c. Later in the section we will use the Save button to put similar summary results into another spreadsheet. Fig. 7.3c Results in output window In Section 7.1 we showed how to stack the data in the repeated-measures layout to give the layout shown in Fig. 7.1h. The converse is to use Spread Manipulate Unstack, as is shown in Fig. 7.3d. Fig. 7.3d Spread Manipulate Unstack 51

7 Before starting the analysis Before using Unstack we have declared the Years column above to be a factor, and we used this factor as an ID in the dialogue, so the years are transferred to

What Genstat has done is sensible if other columns, perhaps containing temperature data are also to be unstacked.

This can be done individually, as described in Chapter 2, or use Spread Column Rename to give the dialogue in Fig. 7.3f.

Factors are a powerful and flexible feature of Genstat, and an understanding of them is useful to simplify many analyses. Factors were introduced in Chapter 2.

53 7 Before starting the analysis Before using Unstack we have declared the Years column above to be a factor, and we used this factor as an ID in the dialogue, so the years are transferred to the unstacked data. The results are shown in Fig. 7.3e, where we see that the month label has also been transferred across. Fig. 7.3e Unstacked data Fig. 7.3f Spread Column Rename One problem with the unstacked data is that the column names may not be what you want. What Genstat has done is sensible if other columns, perhaps containing temperature data are also to be unstacked. But if the work is just to analyse the rainfall data, then you may wish to rename the columns. This can be done individually, as described in Chapter 2, or use Spread Column Rename to give the dialogue in Fig. 7.3f. Many of the dialogues in this chapter have referred to groups or factors. Factors are a powerful and flexible feature of Genstat, and an understanding of them is useful to simplify many analyses. Factors were introduced in Chapter 2.4 and are used to signify groups in the data. They are indicated in the spreadsheet with the name in italics and an exclamation mark (!) in front. So, GenStat distinguishes between three sorts of column, as we show in Fig. 7.3g. Fig. 7.3g Types of column Fig. 7.3h Properties of a factor If a column is numeric, then it may be a variate, which signifies that it just contains numbers. In Fig. 7.3g, the first column called Years is, currently a variate. We have right-clicked on the mouse to 52

7 Before starting the analysis use the pop-up dialogue, which is one way to convert a column s type. Clicking on factor will make it into a factor or grouping or category type of column.

54 7 Before starting the analysis use the pop-up dialogue, which is one way to convert a column s type. Clicking on factor will make it into a factor or grouping or category type of column. Similarly the month column has text values initially and is a text column as shown in Fig. 7.3h. This can also be converted to a factor in the same way. The result is shown above. One curious feature is how Genstat knew that January is the first month, i.e. the first level of the factor. Usually, when text columns are converted to factors, the order of the levels is alphabetical, so April is the first month and August is the second, and so on. In this case Genstat checks the standard order with a list that is given under Options Options Date Format. In other situations the resulting factor levels may not be in the order you require. Even in this case, we might wish to override this order and start our year in September, say. There are alternative dialogues to change the order of the levels of a factor, and we show the Spread Factor Edit Levels and Labels dialogue in Fig. 7.3i. The active cell was in the month column when this dialogue was called. Fig. 7.3i Spread Factor Edit Fig. 7.3j Making Sept the first month We see here that a factor column has three components, which are Ordinal, Level and Label from Fig. 7.3i and 7.3j. The ordinal dictates the way that the results for a factor are presented, and so we have changed this order above. The level is a number associated with each level of the factor, and here is the number of the month. Similarly the label is a text. Once we accept this change, there will appear to be no difference in the spreadsheet, unless you look for the tool tip that shows the order of the levels. But any further analysis will now use the new order. We illustrate by repeating the Graphics Boxplot from Chapter 7.1. The results, shown in Fig. 7.3k, now start from September. 53

7 Before starting the analysis Fig. 7.3k Boxplot from September Fig. 7.3l Spread Factor Edit, for the Years factor As a second example we show the same Spread Factor Edit Levels and Labels dialogue for the Years factor in Fig.

55 7 Before starting the analysis Fig. 7.3k Boxplot from September Fig. 7.3l Spread Factor Edit, for the Years factor As a second example we show the same Spread Factor Edit Levels and Labels dialogue for the Years factor in Fig. 7.3l. We see that there are 32 levels and the ordinals are therefore 1 to 32. The levels are the actual years, going from 1950 to 1983 and there are no labels for this factor. So, when results involving factor columns are presented in tables and graphs, then the ordinals will dictate the order that the results are presented. What you will see is either the level or the label. Factors behave like variates if you do any calculations. So, for example the calc command or Spread Column Calculations dialogue could be used to give Years = Years 1900 This would change the levels, so they now go from 50 to 83. In Chapter 7.2 we showed how the monthly data could be summed into quarterly totals. We now show how to do this, in the time-series layout. This is important in its own right, and is also a way to introduce more features of Genstat that are useful in processing climatic data. The first step is to recode the factor for the months into a new column that specifies the seasons. Before doing this, we use Spread Factor Edit Levels and Labels again to undo the change above. This is so the months are displayed from January again. This step is optional, but simplifies the display in Fig. 7.3m. Then with the active cell in the month column we use Spread Factor Recode. In the dialogue in Fig. 7.3m, we have chosen to call the new factor Season. In this dialogue, we can choose whether to recode the ordinals, the levels, or the labels. For clarity we have recoded the labels. In practice this is slightly dangerous, because typing mistakes can invent extra levels in the new factor. 54

7 Before starting the analysis Fig. 7.3m Spread Factor Recode Fig 7.

They are currently in alphabetical order of the labels, and we would like them in the same order as the months.

56 7 Before starting the analysis Fig. 7.3m Spread Factor Recode Fig 7.3n Factor labels in wrong order The resulting column is shown in Fig. 7.3n. We see that we have the 4 levels that we require, but not in the right order. They are currently in alphabetical order of the labels, and we would like them in the same order as the months. So, we change the ordinals in Fig. 7.3o. Fig. 7.3o Spread Factor Edit Levels and Labels Now we show two alternative ways that we can summarise the data to produce the seasonal totals. The first uses the Spread Calculate Summary Stats dialogue. Complete the dialogue as shown in Fig. 7.3p. This produces a new spreadsheet, shown in Fig. 7.3q, with the rainfall totals for each season. 55

3q Quarterly totals We will see this dialogue again in Chapter 9, because it can equally be used to summarise daily, or other types of data.

An alternative is to use the Stats Summary Statistics Summaries of Groups dialogue, as shown in Fig. 7.3s.

57 7 Before starting the analysis Fig. 7.3p Spread Calculate Summary Stats Fig. 7.3q Quarterly totals We will see this dialogue again in Chapter 9, because it can equally be used to summarise daily, or other types of data. We have chosen to produce the totals here, but could equally produce means, extremes, or percentage points, see Fig. 7.3p. An alternative is to use the Stats Summary Statistics Summaries of Groups dialogue, as shown in Fig. 7.3s. We choose to summarise the rainfall data by both years and seasons. We click on the Margins, which will give us the annual totals as a bonus. If we click OK at this point, then the results will be produced, but only in the output window. Fig. 7.3r Tabulation Fig. 7.3s Quarterly totals Click on the Save button, in the dialogue above, to give a sub-menu as shown in Fig. 7.3t. Once completed, the results are then saved into a new spreadsheet, as shown in Fig. 7.3u. 56

7 Before starting the analysis Fig. 7.3t Saving results Fig. 7.3u Resulting spreadsheet We see that the two methods have produced the same results.

This second layout is actually a two-way table, rather than a set of columns. This is a similar idea to Excel, which can produce summaries in what it calls a pivot-table.

Or use Spread Manipulate Reorder Table, as shown in Fig. 7.3v, and change the order of the factors. The one that is last is the factor that gives the columns in the table. Fig. 7.3v Pivoting the table Fig.

58 7 Before starting the analysis Fig. 7.3t Saving results Fig. 7.3u Resulting spreadsheet We see that the two methods have produced the same results. The first has produced the seasonal totals in the time-series layout, while the second has produced them in a tabular form, that we have called the repeated measures layout. This second layout is actually a two-way table, rather than a set of columns. This is a similar idea to Excel, which can produce summaries in what it calls a pivot-table. Similarly, the table above can be pivoted. Click on a column name in Fig. 7.3u, and drag it across, to give the orientation shown in Fig. 7.3w. Or use Spread Manipulate Reorder Table, as shown in Fig. 7.3v, and change the order of the factors. The one that is last is the factor that gives the columns in the table. Fig. 7.3v Pivoting the table Fig. 7.3w Pivoted table With the original order you may wish to convert the table, so the results can be used as a set of columns. Either right-click and choose Convert, or use Spread Manipulate Convert, to give the dialogue shown in Fig. 7.3x. 57

3y Resulting dialogue Click on the Sheet Type to make it a vector.

59 7 Before starting the analysis Fig. 7.3x Converting the type of spreadsheet Fig. 7.3y Resulting dialogue Click on the Sheet Type to make it a vector. The dialogue changes, see Fig. 7.3y. Click OK. Now the columns with the seasonal totals can be used, just as in the previous section. 58

1a, together with a table of frequencies of wind direction, shown in Fig. 8.

60 8 Challenge 2 8. Challenge 2 Taking control Open the file sulphur.gsh. This is one of the examples used in the GenStat introductory guide. It shows the sulphur level in the atmosphere, together with some information on wind speed and direction. It is shown in Fig. 8.1a, together with a table of frequencies of wind direction, shown in Fig. 8.1b. Fig. 8.1a Sulphur data Fig. 8.1b Table of wind direction If you produce this table, Stats Summary Statistics Summaries of Groups (Tabulation) then the order of the factor levels will give the directions from North. How could you change it so the factor levels are as shown in Fig. 8.1b, i.e. from East? If you have the table as shown in Fig. 8.1b could you quickly sort the order, so it is from the North again. (Hint: Before producing the Table look at the Spread Factor Edit levels and labels dialogue). Use the same ideas to repeat the tabular presentation given at the start of Section 7.3, but for a place in the Southern hemisphere, where you wish to start the year in September. With the data used in Chapter 7.3 could you easily dictate the order of the month factor, if you were in a Francophone country? (Hint: Look at Options Options Date Format). 59

9 Summary of climatic data 9. Summary of climatic data 9.1 Introduction In this chapter we consider the analysis of daily data. The analysis usually proceeds in stages.

62 9 Summary of climatic data 9. Summary of climatic data 9.1 Introduction In this chapter we consider the analysis of daily data. The analysis usually proceeds in stages. The first stage involves a summary of the daily data, perhaps to produce monthly or annual values. These summary values are then processed. In this Chapter we mention both stages, but concentrate more on the production of the initial summary values. They effectively transform the raw climatic data into data that can be processed in standard ways, some of which are described in the next part of this guide. The methods described here apply also to more detailed data, such as synoptic records, where there are multiple values per day. Often, when processing daily data, much of the time is needed to set up and manipulate the data so that the subsequent analysis proceeds smoothly. We describe the tools needed for the initial steps in the analysis in Chapter 9.2. As an example we use 50 years of rainfall and temperature data from the Bulawayo Goetz observatory, Zimbabwe (Lat South, Long East). 9.2 Setting up the data The data were provided in Excel, in the form shown in Fig. 9.2a Fig. 9.2a Daily data for Bulawayo, Zimbabwe They were imported in this form into GenStat. The layout in Fig. 9.2a gives one month for each row of data. The data were then reorganized in GenStat to be in the time-series form shown in Fig 9.2b. This used the methods described in Chapter 7. An alternative is to reorganize the data in Excel, prior to importing the data. 61

first show how these can be introduced. We use the standard Spread Calculate Column dialogue.

63 9 Summary of climatic data Fig. 9.2b Data in Genstat in rows GenStat includes functions for handling date formats, so we first show how these can be introduced. We use the standard Spread Calculate Column dialogue. There the date-time functions can be listed and used, as shown in Figs. 9.2c and 9.2d. Fig. 9.2c Spread Calc Column Fig. 9.2d Date/time functions 62

9 Summary of climatic data This calculation adds a further column to the worksheet, which can then be formatted as a date. The result is shown in Fig. 9.

This duplication is not needed and is shown here to demonstrate that GenStat can use either form. Sometimes the data may be imported with a single date column.

These data are from July 1951 to April 2001. Sometimes the analysis is required for just a subset of the data.

64 9 Summary of climatic data This calculation adds a further column to the worksheet, which can then be formatted as a date. The result is shown in Fig. 9.2e. Fig. 9.2e Date column added The data in Fig. 9.2e now has the date expressed in two different ways. This duplication is not needed and is shown here to demonstrate that GenStat can use either form. Sometimes the data may be imported with a single date column. Then, if they are needed, the component columns can always be calculated, giving the year, month and day separately, see Fig. 9.2d. These data are from July 1951 to April Sometimes the analysis is required for just a subset of the data. A powerful feature in GenStat is that of restricting the analysis to a subset of the data. This uses the Spread Restrict/Filter options, shown in Fig. 9.2f. It is like the filter operation in Excel. We chose to filter using an expression and selected the data from 1961 to 1990, as shown in Fig. 9.2g. Fig. 9.2f Options for restrict (filtering) Fig. 9.2g Restriction to

9 Summary of climatic data The result is shown in Fig. 9.2h. Once a filter is in operation, only the unfiltered data are included in the subsequent analysis.

2h Resulting data Fig. 9.2i Adding a further restriction In the example shown here you may wish to produce a new spreadsheet, with just the remaining values.

65 9 Summary of climatic data The result is shown in Fig. 9.2h. Once a filter is in operation, only the unfiltered data are included in the subsequent analysis. The filtered data remain available, so the filter can be changed at any stage. Fig. 9.2h Resulting data Fig. 9.2i Adding a further restriction In the example shown here you may wish to produce a new spreadsheet, with just the remaining values. That is simply done using Spread Manipulate Split/Subset. For simple subsets of the data, the Split/Subset dialogue could be used directly. Here it was useful to use Filter/Restrict first, because the required subset was quite complicated to produce. Then one option of the Split/Subset dialogue is just to keep the visible, i.e. the non-filtered rows. If the restriction is temporary, then it is usually better to work with the single data file. For example, the data could be restricted to a given month, using Spread Restrict/Filter To Groups, with the condition, shown in Fig. 9.2i. In this dialogue as shown, the restriction would be added to the one that is already in force, so just the data for December 1961 to 1990 would be processed. Once that data for one month have been analysed, the restriction could be removed, or a different month could be considered. This becomes tedious and error-prone, if multiple files are produced. It is then much better to work with a single file, and change the restriction (filter) as needed. This use of restrictions (filters) becomes even more powerful if you progress to using GenStat by giving commands (or macros) as we describe in Chapter 16. Then you can include the filter within the commands. It is then easy to do the analysis for a given month, and then loop through the 12 months, just by changing the filter. 64

9 Summary of climatic data 9.3 Producing the summary values From the daily data the Spread Calculate Summary Stats dialogue may be used to give information on a monthly or other basis.

66 9 Summary of climatic data 9.3 Producing the summary values From the daily data the Spread Calculate Summary Stats dialogue may be used to give information on a monthly or other basis. The dialogue is shown in Fig. 9.3a for the data from Bulawayo. Fig. 9.3a Producing monthly summary values We could equally have calculated 5-day, weekly or 10-day summaries. If they are required, then first use the appropriate calculations on the date column to give the period within the month, see Spread Calculate Column and use the Date/Time functions. As an example we show this dialogue for 10-day periods. The function is called MFRACTION, and also allows the starting month of the year to be specified. This is useful when the summary is needed from the beginning of the season, rather than the calendar year. Then use the resulting column as one of the factors in Fig. 9.3a, rather than the Month column that is shown there. 65

The file was still restricted, so these summaries are just for the 30-year period from 1961 to 1990. This is an example where it is useful to have the date information coded in both ways.

67 9 Summary of climatic data Fig. 9.3b Generating a column to produce summaries by decade We have chosen to give five summary statistics, namely the monthly extremes of the temperature values, and the total rainfall for each month. The file was still restricted, so these summaries are just for the 30-year period from 1961 to This is an example where it is useful to have the date information coded in both ways. We have used the factor columns that give the year and month to define the summary, and we have added the median date, which is convenient for drawing graphs of the monthly values. The resulting monthly values were saved and tidied slightly to give the data as shown in Fig. 9.3c. Fig. 9.3c Resulting data One way to present these values more clearly uses the Stats Summary Statistics Summaries of Groups dialogue, shown in Fig. 9.3d, together with the option, shown in Fig. 9.3e, of saving the table. We chose to look at the maximum temperatures. 66

9 Summary of climatic data Fig.9.3d Tables to present the data Fig. 9.3e Saving the monthly maxima The results are shown in Fig. 9.3f, together with bookmarks to indicate the maxima.

3f Results including Search Bookmark for maxima These monthly data can then be analysed, either from the form shown in Fig. 9.

Some ideas were given in Chapter 7 and other methods will be described in Part 3 of this guide.

68 9 Summary of climatic data Fig.9.3d Tables to present the data Fig. 9.3e Saving the monthly maxima The results are shown in Fig. 9.3f, together with bookmarks to indicate the maxima. For example we see that the overall maximum value was 37.4 degrees and occurred in the last year of the record, i.e. in Fig. 9.3f Results including Search Bookmark for maxima These monthly data can then be analysed, either from the form shown in Fig. 9.3c, or from Fig. 9.3f, where we first will have to redefine the table to be an ordinary column spreadsheet. Some ideas were given in Chapter 7 and other methods will be described in Part 3 of this guide. In general we will use GenStat s Stats and Graphics menus, where the objectives of the study dictate which analysis is appropriate. 9.4 Climate Indices The European Climate Assessment project web page is This includes information on the CCL/CLIVAR Working Group on Climate Change Detection. It also has a dictionary that defines a wide range of indices, and Fortran code to calculate each of them. These indices are based on daily temperature and rainfall data. Examples are as follows: Tn: Mean of daily minimum temperature. 67

9 Summary of climatic data T90: Percent of time T > 90 th percentile of daily mean temperature (warm days). DTR15: No. of days with diurnal temperature range > 15 degrees C.

SDII: Simple daily intensity index, defined as total rainfall on days with 1mm or more divided by the number of days. Rx5: Greatest 5-day precipitation.

69 9 Summary of climatic data T90: Percent of time T > 90 th percentile of daily mean temperature (warm days). DTR15: No. of days with diurnal temperature range > 15 degrees C. R: (Annual) precipitation sum. R1: Number of wet days (wet defined as >= 1mm). R10: Number of days with precipitation >=10mm. SDII: Simple daily intensity index, defined as total rainfall on days with 1mm or more divided by the number of days. Rx5: Greatest 5-day precipitation. R75 Percent of time R > 75 th percentile of daily precipitation amount (moderate wet days). R95: Percent of time R > 95 th percentile (very wet days). R95T: Percentage of annual total precipitation from the very wet days. We take an example, of these indices, for the daily data from Zimbabwe. We use dialogues to show the steps in the calculation. If these calculations are to be done for many stations, then it would be better to use commands, as we describe in Chapter 16 of this guide. We take the calculation of R75 as the example. The 75 th percentile is defined using wet days in the period. Hence we first restrict our attention in the data to this period, as shown in Fig. 9.4a, with the Spread Restrict/Filter By Logical Expression. Fig. 9.4a Just including rain days Fig. 9.4b Finding the 75 th percentile Then we use the Spread Calculate Column dialogue as shown in Fig. 9.4b, to give the 75 th percentile as 15.38mm. If needed, a second use of the dialogue in Fig. 9.4b shows the 95 th percentile to be 38.12mm. Now we wish to look at the whole record, so we use Spread Restrict/Filter Remove all. Next we set up a column that takes the value 1 if the rain day has more than mm and 0 otherwise. The Spread Calculate Column, shown in Fig. 9.4c, provides one way to do this. We then make the resulting column into a factor, and label the two levels, as shown in Fig. 9.4d. Then we return to the Spread Restrict/Filter By Logical Expression and use the condition again that Rain >= 1. Part of the resulting sheet is shown in Fig. 9.4d. 68

4d New column as factor Now we wish to examine the percentage of rain days each year that were greater than 15.38 mm.

4e. Fig. 9.4e Stats Summary Frequency Fig. 9.4f Resulting percentages each year Now the percentage of heavy rainfall, shown in Fig. 9.4f, can be plotted against the year, or examined in any other way, to assess whether there is evidence of a trend.

Where there is clear structure in the data we recommend that this structure be considered in the analysis.

70 9 Summary of climatic data Fig. 9.4c Deriving an indicator column Fig. 9.4d New column as factor Now we wish to examine the percentage of rain days each year that were greater than mm. We use the Stats Summary Statistics Frequency Tables dialogue, as shown in Fig. 9.4e. Fig. 9.4e Stats Summary Frequency Fig. 9.4f Resulting percentages each year Now the percentage of heavy rainfall, shown in Fig. 9.4f, can be plotted against the year, or examined in any other way, to assess whether there is evidence of a trend. In Fig. 9.4f the last column gives the number of rain days in the year. Where there is clear structure in the data we recommend that this structure be considered in the analysis. One clear aspect of structure in this problem is the seasonality of the data, so we examine whether the 75% point is roughly the same throughout the year. We therefore remove all the filters and then re-apply the same restrict/filter as shown earlier in Fig. 9.4a. Now we use the Stats Summary Statistics Summary of Groups (Tabulation) dialogue, shown in Fig. 9.4g, instead of the Calculate dialogue in Fig. 9.4b. 69

71 9 Summary of climatic data Fig. 9.4g Stats Summary Tabulation Fig. 9.4h Percentiles for each month The results in Fig. 9.4h indicate that the Winter months in Bulawayo (May to September), have few rain days, with just 86 in the 30 year, or about 3 per year, compared with close to 10 rain days per month in December and January. When rain falls in the Winter months it seems lighter. So the results overall reflect just the main rainfall season, which is from October to April. 9.5 Other summaries The types of indices described in Section 9.4 have standard definitions, and this is useful so results can be compared across different regions. It is also possible to consider summaries that are deliberately tailored locally to a specific application. Both types of product are useful, for different objectives. Examples of specific definitions are for the start of the season, for dry spells within the season, and for the length of the season. These may each be tailored to the planting of a specific crop and soil type. Currently this type of summary is provided by the simpler package, Instat+. Procedures are being written to permit the same type of event to be summarised in a future version of GenStat. The basic functions have been added in the current version, to facilitate the calculation of spell lengths, and the start of the rains, as we show here. For the first example we assume our objective is to find the date of the start of the rains each year. Our definition is that it is the first occasion after 1 st October that the 3-day rainfall total is more than 20mm. The way we use the menus below should emphasise that this definition is just an example, and could be adapted to a user s requirements. We continue to use GenStat s menu system, but have to use a sequence of menus to achieve our objectives. If this type of task has to be done repeatedly, for example for different periods, or definitions or stations, then it would be more efficient to use GenStat s commands, as we show in Chapter 16. With the data from Zimbabwe, the first step is to calculate the running 3-day totals, and one way is as shown in Fig. 9.5a. 70

9 Summary of climatic data Fig. 9.5a Calculating 3-day running totals Then we calculate all the success days, as indicated in Fig. 9.5b. We call the resulting column success.

72 9 Summary of climatic data Fig. 9.5a Calculating 3-day running totals Then we calculate all the success days, as indicated in Fig. 9.5b. We call the resulting column success. Then we restrict the analysis to just the days where success was TRUE, i.e. equals 1, as shown in Fig. 9.5c. The spreadsheet is now as shown in Fig. 9.5d. Fig. 9.5b Calculating the success days Fig. 9.5c Restricting the data 71

9 Summary of climatic data Fig. 9.5d Resulting spreadsheet From Fig. 9.5d we see from the first row of data that the planting date in 1951 was on the 25 th October, and the 3-day total was 38.9mm.

For clarity we set the column attributes, to display the column as a date. Fig. 9.

73 9 Summary of climatic data Fig. 9.5d Resulting spreadsheet From Fig. 9.5d we see from the first row of data that the planting date in 1951 was on the 25 th October, and the 3-day total was 38.9mm. Now we summarise the data in this sheet, calculating the minimum date each year, as shown in Fig. 9.5e. The results are written to a new summary sheet. The dates come through as numbers. For clarity we set the column attributes, to display the column as a date. Fig. 9.5e Finding the first success (Spread Calculate Summary Stats) To calculate summary statistics of the dates of the start in each year, we calculate the day number in the year, as shown in Fig. 9.5f. We call the resulting column startday. We can then use any of the menus for simple statistics, for example Stats Summary Statistics Summarise Contents gives the median start date as 33, or 2 nd November. The mean start date is 31.5, and the standard deviation is 2 weeks. 72

9 Summary of climatic data Fig 9.5f Calculating the day in each year Another option is a one-way table, sometimes called a tally table, see Fig. 9.5g, which is also on the Summary Statistics menu.

We can do this by merging the appropriate values from the sheet containing the original data.

74 9 Summary of climatic data Fig 9.5f Calculating the day in each year Another option is a one-way table, sometimes called a tally table, see Fig. 9.5g, which is also on the Summary Statistics menu. This also can be used to give a cumulative graph, as shown in Fig. 9.5h. This figure confirms that the starting dates are from early October to the end of November. Fig. 9.5g A tally table Fig. 9.5h Cumulative percentages For some applications it may be useful to know the actual rainfall that triggered the start. We can do this by merging the appropriate values from the sheet containing the original data. We make the new sheet the current window, and then use Spread Manipulate Merge to give the dialogue shown in Fig. 9.5i. We use the date columns for the matching, and select just the column with the 3-day totals to transfer. In Fig. 9.5i it is also important to set the option to not transfer any extra rows from the large sheet. Fig. 9.5i Merging data from the original sheet Part of the resulting sheet is shown in Fig. 9.5j, where we see, for example that the rainfall at the start of the season was only just over 20mm in 1953, but more than 50mm in

9 Summary of climatic data Fig. 9.5j Resulting values As a second example we look briefly at how to extract dry spell lengths from the daily data. We return to the original sheet.

5k, where we have chosen a threshold of 0.85mm to define a rain day. The choice of threshold is up to the user. The result of the calculation is shown in the last column of Fig. 9.5l.

They can then be analysed, as we did for the dates of the start of the rains. This is left as an exercise for the reader!

75 9 Summary of climatic data Fig. 9.5j Resulting values As a second example we look briefly at how to extract dry spell lengths from the daily data. We return to the original sheet. If you have followed the calculations above then you may have to remove the restriction to the success days, before proceeding. Then use the calculation shown in Fig. 9.5k, where we have chosen a threshold of 0.85mm to define a rain day. The choice of threshold is up to the user. The result of the calculation is shown in the last column of Fig. 9.5l. We see, for example, that the 20 th October 1951 was the 3 rd consecutive dry day. Fig. 9.5k Spell lengths Fig. 9.5l Resulting data We can now use the same dialogue as was shown earlier, see Fig. 9.5e, to find the maximum spell lengths each month of each year. They can then be analysed, as we did for the dates of the start of the rains. This is left as an exercise for the reader! An alternative approach is to look at all the spell lengths, rather than just the longest in the month. This is harder to extract from the data shown in Fig. 9.5l, and hence we outline the steps. In the last column of Fig. 9.5l, the only time the difference between two successive values is negative is when a dry spell has finished. We therefore calculate a new column, as shown in Fig. 9.5m, which is TRUE when the difference is negative, and FALSE otherwise. Then we shift it up one place. Now we use the Spread Restrict/Filter facility, as shown earlier, to show only those rows where a dry spell finishes. The result is in Fig. 9.5n, where we see that 12 th October 1951 was the last of 41 consecutive dry days. 74

5n Resulting data We look further at these data on dry

Hence we finish here by saving the data as shown in Fig. 9.

One way to do this is through the Spread Manipulate

76 9 Summary of climatic data Fig. 9.5m Ending spells Fig. 9.5n Resulting data We look further at these data on dry spell lengths in Chapter 13. Hence we finish here by saving the data as shown in Fig. 9.6n, into a new sheet. One way to do this is through the Spread Manipulate Split/Subset dialogue, shown in Fig. 9.5o. Then we delete columns that are not needed, and save the new sheet, see Fig. 9.5p. Fig. 9.5o Splitting the spreadsheet Fig. 9.5p Save the resulting data 75

78 10 Challenge Challenge 3 - Climatic indices In Section 9.4 we listed various climatic indices that have been proposed to study climate change. We calculated one index in Fig. 9.4f. Repeat these steps to calculate some of the other indices. Produce a time series plot of each index. The indices we suggest are as follows: The annual precipitation sum, R, and number of rain days, R1, R10, are straightforward to calculate for each year. Once you have annual values of R and R1, calculate the daily intensity index (SDII) The maximum 5-day precipitation (Rx5) is a slightly greater challenge. (Hint: start by calculating the 5-day precipitation each day e.g. calc rx5 = rtotals(rain ; 4 ; 0) Use the calculate dialogue or type the command. Then find the maximum of z each year. Try also some temperature indices, e.g. DTR15 77

11 - Regression 11. Regression 11.1 Introduction It is useful for scientists to have some understanding of regression methods, even if their interest is primarily for other types of analysis. Fig. 11.1a.

In this chapter we mainly look just at part of the first option, called Linear Models. In Chapter 11.

One reason is that the terms that can be fitted can be either variates or factors.

80 11 - Regression 11. Regression 11.1 Introduction It is useful for scientists to have some understanding of regression methods, even if their interest is primarily for other types of analysis. Fig. 11.1a. Menu showing regression options Regression is a large subject, as is indicated by the number of options in the menu shown in Fig. 11.1a. In this chapter we mainly look just at part of the first option, called Linear Models. In Chapter 11.6 we will look briefly at an example of a non-linear models, using the Standard Curves option in Fig. 11.1a. GenStat s regression facilities are very powerful. One reason is that the terms that can be fitted can be either variates or factors. This supports the examination of regression models where data have (group) structure, within the standard regression framework. We show a simple example of this feature in Chapter 11.5, but it is the fact that this capability is general in GenStat, that gives the methods their power. An important advance is the capability to fit regression models when data are from non-normal distributions. This is important in climatology, because data are often from non-normal models. The materials in this chapter also serve as an introduction to regression modeling in general, and we examine the more general models in Chapter 13. In the menu in Fig. 11.1a this is the second option, Generalized Linear Models. The ordinary and generalized linear models dialogues are shown in Figs. 11.1b and 11.1c. Fig. 11.1b Ordinary regression dialogue Fig. 11.1c Generalized regression dialogue These dialogues are very similar. Hence once users are familiar with ordinary regression modelling it becomes easily to understand the generalized modeling described in Chapter

81 11 - Regression 11.2 Linear regression In simple linear regression there are two variates, one containing the values for the dependent variable (y) and the other the values for the independent variable (x). The equation of the fitted line is y = a + bx, where a is the intercept and b the slope. In GenStat the estimate of the intercept (a) is labelled as the constant, with the slope (b) labelled by the name of the independent variate, x. In regression, plotting the data is an important part of the analysis, so the Graphics Point Plot menu should be used with the two data variates before attempting any regression analysis. This will enable you to see if it is appropriate to fit a straight line to your data. It is also useful to look at the correlation between the two variates: this can be obtained by choosing Stats Summary Statistics Correlations, entering the names of the x and y columns into the data box before clicking [OK]. If the relationship looks reasonably linear, then simple linear regression can easily be carried out using Stats Regression Analysis Linear Model, Fig. 11.2a. By default, this dialogue is set up for simple linear regression, as indicated in the top box labelled Regression. We will see later that the same dialogue is also used for multiple regression models and for the comparison of regressions. Fig. 11.2a Dialogue for typical regression Fig. 11.2b Regression options In Fig. 11.2a the name of the dependent variable (y in this case) should be entered into the Response Variate box, and the name of the independent variate (x in this case) into the Explanatory Variate box. Clicking [OK] will carry out the regression, and print the default output. The output can be altered by clicking the [Options] button in the regression dialogue box, as shown in Fig. 11.2b. Term Names y variable dependent variable or response variate x variable independent variable or explanatory variate a Constant or intercept After a regression has been executed, a visual impression of the fitted line can be obtained by selecting the [Further Output] and then [Fitted Model] from the Regression dialogue box. Also available under [Further Output] is [Model Checking] which allows a set of graphs to be produced to check the assumptions behind the regression (as seen before, in Chapter 4). Example: from Mead, Curnow and Hasted, pages Uptake by leaves of CO 2 (y) is to be regressed on the concentration of CO 2 in air being passed over the leaves (x). This example was considered earlier, in the tutorial section (see chapter 4). The data are entered as shown in Chapter 4 where these were saved as cmtut5.gsh file. Alternatively cmreg1.xls can be opened. 80

11 - Regression Choosing Graphics Point Plot and selecting conc for the X Coordinates and uptake for the Y Coordinates gives the graph shown in Fig. 11.

82 11 - Regression Choosing Graphics Point Plot and selecting conc for the X Coordinates and uptake for the Y Coordinates gives the graph shown in Fig. 11.2c. Fig. 11.2c Graph of data The correlation is obtained with Stats Summary Statistics Correlations. This gives a value of 0.984, as shown in Fig. 11.2e. Fig. 11.2d Correlation dialogue Fig. 11.2e Results in a spreadsheet Choosing Stats Regression Analysis Linear Models, and filling in the boxes as in Fig. 11.2f will allow a regression to be carried out. Fig. 11.2f Simple linear regression dialogue 81

83 11 - Regression Fig. 11.2g Results ***** Regression Analysis ***** *** Summary of analysis *** Response variate: uptake Fitted terms: Constant, conc d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Percentage variance accounted for 96.5 Standard error of observations is estimated to be * MESSAGE: The following units have large standardized residuals: Unit Response Residual * MESSAGE: The following units have high leverage: Unit Response Leverage *** Estimates of parameters *** estimate s.e. t(15) t pr. Constant <.001 conc <.001 The default results give the ANOVA table, a message about data points that deserve scrutiny and the equation. Here, from Fig. 11.2g the fitted equation is shown to be uptake = * conc A plot of the fitted line can be obtained by clicking the [Further Output] then [Fitted Model] buttons from the regression box. Fig. 11.2h Plot with fitted line Plots of the distribution of the residuals, a normal and half-normal plot and a plot of the residuals against the fitted values may be produced by choosing [Further Output], then [Model Checking] and accepting the default options. The results are in Fig. 11.2i. 82

84 11 - Regression Fig. 11.2i. Regression diagnostics 11.3 Multiple regression Here we first describe GenStat's "system" for multiple regression and then give an example. Use Stats Regression Analysis Linear, and then select Multiple or General Linear Regression from the pull-down list in the Regression box. The resulting dialogue box is almost identical to that for Simple Linear Regression except that a list of variates (separated by spaces, commas or +) can now be typed into the Explanatory Variates box. If you wish to change the explanatory variates that are included interactively, and to compare models involving different sets of explanatory variates, it is better to choose General Linear Regression, from the pull-down list as extra options become available under Change Model after fitting the first model. Fig. 11.3a Dialogue for multiple regression The name of the dependent variate should be entered into the Response Variate: box as for simple linear regression. The Maximal Model: box can be ignored in most cases, but can contain a formula defining the most complex model that is going to be fitted (this should be done if any of your variates contains missing values). The list of explanatory variates to be fitted in the first regression model should be entered into the Model to be Fitted box separated by + or by commas. Click [Options] and select 83

11 - Regression Accumulated (in addition to the things already selected) to print an accumulated analysis of variance table. This allows an assessment of each variable added or dropped. Fig. 11.

85 11 - Regression Accumulated (in addition to the things already selected) to print an accumulated analysis of variance table. This allows an assessment of each variable added or dropped. Fig. 11.3b The Change Model sub-dialogue After the first model has been fitted, the model can be changed using one of the choices available by clicking the [Change Model] button. The resulting dialogue is shown in Fig.11.3b. In the dialogue [Add] allows another explanatory variate to be added into the current model. [Drop] allows a variate in the current model to be omitted. The other buttons are not used in this guide. Unless the final regression only involves one independent (x) variable, the fitted line and data cannot be displayed directly, so choosing [Further Output] then [Fitted Model] will give unpredictable results, since the graph produced is adjusted for all explanatory variates other than the one chosen for the x-axis. A residual plot can easily be obtained with the [Further Output] then [Model checking] buttons as described on page 86. Another useful graph to examine is that of the observed data (y) plotted against the fitted values: the straighter and less scattered this graph, the better the regression. An example is given on page 86 Example: from Mead, Curnow and Hasted pages (Example 10.1). The dependent (y) variable is the production of oxygen, to be related to the amounts of chlorophyll and light [independent (x) variables]. The example below gives a small illustration of the stepwise regression method known as backward elimination. The three data variates can be entered using Spread New Blank, with 3 columns and 17 rows, or open the file cmreg2.xls. 84

11 - Regression Fig. 11.3c Data cmreg2.xls Fig. 11.3d Correlation matrix Correlation coefficients can be obtained from Stats Summary Statistics Correlations, entering the names of all three variates into the [Data] box.

The initial regression can be carried out using Stats Regression Analysis Linear Models, selecting General Linear Regression from the types of regression.

86 11 - Regression Fig. 11.3c Data cmreg2.xls Fig. 11.3d Correlation matrix Correlation coefficients can be obtained from Stats Summary Statistics Correlations, entering the names of all three variates into the [Data] box. This gives the correlation between light and oxygen as 0.770, the correlation between chlor and light as 0.392, and so on. The initial regression can be carried out using Stats Regression Analysis Linear Models, selecting General Linear Regression from the types of regression. The resulting box should be completed as in Fig. 11.3a. The results are shown in Fig. 11.3e. Fig. 11.3e Results ***** Regression Analysis ***** Response variate: oxygen Fitted terms: Constant + chlor + light *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Percentage variance accounted for 61.5 Standard error of observations is estimated to be

11 - Regression * MESSAGE: The following units have large standardized residuals: Unit Response Residual 2 4.13 2.33 * MESSAGE: The following units have high leverage: Unit Response Leverage 8 1.

002 Clicking the [Save] button, selecting [Fitted Values], typing fitted into the box and clicking [OK] can save the fitted values as shown in Fig. 11.3f.

87 11 - Regression * MESSAGE: The following units have large standardized residuals: Unit Response Residual * MESSAGE: The following units have high leverage: Unit Response Leverage *** Estimates of parameters *** estimate s.e. t(14) t pr. Constant chlor light Clicking the [Save] button, selecting [Fitted Values], typing fitted into the box and clicking [OK] can save the fitted values as shown in Fig. 11.3f. The data (oxygen) can be graphed against the new variate fitted, which contains the fitted values using the graphics menu as shown in Fig. 11.3g. Fig. 11.3f Saving the fitted values Fig. 11.3g Plot of data against fitted values Fig. 11.3h Graph of data against fitted values 86

11 - Regression In the output from the regression, the t-probability for chlor is 0.111.

chlor can now be omitted using the Change Model dialogue box, entering chlor into the Terms box and clicking [Drop]. Fig. 11.3i Setting the output options Fig. 11.3f The Change Model subdialogue again Fig.

88 11 - Regression In the output from the regression, the t-probability for chlor is The next stage using the backward elimination method is to refit the model, omitting the term with non-significant t-probability, in this case chlor. First, select Accumulated under [Options]. chlor can now be omitted using the Change Model dialogue box, entering chlor into the Terms box and clicking [Drop]. Fig. 11.3i Setting the output options Fig. 11.3f The Change Model subdialogue again Fig. 11.3k Results showing the change from the previous model ***** Regression Analysis ***** Response variate: oxygen Fitted terms: Constant + light *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Change Percentage variance accounted for 56.6 Standard error of observations is estimated to be Unit Response Residual * MESSAGE: The following units have high leverage: Unit Response Leverage *** Estimates of parameters *** estimate s.e. t(15) t pr. Constant light <

11 - Regression *** Accumulated analysis of variance *** Change d.f. s.s. m.s. v.r. F pr. + chlor 1 10.9147 10.9147 12.3 0.003 + light 1 13.5124 13.5124 15.23 0.002 Residual 14 12.4206 0.

89 11 - Regression *** Accumulated analysis of variance *** Change d.f. s.s. m.s. v.r. F pr. + chlor light Residual chlor Total The t-probability for fitting light alone is < It can be seen that from Fig. 11.3k fitting chlorophyll alone (+ chlor) had a t-probability of (from the Accumulated analysis of variance), but when light was also included the t-probability is changed to This is as in the original analysis (also shown by the Change line in the Summary of analysis and the chlor line in the Accumulated analysis of variance). In this section we have just looked at multiple regression where the columns to be considered in the models are two variates. However, factors can also be included just as easily in the models, and this is one aspect that makes the regression facilities in GenStat so powerful Polynomial Regression This is an example of multiple regression where the order of fitting the possible independent (x) variables is important. The independent variables are successively increasing powers of the x variable (x, x 2, x 3, etc ) and it is not sensible to fit higher powers of x without including the lower ones. Polynomials can be fitted easily; for example instead of stating that an x variable is time, which fits a straight line, give it as POL(time;3) to fit a cubic curve. However, using the pol function does not help directly in checking whether a cubic is needed perhaps a quadratic is adequate. In order to decide on the appropriate order of polynomial, try first several options POL(time;2), POL(time;3), POL (time;4). The maximum order for POL is 4 since higher order polynomial models can be very unstable. Example: The following data are from Mead, Curnow and Hasted, page 253, with activity the dependent (y) variable and successive powers of time the independent (x) variables. The data can be entered into a spreadsheet as shown in Fig. 11.4a or cmreg3.xls can be opened. Fig. 11.4a Data cmreg3.xls 88

11 - Regression Now the initial regressions can be carried out choosing Stats Regression Analysis Linear, and selecting General Linear Regression.

90 11 - Regression Now the initial regressions can be carried out choosing Stats Regression Analysis Linear, and selecting General Linear Regression. Select activity as the response variate, POL(time;1) as the model to be fitted, and click OK. Repeat this for other orders of the polynomial: POL(time;2), POL(time;3) and POL(time;4). Fig. 11.4b Fitting a polynomial regression model Click the [Further Output] and [Fitted Model] buttons each time. Then give the variable time as the explanatory variable, and ignore the grouping factor, to graph the data and fitted curve, see Fig. 11.4c. Fig. 11.4c Fitted models with up to 4 th order polynomials When comparing the results of the analyses in the output window, see Fig. 11.4d a second order polynomial seems to fit the data best (F pr. = 0.001). Fig. 11.4d Results from the 4 fitted models Response variate: activity 89

11 - Regression Fitted terms: Constant + time Submodels: POL(time; 1) *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression 1 501.4 501.42 18.41 0.005 Residual 6 163.5 27.24 Total 7 664.

91 11 - Regression Fitted terms: Constant + time Submodels: POL(time; 1) *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression Residual Total Submodels: POL(time; 2) *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression Residual Total Submodels: POL(time; 3) *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression Residual Total Submodels: POL(time; 4) *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression Residual Total Once the model is fitted you can estimate the mean of the variate and its standard error by clicking on the [Predict] button, see Fig. 11.4e. In this case, the predicted mean of activity would be with a standard error of In Fig. 11.4e you can estimate the activity at any time point, for example double-click on the word mean in Fig. 11.4e and type 20, 25, 30, 35, 40 instead to estimate the activity at these times. Fig. 11.4e Calculating the predicted response An alternative, more flexible, approach would be first to calculate the first few powers of x (using either Data Calculations or Spread Calculate Column). They are added into the regression in order using General Linear Regression as above, first with x as the sole explanatory variate, then choosing [Change Model] and [Add] with each successive power. The advantage is that more complex models, for instance including interactions, can be used. The disadvantage that it is not so easy to calculate the predicted values. 90

11 - Regression Instead of fitting polynomials, you could use spline, by including the S() operator in the regression dialogue, rather than POL(). Look in the GenStat Help for more information. 11.

When data for a linear regression have been taken from different samples or treatments, it is usually of interest to test whether the treatments affect the parameters of the regression (the slope and

92 11 - Regression Instead of fitting polynomials, you could use spline, by including the S() operator in the regression dialogue, rather than POL(). Look in the GenStat Help for more information Including factors in a regression study This is often considered as a comparison of linear regressions. When data for a linear regression have been taken from different samples or treatments, it is usually of interest to test whether the treatments affect the parameters of the regression (the slope and the intercept). There are three possible outcomes: - a single line regardless of treatment, - parallel lines where the treatments affect the intercept but not the slope, - individual lines with both slopes and intercepts differing. This series of regressions can be fitted directly in GenStat using a factor to define the treatment groupings, say fact, along with the dependent variable y and the independent variable x. Example: Mead, Curnow and Hasted, pages (Example 11.1) The data are numbers of leaves on cauliflower plants, to be related to accumulated day degrees; seven pairs of values are available from each of two years. The analysis will test to see if the (linear) relationship between number of leaves and day degrees varies between years. Enter the data as shown in Fig. 11.5a. or open the file called cmreg4.xls. Use the Linear Regression dialogue, choosing Simple Linear Regression with Groups from the list, see Fig. 11.5b. Fig. 11.5a Data cmreg4.xls Fig. 11.5b Regression groups Fig. 11.5c Check Accumulated option 91

11 - Regression Fig. 11.5c Check Accumulated option Click on [Options] in the Regression dialogue, select Accumulated in addition to the things already selected, and click [OK].

93 11 - Regression Fig. 11.5c Check Accumulated option Click on [Options] in the Regression dialogue, select Accumulated in addition to the things already selected, and click [OK]. To understand the output below, note that the regression dialogue has generated three regressions, as is shown by examining the commands that the dialogue produced, in the input log window. These commands are as follows: Fig. 11.5d The 3 alternative models that have been fitted "Simple Linear Regression with Groups" MODEL leaves FIT [ ] daydeg ADD [ ] year ADD [ ] year.daydeg The first model fitted uses daydeg as the x variable, this is the simple linear regression. Then year is added, so the model is year + daydeg, corresponding to parallel lines. Finally the year.daydeg interaction is added, fitting a separate lines model. The output below is shown in these three sections, where some of the important elements have been put into bold. Fig. 11.5e Results for each of the three models ***** Regression Analysis ***** Response variate: leaves Fitted terms: Constant + daydeg *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Percentage variance accounted for 90.1 Fitted terms: Constant + daydeg + year *** Accumulated analysis of variance *** Change d.f. s.s. m.s. v.r. F pr. + daydeg <.001 Residual Total *** Accumulated analysis of variance *** *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Change <.001 Change d.f. s.s. m.s. v.r. F pr. + daydeg < year <.001 Residual Total Percentage variance accounted for 98.7 Fitted terms: Constant + daydeg + year + daydeg.year *** Summary of analysis *** *** Accumulated analysis of variance *** 92

11 - Regression d.f. s.s. m.s. v.r. F pr. Regression 3 165.676 55.2255 334.12 <.001 Residual 10 1.653 0.1653 Total 13 167.329 12.8715 Change -1-0.144 0.1444 0.87 0.372 Change d.f. s.s. m.s. v.r. F pr. + daydeg 1 152.

7 From the simple model, we see that the term daydeg is important and that the residual mean square is 1.272. The second section shows that adding the year term, i.e. parallel lines, is useful.

94 11 - Regression d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Change Change d.f. s.s. m.s. v.r. F pr. + daydeg < year < daydeg.year Residual Total Percentage variance accounted for 98.7 From the simple model, we see that the term daydeg is important and that the residual mean square is The second section shows that adding the year term, i.e. parallel lines, is useful. The additional term is statistically significant and the residual mean square has dropped to The final section shows that the separate lines model is not an improvement, indeed the residual mean square has increased slightly to Hence we choose the parallel lines model. This is the model with the terms year + daydeg. To give the details and a graph of the fitted model we return to the regression dialogue and choose the parallel line model, see Fig. 11.5f. The final fitted model can be examined graphically by choosing [Further Output] and [Fitted Model], with daydeg as the Explanatory Variable and year as the Grouping Factor see Fig. 11.5h. Fig.11.5f Parallel lines model Fig. 11.5g Graph the fitted model Fig. 11.5h The fitted model Finally we look at the alternative ways GenStat can give the regression equations, which are to be reported. The default output is shown in Fig.11.5i. This is the option in Fig 11.5f Parallel lines, estimating differences from ref level 93

95 11 - Regression Fig. 11.5i Default output *** Estimates of parameters *** Fig. 11.5j Estimating each line *** Estimates of parameters *** estimate s.e. t(11) t pr. Constant year Year <.001 daydeg <.001 estimate s.e. t(11) t pr. year Year year Year <.001 daydeg <.001 In Fig 11.5f we asked if the option Parallel lines, estimate lines and this gives the output in Fig. 11.5j. From the output in Fig. 11.5j the equations are given directly as follows: Year 1: leaves = *daydeg Year 2: leaves = *daydeg If the model for separate lines had been required, then the corresponding outputs are in Figs. 11.5k and 11.5l. The default output corresponding to fitting the model as year + daydeg + year.daydeg as shown in Fig. 11.5k, while the output, without a constant and from fitting the model as year + year.daydeg is in Fig. 11.5l. Fig. 11.5k Default output for separate lines *** Estimates of parameters *** Fig. 11.5l Estimating each line *** Estimates of parameters *** estimate s.e. t(10) t pr. estimate s.e. t(10) t pr. Constant year Year daydeg <.001 daydeg.year Year year Year year Year <.001 daydeg.year Year <.001 daydeg.year Year <.001 The display in Fig. 11.5l again gives the equations directly as: Year 1: leaves = *daydeg Year 2: leaves = *daydeg 11.6 Nonlinear Regression So far the regression techniques covered have been essentially linear (although curves have been fitted using polynomials, the method was that of linear regression). In GenStat it is possible to fit nonlinear equations, using Stats Regression Analysis Standard Curves in a manner similar to linear regression. In the main Standard Curves dialogue box, the form of the curve to be fitted can be chosen from the list under Type of Curve. This list includes exponential, double exponential, linear + exponential, and logistic. For example, to fit an exponential curve to data in a variate y with respect to an independent variate x, choose Stats Regression Analysis Standard Curves; enter y as the Response Variate, and x as the Explanatory Variate. Exponential (or asymptotic regression) is the default Type of Curve. As with linear regression, a graph of the data and fitted curve can be produced by selecting the [Further Output] then [Fitted Model] buttons. Example: the data comprise the height of a plant measured on 10 different occasions (weeks 0-9). The curve to be fitted is a logistic without an additive constant (lower asymptote 0). The heights for weeks 0 to 9 are given in Fig.11.6a. 94

11 - Regression Enter these values into a variate called height in a spreadsheet, with the week numbers 0, 1 9 into another variate called time or open cmreg5.xls.

11.6c, try different types of curve and you will see the small examples changes in shape. Select Logistic (s-shaped or inverse s-shaped curve) from the list under Type of Curve.

96 11 - Regression Enter these values into a variate called height in a spreadsheet, with the week numbers 0, 1 9 into another variate called time or open cmreg5.xls. Use Graphics Point Plot to produce the graph in Fig. 11.6b. Fig. 11.6a Data Fig. 11.6b Graph of the data Choose Stats Regression Analysis Standard Curves. Fig. 11.6c Basic dialogue for non-linear regression In the dialogue shown in Fig. 11.6c, try different types of curve and you will see the small examples changes in shape. Select Logistic (s-shaped or inverse s-shaped curve) from the list under Type of Curve. This looks of the right form for the data. Often data includes a factor that distinguishes between different groups. For linear regression, an example was given in Section 11.5, where we had data for two years. It is a useful feature that the Genstat dialogue in Fig. 11.6c allows for a factor to be in the model, even when a non-linear model is being fitted. In this example, enter height as the Response Variate, and time as the Explanatory Variate. Select [Options], and de-select Estimate Constant Term. Click [OK] twice. 95

97 11 - Regression Fig. 11.6d Results ***** Nonlinear regression analysis ***** Response variate: height Explanatory: time Fitted Curve: A + C/(1 + EXP(-B*(X - M))) Constraints: A = 0.0 *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Percentage variance accounted for 99.7 Standard error of observations is estimated to be 3.65 *** Estimates of parameters *** estimate s.e. B M C Note: C is the final (asymptotic) height, M the time to reach 50% of the final height, and B is related to the maximum steepness (slope) of the curve (at time M). From the standard curves dialogue box, click [Further Output] and [Fitted Model] to obtain a graph of the fitted curve: Fig. 11.6e Data with fitted curve 11.7 Combining regression and simple analysis of variance: detecting lack of fit The presence of more than one y variable for at least some of the x values in a linear regression enables lack of fit to be detected. The residual mean square can be partitioned into two components, one for the deviations of the means of the y values for each x about the fitted line (lack of fit) and the other measuring the variability between the multiple y values for each x. 96

11 - Regression We use this example to introduce the dialogue for Analysis of variance in Genstat. We also show how the regression and ANOVA dialogues can produce the same results.

This is the same example of linear regression that was used in the tutorial and in Section 11.2. The dataset is cmreg1.

The first column, called conc is a variate, while the second, called fconc is a factor, with 7 levels corresponding to the 7 different concentrations used in the studies. Fig. 11.

98 11 - Regression We use this example to introduce the dialogue for Analysis of variance in Genstat. We also show how the regression and ANOVA dialogues can produce the same results. Example: Mead, Curnow and Hasted, page 170, (Example 9.1, full data set). This relates uptake to concentration of CO 2, with multiple observations for concentrations 100, 130, 160 and 200. This is the same example of linear regression that was used in the tutorial and in Section The dataset is cmreg1.xls For the full analysis we need the columns giving the concentration in two different forms, as shown in Fig. 11.7b. The first column, called conc is a variate, while the second, called fconc is a factor, with 7 levels corresponding to the 7 different concentrations used in the studies. Fig. 11.7a Duplicate column as a factor Fig. 11.7b Data with extra column To enter the data above it is simplest to start with the spreadsheet entering 2 columns and 17 rows. Open the file cmreg1.xls. Then click a cell in the column conc, choose Spread Column Duplicate. This gives the dialogue shown in Fig. 11.7a. Name the New Column as fconc, set it to be a factor and click on [OK]. The resulting spreadsheet is shown Fig. 11.7b. The column fconc is a factor with the same values as conc. One way to start the analysis is with the ANOVA dialogue. Choose Stats Analysis of Variance, selecting General Treatment Structure (no Blocking), as shown in Fig. 11.7c. Fig. 11.7c ANOVA dialogue 97

11 - Regression Put uptake into the Y-Variate box. Either complete the Treatment Structure box with POL(fconc;1), or click on [Contrasts] and complete the resulting dialogue, shown in Fig. 11.7d.

7d Contrasts button Fig. 11.7e Options sub-dialogue In the output window, the results below show the F-prob for deviations (lack of fit) in the ANOVA Table is 0.340, which is not significant.

99 11 - Regression Put uptake into the Y-Variate box. Either complete the Treatment Structure box with POL(fconc;1), or click on [Contrasts] and complete the resulting dialogue, shown in Fig. 11.7d. Click [Options] from the main dialogue, and tick to display the contrasts. Click [OK] in each dialogue box. Fig. 11.7d Contrasts button Fig. 11.7e Options sub-dialogue In the output window, the results below show the F-prob for deviations (lack of fit) in the ANOVA Table is 0.340, which is not significant. This implies that the pattern in the concentration means is well-described by a straight line. Therefore, the calculation of standard errors can be based on the residual mean square from the ANOVA, and the s.e. for the slope given in Fig. 11.7f is correct. Fig. 11.7f Results ***** Analysis of variance ***** Variate: uptake Source of variation d.f. s.s. m.s. v.r. F pr. fconc <.001 Lin <.001 Deviations Residual Total ***** Tables of contrasts ***** Variate: uptake *** fconc contrasts *** Lin s.e ss.div Note: this gives the slope as with standard error of The results above do not give the full equation of the line. One way of giving this is to type a command. Type: APOLYNOMIAL fconc into one of the windows (or open an input window). Use <CNTL>L, or Run Submit Line. This gives the following result: Fig. 11.7g Equation of the line apolynomial fconc ***** Equation of the polynomial ***** * fconc The same analysis can be carried out using the regression dialogues. Choose Stats Regression Analysis Linear Models. Fit the model with the variate, conc, as shown in Fig. 11.7g. Check, using the [Options] that Accumulated has been ticked. 98

11 - Regression Fig. 11.7h First regression model Fig. 11.7i Adding the factor Then use [Change Model] and [Add] the factor, called fconc, as shown above. The output includes the information in Fig.

05193 Total 16 27.80118 1.73757 The results show, just as in the output from the ANOVA dialogue earlier, that the fconc factor is not needed.

The regression line after fitting the full model (with fconc included), does not have quite the same equation, as above, because of the presence of the factor in the model.

We use a trick to fit a linear regression, with a modification to enable the standard error to be based on the ANOVA residual mean square.

100 11 - Regression Fig. 11.7h First regression model Fig. 11.7i Adding the factor Then use [Change Model] and [Add] the factor, called fconc, as shown above. The output includes the information in Fig. 11.7j: Fig. 11.7j Results *** Accumulated analysis of variance *** Change d.f. s.s. m.s. v.r. F pr. + conc < fconc Residual Total The results show, just as in the output from the ANOVA dialogue earlier, that the fconc factor is not needed. The linear relationship with the variate, conc, is adequate. There is now a slight dilemma, if we use the regression approach, and this is a good excuse to introduce a third dialogue. The regression line after fitting the full model (with fconc included), does not have quite the same equation, as above, because of the presence of the factor in the model. However, leaving out the factor completely does not give the same error term as was used with the ANOVA. We use a trick to fit a linear regression, with a modification to enable the standard error to be based on the ANOVA residual mean square. Choose Stats Regression Analysis Generalised Linear Models as shown in Fig. 11.7k. We will see this dialogue again in Chapter 13. Click [Options], select Fix under Dispersion Parameter, and enter the Residual mean square (m.s.) from the ANOVA in the above box, as shown in Fig. 11.7l. (If the Deviations had been significant, i.e. significant Lack of Fit present, the Deviations mean square would need to be entered into the Dispersion Parameter box instead). De-select everything under Display apart from Estimates: Fig. 11.7k GLM dialogue Fig. 11.7l Fixing the residual Click [OK], and enter uptake as the response variate, and conc in Model to be Fitted. Click [OK]. 99

101 11 - Regression The reason this is a trick is that we are not fitting a generalized linear model, just an ordinary linear model. But the ordinary linear model dialogue does not include the option to fix the residual mean square. For a graph of the data with the fitted line, use Further Output, and then Fitted Model. Specify conc as the Explanatory Variable. The numerical results are shown in Fig. 11.7m with the graph in Fig. 11.7n. Fig. 11.7m Results Fig. 11.7n Graph ***** Regression Analysis ***** *** Estimates of parameters *** estimate s.e. t(*) Constant conc MESSAGE: s.e.s are based on dispersion parameter with value

102 12 Challenge Challenge 4 - Climate change How could you investigate climate change in the monthly data used in Chapter 7, i.e. genrain.gsh Although the record is not very long, the methods you suggest could be used more generally. 1. How useful are the following? a) Calculate the annual totals, b) Produce a factor column that has three levels, splitting the 33 years into 3 consecutive groups of 11 years each. c) Look at the annual totals in each group with boxplots. d) Use a t-test to compare the annual totals in the first and third groups. e) Use a one-way ANOVA to compare the three groups of annual totals. f) Ignore the groups and give a line plot of the annual totals against the year number. g) Calculate the correlation between the annual total and the year number. h) Do a linear (and also a polynomial) regression. You could also try a spline model. If you have succeeded, with these calculations have probably usefully reviewed quite a lot or GenStat. But assume your objective is to study climate change, rather than GenStat. Is this an effective way of looking at climate change? 2. Perhaps one problem is the use of just annual totals If there is any evidence of changes, then it might be useful to know at what time of year it occurs. So you could repeat the analyses for one or more of the individual months. 3. Perhaps there is a more fundamental problem. Perhaps part of the problem is that you are started your investigation of climate change with a file containing the monthly totals. The rainfall amounts on rainy days are very variable. Hence the monthly rainfall totals include a lot of noise. In looking for climate change you are searching for a small signal. Describe possible of analyses if you start instead with the daily data. (For example, as well as calculating the annual totals you could now first calculate the date of the start of the rains each year, or the length of the longest dry-spell within the season, or the number of rain days, and so on.) Try any ideas you have on the data from Zimbabwe, zimdata.gsh, from Chapter

103

104 Part III Statistical Methods 103

105 104

13 Distributions in climatology 13. Distributions in climatology 13.1 Introduction So far, most of the analyses that we have looked at, e.g. t-tests and simple regression, were based on the assumption that the data had an underlying normal distribution.

These are its mean, which identifies the centre of the distribution, and the standard deviation, a measure of the spread of the data.

106 13 Distributions in climatology 13. Distributions in climatology 13.1 Introduction So far, most of the analyses that we have looked at, e.g. t-tests and simple regression, were based on the assumption that the data had an underlying normal distribution. This distribution is often appropriate for continuous data such as annual rainfall or daily temperatures. The normal distribution is symmetric and bell-shaped, and is characterized by two parameters. These are its mean, which identifies the centre of the distribution, and the standard deviation, a measure of the spread of the data. The normal distribution is the most common statistical distribution, and is not an unreasonable assumption for many continuous variables. However, not all data are continuous, and not all continuous variables are normally distributed. One quick fix when data are not normally distributed is to transform the data e.g. take logs or square roots so that the transformed data are approximately normal and analyse the transformed data using methods such as t-tests and regression. In this chapter we introduce some other commonly used distributions for climatic data. We show how GenStat can easily be used to analyse such data. Fig. 13.1a Probability calculations Fig. 13.1b Fitting a distribution In Chapter 13.2 we look at probability ideas, and use the Data Probability Calculations dialogue, see Fig. 13.1a. We describe how to estimate the parameters of these distributions in Chapter 13.3, see Fig. 13.1b. We often need more than just to fit a single distribution. In Chapter 11 we looked at regression modelling where the data were normally distributed. In Chapter 13.4 we see how linear regression ideas can be generalised when we have data from a wide range of other distributions. This is the subject of generalised linear models (Fig. 13.1c) introduced by Nelder and Wedderburn (1972). This is a key part of GenStat and of some of the other powerful statistics packages. 105

13 Distributions in climatology Fig. 13.1c Fitting a generalised linear model The value of generalized linear models (GLMs) is more than just generalizing the subject of regression modelling.

They are popular in climatology, but the chisquare test is limited to two-way tables. In Chapter 13.5 we introduce log-linear models.

107 13 Distributions in climatology Fig. 13.1c Fitting a generalised linear model The value of generalized linear models (GLMs) is more than just generalizing the subject of regression modelling. This general framework also includes further topics that used to be analysed as special cases. One example is chi-square tests. They are popular in climatology, but the chisquare test is limited to two-way tables. In Chapter 13.5 we introduce log-linear models. This generalizes the chi-square test and puts it into a modelling framework. Log-linear models are an example of a GLM Probability ideas If the data follow a particular distribution then it is possible to assign a probability to any event. For instance suppose the total annual rainfall has a Normal distribution with a mean of 1650mm and a standard deviation of 400mm. Assume a crop will only grow if the rainfall is at least 1200mm. Then, with this model of a Normal distribution, we can determine the probability of crop failure, i.e. of not getting the required amount of rainfall, in any year. To do this in GenStat choose Data Probability Calculations to bring up the dialogue shown in Fig. 13.2a. Fig. 13.2a Estimating a probability 106

108 13 Distributions in climatology Note the flexibility of the screen shown in Fig. 13.2a. Models can be continuous or discrete. The default data type is continuous and the default distribution is the Normal model. This is depicted by the bell shaped curve in the picture on the right-hand side of the dialogue box. Click on the drop-down box in Fig. 13.2a see the range of distributions available for continuous data. For instance, the Gamma and Weibull distributions are two others. To determine a probability associated with a Normal distribution whose mean and standard deviation are 1650 and 400 respectively, first enter these values as shown in Fig. 13.2a. Next choose the type of calculation. Here we want the Cumulative Lower Probability, since we are interested in the chance of getting less than the required amount of rainfall. Enter the value, 1200 into the X deviate box in Fig. 13.2a, click on OK and your screen should look like Fig. 13.2b. Fig. 13.2b Estimating a probability The probability of less than 1200mm in any one year is 0.13 or 13%. Return now to the calculation section of the dialogue shown in Fig. 13.2b. There are various options available. For instance, the Cumulative Upper Probability will return the probability of being greater than a particular value. (For 1200 it should be 0.87 i.e ; try it and see!) Another, which may be of interest, is Probability in Interval which determines the probability of being within a defined range of values. From this dialogue you can also calculate probabilities for a range of values, by ticking Allow list of values. The results are displayed in the output window and you can tick the box shown in Fig. 13.2b to also display the results in a spreadsheet. For example to determine the probability of having less than 1000, 1200 and 1400 mm of rainfall, enter the data at the X-deviate box in Fig. 13.2b, using either spaces or commas between the numbers. The resulting spreadsheet of results (rainfall values and probability of being less than the values) is shown in Fig. 13.2c. Fig. 13.2c Probability calculations for a range of values If you have a large number of x values, perhaps 800, 850, , then enter these values into a GenStat spreadsheet first, and insert this column name in the dialogue in Fig. 13.2b instead of the list 107

13 Distributions in climatology of values. For these distributions you have no need for books of statistical tables any more. If ever you need a table then you can produce it to order! 13.

This permits the estimation of the parameters from a wide range of discrete and continuous distributions, as is indicated in Fig. 13.3b. Fig 13.3a The Fit Distribution dialogue box Fig 13.

109 13 Distributions in climatology of values. For these distributions you have no need for books of statistical tables any more. If ever you need a table then you can produce it to order! 13.3 Fitting distributions In this section we illustrate the use of the Stats Distributions Fit Distributions dialogue, shown in Fig. 13.3a. This permits the estimation of the parameters from a wide range of discrete and continuous distributions, as is indicated in Fig. 13.3b. Fig 13.3a The Fit Distribution dialogue box Fig 13.3b Some of the distributions In practice the use of this dialogue is preceded by an exploration of the data, using boxplots, histograms, probability or kernel density plots, all of which can be done from other menus in GenStat. We omit this stage to concentrate on some of the ideas involved in fitting the parameters of the chosen distribution. Once you have fitted the parameters, and have an acceptable probability model, then you would usually return to the ideas introduced in Chapter There you use the model to provide results, such as the risks or return periods that correspond to the objectives of your study. The ideas in this section are of use in their own right and they also act as an introduction to the more general modelling ideas that we describe in Chapter As an example we use the daily data from Bulawayo, that was introduced in Chapter 9. We assume our interest is in a study of the distribution of dry spell lengths. Starting with the daily data, a preliminary step is to calculate the spell lengths, and this typifies an initial data manipulation stage that is commonly needed to provide the data to be analysed. We showed how to do this in Chapter 9.5 and hence here we use the resulting GenStat spreadsheet, that we called spells.gsh 4. We fit a geometric distribution as a simple example. In fitting the different distributions it is sometimes important to know how they are parameterised in GenStat. This is provided in the reference guides that are automatically available as part of the help. Part of a table from Chapter 2 of the statistics guide is shown in Fig. 13.3c. 4 In Chapter 16 we will show how the production of the spreadsheet with the spell lengths can be done with GenStat s commands, rather than the menus. This is particularly useful when the analysis is to be repeated for different definitions of a dry day or for different stations. 108

this section, and can be interpreted as the waiting time in a series of Bernoulli trials before an event occurs.

110 13 Distributions in climatology Fig. 13.3c GenStat statistics guide, showing the definition of the geometric distribution : : The geometric distribution is a discrete analogue of the continuous exponential distribution described later in this section, and can be interpreted as the waiting time in a series of Bernoulli trials before an event occurs. The probability that r trials occur before an event is given by: p r =p(1-p) r 0<p<1, r=0,1 We see that the geometric distribution is from zero upwards, unlike our data that start at one. We therefore use the Data Calculations, Fig. 13.3d, to produce another column that subtracts one from the values in the column called spelldry. We call this new column spell0. Fig. 13.3d Calculate dialogue to give spells To fit a geometric distribution to the sample of dry spells choose Stats Distributions Fit Distributions. Select spell0 into the Data Values box and change the Distribution type to Geometric, as shown in Fig. 13.3d. Fig. 13.3d Fitting a geometric distribution The output is displayed in Fig. 13.3e. 109

111 13 Distributions in climatology Fig. 13.3e A geometric distribution fitted to dry spell lengths ***** Fit discrete distribution ***** *** Sample Statistics *** Sample Size 1279 Mean Variance Skewness 4.72 Poisson Index 5.46 Negative Binomial Index 2.02 *** Summary of analysis *** Observations: spell0 Parameter estimates from individual data values Distribution: Geometric Pr(X=r) = p.(1-p)**r Deviance: on 17 d.f. *** Estimates of defining parameters *** estimate s.e. p *** Fitted values (expected frequencies) and residuals *** r Number Number Weighted Observed Expected Residual The output in Fig. 13.3e begins with some simple summary statistics, followed by a (maximum likelihood) estimate of the geometric parameter, p, which is (s.e.=0.022). The table at the end of the output is used to assess the goodness-of-fit of the distribution. The 1279 dry spell lengths are grouped and tabulated, giving observed numbers (counts). Assuming a geometric distribution the corresponding expected values are calculated. A standard chi-square goodness-of-fit test is then applied using the likelihood ratio approach. The resulting chi-square statistic, , is labelled Deviance in the output. The associated degrees of freedom is 17. A formal significance test can be performed by comparing with the upper percentage points of a chi-square distribution on 17 degrees of freedom.5 This gives a highly significant lack-of-fit.6 Clearly, from examination of the observed and expected frequencies, the geometric distribution is not a good fit. There are clear discrepancies, for example there are 315 observed zeroes versus an expected 106. Given the seasonal variation in dry spell lengths the lack-of-fit is not unexpected, as the 1279 lengths are unlikely to form a single homogenous sample. It may be more realistic with a subset, say January to March from 1990 onwards. Restrict the analysis to these dry spell lengths by making the spreadsheet the active window and choosing Spread Restrict/Filter To Groups (factor levels). Ensure Year is in the Factor box and choose 1990 to 2001as the Selected Levels, Fig. 13.3f. Click 5 The results for the goodness-of-fit test depend upon the tabulation of the dry spell lengths. It is not unique, and can be altered using Number of Classifying Groups and/or Limits. Hence the deviance presented is not unique. 6 We hope a future version of GenStat will present a p-value. 110

13 Distributions in climatology Apply. This restricts an analysis to 1990 to 2001. Now specify Month in the Factor box and Jan to Mar as the Selected Levels, Fig. 13.3g. Click OK.

13.3d) gives the output presented in Fig. 13.

112 13 Distributions in climatology Apply. This restricts an analysis to 1990 to Now specify Month in the Factor box and Jan to Mar as the Selected Levels, Fig. 13.3g. Click OK. This further restricts an analysis to January to March. Fig. 13.3f Just years 1990 onwards Fig. 13.3g Just months Jan to Mar Refitting a geometric distribution (Stats Distributions Fit Distributions, Fig. 13.3d) gives the output presented in Fig. 13.3h). Fig. 13.3h A geometric distribution fitted to dry spells (Jan-Mar, 1990 onwards) ***** Fit discrete distribution ***** *** Sample Statistics *** Sample Size 131 Mean 4.44 Variance Skewness 2.42 Poisson Index 1.37 Negative Binomial Index 2.07 *** Summary of analysis *** Observations: spell0 Parameter estimates from individual data values Distribution: Geometric Pr(X=r) = p.(1-p)**r Deviance: 8.17 on 7 d.f. *** Estimates of defining parameters *** estimate s.e. p *** Fitted values (expected frequencies) and residuals *** r Number Number Weighted Observed Expected Residual The estimate of the geometric parameter p is now (s.e ). The tabulation of observed and expected frequencies are in reasonable agreement, suggesting a geometric distribution is reasonable. This is confirmed by performing a chi-square goodness-of-test. The deviance, 8.17 on 7 df, indicates no lack-of-fit. 111

13 Distributions in climatology Unlike most statistical software packages separate geometric distributions can be fitted for each of the three months separately using the dialog in Fig. 13.

A modelling approach, where the dependency of the distribution on several explanatory variables, factors and variates, can be modelled and quantified is much more general.

113 13 Distributions in climatology Unlike most statistical software packages separate geometric distributions can be fitted for each of the three months separately using the dialog in Fig. 13.3d However this is a limited advance, as we might also want to look for a trend in the years. For this a modelling approach is required. A modelling approach, where the dependency of the distribution on several explanatory variables, factors and variates, can be modelled and quantified is much more general. This approach is described in Chapter Generalised linear models The term generalised linear model (GLM) first appears in a landmark paper by Nelder and Weddurburn (1972). GLMs effectively generalise the methods of regression modelling, described in Chapter 11, to data from many other distributions in addition to the normal. Following their paper, there were many important contributors to the development of GLM technology, but the methodology was especially popularised by the McCullagh and Nelder (2 nd edition 1989) book, titled Generalized Linear Models. Now these methods are incorporated into most of the standard statistics packages 7 and there are many books, for example Dobson (2002). The importance of the subject may be gauged from the fact that even elementary books such as Mead et al (2003) or Manly (2001) include sections on GLMs. McConway et al (1999) include an extensive treatment in their book, titled Statistical Modelling using Genstat. GLMs are also discussed in the GenStat statistics guide (Chapter 3.5). The ideas of GLMs build on the methods of regression for data from normal distributions. The similarity between the dialogue boxes for regression for normal distributions, Fig. 13.4a and that for fitting generalized linear models, Fig. 13.4b is clear. Fig. 13.4a Linear Regression dialogue Fig. 13.4b Generalized Linear Models dialogue We illustrate the methods with an example of modelling daily rainfall amounts. We use the data from Bulawayo, introduced in Chapter 9 (Fig. 9.2b). The data are in the spreadsheet zimdata.gsh. Restrict the analysis to non-zero rainfall recordings. To do this, make the spreadsheet active and choose Spread Restrict/Filter By Logical Expression. State the inclusion criterion, as shown in Fig. 13.4c. Click OK. This leaves 3310 rainfall amounts to be modelled. 7 It is perhaps not surprising that GenStat was one of the first general statistics packages to incorporate GLMs, because early versions of Genstat were developed under John Nelder s leadership at Rothamsted. Rothamsted is justly famous for statistics in that Fisher was the first head of their statistics department. Yates was the second, and Nelder the third. 112

13 Distributions in climatology Fig. 13.4c Restricting the analysis to non-zero daily rainfall amounts Usually we would begin with an exploratory analysis, using graphs and summary statistics.

114 13 Distributions in climatology Fig. 13.4c Restricting the analysis to non-zero daily rainfall amounts Usually we would begin with an exploratory analysis, using graphs and summary statistics. For example a histogram of daily rainfall amount for January is given in Fig. 13.4d. Fig. 13.4d Histogram of January daily rainfall amounts The distribution is very positively skewed and is clearly not normal. Such skewed data is often modelled by a gamma distribution, a two-parameter distribution, with a scale and shape parameter. For illustration we will consider modelling daily rainfall as a function of the factor month, with a gamma random error. With this in mind an exploratory analysis might include graphs of rainfall amount by month, as in Fig. 13.4d. Summary statistics such the coefficient of variation by month would also be useful, as the gamma model assumes the true value of this statistic is constant and does not vary with month. We might also fit a gamma distribution for each month separately using the methods of Chapter 13.3 and examine the results, looking for evidence against the model we are considering. This is not done here and we concentrate on the model fitting process. It is usual to model the (natural) logarithm of the mean rainfall amount. The logarithmic transformation is an example of a link function. The use of this function allows the effect of explanatory variables to be interpreted multiplicatively, and prevents non-positive predictions for mean values from a fitted model. (Theoretically, for the gamma model, a more desirable link function, generically known as a canonical link function, is the reciprocal function. However, from a practical viewpoint its interpretation is more difficult.) A gamma model may be fitted to daily rainfall amounts using Stats Regression Analysis Generalized Linear Models. Specify Rain as the Response Variate and Month as an explanatory variable (factor) by selecting it into Model to be Fitted. Change the Distribution to Gamma and the Link Function to Logarithm, as shown in Fig. 13.4e. 113

115 13 Distributions in climatology Fig. 13.4e Fitting a gamma model to daily rainfall data The output is given in Fig. 13.4f. Fig. 13.4f Fitted gamma model (edited output) ***** Regression Analysis ***** Response variate: Rain Distribution: Gamma Link function: Log Fitted terms: Constant, Month *** Summary of analysis *** mean deviance approx d.f. deviance deviance ratio F pr. Regression <.001 Residual Total Coefficient of variation is estimated to be 1.39 from the residual deviance * MESSAGE: The residuals do not appear to be random; for example, fitted values in the range 7.85 to 7.85 are consistently larger than observed values and fitted values in the range 9.75 to 9.75 are consistently smaller than observed values * MESSAGE: The following units have high leverage: Unit Response Leverage : : : *** Estimates of parameters *** antilog of estimate s.e. t(3298) t pr. estimate Constant < Month Feb Month Mar < Month Apr Month May < Month Jun < Month Jul < Month Aug < Month Sep Month Oct < Month Nov Month Dec * MESSAGE: s.e.s are based on the residual deviance Parameters for factors are differences compared with the reference level: Factor Reference level Month Jan 114

116 13 Distributions in climatology The selected gamma model assumes the effect of month is additive with respect to the mean level of rainfall on a logarithmic scale. Algebraically we have: log e( µ i) =β 0 + mi for i=1,, 12, where µ i is the true mean rainfall for month i, β 0 is an intercept parameter and m i represents the month effect. For reporting parameter estimates GenStat constrains m1 (corresponding to January) to be 0 and gives estimates for the remainder. Hence, from the output the estimate of β 0, 2.39 (s.e.=0.0566), is an estimate of the mean daily rainfall amount for January, on a logarithmic scale. The estimates of m 2,, m 12 represent estimated mean differences from January on a logarithmic scale. The output titled Summary of analysis gives an approximate F-test for the month effect. The p-value, <0.001, is very small indicating a clear month effect. Examination of parameter estimates suggest, as expected, a seasonal trend in mean daily rainfall. As was stated earlier the gamma model assumes a constant coefficient of variation (equivalent to a constant shape parameter) and this is estimated to be 1.39 (or 139%). As with the ordinary regression, described in Chapter 11, we can then use the Predict button to give the estimated mean daily rainfall each month. In general, all the operations, such as Add and Drop, described in Chapter 11 can be applied to the choice and interpretation of these models. To emphasise the parallels with the analysis methods traditionally used for normally distributed data return to the Generalized Linear Models dialogue box and change the Distribution to Normal and the Link Function to Identity (i.e. model the true mean, not a transformation of it), as shown in Fig. 13.4g. Fig. 13.4e Fitting an incorrect normal model to daily rainfall data The model is clearly wrong but we use it to illustrate the general principles of modelling. The output is given in Fig.13.4f. The F-test is now derived from the familiar analysis of variance table. Parameter estimates are also given but now correspond to the model: µ i =β 0 + mi where the terms are defined in a similar fashion to the gamma case. Comparing the general form of the output for the fitted normal and gamma models we see they are very similar. In essence, we assume a reasonable model, fit the model and draw inferences from it, whether the error distribution is normal or something else. 115

117 13 Distributions in climatology Fig. 13.4f Fitted normal model (edited output) ***** Regression Analysis ***** Response variate: Rain Fitted terms: Constant, Month *** Summary of analysis *** d.f. s.s. m.s. v.r. F pr. Regression <.001 Residual Total Percentage variance accounted for 2.0 Standard error of observations is estimated to be 12.8 : : *** Estimates of parameters *** estimate s.e. t(3298) t pr. Constant <.001 Month Feb Month Mar <.001 Month Apr Month May <.001 Month Jun <.001 Month Jul <.001 Month Aug Month Sep Month Oct <.001 Month Nov Month Dec Parameters for factors are differences compared with the reference level: Factor Reference level Month Jan 13.5 Moving on from chi-square tests: log-linear modelling Chi-square tests for association in two-way contingency tables of counts are very popular. However, chi-square tests are of limited use, as they cannot be used for more than two-way contingency tables, which occur often in practice. In this section we look at GenStat s facilities for the analysis of two-way contingency tables. The method will then be put into a modelling framework, called log-linear modelling, which can be extended to multidimensional tables. For illustration we use a data set on air pollution, supplied with GenStat. The data are in the spreadsheet sulphur.gsh. The data set is also used and discussed in Chapter 15. The data set consists of 114 sulphur (pollution) measurements and associated variables. To illustrate a chi-square test for association we will use the factors Wind and Sulphurgroup. The first is a categorisation of wind direction into four directions; the second is a categorisation of the actual sulphur measurement into three levels (<4, 4.5 to <11.5, 11.5). A suitable tabulation may be produced using Stats Summary Statistics Frequency Tables and select Wind and Sulphurgroup into the Groups list. Selecting the Set Margin option will produce row and column totals (Fig. 13.5a). 116

118 13 Distributions in climatology Fig. 13.5a An initial tabulation of wind direction by amount of sulphur The cross-tabulation or contingency table is shown in Fig. 13.5b. Fig. 13.5b Contingency table of wind direction by amount of sulphur Count Sulphurgroup < >= 11.5 Count Wind N E S W Count Unknown Count 1 Further exploratory analysis might include producing suitable tables of percentages. In this example the row percentages may be the most useful. These could be produced using the Display as percentage of option in the Frequency Tables dialogue. To perform a chi-square test choose Stats Statistical Tests Contingency tables. Define the dimensions of the table, as indicated in Fig. 13.5c, and specify the Data Arrangement to be Row and Column Factors. Fig. 13.5c Performing a chi-square test for association 117

119 13 Distributions in climatology The results are displayed in Fig. 13.5d. Fig. 13.5d Chi-square test results Pearson chi-square value is with 6 df. Probability level (under null hypothesis) p = The standard chi-square test is named after Pearson, as is indicated in the output in Fig. 15.5d. The p-value, 0.002, is very small indicating strong (statistical) evidence for an association between wind direction and amount of sulphur. An alternative approach is to use a likelihood ratio chi-square test. In Fig. 13.5c change the Method to Maximum Likelihood. The results are shown in Fig. 13.5e and are very similar, as the two test procedures are asymptotically equivalent. Fig. 13.5e A likelihood ratio chi-square test Likelihood chi-square value is with 6 df. Probability level (under null hypothesis) p = To investigate the nature of the association one may compare the observed counts (Fig. 13.5b) with the expected counts under the null hypothesis of no association. The expected values may be calculated by choosing Options in Fig. 13.5c and selecting Expected Values. GenStat produces expected (or fitted) values and standardised residuals for comparing the observed and corresponding counts, i.e. observed count versus model prediction, on a standardised scale. These are shown in Fig. 13.5f. Fig. 13.5f Expected values and standardised residuals Observed Fitted Residual Wind Sulphurgroup N < >= E < >= S < >= W < >= One of the main signals in the data can be derived from the large residual, 3.21, corresponding to a northerly wind direction and a high amount of sulphur. This indicates a tendency for higher amount of sulphur to be found when the wind is blowing in a northerly direction. The data used in this example was in list format, namely one row per measurement occasion. In some instances the data may have already been semi-processed and exist in a summarized list format. The summarised data are in the spreadsheet sulphur summarised.gsh, as shown in Fig. 13.5g. Count is a variate containing the frequencies from the contingency table in Fig. 13.5b. The corresponding levels of wind direction and amount of sulphur are indicated by two factors, as before. 118

120 13 Distributions in climatology Fig. 13.5g The sulphur data in summarised form To perform a chi-square test when the data are in this summarised format, once again choose Stats Statistical Tests Contingency tables. Specify the Data Arrangement to be Single variate with grouping factors. Define the dimensions of the table, as indicated in Fig. 13.5h, and that the counts are contained in Count by selecting it into the Data box. Fig. 13.5h A chi-square test for association when the data are in summarized format The results are identical to those in Fig. 15.5d. Occasionally you may want to manually input the contingency table data directly, as a table, into GenStat. In the Contingency Tables dialogue box reset the Defaults. Select Table to be the Data Arrangement, as in Fig. 15.5i. 119

13 Distributions in climatology Fig. 13.5i Setting options to manually input a table Choose Create Table Spreadsheet (Fig. 13.5j), to specify that you want to type a table into a spreadsheet.

13.5l. (The spreadsheet is declared as a table structure in GenStat.) The Table box in the Contingency Tables dialogue should now contain the new tabular structure, as shown in Fig. 13.5m.

Now we develop a modelling approach to the analysis of contingency tables, using log-linear models.

A log-linear model for assessing the association between wind direction and amount of sulphur requires the data into be in summarised list format (Fig. 13.5g).

121 13 Distributions in climatology Fig. 13.5i Setting options to manually input a table Choose Create Table Spreadsheet (Fig. 13.5j), to specify that you want to type a table into a spreadsheet. In the next dialogue box, Fig. 13.5k, Specify the dimensions of the table, and leave the default name for the new spreadsheet as table. Fig. 13.5j Creating a table Fig. 13.5k The dimensions of the table Type the counts into the spreadsheet, as shown in Fig. 13.5l. (The spreadsheet is declared as a table structure in GenStat.) The Table box in the Contingency Tables dialogue should now contain the new tabular structure, as shown in Fig. 13.5m. Fig. 13.5l The table Fig.13.5m The Contingency Tables dialogue Finally click OK in the Contingency Tables dialogue. The results are, again, identical to those in Fig. 13.5d. Now we develop a modelling approach to the analysis of contingency tables, using log-linear models. First we will repeat the analysis above, and then we will add a third dimension to see whether the suplhur is also related to the presence or absence of rain. A log-linear model for assessing the association between wind direction and amount of sulphur requires the data into be in summarised list format (Fig. 13.5g). Using this data we fit a log-linear model. Choose Stats Regression Analysis Generalized Linear Models. Select the type of Analysis to be Log-linear modeling, Fig. 13.5n. Specify the Response Variate to be Count and the Model to be Fitted as Wind+Sulphurgroup+Wind.Sulphurgroup. 120

122 13 Distributions in climatology Fig. 13.5n Specifying a log-linear model In the Model to be fitted box in Fig. 13.5n, the term Wind is the wind direction main effect, representing, in this case, the marginal distribution of counts over the four wind direction levels. Similarly Sulphurgroup represents the sulphur main effect. These two terms alone correspond to an independence model, such that the distribution of sulphur counts does not depend upon the level of wind direction, and vice versa. To allow for an association between wind direction and amount of sulphur, the interaction term, Sulphurgroup.Wind, is required. This model corresponds to an association model, and assumes the distribution of sulphur counts varies across wind direction level and vice versa. It is the significance of the estimated interaction effect that is of most interest. In the absence of a significant interaction the main effects may be investigated. However, the main effects are usually not of interest. Before continuing select Options in Fig. 13.5n. Select Accumulated and Fit model terms individually, Fig. 13.5o. This will produce significance tests for the individual terms (effects), which is not done by default. Fig. 13.5o Significance tests for individual effects in a log-linear model The output from fitting the log-linear model is shown in Fig. 13.5p. 121

123 13 Distributions in climatology Fig. 13.5p A fitted log-linear model ***** Regression Analysis ***** Response variate: Count Distribution: Poisson Link function: Log Fitted terms: Constant + Wind + Sulphurgroup + Wind.Sulphurgroup *** Summary of analysis *** mean deviance approx d.f. deviance deviance ratio chi pr Regression <.001 Residual * Total Change * MESSAGE: ratios are based on dispersion parameter with value 1 *** Estimates of parameters *** antilog of estimate s.e. t(*) t pr. estimate Constant Wind E Wind S Wind W < Sulphurgroup Sulphurgroup >= Wind E.Sulphurgroup Wind E.Sulphurgroup >= Wind S.Sulphurgroup Wind S.Sulphurgroup >= Wind W.Sulphurgroup Wind W.Sulphurgroup >= * MESSAGE: s.e.s are based on dispersion parameter with value 1 Parameters for factors are differences compared with the reference level: Factor Reference level Wind N Sulphurgroup < 4.5 *** Accumulated analysis of deviance *** Change mean deviance approx d.f. deviance deviance ratio chi pr + Wind < Sulphurgroup Residual Wind.Sulphurgroup Total * MESSAGE: ratios are based on dispersion parameter with value 1 Fig. 13.5p gives an approximate likelihood ratio chi-square test for the interaction effect. This is based on the reduction in deviance on adding the interaction term into the main effects model. From the Accumulated analysis of deviance part of the output the reduction in deviance corresponding to the interaction effect is Χ 2 =21.71 on 6 degrees of freedom. This is highly significant (p=0.001). Hence, there is strong evidence for an interaction. This approach to testing the interaction effect is identical to the likelihood ratio test approach used in the contingency table analysis (Fig. 13.5e), but here it is derived from a modelling framework that can be generalized. The (residual) deviance of the fitted model is 0. This equates with a perfect fit: predicted (fitted) cell counts = observed (response) cell count. A model with this property is often called a saturated model. To see this select Further Output in the Generalized Linear Models dialogue followed by Fitted Values. The output is shown in Fig. 13.5q. 122

124 13 Distributions in climatology Fig. 13.5q Fitted values and residuals ***** Regression Analysis ***** *** Fitted values and residuals *** Standardized Unit Response Fitted value residual Leverage Mean For further illustration fit a main effects model (i.e. drop the interaction term Sulphurgroup.Wind), Fig. 13.5r. This corresponds to a model for independence between wind direction and the amount of sulphur. Fig. 13.5r Fitting a main effects log-linear model The output is shown in Fig. 13.5s. Fig. 13.5s A fitted main effects log-linear model ***** Regression Analysis ***** Response variate: Count Distribution: Poisson Link function: Log Fitted terms: Constant + Wind + Sulphurgroup *** Summary of analysis *** mean deviance approx d.f. deviance deviance ratio chi pr Regression Residual Total * MESSAGE: ratios are based on dispersion parameter with value 1 Dispersion parameter is fixed at 1.00 * MESSAGE: The following units have large standardized residuals: Unit Response Residual *** Estimates of parameters *** antilog of estimate s.e. t(*) t pr. estimate Constant < Wind E Wind S Wind W < Sulphurgroup Sulphurgroup >= * MESSAGE: s.e.s are based on dispersion parameter with value 1 123

13 Distributions in climatology Parameters for factors are differences compared with the reference level: Factor Reference level Wind N Sulphurgroup < 4.

125 13 Distributions in climatology Parameters for factors are differences compared with the reference level: Factor Reference level Wind N Sulphurgroup < 4.5 The (residual) deviance of the fitted model has increased from 0 to (on 6 degrees of freedom). This is the likelihood ratio test statistic for testing the interaction term. Hence, the (residual) model deviance can be viewed as a goodness-of-fit statistic. The expected (fitted) values and residuals for the main effects model [Further Output Expected Values] are shown in Fig. 13.5t. Fig. 13.5t Fitted values and residuals for a main effects log-linear model ***** Regression Analysis ***** *** Fitted values and residuals *** Standardized Unit Response Fitted value residual Leverage Mean The fitted values are the expected counts under the assumed independence model. Hence, the residuals can be used in a similar manner to those obtained in Fig. 13.5f, to identify the main source(s) of the significant interaction between wind direction and amount of sulphur. Note the residuals in Fig. 13.5t and Fig. 13.5f differ because the latter is based on the standard Pearson s approach to testing for association and not the likelihood ratio approach. The main reason for using log-linear modelling is not to assess associations in two-way contingency tables but to explore associations in a multidimensional contingency table defined by several factors. We illustrate by introducing a third factor in the pollution example rainfall (present or absent). This now defines a three-dimensional contingency table of counts. The data are in the spreadsheet sulphur for loglinear.gsh, part of which can be seen in Fig. 13.5u. Fig. 13.5u Data from a thee-dimensional contingency table For a three-way contingency table we may be interested in a number of associations. There are twoway interactions: Rain Wind (a rainfall by wind interaction), Rain Sulphurgroup, 124

126 13 Distributions in climatology Wind Rain, and a three-way interaction involving all three factors (Wind Sulphurgroup Rain). The three-way interaction effect is the most complex and corresponds to an association in the table that can not be interpreted without considering all three factors simultaneously. As usual there are main effects but these are not of interest, as stated earlier. We might begin modelling the associations in the three-way table by fitting a saturated model, that is all possible main effects and interactions. Choose Stats Regression Analysis Generalized Linear Models. Select the type of Analysis to be Log-linear modelling. Specify the Response Variate to be Count and the Model to be Fitted as Wind+Sulphurgroup+Rain+Rain.Wind+Rain.Sulhurgroup+Wind.Sulphurgroup+ Wind.Sulhurgroup.Rain. Alternatively specify the model using Wind*Sulhurgroup*Rain which corresponds to specifying all of the effects, as shown in Fig. 13.5v.. Before continuing select Options followed by Accumulated and Fit model terms individually, Fig. 13.5w. This will produce sequential significance tests for the individual terms (effects). Fig. 13.5v Fitting a saturated log-linear model to the three-way contingency table Fig. 13.5w Significance tests for individual effects Part of the output is shown in Fig. 13.5x. Fig. 13.5x Output from a saturated log-linear model (edited) ***** Regression Analysis ***** Response variate: Count Distribution: Poisson Link function: Log Fitted terms: Constant + Wind + Sulphurgroup + Rain + Wind.Sulphurgroup + Wind.Rain + Sulphurgroup.Rain + Wind.Sulphurgroup.Rain *** Accumulated analysis of deviance *** Change mean deviance approx d.f. deviance deviance ratio chi pr + Wind < Sulphurgroup Rain Wind.Sulphurgroup Wind.Rain Sulphurgroup.Rain Residual Wind.Sulphurgroup.Rain Total * MESSAGE: ratios are based on dispersion parameter with value 1 125

127 13 Distributions in climatology The output in Fig. 13.5x indicates the three-way interaction effect is not significant (p=0.47). This term can now be dropped from the model. The simpler model can now be fitted and examined. As there are only three factors we might fit all possible models of interest 8 and select an appropriate one. Alternatively, for larger problems with many factors one might employ a regression type selection method (forward or backward stepwise selection). We do not continue the analysis here, but point out that our inference concerning the three-way interaction cannot be deduced through the use of traditional Pearson s chi-squares tests applied to two-way contingency tables. Hence, the latter can lead to erroneous conclusions References Dobson, A. J. (2002) An Introduction to Generalized Linear Models. 2nd edn. Boca Raton: Chapman & Hall/CRC Press. McConway, K. J., Jones, M. C. and Taylor, P. C. (1999) Statistical Modelling using GenStat. London: Arnold McCullagh and Nelder (1989) Generalized Linear Models, 2nd edn. Boca Raton: Chapman & Hall/CRC Press. Manly, B. F. J. (2001) Statistics for Environmental Science and Management. Boca Raton: Chapman & Hall/CRC Press. Nelder, J. A. (2000) The Analysis of Contingency Tables with One Factor as the Response: Round Two. The Statistician,49, Nelder, J. A. and Wedderburn, R. W. M (1972) Generalized Linear Models, J. R. Statist. Soc. A, 135, Log-linear modelling for multi-dimensional contingency tables requires some care. Often there a set of factors can be regarded as explanatory and the others as responses. Also there may be factors or combinations of factors whose totals have been fixed by design. For example, we might make measurements on 50 rainy occasions and 50 dry occasions. Such factors are usually regarded as explanatory. For a discussion of how to correctly take into account a mixture of response and explanatory variables see McCullagh and Nelder (1989). Nelder (2000) presents two examples of misanalysed data that have been presented in recent literature. 126

128 127

The methods of analysis fall into two categories, descriptive and inferential. We will concentrate on descriptive techniques, as these methods are widely used in practice.

129 14 Basic multivariate methods 14 Basic multivariate methods 14.1 Introduction Multivariate analysis is concerned with methods of analysing data where there are two or more variables for each individual or unit. The methods of analysis fall into two categories, descriptive and inferential. We will concentrate on descriptive techniques, as these methods are widely used in practice. The multivariate analysis menu in GenStat is shown in Fig. 14.1a and is seen to be extensive. Fig. 14.1a Multivariate analysis menu Fig. 14.1b Mean monthly temperatures (10 ºC) from 20 weather stations in 1951 Descriptive multivariate methods are generally regarded as data exploration tools. The aims are often data reduction, summary and visualisation, searching for natural groupings, and hypothesis 128

14 Basic multivariate methods generation. The underlying methodology is generally based on the concept of distance between units or between variables.

130 14 Basic multivariate methods generation. The underlying methodology is generally based on the concept of distance between units or between variables. One frequently exploited measure is the correlation. Data presented by Gabriel (1985) will be used as an example. The data set is presented in Fig. 14.1b. It consists of mean monthly temperatures (10 ºC) from 20 weather stations for six different months in The data are stored in the GenStat spreadsheet Gabriel.gsh. A scatter plot matrix of the six temperature variables is obtained using Graphics Scatter Plot Matrix. Select all the temperature variables for plotting by taking them into the Data list, as shown in Fig. 14.1c. Fig. 14.1c Selected data columns for the scatter plot matrix Fig. 14.1d Scatter plot matrix of the temperature data Clearly, from Fig. 14.1d, there are associations between the 6 temperature variables, some linear and others non-linear. The correlations between the temperature variables can be summarised in a correlation matrix by choosing Stats Summary Statistics Correlations, and again selecting all 6 temperature variables. See Fig. 14.1e. 129

14 Basic multivariate methods Fig. 14.1e Calculating correlations The correlation matrix (Fig. 14.1f) is displayed in the output window. Fig. 14.1f Correlation matrix for the temperature data *** Correlation matrix *** Jan 1.

000 Jan Mar May Jul Sep Nov For example, the correlation between the January and March temperatures is 0.88. The matrix is symmetric and hence the upper part, above the main diagonal is not displayed.

131 14 Basic multivariate methods Fig. 14.1e Calculating correlations The correlation matrix (Fig. 14.1f) is displayed in the output window. Fig. 14.1f Correlation matrix for the temperature data *** Correlation matrix *** Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov For example, the correlation between the January and March temperatures is The matrix is symmetric and hence the upper part, above the main diagonal is not displayed. As this example is purely for illustration we will not trouble ourselves over the inadequacy of correlation coefficients for summarising non-linear associations. In the following sections a number of topics are considered. In Chapter 14.2, GenStat s matrix calculator is demonstrated. This calculator allows many of the matrix algebra calculations underlying multivariate methods to be performed manually, and hence aids understanding of the methods. In Chapter 14.3 we describe principal components analysis or empirical orthogonal function analysis as it is sometimes known in climatology. This is a popular multivariate technique. Another common technique, cluster analysis, is the topic of Chapter Understanding the Concepts Matrix algebra plays a central role in multivariate methods. As an example consider the 6 6 correlation matrix from Chapter 14.1, and denote it by R. GenStat has a convenient matrix calculator that can be used to perform many matrix operations. For illustration, the eigenvalues of the matrix R will be determined. One way to produce R is, shown in Fig. 14.2c. This uses the Correlations dialogue box and the Save Correlations option, see Fig. 14.2a. Fig. 14.2a Saving the correlation matrix 130

14 Basic multivariate methods The name of the matrix is specified as R. Running the Correlations dialogue box results in the matrix R being created.

) In situations where R has to be entered it can be typed into a Symmetric Matrix spreadsheet. Use File New Spreadheet tab Symmetric Matrix spreadsheet icon, see Fig. 14.

14.2c Typing the correlations The eigenvalues may then be found using Data Matrix Calculations. The Matrix Calculations dialogue box lists data available for calculations, see Fig. 14.2d.

We want to determine the eigenvalues of R.

132 14 Basic multivariate methods The name of the matrix is specified as R. Running the Correlations dialogue box results in the matrix R being created. (The correlation matrix can also be displayed in a spreadsheet by choosing the corresponding option, see Fig. 14.2a.) In situations where R has to be entered it can be typed into a Symmetric Matrix spreadsheet. Use File New Spreadheet tab Symmetric Matrix spreadsheet icon, see Fig. 14.2b In the dialogue box specify the number of columns to be 6 and name the matrix R, as shown in Fig. 14.2ab 9. Then enter the matrix, as shown in Fig. 14.2c. Fig. 14.2b Specifying a symmetric matrix Fig. 14.2c Typing the correlations The eigenvalues may then be found using Data Matrix Calculations. The Matrix Calculations dialogue box lists data available for calculations, see Fig. 14.2d. The matrices available are R (and possibly also R2). Note the prefix S: in Fig. 14.2d denoting a symmetric matrix. The blank top line in the dialogue in Fig. 14.2d is where the function or expression to be evaluated is typed. We want to determine the eigenvalues of R. To do this click the Eigenvalues button, which takes the EVALUES function to the top line of the dialogue box, and specify the argument of the function to be R by typing or selecting it into the brackets. Fig. 14.2d The matrix calculator Fig. 14.2e The eigenvalues The result must also be saved (to a diagonal 6 6 matrix in this case) specified as eigen in Fig. 14.2d, which is displayed in a spreadsheet. (The results can also be printed in the output window by choosing the appropriate option.) The eigenvalues of R, 3.980, 1.826,, are displayed in Fig. 14.2e. 9 If you are trying both methods, name the matrix differently, perhaps R2. 131

14 Basic multivariate methods GenStat displays diagonal matrices like eigen as a single column. Note eigen will appear in the Available Data list pre-fixed with D: to indicate it is a diagonal matrix.

133 14 Basic multivariate methods GenStat displays diagonal matrices like eigen as a single column. Note eigen will appear in the Available Data list pre-fixed with D: to indicate it is a diagonal matrix. One use of the matrix calculator is in training courses, either on matrices themselves or multivariate methods. If the aim is to get results for a particular method then the Statistics Multivariate Analysis dialogs would be used, as we shall do in the rest of this chapter Principal Components Analysis (Empirical Orthogonal Function Analysis) The presence of high correlations amongst the six temperature variables, shown in Chapter 14.1, suggests there is some redundant information, and that less than six dimensions may be required to adequately summarise the variation between the 20 stations. Principal components analysis (PCA), or empirical orthogonal function analysis, is often used to investigate this. The technique treats the variables as being on an equal footing, thus we do not have response variables and explanatory variables, as we would in a regression study. Further, PCA is suitable only for continuous variables, like temperature. PCA transforms the original temperature variables to a new set of uncorrelated variables, called the principal components. They are simple linear combinations of the temperature variables. They are derived such that the first principal component accounts for as much variation in the original data as possible, the second principal component accounts for as much of the remaining variation as possible and is uncorrelated with the first principal component, and so on. Geometrically PCA is an orthogonal rotation in the six-dimensional temperature space, with the six principal components representing the new axes. The principal components are given by the (normalised) eigenvectors of the covariance matrix for the 6 temperature variables. The corresponding eigenvalues of the matrix give the variance accounted for by each principal component. For presentation purposes it is convenient to consider the temperature variables to be mean-centered. However, it is more common for PCA to be done on the correlation matrix. This is equivalent to performing a PCA on the covariance matrix of the standardised variables (mean 0, variance 1). This removes the general problems of different measurement scales and avoids the domination of the results by a few variables with large variances. To perform PCA in GenStat use Stats Multivariate Analysis Principal Components. Select the 6 temperature variables and the Correlation Matrix option, as shown in Fig 14.3a. By default GenStat only calculates the first two principal components. To show all the results we increase this to the maximum, 6. To do this select Options and specify the Number of Dimensions to be 6, see Fig 14.3b. Fig. 14.3a Performing a PCA Fig. 14.3b Changing the options The results are displayed in the output window. 132

134 14 Basic multivariate methods Fig 14.3c Results of a PCA on the correlation matrix of the temperature data ***** Principal components analysis ***** *** Latent Roots *** *** Percentage variation *** *** Trace *** *** Latent Vectors (Loadings) *** Jan Mar May Jul Sep Nov The output in Fig. 14.3c begins with the latent roots. These are the eigenvalues of the correlation matrix; see Section Hence, the first principal component has a variance of The total (variation) variance in the original standardised data set, that is the sum of the variances of the 6 standardised temperature variables, is 6. This is also equal to the total variance of the principal components. The variance of each principal component as a percentage of the total variance is given next in Fig. 14.3c. For example, the first principal component accounts for 66.34% of the total variance. The total variance of the original standardised data is actually equal to the trace (sum of the main diagonal) of the correlation matrix. This is given in the third part of the output. The fourth and final part of the PCA output gives the principal components (latent vectors). For example the first principal component is given by Z1 = 0.33X1 0.44X2 0.47X3 0.33X4 0.42X5 0.44X6, where X 1 to X 6 are the standardised temperature variables Jan, Mar,,Nov. (Each principal component is unique up to multiplication by 1.) Usually, it is hoped that the first few principal components (here one or two) account for most of the variation in the data. Of course, this is a matter for debate. There are several generic rules for deciding how many principal components should be retained and considered further. Some rules in use are: Retain enough principal components to account for an arbitrary percentage of the total variance, say 70%, 80% or 90%. For PCA on the correlation matrix (standardised data) retain principal components whose eigenvalues are greater than 1. Plot the ordered eigenvalues or equivalently the percentage of variance accounted for by each principal component, and choose the number of principal components corresponding to an elbow. The plot is known as a scree plot. A scree plot can be produced in GenStat. Return to the Principal Components dialogue box. (Stats Multivariate Analysis Principal Components.) Select Options Scree Plot. The scree plot is shown in Fig 14.3d. 133

135 14 Basic multivariate methods Fig. 14.3d A scree plot The first two principal components account for 96.8%; the first about 66% and the second about 30%. For most applications this total is more than adequate and indicates that the effective dimensionality of the data is two. These first two components are Z1 = 0.33X1 0.44X2 0.47X3 0.33X4 0.42X5 0.44X 6, Z2 =+ 0.54X X2 0.24X3 0.54X4 0.38X X 6. Data analysts sometimes try to use the coefficients to interpret these components. This is obviously subjective; one problem being sampling error. It is important to remember the coefficients (or loadings in GenStat terminology) defining the principal components are sample estimates. As there is no underlying statistical model the precision of these estimates is unknown. Despite these reservations some interpretation of the retained principal components can be attempted. Closer inspection of the linear combination forming the first principal component, Z 1, suggests it is essentially an average of the standardised temperature measurements. That is Z1 0.40[X1 + X2 + X3 + X4 + X5 + X 6], and hence Z 1 can be considered to be representing overall warmth. Clearly stations recording warmer temperatures will tend to have a lower value for Z 1, whereas colder stations will tend to have a higher value for Z 1. We can try to interpret the second component in a similar manner. Usually though, it becomes more difficult to interpret further principal components. It appears that the second principal component is some kind of contrast between the temperature in the months January, March and November versus May, June and September. Roughly Z2 X 1 + (X2 + X 6) X 4 (X3 X 5) Thus, for northern-hemisphere stations, this component may be representing a seasonal contrast, Winter/Spring versus Summer/Autumn. The values of Z 1 and Z 2 for the twenty weather stations are known as the principal component scores. As there are only two principal components the score can be represented in a two-dimensional scatter plot. To do this in GenStat return to the Principal Components Analysis dialogue box and in the Options sub-dialogue box de-select the Scree Plot, change the Number of Dimensions to be 2 (to indicate we are only interested in the first two components), select Scatter Plot Matrix of Principal Component Scores and specify the plot is to have each point labelled with the station number (note Station must be declared as text first!). Finally select Scores to print the principal component scores in the output window. The Options sub-dialogue box should now look like the one in Fig. 14.3e. 134

14 Basic multivariate methods Fig. 14.3e Producing a scatter plot of the first two principal component scores The principal component scores are listed and plotted in Fig. 14.3f and Fig 14.

136 14 Basic multivariate methods Fig. 14.3e Producing a scatter plot of the first two principal component scores The principal component scores are listed and plotted in Fig. 14.3f and Fig 14.3g respectively. Fig. 14.3f The principal component scores (edited) *** Principal Component Scores *** : : :

14 Basic multivariate methods Fig. 14.3g Scatter plot of the first two principal component scores The scatter plot of the principal component scores can be used for further exploratory work.

137 14 Basic multivariate methods Fig. 14.3g Scatter plot of the first two principal component scores The scatter plot of the principal component scores can be used for further exploratory work. This may include visualise inspection for clusters or groups of similar stations or the inclusion of auxiliary information in the plot, with the aim of highlighting potential systematic differences or structure. A biplot is a modification of the plot of the principal component scores. This plot can greatly help with the interpretation of a principal components analysis. The aim of a biplot is to summarise the units and the variables under consideration in one graph. There are several types of biplot, the commonly used one incorporating the correlations between the temperature variables through the use of angles between vectors. This is the biplot described here. For the temperature example the first two principal component scores are standardised to have equal variance and plotted against one another. This graphically summarises the stations (units), such that the inter-point distances are approximately, the standardised statistical distance between stations. Information concerning the six (standardised) temperature variables is added by the use of vectors originating from the origin of the graph, one for each temperature variable. The length of a vector represents, approximately, the variability; the length being proportional to the standard deviation of the corresponding variable. Hence, as we are dealing with standardised data we would expect the lengths to be similar. The cosine of the angle between pairs of vectors approximates the corresponding correlation between the variables. So means that if vectors occur very close together, they are highly correlated. A biplot of the temperature data is given in Fig. 14.3h The biplot was produced using the GenStat syntax [BIPLOT procedure (with method = variate) in conjunction with the DBIPLOT procedure from the Biometrics library (using old style graphics viewer (4.1)]. 136

138 14 Basic multivariate methods Fig. 14.3h A biplot of the temperature data. 11 One important point to note is that the above biplot has used Z 1 instead of Z 1 in its construction. 12 From the biplot, November and March temperatures are very highly correlated. A further use of the biplot is that the approximate order of the 20 stations with respect to any one of the temperature variables can be deduced from the plot by projecting the points onto a line that passes through the origin in the direction of the temperature vector under consideration. The order of the data is approximately represented by the order of the projected stations; stations further along in the direction of the arrow having higher values for the variable. For example, from Fig 14.3h it can be seen that stations 4, 5 and 11 have relatively low temperatures in May, July and September. Further, the first, or horizontal coordinate, of the vectors is proportional to the coefficients defining the first principal component. Hence, from the biplot it is immediate apparent that all six standardised temperature variables contribute positively and roughly the same to the first principal component (-Z 1 ). Likewise the second (vertical) coordinate of the vectors represent the coefficients of the second principal component. From these coordinates it can be seen that the coefficients for January, March and November are positive while the others are negative. Hence the second component represents a contrast, as described earlier. Returning to the interpretation of the two principal components, a rotation is sometimes applied to find a new set. This is often done when the principal components are difficult to interpret and it is hoped that the rotated components will be more easily interpretable. For this example, we have a reasonable interpretation for the principal components. However, for illustration we apply a rotation and discuss the results. There are several rotation methods, the most common being varimax. This method seeks a rotation that results in the rotated components having extreme coefficients large or small in absolute value, and that each temperature variable loads highly on just one rotated component, that is each temperature variable does not have a large coefficient for more than one principal component 11 The presented biplot can be improved. For example the presented vectors could be extended in both directions with faint dashed lines. This would allow easier visual projection of the points onto the dimension represented by a particular vector. 12 We hope a future version of GenStat will give consistent results, in addition to a biplot being included in the GUI. 137

139 14 Basic multivariate methods To apply a varimax rotation to the first two principal components of the temperature data return to the Principal Components dialog and select Rotate Loadings. The default method is Varimax, so click OK to give a new set of coefficients, shown in Fig. 14.3j. Fig. 14.3j Varimax rotated principal components *** Rotated factors *** rotated The new rotated components are: Z1,varimax =+ 0.16X1 0.08X2 0.50X3 0.62X4 0.57X5 0.09X6 Z2,varimax =+ 0.61X X X3 0.15X X X 6, or as the rotated coefficients are estimates: Z1,varimax 0.55[X3 + X4 + X 5] Z2,varimax [X1 + X2 + X 6 ], Before attempting to interpret the rotated components note they are no longer principal, in the sense that Z 1, varimax no longer corresponds to the linear combination of the X-variables with maximum variance. The first, Z 1, varimax is essentially an average of the May, July and September temperatures while Z 2, varimax is an average for January, March and November. Hence, the first two rotated components are representing average Summer/Autumn and Winter/Spring temperatures respectively, for northern-hemisphere stations. A scatter plot of the rotated component scores is presented in Fig. 14.3k This was produced manually. 138

140 14 Basic multivariate methods Fig. 14.3k Scatter plot of (varimax) rotated component scores Cluster Analysis Cluster analysis is the name given to a set of techniques concerned with finding natural clusters (groupings) of units (objects) in a sample of n units. Cluster analysis methods may be classified into hierarchical and non-hierarchical. Here we consider only agglomerative hierarchical clustering methods. Agglomerative hierarchical cluster analysis generally operates on a symmetric n n matrix of distances or dissimilarities, describing the pairwise distances between units. Alternatively a matrix of similarities may be used, giving the similarities between pairs of units. GenStat uses a similarity matrix, as the starting point for cluster analysis. Sometimes a similarity matrix does not arise naturally and needs to be calculated. For example, consider Gabriel s temperature data. One might search for clusters of stations, using the six recorded temperature variables. To form a symmetric 6 6 matrix of similarities containing similarity coefficients between each pair of stations use Data Form Similarity Matrix. Select the six temperature variables into the Data Values list, as shown in Fig. 14.4a, specify the Name of New Matrix (similarity matrix) to be S, and label the rows of the similarity matrix using the values in the Station column by taking Station into the Unit Labels box. The default method (Type of Test) for calculating the similarity coefficients is based on a squared Euclidean distance type measure where each of the six variables have been standardised by their respective sample ranges. The standardisation puts each of the variables on an equal footing similar to standardisation in principal components analysis, and eliminates the effects of different measurement units. This may not always 14 We hope a later version of GenStat allows a) this graph to be produced and b) the rotated scores to be listed and saved directly from the Principal Components Analysis dialogue. 139

141 14 Basic multivariate methods be appropriate. The GenStat manual gives formulae for the available similarity measures. 15 Other methods are available and it is important to note that similarity measures do not have to be directly derived from Euclidean distance. 16 Fig. 14.4a Calculating a similarity matrix In certain contexts a correlation coefficient may be the basis of a similarity measure, particularly where similarity between continuous variables is of interest. However the most appropriate choice depends on the type of data and the context of the study. For the purpose of this example use the default choice of Type of Test, which for continuous variables like temperature is permissible. The matrix S may be displayed in the output window using Data Display. The labels and the first five columns of the matrix are given in Fig. 14.4b. Fig. 14.4b Part of the S similarity matrix S Notice the similarity coefficients are all between 0 and 1. A similarity coefficient of 1 indicates two stations are identical; at the other extreme 0 indicates the two stations differ maximally. Having determined a similarity matrix a cluster analysis may be performed. The algorithm begins with 20 groups, each containing one station. The most similar are merged into a new group. The process continues with the 19 groups. The two most similar groups are merged, resulting in 18 groups. This process continues until all 20 stations have been merged into a single group. While similarities between groups of size 1 are defined (from the similarity matrix) there is no unique way of 15 The command mode (FSIMILARITY directive) is more flexible and will allow appropriate similarity coefficients to be calculated for data that consists of a mixture of data types. 16 Ward s method, a popular clustering method among some analysts, which utilises sums of squares, is not included in GenStat. We hope this will be rectified in a later version of GenStat. 140

To perform a cluster analysis choose Stats Multivariate Analysis Cluster Analysis Hierarchical. Specify S to be the Similarity Matrix. See Fig. 14.4c.

142 14 Basic multivariate methods characterising the similarity between two groups of stations of arbitrary size. This has given rise to many different clustering methods corresponding to different ways of defining the new similarities. To perform a cluster analysis choose Stats Multivariate Analysis Cluster Analysis Hierarchical. Specify S to be the Similarity Matrix. See Fig. 14.4c. The default clustering method is single-link (nearest neighbour), which uses the maximum pairwise similarity between stations in the two different groups to define a similarity coefficient between the two groups. For this example use the default method. Fig. 14.4c Performing a cluster analysis Fig. 14.4d Selecting options Before continuing select Options. In the sub-dialogue box select Display Similarity Axis, as shown in Fig. 14.4d. This gives the dendrogram an axis of measurement to read-off similarity levels at which merges take place between groups in the clustering process. Also label the units (stations) by specifying Station, as shown in Fig. 14.4d. 141

143 14 Basic multivariate methods Fig. 14.4e Dendrogram from cluster analysis 17 The resulting dendrogram summarises the similarity levels at which merges take place between groups in the clustering process. The branches (horizontal lines) represent the groups. These are merged at nodes to give new larger groups. For example, in Fig. 14.4e groups (4, 5) and (11) are joined in the clustering process and similarity between them is about Examination of the dendrogram suggests a cluster consisting of stations 2, 6, 7, 8, 9, 10, 13, 14, 15 and 16. Other clusters, appear to be smaller in size. Searching for clusters is somewhat arbitrary, and problems often arise for a number of reasons. For cluster analysis these include the following. Choice of similarity measure between units. Spurious clusters (when there are no natural groupings). There is no optimal clustering method covering all situations. Different clustering methods, e.g. single-link, average link, differ somewhat in their ability to reveal certain types of groupings. Hence, a particular clustering method may not reveal groupings when they exist or impose a certain structure on any groups that are revealed. For a summary of the performance of different methods see Sharma (1996). Different clustering methods may lead to quite different solutions. The results are sensitive to the variables used. Only relevant variables should be used. Other information may be ignored, such as spatial information, for the temperature data. The interpretation of the results is subjective. These points have not been raised to imply cluster analysis should be avoided, but that it should be used with care. For the novice it is tempting to perform cluster analysis on the same data set using many different clustering methods, and then choose the solution most suited to their research. This approach can be dangerous. Generally, different clustering methods will only reveal the same groupings if they are well separated. This is often not the case. If different cluster solutions are discovered, care is needed in the interpretation to avoid spurious results, which cannot be reproduced. Ideally, the best approach is a choice of similarity measure and clustering method based 17 We hope a future version produces more informative axis labels. 142

144 14 Basic multivariate methods on the type of data and informed scientific judgement concerning the context of the problem. Sharma (1996) discusses some techniques that may help with the choice of the number of clusters and the reliability and external validity of a cluster solution. Chatfield and Collins (1980) are rather negative about the use of cluster analysis and prefer more visual techniques as an alternative. For example, with the temperature data, this principal components analysis (PCA) revealed the variation between the 20 weather stations could be adequately summarised in two dimensions (see Section 14.3). Examination of either Fig. 14.3h showing the a scatter plot of the unrotated principal component scores, or Fig. 14.3k showing the plot for the varimax rotated scores suggests, once again, a cluster consisting of stations 2, 6, 7, 8, 9, 10, 13, 14, 15 and 16 and other clusters, if any, of smaller size. This illustrates one use of PCA, normally to search for clusters. PCA can also help to profile any clusters. For example, consider the varimax rotated principal components. In Chapter 14.3 it was stated that the first and second rotated components represent an average for spring/summer and autumn/winter respectively. Hence, rather than consider Fig. 14.3k further, a plot of the arithmetic sample mean for the two periods (May/July/September and January/March/November) based on the standardised data is more meaningful and a plot is given in Fig. 14.4f. Comparing Fig 14.3k and Fig 14.4f it is clear the two scatter plots are essentially very similar. Fig 14.4f A plot of the mean January/March/November temperature against the mean May/July/September temperature for the 20 weather stations (using standardised data) Examination of Fig 14.4f indicates the cluster of 10 stations proposed earlier all have similar recorded temperatures on average, in the two periods of the year. Further, the temperatures are relatively high, irrespective of the period. Some authors, for example Chatfield and Collins (1980), suggest using the first principal component scores in a cluster analysis, rather than a large number of continuous variables. However, agreement amongst statisticians is not complete, and Manly (1994) suggests avoiding this route. 143

145 14 Basic multivariate methods 14.5 Concluding Remark In this chapter we have followed Gabriel s treatment of the temperature data to illustrate some important multivariate techniques. However, the data has a well-defined geographic structure, dividing into the southern- and northern-hemispheres, as shown in Fig. 14.5a, and this structure has been ignored in the analysis. This is a common failing in the use of multivariate methods in climatology and, except for illustration we would not usually recommend analysing data using techniques which ignore the structure in this way. Fig. 14.5a A map showing the location of the 20 weather station (from Gabriel) 14.6 References Chatfield, C. and Collins, A. J. (1980) Introduction to Multivariate Analysis. Chapman & Hall: London. Gabriel, K. R. (1985) Exploratory Multivariate Analysis of a Single Batch of Data. In: Probability, Statistics, and Decision Making in the Atmospheric Sciences. (ed Murphy, A. H. and Katz, R. W.). Westview Press: Boulder, Colarado. Manly, B. F. J. (1994) Multivariate Statistical Methods: A Primer. Chapman & Hall: London. Sharma, S. (1996) Applied Multivariate Techniques. Wiley: New York. 144

15 Further methods 15. Further methods 15.1 Introduction In this chapter we consider some of the more specialised types of analysis that are needed to process climatic and other environmental data.

146 15 Further methods 15. Further methods 15.1 Introduction In this chapter we consider some of the more specialised types of analysis that are needed to process climatic and other environmental data. In Chapters 15.2 and 15.3 we look at the problem of estimating extremes. These can be extreme winds, or rainfall or river flow. The analysis of circular data is considered in Chapters 15.4 and In climatology the obvious example is the analysis of wind direction. With wind speed and wind direction together, a common display is called a rose diagram, effectively a form of stacked histogram. We give an example in Chapter 15.4 and look at some general issues concerned with directional data in Chapter Time is also circular, on a daily, or annual basis. So the same methods can be used to consider the time of day at which events occur, such as maximum temperatures or the start of a rainfall Extremes 18 Extremes are important in both the analysis of climatic and hydrologic data. The recent book by Coles (2001) includes references to macros for the analysis of extremes using S-Plus, that are available from Similar facilities have been added to Genstat Version 7, and are accessed using Statistics Distributions Extremes (Fig. 15.2a). Fig. 15.2a Modelling extreme values There are two ways of modelling extremes. They are either to use the largest value in a given period, often a year, or to use all values that are above a large threshold. Both can be modelled in GenStat, but we just consider an example of the first type in this section. For illustration we use an example that is originally from Changary (1982) that is also used in another recent book on extremes, by Reiss and Thomas (2001). Changary recorded 30 annual maximum wind speeds (mph) for Jacksonville, Florida from 1950 to 1979 and the corresponding storm type; tropical or non-tropical. The first 10 years of data are shown in Fig. 15.2b. The data are stored in the GenStat spreadsheet windspeed.gsh. 18 This is not yet in What s New in Help Reference Manual New Features because the GenStat procedure has yet to be refereed. 145

15 Further methods Fig. 15.2b Annual maximum wind speeds (mph) from 1950 to 1979 with storm type We start by considering the 30 wind speeds as a single sample.

The Gumbel distribution or (Type I) extreme value distribution is often used to model variation in the largest extreme.

147 15 Further methods Fig. 15.2b Annual maximum wind speeds (mph) from 1950 to 1979 with storm type We start by considering the 30 wind speeds as a single sample. This ignores any possible trend in the data, or possible dependence of the extremes on whether they result from tropical storms or not. We consider that type of complication in Chapter The Gumbel distribution or (Type I) extreme value distribution is often used to model variation in the largest extreme. Denoting maximum wind speed generically by X, the cumulative distribution function (cdf) of X, F(x), is given by { ( ) } F(x) = exp exp x µ / σ, for < x < +, where µ is a location parameter and σ is a scale parameter. This is a special case of the generalised extreme value distribution (GEV), which has a further shape parameter ξ. This more general distribution also encompasses the Frechet (Type II extreme value) distribution and the Weibull (Type III extreme value) distribution. See Coles (2001) for details. We fit a Gumbel distribution to the windspeed data using Statistics Distributions Extremes Maxima. By default GenStat fits a GEV. For a Gumbel, set the Shape Parameter Eta to be 0, as shown in Fig. 15.2c, and select mph into the Data Values box. Fig. 15.2c Fitting a Gumbel distribution GenStat fits the Gumbel model using the method of maximum likelihood. The results are displayed in the output window (Fig. 15.2d). The maximum likelihood parameter estimates of µ and σ are reported first, together with their standard errors. The rest of the output contains various goodnessof-fit statistics. They provide no evidence against the assumption of a Gumbel distribution. 146

148 15 Further methods Fig. 15.2d A fitted Gumbel distribution (edited output) Gumbel Extreme Value Distribution (GEV with Eta = 0): CDF(x) = EXP(-EXP(-(x - Mu)/Sigma) *** Estimates of Gumbel parameters *** estimate "s.e." Mu Sigma Eta 0 FIXED Maximum Log-Likelihood = Maximum value of Gumbel Distribution is Infinite (Eta >= 0) Goodness of Fit Test for mph following a Gumbel distribution (i.e. ETA=0) Critical values of test statistics (MARGINAL tests) Significance level Test statistic 15% 10% 5% 2.5% 1% Anderson-Darling Cramer-von Mises Watson Test statistic Type of Anderson- Cramertest Variate(s) Darling von Mises Watson Marginal ?, *, ** indicate significance at 10%, 5% and 1% levels respectively GenStat also provides, by default, three graphs: a quantile-quantile plot, a kernel density plot and a return level plot. These are presented in Fig. 15.2e Fig. 15.2g respectively. These plots can be used to assess the adequacy of a Gumbel distribution for modelling maximum windspeed. Fig. 15.2e Quantile-quantile plot 147

149 15 Further methods Fig. 15.2g Return level plot All three plots suggest a Gumbel distribution is reasonable, apart possibly from a slight deviation in the tail. We assume a Gumbel distribution is satisfactory. In addition to model checking, the plot shown in Fig. 15.2g can be used to estimate return levels for given return periods. For example, we might be interested in estimating the upper 10% point of the distribution of annual maximum wind speeds. The period of 10 years is commonly referred to as the return period, and the value is the corresponding return level. The probability 0.1 is called a return probability. From Fig. 15.2g a return level of 10 years gives approximately 60 mph for the return level. GenStat can estimate return levels corresponding to return probabilities (i.e. periods) and vice versa. Approximate 95% confidence intervals are provided. Return to the Generalized Extreme Value dialogue box (Statistics Distributions Extremes Maxima.) Select Calculate Predictions. Ensure Return Levels is selected and specify a return probability of 0.1 (which equals a return period of 10 years), as shown in Fig. 15.2h. Fig 15.2h Estimating a return level The relevant GenStat output is given in Fig 15.2i. 148

150 15 Further methods Fig 15.2i Estimated return level 95.0 % Approximate Intervals for Return Periods Probability Return Period Level Lower Upper From Fig. 15.2i for a return probability of 0.1 (return period of 10 years) the estimated return level is 60.0 mph. An approximate 95% confidence interval for the true return level is 51.4 to 68.6 mph. The approximation is based on the normal distribution, yielding a symmetric interval. The normal approximation can be poor, particularly when dealing with long return periods (Coles (2001)). A profile likelihood method generally gives a better approximation, but is more computationally expensive, and is produced in GenStat by selecting Options in the Generalized Extreme Value dialogue box, followed by Exact confidence intervals 19 (Fig. 15.2j). Fig 15.2j Calculating a profile likelihood confidence interval for the true return level The output is shown in Fig. 15.2k. Fig. 15.2k A profile likelihood confidence interval for the true return level 95.0 % Profile Likelihood Intervals for Return Periods Probability Return Period Level Lower Upper The choice of method for calculating the confidence interval leaves the estimate the same and only affects the confidence interval. The profile likelihood 95% confidence interval for the 10 year return level is 54.8 to 67.4 mph More on extremes We have assumed the distribution of annual maximum wind speed does not depend on storm type (tropical and non-tropical), and that there is no trend with year (time). GenStat can model maximum wind speed incorporating explanatory variables like type of storm and year. Note these models do not belong to the class of generalised linear models, which we considered in Chapter 13, as the extreme value distributions do not belong to the exponential family. For illustration we consider the dependency of annual maximum wind speed on the type of storm (a factor), and assume the error distribution is Gumbel. In the Generalized Extreme Value dialogue box add Type to the Groups box, and make sure the other settings are as shown in Fig. 15.3a. 19 These are the profile likelihood confidence intervals. 149

151 15 Further methods Fig. 15.3a Fitting a Gumbel model with storm type as an explanatory variable. The results are displayed in the output window. Fig 15.3b contains edited output. Fig 15.3b Gumbel model with type of storm as an explanatory variable Gumbel Extreme Value Distribution (GEV with Eta = 0): CDF(x) = EXP(-EXP(-(x - Mu)/Sigma) Fitting Groups term: Type *** Estimates of Gumbel parameters *** estimate "s.e." Mu(Tropical) Sigma Eta 0 FIXED Non-tropical Maximum Log-Likelihood = Goodness of Fit Test for mph following a Gumbel distribution (i.e. ETA=0) Critical values of test statistics (MARGINAL tests) Significance level Test statistic 15% 10% 5% 2.5% 1% Anderson-Darling Cramer-von Mises Watson Test statistic Type of Anderson- Cramertest Variate(s) Darling von Mises Watson Marginal ?, *, ** indicate significance at 10%, 5% and 1% levels respectively Levels given for Type in Tropical 95.0 % Approximate Intervals for Return Periods

152 15 Further methods Probability Return Period Level Lower Upper GenStat assumes there are two Gumbel distributions, one corresponding to tropical storms and the other to non-tropical. Further each type of storm is assumed to have a common scale parameter, σ, and this is estimated to be 7.22 (s.e=1.50), see Fig. 15.3b. Each storm type has its own location parameter. Denoting these by µ tropical and µ non-tropical GenStat (Fig. 15.3b) reports the following estimates. µ ˆ tropical = 43.4 (s.e.=3.68) µ ˆ ˆ non-tropical µ tropical = 0.47 (s.e.=4.22) In general GenStat estimates the location parameter corresponding to the first level of the factor under consideration (e.g. µ 1 ), and the differences between the location parameters for subsequent levels and this first level (e.g.µ 2 -µ 1, µ 3 -µ 1, ). The difference µ non-tropical -µ tropical can be interpreted as the difference between the storm type modes or equivalently the storm type means. We interested in whether the estimate of this difference is significant, to check if storm type has an effect on annual maximum wind speed. In this case the standard error shows the difference is clearly not significant. To test the storm type effect formally we may use (i) a Wald type test, or (ii) a likelihood ratio test (often regarded as superior). Neither is done automatically in GenStat. 20 For illustration consider the likelihood ratio test. The test statistic Χ 2, is given by Χ 2 = -2(log e (L 1 )-log e (L 2 )) where log e (L 1 ) is the maximized value of the log likelihood function for the model with no explanatory variables ( from Fig. 15.2d) and log e (L 2 ) is the corresponding maximum for the model with storm type as an explanatory variable ( from Fig. 15.3b) 21. This gives Χ 2 =0.04. A large value of Χ 2 is evidence for a storm type effect, and here the value is clearly very small. Comparing Χ 2 with the upper percentage points a chi-square distribution with 1 degree of freedom (1 equals the number of extra model parameters to estimated on including storm type as an explanatory variable) gives a p-value of Hence, there is no evidence for a storm type effect. The p-value may be calculated in GenStat using Data Probability Calculations and choosing the dialogue box settings as in Fig. 15.3c. Fig 15.3c Calculating the p-value for the likelihood ratio test Finally the goodness-of-fit tests in Fig 15.3b are not significant indicating no evidence for lack of fit Perhaps these might be included in a future version of GenStat. 21 If increased accuracy is needed, the maximised log likelihood value from Monitoring Parameter Estimates could be used. 22 The default graphs now refer to standardized mph. What is this? 151

153 15 Further methods The last part of Fig 15.3b is an estimated return level. For a return probability of 0.1 (return period of 10 years) the estimated return level is 59.6 mph. An approximate 95% confidence interval for the true return level is 50.0 to 70.3 mph. This estimate is for tropical storms only, as indicated in the output. In addition to allowing for differences between factor levels to be assessed, GenStat allows trends to be modelled. For example if we were interested in the trend over years (ignoring storm type) we could specify Year in the Trend box of the Generalized Extreme Value dialogue box 23. Once again the underlying model assumed by GenStat is that the scale parameter, σ, remains constant, but the location parameter is a linear function of time. The location parameter at year t, µ t, is given by µ t = µ + βt. Hence, µ in the above equation (the intercept) is the location parameter corresponding to t=0. The significance of the trend may be assessed, as before, using a Wald or likelihood ratio test. When modelling a trend GenStat estimates return levels and probabilities corresponding to t=0 (not sensible in this case). To obtain estimates corresponding to times other than t=0 the Year variable should be recoded, and the model refitted. GenStat also allows both Year and Type to be fitted simultaneously as explanatory variables, but these models are not considered further here. Censoring is another common problem in the study of extremes. In the context of the current example, suppose we were interested in studying maxima from non-tropical storms. In the 8 years where the extreme was from a tropical storm we know that the extreme from an ordinary storm was less than this value. Hence, the sample of 30 annual maximum wind speeds is composed of 22 actual values and 8 (left) censored values, e.g. in 1950 the maximum non-tropical storm wind speed was < 65 mph. (The more common direction of censoring is to know that the extreme was greater than the value recorded, perhaps because it exceeded what the instruments could handle. These are right censored values.) We assume there is no trend over time, and that we have a single sample of 30 annual maximum wind speeds, 8 of which are left censored. To indicate in GenStat which values are censored an extra column, Censor, is provided in the spreadsheet, where 1 = censored. To fit a Gumbel distribution to the sample of wind speeds, taking account of the censoring use Statistics Distributions Extremes Maxima, and make the specifications shown in Fig, 15. 3d. Fig 15.3d Fitting a Gumbel model in the presence of censored data The parameter estimates are shown in Fig. 15.3e. Fig. 15.3e Gumbel model with censored data (edited output) Gumbel Extreme Value Distribution (GEV with Eta = 0): CDF(x) = EXP(-EXP(-(x - Mu)/Sigma) 23 If there are convergence problems, then use (Year-1950). 152

15 Further methods 8 data points are left censored *** Estimates of Gumbel parameters *** estimate "s.e." Mu 41.45 2.074 Sigma 6.850 1.641 Eta 0 FIXED Maximum Log-Likelihood = -83.

154 15 Further methods 8 data points are left censored *** Estimates of Gumbel parameters *** estimate "s.e." Mu Sigma Eta 0 FIXED Maximum Log-Likelihood = Comparing the fitted Gumbel distribution in Fig. 15.3e with Fig. 15.2d, the introduction of the censoring has decreased the estimate of the location parameter to 41.45, and increased its standard error (slightly). This corresponds to what one would expect Directional Data wind roses using an example in GenStat Methods for processing directional data have been added to Version 7 of Genstat. In this section we look at rose diagrams and also show how users can access examples of data that are provided with Genstat. For illustration we use data on suphur pollution collected in 1990, and provided with GenStat (sulphur.gsh). There are 114 sulphur measurements and several associated variables. The data is described in the GenStat introductory guide. Two of the other variables of interest are wind speed and wind direction. Use Help Example Programs to access the example program in GenStat. This gives the dialogue box shown in Fig 15.4a. Fig 15.4a Accessing example programs in GenStat Choose more and on the next screen choose the option to give examples from procedures in the Graphics module. Click OK three times to come to the WINDROSE procedure. Choose this procedure and run the example. This produces the two wind roses shown in Fig. 15.4b and Fig. 15.4c. 153

15 Further methods Fig 15.4b Rose diagram for wind direction depicting wind speed Fig 15.4c Rose diagram for wind direction depicting the sulphus amount Consider the wind rose diagram in Fig. 15.4b. The segments are analogous to the bars in a histogram.

155 15 Further methods Fig 15.4b Rose diagram for wind direction depicting wind speed Fig 15.4c Rose diagram for wind direction depicting the sulphus amount Consider the wind rose diagram in Fig. 15.4b. The segments are analogous to the bars in a histogram. The 360 degrees representing the compass are divided into segments. A particular segment represents a range of wind directions (angles) and the radius gives the relative frequency with which they occur in the data set. A radial percentage scale is used so these relative frequencies are easily quantified. Fig 15.4b reveals about 25% of wind directions are in a south-westerly direction. Information concerning corresponding wind speed is also shown. Each segment is composed of varying wind speeds and these are grouped into intervals, with the relative frequency of wind speeds occurring within each interval measured against the radial scale. Hence, winds in southwesterly are predominantly between 10 and 20 mph, and from the radial percentage scale, these occurrences constitute around 15-20% of the total number of measurements from around the 154

15 Further methods compass. Note zero (or negative) wind speeds are treated differently and represented by a centre circle 24. Such a centre circle can be seen in Fig. 15.

To load the data into a spreadsheet choose Spread New Data in GenStat. Select all of the variables into the Data to Load list, as shown in Fig. 15.

156 15 Further methods compass. Note zero (or negative) wind speeds are treated differently and represented by a centre circle 24. Such a centre circle can be seen in Fig. 15.4c 25 Zero values are said to be calm. The data was read into GenStat via the example program supplied. (The data can be viewed in the input window.) They are currently in the GenStat server. To load the data into a spreadsheet choose Spread New Data in GenStat. Select all of the variables into the Data to Load list, as shown in Fig. 15.4d. Fig. 15.4d Loading data from the GenStat server into a spreadsheet The first 8 rows of the data set are shown in Fig. 15.4e. Fig. 15.4e The air pollution data Wind roses are on the Graphics menu. Choose Graphics Windrose Diagram. Specify WindSpeed is the variate to be plotted and factor WindDirection contains the angles (directions), as shown in Fig. 15.4f. 24 The centre circle is of radius % calm in Fig. 15.4c contradicts the GenStat Help in the initial Version 7 documentation. There is one zero sulphur measurement (row 1), and one missing wind direction (row 31). Hence, the proportion of calm values is 1/113 = %. GenStat appears to use the case with unknown wind direction (and non-zero sulphur) and combines this case with the zero sulphur case giving 1.75%! The treatment of missing and zero values will be clarified in future documentation and help. 155

15 Further methods Fig. 15.4f Plotting a wind rose Select Next Finish. This produces the wind rose in Fig. 15.4b. The most interesting column in Fig. 15.4e is the one containing the directional data, so we look at this column in more detail.

This factor has levels that give the wind direction in degrees, and labels that give it as text. Fig. 15.

157 15 Further methods Fig. 15.4f Plotting a wind rose Select Next Finish. This produces the wind rose in Fig. 15.4b. The most interesting column in Fig. 15.4e is the one containing the directional data, so we look at this column in more detail. With a cell from this column the active one, use Spread Factor Edit Levels Labels, to give the display shown in Fig. 15.4g. This factor has levels that give the wind direction in degrees, and labels that give it as text. Fig. 15.4g The (factor) wind direction For the construction of the wind rose GenStat assumes the levels are in clockwise order, and represents the midpoints of the segments to be plotted. The data do not have to be in a factor. For illustration close the dialogue shown in Fig. 15.4g and use Spread Column => Duplicate. Specify that the duplicate column is to be a Variate and give it a name, such as Wind_degrees. Then produce a wind rose diagram again using Graphics Windrose Diagram. You get almost the same plot, as shown in Fig. 15.4h. 156

plotting. See Fig 15.4i. Fig. 15.4i Specifying attributes of a wind rose When wind direction is a factor it assumes that the data are grouped, and so represent the whole circle.

158 15 Further methods Fig. 15.4h Wind rose diagram for wind speed with wind direction declared as a variate The reason it is slightly different is shown in one of the screens from the wizard (sequence of dialog boxes) used in the plotting. See Fig 15.4i. Fig. 15.4i Specifying attributes of a wind rose When wind direction is a factor it assumes that the data are grouped, and so represent the whole circle. When dealing with a variate GenStat has grouped the wind directions into 20 degree segments. Change this value to 45 to get an identical plot to Fig. 15.4b More on directional data There is more to the analysis of circular data than rose diagrams. The formulae and facilities in Genstat are from Fisher (1993). To obtain summary statistics for circular data, say wind direction for example, choose Stats Summary Statistics Summarise Circular data. Specify Wind_degrees, as the variate containing the angles to be summarised, change the Width of sectors from 20 to 45 (for the reason mentioned in Chapter 15.4) and select Display Fitted Values for von Mises Distribution, see Fig. 15.5a. 157

15 Further methods Fig. 15.5a Summarising wind direction The output is shown in Fig 15.2b. It contains various summary statistics and significance tests for distributional assumptions.

159 15 Further methods Fig. 15.5a Summarising wind direction The output is shown in Fig 15.2b. It contains various summary statistics and significance tests for distributional assumptions. One of the summaries is the mean (wind) direction which is 233. The output contain the results of three significance tests 26 : Prob. test of randomness gives a p-value from a test for randomness against any alternative. Prob. Rayleigh test of uniformity gives a p-value from a test for randomness against a unimodal alternative. Prob. Chi-square von Mises gives a p-value from a goodness-of-fit test for a von Mises distribution. The von Mises distribution is the most common parametric distribution used to model circular data. It is a symmetric, unimodal distribution, completed defined by a location parameter or mean direction (µ) and scale or concentration parameter (κ). Fisher (1993) gives the probability density function. Two properties of this distribution are that for κ=0 the distribution is uniform; as κ the distribution becomes increasingly concentrated around the mean direction. For the wind direction data parameter estimates are µ= ˆ 233, κ= ˆ Fig. 15.2b Results from summarising wind direction. ***** Summary statistics and test for Circular data ***** Variate : Wind_degrees Number of equidistant sectors : 8 Number of observations : 114 Mean direction : Circular standard deviation : Mean resultant length : Skewness : Kappa estimate : Prob. test of randomness : Prob. Rayleigh test of uniformity : Chi-square von Mises : 4.53 with 5 df Prob. Chi-square von Mises : *** Goodness of fit for von Mises distribution *** Observed Expected ChiSquare Midpoint We hope a future version of GenStat will include quantile-quantile plots. 158

15 Further methods 270 21 20.64 0.01 315 11 14.28 0.75 Unknown Observed 1 The first two tests show the data are not uniform around the circle. The third test ( Prob.

The end of the output contains the groups (segments), observed number of wind directions in the segment and the corresponding expected number under the assumption the wind directions are from a von

160 15 Further methods Unknown Observed 1 The first two tests show the data are not uniform around the circle. The third test ( Prob. Chisquare von Mises ) is simply the standard chi-square test approach to goodness-of-fit testing. The end of the output contains the groups (segments), observed number of wind directions in the segment and the corresponding expected number under the assumption the wind directions are from a von Mises distribution. The high p-value (0.48) indicates no evidence against the assumption of a von Mises distribution. However, Fisher (1993) claims it is difficult to identify a von Mises distribution when κ < 2, presumably because the mode of the distribution is less clearly defined. The wind directions can also be summarised using a circular plot. Choose Graphics Circular Plot. Specify Wind_degrees in the Data list, as shown in Fig. 15.5c. In the next dialogue box change the Width of sectors from 20 to 45, and select Kernel Density. See Fig. 15.5d. Fig. 15.5c Specify the directions to plot. Fig. 15.5d Changing the options. The circular plot produced is given in Fig. 15.5e. 159

161 15 Further methods Fig. 15.5e A circular plot of wind direction The triangular segments represent relative frequencies a rose diagram. The labelling on the outside of the circle is count of the number of wind directions in the segment. The arrow is the mean vector, its direction is 233 (the mean direction) and its length, considering the radius of the circle as 1 unit, is the mean resultant length. From Fig. 15.5b the mean resultant length is 0.28, hence the length of the vector is 28% of the radius. The outer dashed curve is a kernel density plot, representing the relative density of wind directions around the circle Further methods GenStat also includes facilities for spatial analyses, see Fig. 15.6a. Kriging is available, though co-kriging is not. It is promised for Version

162 15 Further methods Fig. 15.6a Facilities for spatial analysis Time series analysis is also available, including ARIMA modelling. However, other statistics packages offer more options and we await information on whether future versions of GenStat will upgrade these facilities References Changary, M. J. (1982) Historical Extreme Winds for the United States Atlantic and Gulf of Mexico coastlines. U.S Nuclear Regulatory Commission, NUREG/CR Coles, S. (2001) An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag: London. Fisher, N. I. (1993) Statistical Analysis of Circular Data. Cambridge University Press: Cambridge. Payne GenStat for Windows TM (7th Edition) (2003). Reiss, R. D. and Thomas, M. (2001) Statistical Analysis of Extreme Values. 2nd edn Birkhauser Verlag: Basel. 161

163

164 Part IV Commands and Strategy 163

165 164

16 Moving from menus to commands 16. Moving from menus to commands 16.1 Introduction In this guide we have mainly used GenStat's spreadsheet, menus and dialogue boxes to enter and analyse data.

It is important that users know the potential of the software they are using, but not urgent that everyone can realize this potential. So we hope that everyone will read parts of this chapter.

166 16 Moving from menus to commands 16. Moving from menus to commands 16.1 Introduction In this guide we have mainly used GenStat's spreadsheet, menus and dialogue boxes to enter and analyse data. Occasionally we used commands. In this chapter we discuss the use of commands in more detail. It is important that users know the potential of the software they are using, but not urgent that everyone can realize this potential. So we hope that everyone will read parts of this chapter. At the start of each section we explain how much is useful for everyone, and what occasional users of GenStat might omit. This first section is for everyone! First we explain which language we will use in this chapter. GenStat is in two parts. There is the GenStat server that lurks at the bottom right of your screen when GenStat is running. It is usually green, but turns to red, when it is working. The GenStat server is where the calculations are done. It is written in a language called Fortran. Hardy computer users can add their own Fortran routines to GenStat, though Fortran programmes are a rare species now. To learn more about linking GenStat to other programs look at Help GenStat Guides Syntax and Data Management, Chapter 5, particularly Section 5.7. The front-end with the menus and dialogues has been programmed in a language called C. This is for the development team, and currently you cannot add to this part of GenStat. Third, GenStat s own commands are themselves a language, and it is this GenStat language that we explore in this chapter. When a dialogue box is completed, and [OK] is pressed, GenStat itself produces commands. These are written to the Input Log and sent to the GenStat Server for processing. Once the Server has processed the commands, it sends the results to the Output Window. If there were any mistakes in the commands, then it sends a description of the mistake to the Error Log, as well as to the Output window. Here is an example. We show the Simple Regression dialogue from Chapter 4, Fig. 4.2c, of this guide, together with the commands it produced, that were put into the Input Log. Fig. 16.1a Dialogue for simple linear regression Fig. 16.1b Commands corresponding to regression dialogue 165

1c Fig. 16.1c Simpler command for the regression analysis Model uptake Fit conc So, you have been using GenStat's commands all the time, though without having to type them yourself.

167 16 Moving from menus to commands As usual when commands are generated automatically, they are sometimes more complicated than necessary. Here you could run the same analysis with the commands shown in Fig. 16.1c Fig. 16.1c Simpler command for the regression analysis Model uptake Fit conc So, you have been using GenStat's commands all the time, though without having to type them yourself. In this section we explain a little of the GenStat language. We do not expect you to stop using the spreadsheet and dialogue boxes, but anticipate that some knowledge of the language of GenStat will be valuable in your effective use of the package. We describe first why this extra knowledge is useful. 1. The Log File keeps a record of the commands used for your analyses This is a useful record and you should normally keep it with the data and output files. The details of the analysis are not always clear from an output file, while this Input Log shows all the steps that were undertaken. If you ever require help on an analysis, then the log file can be sent to an advisor. However, the log file is only useful to you, if you are able to read it. So, you need, at least, to be able to read GenStat commands. 2. Some of GenStat's facilities are not available through the dialogue boxes Sometimes a facility is not available through the menu. For example GenStat includes a simple procedure to calculate day lengths at any latitude and any day of the year. As a command it is called daylength, and there is no equivalent menu option. To see how this type of procedure can be used, select Help Example Programs see Fig. 16.1d Then click OK once, and choose the option to look for a Procedure in the Procedure Library (all). Click OK about 6 more times, which should give a menu with the daylength procedure included, as shown in Fig. 16.1e. Fig. 16.1d Example programs Fig. 16.1e Finding the Daylength procedure Accept the procedure but do not run it. It will then open in an input window as shown in Fig.16.1f. 166

16 Moving from menus to commands Fig. 16.

To run these commands use Run Submit Window. This should give the results below, showing, for example that sunrise at this location in England is at 8.

Many examples are available as sample sets of commands, through the Help menu.

168 16 Moving from menus to commands Fig. 16.1f Example that uses the Daylength procedure This shows GenStat s commands including one that calls the daylength procedure. To run these commands use Run Submit Window. This should give the results below, showing, for example that sunrise at this location in England is at 8.09 hours on 1 st January. Fig. 16.1g Running the commands Fig. 16.1h Results 3. Examples can be provided, as commands, for similar problems to your own. Many examples are available as sample sets of commands, through the Help menu. Choose Help Example programs, make a selection from the list and by clicking [OK] you open the example program in a new Input window. The example below is from the textbook by Snedecor and Cochran, and is of simple linear regression. You also have the possibility to run the example program as described above. In this way, you can modify them and use them with your own data. 167

16 Moving from menus to commands Fig. 16.1i Choosing an example Fig. 16.1j Data for the example In addition, advisors could provide the commands for your analyses which you then retrieve and run.

You can make more use of the HELP information in GenStat This is both informative and enables you to see more clearly what analyses are possible with GenStat.

169 16 Moving from menus to commands Fig. 16.1i Choosing an example Fig. 16.1j Data for the example In addition, advisors could provide the commands for your analyses which you then retrieve and run. Instructions are much easier to provide in this way, compared to advice on how to complete dialogue boxes. 4. You can make more use of the HELP information in GenStat This is both informative and enables you to see more clearly what analyses are possible with GenStat. As a simple example, here is part of the HELP information about the HEATUNITS procedure that can be used to calculate degree-days. Fig. 16.1k Help about a procedure for degree days This is another procedure that is not currently available through the menus. 5. It is sometimes quicker to analyse data with commands Whether this is the case for you, will depend on how much you use GenStat. If you are just an occasional user, then the spreadsheet and dialogue boxes will be simpler, because you would probably make mistakes when typing the commands. If you use GenStat regularly, then it will be far quicker to make use of commands. As an example the commands 168

170 16 Moving from menus to commands FACTOR [LEVELS = 4 ;VALUES = 12(1...4) ;LABELS =! T(A,B,C,D)] year FSPREADSHEET year would declare a factor column of length 48, with values 1,1,1, 4,4,4 which is called year (first line) and write it to a new spreadsheet (second line). It also labels the levels 1 to 4 as A, B, C and D respectively. In this manual we have always entered such data through the menus. There we would use Spread New Create, to create a new spreadsheet. Then we had to define the number of columns, the number of rows and the sheet type. Spread Calculate Fill, to enter the data and then Spread Column Edit Column Attributes to give a name, make it a factor and attach labels. Typing the two lines is much quicker, especially if you already have examples of similar commands that can simply be copied and pasted. The main reason that it is quicker, however, is that the set of commands can automate an analysis that would otherwise involve using many different dialogues repetitively. 6. And it is not that difficult Finally we address the concern that many of you may have, simply that it may be a good thing to do, but it will be too difficult, especially if you have never "programmed" before. We don't think so, if you are at the same time determined and realistic. Remember that your initial task is not to learn to write GenStat commands, but simply to be able to read and understand them. Then you can follow the examples that others may give you. The next step might be to adapt examples slightly. So you don t have to start with an empty window. You could start with the commands generated from the dialogues (which are in the Input Log) and copy them into your Input Window for editing. Or you could start with the commands for a similar analysis or with the example programs from the Help menu Finding errors in commands There is one part of writing commands that beginners do find difficult and you must practice as much as possible. When you write commands, you will certainly make mistakes. Then GenStat will give you an error message and you have to use this message to try to correct the mistake(s). This takes practice - so when you make mistakes, just think of it as part of the practice! Those who do not need the detail can proceed to Chapter Here is a common mistake, with the corresponding error message. Fig. 16.2a Example with an error Fig. 16.2b Explanation of the fault You can probably guess the problem. The name that was given to a column, here year, was in lower case and GenStat thinks of Year and year as different names. You can use upper or lower case, but you must be consistent. Note that upper or lower case does not matter, in other parts of the command, so: FACTor [ Lev = 3 ; Val = 1, 2,3,1,2,3] year is fine. It is just in the names that you choose to call the columns, that you must be consistent. This error message is reasonably clear, but that is not always the case. In general, you must become a detective when mistakes are made - and that takes practice. You can start by searching in the GenStat Help for GenStat fault codes by area, see Fig. 16.2c. 169

This is used here both as practice in writing commands and to describe a possible "system" for preparing and submitting commands to GenStat. This section is for everyone.

171 16 Moving from menus to commands Fig. 16.2c Check on the types of error that can be generated 16.3 Using Input Windows We use the example, that was previously used in the tutorial for a two sample t-test, see Chapter 3 of this guide. This is used here both as practice in writing commands and to describe a possible "system" for preparing and submitting commands to GenStat. This section is for everyone. It is not difficult and can take away the fear of programming that is felt by many users who know only how to click-and point as their way of controlling the computer. The full "program" is in Fig. 16.3a. Do not type it yet. We will produce it in stages. Fig. 16.3a A simple program in GenStat Before starting, press to clear the Output Window. Then use File New Text window to open a new input window. This is where you will type the commands. 170

16 Moving from menus to commands Fig. 16.3b First plot to type Fig. 16.3c Second plot to add Type the first three lines, as shown in Fig. 16.3b. Do not forget to type the colon (:) on the last line.

Run the whole set of commands so far, by pressing This clears the output Window before running the commands, so you can see the results clearly. Correct any errors if you made them.

172 16 Moving from menus to commands Fig. 16.3b First plot to type Fig. 16.3c Second plot to add Type the first three lines, as shown in Fig. 16.3b. Do not forget to type the colon (:) on the last line. Then press, or use Run Submit Window to submit these commands to GenStat. If this gave any errors, then correct them and run the commands again. Now add the second group, as shown in Fig. 16.3c. Run the whole set of commands so far, by pressing This clears the output Window before running the commands, so you can see the results clearly. Correct any errors if you made them. Of course GenStat can only detect errors in the "syntax". So, if you typed RAED instead of READ, then it will be an error. It can not detect if a number was typed incorrectly. This must be checked against the original data. GenStat does try to help by printing summary values in the Output Window. You have probably typed at least one value in error, because we have given 2.9 to type above, for the 8 th value in the second group, when it should have been 2.0. Correct this mistake and run again. It remains to type the last two lines of the "program". Type just the first of these lines, i.e. append [newvector=all ; group=type]new, standard With the cursor still on the line, use Run Submit line. Correct if there were mistakes. Notice that this is a command that does not produce results in the output window. It will tell you, however, if you have made a mistake. Instead of typing the last line directly we will practice using GenStat's dialogue to give the structure. Use Stats Statistical Tests One and two-sample tests. Complete the dialogue as shown and press [OK]. 171

173 16 Moving from menus to commands Fig. 16.3d Using a dialogue to generate the commands Now go to the input log. The last line is shown in Fig. 16.3e. Fig. 16.3e Command generated by t-test dialogue TTEST [PRINT=summary,test,confidence,variance; METHOD=twosided;\ GROUPS=type; CIPROB=0.95; VMETHOD=automatic] Y1=all As already mentioned, using the GenStat menus generates much longer commands in the Input Window than is usually needed. Copy this line into your input window. Edit it so it reads TTEST [GROUPS = type; ciprob = 0.95 ] all With the cursor still on the line, use Run Submit line. Correct if there were mistakes. Save the program file. Call it cmprog1.gen. Previously we saved files with the extention.gsh because they were in GenStat s spreadsheet format. Here we use the extension.gen because they contain GenStat commands. You should now have a "program" where each section has been tested. It is time to test the whole program. Restart GenStat using Run Restart Session. Then load the file called cmprog1.gen and run the whole program by selecting Run Submit Window It now just remains to see how GenStat responds when you make mistakes. If you have already made, and corrected many mistakes, that is fine. Otherwise now is the time to make some. These can be spelling mistakes, or grammatical errors, like swapping a comma and a semi-colon. Make just one mistake at a time, because GenStat does not always behave sensibly after the first mistake. To conclude this section we review the concepts that have been introduced. You prepared the program in a special input window. You can also type commands into the output window, for example the input log. But this confuses the function of these windows and we suggest that you normally use a special input window, as here. The program was prepared and tested in small parts. We think that this is a useful routine. Even if you write, or are given, a long program, unless it works first time, split it into small sections, to check that each part runs successfully, before running the whole program. GenStat has the facilities to encourage this, with its Run Submit Selection menu. If there are mistakes in one section, then do not proceed to the next section until they are corrected. GenStat often ignores further commands after a mistake has been detected. 172

174 16 Moving from menus to commands The program began with the JOB command. This is the equivalent of the Data Clear All Data menu option and ensures that a complete run of the program starts without columns in GenStat's memory The syntax of GenStat's commands All GenStat's commands, also called statements, have the same "syntax". That is, they obey the same grammatical rules. In this section we explain these rules. This explanation should enable you to read any GenStat program and also to make full use of GenStat's HELP facilities. It is important for everyone to realize that you need to understand the grammar of a language, if you are to use the language effectively. This applies to ordinary languages, like English or Swahili and also to computer languages. One difference is that people are forgiving of those who make mistakes when starting to use the language, but the computer is not. So any error, such as a comma, where a semi-colon was expected, will not work. Those who are skimming can proceed to Chapter We begin with some familiar examples of GenStat's commands TTEST [ ciprob = 0.95 ] new ; standard TTEST [groups = type ; ciprob = 0.95] data PRINT structures = block, treat, yield ; dec = 0,0,2 There are three components to explain, namely the name, the options and the parameters. 1. The name of the command. In the examples above, these are TTEST, PRINT and they tell GenStat what sort of action is required These names can be given in capital, or small letters. 2. The option list associated with a command. This is in square brackets; for example, the first TTEST command includes one option, namely [ciprob = 0.95]. The second TTEST example includes two options, namely [groups = type ; ciprob = 0.95 ]. If there are no options then you can omit the square brackets completely. 3. The parameter list associated with a command. For example, the PRINT command PRINT structures = block, treat, yield ; dec = 0,0,2 above has 2 parameters, the first specifies the structures upon which operations are to be performed, i.e. what to print. The second parameter gives information for each structure, i.e. how many decimals to use for each column printed. We will see below that this command can be given in a simpler form as PRINT block, treat, yield ; dec = 0,0,2 (Note that the word PRINT shows that GenStat started in the era when all results were automatically sent to a printer. Now a better name for the command would be DISPLAY, because the results are normally displayed in the output window.) It is important to know when to use a comma and a semi-colon in a command. When any option or parameter has more than one item, a comma separates the items in the list. So, in the PRINT command above, where we refer to 3 columns, the command is given as: PRINT block, treat, yield i.e. PRINT block [comma] treat [comma] yield When a command (or directive) includes more than one option or more than one parameter, then a semi-colon is used to separate them. For example: 173

175 16 Moving from menus to commands PRINT structure = block, treat, yield ; dec = 0,0,2 i.e. PRINT structure = block, treat, yield [semi-colon] dec = 0,0,2 normally shortened to PRINT block, treat, yield ; dec = 0,0,2 Each command or procedure has a list of allowed option and parameter names. You can see what they are by looking at the HELP associated with the command, see Fig. 16.1k for the Heatunits procedure. There are various ways to get HELP and here we look for the help on DESCRIBE. One method is as follows. Type describe at the beginning of a line in any window. Then press the [F1] key. This should take you straight to the HELP on DESCRIBE. An alternative is : Help Contents List of Procedures. Then click on the DESCRIBE procedure. Or go to the corresponding dialogue, which is Stats Summary Statistics Summarise Contents of Variates. Press the HELP button. From the HELP for describe you will see that it has 3 possible options, namely PRINT, and SELECTION and GROUPS and it also has two possible parameters, namely DATA and SUMMARIES. The DESCRIBE command was used (from the menus) in the introductory tutorial, see Section 2.3.1, page 11. If you take the default, shown in Fig. 16.4a. Fig. 16.4a Describe dialogue with default output it is equivalent to typing the command DESCRIBE data = total, raindays If you specify a selection of descriptive statistics, then it might result in the command. DESCRIBE [selection = mean, median] total, raindays Now we describe how to give options or parameters. The general rules apply consistently to any option or any parameter for any command. The normal way is to give the "name = list of things", for example with the DESCRIBE command we have "data = total, raindays". There is also "selection = mean, median". We take the "selection = mean, median" to explain the structure further. The name, i.e. "selection" is one of the permitted option words - see the HELP for DESCRIBE. It can be written in upper or lower case and can be abbreviated as long as GenStat can work out what it is. So it could be written as "sel = " or even "s = ". After the = sign we give the list of summary statistics that are recognised by the command. For the DESCRIBE command there are 22 possible summaries, of which "mean, median" are two. These words can again be shortened, as long as GenStat can work out what they are. So "sel = mea,med" would be alright. If you give a word that GenStat does not recognise, then an error is generated. For 174

16 Moving from menus to commands example describe [sel = mean, standev] cluster, fruit would be an error, because the GenStat word for the standard deviation is "sd" and not "standev".

176 16 Moving from menus to commands example describe [sel = mean, standev] cluster, fruit would be an error, because the GenStat word for the standard deviation is "sd" and not "standev". Now look at the parameter for the describe command, i.e. "describe data = cluster, fruit". The name of the parameter, "data" can be abbreviated, and as it is the first parameter that is permitted for this command, it can even be omitted completely. So the command could be given simply as describe total, raindays In the "data = total, raindays" parameter, the permissible list for data is the set of columns that are available. These names cannot be abbreviated. In giving commands, extra spaces can be inserted to help make a command readable. If you need to use more than one line for the same command, then use the continuation character \ at the end of each line to be continued. For example type and run the commands shown in Fig. 16.4b. Fig. 16.4b Commands to input and display a matrix Finally we explain the difference between options and parameters. The first parameter is special. Take the PRINT command below. The first parameter says what structures will be displayed. The decimal parameter then says how many decimals will be displayed for each of the 3 columns. So the decimals list of 0,0,2 is "in parallel" to the list of column names, i.e. block, treat, yield. For example PRINT block, treat, yield ; dec = 0,0,2,4,1,2,1,0 would not be sensible. GenStat would give a warning, and would then ignore all the decimal settings after the third. Options, on the other hand, always apply to the command as a whole. DESCRIBE [selection = mean, median] cluster, fruit So the option to give the mean and median applies to both columns. If you wished to give the mean for the first column and the median for the second, you would have to give the command twice, i.e. describe [sel=mean] cluster describe [sel=median] fruit 16.5 Examples of GenStat programs You have already seen a simple complete program in Fig. 16.3a. In Chapter and we describe a more useful program. We calculate the dates of the start of the rains, repeating the work we did in Section 9.5, where we had to use a whole sequence of menus. In Chapter we generalize the commands by adding a loop. This will then process the data for 4 different definitions of the start, at one go. This enables us to automate repetitive analyses. For example we could similarly analyse the data for different stations. Putting commands together in this way could be called a program. We describe the next stage in Chapter , and that is to turn the command into a procedure. You have already seen procedures in this chapter when, in Fig. 16.1e you chose an example to illustrate the DAYLENGTH procedure. Once a procedure has been written it can be 175

16 Moving from menus to commands used just like any other command in GenStat. With a procedure, you add an extra word that GenStat understands, and we explain why this is useful.

177 16 Moving from menus to commands used just like any other command in GenStat. With a procedure, you add an extra word that GenStat understands, and we explain why this is useful. Finally, in Chapter we describe a limit in the current language facilities of GenStat, namely it is a language, but not (yet?) a visual language. Those of you who are just skimming should still read There we emphasise that it is easy to use a program that someone else has written, and this is itself very powerful. Then proceed to the next chapter Using a program In Fig. 16.5a we show a set of commands that we have written. There we open a data file containing daily rainfall data, and calculate the date of the start of the rains, for the same definition that we used in Chapter 9.5. This is the first occasion after 1 st October that the 3-day rainfall total exceeds 20mm. These commands have been saved in a file called start.gen. In other chapters we have saved data files, such as the Zimbabwe data and given them names like zimdata.gsh. Here the file contains commands, rather than data, so we use a different extension. Fig. 16.5a Commands to calculate the dates of the start of the rains If you were sent this file, then to run the commands in it all you need to do is to save the file into the same directory that you are using for your current GenStat work. Then use Run Submit File as shown in Fig. 16.5b. This opens the dialogue shown in Fig. 16.5c, into which you type your input file name. 176

16 Moving from menus to commands Fig. 16.5b Submitting the file Fig. 16.5c The submit file dialogue Once you click [OK], the commands are executed. One result is a graph, shown in Fig. 16.5d, which indicates, for example, that the median starting date was after about 60 days.

5d Part of the results from the file of commands This way of running a file of commands is particularly useful if you want to call GenStat from another piece of software.

178 16 Moving from menus to commands Fig. 16.5b Submitting the file Fig. 16.5c The submit file dialogue Once you click [OK], the commands are executed. One result is a graph, shown in Fig. 16.5d, which indicates, for example, that the median starting date was after about 60 days. The year is assumed to start on 1 st September, so this is the beginning of November. The season always started by the end of November as 100% of start days occurred before day 90. Fig. 16.5d Part of the results from the file of commands This way of running a file of commands is particularly useful if you want to call GenStat from another piece of software. You could then use GenStat for the analysis, and return to your software with the results. If you need further information then look at the help for GenBatch, for which the first few lines are shown in Fig. 16.5e. 177

16 Moving from menus to commands Fig. 16.5e Help information about using GenStat in batch mode Here we are within GenStat, and so there are simpler ways of running the commands in Fig. 16.5a.

179 16 Moving from menus to commands Fig. 16.5e Help information about using GenStat in batch mode Here we are within GenStat, and so there are simpler ways of running the commands in Fig. 16.5a. Use File Open and open the file called start.gen. Then use the option Run Submit Window to process the data. Once you have opened the file, you could change the commands if you wished. In this section we do not assume that you will write this sort of program, or even understand every command. But still you could probably change it. For example, the definition was the first occasion that 20mm was exceeded. Could you find the line where this is specified, and change it to 15mm or 25mm, if that were more appropriate? Or the line that specifies the data file is as follows: import zimdata.gsh To analyse the data from another station, the name of the file is all that need be changed. We claim that it is usually quite easy to follow roughly what a program is doing, and to change it slightly. This is already a powerful skill. This skill indicates how much the use of statistics packages has changed, with the Windows-style interface. Earlier you had to learn how to write commands, before you could start analyzing your data. Now we have reached Chapter 16, before we need to consider commands in any detail. And you can still proceed, by learning how to read, and edit, before you learn to write. Those who are skimming, can now proceed to the next Chapter Writing your own programs We now assume that you would like to understand the program shown in Fig. 16.5a sufficiently that you could then write some programs of your own. It is at this point that you may suddenly experience some pain. This will remind you that GenStat used to have a reputation of being difficult to learn. We do not think that this stage need be so painful. The GenStat development team has worked hard to make the menus and dialogues easy to use. We hope they can similarly help to make this initial program writing as painless as possible. This is likely to be through a user guide, plus a set of worked examples that you could edit for different tasks. This may be done, by the time you read this guide. To check try Help Example Programs. Then look under topics such as Declaration of data structures, Input and output and Manipulation of data. It is usually the organization of the data and results that is more difficult than the analyses, and there are already plenty of examples on different analyses. You may also find further information on the GenStat web site, which is at There is a payoff. The pain is not just to make you suffer, but is related to the power of the system. We look briefly at some examples from the program in Fig. 16.5a, and we start by avoiding the pain. The last 5 lines are as follows: calc startday=ndayinyear(startdate;9) 178

180 16 Moving from menus to commands print yr,startdate,startday describe startday tally [graph=%cum]startday fspread yr,startdate,startday These lines are fairly self-explanatory, except perhaps the last. This is to transfer these data from the GenStat server into a new spreadsheet. It is the command equivalent of Spread New Data in GenStat For more information on any of the other lines, you could either go straight to the Help system, for example just type the command and press F1, or look for the equivalent dialogue. Before that we had the 2 lines: calc success=(rainsum3>20).and.(dayfromsept>30) restrict Date,YearfromSept,rainsum3;success The first is a logical calculation, which returns the value 1, for true, or 0 for false. Then we used the command equivalent of the Spread Restrict/Filter dialogue to look at just the days when planting was possible. So where is the pain? Well what is the point of the group command in the following two lines? calc YearfromSept=year(Date)+(Month(Date)>8) group [redefine=yes]yearfromsept and what on earth does the second line do in the following part of the code? tabulate [class=yearfromsept]date;min=start vtable start;startdate;!p(yr) In particular what does!p(yr) mean in that line? The first two lines can give a clue to the problem. Having calculated the column called YearfromSept, we need to change it into a factor column, so we can use it later to get the summary for each year. This is easy in the spreadsheet, just right-click and change to a factor. The GROUP command is a way of doing the same thing. More generally GenStat recognizes different data types including variates, factors, matrices, tables and pointers. They are all described in Chapter 2 of the syntax and data management guide, see Help GenStat Guides. Understanding the role of these structures is needed to be able to write most programs. So you will need to check in that chapter, and hopefully have explanations from some more simple examples Repeating sets of commands We now start the real payoff from using commands. It is easy to repeat sets of commands, possibly for different stations or different criteria. This is much simpler than having to go through the menus repeatedly. One way is to change the program given in Fig. 16.5a. here we show how we can repeat sets of commands within a single run of the program. In this section we generalize the program to look at four different earliest dates that the season could start. We have set them at 30, 40, 50 and 60 days after the initial date of 1 st September, but the user could change these. When this file is run the results include those shown in Fig 16.5f. The first column gives summary statistics when the earliest possible starting date was 1 st October. Then for example Q1, the first quartile, shows that planting was possible in 25% of the years by day 50. September 1 st was taken as day 1, so this is 19 th October. 27 You need to know a little about tables and pointers to follow the 2 lines of code, starting tabulate and vtable. In the tabulate command the days of the start are automatically stored in a table structure. This is useful, because the table automatically stores the year numbers as well as the starting dates. Then the vtable line is used to change this table into columns, and we chose the names Startdate for the dates, and yr, for the corresponding year number. 179

16 Moving from menus to commands Fig. 16.5f Summary statistics for starting dates Fig. 16.5g Adding a loop to the commands In the command file, shown in Fig.16.5g, the key lines are those with FOR and then ENDFOR.

Other controls described in that chapter are IF ELSIF ELSE ENDIF and CASE OR ELSE ENDCASE. For example in the program in Fig. 16.

181 16 Moving from menus to commands Fig. 16.5f Summary statistics for starting dates Fig. 16.5g Adding a loop to the commands In the command file, shown in Fig.16.5g, the key lines are those with FOR and then ENDFOR. These are described in Chapter in the GenStat Syntax and data Management guide, that is accessed through the help system. Other controls described in that chapter are IF ELSIF ELSE ENDIF and CASE OR ELSE ENDCASE. For example in the program in Fig. 16.5g we might want to analyse the data only if there are more than 5 years of data or if there are not too many missing values. To do this we could add lines such as CALC check=nobs(rain)/366 IF check > 5 Do all the commands ELSE Print At least 5 years of data are needed for this analysis ENDIF This example indicates that it is quite easy to generalize a simple program, such as the one we started with in Chapter

182 16 Moving from menus to commands Making a program into a procedure We have emphasized that part of the value of writing commands is that the user can easily make changes to tailor the analysis to their requirements. For example in Fig. 16.5a the name of the station could be changed from zimdata.gsh and the earliest date could be changed from 1 st October, by changing the value of 30 in the code. One problem with the programming in Fig. 16.5a is that it forces the user to dig into the code to make these changes. It is not good programming practice to hardwire the station name, or 1 st October in the code. This is not a serious problem with such short programs, but can be awkward in general. And this can lead to mistakes. A simple improvement is to put all the components that the user might change at the start of the program. So we could have: Text [n=1]stationname; value = zimdata.gsh Scalar Earliestdate; value=30 Then we would use these names in the code, for example Import StationName instead of import zimdata.gsh Even better would be to take anything like this out of the code altogether. This is the process of writing a procedure, as described in Chapter 5.3 and 5.4 of the GenStat Syntax and Data Management guide. Many procedures have been written for GenStat. To see the code you can use Help Procedure Source, and then choose one of your choice. Local or special interest groups can also add their own libraries of procedures, using Options Procedure Libraries. The help associated with attaching these libraries is shown in Fig. 16.5h. This opens the possibility of adding a library to include further facilities for the analysis of climatic data. Fig. 16.5h Attaching a procedure library We sometimes find that staff who are comfortable at computing, though not programmers, assume that it will be impossible for them to write procedures. This is not the case. We therefore give two small challenges. They demonstrate that you can start with procedures as you might start writing commands, namely by changing one that exists, rather than writing your own. The 181

16 Moving from menus to commands first task is as follows. If you were to run the commands that we showed in Fig. 16.5g, you would see that the results are not quite as we show in Fig. 16.5f.

183 16 Moving from menus to commands first task is as follows. If you were to run the commands that we showed in Fig. 16.5g, you would see that the results are not quite as we show in Fig. 16.5f. The results are given in a different order, and we changed the order in the spreadsheet to display the summaries in a more logical order. They are given by the DESCRIBE procedure, which is also what is run automatically when you use the Stats Summary Statistics Summarise Contents of Variates menu. Could you change the code in the DESCRIBE procedure, so the results are given in the same order as shown in Fig. 16.5f? Use Help Procedure Source to get the exiting code for DESCRIBE and you may be surprised how easy it is to make the change. So much so, that we suggest a second task. Suppose you were also asked to get the 10% and 90% points of the starting dates, as well as the quartiles. Currently the procedure permits any of 22 summary statistics, but not those. Could you make the 22 into 24, by adding the option to give these percentiles? Understanding how easy it is to write procedures is important, even if you have little intention of becoming a programmer. Perhaps you will suggest some tasks for others, or commission a programmer. Then knowing just a little about writing procedures will greatly help you to define exactly what would be useful for you, or for your organization Making procedures visual Apart from this chapter you have been using GenStat mainly through menus and dialogues. In this section we have now seen that it is possible to add your own procedures to those that are currently available in GenStat. In this final section we look at whether these additions could also have their own menus and dialogues. The short answer is no we have reached the limit in the current version of GenStat. Some other packages allow users to write their own procedures, and then also include an extra menu, so that others can use them through a menu/dialogue system. One example is Excel, where add-ins are programmed in VBA, i.e. Visual Basic for Applications. It is the visual part of the language that allows users to include the menus/dialogues as part of the program. Some statistics packages, for example S-Plus and Stata also have this feature. In GenStat there is only a very crude way of adding a visual component. If you use Help Example Programs High-resolution graphics then you see an example of the sort of menu that you can program currently. Fig. 16.5i Adding a menu in a user-written program While it may be interesting that you can use GenStat to see an example of a contour map, or a weather map, the point here is of the type of menu that allows interaction with the user, so they do not have to understand much of GenStat s command syntax. The menu of the type shown in Fig. 16.5i was acceptable before Windows, but looks very old-fashioned now. We feel that in writing procedures for GenStat it is better to concentrate on writing a good procedure to be used in command 182

184 16 Moving from menus to commands mode. And then hope that the developers will improve the visual aspects in a future version of the software. 183

185

17 Challenge 5 17. Challenge 5 Changing a GenStat procedure In Chapter 16.1 we ran a procedure called daylength.

186 17 Challenge Challenge 5 Changing a GenStat procedure In Chapter 16.1 we ran a procedure called daylength. In the help on this procedure it states The formula by which the day lengths is calculated is given in Sellers (1965). The first task is to find the formula. One way is to find the article, but the second is to look at the contents of the procedure. This is possible for any procedure as is shown in Fig.17.1a. Fig. 17.1a Contents of the daylength procedure Find the procedure, called daylength, and put it in an input window, so it can be modified. Give the procedure a new name, for example the name of the place where you work. Change the default latitude from to the latitude that you want. Try running your own version of the procedure. The procedure currently does not print anything. Change it by adding the lines If Nvalues < 40 print DAYNUMBER, DAYLENGTH ; dec = 0,2 Else print The results are saved in the variable called,!p(daylength) Endif This should be inserted just before the ENDPROCEDURE line Now run it again. 185

187 18 Developing a strategy 18. Developing a strategy 18.1 Introduction In this last chapter we look more generally at three aspects concerned with a strategy for processing climatic data. They are on data, software and staff. In Chapter 18.2 we consider aspects concerned with data entry and checking. We look at the access to data from databases using ODBC (Open Database Connectivity) in Chapter In Chapter 18.4 we examine software issues. We consider where GenStat might fit in your software strategy. What are other alternatives, and what other software might you use together with GenStat? Finally, in Chapter 18.5, we look at people, and include issues of training. We try to be as general as possible, so we hope the issues will be applicable whether you work in a Met Service, a University, or elsewhere. And we believe the issues are essentially the same whatever type of country you are in Data A few people collect climatic data, but most users access the data from others, perhaps from a Met Service. As well as the actual data you will usually have some details, perhaps the location of the stations, and details of the instruments used to take the measurements. This is called meta-data. The information is often not usable without some meta-data. Met Services are sometimes protective about their data, but we see no reason why they should restrict access to the meta-data. And this information is very useful to help anyone specify what data they need. In many developing countries a database system called CLICOM has been used to store the climatic data and meta-data. The original CLICOM is now old fashioned, but the term CLICOM is sometimes used in a more generic way as countries adopt more modern versions of software for managing their climatic records. One such system, called CLIMSOFT, is being developed largely by the Zimbabwe Met Service. Another, called CLIDATA ( has been developed by the Czech Met. Service. Climatic data are also stored by many other organizations, including groups concerned with foodearly-warning systems, universities, and so on. We do not wish to enter here into the quagmire of data access from Met Services, though we would like to encourage them to have a well-defined policy for data access. And if this policy makes their data expensive, then they must not be surprised if users ignore their services, and look elsewhere for the data 28. Even when the data are secondary, i.e. you did not collect them, we strongly encourage users to be critical about the data quality. We illustrate by considering three topics. The first is when you have the same data from two different sources and need to check whether they are really the same. We illustrate with the Zimbabwe data, though we had only the single copy. GenStat can help in the following way. Enter one copy of the data into a GenStat spreadsheet. We used the data in the file called zimdata.gsh and made 2 changes, to introduce errors. We changed 23.7 to 24.7 in the first row of the data in Fig. 18.2a and deleted the 4 th row of data. Then use Spread Sheet Compare to give the dialogue shown in Fig. 18.2b. 28 Many organisations that try to charge for data, are themselves quite adept at borrowing, rather than paying for software. They should assess whether their potential clients will be willing to pay for data, or will avoid the supplier. This can marginalize the Met service, often the opposite of what should be encouraged. 205

18 Developing a strategy Fig. 18.2a Sheet with 2 faults Fig. 18.2b The Spread Sheet Compare dialogue When we make the comparison the result is not very clear.

But then it starts by checking the year column, and so reports a mismatch on the column in line 184, when the data in memory gives 1952, while there is still one more row of 1951 in the file on the

2b and tick to match rows using the Date as the ID column. The results are as follows and correctly identify the discrepancies: "Comparing Spreadsheets: zimdata.gsh and zimdata.

188 18 Developing a strategy Fig. 18.2a Sheet with 2 faults Fig. 18.2b The Spread Sheet Compare dialogue When we make the comparison the result is not very clear. GenStat reports that the number of rows is not equal. But then it starts by checking the year column, and so reports a mismatch on the column in line 184, when the data in memory gives 1952, while there is still one more row of 1951 in the file on the disc. In comparisons such as this, it is useful if there is a unique identifier for each row, and here this is a column called Date. So we return to the menu in Fig. 18.2b and tick to match rows using the Date as the ID column. The results are as follows and correctly identify the discrepancies: "Comparing Spreadsheets: zimdata.gsh and zimdata.gsh Unequal number of rows: vs Mismatch on Maxtemp at row 1: 24.7 <> 23.7 Row 4 in zimdata.gsh not matched. Spreadsheets are different! " A different problem is when the data are on file, but you also have access to a paper copy. In this case GenStat has an option for you to verify whether the two are the same. Use Spread Sheet Verify to give the dialogue shown in Fig. 18.2c. We chose to check just the column with maximum temperatures. The original values are no longer visible, as shown in Fig. 18.2d. Fig. 18.2c Spread Sheet Verify dialogue Fig. 18.2d Data ready to verify one column We typed the correct value, from the file, and this generated the error, shown in Fig. 18.2e. We now have the choice between keeping what we just typed, or what was there before, or of now realizing that we were wrong both times! We can also add a note, and have chosen to do so, see Fig. 18.2e. 206

18 Developing a strategy Fig. 18.2e The GenStat dialogue when there is inconsistency After entering 2 values the screen is as shown in Fig. 18.2f, with the coloured cell indicating that there is a bookmark present.

18.2g A special GenStat sheet with a record of changes In presenting these facilities, we do not suggest that you therefore consider using GenStat for the data entry and checking phase.

189 18 Developing a strategy Fig. 18.2e The GenStat dialogue when there is inconsistency After entering 2 values the screen is as shown in Fig. 18.2f, with the coloured cell indicating that there is a bookmark present. GenStat also prepares a new spreadsheet that keeps a record of all the changes we have made. This is shown in Fig. 18.2g, and indicates that both the first two rows were inconsistent. Fig. 18.2f The spreadsheet after verifying two rows Fig. 18.2g A special GenStat sheet with a record of changes In presenting these facilities, we do not suggest that you therefore consider using GenStat for the data entry and checking phase. But their existence can be useful in discussions with the suppliers of the data, on what checks were made, when they were entered. In Part 2 of this guide we described some of the steps that are often needed to get your data into shape for the analysis. It is important to allow sufficient time for this initial data manipulation stage. If the data manipulation is to be done on a routine basis for many stations then it will sometimes be more efficient to use commands, as described in Chapter 16, rather than just the menus and dialogues. In Chapter 7 we also illustrated some exploratory graphs that can be used to check the data. This is different from the data entry checks described above. The data may be transcribed correctly, but still be an odd value. Sometimes we find the large volume of the raw data overwhelms users. So they summarise the data, or (worse) they only start with the summary data. Checking for oddities should be done at every 207

18 Developing a strategy stage, so we are happy that summary values are sensible. But when they are not, we need to be able to return to the raw data to see where the problems have arisen.

190 18 Developing a strategy stage, so we are happy that summary values are sensible. But when they are not, we need to be able to return to the raw data to see where the problems have arisen. It is also useful to consider what can be checked on the raw data directly. It is also important to consider checks in relation to the specific objectives of the analysis. For example if the study includes looking at the occurrence of dry spells, their occurrence is particularly sensitive to the recording of small rainfall amounts. Typically some observers are more conscientious than others in recording small rainfalls, say those that are less than 1mm. If the analysis is sensitive to this aspect, then this aspect should be checked. We illustrate with the Zimbabwe data. As an example, the first step, with the active cell in the rainfall column, is to use Spread Calculate Code to Groups. This gives the dialogue in Fig. 18.2h. With the limits we have chosen there, we see there were 1503 days with 5mm rainfall or more, and so on. Fig. 18.2h Checking the data prior to a dry-spells analysis Now use the Stats Summary Statistics Frequency Tables dialogue, as shown in Fig. 18.2i. Fig. 18.2i Stats Summary Statistics Frequency Tables This gives the percentages of values in each group for each year, as shown in Fig. 18.2j. The final column in this table gives the count in each year. We see that there are 365 values, with 366 in leap years. That is comforting! Overall the bottom line indicates that 82% of the days were dry, and 4% had rain of less than 1mm. Looking at the individual years does not give any cause for concern. 208

Some of the results could be displayed in graphical form. These checks do not take long. It takes longer to decide on the appropriate action when there are problems.

191 18 Developing a strategy Fig. 18.2j Percentages of days each year in different rain groups The next step would be to look in more detail, probably by omitting the dry days, and looking at the important months in the year. Some of the results could be displayed in graphical form. These checks do not take long. It takes longer to decide on the appropriate action when there are problems. But this is nowhere near as long as the time that is wasted if the analysis is done first, and then some problems are found. As a second example we show a trellis plot of the maximum and minimum temperatures for the daily data from Bulawayo in Fig. 18.2k. This is designed more for the screen than the printer. Fig. 18.2k Trellis plot of the daily maximum and minimum temperatures 209

18 Developing a strategy If there are particular years or days that need closer inspection, then parts of the plot can be expanded, as we show in Fig. 18.2l. Fig. 18.2l One year from Fig. 18.2j in more detail 18.

192 18 Developing a strategy If there are particular years or days that need closer inspection, then parts of the plot can be expanded, as we show in Fig. 18.2l. Fig. 18.2l One year from Fig. 18.2j in more detail 18.3 ODBC Climatic data, with the associated meta-data, such as the station name and location, are usually kept in a database. This is often for many stations and hence each analysis is usually on a subset of the data in the database. Most statistics packages, including GenStat, can connect to a database and extract the data that are needed using ODBC (Open Database Connectivity). We illustrate using data from an Excel file. The process is similar for data from a database. You can choose to load the data directly into the GenStat server, or into a GenStat spreadsheet. We choose a spreadsheet, and therefore started with Spread New ODBC Data Query, see Fig 18.3a. Fig. 18.3a Starting an ODBC data query The menu is then as shown in Fig. 18.3b. (N.B. The actual list of data sources that appears depends on what has been set up or installed on your computer.) We ask for a machine data source, and state that we will use an Excel file. We then press OK. 210

3b, select Machine Data Source click New, Fig. 18.3b, rather than [OK]. Highlight User Data Source and click Next, Fig.

193 18 Developing a strategy Fig. 18.3b Specifying our data are in an Excel file If you do not find Excel in the list on your machine, Fig. 18.3b, select Machine Data Source click New, Fig. 18.3b, rather than [OK]. Highlight User Data Source and click Next, Fig. 18.3c. Select Microsoft Excel Driver (*.xls). Click Next, Finish. Then continue with the importing. We now specify the file, Fig. 18.3c, and choose the monthly data we used in Chapter 7 Fig. 18.3c Choosing the file 211

18 Developing a strategy Fig 18.3d Specifying the variables to input It is at this point, Fig. 18.3d, that we see the distinction between the importing from Excel that we have done earlier, and the use of the ODBC facility.

3e A subset of the rows (or cases) can be chosen We are then asked which rows to include, Fig. 18.3e. As an example, we choose to import only those for years from 1960.

194 18 Developing a strategy Fig 18.3d Specifying the variables to input It is at this point, Fig. 18.3d, that we see the distinction between the importing from Excel that we have done earlier, and the use of the ODBC facility. We can now choose which of the columns we wish to import. In this case we choose all of them. Fig. 18.3e A subset of the rows (or cases) can be chosen We are then asked which rows to include, Fig. 18.3e. As an example, we choose to import only those for years from Then we have the final query, on whether we wish to continue with the import, or save the code that we have generated through this query. 212

18 Developing a strategy Fig. 18.3f The final step In Fig. 18.3f we click on [Finish] and the data are imported. 18.4 Software This guide is deliberately called Analysing Climatic data with using GenStat for Windows as the afterthought.

195 18 Developing a strategy Fig. 18.3f The final step In Fig. 18.3f we click on [Finish] and the data are imported Software This guide is deliberately called Analysing Climatic data with using GenStat for Windows as the afterthought. Our emphasis is on the analysis, rather than the particular software. In this section we outline the decisions on software that individuals and organizations might make, when analyzing climatic data is a part of their work. We limit ourselves primarily to decisions concerning statistical software. The authors of this guide have no commercial interest in GenStat. Our centre uses many statistics packages. We produce only one package ourselves, and that is called Instat 29. It is available free to all individuals. It is currently the only general statistics package that includes a special menu and guide for the analysis of climatic data. We still think that many users will wish to continue with Instat for some of their statistical analyses of their climatic data. But Instat+ is intended as an introductory package, and some users will need statistical software that is more powerful, for some of their applications. Before statistics packages were in Windows, they were command driven. So you had to learn the language to be able to use the software. This was a considerable effort, and hence users often limited themselves to a single package. The ease of use of modern statistics packages has changed all that. Cost apart, you no longer need to use a single package, but can consider the best mix for your work. Some users are content with a spreadsheet for their statistical work. For most applications, the analysis of climatic data benefits from the addition of a statistics package. This does not mean that you have to stop using the spreadsheet. You can add the statistics package to your spreadsheet use. Another recent development is the ease with which statistics packages read data in different formats, and also are able to transfer data between different packages. We saw the use of ODBC in the last section, and all common statistics packages read data from Excel. Most also write back to Excel, so you can return to a spreadsheet, perhaps to draw some presentation graphs. Having decided that climatic analyses could benefit from the accessibility of a powerful statistics package, then we describe below the criteria that made us choose GenStat. But there is healthy competition in the statistics package market. We hope that some readers might disagree with our choice. Then perhaps they will produce a guide for the analysis of climatic data using an alternative package. This will help users, and also encourage the suppliers to keep improving their products. SAS, SPSS, Systat, S-PLUS, R and Stata, see Fig. 18.4a, are among the other statistics packages that you might consider. For those who currently use a statistics package, the same argument applies as for Excel users. This need not be a competition between packages. If facilities in this guide would 29 We also produce an Excel add-in called SSC-Stat, that adds to the statistical facilities and encourages good statistical practice when using Excel. 213

INTRODUCTION TO GENSTAT 14 FOR WINDOWS

INTRODUCTION TO GENSTAT 14 FOR WINDOWS November 2011 CONTENTS 1. GenStat basics... 1 2. Data Input and Manipulation... 1 2.1 Starting GenStat 14th Edition... 1 2.2 Data input... 3 2.3 Some basic data manipulation...