Using SAS to Manage Biological Species Data and Calculate Diversity Indices

Similar documents
CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Introduction to entropart

A Visual Step-by-step Approach to Converting an RTF File to an Excel File

SAS/STAT 13.1 User s Guide. The NESTED Procedure

Producing Summary Tables in SAS Enterprise Guide

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

1. Introduction. 2. Modelling elements III. CONCEPTS OF MODELLING. - Models in environmental sciences have five components:

Describe the two most important ways in which subspaces of F D arise. (These ways were given as the motivation for looking at subspaces.

The NESTED Procedure (Chapter)

Not Just Merge - Complex Derivation Made Easy by Hash Object

Section 1.1 Definitions and Properties

Linear Programming with Bounds

How to Conduct a Heuristic Evaluation

DESIGN OF A COMPOSITE ARITHMETIC UNIT FOR RATIONAL NUMBERS

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

York Public Schools Subject Area: Mathematics Course: 6 th Grade Math NUMBER OF DAYS TAUGHT DATE

Move-to-front algorithm

Gateway Regional School District VERTICAL ALIGNMENT OF MATHEMATICS STANDARDS Grades 3-6

Multi-Criteria Decision Making 1-AHP

ABSTRACT INTRODUCTION PROBLEM: TOO MUCH INFORMATION? math nrt scr. ID School Grade Gender Ethnicity read nrt scr

Software Development Techniques. 26 November Marking Scheme

DM and Cluster Identification Algorithm

A SAS MACRO for estimating bootstrapped confidence intervals in dyadic regression

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Macros I Use Every Day (And You Can, Too!)

For Module 2 SKILLS CHECKLIST. Fraction Notation. George Hartas, MS. Educational Assistant for Mathematics Remediation MAT 025 Instructor

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

(Refer Slide Time 3:31)

by Kevin M. Chevalier

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Multiplying and Dividing Rational Expressions

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

words, and models Assessment limit: Use whole numbers (0-10,000) b. Express whole numbers in expanded form

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

YEAR 7 KEY STAGE THREE CURRICULUM KNOWLEDGE AND SKILLS MAPPING TOOL

Using Basic Formulas 4

1. To add (or subtract) fractions, the denominators must be equal! a. Build each fraction (if needed) so that both denominators are equal.

Mini-Lectures by Section

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Statistics, Data Analysis & Econometrics

Common Core State Standards Mathematics (Subset K-5 Counting and Cardinality, Operations and Algebraic Thinking, Number and Operations in Base 10)

Omitting Records with Invalid Default Values

Fast, Easy, and Publication-Quality Ecological Analyses with PC-ORD

A Prehistory of Arithmetic

6.001 Notes: Section 4.1

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Hexadecimal Numbers. Journal: If you were to extend our numbering system to more digits, what digits would you use? Why those?

Customizing ODS Graphics to Publish Visualizations

3rd Grade Texas Math Crosswalk Document:

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #44. Multidimensional Array and pointers

Section 4 General Factorial Tutorials

Repetition Through Recursion

Multiplying and Dividing Rational Expressions

Unit Map: Grade 3 Math

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Interleaving a Dataset with Itself: How and Why

CPSC 532L Project Development and Axiomatization of a Ranking System

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Lastly, in case you don t already know this, and don t have Excel on your computers, you can get it for free through IT s website under software.

This paper describes a report layout for reporting adverse events by study consumption pattern and explains its programming aspects.

HyperNiche. Multiplicative Habitat Modeling

these developments has been in the field of formal methods. Such methods, typically given by a

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Is the statement sufficient? If both x and y are odd, is xy odd? 1) xy 2 < 0. Odds & Evens. Positives & Negatives. Answer: Yes, xy is odd

Outline and Reading. Analysis of Algorithms 1

The Consequences of Poor Data Quality on Model Accuracy

Shell CSCE 314 TAMU. Haskell Functions

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

An Application of PROC NLP to Survey Sample Weighting

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

Equations and Problem Solving with Fractions. Variable. Expression. Equation. A variable is a letter used to represent a number.

PharmaSUG Paper AD06

Chapter 1: Number and Operations

CHAPTER 6. The Normal Probability Distribution

Lukáš Plch at Mendel university in Brno

Optimizing System Performance

CS1 Recitation. Week 1

parameters, network shape interpretations,

Information Criteria Methods in SAS for Multiple Linear Regression Models

Stage 1 Place Value Calculations Geometry Fractions Data. Name and describe (using appropriate vocabulary) common 2d and 3d shapes

2.2 Order of Operations

SAS/STAT 13.1 User s Guide. The Power and Sample Size Application

command.name(measurement, grouping, argument1=true, argument2=3, argument3= word, argument4=c( A, B, C ))

When using computers, it should have a minimum number of easily identifiable states.

Armstrong State University Engineering Studies MATLAB Marina 2D Arrays and Matrices Primer

REDUCING INFORMATION OVERLOAD IN LARGE SEISMIC DATA SETS. Jeff Hampton, Chris Young, John Merchant, Dorthe Carr and Julio Aguilar-Chang 1

Checking for Duplicates Wendi L. Wright

While You Were Sleeping, SAS Was Hard At Work Andrea Wainwright-Zimmerman, Capital One Financial, Inc., Richmond, VA

METAPOPULATION DYNAMICS

Paul's Online Math Notes. Online Notes / Algebra (Notes) / Systems of Equations / Augmented Matricies

Introduction to Matlab. By: Hossein Hamooni Fall 2014

University of Alberta

CPSC 67 Lab #5: Clustering Due Thursday, March 19 (8:00 a.m.)

Real-time Monitoring of Multi-mode Industrial Processes using Feature-extraction Tools

Middle School Math Course 3

Gateway Regional School District VERTICAL ARTICULATION OF MATHEMATICS STANDARDS Grades K-4

Model Design and Evaluation

Curriculum Guide for Doctor of Philosophy degree program with a specialization in ENVIRONMENTAL PUBLIC HEALTH

SOLVING THE TRANSPORTATION PROBLEM WITH PIECEWISE-LINEAR CONCAVE COST FUNCTIONS

OLAP Drill-through Table Considerations

Transcription:

SCSUG November 2014 Using SAS to Manage Biological Species Data and Calculate Diversity Indices ABSTRACT Paul A. Montagna, Harte Research Institute, TAMU-CC, Corpus Christi, TX Species level information is necessary for conservation, management, and research into ecological and environmental systems. A common problem is that one doesn t know what the community composition will be, so data management is complex because it has to be managed by species and not by samples. Once the data management problem is solved, there will be a need to summarize information at the sample level, and perhaps calculate a diversity index. This presentation will describe algorithms for commonly used diversity indices, and demonstrate how SAS was used to solve the data management and computational problems. KEYWORDS PROC SORT, PROC TRANSPOSE, DATA STEP PROGRAMMING, ARRAYS, DO LOOPS, GOTO INTRODUCTION Biodiversity is a common indicator of ecosystem health and often is measured in environmental assessment studies. Both terrestrial and aquatic samples are composed of multiple species and it is useful to summarize the species in a sample by computing a diversity index. Diversity indices intend to describe richness (i.e., how many species are present) and evenness (i.e., how the counts are distributed among the species). For example, it is common for disturbed samples to have dominance (i.e., low evenness regardless of the richness) compared to samples from undisturbed environments. An important attribute of diversity is that the more samples taken, the more species are found. This concept is referred to as the species-sample size principle and it implies that one will always find new species. This attribute implies that when creating biological databases, it is always advisable to use a vector approach where each species from a sample is a record (Table 1). A matrix approach (where each sample is a record and species are in columns) will not work because it would be necessary to continually add new columns when new species are found. Thus, samples are decomposed and must be reassembled in order calculate diversity indices. In the present study, various diversity algorithms are discussed, and then SAS data base approaches and analytical methods are shown to implement the equations. MATHEMATICAL METHODS There are many approaches to calculating diversity indices, and most fall into one of the two classes being: richness indices or evenness indices (Ludwig and Reynolds 1988). The simplest richness index is R, the total number of species (equation 1).... (1) Richness indices are weak because they vary as a function of sample size. One of the first richness indices to account for sample size (n) is R1 (Margalef 1958).... (2) A similar formulation which is a less severe transformation of s sample size is R2 (Menhinick 1964).... (3) There are many different forms of diversity indices that try to account for evenness as well as richness. Hill (1973) reviewed many and demonstrated that they were part of a family or series of diversity numbers that are based on the proportional representation of each of the i species in a sample. Hill s diversity numbers are N0, N1, and N2. N0 is the total count, which I define above as R, the total number of species. Hill s N1 is the exponential form of the Shannon and Weaver (1948) H' diversity index. where Table 1. Biological data. Sample Species Value 1 1 6 1 2 5 2 1 9 2 3 1 2 4 2... (4) 1

[( ) ( )]... (5) Hill s N2 is the reciprocal of Simpson (1949) index λ.... (6) where ( )... (7) Evenness indices have also been created. An evenness index should be maximum if all the species are equally abundant, and then then should decrease to zero as relative abundances. Again Hill (1973) shows how a series of proportional ratios of Hill s Numbered indices can be used to generalize the equations and computing. The most common evenness index used by ecologists is Pielou (1975) J' where a perfectly even community has an index value of one because there is one individual on a proportional basis for every species. Hill named J' E1 Sheldon (1969) proposed the exponential form of E1, and Hill named this E2. Heip (1974) proposed subtracting the minimum from E2, and Hill named this E3.... (8)... (9)... (10) Hill (1973) proposed the ration of N2 to N1, and this is named E4. This ratio is useful because is it in units of very abundant species. That is, as N1 and N2 reduce to one species, the ratio reduces to a value of 1.... (11) Alatalo (1981) proposed a modified Hill s ratio that approaches zero as a single species becomes more dominant. ( )... (12) DATA METHODS The method to create biological databases is an important decision. Species data are multivariate responses in a sampling design, meaning the dataset will contain samples as rows and species as columns, and it will be easier to analyze the data once it is in this matrix format. However, it is recommended that biological community data be stored in a vector format (Table 1). The data model should contain every species within every sample as rows and the count (or size) of the species in a column (Table 1). This is important because, we never know which species will be found in any given sample, and new species are found all the time. So, most biological data is stored as a vector, not a matrix. Often samples are collected in a complex experimental design, so you can either include columns for all elements of the design, or include a key variable for each measurement value and then create a second relational table with key variables. It is important to distinguish your data from your metadata. Metadata is data about the data. Include the data source with contact information, what the weather was like, the exact georeferenced location, the units of the variables, adequate descriptions of variable names and codes used to identify treatments, and any other information that is usually overlooked or assumed to be easy to remember. Just remember this: You won t remember it. It is very important to write down all the details, and then record it in your data base. For example, note that in Table 1, species 1 occurs in samples 2 and 3, but in sample 2 there are two new species (3 and 4). This storage approach creates a problem because simply computing sample means with PROC MEANS would return the wrong values. For example the average for Species 2 would be 5, because the zero value for Species 2 was not found in the second sample. Zeros must be in the data set to calculate means. This is easily accomplished by transposing the dataset twice. The first transpose will create the missing cells by using the PREFIX option and the ID statement so that each species is placed in its own column. The second transpose returns the data set to the original format, and then it is easy to replace the missing values with zeros as in the example below. 2

PROC SORT DATA=sp; BY sample species; PROC TRANSPOSE DATA=sp OUT=tsp PREFIX=S; ID species; PROC TRANSPOSE DATA=tsp OUT=sp1; DATA sp2; SET sp1; IF value=. THEN value=0; Table 2. Results of first transpose. The result of the program to put Sample S1 S2 S3 S4 zeros in the dataset for missing species is shown in Table 2. The vector in Table 1 (data sp ) has 1 6 5.. been converted to a matrix (data 2 9. 1 2 tsp. The second transpose puts the dataset back into a vector format (data sp1 ), and then the data step is used to convert the missing values to zero (Table 3, data sp2 ). The data set with zeros can be used to calculate the average abundance of each species correctly. Table 3. Results of second transpose. Sample Species Value 1 1 6 1 2 5 1 3 0 1 4 0 2 1 9 2 2 0 2 3 1 2 4 2 While using the ID statement in TRANSPOSE is necessary to create a dataset with zeros, it is not if our goal is to simply convert the species data into an array that can then be used to calculate diversity indices. In fact, to calculate the diversity indices, we only need to TRANSPOSE once without the ID statement. An example of this is listed below: PROC SORT DATA=sp; BY sample species; PROC TRANSPOSE DATA=sp OUT=tsp2 PREFIX=S; Table 4. Results of transpose without the ID statement. Sample S1 S2 S3 1 6 5. 2 9 1 2 This data transform creates an array with missing values (Table 4, data tsp2). Notice that the values of the species are now compressed and that the column names (S1 to S3) no longer represent the individual species, but just the next species found in a sample. In contrast, notice in Table 2, which is a result of using the ID statement, that the column names do represent the individual specific species. Leaving out the ID statement creates an array, which is more useful for creating the programming code to calculate the diversity indices. The next task is to calculate various measures of biodiversity (Equations 1-13). This is best accomplished by transposing the data and treating each sample as an array without the ID statement as done in Table 4. We can also calculate total abundance of organisms (i.e., to sum values). First, the ARRAY has to be declared, then it is a simple matter to calculate the index contribution for each species, and then sum the individual contributions. Two cautions: 1) include a test to make certain data exists on the line, that is the sample is not empty or no organisms were found, or there are zeros in the row, and 2) make sure there are no divide by zero errors, which can happen if there is just one species in a sample. Although not included here, it is simple to add an IF THEN ELSE type statement or GOTO statement and test for the values of denominators in the equations. The SAS DATA step code below calculates the indices. Also, don t forget to FORMAT the calculated variables to display the correct number of significant digits. 3

PROC TRANSPOSE DATA=sp OUT=tsp2 PREFIX=S; DATA diversity; SET tsp2; ARRAY s s1-s100; /* Create temporary species array */ DROP s1-s100 _name_; DO OVER s; /* Loop to convert zero to missing */ IF s=0 THEN s=.; END; maxn=max(of s1-s100); /* maxn = highest number of individuals */ tn=sum(of s1-s100); /* tn = total number of individuals */ R=n(of s1-s100); /* R = Richness, total species EQ-1 */ R1=(R-1)/log(tn); /* R1 = Margalef index EQ-2 */ R2=R/sqrt(tn); /* R2 = Menhinick index EQ-3 */ Hprime=0; Lambda=0; /* Define all indices and set to zero */ N0=0; N1=0; N2=0; E1=0; E2=0; E3=0; E4=0; E5=0; DO OVER s; /* Loop to calculate diversity indices */ If s=. THEN GOTO msp; /* Skip species with missing values */ Hprime = Hprime + (-(s/tn) * log(s/tn)); /* Shannon H index EQ-5 */ Lambda = Lambda + (s/tn)**2; /* Simpson λ index EQ-7 */ msp: ; END; N0=R; /* Hill's Number 0, same as R EQ-1 */ N1=exp(Hprime); /* Hill's number 1 EQ-4 */ N2=1/Lambda; /* Hill's number 2 EQ-6 */ E1=log(n1)/log(n0); /* Eveness index, J, EQ-8 */ E2=N1/N0; /* Eveness index EQ-9 */ E3=(N1-1)/(N0-1); /* Eveness index EQ-10 */ E4=N2/N1; /* Eveness index EQ-11 */ E5=(N2-1)/(N1-1); /* Eveness index EQ-12 */ OUTPUT; /* Add new calculated values to dataset */ RETURN; /* Start the next data step */ FORMAT Hprime Lambda R1 R2 N0 N1 N2 E1 E2 E3 E4 E5 8.2; FORMAT tn R 8.0; DROP maxn; The program above contains two complex action steps necessary to calculate the proportions of each individual species: the ARRAY and DO statements. The ARRAY statement creates a series of temporary variables that are used to identify each individual species. The DO statement allows for repetitive processing in a single data step. For example, the first DO loop tests each species (s) and if it is zero it converts it to a missing value. Combining the ARRAY and DO statements is necessary to calculate the proportional representation of each species to the total abundance. This is implemented in the second DO loop where the individual species (s) is divided by the total number of organisms (tn). One caution about DO loops, you have to be careful which letter used as a name and make sure it is unique to the data set. For example, here S is used for each species column, so if we had another numeric variable beginning with an S then it could be processed as if it were a species. This would not occur if it was a character variable. SAS code can be added to test if there are no species present (If tn=. OR tn=0 THEN ), and then the indices can be set to either missing or zero. Likewise, it is simple to add a test if the denominator would be zero, and execute the calculation only for non-zero counts. CONCLUSION The calculation of diversity indices is a common practice among those who collect, manage, and study biological community samples or data. Biological community data has a unique hierarchical structure because each sample generates measurement values (in this case count data) for a unique number of species within each sample. This structure alone has implications on how to design data models for biological community structure data. 4

The SAS system provides powerful tools for data management, statistical analysis, and data visualization. In fact, it would very difficult and time consuming to perform the analyses presented here without the ability to easily transpose data and write code for complex equations. All diversity indices have a weakness in that they are species independent. For example, a sample with 3 species and a second sample with 3 different species have the same richness, but the samples are completely different. Also some indices are very sensitive to sample size, such as R1 and R2. When samples sizes are equal, R is just fine. The most commonly used indices are H' (Equation 5) for diversity and J' (Equation 8) for evenness. While there are many diversity indices, some are more easily interpretable than others. For example Hill s diversity indices N1 and N2 are in the units of number of dominant species, and thus are easy to understand. Likewise, Hill s evenness index E5 is also interpretable because it approaches zero as a single species becomes dominant. REFERENCES Alatalo, R.V. 1981. Problems in the measurement of evenness in ecology. Oikos 37:199-204. Heip, C. 1974. A new index measuring evenness. Journal of the Marine Biological Association 54:555-557. Hill, M.O. 1973. Diversity and evenness: a unifying notation and its consequences. Ecology 54:427-432. Ludwig, J.A. and J.F. Reynolds. 1988. Statistical Ecology: A Primer on Methods and Computing, John Wiley & Sons, New York. Margalef, R. 1958. Information Theory in Ecology. General Systematics 3:36-71. Pielou, E.C. 1975. Ecological Diversity. Wiley, New York. Shannon, C. E. and W. Weaver. 1949. The Mathematical Theory of Communication. University of Illinois Press. Urbana, IL. Sheldon, A.L. 1969. Equitability indices: dependence on species count. Ecology 50:466-467. Simpson, E.H. 1949. Measurement of diversity. Nature 163:688 ACKNOWLEDGMENTS This publication was made possible, in part, by the National Oceanic and Atmospheric Administration, Office of Education Educational Partnership Program award (NA11SEC4810001). Its contents are solely the responsibility of the award recipient Paul Montagna and do not necessarily represent the official views of the U.S. Department of Commerce, National Oceanic and Atmospheric Administration. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Paul A. Montagna Harte Research Institute for Gulf of Mexico Studies Texas A&M University-Corpus Christi 6300 Ocean Drive, Unit 5869 Corpus Christi, Texas 78712 paul.montagna@tamucc.edu Office (361) 825-2040 http://harteresearchinstitute.org/ TRADEMARK INFORMATION SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5