Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Size: px

Start display at page:

Download "Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data"

Tiffany Byrd
6 years ago
Views:

1 Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated with point data (not attributes associated with those locations, just where they are found) Geographic Patterns in Areal Data -These methods are used to examine the pattern of attribute values associated with polygon representations of geographic phenomena (i.e. is there a pattern in the attributes of a set of adjacent polygons?)

2 Geographic Patterns in Areal Data Given a set of geographic areas, whether they are represented using vector polygons or collections of raster cells, that have some accompanying variable or attribute information, we can ask questions like: Does the pattern of values show a spatial organization that differs from what we might expect if the values were distributed randomly? In general, the way we will answer this sort of question by forming a descriptive statistic that compares the observed pattern to some expected pattern

3 Geographic Patterns in Areal Data We can assess this sort of thing in different ways, depending on the type of data we have: If we can count occurrences of a nominal variable per area, we can form a contingency table and use a χ 2 test to compare the observed values to those expected OR We can compare pairs of polygons that share a common boundary, computing the joint count statistic for binary nominal data, or Moran s I statistic when we have interval or ratio data that we want to examine for pattern

4 Contingency Tables and the χ 2 Test Any time we can have a pair of nominal variables that can be cross-tabulated this method can be applied E.g. suppose we conducted a survey of all students taking a Geography course, and asked them to indicate their year {freshman, sophomore, junior, senior} and what county they live in {Orange, Durham, Chatham, Alamance} We can use this data to form a 4x4 table, where each cell indicates the count of students in a particular year that live in a particular county This sort of table is called a contingency table, and it can be applied to spatial patterns if one of our nominal variables represents location information (e.g. county)

5 County vs. Year Example Table Freshman Sophomore Junior Senior Totals Alamance Chatham Durham Orange Totals Contingency tables can be built for any data set where we have two nominal variables that we can use to categorize the values into the cells of the table the application does not have to be spatial, but membership in a particular spatial unit (i.e. inside of a certain polygon) is a convenient approach for spatial analysis

6 Contingency Tables and the χ 2 Test Furthermore, we can use the data in a contingency table to assess the presence of a spatial pattern by first forming an expectation of how values of one of the nominal variables should be distributed with respect to the other E.g. if our hypothesis is that the distribution of ages of geography students shouldn t according to their county of residence, then the relative proportions of freshmen : sophomores : juniors : seniors should be the same for each of our five counties (even if the total number of students per county is different) We can use the observed frequency counts in each cell of our contingency table to generate expected frequency counts, based on the rule suggested above

7 County vs. Year Example Table Freshman Sophomore Junior Senior Totals Alamance Chatham Durham Orange Totals Expected values are calculated by multiplying the row total by the column total for each cell, and dividing by the grand total, e.g. for the Freshmen in Alamance County 45 * 28 / 200 = 6.3, and so one for all the cells This creates expected frequencies that are proportionate to one another across rows and columns

8 Contingency Tables and the χ 2 Test Once we have observed and expected frequencies for each cell in the contingency table, we can use those values to calculate the χ 2 test statistic: χ 2 = n Σi = 1 (O - E) 2 E where: O is the observed freq. E is the expected freq. n is the number of cells This χ 2 test statistic has (r -1) * (c - 1) degrees of freedom, where are r & c are the number of rows and columns in the contingency table If the observed frequencies are very different from the expected frequencies, χ 2 test will be larger than the 1- tailed critical value it will be compared it to, thus detecting the presence of a spatial pattern

9 Contingency Table χ 2 Test Example Research question: Is there a spatial pattern in the distribution of student years in counties of residence 1. H 0 : O ~ E (Frequencies are the same, no pattern) 2. H A : O E (Frequencies different, pattern present) 3. Select α = 0.05, one-tailed because of how χ 2 test is used here 4. We calculate the χ 2 test statistic using the formula χ 2 = n Σi = 1 (O - E) 2 E (4-6.3) 2 = (9-11.4) (7-6.4) ( )2 =

10 Contingency Table χ 2 Test Example 5. We now need to find the critical χ 2 values, first calculating the degrees of freedom: df = (r -1) * (c - 1) = (4-1) * (4-1) = 3 * 3 = 9 We can now look up our χ 2 crit values for our α = 0.05, which we will apply here in a one tailed fashion, thus we look in the χ 2 table for p = 0.05 to provide the critical value:

11 Contingency Table χ 2 Test Example 6. Finally, we must compare the χ 2 test value to the χ 2 critical value, finding that χ 2 test > χ2 crit, therefore we reject H 0 and accept H A, which tell us that the null hypothesis of no pattern has been rejected because based on the comparison between the expected and observed frequencies, there appears to be some pattern in which counties geography students in different years reside Notably, this test cannot tell us anything about the pattern s nature, only that the distribution is significantly different from the expected null, even distribution and thus there is evidence of spatial autocorrelation, meaning that geography students in certain years tend to live in certain counties

12 The Joint Count Statistic The contingency table approach, while it can be applied to spatial analyses in the fashion described, does not actually include any spatial relationship information in its formulation, beyond the encoding the coincidence of two nominal variables (when one of those variables represents location information) We can also formulate descriptive statistics that do include spatial relationships, specifically by finding all the regions that share a boundary in a set of polygons, and then comparing attribute values from the pairs to assess the pattern of that attribute

13 The Joint Count Statistic The first step in this method is to enumerate all of the pairs of polygons that share a boundary by creating a binary connectivity table (a.k.a. a spatial matrix). For example using the following five region system: A C B D E 1. Label the regions 2. Create a table with the same row & column labels A B C D E A B C D E Fill in the table with 1s and 0s to indicate which regions share a boundary

14 The Joint Count Statistic We can now take the sum of all the 1 s in the binary connectivity table and divide by 2 to calculate the total number of shared boundaries in the system (J): J = n Σi = 1 x i 2 Next, we are ready to look at the attribute information associated with the polygons to determine if each pair of polygons that shares a boundary has the same values or different values The joint count statistic is designed to be used with binary nominal attributes, i.e. the attribute values need to be reduced to some 2 class description for use in this statistic

15 The Joint Count Statistic The binary attributes in question can be any number of possible representations: The example in the text uses positive or negative residuals in polygons from spatially-mapped regression results It could be any sort of presence/absence data Another possibility is a reclassification of other sorts of data (e.g. nominal or ordinal schemes reclassified to two classes, or interval/ratio data transformed to binary data in any number of ways -- above and below the mean, for example) It can be any scheme in which each polygon is assigned either attribute A or attribute B

16 The Joint Count Statistic We will use the suggested example in the text, where each of our five regions is assigned either a + attribute or a - attribute (possibly describing regression residuals): We now have three types of boundaries: ++ boundaries (2) +- boundaries (5) -- boundaries (0) The joint count statistic compares the observed number of +- boundaries (where the value on either side of the boundary is different) to the number that we would expect to find if the values in the polygons did not exhibit any spatial autocorrelation

17 The Joint Count Statistic The expected number of +- boundaries is calculated as: E [+-] = 2JPM N(N - 1) where: J is the total number of shared boundaries P is the number of + polygons M is the number - polygons N is the total number of polygons For our example, E [+-] is calculated as: E [+-] = 2JPM N(N - 1) = 2*7*3*2 5(5-1) = = 4.2 We will form a statistic by comparing the expected number of +- boundaries to the observed number of +-, which we obtain by simply counting the number of shared boundaries with this characteristic (being careful not to double count)

18 The Joint Count Statistic For our example five region system, the observed number of shared +- boundaries is 5 The last ingredient we need to be able to build a test statistic is an estimate of the variance in E[+-], and unfortunately, calculating this quantity requires a somewhat involved expression: Σ L i (L i -1)PM N(N - 1) 4[J(J -1)- Σ L i (L i -1)]P(P -1)M(M -1) N(N - 1)(N - 2)(N - 3) V [+-] = E [+-] + E [+-] where L i is the total number of boundaries shared by region i In our example V [+-] = 0.56

19 The Joint Count Statistic We can now calculate a test statistic to compare the observed number of +- boundaries to the expected number of +- boundaries as a Z-statistic: (Obs. +- ) - E [+-] Z test = V [+-] This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance An exceptional Z-statistic value would indicate a level of spatial autocorrelation that exceeds the expected amount for our system

20 Z-test for the Joint Count Statistic Example Research question: Is the areal pattern of + and - values randomly distributed amongst the polygons? 1. H 0 : O[+-] ~ E[+-] (Areal pattern is random) 2. H A : O[+-] E[+-] (Pattern is spatially autocorrelated) 3. Select α = 0.05, two-tailed because of H 0 4. We will calculate the test statistic using: Z test = (Obs. +- ) - E [+-] V [+-] = = 1.07

21 Z-test for the Joint Count Statistic Example 5. For an α = 0.05 and a two-tailed test, Z crit = Z test < Z crit, therefore we accept H 0, finding that the areal pattern of +- values in the polygons is not significantly different from a random areal pattern; there is no evidence of spatial autocorrelation in this system that exceeds that which would normally expect were the values of + and - simply assigned randomly to polygons

22 Moran s I Statistic While the joint count statistic does include spatial information (shared boundaries between polygons) in its assessment of autocorrelation, it does so for very limited sorts of attribute data We can use the joint count statistic with binary nominal information, whereas in many situations, we have measurements that are considerably more detailed (i.e. interval or ratio data) We may want to assess spatial patterns of interval or ratio data in a fashion that allows to take full advantage of the detail inherent in those sorts of measurements, checking to see if the pattern of those values exhibits spatial autocorrelation

23 Moran s I Statistic For this purpose we can make use of Moran s I statistic, which we can view as an expansion of the ideas implemented in the joint count statistic Moran s I statistic considers the spatial relationships between each pair of polygons in an areal data set, and encodes the relationships in a connectivity table, just as is done for the join count statistic However, there is much greater flexibility in the nature of how neighborhood information is included in the Moran s I statistic:

24 Moran s I Statistic The computation of Moran s I statistic includes a weight term, where the weights express the degree to which any two elements of the polygon coverage are considered to be spatially related or proximal: In the simplest case, two polygons that share boundary have a weight of 1, and polygons that do not share a boundary have a weight of 0 (binary connectivity case) However, we can imagine all sorts of other schemes: We might weight by the length of boundary that is shared, as a function of a distance between the polygons, or using an expression that indicates how many neighbors apart they are (i.e. 1st order neighbors are adjacent, 2nd order neighbors are separated by one other polygon etc.)

25 Moran s I Statistic Thus, for each and every pair of polygons in the system, a weight expresses the degree to which they are spatiallyrelated (close to each other, connected, etc.) This weight term is multiplied by an expression that compares the attribute values of each and every pair of polygons, by calculating the mean and standard deviation for the whole data set, and then comparing the z-scores of the variable values for each polygon to that of the other: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j where n is the number of polygons w ij is the weight for combinations of the polygon in column i and the polygon in row j of the connectivity matrix z i and z j are z-scores

26 Moran s I Statistic Moran s I statistic is a normalized statistic that can be interpreted much like a correlation coefficient: It produces values between +1, that indicate a very strong spatial pattern, to values near -1 that are extremely rare because it is incredibly unusual to find patterns that exhibit strong negative spatial autocorrelation from real data we can certainly produce simulated patterns that exhibit strong negative autocorrelation, but finding such things in nature is all but unheard of, which is more or less what Tobler s Law predicts Values around 0 indicate an absence of spatial pattern, neither showing organization where nearby values are similar, nor the ultra-rare opposite of that condition

27 Moran s I Statistic The value of a Moran s I statistic depends strongly on the particular weighting method used: Given the same data, depending on how the spatial relationships between pairs of polygons are encoded, one can produce Moran s I values of varying magnitude, despite the fact that the inherent data and pattern is the same: This is an expression of the strong influence on how the conceptual choice made in how to describe spatial relationships will impact the results here For conceptual ease, we will use the same definition we used in the joint count example: If two polygons share a boundary, they will be assigned a weight of 1 in the binary connectivity table, otherwise they will be given a value of 0, indicating that the comparison of their values has no impact on the statistic because they are not David adjacent Tenenbaum GEOG 090 UNC-CH Spring 2005

28 Moran s I Statistic Example A B C D E A C B D E W = {w ij } = A B C D E j rows Polygon Value Z-Score A B C D E Mean 14 Std. Dev Moran s I = i columns n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j

29 Moran s I Statistic Example To calculate the statistic, substitute the appropriate values into the equation: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j = 5 ΣΣ w ij z i z i j j (5-1) 14 ΣΣ w ij z i z j = 2 [(1.33)*(-0.88)+(1.33)*(0.22)+ (-0.88)*(0.22) i j +(-0.88)*(0.44)+(0.22)*(0.44)+(0.22)*(-1.11) +(0.44)*(-1.11)] = 2.24 = 5 (2.24) (5-1) 14 = 0.2

Robust Linear Regression (Passing- Bablok Median-Slope)

Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their