Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Similar documents
Robust Linear Regression (Passing- Bablok Median-Slope)

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Spatial Data Models. Raster uses individual cells in a matrix, or grid, format to represent real world entities

Table of Contents (As covered from textbook)

Analysis of Variance in R

[spa-temp.inf] Spatial-temporal information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Topic 3: GIS Models 10/2/2017. What is a Model? What is a GIS Model. Geography 38/42:477 Advanced Geomatics

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Multiple Regression White paper

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Correctly Compute Complex Samples Statistics

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Announcements. Data Sources a list of data files and their sources, an example of what I am looking for:

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

IQR = number. summary: largest. = 2. Upper half: Q3 =

STATS PAD USER MANUAL

MINITAB 17 BASICS REFERENCE GUIDE

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Week 4: Simple Linear Regression II

1 More configuration model

Spatial Interpolation & Geostatistics

Data analysis using Microsoft Excel

Probability An Example

Descriptive Statistics, Standard Deviation and Standard Error

STA 570 Spring Lecture 5 Tuesday, Feb 1

One Factor Experiments

Spatial Interpolation - Geostatistics 4/3/2018

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Selected Introductory Statistical and Data Manipulation Procedures. Gordon & Johnson 2002 Minitab version 13.

Laboratory for Two-Way ANOVA: Interactions

GRAPHS AND STATISTICS Residuals Common Core Standard

+ = Spatial Analysis of Raster Data. 2 =Fault in shale 3 = Fault in limestone 4 = no Fault, shale 5 = no Fault, limestone. 2 = fault 4 = no fault

Enterprise Miner Tutorial Notes 2 1

Introductory Applied Statistics: A Variable Approach TI Manual

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Week 7 Picturing Network. Vahe and Bethany

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Subset Selection in Multiple Regression

Splines and penalized regression

For our example, we will look at the following factors and factor levels.

Attribute Accuracy. Quantitative accuracy refers to the level of bias in estimating the values assigned such as estimated values of ph in a soil map.

Slides 11: Verification and Validation Models

Map Analysis of Raster Data I 3/8/2018

Spatial Outlier Detection

Spatial Interpolation & Geostatistics

Watershed Sciences 4930 & 6920 GEOGRAPHIC INFORMATION SYSTEMS

STAT 113: Lab 9. Colin Reimer Dawson. Last revised November 10, 2015

1. What specialist uses information obtained from bones to help police solve crimes?

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

Applied Multivariate Analysis

Interactive Math Glossary Terms and Definitions

ASSIGNMENT 6 Final_Tracts.shp Phil_Housing.mat lnmv %Vac %NW Final_Tracts.shp Philadelphia Housing Phil_Housing_ Using Matlab Eire

A toolbox for analyzing the effect of infra-structural facilities on the distribution of activity points

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Exploratory model analysis

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Getting to Know Your Data

CS229 Lecture notes. Raphael John Lamarre Townshend

Two-Stage Least Squares

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Multivariate Capability Analysis

Nonparametric Testing

WELCOME! Lecture 3 Thommy Perlinger

TELCOM2125: Network Science and Analysis

- 1 - Fig. A5.1 Missing value analysis dialog box

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Product Catalog. AcaStat. Software

Linear Methods for Regression and Shrinkage Methods

Spatial Statistics With R: Getting Started

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

SPSS TRAINING SPSS VIEWS

Modelling Proportions and Count Data

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Influence Maximization in Location-Based Social Networks Ivan Suarez, Sudarshan Seshadri, Patrick Cho CS224W Final Project Report

1 Homophily and assortative mixing

Neighbourhood Operations Specific Theory

Frequency Distributions

SPSS INSTRUCTION CHAPTER 9

Genotype x Environmental Analysis with R for Windows

Box-Cox Transformation for Simple Linear Regression

Predict Outcomes and Reveal Relationships in Categorical Data

Statistical Pattern Recognition

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Modelling Proportions and Count Data

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Differentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob

Erdős-Rényi Model for network formation

Grade 7 Mathematics Performance Level Descriptors

2003/2010 ACOS MATHEMATICS CONTENT CORRELATION GRADE ACOS 2010 ACOS

Transcription:

Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated with point data (not attributes associated with those locations, just where they are found) Geographic Patterns in Areal Data -These methods are used to examine the pattern of attribute values associated with polygon representations of geographic phenomena (i.e. is there a pattern in the attributes of a set of adjacent polygons?)

Geographic Patterns in Areal Data Given a set of geographic areas, whether they are represented using vector polygons or collections of raster cells, that have some accompanying variable or attribute information, we can ask questions like: Does the pattern of values show a spatial organization that differs from what we might expect if the values were distributed randomly? In general, the way we will answer this sort of question by forming a descriptive statistic that compares the observed pattern to some expected pattern

Geographic Patterns in Areal Data We can assess this sort of thing in different ways, depending on the type of data we have: If we can count occurrences of a nominal variable per area, we can form a contingency table and use a χ 2 test to compare the observed values to those expected OR We can compare pairs of polygons that share a common boundary, computing the joint count statistic for binary nominal data, or Moran s I statistic when we have interval or ratio data that we want to examine for pattern

Contingency Tables and the χ 2 Test Any time we can have a pair of nominal variables that can be cross-tabulated this method can be applied E.g. suppose we conducted a survey of all students taking a Geography course, and asked them to indicate their year {freshman, sophomore, junior, senior} and what county they live in {Orange, Durham, Chatham, Alamance} We can use this data to form a 4x4 table, where each cell indicates the count of students in a particular year that live in a particular county This sort of table is called a contingency table, and it can be applied to spatial patterns if one of our nominal variables represents location information (e.g. county)

County vs. Year Example Table Freshman Sophomore Junior Senior Totals Alamance 4 7 7 10 28 Chatham 8 16 22 27 73 Durham 21 10 6 10 47 Orange 12 13 9 18 52 Totals 45 46 44 65 200 Contingency tables can be built for any data set where we have two nominal variables that we can use to categorize the values into the cells of the table the application does not have to be spatial, but membership in a particular spatial unit (i.e. inside of a certain polygon) is a convenient approach for spatial analysis

Contingency Tables and the χ 2 Test Furthermore, we can use the data in a contingency table to assess the presence of a spatial pattern by first forming an expectation of how values of one of the nominal variables should be distributed with respect to the other E.g. if our hypothesis is that the distribution of ages of geography students shouldn t according to their county of residence, then the relative proportions of freshmen : sophomores : juniors : seniors should be the same for each of our five counties (even if the total number of students per county is different) We can use the observed frequency counts in each cell of our contingency table to generate expected frequency counts, based on the rule suggested above

County vs. Year Example Table Freshman Sophomore Junior Senior Totals Alamance 6.3 6.4 6.2 9.1 28 Chatham 16.4 16.8 16.1 23.7 73 Durham 10.6 10.8 10.3 15.3 47 Orange 11.7 12.0 11.4 16.9 52 Totals 45 46 44 65 200 Expected values are calculated by multiplying the row total by the column total for each cell, and dividing by the grand total, e.g. for the Freshmen in Alamance County 45 * 28 / 200 = 6.3, and so one for all the cells This creates expected frequencies that are proportionate to one another across rows and columns

Contingency Tables and the χ 2 Test Once we have observed and expected frequencies for each cell in the contingency table, we can use those values to calculate the χ 2 test statistic: χ 2 = n Σi = 1 (O - E) 2 E where: O is the observed freq. E is the expected freq. n is the number of cells This χ 2 test statistic has (r -1) * (c - 1) degrees of freedom, where are r & c are the number of rows and columns in the contingency table If the observed frequencies are very different from the expected frequencies, χ 2 test will be larger than the 1- tailed critical value it will be compared it to, thus detecting the presence of a spatial pattern

Contingency Table χ 2 Test Example Research question: Is there a spatial pattern in the distribution of student years in counties of residence 1. H 0 : O ~ E (Frequencies are the same, no pattern) 2. H A : O E (Frequencies different, pattern present) 3. Select α = 0.05, one-tailed because of how χ 2 test is used here 4. We calculate the χ 2 test statistic using the formula χ 2 = n Σi = 1 (O - E) 2 E (4-6.3) 2 = + 6.3 (9-11.4) 2 + + 11.4 (7-6.4)2 + 6.4 (18-16.9)2 = 22.61 16.9

Contingency Table χ 2 Test Example 5. We now need to find the critical χ 2 values, first calculating the degrees of freedom: df = (r -1) * (c - 1) = (4-1) * (4-1) = 3 * 3 = 9 We can now look up our χ 2 crit values for our α = 0.05, which we will apply here in a one tailed fashion, thus we look in the χ 2 table for p = 0.05 to provide the critical value:

Contingency Table χ 2 Test Example 6. Finally, we must compare the χ 2 test value to the χ 2 critical value, finding that χ 2 test > χ2 crit, therefore we reject H 0 and accept H A, which tell us that the null hypothesis of no pattern has been rejected because based on the comparison between the expected and observed frequencies, there appears to be some pattern in which counties geography students in different years reside Notably, this test cannot tell us anything about the pattern s nature, only that the distribution is significantly different from the expected null, even distribution and thus there is evidence of spatial autocorrelation, meaning that geography students in certain years tend to live in certain counties

The Joint Count Statistic The contingency table approach, while it can be applied to spatial analyses in the fashion described, does not actually include any spatial relationship information in its formulation, beyond the encoding the coincidence of two nominal variables (when one of those variables represents location information) We can also formulate descriptive statistics that do include spatial relationships, specifically by finding all the regions that share a boundary in a set of polygons, and then comparing attribute values from the pairs to assess the pattern of that attribute

The Joint Count Statistic The first step in this method is to enumerate all of the pairs of polygons that share a boundary by creating a binary connectivity table (a.k.a. a spatial matrix). For example using the following five region system: A C B D E 1. Label the regions 2. Create a table with the same row & column labels A B C D E A B C D E 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 3. Fill in the table with 1s and 0s to indicate which regions share a boundary

The Joint Count Statistic We can now take the sum of all the 1 s in the binary connectivity table and divide by 2 to calculate the total number of shared boundaries in the system (J): J = n Σi = 1 x i 2 Next, we are ready to look at the attribute information associated with the polygons to determine if each pair of polygons that shares a boundary has the same values or different values The joint count statistic is designed to be used with binary nominal attributes, i.e. the attribute values need to be reduced to some 2 class description for use in this statistic

The Joint Count Statistic The binary attributes in question can be any number of possible representations: The example in the text uses positive or negative residuals in polygons from spatially-mapped regression results It could be any sort of presence/absence data Another possibility is a reclassification of other sorts of data (e.g. nominal or ordinal schemes reclassified to two classes, or interval/ratio data transformed to binary data in any number of ways -- above and below the mean, for example) It can be any scheme in which each polygon is assigned either attribute A or attribute B

The Joint Count Statistic We will use the suggested example in the text, where each of our five regions is assigned either a + attribute or a - attribute (possibly describing regression residuals): + + - + - We now have three types of boundaries: ++ boundaries (2) +- boundaries (5) -- boundaries (0) The joint count statistic compares the observed number of +- boundaries (where the value on either side of the boundary is different) to the number that we would expect to find if the values in the polygons did not exhibit any spatial autocorrelation

The Joint Count Statistic The expected number of +- boundaries is calculated as: E [+-] = 2JPM N(N - 1) where: J is the total number of shared boundaries P is the number of + polygons M is the number - polygons N is the total number of polygons For our example, E [+-] is calculated as: E [+-] = 2JPM N(N - 1) = 2*7*3*2 5(5-1) = 84 20 = 4.2 We will form a statistic by comparing the expected number of +- boundaries to the observed number of +-, which we obtain by simply counting the number of shared boundaries with this characteristic (being careful not to double count)

The Joint Count Statistic For our example five region system, the observed number of shared +- boundaries is 5 The last ingredient we need to be able to build a test statistic is an estimate of the variance in E[+-], and unfortunately, calculating this quantity requires a somewhat involved expression: Σ L i (L i -1)PM N(N - 1) 4[J(J -1)- Σ L i (L i -1)]P(P -1)M(M -1) N(N - 1)(N - 2)(N - 3) V [+-] = E [+-] + E [+-] 2 + + where L i is the total number of boundaries shared by region i In our example V [+-] = 0.56

The Joint Count Statistic We can now calculate a test statistic to compare the observed number of +- boundaries to the expected number of +- boundaries as a Z-statistic: (Obs. +- ) - E [+-] Z test = V [+-] This test statistic is normally distributed with mean 0 and variance 1, thus we can use the standard normal distribution to assess its significance An exceptional Z-statistic value would indicate a level of spatial autocorrelation that exceeds the expected amount for our system

Z-test for the Joint Count Statistic Example Research question: Is the areal pattern of + and - values randomly distributed amongst the polygons? 1. H 0 : O[+-] ~ E[+-] (Areal pattern is random) 2. H A : O[+-] E[+-] (Pattern is spatially autocorrelated) 3. Select α = 0.05, two-tailed because of H 0 4. We will calculate the test statistic using: Z test = (Obs. +- ) - E [+-] V [+-] = 5-4.2 0.56 = 1.07

Z-test for the Joint Count Statistic Example 5. For an α = 0.05 and a two-tailed test, Z crit =1.96 6. Z test < Z crit, therefore we accept H 0, finding that the areal pattern of +- values in the polygons is not significantly different from a random areal pattern; there is no evidence of spatial autocorrelation in this system that exceeds that which would normally expect were the values of + and - simply assigned randomly to polygons

Moran s I Statistic While the joint count statistic does include spatial information (shared boundaries between polygons) in its assessment of autocorrelation, it does so for very limited sorts of attribute data We can use the joint count statistic with binary nominal information, whereas in many situations, we have measurements that are considerably more detailed (i.e. interval or ratio data) We may want to assess spatial patterns of interval or ratio data in a fashion that allows to take full advantage of the detail inherent in those sorts of measurements, checking to see if the pattern of those values exhibits spatial autocorrelation

Moran s I Statistic For this purpose we can make use of Moran s I statistic, which we can view as an expansion of the ideas implemented in the joint count statistic Moran s I statistic considers the spatial relationships between each pair of polygons in an areal data set, and encodes the relationships in a connectivity table, just as is done for the join count statistic However, there is much greater flexibility in the nature of how neighborhood information is included in the Moran s I statistic:

Moran s I Statistic The computation of Moran s I statistic includes a weight term, where the weights express the degree to which any two elements of the polygon coverage are considered to be spatially related or proximal: In the simplest case, two polygons that share boundary have a weight of 1, and polygons that do not share a boundary have a weight of 0 (binary connectivity case) However, we can imagine all sorts of other schemes: We might weight by the length of boundary that is shared, as a function of a distance between the polygons, or using an expression that indicates how many neighbors apart they are (i.e. 1st order neighbors are adjacent, 2nd order neighbors are separated by one other polygon etc.)

Moran s I Statistic Thus, for each and every pair of polygons in the system, a weight expresses the degree to which they are spatiallyrelated (close to each other, connected, etc.) This weight term is multiplied by an expression that compares the attribute values of each and every pair of polygons, by calculating the mean and standard deviation for the whole data set, and then comparing the z-scores of the variable values for each polygon to that of the other: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j where n is the number of polygons w ij is the weight for combinations of the polygon in column i and the polygon in row j of the connectivity matrix z i and z j are z-scores

Moran s I Statistic Moran s I statistic is a normalized statistic that can be interpreted much like a correlation coefficient: It produces values between +1, that indicate a very strong spatial pattern, to values near -1 that are extremely rare because it is incredibly unusual to find patterns that exhibit strong negative spatial autocorrelation from real data we can certainly produce simulated patterns that exhibit strong negative autocorrelation, but finding such things in nature is all but unheard of, which is more or less what Tobler s Law predicts Values around 0 indicate an absence of spatial pattern, neither showing organization where nearby values are similar, nor the ultra-rare opposite of that condition

Moran s I Statistic The value of a Moran s I statistic depends strongly on the particular weighting method used: Given the same data, depending on how the spatial relationships between pairs of polygons are encoded, one can produce Moran s I values of varying magnitude, despite the fact that the inherent data and pattern is the same: This is an expression of the strong influence on how the conceptual choice made in how to describe spatial relationships will impact the results here For conceptual ease, we will use the same definition we used in the joint count example: If two polygons share a boundary, they will be assigned a weight of 1 in the binary connectivity table, otherwise they will be given a value of 0, indicating that the comparison of their values has no impact on the statistic because they are not David adjacent Tenenbaum GEOG 090 UNC-CH Spring 2005

Moran s I Statistic Example A B C D E A C B D E W = {w ij } = A B C D E 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 j rows Polygon Value Z-Score A 20 1.33 B 10-0.88 C 15 0.22 D 16 0.44 E 9-1.11 Mean 14 Std. Dev. 4.53 Moran s I = i columns n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j

Moran s I Statistic Example To calculate the statistic, substitute the appropriate values into the equation: Moran s I = n ΣΣ w ij z i z i j j (n -1) ΣΣ w ij i j = 5 ΣΣ w ij z i z i j j (5-1) 14 ΣΣ w ij z i z j = 2 [(1.33)*(-0.88)+(1.33)*(0.22)+ (-0.88)*(0.22) i j +(-0.88)*(0.44)+(0.22)*(0.44)+(0.22)*(-1.11) +(0.44)*(-1.11)] = 2.24 = 5 (2.24) (5-1) 14 = 0.2