Spatial Statistics With R: Getting Started

Size: px

Start display at page:

Download "Spatial Statistics With R: Getting Started"

Rosa Rosanna Howard
6 years ago
Views:

1 Spatial Statistics With R: Getting Started Introduction In the last practical, you saw how to handle geographical data in R, and how to carry out some basic, and more advanced statistical analysis on the data. However, even the more advanced Poisson modelling carried out did not take into consideration any spatial dependencies in the data. The breach of peace counts in each of the census blocks were modelled as independent Poisson counts, and the number of counts in each block was considered only in terms of other properties of that block, ignoring anything happening in surrounding blocks. However, there is a large area of statistical analysis devoted to processes in which events in nearby areas are related. In this practical you will learn how to use R libraries devoted to this kind of analysis - in particular spdep. The spdep Package The name of this package is a shortened form of spatial dependencies and contains a number of statistical routines for testing for spatial dependencies in random variables, as well as other routines for allowing for such dependencies when fitting models. To begin this practical, start up R by opening your working folder and clicking on the pract.rdata file and then load the packages GISTools. Enter library(gistools) to load these - and then enter data(newhaven) which will make the New Haven data visible again. Note that when you loaded the spdep package the printout shows that a number of 'helper' libraries were also loaded. You will see something like: Loading required package: tripack Loading required package: sp Loading required package: maptools Loading required package: foreign Loading required package: boot Loading required package: spam The topology of a spatial data set is the term usually described the spatial arrangement of geographical items within it - in particular, for a polygon data set the topology is a list of polygons that touch one another. Here, touching can mean the sharing of a common edge, and in some cases it can also mean the sharing of a common single point (for example when two census block areas are joined only at their corners. spdep has a function to extract the topology information from a polygon object - called poly2nb. The nb here stands for neighbours - since it is basically a list of which polygon neighbours which other ones. Enter blocks.nb = poly2nb(blocks) Spatial Statistics with R: Page 1 of 12

2 to store this information in a variable called blocks.nb. It is possible to plot this information as a kind of network. The nodes on the network are the so-called label points for the polygon file. Each polygon has a label point - a point somewhere inside the polygon where any text used to label the polygon may be placed. They are useful useful as node points on a network representing polygon neighbours. To extract the label points, as a point object, enter blocks.labs = poly.labels(blocks) and then it is possible to plot the neighbour information. Here, this is done on a backdrop of the census block polygons: plot(blocks,col='grey') plot(blocks.nb,coordinates(blocks.labs),col='red',add=true) the default for poly2nb is to define neighbours as having points, as well as edges, in common. This is sometimes called queen's case topology because connection at edges and corners corresponds to the legal moves of the queen in chess. It is also possible to extract rook's case topology - where only common edges define neighbours. This corresponds to legal moves of the rook in chess. To extract rook's case moves, add the argument queen=false to the poly2nb function: blocks.nb = poly2nb(blocks,queen=false) plot(blocks,col='grey') plot(blocks.nb,coordinates(blocks.labs),col='red',add=true) This repeats the network map from before, but now only polygons with common edges are connected. In this case, as few polygon pairs are connected only at the corners, the result is fairly similar. An alternative definition of topology (based on nearness of polygons rather than contiguity) is to defined two polygons as neighbours if their label points are within some distance d of one another. R can define these kinds of neighbours using the dnearneigh function. For example, to define census blocks as being neighbours if they are within 1.2 miles apart, enter the following: blocks.nb2 = dnearneigh(poly.labels(blocks),0,miles2ft(1.2)) It is then possible to plot the neighbour network under this definition: plot(blocks,col='grey') plot(blocks.nb2,coordinates(blocks.labs),col='red',add=true) Note that this demonstrates that under different definitions of neighbour, patterns of network can occur. quite different Computing and Testing Moran's I Having defined contiguity for this census block example, it is now possible to investigate the degree of spatial dependency there is in the attribute data. A typical way of doing this is to compute the Moran's-I coefficient. Moran s-i is defined as Spatial Statistics with R: Page 2 of 12

3 I = i N j w ij i j w ij i ( Xi X )( X j X ) ( Xi X ) 2 Where: X i N w ij X Is the attribute attached to polygon i Is the number of polygons Indicates whether polygons i and j are neigbours Is the average polygon attribute value The formula may seem complex, but essentially it measures the degree to which similarvalued attributes occur near to each other. If above average valued attributes tend to be near other above-average attributes, this gives a positive value of Moran s-i. If, on the other hand, above average values tend to occur near to below average values - in a checker-board pattern - this gives a negative Moran s-i. Moran s I is typically between -1 and 1, and in some ways is similar to a correlation coefficient. A value of zero suggests no spatial dependency. It is sometimes referred to as a measure of autocorrelation as it measures the variable X s correlation to itself, in a geographical sense. To illustrate this, choropleth maps corresponding to four values of Moran s-i are given below: I = I = I = I = Spatial Statistics with R: Page 3 of 12

4 R compute Moran s-i. To do this, it needs to convert a neigbourhood list to a w-list. This is really just another way of storing the polygon adjacency data. The conversion is done with the nb2listw function. Enter blocks.lw = nb2listw(blocks.nb) which stores the w-list in blocks.lw. Having done this, it is possible to investigate spatial dependency of some of the New Haven data. To test whether the percent vacant properties variable P_VACANT exhibits spatial dependency, we first attach the data frame from the blocks object: attach(data.frame(blocks)) To compute the Moran s-i statistic, now enter: moran.test(p_vacant,blocks.lw) which produces the following output: Moran's I test under randomisation data: P_VACANT weights: blocks.lw Moran I statistic standard deviate = , p-value = alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance This needs some explanation. The first number of the last line printed gives the Moran s-i statistic itself - about The other information relates to a statistical test as to whether the Moran s-i is equal to zero. If this is the case, then the theoretical values for the expected value of Moran s-i and its sample variance are estimated using the following formulae: E(I) = 1 N 1 V ar(i) = where: ND 6EC 2 (N + 1)(N 1)C 2 A = 1 (w ij + w ji ) 2, i j 2 B = k i j j w jk + i w ik 2 Spatial Statistics with R: Page 4 of 12

5 C = i w ij, i j j D =(N 2 3N + 3)A NB +3C 2 E = i (X i X) 4 /N ( i (X i X) 2 /N ) 2 These (very) complicated formulae can be used to create a test statistic z = I E(I) {V ar(i)} 1 2 which is approximately Normally distributed, and can be looked up against a p-value. The last line of the printout from moran.test tells you that the value of E(I) for P_VACANT is about (labelled Expectation ) and that for Var(I) is about (labelled Variance ). These can be used to compute z in the formula above, which is then used to test the hypothesis that I=0. In the printout from moran.test this is labelled as Moran I statistic standard deviate and takes the value of around Finally the p-value for the statistic is computed, and shown in the printout to be about Recall that the p- value is the probability of obtaining a value at least as extreme as the one observed from the data, given that the null hypothesis is true. Thus, the lower the value, the more evidence against the null hypothesis. Here the smallness of the p-value suggests strong evidence against the null hypothesis - ie we should reject the hypothesis that I=0, implying that some degree of spatial dependency is present. We can now do the same test in terms of density of breach of peace events - firstly compute the density values in events per square mile: density = poly.counts(breach,blocks)/ ft2miles(ft2miles(poly.areas(blocks))) and then carry out the Moran s-i test: moran.test(density,blocks.lw) this gives a print-out similar to that before. In this case the Moran s-i statistic is As a self test you should be able to find the p-value for this and decide whether Moran s-i differs significantly from zero. Simulation-Based Tests The basis for the significance tests in the last section was to compute the expected value and variance of the Moranʼs-I statistic under the assumption that there is no spatial dependency in the attribute X. Here, this is done by assuming that if there was no spatial dependency, then any of the observed X-values could have occurred with equal chance at any of the polygons. In other words, any permutation of polygon attributes to the polygons is equally likely. The formulae for E(I) and Var(I) were theoretically derived given Spatial Statistics with R: Page 5 of 12

6 this assumption. However, the assumption that Moranʼs-I is normally distributed in this case is only approximate. In times when computers were a lot slower than they are now, this approach was probably the most appropriate but now there is an alternative approach. This is simply to permute the attributes randomly amongst the polygons a large number of times, and note the values of Moranʼs-I each time. By comparing the actual Moranʼs-I against these, we can see how extreme the true value is compared to those generated under the assumption that any permutation is equally likely. If there are n simulations, and m of these have a larger value than the true Moranʼs-I, then the experimental p-value is m/(n+1). The theoretical approach of the previous section is relatively easy to compute (although seven formulae may seem complex to a human, they can be calculated in a fraction of a second by a computer) but it is only approximate. The simulation approach - also called the Monte-Carlo approach - outlined here requires more computer time (usually n should be around 10,000) but the simulations are of the true model. R can can carry out simulationbased tests with the moran.mc function: moran.mc(p_vacant,blocks.lw,nsim=10000) The extra argument nsim tells the function how many simulations to carry out - that is, the number n mentioned above. The result will be something like: Monte-Carlo simulation of Moran's I data: P_VACANT weights: blocks.lw number of simulations + 1: statistic = , observed rank = 9909, p-value = alternative hypothesis: greater Note that the p-value here - although slightly different from that obtained from moran.test still suggests that the hypothesis of zero Moran s-i should be rejected. Also note that when you run moran.mc you may well obtain slightly different results, as this approach is based on random simulation, and so no two runs of the function will have identical outcomes. As another self-test, try running moran.mc on the density variable. Regression Models with Spatial Autoregression In this section the idea of spatial dependency is taken a step further, by considering its effect when calibrating regression models. A standard bivariate regression model has the form Y i = β 0 + β 1 X i + ɛ i where the Y variable is to be predicted by the X variable. The beta {β 0, β 1} values are the regression coefficients (intercept and slope respectively) and the final epsilon {ɛ i } term is an error term. In a standard model it is assumed that these are normally distributed, with a mean of zero. It is also assumed that all errors have the same standard deviation, and that they are independent. However, in many geographical situations, the last assumption is dubious. The error term in a model is essentially related to factors influencing the Y Spatial Statistics with R: Page 6 of 12

7 variable that are not reflected in the predictor variable X. If such factors relate to a geographical phenomenon, it is possible that their effects might spill over, so that error terms in adjacent regions will depend on one another. In this case, the model above will be inappropriate, and models allowing for dependency in the epsilons should be considered instead. To consider this kind of model, we will look at two new New Haven crime variables related to residential burglaries. These are both point objects, called burgres.f and burgres.n. burgres.f is a list of burglaries occurring between 1st august 2007 and 31st january 2008 where entry was forced into the property, and burgres.n is a list of burglaries from the same time period where no entry was forced. In the case of non-forced entry, this suggests that the property was left insecure, perhaps by leaving a door or window open. Both variables are point objects. One interesting question is whether both kinds of residential burglary occur in the same places - that is, if a place is a high risk area for nonforced entry, does it imply that it is also a high risk for forced entry? To investigate this, we will use a bivariate regression model that attempts to predict the density of forced burglaries from the density of non-forced ones. The indicators needed for this are the rates of burglary given the number of properties at risk. Here we use the variable OCCUPIED from the data frame in the census blocks object to estimate the number of properties at risk. If we were to compute rates per 1,000 households, this would be 1000*(number of burglaries in block)/occupied and since this is over a six-month period, doubling this quantity gives the number of burglaries per 1,000 households per year. However, typing in OCCUPIED shows that some blocks have no occupied housing, so the above quantity is not defined. To overcome this problem we select a subset of the blocks object consisting only of blocks with greater than zero occupied dwellings. For polygon spatial objects, each individual polygon can be treated like a row in a data frame for purposes of subset selection. Thus to select only the blocks where the variable OCCUPIED is greater than zero, enter blocks2 = blocks[occupied > 0,] to stored the subset census block data in the object blocks2. We can now compute the burglary rates for forced and non-forced entries by first counting the burglaries in each block in blocks2 (with the poly.counts function), dividing these numbers by the OCCUPIED counts and then multiplying by 2,000 (to get yearly rates per 1,000 households). However, before we do this, remember that we need the OCCUPIED column from blocks2 and not blocks - but at the moment the one from blocks is attached. To sort this out, firstly detach the data frame associated with blocks and then attach the one associated with blocks2: detach(data.frame(blocks)) attach(data.frame(blocks2)) now the two rate variables can be calculated: forced.rate = 2000*poly.counts(burgres.f,blocks2)/OCCUPIED notforced.rate = 2000*poly.counts(burgres.n,blocks2)/OCCUPIED Spatial Statistics with R: Page 7 of 12

8 so we now have the two rates stored in forced.rate and notforced.rate. A first attempt at modelling the relationship between the two rates could be via simple bivariate regression - ignoring any spatial dependencies in the error term. This is done using the lm function, which creates a simple regression model object. model1 = lm(forced.rate~notforced.rate) this stores the basic model in model1 - to see the regression coefficients, enter summary(model1) which produces the following output: Call: lm(formula = forced.rate ~ notforced.rate) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** notforced.rate * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 125 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 125 DF, p-value: the key things to note here are that the forced rate is related to the not-forced rate by the formula expected(forced rate) = *(not forced rate) and that the coefficient for the not forced rate is statistically different from zero - so there is evidence that the two rates are related. One possible explanation is that if a burglar is active in an area, they will only use force to enter dwellings when it is necessary, making use of an insecure window or door if they spot the opportunity. Thus in areas where burglars are active, both kinds of burglary could potentially occur. However, in areas where they are less active it is less likely for either kind burglary to occur. However, this regression model could possibly be improved if, instead of assuming that the error terms are independent, we assume a spatial dependency. This can be done in a number of ways, but the approach we will use here is the spatially autocorrelated regression (SAR) model: y i = ρ i w ij y j + β 0 + β 1 x i + ɛ i Spatial Statistics with R: Page 8 of 12

9 The difference between this and the standard model is the first term on the left hand side. Here, w ij is equal to 1 if polygons i and j are neighbours and zero otherwise. coefficient ρ control;s the degree of spatial dependency. Effectively the variable y for a given polygon is predicted not just by x but also by the y-variables of polygons neighbouring y. Calibrating a SAR model involves estimating the regression coefficients, as before, but also involves estimating ρ. In R, SAR models can be calibrated using the function spautolm. This works in a similar way to lm, but also needs the contiguity information in listw form. Since we are now working with blocks2 rather than blocks we need to extract the information for the newer object: blocks2.nb = poly2nb(blocks2) blocks2.lw = nb2listw(blocks2.nb) Now it is possible to fit the SAR model - model2 = spautolm(forced.rate~notforced.rate,listw=blocks2.lw) This stores the result in model2 - more information can be found by entering summary(model2) giving the following output - Call: spautolm(formula = forced.rate ~ notforced.rate, listw = blocks2.lw) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-10 notforced.rate Lambda: LR test value: p-value: Log likelihood: ML residual variance (sigma squared): , (sigma: ) Number of observations: 127 Number of parameters estimated: 4 AIC: This shows that the model calibrated in this way gives the model expected(forced rate) = *(not forced rate) which differs only very slightly from the model obtained with a standard regression model. The section marked 'lambda' in the output shows that the estimated value of the dependency coefficient is 0.139, but the test of a null hypothesis of zero dependency has The Spatial Statistics with R: Page 9 of 12

10 a p-value of around so we fail to reject the null hypothesis. This suggests that, in this case, one does not need to allow for spatial dependency of the error term. The Modifiable Areal Unit Problem In the previous section, data was summarised and then analysed at the US Census block level. One important issue with spatial analytical models of this kind is their dependency on the set of areal units used. For example, if we were to work with US Census tracts instead of blocks, would we obtain similar results? With the data set here, it is possible to test this. Firstly, included in the library is an object called tracts which consists of the polygon outlines of the US Census tracts for New Haven. To see the relationship between the tracts and the blocks, enter: plot(blocks,border= red ) plot(tracts,lwd=2,add=true) the parameter lwd controls the line width being drawn. The Census blocks are nested within the tracts. Next, compute the burglary rates for the tracts; first off detach the data frame associated with blocks2 and the attach the one for tracts: detach(data.frame(blocks2)) attach(data.frame(tracts)) now, compute burglary rates for the tracts: forced.rate.t = 2000*poly.counts(burgres.f,tracts)/OCCUPIED notforced.rate.t = 2000*poly.counts(burgres.n,tracts)/OCCUPIED and run a basic model: model1.t=lm(forced.rate.t~notforced.rate.t) summary(model1.t) you should now be familiar with the format of the output - working with data based on census tracts, we obtain the model expected(forced rate) = *(not forced rate) Notice that the difference in calibrating the model brought about by altering the areal units used for the analysis is notably larger than the difference made by the inclusion of a spatial dependency term in the error model. This is referred to as the Modifiable Areal Unit Problem - first identified in the 1930s, and extensively research by Stan Openshaw in the 1970s and beyond. This variability in results is often the case, and illustrates the importance of the Modifiable Areal Unit Problem as an issue in spatial analysis. A Zone-Free Approach An alternative approach to mapping these crime patterns is to use kernel density estimation. Here we model the relative density of the points as a density surface - essentially a function of location (x,y) representing the relative likelihood of occurrence of an event at that point. If we think of locations in space as a very fine pixel grid, then summing the pixels making up an arbitrary region on the map gives the probability that an event occurs in that area. Spatial Statistics with R: Page 10 of 12

11 For the more mathematically-minded, if f(x,y) is the density function, then the probability that an even occurs in an area A is: f(x, y) dydx (x,y) A Kernel density estimators operate by averaging a small 'bump' (a probability distributioin in 2D, in fact) centred on each observed point. Thus, the approximation to f is given by: ˆf(x, y) = 1 ( x xi k, y y ) i h 1 h 2 h 1 h 2 i in mathematical terms. The function k is the kernel function - that is, the 'bump' described earlier. The h parameters control the smoothness of the estimate. Very small values give rise to very 'spikey' surfaces, and large values to very flat ones. Typically, they are chosen automatically, from the distribution of the points. Here, the function to compute a kernel density estimation is kde.points. This estimates the value of the density over a grid of points, and returns the result as a grid object. It can take two arguments - the set of points to use, and another geographical object, whose bounding box will be the extent of the grid object to be created. The points object breach will be used to produce a kernel surface: breach.dens = kde.points(breach,lims=tracts) This stores the kernel density estimate of breach of peace in a grid object called breach.dens. A quick way of drawing the density is to use the level.plot function: level.plot(breach.dens) This draws a shaded contour plot of the density function. One thing to notice is that this covers a rectangular area - but to give context it would be helpful to add a map of New Haven. For example, to add the Census tracts, type plot(tracts,add=true) Another approach might be to mask out the information outside of the study area. The kde.points function always computes values on a rectangular grid, but part of the grid lies outside of the New Haven area. To overcome this, it is possible to create a mask polygon object. This is simply a normal polygon object, shaped like the rectangle that kde.points produces, but with a hole in it the shape of the study area. In this case the hole is shaped like New Haven. If the mask polygon is plotted over the level plot of the grid data, with both its edges and fill colour being white, the effect is to erase the parts of the density surface lying outside of the study area. This can be achieved using the poly.outer function: masker = poly.outer(breach.dens,tracts,extend=100) The first two parameters give the outer rectangle and the hole shape, respectively. The third parameter actually causes the outer rectangle to extend by a small amount in each direction - sometimes this is useful, since occasionally their is a very slight mismatch between the coordinates of the outer edge of the grid, and the outer edge of the mask Spatial Statistics with R: Page 11 of 12

12 polygon. The erasing technique set out above might then fail to erase a small amount of information on the edge of the grid. The extend parameter avoids this by making the mask polygon s outer edges slightly exceed those of the grid. Here, we extend the edges by 100 feet. Now we have a masking polygon, called masker we can plot this on the map. The quickest way to do this is to use the add.masking command - this is more or less the same as the plot command, but defaults to drawing white filled polygons with white boundaries. Enter add.masking(masker) This erases the part of the density map outside of New Haven. However it has also partly erased the external boundaries of the census tracts. It would probably have been more sensible to draw the tracts after the mask polygon was drawn. A better map can be achieved by entering the commands in this order: level.plot(breach.dens) add.masking(masker) plot(tracts,add=true) Finally, it is also possible to use shading schemes (as seen in practical 2) to draw level plots with different intervals or colours. To do this, the auto.shading function is used as before. The variable to define the shading scheme is the kernel density estimate of the breach.dens object - accessed by breach.dens$kde. The following gives a level plot with 7 levels, drawn as shades of green: breach.dens.shades = auto.shading(breach.dens$kde, n=7,cols=brewer.pal(7,"greens"),cutter=range.cuts) level.plot(breach.dens,shades=breach.dens.shades) add.masking(masker) plot(tracts,add=true) Note the first command is split over two lines. End of Practical At this stage, the practical has finished. To exit R, enter save.image(file='rpract.rdata') detach(data.frame(blocks)) q() Which will save your current variables into a file in your working folder, undo the attach command entered earlier, and quit R. Spatial Statistics with R: Page 12 of 12

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated