Methods for cross-referencing, consistency check and generalisation of spatial data

Size: px

Start display at page:

Download "Methods for cross-referencing, consistency check and generalisation of spatial data"

Rafe Allen
5 years ago
Views:

2 SENSOR REPORT SERIES 2006/01 SENSOR Sustainable Impact Assessment: Tools

1 Methods for cross-referencing, consistency check and generalisation of spatial data SENSOR Project Deliverable Report SENSOR REPORT SERIES 2006/01 SENSOR Sustainable Impact Assessment: Tools for Environmental, Social and Economic Effects of Multifunctional Land Use in European Regions

2 Title Authors Methods for cross-referencing, consistency check and generalisation of spatial data Hansen HS, Grondal L Date February 2006 Category Deliverable title Project Deliverable Report D Methods for consolidating, generalisation, cross-referencing and consistency checking of diverse historical and current spatial/ statistical data Submission date December 2005 SENSOR Project The Integrated EU project SENSOR aims to develop ex-ante Sustainability Assessment Tools (SIAT) to support policy making regarding multifunctional land use in European regions. Land use represents a key human activity which drives socio-economic development in rural regions and manipulates structures and processes in the environment. At the European level, policies related to land use intend to support the efficient use of natural resources and to improve socio-economic developments. The project is financed by the EU 6 th Framework Programme. Project duration is four years, starting in December The project is carried out by a consortium of research institutes, led by the Leibniz-Centre for Agricultural Landscape Research (ZALF). This report contains one of the early deliverables in the course of the project. Its objective is to identify and describe various methods and tools for generalisation, cross-referencing and consistency checking of diverse historical and current spatial/statistical data. This overview represents a general up-to-date recommendation for the SENSOR partners. Keywords Correct Reference sustainability impact assessment tool, multifunctional land use, geodatabase, SENSOR data management system, cross-referencing, spatial data models, database development, spatial and statistical data, vector and raster data Hansen HS, Grondal L (2006): Methods for cross-referencing, consistency check and generalisation of spatial data. In: Helming K, Wiggering H (eds.) (2006) SENSOR Report Series 2006/01, ZALF, Germany Prepared under contract from the European Commission Contract no (GOCE) EU FP6 Integrated Project Priority Area "Global Change and Ecosystems" December December 2008 This publication has been funded under the EU 6 th Framework Programme for Research, Technological Development and Demonstration, Priority Global Change and Ecosystems (European Commission, DG Research, contract (GOCE)). Its content does not represent the official position of the European Commission and is entirely under the responsibility of the authors. The information in this document is provided as is and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.

3 Executive summary: The main aim of SENSOR is to deliver tools to evaluate multifunctionality and sustainability of landscapes. Therefore, complete, reliable and harmonised data are a core precondition for a successful implementation of the project. Data quality is a very important issue in all database development projects, and a major aim of this deliverable has been to identify and describe various methods and tools to enhance the quality of data produced by the SENSOR project. After having developed a metadata profile (standard) for the SENSOR project and a web-based Metadata reporting system (Deliverable D 5.1.1), the next natural step towards harmonised good quality spatial data is the development of methods for cross-referencing and consistency checking. During the autumn period we have worked on identifying the needs and methods concerning data quality. The new Geodatabase concept from ESRI has built-in facilities for setting up topological rules in order to prevent, identify and correct topological errors within a dataset as well as between datasets. The methods and tools are analysed and tested for usefulness. Generalisation of spatial data will be a major issue for the SENSOR project. First there will be a need for generalisation of detailed data from the case studies to the more overall research in Module 2 and 3, and finally a further generalisation to the NUTSx level in the SIAT system. A lot of tools are already available for this effort and Maria Luisa Paracchini has made a detailed review of methods for data and indicators disaggregation and weighting. Second there might be a need to generalise data for subsequent publication for example on the Internet. The described methods and tools for generalization, cross-referencing and consistency checking of diverse historical and current spatial/statistical data will be further developed and added as an integrated part of the whole SENSOR data management system. 3

4 1 Introduction Data quality Spatial data models The geodatabase Introduction... 9 Data consistency check Cluster Tolerance Ranks Rules Subtract Merge Create Feature Consistency test Rules for consistency check within a polygon feature dataset Rules for consistency check between two or more feature classes Topology rule: Must not overlap Topology rule: Must not have gaps Topology rule: Must be covered by feature class of Topology rule: Must be covered by Topology rule: Must cover each other Rules for consistency check within a point feature dataset Topology rule: Must be properly inside polygons Map projection Map Generalisations in GIS Map generalisation - lines Map generalisation - polygons Map generalisation by converting vector to raster models Morphological filtering during map generalisation Majority filter Boundary clean Statistical data Database keys Changing administrative units Changing definitions Conclusion References

5 Methods for cross-referencing, consistency check and generalisation of spatial data Louise Grøndal / NERI (hsh@dmu.dk) Henning Sten Hansen / NERI (lgr@dmu.dk) Abstract Data quality is a very important issue in all database development projects, and a major aim of this deliverable has been to identify and describe various methods and tools to enhance the quality of data produced by the SENSOR project. After having developed a metadata profile (standard) for the SENSOR project, and a web-based metadata reporting system (Deliverable D 5.1.1), the next natural step towards harmonised good quality spatial data was the development of methods for cross-referencing and consistency checking. The focus of work lay on identifying the needs and methods concerning data quality. The new Geodatabase concept from ESRI has built-in facilities for setting up topological rules in order to prevent, identify and correct all topological errors within a dataset as well as between datasets. The methods and tools were analysed and tested for usefulness. Generalisation of spatial data is a major issue for the SENSOR project. First there is a need for generalisation of detailed data from the case studies to the more overall research in Module 2 and 3, and finally a further generalisation to the NUTS x level in the SIAT system. Second there might be a need to generalise data for subsequent publication for example on the Internet. 1 Introduction Digital systems are capable of processing data more precisely than analogue systems but the quality depends on the accuracies of the source data. Hence, the establishment of a protocol for the procedures to perform on data is needed. Today's technology makes it possible for a growing number of spatial data producers and users to access geospatial data. Users with diverse backgrounds can now obtain digital datasets through geospatial clearinghouses and/or warehouses. Furthermore, data producers can now more easily add new features, attributes and relationships to those already included in the database, to the effect that datasets are the result of the contributions of a number of producers. Therefore, assumptions and limitations affecting the creation or modification of data must be documented. Nowadays, the need to communicate information about datasets is crucial, when users integrate data from various sources (INSPIRE, 2002a; 2002b). Data quality concepts provide an important framework for both data producers and users. Proper documentation provides spatial data producers with a better knowledge of their holdings, and allows them to more effectively manage data production, storage, updating and reuse. Data users can use this information to determine the appropriateness of a dataset for a given application and lessen the possibility of misuse. This document is a description on how to assess data quality and how to apply different topology rules to vector datasets. The aim of this paper is to evaluate methods and procedures for controlling quality on vector datasets using the tools provided by ArcGIS. The geodatabase format was found to be an effective tool in creating topologically correct datasets and the establishment of topology rules will be discussed. In addition, tools for generalisation of raster datasets are presented with examples on usage. 5

6 2 Data quality SENSOR Report Series 2006/01 According to Veregin and Hargitai (1995), the quality of a dataset can be described by the accuracy, precision, consistency and completeness of the dataset. They describe five components of data quality: Accuracy (the accuracy of the spatial component of the database), Consistency (the fidelity of the relationships encoded in the database), Completeness (the external validity of the database) and Lineage (the processing history of the database, including sources data capture methods and data transformation techniques). Accuracy of the spatial components in the database is defined as the discrepancy between a database and the reference source. The uncertainty of a dataset can be defined statistically as the mean error and the standard deviation. The resolution is another parameter describing the accuracy of the database. The geometric resolution might be described as the number of points at a line or the density of a grid in an elevation model. Consistency is a measure of the internal validity of the database. Hence, no apparent contradictions in the relationships among the encoded features appear in the database. In the spatial domain, the consistency refers to the lack of topological errors (e.g. unclosed polygons, dangling nodes, etc.). Feature datasets need to be topologically consistent for an analysis to be carried out. The completeness of a dataset describes the ability to depict the real world. It can be defined as the degree to which all intended entries into a database have actually been encoded. Given sufficient metadata, the completeness can be assessed by determining the degree to which a database contains all of the features it purports to contain. Lineage is the origin of the dataset, hence a description of the source material of the data, and the method of derivation including all transformations involved in producing the final dataset is needed. For spatial operations (geoprocessing) to be carried out, the datasets need to be topologically consistent (Tomlinson, 2003). In the current context, consistency will be defined as the lack of topological errors (e.g. sliver polygons or overlapping polygons, dangling nodes etc.). In working with a geodatabase, rules can be established to control the allowable spatial relationships of features within a feature class, in different feature classes or between subtypes. The topology is a type of dataset in a geodatabase that manages a set of rules and other properties associated with a collection of simple feature classes. The feature classes that participate in a topology are kept in a feature data set so that all feature classes have the same spatial reference. The topology has an associated cluster tolerance, which can be specified by the data modeller to fit the precision of the data. Hence, a priori knowledge of the spatial precisions is of great importance. This document is intended as suggestions for consistency check on polygon datasets and will be presented with examples using the ArcGIS Geodatabase model. 6

7 3 Spatial data models SENSOR Report Series 2006/01 Spatial data can be presented in different types of models. When presenting the real world in a GIS it is convenient to group entities of the same geometric type together in a class (Longley et al., 2001). According to Worboys and Duckham (2004) geospatial data are divided into classes, which also are used as data models in ArcGIS: Raster data model Vector data model Coverage data model Geodatabase data model Figure 1 Vector data model (left) and the raster data model (right), as compared to the real world data. Raster data models use an array of cells, known as pixels, to represent objects of the real-world (Figure 1). Each cell holds attribute values. The resolution of the model is defined by the size of the cell; the larger the cell sizes the less accuracy of the dataset and vice versa. However, low cell size might cause problems in operation. Therefore, the file might be compressed. The advantage of raster data models is the simple data structure and the ability to perform spatial analysis. Disadvantages are the lack of details. The raster data model is closely related to a field conceptual data model. The vector data model displays a discrete object view (Longley et al., 2001). Each object of the real world is classified as one of the fundamental geometric types: point, line or polygon. Points are recorded as single x,y-coordinate pairs and in some cases also a z-coordinate, whereas lines consist of a series of points, and polygons are one or more line segments. Topological structuring of the data set can assess spatial relations between objects. Vector data is considered for detailed maps whereas raster data sets are more suitable for the geographic variation of a phenomenon (Burrough and McDonnell, 1998). Transformation from vector model to raster model induces the decrease in details and should, therefore, be conducted with caution (see also chapter 5). 7

8 The coverage data model contains topological features between the vector features; hence the spatial data record for a line contains information on which nodes delimit a line and which lines are connected. In addition, information is stored on which polygons are on its right and left sides. The advantage of coverage data models is the topology facilitating geographic analysis. According to Zeiler (1999) the disadvantage of the model is the rather primitive topological relationships, weakening the integrity of the dataset. Thus, if lines are inserted across polygons, they are split into two polygons. ArcGIS introduced the geodatabase model. This is an object oriented data model allowing relationships to be defined among feature classes in the model. The advantage of this model is the ability to define topological, spatial and general relationships among the features (Zeiler, 1999). Since the geodatabase can include all above-mentioned geographic information, it represents the most suitable data model and is recommended for SENSOR. 8

9 4 The geodatabase 4.1 Introduction The geodatabase is the top-level unit of geographic data. It is a collection of (i) feature datasets, (ii) feature classes, (iii) object classes and (iv) relationship classes. The geodatabase supports an approach to modelling geography that integrates the behaviour of different feature types and supports different types of essential relationships. Topology is a collection of rules and relationships that, coupled with a set of editing tools and techniques, enables the geodatabase to more accurately model geometric relationships. The feature dataset is a collection of feature classes that share a common coordinate system. The feature class consists of points, lines or polygons. Datasets are centrally stored and managed. All features in the geodatabase share a common coordinate system. According to Zeiler (1999) 5 steps are essential to create a data model for GIS data: 1. Model the user s view of the data (identifying the data needs and organise data in logical grouping). 2. Define objects and relationships (build a logical data model). 3. Select geographic representation (determine which feature data are needed). 4. Match to geodatabase elements (specify relationships among features). 5. Organise geodatabase structures (build the structure of the database with consideration of thematic grouping and topological associations). This report focuses the 5 th step which includes defining and checking the topological rules. The consistency of a geodatabase is related to certain topological rules. When including datasets to the database, the topological rules can be set to identify certain inconsistencies, e.g. gaps between polygons or overlapping polygons. The topological rules ensure the logical relationship and consistency in the database. Thus, neighbouring polygons share joint borders. Hence, there are no neglected areas and no overlapping polygons. Figure 2 The geodatabase configuration. Datasets are entered into the database for analysis and visualisation in the GIS. 9

10 Data consistency check The geodatabase is an effective tool when creating a topologically correct dataset. The main capabilities are: Declare and place limitations on how to share geometry features (Figure 2) Create features from unstructured geometry using snapping and aggregation. Constraint Support: - Relationships between features - Validation rules - Logical Networks SENSOR Report Series 2006/01 Figure 3 The shared geometry. These features are expressed as integrity set defining the topology. Topologies can extend across feature types with the same spatial reference system. However, feature types can only be included in one topology or geometric network. The defined topologies are based on the following components: (i) rules, (ii) ranks, and (iii) cluster tolerance Cluster Tolerance The cluster tolerance is defined in ArcGIS as the distance that determines the range in which features are made coincident. Vertices are defined as coincident and snapped together. It is the distance range in which all vertices and boundaries in a feature dataset are considered identical or coincident. To minimise errors, the cluster tolerance should be as small as possible, depending on the precision level of the dataset. The behaviour of snapping operations is controlled using Cluster Tolerance. As an example, snapping points together that are 20 cm apart is acceptable if the data accuracy is 2 meters. Hence, the cluster tolerance is entirely dependent on the scale, complexity and accuracy. When editing a geodatabase, not all three parameters need to be modified, however awareness of them is necessary, especially the rules Ranks The rank of the topology controls the features that will be moved when snapping coincident vertices during the validation of the topology. When merging shared points, edges or areas, a concept of feature type rank is used as a tiebreaker. Vertices from lower ranked features will be snapped to the location of the higher ranked vertices if they fall within the cluster tolerance. Feature types with the same rank are averaged. 10

11 4.1.3 Rules SENSOR Report Series 2006/01 Rules define allowable relationships in a topology. Rules serve as the definition for topological integrity. A total of 26 geodatabase rules are available in the ArcGIS 8.3 for modelling topological relationships among point, line and polygon feature classes. Any number of rules can be applied within a feature dataset as long as the rules match the feature types they are targeting. Topology rules can also be defined between subtypes of features in one or another feature class. Table 1 Geodatabase topology rules, from ESRI Topology rules 8.3. Topology Rule Area boundary must be covered by boundary of Dataset Polygon/Polygon Boundary must be covered by Polygon/Line Contains Point Polygon/Point Endpoint must be covered by Line Must be covered by... Polygon/Polygon Must be covered by boundary of Line/Polygon, Point/Polygon Must be covered by endpoints of Point/Line Must be covered by feature class of Line/Line, Polygon/Polygon 11

12 Must be properly inside polygons Point/Polygon Must be single part Line Must cover each other Polygon/Polygon Must not have dangles Line Must not have gaps Polygon Must not have Pseudo-nodes Line Must not intersect Line Must not intersect or touch interior Line Must not overlap Line, Polygon Must not overlap with... Line/Line, Polygon/Polygon Must not self intersect Line Must not self overlap Line Point must be covered by Line Point/Line 12

13 The geodatabase takes a user centred approach to topology, focusing on the editing process rather than data consistency. Rule violations are flagged for later correction. The topological rules are implemented as a part of the geodatabase structure in ESRI s ArcGIS. The spatial relationships among feature classes are defined by the topology and specific rules are applied for each basis geometry type. The primary spatial relationships that can be modelled using topology are adjacency, coincidence and connectivity. Here the focus is on examples on the topology among feature datasets. Validation of the geodatabase is the process of the software checking the applied topology rules and the relationships against the dataset. The initial validation of the topology checks all features against all rules. Before a validation is run, all areas not yet validated are considered dirty areas. According to the ESRI definition, dirty areas are the regions surrounding features that have been altered after the initial topology validation process and require an additional topology validation to be performed to discover any errors. If modifications are made after they are validated they once again become dirty. The validation is validated within ArcMap. Most topology violations have predefined fixes that can be used to correct the errors. Some topology rules, however, have no predefined fixes. Once the dirty areas are marked on the map, the errors can be selected and a correction can be applied for the error type. The rules applied for polygon error fixes are the following: Subtract The Subtract fix removes the overlapping portion of geometry from each feature that is causing the error and leaves a gap or void in its place. This fix can be applied to one or more selected Must Not Overlap errors. Merge The Merge fix adds the portion of overlap from one feature and subtracts it from the others that are violating the rule. You need to pick the feature that receives the portion of overlap using the Merge dialog box. This fix can be applied to one Must Not Overlap error only. Create Feature The Create Feature fix creates a new polygon feature out of the error shape and removes the portion of overlap from each of the features, causing the error to create a planar representation of the feature geometry. This fix can be applied to one or more selected Must Not Overlap errors. 13

4.2 Consistency test A test of consistency was performed as a test of internal consistency and of consistency between two layers (feature datasets).

14 4.2 Consistency test A test of consistency was performed as a test of internal consistency and of consistency between two layers (feature datasets). A geodatabase was established using feature datasets providing information from the Danish coastal areas. A test was performed on two feature datasets: CORINE landcover NUTS3. Figure 4 Danish land areas presented by Corine Land cover 2000 and NUTS3. From the two datasets, the Danish region was extracted. Establishing a geodatabase: Decide on which dataset should go into the database and what to which rules the datasets should be subjected. Depending on the dataset, different kinds of topological rules are applicable to the feature datasets. There are rules for modelling spatial relationships between feature classes in feature datasets. The rules can be defined between features on a single feature class or subtype or between two feature classes or subtypes. Identify which data sets should go into the database. Define the projection of the data sets, the database projection is inherited to all datasets within the database (see section 4.4). Define the spatial relationships that are important. In the following examples of topological rules to apply on polygon and point feature datasets will be presented. 4.3 Rules for consistency check within a polygon feature dataset Must not overlap. Must not have gaps. 4.4 Rules for consistency check between two or more feature classes Must be covered by feature class of. Must be covered by. Must cover each other. 14

4.4.1 Topology rule: Must not overlap Figure 5 Polygons must not overlap within a feature class or subtype. Polygons can be disconnected or touch at a point or along an edge.

Fixing the errors originating from this rule, three methods are available: Substract. Merge. Create feature. 4.

15 4.4.1 Topology rule: Must not overlap Figure 5 Polygons must not overlap within a feature class or subtype. Polygons can be disconnected or touch at a point or along an edge. Overlapping areas are considered errors. This rule applies both for internal cross validation of a polygon dataset. Fixing the errors originating from this rule, three methods are available: Substract. Merge. Create feature Topology rule: Must not have gaps Figure 6 Polygons must not have gaps between them within a feature class or subtype. Errors are created where polygon boundaries are not coincident with other polygon boundaries. This rule is applied for cross validation of a polygon dataset. The gaps are also known as sliver polygons. A sliver polygon is a small area feature that may be found along the borders of areas following the topological overlay of two or more data sets with common features (for example lakes). Topological overlay results in small sliver polygons if the two input data layers contain similar boundaries from two different sources. Consider two data layers containing land parcels that will be used in a topological overlay. One data layer may have come from an external source-perhaps provided in digital format by a data supplier. The other data layer may have been digitised within the organisation. After they are overlaid, it is likely that small errors in the parcel boundaries will appear as sliver polygons small thin polygons along the boundaries. It is recommended to remove the slivers (Tomlinson, 2003). Potential error fix from this rule is to create a new polygon: Create feature fix creates new polygon features using a closed ring of line error shapes that form a gap. This fix can be applied to one or more selected Must Not Have Gaps errors. If two errors are selected, the result will be one polygon feature per ring. If one multipart feature is needed as the result of the fix, each new feature has to be selected. Note that the ring that forms the outer bounds of the feature class will be in error. Using the Create Feature fix for this specific error can create overlapping polygons. However, errors can be marked as an exception. 15

16 Gaps and sliver polygons create problems: They represent obvious inaccuracies and in addition, the maps look bad. SENSOR Report Series 2006/01 They can cause errors in spatial analyses: polygons that should be adjacent might not be, points that should lie within some polygon might lie within no polygon, points may lie in more than one polygon and distances may be inconsistent. These problems can be resolved by identifying and eliminating the slivers within the data layer. In addition, these features should also be identified between to feature datasets Topology rule: Must be covered by feature class of Figure 7 Corine Land cover 2000 must be covered by NUTS3. Red areas mark dirty areas where NUTS3 does not cover Corine Land cover This rule requires that a polygon in one feature class must share all of its area with polygons in another feature class. Errors are found where areas from the first feature class are not covered by polygons from the other feature class. However, uncovered features in the first feature class are not errors. Subtracting or creating new features can eliminate dirty areas from this rule Topology rule: Must be covered by Figure 8 This rule requires that polygons of one feature class must be contained within polygons of another feature class. Any area defined in the contained feature class must be covered in an area in the covering feature class. Only polygons in CLC2000 that are completely within the NUTSX feature classes are not errors. All areas marked with red are dirty areas To potentially avoid this type of error, features might be created where the dirty areas are located. 16

4.4.5 Topology rule: Must cover each other SENSOR Report Series 2006/01 The rule is useful when two feature classes are used for the same geographical area and areas in one feature class must be

17 4.4.5 Topology rule: Must cover each other SENSOR Report Series 2006/01 The rule is useful when two feature classes are used for the same geographical area and areas in one feature class must be defined in the other feature class. The errors from this rule can be fixed by subtracting or creating new features where the dirty areas are located. The following is an example from the Danish Wadden Sea area. Figure 9 Corine Landcover 200 and NUTS 3. Areas in red are errors from the rule Must cover each other. The rule requires that polygons from one feature class must share all of their area with polygons of another feature class. All intertidal flats are marked as errors with this rule. Figure 10 Corine Land cover 200 and NUTS 3. Areas in red are errors from the rule Must cover each other. The rule requires that polygons from one feature class must share all of their area with polygons of another feature class. 17

4.5 Rules for consistency check within a point feature dataset 4.5.1 Topology rule: Must be properly inside polygons This rule is used when points must be completely inside the polygons of another feature dataset.

18 4.5 Rules for consistency check within a point feature dataset Topology rule: Must be properly inside polygons This rule is used when points must be completely inside the polygons of another feature dataset. An example could be: cities have to belong to a certain area or county. Figure 11 The points marked with red are the errors found using the topology rule. Errors are created when points fall outside a polygon or if the point is on the boundary of the polygon. Errors are fixed by removing the point features that are not properly within the polygon features with the Delete fix. This is however not always the best way of correcting the errors and there is another option, where the point can be moved inside the polygon feature. 4.6 Map projection Accurate georeferencing is a pre-requesite for map projection. Insufficiently accurate data may be the loss of the original acquisition and the data might in the worst case be useless. Figure 12 gives an example for map projections on the NUTS dataset. The dataset is originally displayed as GCS_ETRS_1989, in Figure 12 the dataset has been converted to UTM zone 32 WGS84 (blue) and UTM zone 32 ED50 (red). As seen there is a difference between the two projections by approximately m. Figure 12 The NUTS dataset in two projections; UTM zone 32 WGS84 (blue) and UTM zone 32 ED50 (red). The map to the left does not show the difference in the projections. However the difference between the two projections becomes visible when you zoom into a small area. ArcGIS provides good tools for converting between projections and different coordinate systems and hence enables combinations of datasets that use different systems of georeferencing. 18

19 5 Map Generalisations in GIS SENSOR Report Series 2006/01 The resolution of a feature dataset refers to the amount of detail that can be discerned in space (Vergain and Hargitai, 1995). The generalisation refers to the elimination of small features, smoothing and thinning of features, merging or aggregation of features in close proximity to each other. Generalisation in ARC/INFO is defined as a general reduction of the number of points used to represent a line. This includes removal of vertices from arcs in accordance to a specified weed distance. 5.1 Map generalisation lines Generalisation of vector data models can be performed on lines and polygons. According to Longley et al. (2001), the generalisation of polylines is a weeding of vertices in the line, simplifying the representation of the line. The most common method of simplification is known as the Douglas-Poiker algorithm. In addition, lines can be smoothed to improve the aesthetic or cartographic quality. Two methods are available in ArcGIS; The PEAK algorithm, which is a polynomial approximation with exponential kernel, producing smoothed lines with more vertices than the source line. The other is the Bezier interpolation algorithm that fits Bezier curves between vertices. Table 2 The two types of line generalisation presented with the file sizes. Spatial transformation Simplification Original feature dataset Generalised feature dataset Zoom on original (top) and generalised feature dataset (bottom) File size kb 1928 kb Smoothing File size kb kb As shown in Table 2, simplifying a feature dataset decreases the file size, whereas the smoothing algorithm increases file size of the feature dataset. 19

5.2 Map generalisation polygons The dissolve tool is used when aggregating polygon features are based on a specified attribute or attributes.

In this example, the regions in the Netherlands and in Belgium have been dissolved at the country basis, resulting in two polygons instead of individual regions.

20 5.2 Map generalisation polygons The dissolve tool is used when aggregating polygon features are based on a specified attribute or attributes. The tool aggregates units but keeps the coordinates of the dataset. In this example, the regions in the Netherlands and in Belgium have been dissolved at the country basis, resulting in two polygons instead of individual regions. In addition, the aggregated features can also include summaries of any of the attributes present in the input features. In the before mentioned example population density for each of the regions in the two countries could be summarised for the entire country to give the total population density in the country. Figure 13 The NutsX feature dataset with provinces in the Netherlands and Belgium. The input feature dataset containing the regions in the two countries and the output of the aggregation on country basis. 5.3 Map generalisation by converting vector to raster models Two widespread methods for conversion between raster and vector models have been developed. The first approach is named vectorisation defined as raster data conversion to vector data. An example is given in Figure 14: Figure 14. Conversion of a raster feature dataset to a vector feature dataset. Each raster cell has an attribute value, attribute values within the same class are grouped. The polygons are created by storing the x- and y- coordinates for the points adjacent to the boundaries (modified from Bernhardsen, 1992). The other method is the rasterisation which means the conversion of vector data to raster data (Figure 15). Data are lost in the two procedures. Once vector data are converted to raster data, raster generalisation tools can be applied to either clean up small erroneous pixels or to generalise the data to remove or smooth the unnecessary detail. 20

Each cell is assigned the attribute code of the polygon to which it belongs (modified from Bernhardsen, 1992).

21 Figure 15 Example for the conversion of a vector feature dataset to a raster feature dataset. The polygons in the vector feature dataset containing the centre of the individual cells in the raster dataset. Each cell is assigned the attribute code of the polygon to which it belongs (modified from Bernhardsen, 1992). Following are two examples on rasterisation of a Danish soil map, a raster grid of 1 by 1 km and 10 by 10 km are used in the rasterisation. Figure 16 The Danish soil map converted to the 1 km EuroGrid. As shown in Figure 16, the conversion using the centre of the individual cells in the EuroGrid results in a coarse version of the original vector feature dataset. There are however errors introduced with this rasterisation due to the assignment of the attribute code at the centre of the raster grid. At the bottom arrow one of the small polygons marked by the blue colour (SØ) contained the centre of a 1 by 1km raster grid cell, thus the cell was assigned the attribute value from this polygon. The majority of the cell however was covered by a polygon with the yellow colour (GC). The opposite is the case at the upper arrow. Here a large polygon marked with the blue colour (SØ) covered one of the EuroGrid cells, but the centre of the cell was coinciding with the yellow polygon (GC). Figure 17 The Danish soil map converted to the 10 km EuroGrid. 21

As shown in Figure 17, the rasterisation results in a very coarse map, which contains some of the features correctly from the vector feature dataset, there are however also raster grid cell that are

The top left yellow polygon in 10 by 10 km Eurogrid does not have the majority of the cell covered by ES, the majority of that 10 by 10 km cell in the vector dataset is rather the red colour (DS).

22 As shown in Figure 17, the rasterisation results in a very coarse map, which contains some of the features correctly from the vector feature dataset, there are however also raster grid cell that are less correct. The top left yellow polygon in 10 by 10 km Eurogrid does not have the majority of the cell covered by ES, the majority of that 10 by 10 km cell in the vector dataset is rather the red colour (DS). It seems the larger areas are reproduced in the 10 by 10 km EuroGrid the orange (TS) dominated areas are found in the EuroGrid, but some erroneous features appears. Hence, the rasterisation with a large grid is mostly applicable to vector dataset that contains large-scale features. 5.4 Morphological filtering during map generalisation Another widespread procedure for generalisation is the reduction of objects by aggregating and merging. Small areas are removed and merged with a larger neighbouring area with the greatest similarity to the small area when setting a threshold for the minimum size of an area in the map. The following examples show how a feature dataset is coarsened as resampling tools are applied. Figure 18 Converting a polygon vector dataset with small scale features to a grid of 1 by 1 km. Small features are lost with this type of generalisation Majority filter This generalisation tool has commonly been used to remove small entities in the raster data model. The function replaces cells based on the majority value in their contiguous neighbourhoods. Majority filters have two criteria for replacement. At first, the number of neighbouring cells with the same value must be the majority or half the cells must have the same value. E.g. 3 out of 4 and 5 out of 8 connected cells must have the same value with the majority parameter and 2 out of 4 and 4 out of 8 for the half parameter. Secondly, those cells must be contiguous about the centre of the specified filter (for example, 3 out of 4 cells must be the same). The second criteria concerning the spatial connectivity of the cells minimise the corruption of cellular spatial patterns. If these criteria are not met, then no replacement occurs, and the cell retains its value. Notice that many of the smaller groups of cells have disappeared. Figure 19 The majority filter with 8 neighbours and requiring at least half the values (four out of eight cells) to have the same value before changing the cell s value, have been applied to the 1 by 22

1 km raster dataset seen in Figure 18. As seen features have been removed in the filtered raster dataset. 5.4.

23 1 km raster dataset seen in Figure 18. As seen features have been removed in the filtered raster dataset Boundary clean The Boundary Clean function is primarily used for cleaning ragged edges between zones. It uses an expand, then a shrink method that cleans boundaries on a relatively large scale. Initially, zones of higher priority invade into their neighbouring zones of lower priority by one cell in all eight directions. Then they shrink back to those cells, which are not completely surrounded by the cells of the same value. Thin islands inside a zone, which can be viewed as sharing boundaries with the zone, may also be replaced. The smallest size region that can be retained is a 3 by 3 block of cells. Therefore, thin regions may be replaced. For example, a region two cells wide and 10 cells long will be removed, since it cannot be recovered after shrinking. In Figure 20, Boundary Clean was applied to the input raster dataset seen in Figure 18 with no sorting of the zones. Zones with larger values have a higher priority to expand into zones with smaller values. Again, notice that even more of the smaller and thinner groups of cells have disappeared. Figure 20 The majority filtered raster dataset from Figure 19 with a boundary clean. As shown in Figure 20, the complexity of the dataset is markedly reduced when applying the generalisation features. Other filters are characterised by using a mask of a certain property to filter the raster dataset, for instance nibble. No examples will be shown on these filters. For smoothing a raster dataset the majority filer in addition with the boundary clean filter are good choices. 23

24 6 Statistical data SENSOR Report Series 2006/01 Whereas the quality of spatial data mainly focuses on the geometric properties i.e. the coordinates, the other properties associated to spatial units the so-called attribute data are often stored separate from the spatial data. The European NUTS map is generally obtained from a mapping agency or EuroGeographics, while the associated statistical information is downloaded from EuroStat. However, this separate process of provision contains risks for incompatibility between the two data sets. Thus the database keys, the administrative units and various other items may be incompatible. 6.1 Database keys When datasets are produced independently from each other, there are several pitfalls to fall into. For numeric keys you will often be in a situation where one database key is numeric while the key in the associated table contains only numbers, but the field type is String. Accordingly no records will be joined. For string type database keys there are several more pitfalls. The most common problem will emerge if the values of one database key are in upper case whereas the key values in the associated table are written using lower case. Although people do not care about the differences between small letters and capital letters, the result of a join will be no matching records. The database keys in NUTSX of the SENSOR Core data set are in upper case like DK004. Consequently, all tables with statistical data referring to NUTS regions must have similar database keys. Actions needed in the SENSOR data management system During upload of statistical data, we have to check if the database keys provided in the table to be uploaded are appropriate. Thus the tables must contain either a key belonging to one of the NUTS datasets or one of the EuroGrids. 6.2 Changing administrative units Contrary to the EuroGrids, which basically can be considered time invariant, the NUTS region changes (a lot over time). The main reason to the variations is changes in the administrative boundaries within the Member States. Thus, Denmark has decided to make substantial changes in the administrative structure from January , and this will give rise to changes in NUTS- 3 for Denmark. Other Member States do have similar discussions on administrative reforms, and in a European Union with 25 (and soon 27) Member States, there will be a potential risk for changes in the NUTS classification every year. Certainly this will complicate analysis of changes over time. Actions needed in the SENSOR data management system The SENSOR Core data set must contain several versions of the NUTS classification, each with an appropriate endorsement. Thus the user must mention during upload which version of the NUTS classification, the table belongs to. 24

25 6.3 Changing definitions The last issue we will deal with in this chapter is related to changing definitions of various phenomena. This may refer to different definitions among the Member States as well as different definitions between various years. One good example of this is the different definitions of the term urban between the Member states, and of course this will be reflected in a table describing the level of urbanisation for the various NUTS regions. Actions needed in SENSOR data management system We must admit that the SENSOR data management system cannot handle the consequences of changing definitions over time or between the Member States. However, we must stress that all partners in their metadata descriptions must mention if a table to be uploaded to the Data Management system is based (partly) on different definitions in the various Member States. Although this will not improve the data in the table, it will raise (hopefully) awareness among the potential users of the data. 7 Conclusion Data quality is important for developing spatial databases as well as general databases with alphanumeric information and to ensure high quality, cross-referencing and consistency checking need to be done. The methods and tools of the new Geodatabase concept from ESRI has built-in facilities for setting up topological rules to prevent, identify and correct topological errors within and between datasets, and during the current work, these methods and tools have been tested and found useful. The generalisation of spatial data will be another major issue for the SENSOR project. E.g. generalisation of detailed data from scenarios, the test plots and case studies in Module 2, 3 and 6. Besides this, a further generalisation to the NUTS x level in the SIAT system is needed. The Review of methods for data and indicators disaggregation and weighting by Parachini (2005) presents and evaluate several methodologies for aggregation and disaggregation of spatial data. The described methods and tools for generalisation, crossreferencing and consistency checking of diverse historical and current spatial and statistical data will be further developed and added as an integrated part of the data management system of SENSOR. The generalised data for subsequent publication should be provided by the module partners using the tools of the Data Management system. 25

26 8 References Bernhardsen T (1992) Geographic Information Systems. ISBN Burrough PA, McDonnell RA (1998) Principles of Geographic Information Systems, 2nd Ed. Clarendon Press. INSPIRE (2002a) INSPIRE Architecture and Standards Position Paper. INSPIRE (2002b) INSPIRE: Reference Data and Metadata Position Paper. Longley P, Goodchild M, Maguire DJ, Rhind DW (2001) Geographic Information Systems and Science. ISBN Parachini L (2005) Review of methods for data and indicators disaggregation and weighting. Tomlinson RF (2003) Thinking about GIS: geographic information system planning for managers. ISBN Veregin H, Hargitai P (1995) An evauation matrix for geographical data quality. In: Guptill and Morrison (eds) Elements of Spatial Data Quality. ISBN Worboys M, Duckham M (2004) GIS A Computing Perspective. 2 nd ed. ISBN Zeiler M (1999) Modeling Our World. The ESRI Guide to Geodatabase Design. ISBN Useful link concerning geodatabase topology: 26

27 Main partner involved in this publications is: The National Environmental Research Institute, Denmark This report was edited by the Leibniz-Centre for Agricultural Landscape Research Contact:

Topology in the Geodatabase: An Introduction

Topology in the Geodatabase: An Introduction Colin Zwicker Erik Hoel ESRI Super Secret Topology Laboratory, May 2016 Agenda ArcGIS Topology defined Validating a topology Editing a topology Geoprocessing