DATA MODELS IN GIS Prachi Misra Sahoo I.A.S.R.I., New Delhi -110012 1. Introduction GIS depicts the real world through models involving geometry, attributes, relations, and data quality. Here the realization of models is described, with the emphasis on geometric spatial information, attributes and relations. A prerequisite for describing the real world by use of GIS is that the different type of geographical information can be stored in the computer. All the operations in the computer are based on the storage and handling of numbers. This is why the data stored in the computers is known as digital data. In GIS there is need to store graphical figures, images, numerical values and plain text. All these forms of data must be converted into digital representation. In principle, only two different numerical symbols or signals can be stored in a computer 0 and 1. The numerical system based on 0 and 1 is known as the binary system. In order to separate different numbers form each other, the stream of 0 s and 1 s is divided into groups of 8 bits. Each group is known as a byte. Decimal figures are stored with the help of 4 bytes (32 bits), based on logarithmic notation, by separate storage of the number s mantissa and exponent. Text is stored digitally with the help of a code system called ASCII (American Standard Code for Information Interchange). Each number between 0 and 127 corresponds to one sign on the computer s key board. For example, uppercase letters A through Z are represented by numbers between 97 and 90, and lowercase a through z by numbers between 97 and 122. To handle special national signs, a variation of the system has been developed to accommodate up to 255 different signs. Geometric presentations are commonly called digital maps. Strictly speaking, a digital map would be peculiar because it would comprise only numbers (digits). By their very nature, maps are analog, whether they are drawn by hand or machine, or whether they appear on paper or are displayed on a screen. Technical, GIS does not produce digital maps it produces analog maps from digital map data. Nonetheless, the term digital map is now so widely used that the distinction is well understood. Spatial information is presented geometrically in two ways: as vector data in the form of points, lines, and areas (polygons), or as raster data in the form of uniform, systematically organized cells. The vector model and raster model are discussed in the following sections. 2. Basic Data Models in GIS The data model represents a set of guidelines to convert the real world (called entity) to the digitally and logically represented spatial objects consisting of the attributes and geometry. The attributes are managed by thematic or semantic structure while the geometry is represented by geometric-topological structure. There are two major types of geometric data model; vector and raster model as shown in Fig. 1.
Figure 1. Vector and Raster data models 2.1 Vector Data Model Vector model uses discrete points, lines and/or areas corresponding to discrete objects with name or code number of attributes. Given a map, you can tell how map features are like and how the map features are related to one another spatially. 2.1.1 Geometry of Vector Data Model The vector data model consists of three types of geometric objects: point, line, and area. A point may represent a gravel pit, a line may represent a stream, and an area may represent a vegetated area. A point has 0 dimension. A point feature occupies a location and is separate from other features (Figure 2). A line is one-dimensional and has the property of length. A line feature is made of points: a beginning point, an end point, and a series of points marking the shape of the line, which may be a smooth curve or a connection of straight-line segments. Smooth curves are typically generated or fitted by mathematical equations, such as cubic polynomial equations. Straight-line segments may represent human-made features or approximations of curves in data entry. Points that mark the shape of a line feature but are not nodes are called vertices. Line features may intersect or join with other lines and may form a network (Figure 3). An area is two-dimensional and has the properties of area and boundary. The boundary of an area feature separates the interior area from the exterior area. Area features may be isolated or connected. An isolated area feature typically has a node serving as both the beginning and end node. Area features may be surrounded by other areas and form holes within them. Area features may overlap one another and create overlapped areas. For example, the fired areas from previous forest fires may overlap each other (Figure 4). Vector data representation using point, line, area, and volume is not always straightforward because it may depend on map scale and, occasionally, criteria established by government mapping agencies. A city on a 1:1,000,000-scale map is represented as a point, but the same city is shown as an area on a 1:24,000-scale map. A stream is shown as a single line near its III.141
headwaters but as an area along its lower reaches. In this case, the width of the stream determines how it should be represented on a map. 2.1.2 Topology of Vector Data Model Topology expresses explicitly the spatial relationships between geometric objects. The vector data model in ARC/INFO supports three basic topological concepts: 1. Connectivity: Arcs connect to each other at nodes 2. Area definition: An area is defined by a series of connected arcs 3. Contiguity: Arcs have directions and left and right polygons Figure 2. Points with x-, y-coordinates III.142
Figure 3. The data structure of a line data model Figure 4. The data structure of an area data model III.143
2.1.3 Advantages and Disadvantages of Vector Data Models The advantages of the vector data model are: 1. Good representation of entity data models. Compact data structure. 2. Topology can be described explicitly therefore good for network analysis. 3. Coordinate transformation and rubber sheeting is easy. 4. Accurate graphic representation at all scales. 5. Retrieval, updating and generalization of graphics and attributes are possible. The disadvantages of the vector data model are: 1. Complex data structures 2. Combining several polygon networks by intersection and overlay is difficult and requires considerable computer power. 3. Display and plotting may be time consuming and expensive, particularly for high-quality drawing, colouring, and shading. 4. Spatial analysis within basic units such as polygons is impossible without extra data because they are considered to be internally homogeneous. 5. Simulation modeling of process of spatial interaction over paths not defined by explicit topology is more difficult than with raster structures because each spatial entity has a different shape and form. 2.2 Raster Format Raster model uses regularly spaced grid cells in specific sequence. An element of the grid cell is called a pixel (picture cell). The conventional sequence is row by row from the left to the right and then line by line from the top to bottom. Every location is given in two dimensional image coordinates; pixel number and line number, which contains a single value of attributes. 2.2.1 Geometry of Raster Data The geometry of raster data is given by point, line and area objects as follows (see Figure 5) a. Point Objects: A point is given by point ID, coordinates (i, j) and the attributes b. Line Objects: A line is given by line ID, series of coordinates forming the line, and the attributes c. Area Objects: An area segment is given by area ID, a group of coordinates forming the area and the attributes. Area objects in raster model are typically given by "Run Length" that rearranges the raster into the sequence of length (or number of pixels) of each class as shown in Figure 5. The topology of raster model is rather simple as compared with the vector model as shown in Figure 5. The topology of line objects is given by a sequence of pixels forming the line segments. The topology of an area object is usually given by "Run Length" structure which includes Start line no., (start pixel no., number of pixels), second line no., (start pixel no., number of pixels). III.144
2.2.2 Topology of Raster Data Figure 5. Geometry and Topology of Raster Data One of the weak points in raster model is the difficulty in network and spatial analysis as compared with vector model. For example, though a line is easily identified as a group of pixels III.145
which form the line, the sequence of connecting pixels as a chain would be a little difficult in tracing. In case of polygons in raster model, each polygon is easily identified but the boundary and the node (when at least more than three polygons intersect) should be traced or detected. a. Flow Directions A line with directions can be represented by four directions called as the Rook's move in the chess game or eight directions called as the Queen s move, as shown in Figure 6 (a), (b), (c). Figure 6 (c) shows an example of flow directions in the Queen's move. Water flow, links of a network, roads etc. can be represented by the flow directions (or called Freeman chain code). b. Boundary Boundary is defined as 2 x 2 pixel window that has two different classes as shown in Figure 7 (a). If a window is traced in the direction shown in Figure 7 (a), the boundary can be identified. c. Node A node in polygon model can be defined as a 2 x 2 window that has more than three different classes as shown in Figure 7 (b). Figure 7 (c) and (d) show an example of identification of pixels on boundary and node. Figure 6. Flow Directions III.146
Figure 7. Identification of Boundary and Node 2.2.3 Advantages and Disadvantages of Raster Data Models The advantages of the raster data model are: 1. Simple data structures. 2. Location-specific manipulation of attribute data is easy. 3. Many kinds of spatial analysis and filtering may be used. 4. Mathematical modeling is easy because all spatial entities have a simple, regular shape. III.147
5. The technology is cheap. 6. Many forms of data are available. The disadvantages of the raster data model are: 1. Large data volumes. 2. Using large grid cells to reduce data volumes reduces spatial resolution, result in loss of information and an inability to recognize phenomenological defined structures. 3. Crude raster maps are inelegant though graphic elegance is becoming much less of a problem today. Coordinate transformations are difficult and time consuming unless special algorithms and hardware are used and even then may result in loss of information or distortion of grid cell shape. 2.3 Quadtree Data Model Traditionally, the raster model is based on dividing the real world into equal-sized rectangular cells. However, in many cases, it can be more practical to use a model with varying cell size. Larger cells (lower resolution) may be used to represent larger homogeneous areas, and smaller cells (higher resolution) may be used for more finely detailed areas. This approach, known as the quad-tree representation, is a refinement of the block code method. In representing a given areas, the aggregate amount of data involved is proportional to the square of the resolution (into cells). Because the quad-tree model is a very practical concept, it is preferable for the storage of both small and large volumes of data. The quad-tree paradigm divides a geographical area into square cells of sizes varying from relatively large to that of the smallest cell of the raster. Usually, the squares are then quartered into four smaller squares. The quartering may be continued to a suitable level until a square is found to be so homogeneous that it no longer needs to be divided, and the data on it can be stored as a unit. A larger square may therefore comprise several raster cells having the same values. However, homogeneous areas that are not square or do not coincide with the pattern of squares employed may be further divided into homogeneous squares. The structure of the quadtree resembles an inverted tree, whose leaves are pointers to the attributes of homogeneous squares and whose branch forks are pointers to smaller squares hence the name quad-tree (Figure 8). 2.3.1 Advantages and Disadvantages of Quadtree Data Models The advantages of the quad-tree model are: 1. Rapid data manipulation, because homogeneous areas are not divided into the smallest cells used. 2. Rapid search, because larger homogeneous areas are located higher up in the point structure 3. Compact storage, because homogeneous squares are stored as units. 4. Efficient storage structure for certain operations, including searching for neighboring squares or for a square containing a specific point. III.148
5. The disadvantages of the quad-tree model are: 6. Establishing the structure requires considerable processing time. 7. Protracted processing may prolong alterations and updating 8. Data entered must be relatively homogeneous 9. Complex data may require more storage capacity than ordinary raster storage. 3. Advanced Data Models in GIS Figure 8. Quadtree data model In GIS, continuous surface such as terrain surface, meteorological observation (rain fall, temperature, pressure etc.) population density and so on should be modeled. As sampling points are observed at discrete interval, a surface model to present the three dimensional shape; z = f (x, y) should be built to allow the interpolation of value at arbitrary points of interest. Usually the following four types of sampling point structure are modeled into DEM. 3.1 Grid Model A systematic grid, or raster, of spot heights at fixed mutual spaces is often used to describe terrain. Elevation is assumed constant within each cell of the grid; that is, the area represented by each cell is shown as a flat area in the model. Thus, small cells detail terrain more accurately than large cells. The size of cells is constant in a model, so areas with a greater variation of terrain may be described less accurately than those with less variation. The grid model is most suitable for describing random variations in the terrain, whereas the systematic linear structures can easily disappear or be deformed. A possible solution is to store the data as individual points and generate grids of varying density as required. It is debatable whether the grid model represents samples on a grid and can therefore be called a point model, or represents an average across raster cells. In the United States the former seems to be the most usual. Elevation values III.149
are stored in a matrix, and the contiguity between points is thus expressed through the column and line numbers. Different interpolation techniques are used to generate an elevation grid from source data such as points, contour lines, and break lines. In interpolation of elevation values for the cells, it is usual to assume that points located at a distance. The averages of the elevations of those closed to grid points, within a given circle or square, can be assigned to the grid points with inverse weighting in proportion to the intervening distances involved. More advanced statistical methods can replace this kind of simple weighting in order to obtain a best possible model of the terrain based on available data. When the data relate to profiles or contours, grid point elevations are interpolated, in the same way, from the elevations at the intersections of the original data lines and the lines of the grid. 3.2 TIN Model An area model is an array of triangular areas with their corners stationed at selected points of most importance, for which the elevations are known. The inclination of the terrain is assumed to be constant within each triangle. The areas of the triangles may vary, with the smallest representing those areas in which the terrain varies most. The resulting model is called the triangulated irregular network(tin) In so far as possible, small equilateral triangles are preferable. To construct a TIN, as measured points are built and the model thus represents lines of fracture, single points, and random variations in the terrain. The points are established by triangulation and in such a way that no other points are located within each triangle s converted circle. In the TIN model, the x-y-z coordinates of all points, as well as the triangle attributes of inclination and direction, are stored. The triangles are stored in a topological vector data storage structure comprising polygons and nodes, thereby preserving the triangle s contiguity, which eases the calculation of z values for new points. 3.3 Contour lines Interpolation based on proportional distance between adjacent contours is used. TIN is also used. 3.4 Profile Profiles are observed perpendicular to an alignment or a curve such as high ways. In case the alignment is a straight line, grid points will be interpolated. In case the alignment is a curve, TIN will be generated. Figure 9 shows different types of DEMs. III.150
Figure 9. Different types of DEMs III.151
References 1. Bernhardsen, T. (2002) Geographic Information Systems: An Introduction. John Wiley & Sons, Inc. 2. Burrough, P.A. and McDonnell R.A. (1986) Principles of Geographic Information Systems. Oxford: Oxford University Press. 3. Buckley, D. J. The GIS Primer: An Introduction to Geographic Information System. http://www.innovativegis.com/basis/primer/primer.html. 4. Geographic Information System: Primer, Geospatial Training and Analysis Cooperative http://geology.isu.edu/geostac/field_exercise/gisprimer/frameset.html. 5. Davis, B. E. (2001) GIS: A Visual Approach. Onward Press. III.152