Thiago L. Gomes Salles V. G. Magalhães Marcus V. A. Andrade Guilherme C. Pena Universidade Federal de Viçosa (UFV)
The availability of high resolution terrain data has become a challenge in GIS; On one hand, we have high quality data. On the other hand, the algorithms to process these data require high processing power and memory. 1
When this volume of data does not fit in internal memory it needs to be processed externally (mainly in disks); The time to access data on disk is much higher than the internal access; Then, the algorithms must be designed focusing the optimization of I/O operations; not only the CPU processing; 2
Consider two algorithms to access a huge matrix M with n x n cells stored in external memory: Alg. 1 for (i=1; i <= n; i++) for (j=1; j <= n; j++) M[i,j] = 0 ; Alg. 2 for (j=1; j <= n; j++) for (i=1; i <= n; i++) M[i,j] = 0 ; Based on CPU instructions, both algorithms are ϴ(n 2 ); 3
But, considering I/O operations, if the block size B is smaller than the matrix row: Algorithm 1 executes ϴ(n 2 /B) I/O operations Algorithm 2 executes ϴ(n 2 ) I/O operations In a machine where the disk block contains 10000 cells and the time to read a block is 10 milliseconds (9 for seek and 1 for read), the time to access a matrix with 50000 2 cells is: Algorithm 1 4 minutes Algorithm 2 10 months 5
An important application in GIS is the drainage network computation. Applications: Environmental planning Watershed analysis Studies of sediment flow Dam planning 6
We will work with terrains represented by digital elevation matrices. Objective: to compute the overland flow direction and flow accumulation matrices. 7
71 72 67 68 62 65 63 61 58 DEM 3D Viewing 8
71 72 67 68 62 65 63 61 58 DEM 3D Viewing 8
71 72 67 68 62 65 63 61 58 DEM 3D Viewing Flow direction 71 72 67 68 62 65 63 61 58 8
71 72 67 68 62 65 63 61 58 DEM 3D Viewing Flow direction 71 72 67 68 62 65 63 61 58 1 1 1 1 5 1 1 2 9 Flow accumulation 8
71 72 67 68 62 65 63 61 58 DEM Threshold = 4 3D the drainage network is Viewing composed by all cells with flow accum 4 Flow direction 1 1 1 1 5 1 1 2 9 71 72 67 68 62 65 63 61 58 Flow accumulation 8
71 72 67 68 62 65 63 61 58 DEM 3D Viewing Flow direction 71 72 67 68 62 65 63 61 58 1 1 1 1 5 1 1 2 9 Drainage network Flow accumulation 8
In some cases, it is not possible to determine (directly) the flow direction in a cell: 71 72 67 71 72 68 62 65 68 62 63 61 58 63 61 68 62 65 68 62 71 72 67 71 72 71 72 67 71 72 67 68 68 68 62 63 68 68 68 61 68 62 65 68 62 71 72 67 71 72 Local minimum (depression) Flat area In general, these two cases are treated by a very time-consuming preprocessing step; 9
A depression is removed by filling it; that is, its elevation is raised to the elevation of its lowest neighbor; And, the flow direction in flat areas is oriented to the lowest neighbor cell; But, in general, this preprocessing step takes more than 50% of the total running time; 10
To avoid this time-consuming preprocessing step, we developed the RWFlood method which is very efficient when the whole terrain fits in internal memory; 11
The basic idea of RWFlood is: supposing a terrain being flooded by water coming from outside and getting into the terrain through its boundary; the course of the water getting into the terrain will be the same as the water coming from rain and flowing downhill (that is, the flow direction); 13
In other words, the idea is to suppose the terrain surrounded by water (as an island) and the flooding process is simulated raising the water level; 14
Initially, the water level is set to the elevation of the lowest cell in the terrain boundary; Then, two actions are executed iteratively: flooding a cell raising the water level 14
Flooding a cell c For all cells d neighbors to c do: if the elevation of d is smaller than the elevation of c then d is raised to the elevation of c; the flow direction of d is set to the cell c; 14
Raising the water level After flooding all cells considering the current water level H, the water is raised to the elevation of the lowest cell higher than H; 14
These cells are processed as previously and the level of the water is raised to the next level; 14
Now, the cell to be processed has some neighbor cells whose elevation is smaller than the water level (a depression); 14
The depression is filled; This algorithm could be implemented using an stable priority queue. But, for performance purpose, we use an array of queues. 14
RWFlood was implemented to flood the terrain and to compute the flow direction in O(N) time; The flow accumulation can be easily computed (in linear time) using an algorithm based in topological sorting; But, it does not scale well for huge terrains requiring external memory processing; Thus, the idea of this work (the EMFLOW method) is to adapt the RWFlood for external processing; 15
Basically, RWFlood stores the cells in the boundary of flooded regions; Cells in the boundary of flooded region And these cells are processed based on their elevation: from the lowest to the highest; 16
When a cell is processed, it is necessary to access its neighbors; Which means the terrain matrix is accessed nonsequentially since the cells that are neighbors in the two-dimensional matrix representation may not be close in the memory; Thus, this process can be inefficient when the matrix is huge and is stored in external memory; 17
To reduce the number of disk accesses, we propose the EMFlow whose basic idea is to use a cache strategy to benefit from the spatial locality of reference present in the sequence of accesses. 18
Spatial locality of reference: a special library, named TiledMatrix, is used to subdivide the matrix in squared blocks (of cells) that are stored sequentially in the external memory; 19
The blocks are managed as in a cache memory; When a cell needs to be accessed, the entire block containing this cell is loaded (and kept) in the memory. The next accesses to cells in this block can be done efficiently; When the internal memory is full, the blocks are replaced using the LRU policy. 20
In EMFLow, all matrices used in RWFlood are replaced by matrices managed by TiledMatrix; Supposing the blocks near the flooded region border can fit in the memory, the disk accesses are reduced. 21
EMFlow was implemented in C++; It was compared against TerraFlow and r.watershed.seg (both included in GRASS); Test machine rebooted with: 1GB and 4GB of memory to consider different scenarios; Computer: Intel Core 2 Duo 2.8 GHz, Ubuntu Linux 11.04 64 bits with a 5400 RPM SATA HD; 22
Terrains from Nasa SRTM (30 meter); Processing times (s). EMFlow TerraFlow r.watershed.seg Terrain Memory Memory Memory Width 1GB 4GB 1GB 4GB 1GB 4GB 1000 1 1 24 19 6 6 5000 14 15 661 401 625 617 10000 75 65 2330 2252 12636 8530 15000 326 154 7588 5870 > 100000 22276 20000 718 295 12937 13067 > 100000 41493 25000 2006 530 22221 19340 > 100000 77729 30000 2848 851 35408 30364 > 100000 > 100000 40000 5654 1827 67076 56421 > 100000 > 100000 50000 10649 2898 98222 82673 > 100000 > 100000 23
24
25
We developed a very fast and simple algorithm to compute the drainage network on huge terrains stored in external memory; In huge terrains it was about 30 times faster than Terraflow. 26
Future work includes: A parameter that may affect the algorithm`s efficiency is the TiledMatrix block size. In future works we intend to run more tests to evaluate this influence. 27
Future work includes: The algorithm needs to keep in the memory the cells in the border of the flooded region. If this border is big and the TiledMatrix cache size is small, the cache may not be enough to store the blocks in this area. One way to avoid this is to identify islands during the flooding process and process each island separately. We intend to use this strategy to improve the method. 28
Acknowledgements
Drainage network on Tapajos basin computed by EMFlow
Drainage network on Tapajos basin computed by r.watershed
Drainage network on Tapajos basin computed by TerraFlow