Kriging in a Parallel Environment

Size: px

Start display at page:

Download "Kriging in a Parallel Environment"

Oswald Jacobs
5 years ago
Views:

1 Kriging in a Parallel Environment Jason Morrison (Ph.D. Candidate, School of Computer Science, Carleton University, ON, K1S 5B6, Canada; (613) ; morrison@scs.carleton.ca) Introduction In spatial data modelling and analysis there are a variety of techniques to perform prediction. The goal of these techniques is to take spatially located data and to establish estimates of data values at unknown locations. Of these techniques, the attractive aspects of kriging are often overshadowed by the slow speed of the calculation. Unfortunately the calculations necessary to perform kriging have a high computational complexity (i.e., for simple, ordinary and universal kriging). Even if algorithms requiring the theoretical minimum complexity become available, kriging will still be too slow to be an interactive process. Faster paradigms are required to advance the use of this technique in modern interactive data analysis. As a part of the Parallel and Distributed Geomatics Project the goal of this work is to provide insight into kriging. Problem Definition From the previous work by Kerry and Hawick [4] it is known that parallelism can be successfully applied to kriging. In that work the authors showed that it is possible to produce speedup using tightly coupled parallel machines (a CM-5 and a farm of Alpha Workstations with an Optical based ATM switching network). Since tightly coupled machines are not always available and can be prohibitively expensive, we try to address a more practical approach to parallelism. Is it possible to attain speedup in kriging on a general network of workstations (NOW)? We further constrain ourselves by demanding that the implementation be portable to different platforms and configurations. Kriging, itself is often used to describe an entire field of estimation techniques. In this research we are restricting our attention to three types of Kriging: Simple, Ordinary and Universal Kriging. This restriction is motivated by the fact that the mathematics involved in these three types of estimation is very similar. From here on we will only discuss the Simple Kriging algorithm. The interested reader should consult [5] and [6] to see how our work will extend to Ordinary and Universal Kriging. Essentially Simple Kriging(SK) is a mathematical technique that uses the known data points to calculate a data value at whatever point the user desires. In our algorithm SK is used repeatedly to generate data values of points on a grid of size m. We assume the size of the data set n is smaller than the number of output locations (i.e., n<m). This assumption seems reasonable given this is true in the applications discussed within the literature [5]. Theory of Simple Kriging Simple Kriging(SK) begins with the assumption that the original set of data is a partial realization of a random function denoted by Z. Each known data value Z(x) has an associated spatial location x. It is further assumed that Z has the property of second order stationarity. This

2 means that the first and second order statistical properties of Z are invariant under any translation. That is, Equations (1) and (2) must hold. E[Z(x)] = m (1) E[(Z(x)-m)(Z(x+h)-m)] = Cov(x,x+h) = Cov(h) (2) where m is a scalar constant, h is a vector distance and Cov() is the covariance of the random function. The set of all points is defined as X={x 1,x 2,,x n }. In SK it is the data analyst's task to define a covariance function Cov(x i,x j )=Cov(x i -x j ) which matches the observed covariance in the data. This task is quite complex and involves statistical measurements of the data as well as knowledge of the data source and the data collection technique. More information on this can be found in [5] and [6]. The SK estimate of a data value at point x 0 is denoted Z est (x 0 ). Each estimate is defined by calculating weights l(x 0, x) for each known data value x. The weights and their respective data values are then multiplied together and summed to establish the estimate. The first task is to calculate a vector L(x 0 ) of length n which contains the weights l(x 0,x i ) at element L (i) (x 0 ). The n by n matrix C and the 1 by n vector c must be calculated with elements C (i,j) =Cov(x i,x j ) and c i =Cov(x 0,x i ). Equation (3) is then used to calculate the weights in L(x 0 ). The estimate can be calculated by using equation (4), the calculated weights and by defining the 1 by n vector Y with elements Y (i) =Z(x i )-m. This makes the time for each of the estimations O(n) to form c and perform the dot product with q. To calculate q can be done in a single pre-computation of O(n 3 ) to calculate C and multiply it with Y. L=C -1 c (3) Z est (x 0 )=YL(x 0 )+m =YC -1 c+m=qc + m (4) Z 2 err (x 0 )=Cov(0)-cC -1 c (5) Finally, equation (5) defines the standard squared error of the estimate at a point x 0. The cost of the this calculation is expensive at O(n 2 ) for each error but mandatory when trying to analyze data. Fortunately, this calculation does not add anything to the pre-computational step as its requirements are identical to Z est (x 0 ). The estimation achieved by SK is commonly classified as a BLUE or Best Linear Unbiased Estimate. It is considered linear because it is a linear combination of known values, it is unbiased because the average estimated error is zero and it is best because the square of the errors is minimized (see [5] or [6] for all derivations). Algorithms There are three versions of our algorithm. The first version is sequential while the second and third versions are parallel. The sequential version of the algorithm follows the basic step described in the theory and operates in two phases. In phase one the data points, their locations and Cov() are used to create Y, C -1 and q. In the second phase each of the estimation locations on the grid and the appropriate Z est () and Z err 2 () are calculated.

3 Sequential I -- (and Parallel II) 1) Calculate Y, C -1 and q 2) Compute all estimation locations on the output grid 3) For each location calculate c, Z est () and Z err 2 () The first parallel version takes the approach of dividing the calculations for the output grid between the processors, running the sequential algorithm and then returning the output to a single processor. This implies that the input, output and intermediate structures Y, C -1 and q must fit on each of the processors. This establishes a requirement of more than n 2 +(4+d)n storage for each processor plus required space to perform the matrix inversion (note d is the dimension of each location). We further make the assumption, commonly made in CGM algorithms [3], that n > p 2 and hence m > p 2 where p is the number of processors. It is not assumed here that the data is already copied to every processor but rather it is included as the initial step of the algorithm. Parallel I 1) Distribute data to all processors 2) Distribute grid information to all processors 3) Each processor calculates their portion of the grid 4) Each processor runs the Sequential program on their sub-grid 5) Collect answers on original processor The important stage of this program is how to calculate the sub-grid for each processor. This is done by applying a technique used in division of raster data between multiple processors[7]. Suppose, without loss of generalization, that the output grid has a larger or equal number of rows than columns. Then the processors are assigned numbers in the arrangement of a single column. If the number of processors divides evenly into the number of rows then each processor is assigned an equal number of rows and the problem is solved. If however the division has a remainder of r, then the first r processors are assigned one extra row. This strategy minimizes the difference between rows while still assigning blocks of the grid to a single processor. It is also guaranteed to produce a sub-grid for each processor. Since m>p 2 and the number of rows m row is greater than number of columns m col then p<m row. The work by Kerry and Hawick [4] the authors concentrate on a web-based interface to their parallel kriging implementation. However, the paper does provide inspiration for a second form of parallel kriging. They mention that "at present High Performance Fortran cannot express the high degree of parallelism obtained with message passing implementations of ScaLAPACK"(page 5). The high degree of parallelism that they refer to is the basis of finegrained parallelism and the ScaLAPACK library. Instead of dividing up the output of the program fine-grained parallelism tries to perform the algorithm's computations in the same order as in the sequential algorithm. The emphasis is instead placed on making each matrix and vector computation work in parallel. Dividing each matrix into blocks, distributing them across the processors allows each processor to work on a

4 portion of the overall computation. Details of each operation go beyond the scope of this abstract and the reader is referred to [2] for more information. Implementation Proceeding in the spirit of providing a usable implementation on a variety of platforms we used a collection of standard, freely available libraries. The libraries mentioned are the LAMMPI message passing library, the BLAS basic linear algebra library, the LAPACK linear algebra package, the BLACS basic linear algebra communication subroutines library, the ATLAS library for optimized BLAS routines and the ScaLAPACK scalable LAPACK library. With the exception of the LAMMPI library all of the above can be found at web site currently maintained by the University of Tennessee, Knoxville. The main implementation issues with the sequential program are in the efficiency, stability and precision requirements of the matrix inversion and multiplication. To support these requirements the LAPACK, BLAS and ATLAS libraries were used. The most important aspect of the sequential algorithm is the inversion of matrix C and the subsequent calculation of q. First the calculation of q was performed by solving the system of equations Cq=Y. The LAPACK function dsysvx() solves such systems of equations using iteration and LU decomposition to guarantee precision and convergence to the solution(see [1] for details). The matrix C was then inverted using the LU decomposition obtained from the solution to q. This was performed by the LAPACK function dsytri() whose details can also be found in [1]. The remainder of the algorithm follows directly and the BLAS functionality was used for all remaining operations. The only remaining concern of the first parallel implementation is distribution of the data and the collection of the output. This was performed using the BLACS library, which is optimized for communication of vectors and matrices between processors. On our system BLACS is essentially a front-end for MPI but it was chosen because of its common use in the numerical analysis community and its subsequent availability on many types of high performance computers. The implementation of the second parallel algorithm has been completed using the sequential algorithm and changing the calls from LAPACK and BLAS calls to ScaLAPACK and PBLAS equivalent parallel calls. The distribution of the matrices also had to be facilitated using BLACS but the remainder of the code stays the same. While the code for this program is currently functional it is clear that our implementation is restricted by the hardware we are currently using. We are reserving our final conclusions until the program has been ported to a tightly coupled machine (e.g., IBM SP2, SGI Origin 2000, or Cray T3E). Results and Discussion At the time of writing the testing of the algorithms has yielded very interesting results. It is clear that our first parallel version is a success and can achieve very high efficiency. For our testing we have used the sample data sets presented in the appendix of [5]. The tests were conducted on a cluster of Pentium III's 450 Mhz, running Sun Solaris 2.7 and connected with a 100 Mb hub. This strictly non-parallel communication device is limiting and simulates

5 somewhat average conditions in NOWs. The high speedup and parallel efficiency are especially important given the high computational speed vs the medium communication speed. A small sampling of efficiencies are demonstrated in Table (1). Detailed results will be presented at the conference. Table (1) -- Parallel Speed up and Efficiency in Parallel Algorithm 1 (n=200, m=25,000) # of procs Speed-Up Efficiency 96.8% 95.6% 94.5% 91.7% 93.4% Conclusions In previous work the concentration was on programming for tightly coupled computer systems. This work shows that such a system is not necessary to achieve high performance. High levels of performance are achievable through less expensive, off the shelf components. We also show that the performance is achievable using easily accessed libraries and that high efficiency is obtained by working in parallel. References [1] E. Anderson, et.al., LAPACK User's Guide 3 rd ed.. Siam Publishing, Philadelphia [2] L.S. Blackford, et.al., ScaLAPACK User's Guide. Siam Publishing, Philadelphia [3] F. Dehne, A. Fabri and A. Rau-Chaplin, "Scalable Parallel Geometric Algorithms for Coarse Grained Multicomputers," in Proc. ACM 9 th Ann. Comp. Geom., p [4] K. Kerry and K. Hawick, "Kriging Interpolation on High Performance Computers", Proc. of High Performance Computing and Networks Europe. LNCS 1401, Springer-Verlag [5] R. Olea, Geostatistics for Engineers and Earth Scientists. Kluwer Academic Publishers, Boston [6] B. Ripley, Spatial Statistics. Wiley Series in Probability and Mathematical Statistics, John Wiley and Sons. Toronto [7] D. Roytenberg, "Developing Parallel GIS Applications on the ALEX AVX II Computer", Master's Thesis, School of Computer Science, Carleton University Acknowledgements A special thank you to Hossam Khalil for his work in implementing the applications described. This work was funded by OGS and GEOIDE.

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University