University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

Size: px

Start display at page:

Download "University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors"

Victor Charles
6 years ago
Views:

1 Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf. on Parallel Computing (ParCo 95) Gent, Belgium, September 9-22, 995 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 44 E Malaga Spain

2 Image template matching on distributed memory and vector multiprocessors V. Blanco, M. Martn, D.B. Heras, O. Plata and F.F. Rivera Dept. Electronica y Computacion Fac. Fsica. Univ. Santiago de Compostela elvicente@usc.es, elfran@usc.es 6th December 995 Introduction In this work we present a study on the trade-o between temporal and spatial parallelism to perform highly parallel algorithms. We have selected the image template matching algorithm as representative of the dierent computational structures that can be executed eciently on both vector and distributed memory systems [2, 3, 4]. The computational body of the template matching that we will consider is the computation of the cross-correlation coecient. This coecient is given in terms of the cross-correlation function as: P Mr?P Mc? C(i; j) = P k=0 M f r? k=0 l=0 P (k + i; l + j) T (k; l) P Mc? l=0 P 2 (k + i; l + j)g =2 () where P and T are the image and the template with N r N c and M r M c pixels respectively. Note that the calculation of this coecient presents high spatial and temporal locality. The program has four independent nested loops corresponding to the rows and columns of the image and the template. The loops associated to indexes i and j can be executed as a doall structure, and the ones associated to k and l constitute summations that can be executed as a typical ecient reduction structure. We have executed this code on the Fujitsu AP000 system as representative of distributed memory multiprocessors. The programming strategy we used is based on the SPMD paradigm exploiting data parallelism, and it consists in determining the most adequate distribution of data, dividing it into subspaces, one for each processor. Then, a mapping must be done in order to assign computations to each one of these subspaces and explicitly establishing the communications required [5]. In this way, the parallel code keeps the same general structure of the sequential counterpart introducing the necessary routing statements. At this point, dierent transformations of the local code can be applied that exploit the vectorial capabilities of each node. In order to do that we have used the Fujitsu VP2400/0 vector computer. This work was supported in part by the CICYT under grant TIC C03-03 and Xunta de Galicia under grant XUGA20606B93. The authors wish to acknowledge the help oered by Fujitsu Labs Ltd. for the use of their systems.

3 N c (0,0) (0,) (0,2) (0,3) (0,4) n c n r Imagen local (,0) M c + n c - (,4) (,) (,2) (,3) M c N r (2,0) M r + n r - (2,) (2,2) (2,3) (2,4) n c Template M r (3,0) n r (3,) (3,2) (3,3) (3,4) (4,0) (4,) (4,2) (4,3) (4,4) Figure : Access scheme of template and image 2 Exploiting spatial parallelism The parallel implementation of the cross-correlation coecient is not direct due to the dependencies between the bounds of the summations over indexes k and l and the indexes in the external loops i and j. The most ecient strategy to distribute data is based on the replication of the template in every node. The size of the template is usually small, so the memory cost associated to this approach is not too high in practice, and on the other hand, the amount of communications saved justi- es it. The image is stored in the local memories using a block distribution scheme, mapping the two-dimensional matrix on the two-dimensional mesh of processors in a straightforward way. Each processor computes the products in the equation from position (0; 0) of the image in a lexicographic order, and from position (M r ; M c ) of the template in reverse lexicographic order. In this way, each processor executes every computation that involve just local memory accesses. The routing operations needed to compute the global result can be mapped in an ecient way on the mesh network, through rows and columns. In gure we display the access scheme of the template and the local image in each node. In this example, node (3; 3) have to send local results to every node that have some shaded zone, nodes (; ), (; 2), (; 3), (2; ), (2; 2), (2; 3), (3; ), (3; 2) and (; 3). Finally, in order to minimize communication costs, we compose individual messages in buers that have to be sent in a whole routing operation. In gure 2 we present the eciencies for the parallel algorithm executed on the AP000, that is a general purpose system with a MIMD conguration, distributed memory and a two dimensional torus topology network. Note that the eciency is high even when 52 processors are used. Moreover, in some cases we found superlinear speedups when the memory hierarchy (specially cache accesses) operate eciently. 2

4 Template 4 x 4.06 Efficiency Efficiency # of PEs Template 8 x # of PEs Figure 2: Eciency of the cross-correlation on the AP-000 Image 024 X X X 256 Template 8 X 8 4 X 4 8 X 8 4 X 4 8 X 8 4 X 4 Scalar Automatic Optimized Speedup Table : Run-times on the VP2400/0 3 Exploiting temporal parallelism The local program associated to each node computes a local cross-correlation, so, if we assume vector capabilities in the processors, temporal parallelism in a ner grain can be exploited. We have implemented a vector code for the local program. We have used the VP2400/0 vector computer from Fujitsu as a tool for the evaluation of the vectorization possibilities of this algorithm. In order to obtain the best use of the hardware of the system we have systematically applied to the algorithms dierent transformations that exploit the vectorial capabilities of the system []. In particular, we have considered the following: vectorization over the longest loop, minimization of memory conicts, loop fusion, use of scalar variables in reduction operations, unrolling and blocking. In table, runtimes in milliseconds are shown for dierent sizes of the local image and the template. Note that the automatic compilation does not oer good performance because it vectorize the innermost loop, that corresponds to the rows 3

5 of the template, a small quantity. 4 Conclusions The cross-correlation coecient and other related computations are the computational kernel of codes in the eld of image processing, and in particular for the image template problem. In this paper we focus on the computational features that make this kind of loop structured codes suitable for parallel and vector machines. We found that a block distribution of the image and a replication of the template in every processor will produce a high eciency in the parallel algorithm on distributed memory systems, and in particular in systems with mesh interconnexion topology. On the other hand, we found that vectorization is a more ecient solution than spatial parallelization in order to increase the processing speed of this kind of codes due to the communication costs. The best solution should be to combine both approaches in a distributed memory system with vector capabilities. References [] W. Cowell and C. Thompson. Transforming fortran do loops to improve performance on vector architectures. ACM Transaction on Mathematics Software, 2(4):326{353, 986. [2] Z. Fang, X. Li, and L. Ni. Parallel algorithms for image template matching on hypercube simd computers. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI-9(6):835{84, Nov [3] V. Kumar and V. Krishnan. Ecient image template matching on hypercube simd arrays. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI- (6):665{669, 989. [4] E. Zapata, J. Benavides, O. Plata, and F. Rivera. Image template matching on hypercube simd computers. Signal Processing, 2:49{60, 990. [5] E. Zapata, F. Rivera, and O. Plata. On the partition of algorithms into hypercubes. Advances in Parallel Computing., :49{7,

Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08

Sparse Givens QR Factorization on a Multiprocessor J. Tourino R. Doallo E.L. Zapata May 1996 Technical Report No: UMA-DAC-96/08 Published in: 2nd Int l. Conf. on Massively Parallel Computing Systems Ischia,