Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Size: px

Start display at page:

Download "Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o"

Evelyn Stanley
5 years ago
Views:

1 Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National Central University Chung-Li 32054, Taiwan steven@axp1.csie.ncu.edu.tw sheujp@csie.ncu.edu.tw z Department of Computer and Information Science The Ohio State University Columbus, OH chh@cis.ohio-state.edu Abstract This paper presents compilation techniques to compress holes, which are memory locations mapped by useless template cells and are caused by the non-unit alignment stride in a two-level data-processor mapping. A two-level data-processor mapping provides user to specify data-processor mapping by aligning related array objects with a template, and then distributing the template onto the user-declared abstract processors. In a two-level data-processor mapping, there is a repeated pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the attributes of classes for the data distribution. Similarly, data distribution on a processor also has a repeated pattern. We use compression table to record the attributes of the rst data distribution pattern on that processor. By using class table and compression table, hole compression can be easily and eciently achieved. Compressing holes can save memory usage, improve spatial locality and further increase system performance. The proposed method is ecient, stable and easy implement. The experimental results do conrm the advantages of our proposed method over existing methods. Keywords: Communication Set, Distributed-Memory Multicomputers, High Performance Fortran, Hole Compression, Two-Level Data-Processor Mapping. 1 Introduction Distributed-memory multicomputers are superior than shared-memory multiprocessors in cost and scalability. These advantages have made distributed-memory multicomputers be extensively used in scientic and engineering computation. However, the absence of a global shared address causes the diculty to program in distributed-memory multicomputers. Programmers have to not only distribute the computation and data onto processors but also manage communication among processors. Currently, parallelizing compilers try to ll up the gap between users and machines to relieve the burden of such a tedious and error-prone work from This work was supported in part by the NSC of the ROC under Grant #NSC E

2 !HPF$ PROCESSORS P ROC(P )!HPF$ ALIGN A(i) WITH T (s i + o)!hpf$ DISTRIBUTE T (cyclic(x)) ONTO P ROC Fig. 1: HPF-like loop model considered in the paper. users. Many researchers investigate the extension of existing languages to provide users more convenient way to manage data distribution, such as Fortran D [6, 25], HPF (High Performance Fortran) [8, 16], and Vienna Fortran [2, 3], etc. The major characteristic of these languages is to provide users directives to specify data distribution at language level. Generally speaking, data parallel languages support two-level data-processor mapping. A two-level dataprocessor mapping provides user to specify data-processor mapping by aligning related array objects with a template, an abstract index space, and then distributing the template onto the user-declared abstract processors. In distribution phase, three regular data distributions, block, cyclic, and block-cyclic data distributions, are provided by these languages. The distribution that contiguous blocks of array elements are evenly distributed onto processors is block distribution. The cyclic distribution is to distribute each array element onto processors one at a time and in a round-robin fashion. The distribution that blocks of size x are distributed onto processors in a round-robin fashion is block-cyclic distribution and is denoted as cyclic(x). Block-cyclic distribution is known to be the most general data distribution. A Block or a cyclic distribution can be represented by a block-cyclic distribution. A cyclic(d NA P e) distribution is, basically, a block distribution and the cyclic distribution can be represented by cyclic(1), where N A is the number of array elements and P is the number of processors. Consequently, the program model considered in this paper is demonstrated in Fig. 1, where P is the number of processors, s is the alignment stride, o is the alignment oset and x is the distribution block size. Fig. 2 illustrates an example of this model, where P = 4, s = 3, o = 1, and x = 5. The white squares represent the array elements of A and the number in the square is the global index of that array element. The gradations of gray squares represent dierent template cells mapped to dierent processors and the number in the square is the global index of that template cell. A template is an abstract index space and whose elements have no content and occupy no storage [8]. Therefore, in real situation, we concern actually about how array elements are distributed onto processors, not how the template cells are distributed onto processors. Furthermore, if the alignment stride is non-unit, the two-level data-processor mapping will create holes; that is, there are a lot of template cells with which no array element is aligned. These holes imply that the memory locations are never used. Fig. 3(a) illustrates the distribution of data elements onto processors without hole compression. On the contrary Fig. 3(b) illustrates that of with hole compression. Obviously, there are 30 memory spaces that should be allocated by each processor if hole compression is not performed. However, only a few template cells are aligned by array elements. The rest of template cells, which have no array elements aligned with, are never used and are holes. Therefore, after hole compression performed, only 10 memory spaces are needed by each processor. Memory holes caused by the non-unit alignment stride will result in a large amount of memory wastage, even for a small alignment stride. Therefore, only the template cells aligned by array elements need to be mapped to memory locations and all the holes should be removed. Suppose the number of template cells is N T and s is the alignment stride. Thus, only N T =s template cells are aligned by array elements. The NT =s percentage of memory usage is N T 100% and equals 1 s 100%. In other words, the percentage of memory wastage is s?1 s 100%. The larger the alignment stride is, the more the memory usage wastes. Even for the 2

3 Fig. 2: A Two-level data-processor mapping. Array A(i) is aligned with template T (3 i + 1) and the template is then distributed onto 4 processors with cyclic(5) distribution. 3

4 (a) (b) Fig. 3: The distributions of data elements onto processors without and with hole compression. (a) The distribution of data elements onto processors without hole compression. (b) The distribution of data elements onto processors with hole compression. 4

5 least positive non-unit alignment stride 2, it still has 50% waste of memory space. Therefore, compressing holes is quite necessary and important. The processes that map the useful template cells to processors and eliminate useless holes are called hole compression. In addition to increase memory usage, removing holes can also improve spatial locality and, furthermore, achieve higher performance. To eciently achieve the goal of removing holes, this paper presents compilation techniques for compiling two-level data-processor mappings. Observing the distribution of data onto processors, one can nd out that the distribution patterns for every three blocks are identical as shown in Fig. 2. By the observation, we classify all blocks into classes and design a table, named class table, to record the attributes of the rst repeated pattern. On the other hand, from processor viewpoint, the above observation is true as well. Therefore, another table, compression table, are established to record the attributes of the rst repeated data distribution pattern on that processor. Compression table is established according to the class table. We design systematic methods to construct these tables and the time complexity of each construction is O(s) in worst case, where s is the alignment stride. Hole compression can be easily achieved by using compression table accordingly. Table-lookup approach can save a lot of redundant computations since the computations of repetition of xed patterns can be obtained by table lookup instead of recomputations. Experimental results verify the advantages of the proposed approach. Moreover, the proposed approach has high stability against existing methods. The execution time changes little with the variations of the alignment stride and the distribution block size. In addition, from implementation viewpoint, the proposed approach can be easily implemented as well. This paper is organized as follows. On compiling two-level data-processor mapping, class table is useful for summarizing the two-level mapping. Therefore, the structure and characteristics of class table are presented in Section 2. Section 3 introduces how to utilize class table for compressing holes and generating global indices of array elements contained by each processor. Here a useful structure termed as compression table is also introduced. Experimental results to show the advantages of ours over existing methods are provided in Section 4. Section 5 describes the related work. Section 6 concludes the paper and points out the possible direction of future research as well. 2 Characteristics of Class Table As compiling array statements, we have designed a useful structure to summarize the characteristics of access patterns for an array statement, while the data distribution is block-cyclic distribution [28]. Based on the similar concept, a structure named class table is designed in this paper to record the attributes of a two-level data-processor mapping. In this section we briey describe the basic components of class table. Without loss of generality, we assume every numbering system is starting from zero, such as numbering array elements, template cells, and processors, etc. Suppose array element A(i) is aligned with template T at (s i + o), where s is the alignment stride and o is the alignment oset. Two dierent alignment osets o 1 and o 2 lead to isomorphic data-processor mapping if and only if o 1 o 2 (mod s). In order to reuse the same class table for isomorphic data-processor mappings, class table regards the alignment oset o as another reduced one r, where r = (o mod s). For example, suppose array A(i) is aligned with a template T at (3i + 10). The alignment phase is illustrated in Fig. 4. In this example, s = 3 and o = 10. Since 10 > 3, the alignment oset is reduced to 1, which is equal to (10 mod 3). For the reduced alignment, a template cell t has an array element aligned with if and only if t r (mod s), such as the template cells 1; 4; 7; 10; 13; 16;, in this example. For these template cells with which have array elements aligned, a template cell t is active if o t (s (N A? 1) + o); otherwise, it is 5

6 A p p p T Fig. 4: The illustration of active elements and pseudo active elements. The boldfaced numbers are active elements and the italic numbers are pseudo active elements. termed pseudo active, where N A is the number of array elements. For this example, the active template cells are starting from 10 and then each strides 3 and the template cells 1, 4, 7 are pseudo active elements. Fig. 4 also illustrates the active elements and pseudo active elements for this example. The boldfaced numbers are active elements and the italic numbers are pseudo active elements. Note that the pseudo active elements are viewed as the same with active elements when we are generating class table and compression table. However, the pseudo active elements will not be counted when we are generating the compressed local array. For a cyclic(x) distribution, every x template cells forms a block. A block is numbered according to the occurrence of the block in the data-processor mapping. Suppose the numbers of array elements and template cells are N A and N T, respectively. Let N b be the number of blocks. Thus N b = dn T =xe. According to the alignment stride and oset, blocks of the same format can be classied into the same class. In other words, those blocks within which have the same positions of active elements are classied into the same class. For example, consider the two-level data-processor mapping shown in Fig. 2. Blocks 0, 3, 6, 9, have the same positions of active elements. These blocks are classied into the same class. Similarly, blocks 1, 4, 7, 10, can be classied into a class and blocks 2, 5, 8, 11, is another class. The following theorem demonstrates that, for a two-level data-processor mapping, according to the positions of active elements within the blocks, all blocks can be classied into dierent classes and the number of classes is equal to s= gcd(s; x), where gcd(a; b) is the greatest common divisor of a and b. Theorem 1 For any two-level data-processor mapping, suppose array A is aligned with T at a stride s and an oset o and template T is distributed onto processors using cyclic(x) distribution. All template blocks can be classied into s= gcd(s; x) classes. Proof. Since each block is of the same size and the alignment stride is xed on s, we only have to prove that if the position of the rst active element for some block is the same as that of another block, the positions of the rest of active elements within the two blocks are the same. Suppose the position of the rst active element within block b is, 0 < s. We would like to show that the position of the rst active element is still for every period of s= gcd(s; x) blocks. The positions of active elements before or after can be formulated as (( + k s) mod x), where k 2 Z and Z is the set of integers. We claim that ( + k 0 s) mod x =, for some k 0. Therefore, the number of blocks between b and the block whose the rst active element position is the same with b is k0 s x. Since ( + k 0 s) mod x =, it implies that + k 0 s = x q +, where q 2 Z. Thus, k 0 s = x q. Because k 0 and q are integers, the unique solution to k 0 s = x q is k 0 = t lcm(s;x) s and lcm(a; b) is the least common multiple of a and b. Therefore, k0 s x and q = t lcm(s;x) x, where t 2 Z = q = t lcm(s;x) x = t s gcd(s;x). In other words, for every period of s= gcd(s; x) blocks, the rst active elements of these blocks are the same. Thus, for arbitrary two-level data-processor mapping, blocks can be classied into s= gcd(s; x) classes. The theorem is therefore obtained. 2 Let N c be the number of classes. By Theorem 1, N c = s= gcd(s; x). Since all blocks are classied into s= gcd(s; x) classes, we number the class number of each block in lexicographical order and every s= gcd(s; x) 6

7 c A f A l A n Fig. 5: Class table for the example shown in Fig. 2. N c = s= gcd(s; x) /* the number of classes */ r = o mod s /* reduced alignment oset */ DO c = 0 TO (N c? 1) A f (c) = dmax(c x? r; 0)=se A l (c) = b((c + 1) x? r? 1)=sc A n (c) = A l (c)? A f (c) + 1 ENDDO Fig. 6: Class Table Generation algorithm. blocks repeats again. In other words, block b belongs to class (b mod N c ). Blocks b 1 and b 2 belong to the same class if and only if b 1 b 2 (mod N c ). Blocks fb; (b + 1); (b + 2); : : : ; (b + N c? 1) j b 0 (mod N c )g are a period in terms of class and we call such a period a class cycle. For a class cycle, a class table is used for recording the rst, the last and the number of active elements on a class. For a class c, A f (c) represents the order of occurrence in a class cycle for the rst active element in c, A l (c) the order of occurrence in a class cycle for the last active element in c, and A n (c) the number of active elements within c. For the example shown in Fig. 2, the class table is shown in Fig. 5. In this example, blocks can be classied into 3 classes. The blocks 0, 3, 6, 9,, are class 0, the blocks 1, 4, 7, 10,, are class 1, and the blocks 2, 5, 8, 11,, are class 2. Blocks 0, 1, 2 form a class cycle and blocks 3, 4, 5 form another class cycle, and so on. For the rst class in a class cycle, the occurrences of the rst and the last active elements are 0 and 1, respectively, and the number of active elements in this class is 2. Thus, A f (0) = 0, A l (0) = 1, and A n (0) = 2. Likewise, A f (1) = 2, A l (1) = 2, and A n (1) = 1 for class 1 and A f (2) = 3, A l (2) = 4, and A n (2) = 2 for class 2. The algorithm to generate a class table is presented in Fig. 6 and is termed Class Table Generation algorithm. The major factors aecting the construction of a class table are the alignment stride s, alignment oset o, and the distribution block size x. The time complexity of Class Table Generation algorithm is O(s= gcd(s; x)). In addition to the original meanings described above, class table also contains a lot of additional information implicitly. For a class c, A f (c) implies total number of active elements contained from class 0 to class (c? 1) and (A l (c) + 1) implies total number of active elements contained from 0 up to class c. Therefore, the number of active elements contained in a cycle is equal to A l (N c? 1) + 1. Let A c be the number of active elements contained in a class cycle. Thus, A c = A l (N c? 1) + 1. Note that, if the block size x is larger than the alignment stride s, each block contains at least one active element. Thus, A l (c) is always larger than or equal to A f (c), where c is the class number of that block. However, if the block size is smaller than the alignment stride, each block contains at most one active element and there are blocks that contain no active element at all. For any block that contains no active element, A f (c) will be one larger than A l (c). Lemma 1 demonstrates the phenomenon. 7

8 Lemma 1 For any block b, A f (c) = A l (c) + 1 () A n (c) = 0; where c is the class number of b. Proof. ()): Since A n (c) = A l (c)? A f (c) + 1 and A f (c) = A l (c) + 1; obviously, A n (c) = 0. ((): Since A n (c) = A l (c)? A f (c) + 1 and A n (c) = 0; clearly, A f (c) = A l (c) The construction of a class table from a two-level data-processor mapping has been introduced. In the following section, how to use the class table to construct a compression table to extract information from a repeated data distribution pattern on a processor will be proposed. The generation of the compressed local array for a processor by using the compression table will be described next. The transformations between global and local indices are addressed as well. 3 Hole Compression HPF supports programmers directives to explicitly specify regular data distribution. It adopts two-level data-processor mapping. That is, related array objects are rst aligned with each other or aligned with a template; this group of arrays is then distributed onto the user-declared abstract processors. A two-level data-processor mapping causes a lot of useless holes if the alignment stride is non-unit. Memory holes result in memory wastage and degrade system performance. Therefore, this section proposes a compilation technique to totally eliminate these useless holes and systematically generate the sequence of array elements exactly obtained by each processor. 3.1 Construction of Compression Table Suppose a two-level data-processor mapping is of the form shown in Fig. 1. By Theorem 1, for arbitrary twolevel data-processor mapping, all blocks can be classied into N c classes, where N c = s= gcd(s; x). Therefore, for every lcm(n c ; P ) blocks, the same data distribution pattern for all processors will repeat again. In other words, blocks within a data distribution pattern can be viewed as dierent, but are identical for every data distribution pattern. Hence, we only have to consider the rst data distribution pattern and the rest of data distribution patterns can be done likewise. As a result, a compression table records the information to generate the compressed local array for each processor on the rst data distribution pattern. Similar to the class table, compression table also regards an alignment oset o as another reduced alignment oset r if o s, where r = (o mod s). Let the number of blocks in a data distribution pattern be N pb. Thus N pb = lcm(n c ; P ). The number of blocks on each processor within a data distribution pattern is, thus, equal to N pb =P, which can also be written as N c = gcd(n c ; P ). Let the number of blocks on a processor within a data distribution pattern be N p pb, then N p pb = N c= gcd(n c ; P ). That is, on each processor, each data distribution pattern contains N p pb blocks. We number a block on a processor within a data distribution pattern according to the order of occurrence of that block in the data distribution pattern. For the example shown in Fig. 2, a data distribution pattern contains 12(= N pb ) blocks. Blocks 0 to 11 form the rst data distribution pattern and blocks 12 to 23 are within the second data distribution pattern. Blocks 0 and 4 are the rst and the second occurred blocks on processors 0, blocks 1 and 5 are the rst and the second occurred blocks on processor 1 within the rst data distribution pattern, and blocks 12 and 13 are the rst occurred blocks respectively on processor 0 and 8

9 Fig. 7: The representations of the notations class, class cycle, data distribution pattern, and occurrence of a block within a data distribution pattern. 1 within the second data distribution pattern, and so on. The representations of the notations class, class cycle, data distribution pattern and occurrence of a block within a data distribution pattern are illustrated in Fig. 7. For a processor, data distributions among dierent data distribution patterns are identical. The only dierence between the rst data distribution pattern and other data distribution patterns is the indices of every pair of corresponding cells. However, for every corresponding cells, their indices are dierent in only a xed oset. Hence, for a processor, only how to compress holes for the rst data distribution pattern needs to be considered and, for the rest of data distribution patterns, only the xed oset needs to be evaluated. For every block allocated on processor p in the rst data distribution pattern, if the following three items can be evaluated, we can systematically generate the compressed local array. Thus, hole compression can be easily achieved. C p g, the global index of the rst array element in a block on processor p, C p n, the number of array elements in a block on processor p, C p l, the local index in the compressed local array for the rst array element in a block on processor p. Therefore, to facilitate hole compression, we design a structure, termed compression table, to characterize blocks in a data distribution pattern on each processor. For a block in processor p, the compression table records the following three items: Cg p, Cn, p and C p l. Take the compression table of processor p 0 as an example. Block b 0 is the rst occurred block within the rst data distribution pattern on processor p 0. The global index of the rst array element in block b 0 is 0, the number of array elements in the block is 2, and the local index of array element A(0) in the compressed local array is 0. Thus, Cg(0) 0 = 0, Cn(0) 0 = 2, and Cl 0 (0) = 0. The second occurrence of block within the rst data distribution pattern on processor p 0 is block b 4. The global index of the rst array element in block b 4 is 7, the number of array elements in the block is 1, and the local index of A(7) in the compressed local array is 2. Therefore, Cg(1) 0 = 7, Cn(1) 0 = 1, and Cl 0(1) = 2. Similarly, C0 g(2) = 13, Cn(2) 0 = 2, and Cl 0 (2) = 3. The 9

10 occurrence Cg 0 Cn 0 Cl occurrence Cg 1 Cn 1 Cl (a) The compression table of p 0. (b) The compression table of p 1. occurrence Cg 2 Cn 2 Cl occurrence Cg 3 Cn 3 Cl (c) The compression table of p 2. (d) The compression table of p 3. Fig. 8: Compression tables for processor p 0 and p 1 in the example shown in Fig. 2. (a) The compression table of processor p 0. (b) The compression table of processor p 1. (c) The compression table of processor p 2. (d) The compression table of processor p 3. compression tables of processor p 0, p 1, p 2, and p 3, for the example shown in Fig. 2 are illustrated in Fig. 8 (a), (b), (c), and (d), respectively. How to systematically construct a compression table for processor p is described as follows. Since all blocks are classied into N c classes, every N c blocks form a class cycle. For any block b, there are b b N c c class cycles appeared before b. Furthermore, each class cycle contains A c active elements. Hence, there are (b b N c ca c ) active elements in these class cycles. From Section 2, A f (c) implicitly implies the number of active elements from class 0 to class (c? 1) in a class cycle. As a result, for any block b, the global index of the rst array element in the block can be obtained by b b N c ca c +A f (c), where c is the class number of b. Therefore, Cg p = b b N c c A c + A f (c). On the other hand, Cn p can be obtained by A n (c) and C p l can be evaluated by the local index of the rst array element in the previous block plus the number of active elements contained by the previous block. Of course, the local index of the rst array element in the rst block on the rst data distribution pattern is 0. Note that C p l can also represent the number of active elements occurred before the block in a data distribution pattern. The algorithm to generate a compression table is demonstrated in Fig. 9. The time complexity of Compression Table Generation algorithm is O(N c = gcd(n c ; P )), where N c is the number of classes and P is the number of processors. By Section 2, N c = s= gcd(s; x). Hence, the worst case of Compression Table Generation algorithm is O(s), as much as that of Class Table Generation algorithm. Note that if the alignment stride is smaller than or equal to the distribution block size, (s x), each block contains at least one active element. In such a condition, the rst array element in a block obtained by the above calculation is exactly the rst array element of that block. On the contrary, if the alignment stride is larger than the distribution block size, (s > x), a block may contain no active element. The rst array element of the empty block, a block contains no active element, obtained by the above calculation is a pseudo array element. However, the pseudo array element does not aect the correctness of our proposed approach. We shall show the fact in the following section. 10

11 b = p; nelem = 0; addr = 0; /* initialization */ N p pb = N c gcd(n c;p ) DO occurrence = 0 TO (N p pb? 1) c = b mod N c /* the class number of block b */ Cg p (occurrence) = b b N c c A c + A f (c) Cn(occurrence) p = A n (c) C p l (occurrence) = addr + nelem b = b + P /*the next block on processor p */ nelem = Cn(occurrence) p /* keep the current value for the next C p l */ addr = C p l (occurrence) /* keep the current value for the next Cp l */ ENDDO Fig. 9: Compression Table Generation algorithm. 3.2 Generation of Compressed Local Arrays Using the compression table to generate the compressed local array on a processor is proposed in this subsection. Dierent alignment stride and oset cause dierent data-processor mapping. For simplifying discussion, we rst consider a two-level data-processor mapping without considering the pseudo active elements. That is, we assume o < s. However, a compressed local array should not include the pseudo active elements if pseudo active elements do exist. Therefore, it follows the discussions of the general two-level data-processor mapping containing pseudo active elements Special Case of Two-Level Data-Processor Mapping Generally speaking, the main factors aecting a two-level data-processor mapping are the alignment stride s, alignment oset o, distribution block size x and the number of processors P. However, two dierent alignment osets o 1 and o 2 lead to isomorphic data-processor mapping if and only if o 1 o 2 (mod s). Hence, the constructions of class table and compression table regard two alignment osets as the same if the two alignment osets lead to isomorphic data-processor mapping. That is, if o s, we shift the alignment oset from o to r as we are generating class table and compression table, where r equals (o mod s). In this subsection, to simplify the discussion of the generation of compressed local array, we do not take the pseudo active elements into consideration. Specically, we assume that the alignment oset o is smaller than the alignment stride s. On the other hand, since, in our proposed method, a basic unit to generate the compressed local array is a data distribution pattern, thus we also require that array elements be distributed onto multiple of data distribution patterns. To do so can avoid evaluating the pseudo active elements occurred in back of the last active element (s (N A? 1) + o). Fig. 2 is an example of such a two-level data-processor mapping. Again, here we consider that array A(i) is aligned with T (si+o) and T is distributed onto P processors using cyclic(x) distribution. As previously stated, the data distribution among dierent data distribution patterns are identical except the indices of the corresponding elements. Let the dierence of the global indices of two corresponding array elements on two contiguous data distribution patterns be global indexing oset. Therefore, if one array element is known, the corresponding array element on the previous/next data distribution pattern can be obtained by subtracting/adding the global indexing oset from/to the global index of that array element. Hence, the most ecient approach to generating the compressed local array for 11

12 Algorithm: Hole Compression algorithm for the special case of two-level data-processor mapping. Input: Compression table and global indexing oset. Output: A p, the compressed local array on processor p. Procedure: Step 1. Generate the compressed local array for the rst data distribution pattern on processor p by using the compression table. Step 2. Generate A p for the rest of data distribution patterns according to the global indexing oset. Fig. 10: A simple Hole Compression algorithm for processor p in the special case of two-level data-processor mapping. processor p is to rst generate the compressed local array for the rst data distribution pattern according to the compression table, and then, for the following data distribution patterns, the compressed local array can be generated according to previously generated compressed local array and the global indexing oset. Consider the two-level data-processor mapping shown in Fig. 2 and take the generation of the compressed local array for processor p 0 as an example. The compressed local array for the rst data distribution pattern is A(0; 1; 7; 13; 14). Accordingly, the compressed local array for the second data distribution pattern is A(20; 21; 27; 33; 34) since the global indexing oset is 20. Thus, the compressed local array of processor p 0 is A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34). A simple Hole Compression algorithm for the special case of two-level data-processor mapping is shown in Fig. 10. The algorithm shown in Fig. 10 is straightforward but less ecient. Based on the similar idea, we propose another ecient algorithm to generate the compressed local array for the special case of two-level dataprocessor mapping. Let the dierence of the local indices in a compressed local array for two corresponding array elements on two contiguous data distribution patterns be local indexing oset. As the dierences of the global and the local indices of two corresponding array elements on two contiguous data distribution patterns are xed, which are equal to the global and the local indexing osets, respectively, the global indices of the compressed local array elements on the latter data distribution pattern can be obtained by adding the global indexing oset to the global indices of the corresponding compressed local array elements on the former data distribution pattern. Similarly, the local indices in the compressed local array for the compressed local array elements on the latter data distribution pattern can be obtained by adding the local indexing oset to the local indices of the corresponding compressed local array elements on the former data distribution pattern. Once the global and local indexing osets for two corresponding array elements on two contiguous data distribution patterns are gured out, the compressed local array for processor p can be eciently generated as follows. For the blocks of the same occurrence on every data distribution pattern, the global index of the rst array element can be obtained by compression table, and for the remaining array elements at the same position in each block on each data distribution pattern, its index can be obtained by adding the global indexing oset to the global index of the previous corresponding array element. Similarly, the local index of the rst array element in the rst occurred block on the rst data distribution pattern can also be obtained by the compression table. For the other corresponding array elements, the local indices can be obtained by adding the local indexing oset to the local index of the previous corresponding array element. Fig. 11 illustrates the ecient algorithm to generate the compressed local array for processor p in the special case of two-level data-processor mapping. 12

13 Algorithm: Hole Compression algorithm for the special case of two-level data-processor mapping. Input: Compression table, global and local indexing osets. Output: A p, the compressed local array on processor p. Procedure: Step 1. for the blocks of the same occurrence on every data distribution pattern do Fill in A p according to the compression table, global and local indexing osets. Fig. 11: Hole Compression algorithm for processor p in the special case of two-level data-processor mapping. Again, take the generation of the compressed local array for processor p 0 in Fig. 2 as an example. The global index of the rst array element in block b 0 on the rst data distribution pattern is 0, which can be obtained by looking up the compression table of processor p 0. The local index of A(0) is 0, which can also be obtained by the compression table. For the corresponding array elements on the second data distribution pattern in this example, the global index of that array element is 20, which is equal to 20+0, where 20 is the global indexing oset and 0 is the global index of the previous corresponding array element. The local index of A(20) is 5, which equals 5 + 0, where 5 is the local indexing oset and 0 is the local index of the previous corresponding array element. That is, A(20) is mapped to A p0 (5). By the same way, we can obtained the compressed local array A p0 = (0; 1; 7; 13; 14; 20; 21; 27; 33; 34). Implementation Issues The main concept to generate the compressed local array for the special case of two-level data-processor mapping has been discussed in the previous section. However, some implementation issues should be addressed. Since the data distributions among dierent data distribution patterns are identical except the indices of the corresponding elements, the key to generating the compressed local array is to evaluate the global indexing oset. Evidently the dierence of the global indices of two corresponding array elements on two contiguous data distribution patterns is the number of active elements in a data distribution pattern. Let A ptn be the number of active elements in a data distribution pattern. Thus A ptn is the global indexing oset. The number of active elements in a data distribution pattern can be evaluated by A ptn = d(n pb x)=se, where N pb is the number of blocks in a data distribution pattern. For instance, in the example shown in Fig. 2, the rst blocks in the rst and the second data distribution patterns are blocks 0 and 12, respectively. The rst array elements in blocks 0 and 12 are 0 and 20, respectively. The dierence between the global indices of the two corresponding array elements is 20, which is equal to A ptn. The global indices of two corresponding array elements on two contiguous data distribution patterns are xed dierence. Similarly, the local indices in a processor's compressed local array for two corresponding array elements on two contiguous data distribution patterns are also xed dierence. We term this xed dierence local indexing oset. Clearly, the dierence of the local indices in a processor's compressed local array for two corresponding array elements on two contiguous data distribution patterns equals the number of active elements on a data distribution pattern allocated onto that processor. Let A p ptn be the number of active elements allocated onto processor p in a data distribution pattern. Obviously, A p ptn is the local 13

14 indexing oset. From compression table, A p ptn = C p n(n p pb? 1) + Cp l (N p pb? 1): For processor p 0 in the example shown in Fig. 2, the number of active elements on a data distribution pattern is 5. The rst array elements on the rst and the second data distribution patterns allocated onto processor p 0 are 0 and 20, respectively. The local index of array element 0 in compressed local array is 0 and that for array element 20 is 5. The dierence of the local indices in the compressed local array of p 0 for these two corresponding array elements is 5, which is equal to A p ptn. Suppose there are N ptn data distribution patterns in the special case of two-level data-processor mapping. The number of data distribution patterns N ptn can be obtained by N ptn = N b =N pb, where N b is the number of blocks and N pb is the number of blocks in a data distribution pattern. Since the array elements are distributed onto multiple of data distribution patterns, the number of compressed local array elements on processor p can be gured out as follows. Let the number of compressed local array elements on processor p be N p A. Then N p A = N ptn A p ptn : The code to generate the compressed local array on processor p for the special case of two-level data-processor mapping is shown in Appendix A General Case of Two-Level Data-Processor Mapping The concept used by the special case of two-level data-processor mapping is the same for the general case of two-level data-processor mapping. However, the pseudo active elements are taken into consideration for the general case of two-level data-processor mapping. For example, Fig. 12 is a general two-level data-processor mapping. In this data-processor mapping, array A has 30 elements indexed from 0 to 29. Array element A(i) is aligned with a template T at a stride 3 and an oset 28. Template T is distributed onto 4 processors using cyclic(5) distribution. The two data-processor mappings shown in Figs. 2 and 12 are isomorphic since the alignment strides, distribution block sizes and the numbers of processors of two mappings are identical, and the alignment osets 1 28 (mod 3). The class table and compression table used by the data-processor mapping shown in Fig. 2 are the same for that in Fig. 12. The compressed local array of processor p 0 in Fig. 2 is A p0 = A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34), which has been introduced in previous section. However, in Fig. 12, taking the pseudo active elements into consideration, the compressed local array is turned to A p0 = A(4; 5; 11; 12; 18; 24; 25). Hole Compression algorithm for the special case of two-level data-processor mapping ignores the eect caused by the pseudo active elements. In the general case of two-level data-processor mapping, the rst and the last data distribution patterns may have pseudo active elements. These two data distribution patterns should be considered separately. However, for the rest of data distribution patterns between the rst and the last data distribution patterns, there is no pseudo active element in these data distribution patterns. Hence the generation of compressed local array for these data distribution patterns can adopt the Hole Compression algorithm for the special case of data-processor mapping. The evaluations of the global and local indexing osets in the special case of two-level data-processor mapping are also important for general two-level data-processor mapping, where the global and local indexing osets are respectively the dierences of the global indices and the local indices in a compressed local array for two corresponding array elements on two contiguous data distribution patterns. In addition, for processor p, we still have to count 14

15 Fig. 12: A general two-level data-processor mapping. Array A(i) is aligned with template T (3 i + 28) and the template is then distributed onto 4 processors with cyclic(5) distribution, assuming N A =

16 Algorithm: Hole Compression algorithm for the general two-level data-processor mapping. Input: Compression table, N p A, global and local indexing osets. Output: A p, the compressed local array on processor p. Procedure: Step 1. Generate the rst A p ptn compressed local array elements according to the compression table and the number of pseudo active elements. Step 2. Generate the remainder (bn p A =Ap ptnc? 1) A p ptn compressed local array elements. Every A p ptn elements are obtained by adding the global indexing oset to the previous A p ptn elements. Step 3. Generate the last (N p A mod Ap ptn) compressed local array elements according to the previous A p ptn compressed local array elements and the global indexing oset. Fig. 13: Hole Compression algorithm for processor p on a general two-level data-processor mapping. the number of pseudo active elements on p before the rst active element o and that after the last active element (s (N A? 1) + o) within the block where the last active element is allocated. Consider the generation of compressed local array of processor p 0 for the general case of two-level data-processor mapping shown in Fig. 12. Since the data-processor mapping shown in Fig. 12 is isomorphic to that in Fig. 2, the compression tables used by the two data-processor mappings are identical and have been shown in Fig. 8. The global and local indexing osets are 20 and 5, respectively, which are the same as those in the special case of two-level data-processor mapping shown in Fig. 2. The compressed local array of processor p 0 without considering the impact caused by the pseudo active elements is A p0 = A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34), as discussed in Section Take the pseudo active elements into consideration. There are totally 9 pseudo active elements appeared before o in the two-level dataprocessor mapping, which are indexed as 1, 4, 7, 10, 13, 16, 19, 22, and 25. It implies, in this example, all the global indices of array elements have to subtract 9 from their original indices in the special case of two-level data-processor mapping to get their real indices in the general case. Thus, A p0 is turned to A(?9;?8;?2; 4; 5; 11; 12; 18; 24; 25). In addition, there are three pseudo active elements on processor p 0 ( 1, 4, and 22). As a result, the previous three array elements should be kicked out. Thus, the compressed local array of processor p 0 for the general two-level data-processor mapping in Fig. 12 is A p0 = A(4; 5; 11; 12; 18; 24; 25). However, if the number of compressed local array on processor p is known, the compressed local array on processor p can be generated more eciently. Let the number of compressed local array on processor p be N p A. Suppose the local indexing oset is Ap ptn. It implies that the global indices of array elements in the compressed local array will dier in a xed oset, global indexing oset, for a period of A p ptn elements. For example, in the two-level data-processor mapping shown in Fig. 12, the global and local indexing osets are 20 and 5, respectively. If the global index of the rst compressed local array element is 4, the global index of the sixth(= 1+5) compressed local array element is 24(= 4+20). Thus we can generate the compressed local array for the rst A p ptn elements by using the compression table and the number of pseudo active elements. After that, we can generate the next A p ptn elements according to the previous A p ptn elements and the global indexing oset until the last (N p A mod Ap ptn) elements. Finally, we can generate the last (N p A mod Ap ptn) elements accordingly. The algorithm to generate the compressed local array for the general case of two-level data-processor mapping is demonstrated in Fig

17 Although the concept of Hole Compression algorithm for the general case of two-level data-processor mapping is easy, some implementation issues should be addressed. The details of the algorithm will also be explained in the following subsection. Implementation Issues For a processor p, only the blocks which contain active elements do we need to concern about. Hence the block which contains active elements is termed active block. Let b p f and bp l respectively denote the rst and the last active blocks that processor p contains. Thus b p f and bp l can be gured out as follows. The block numbers of blocks contained by processor p can be written in the form of (p + k P ), where k 2 Z + and Z + is the set of positive integers. Since the rst and the last active elements are o and s (N A? 1) + o, they are allocated to blocks bo=xc and b(s (N A? 1) + o)=xc, respectively. Therefore, b p f should be greater than or equal to bo=xc and b p l should be smaller than or equal to b(s(n A?1)+o)=xc. Suppose b p f = (p+k f P ) and b p l = (p + k l P ), where k f and k l 2 Z +. Thus, (p + k f P ) bo=xc and (p + k l P ) b(s (N A? 1) + o)=xc. We can obtain that k f = d(bo=xc? p)=p e and k l = b(b(s (N A? 1) + o)=xc? p)=p c. As a result, b p f = p + d(bo=xc? p)=p e P; b p l = p + b(b(s (N A? 1) + o)=xc? p)=p c P: For any block in processor p, the block will belong to a unique data distribution pattern and occurrence in that data distribution pattern. Therefore, for any block b in processor p, b can be decomposed into two factors and, where and are the data distribution pattern number and the occurrence in the distribution pattern where b belongs, respectively. and can be respectively obtained by bb=n pb c and b(b mod N pb )=P c, where N pb is the number of blocks in a data distribution pattern and N pb = lcm(n c ; P ), which has been introduced in Section 3.1. On the other hand, given (; ) p, it will correspond to a unique block number. That is, given (; ) p, a unique block number b can be obtained via b = p + N pb + P: Thus, we term (; ) p addressing pair. Accordingly, the addressing pairs for the rst and the last active blocks that contained by processor p, b p f and bp l, are represented by ( f ; f ) p and ( l ; l ) p, respectively. When we are generating the class table and the compression table, pseudo active elements are regarded as the same with active elements. However, as we are generating the compressed local array, these pseudo active elements have to be deducted out. Hence, in addition to the evaluations of the global and local indexing osets, we have to gure out the number of pseudo active elements allocated on processor p and occurred before the rst active element o. In order not to over generate the compressed local array elements, we still have to calculate the number of pseudo active elements occurred in the last active block b p l on processor p. Let PA p h denote the number of pseudo active elements allocated on processor p and occurred before the rst active element o and PA p t denote the number of pseudo active elements occurred in the last active block b p l. The values of PA p h and PAp t can be calculated as follows. To calculate PA p h, we have to decide whether the rst active element o is allocated on processor p or not. If the rst active element o is not allocated on p, pseudo active elements only occur in the blocks before the rst active block b p f. Therefore, PAp h = f A p ptn + C p l ( f ), where A p ptn is the number of active elements allocated on processor p in a data distribution pattern. Otherwise, which means that the rst active element o is allocated on processor p, thus, in addition to the blocks before the rst active block b p f, the pseudo active elements occurred in the rst active block also need to be considered. Hence, PA p h = f A p ptn +C p l ( f )+b(o 17

18 mod x)=sc. As a result, PA p h can be obtained by ( PA p h = f A p ptn + C p l ( f ) if o 62 p; f A p ptn + C p l ( f ) + b(o mod x)=sc if o 2 p: For the example shown in Fig. 12, PA p h for processor p 0 can be evaluated as follows. The rst active block on p 0 is b 8 and the addressing pair for the block is (0; 2) p0. Since the rst active element 28 is not allocated on processor p 0, the pseudo active elements on p 0 only occur in the blocks before the rst active block b 8, that is, blocks b 0 and b 4. Accordingly, PA p0 h = Cp0 l (2) = 3, where A p ptn = 5. As regards to PA p h for processor p 1, the rst active block on p 1 is b 5 and the addressing pair of this block is (0; 1) p1. Since the rst active element 28 is allocated on p 1, the pseudo active elements occur not only in the blocks before the rst active block but also in the rst active block. Thus, PA p1 h = Cp1 l (1) + 1 = 2. To evaluate PA p t, we also have to decide whether the last active element (s (N A? 1) + o) is allocated on processor p or not. If the last active element (s (N A? 1) + o) is not allocated on p, there is no pseudo active element in the last active block. Therefore, PA p t = 0. For the example shown in Fig. 12, the values of PA p t for processors p 0, p 1 and p 2 are 0 because the last active element 115 is not allocated on p 0, p 1 and p 2. On the other hand, if the last active element is allocated on p, the number of pseudo active elements in the last active block b p l of p 3 shown in Fig. 12, since the last active element 115 is allocated on p 3, thus PA p3 t where the addressing pair of b p3 l is (1; 2) p3. Consequently, PA p t can be obtained by PA p t = can be calculated by PA p t = Cn( p l )? b((s (N A? 1) + o) mod x)=sc? 1. For the case = Cn p3 (2)? 0? 1 = 1, ( 0 if (s (N A? 1) + o) 62 p; Cn( p l )? b((s (N A? 1) + o) mod x)=sc? 1 if (s (N A? 1) + o) 2 p: Since the rst and the last active blocks on processor p and the corresponding addressing pairs can be evaluated, the number of active elements on processor p can be obtained accordingly. For processor p, the number of active elements occurred from the rst block up to the last active block is l A p ptn+c p l ( l)+cn( p l ). However, the pseudo active elements are also counted. Therefore, the number of pseudo active elements should be deducted out. As a result, the number of active elements allocated on processor p is N p A and can be obtained by N p A = l A p ptn + C p l ( l) + C p n( l )? PA p h? PAp t : The code to generate the compressed local array on processor p for the general two-level data-processor mapping is shown in Appendix B. We explain the ideas of the algorithm shown in Fig. 13 by using the following example. Consider the general two-level data-processor mapping shown in Fig. 12. Take the generation of the compressed local array for processor p 0 as an example. We have to evaluate the number of active elements allocated on processor p 0 rst. The rst and the last active blocks on processor p 0 are b 8 and b 20, respectively. That is, b p0 f = b 8 and b p0 l = b 20. The addressing pairs ( f ; f ) p0 and ( l ; l ) p0 are (0; 2) p0 and (1; 2) p0, respectively. The values of PA p h and PAp t for processor p 0 have been described in previous paragraphs and PA p0 h = 3 and PA p0 t = 0. The number of active elements on processor p 0 is N p0 A = Cp0 l (2) + Cn p0 (2)? 3? 0 = 7, where A p0 ptn = 5. Since A p0 ptn = 5, according to Hole Compression algorithm shown in Fig. 13, the rst step is to generate 5 compressed local array elements. The compression table records the information to generate A p ptn compressed local array elements when pseudo active elements are not considered. However, there are 9 pseudo active elements occurred before the rst active element o in this example. Hence all the global indices 18

1 Elementary number theory

Math 215 - Introduction to Advanced Mathematics Spring 2019 1 Elementary number theory We assume the existence of the natural numbers and the integers N = {1, 2, 3,...} Z = {..., 3, 2, 1, 0, 1, 2, 3,...},