Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o
|
|
- Evelyn Stanley
- 5 years ago
- Views:
Transcription
1 Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National Central University Chung-Li 32054, Taiwan steven@axp1.csie.ncu.edu.tw sheujp@csie.ncu.edu.tw z Department of Computer and Information Science The Ohio State University Columbus, OH chh@cis.ohio-state.edu Abstract This paper presents compilation techniques to compress holes, which are memory locations mapped by useless template cells and are caused by the non-unit alignment stride in a two-level data-processor mapping. A two-level data-processor mapping provides user to specify data-processor mapping by aligning related array objects with a template, and then distributing the template onto the user-declared abstract processors. In a two-level data-processor mapping, there is a repeated pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the attributes of classes for the data distribution. Similarly, data distribution on a processor also has a repeated pattern. We use compression table to record the attributes of the rst data distribution pattern on that processor. By using class table and compression table, hole compression can be easily and eciently achieved. Compressing holes can save memory usage, improve spatial locality and further increase system performance. The proposed method is ecient, stable and easy implement. The experimental results do conrm the advantages of our proposed method over existing methods. Keywords: Communication Set, Distributed-Memory Multicomputers, High Performance Fortran, Hole Compression, Two-Level Data-Processor Mapping. 1 Introduction Distributed-memory multicomputers are superior than shared-memory multiprocessors in cost and scalability. These advantages have made distributed-memory multicomputers be extensively used in scientic and engineering computation. However, the absence of a global shared address causes the diculty to program in distributed-memory multicomputers. Programmers have to not only distribute the computation and data onto processors but also manage communication among processors. Currently, parallelizing compilers try to ll up the gap between users and machines to relieve the burden of such a tedious and error-prone work from This work was supported in part by the NSC of the ROC under Grant #NSC E
2 !HPF$ PROCESSORS P ROC(P )!HPF$ ALIGN A(i) WITH T (s i + o)!hpf$ DISTRIBUTE T (cyclic(x)) ONTO P ROC Fig. 1: HPF-like loop model considered in the paper. users. Many researchers investigate the extension of existing languages to provide users more convenient way to manage data distribution, such as Fortran D [6, 25], HPF (High Performance Fortran) [8, 16], and Vienna Fortran [2, 3], etc. The major characteristic of these languages is to provide users directives to specify data distribution at language level. Generally speaking, data parallel languages support two-level data-processor mapping. A two-level dataprocessor mapping provides user to specify data-processor mapping by aligning related array objects with a template, an abstract index space, and then distributing the template onto the user-declared abstract processors. In distribution phase, three regular data distributions, block, cyclic, and block-cyclic data distributions, are provided by these languages. The distribution that contiguous blocks of array elements are evenly distributed onto processors is block distribution. The cyclic distribution is to distribute each array element onto processors one at a time and in a round-robin fashion. The distribution that blocks of size x are distributed onto processors in a round-robin fashion is block-cyclic distribution and is denoted as cyclic(x). Block-cyclic distribution is known to be the most general data distribution. A Block or a cyclic distribution can be represented by a block-cyclic distribution. A cyclic(d NA P e) distribution is, basically, a block distribution and the cyclic distribution can be represented by cyclic(1), where N A is the number of array elements and P is the number of processors. Consequently, the program model considered in this paper is demonstrated in Fig. 1, where P is the number of processors, s is the alignment stride, o is the alignment oset and x is the distribution block size. Fig. 2 illustrates an example of this model, where P = 4, s = 3, o = 1, and x = 5. The white squares represent the array elements of A and the number in the square is the global index of that array element. The gradations of gray squares represent dierent template cells mapped to dierent processors and the number in the square is the global index of that template cell. A template is an abstract index space and whose elements have no content and occupy no storage [8]. Therefore, in real situation, we concern actually about how array elements are distributed onto processors, not how the template cells are distributed onto processors. Furthermore, if the alignment stride is non-unit, the two-level data-processor mapping will create holes; that is, there are a lot of template cells with which no array element is aligned. These holes imply that the memory locations are never used. Fig. 3(a) illustrates the distribution of data elements onto processors without hole compression. On the contrary Fig. 3(b) illustrates that of with hole compression. Obviously, there are 30 memory spaces that should be allocated by each processor if hole compression is not performed. However, only a few template cells are aligned by array elements. The rest of template cells, which have no array elements aligned with, are never used and are holes. Therefore, after hole compression performed, only 10 memory spaces are needed by each processor. Memory holes caused by the non-unit alignment stride will result in a large amount of memory wastage, even for a small alignment stride. Therefore, only the template cells aligned by array elements need to be mapped to memory locations and all the holes should be removed. Suppose the number of template cells is N T and s is the alignment stride. Thus, only N T =s template cells are aligned by array elements. The NT =s percentage of memory usage is N T 100% and equals 1 s 100%. In other words, the percentage of memory wastage is s?1 s 100%. The larger the alignment stride is, the more the memory usage wastes. Even for the 2
3 Fig. 2: A Two-level data-processor mapping. Array A(i) is aligned with template T (3 i + 1) and the template is then distributed onto 4 processors with cyclic(5) distribution. 3
4 (a) (b) Fig. 3: The distributions of data elements onto processors without and with hole compression. (a) The distribution of data elements onto processors without hole compression. (b) The distribution of data elements onto processors with hole compression. 4
5 least positive non-unit alignment stride 2, it still has 50% waste of memory space. Therefore, compressing holes is quite necessary and important. The processes that map the useful template cells to processors and eliminate useless holes are called hole compression. In addition to increase memory usage, removing holes can also improve spatial locality and, furthermore, achieve higher performance. To eciently achieve the goal of removing holes, this paper presents compilation techniques for compiling two-level data-processor mappings. Observing the distribution of data onto processors, one can nd out that the distribution patterns for every three blocks are identical as shown in Fig. 2. By the observation, we classify all blocks into classes and design a table, named class table, to record the attributes of the rst repeated pattern. On the other hand, from processor viewpoint, the above observation is true as well. Therefore, another table, compression table, are established to record the attributes of the rst repeated data distribution pattern on that processor. Compression table is established according to the class table. We design systematic methods to construct these tables and the time complexity of each construction is O(s) in worst case, where s is the alignment stride. Hole compression can be easily achieved by using compression table accordingly. Table-lookup approach can save a lot of redundant computations since the computations of repetition of xed patterns can be obtained by table lookup instead of recomputations. Experimental results verify the advantages of the proposed approach. Moreover, the proposed approach has high stability against existing methods. The execution time changes little with the variations of the alignment stride and the distribution block size. In addition, from implementation viewpoint, the proposed approach can be easily implemented as well. This paper is organized as follows. On compiling two-level data-processor mapping, class table is useful for summarizing the two-level mapping. Therefore, the structure and characteristics of class table are presented in Section 2. Section 3 introduces how to utilize class table for compressing holes and generating global indices of array elements contained by each processor. Here a useful structure termed as compression table is also introduced. Experimental results to show the advantages of ours over existing methods are provided in Section 4. Section 5 describes the related work. Section 6 concludes the paper and points out the possible direction of future research as well. 2 Characteristics of Class Table As compiling array statements, we have designed a useful structure to summarize the characteristics of access patterns for an array statement, while the data distribution is block-cyclic distribution [28]. Based on the similar concept, a structure named class table is designed in this paper to record the attributes of a two-level data-processor mapping. In this section we briey describe the basic components of class table. Without loss of generality, we assume every numbering system is starting from zero, such as numbering array elements, template cells, and processors, etc. Suppose array element A(i) is aligned with template T at (s i + o), where s is the alignment stride and o is the alignment oset. Two dierent alignment osets o 1 and o 2 lead to isomorphic data-processor mapping if and only if o 1 o 2 (mod s). In order to reuse the same class table for isomorphic data-processor mappings, class table regards the alignment oset o as another reduced one r, where r = (o mod s). For example, suppose array A(i) is aligned with a template T at (3i + 10). The alignment phase is illustrated in Fig. 4. In this example, s = 3 and o = 10. Since 10 > 3, the alignment oset is reduced to 1, which is equal to (10 mod 3). For the reduced alignment, a template cell t has an array element aligned with if and only if t r (mod s), such as the template cells 1; 4; 7; 10; 13; 16;, in this example. For these template cells with which have array elements aligned, a template cell t is active if o t (s (N A? 1) + o); otherwise, it is 5
6 A p p p T Fig. 4: The illustration of active elements and pseudo active elements. The boldfaced numbers are active elements and the italic numbers are pseudo active elements. termed pseudo active, where N A is the number of array elements. For this example, the active template cells are starting from 10 and then each strides 3 and the template cells 1, 4, 7 are pseudo active elements. Fig. 4 also illustrates the active elements and pseudo active elements for this example. The boldfaced numbers are active elements and the italic numbers are pseudo active elements. Note that the pseudo active elements are viewed as the same with active elements when we are generating class table and compression table. However, the pseudo active elements will not be counted when we are generating the compressed local array. For a cyclic(x) distribution, every x template cells forms a block. A block is numbered according to the occurrence of the block in the data-processor mapping. Suppose the numbers of array elements and template cells are N A and N T, respectively. Let N b be the number of blocks. Thus N b = dn T =xe. According to the alignment stride and oset, blocks of the same format can be classied into the same class. In other words, those blocks within which have the same positions of active elements are classied into the same class. For example, consider the two-level data-processor mapping shown in Fig. 2. Blocks 0, 3, 6, 9, have the same positions of active elements. These blocks are classied into the same class. Similarly, blocks 1, 4, 7, 10, can be classied into a class and blocks 2, 5, 8, 11, is another class. The following theorem demonstrates that, for a two-level data-processor mapping, according to the positions of active elements within the blocks, all blocks can be classied into dierent classes and the number of classes is equal to s= gcd(s; x), where gcd(a; b) is the greatest common divisor of a and b. Theorem 1 For any two-level data-processor mapping, suppose array A is aligned with T at a stride s and an oset o and template T is distributed onto processors using cyclic(x) distribution. All template blocks can be classied into s= gcd(s; x) classes. Proof. Since each block is of the same size and the alignment stride is xed on s, we only have to prove that if the position of the rst active element for some block is the same as that of another block, the positions of the rest of active elements within the two blocks are the same. Suppose the position of the rst active element within block b is, 0 < s. We would like to show that the position of the rst active element is still for every period of s= gcd(s; x) blocks. The positions of active elements before or after can be formulated as (( + k s) mod x), where k 2 Z and Z is the set of integers. We claim that ( + k 0 s) mod x =, for some k 0. Therefore, the number of blocks between b and the block whose the rst active element position is the same with b is k0 s x. Since ( + k 0 s) mod x =, it implies that + k 0 s = x q +, where q 2 Z. Thus, k 0 s = x q. Because k 0 and q are integers, the unique solution to k 0 s = x q is k 0 = t lcm(s;x) s and lcm(a; b) is the least common multiple of a and b. Therefore, k0 s x and q = t lcm(s;x) x, where t 2 Z = q = t lcm(s;x) x = t s gcd(s;x). In other words, for every period of s= gcd(s; x) blocks, the rst active elements of these blocks are the same. Thus, for arbitrary two-level data-processor mapping, blocks can be classied into s= gcd(s; x) classes. The theorem is therefore obtained. 2 Let N c be the number of classes. By Theorem 1, N c = s= gcd(s; x). Since all blocks are classied into s= gcd(s; x) classes, we number the class number of each block in lexicographical order and every s= gcd(s; x) 6
7 c A f A l A n Fig. 5: Class table for the example shown in Fig. 2. N c = s= gcd(s; x) /* the number of classes */ r = o mod s /* reduced alignment oset */ DO c = 0 TO (N c? 1) A f (c) = dmax(c x? r; 0)=se A l (c) = b((c + 1) x? r? 1)=sc A n (c) = A l (c)? A f (c) + 1 ENDDO Fig. 6: Class Table Generation algorithm. blocks repeats again. In other words, block b belongs to class (b mod N c ). Blocks b 1 and b 2 belong to the same class if and only if b 1 b 2 (mod N c ). Blocks fb; (b + 1); (b + 2); : : : ; (b + N c? 1) j b 0 (mod N c )g are a period in terms of class and we call such a period a class cycle. For a class cycle, a class table is used for recording the rst, the last and the number of active elements on a class. For a class c, A f (c) represents the order of occurrence in a class cycle for the rst active element in c, A l (c) the order of occurrence in a class cycle for the last active element in c, and A n (c) the number of active elements within c. For the example shown in Fig. 2, the class table is shown in Fig. 5. In this example, blocks can be classied into 3 classes. The blocks 0, 3, 6, 9,, are class 0, the blocks 1, 4, 7, 10,, are class 1, and the blocks 2, 5, 8, 11,, are class 2. Blocks 0, 1, 2 form a class cycle and blocks 3, 4, 5 form another class cycle, and so on. For the rst class in a class cycle, the occurrences of the rst and the last active elements are 0 and 1, respectively, and the number of active elements in this class is 2. Thus, A f (0) = 0, A l (0) = 1, and A n (0) = 2. Likewise, A f (1) = 2, A l (1) = 2, and A n (1) = 1 for class 1 and A f (2) = 3, A l (2) = 4, and A n (2) = 2 for class 2. The algorithm to generate a class table is presented in Fig. 6 and is termed Class Table Generation algorithm. The major factors aecting the construction of a class table are the alignment stride s, alignment oset o, and the distribution block size x. The time complexity of Class Table Generation algorithm is O(s= gcd(s; x)). In addition to the original meanings described above, class table also contains a lot of additional information implicitly. For a class c, A f (c) implies total number of active elements contained from class 0 to class (c? 1) and (A l (c) + 1) implies total number of active elements contained from 0 up to class c. Therefore, the number of active elements contained in a cycle is equal to A l (N c? 1) + 1. Let A c be the number of active elements contained in a class cycle. Thus, A c = A l (N c? 1) + 1. Note that, if the block size x is larger than the alignment stride s, each block contains at least one active element. Thus, A l (c) is always larger than or equal to A f (c), where c is the class number of that block. However, if the block size is smaller than the alignment stride, each block contains at most one active element and there are blocks that contain no active element at all. For any block that contains no active element, A f (c) will be one larger than A l (c). Lemma 1 demonstrates the phenomenon. 7
8 Lemma 1 For any block b, A f (c) = A l (c) + 1 () A n (c) = 0; where c is the class number of b. Proof. ()): Since A n (c) = A l (c)? A f (c) + 1 and A f (c) = A l (c) + 1; obviously, A n (c) = 0. ((): Since A n (c) = A l (c)? A f (c) + 1 and A n (c) = 0; clearly, A f (c) = A l (c) The construction of a class table from a two-level data-processor mapping has been introduced. In the following section, how to use the class table to construct a compression table to extract information from a repeated data distribution pattern on a processor will be proposed. The generation of the compressed local array for a processor by using the compression table will be described next. The transformations between global and local indices are addressed as well. 3 Hole Compression HPF supports programmers directives to explicitly specify regular data distribution. It adopts two-level data-processor mapping. That is, related array objects are rst aligned with each other or aligned with a template; this group of arrays is then distributed onto the user-declared abstract processors. A two-level data-processor mapping causes a lot of useless holes if the alignment stride is non-unit. Memory holes result in memory wastage and degrade system performance. Therefore, this section proposes a compilation technique to totally eliminate these useless holes and systematically generate the sequence of array elements exactly obtained by each processor. 3.1 Construction of Compression Table Suppose a two-level data-processor mapping is of the form shown in Fig. 1. By Theorem 1, for arbitrary twolevel data-processor mapping, all blocks can be classied into N c classes, where N c = s= gcd(s; x). Therefore, for every lcm(n c ; P ) blocks, the same data distribution pattern for all processors will repeat again. In other words, blocks within a data distribution pattern can be viewed as dierent, but are identical for every data distribution pattern. Hence, we only have to consider the rst data distribution pattern and the rest of data distribution patterns can be done likewise. As a result, a compression table records the information to generate the compressed local array for each processor on the rst data distribution pattern. Similar to the class table, compression table also regards an alignment oset o as another reduced alignment oset r if o s, where r = (o mod s). Let the number of blocks in a data distribution pattern be N pb. Thus N pb = lcm(n c ; P ). The number of blocks on each processor within a data distribution pattern is, thus, equal to N pb =P, which can also be written as N c = gcd(n c ; P ). Let the number of blocks on a processor within a data distribution pattern be N p pb, then N p pb = N c= gcd(n c ; P ). That is, on each processor, each data distribution pattern contains N p pb blocks. We number a block on a processor within a data distribution pattern according to the order of occurrence of that block in the data distribution pattern. For the example shown in Fig. 2, a data distribution pattern contains 12(= N pb ) blocks. Blocks 0 to 11 form the rst data distribution pattern and blocks 12 to 23 are within the second data distribution pattern. Blocks 0 and 4 are the rst and the second occurred blocks on processors 0, blocks 1 and 5 are the rst and the second occurred blocks on processor 1 within the rst data distribution pattern, and blocks 12 and 13 are the rst occurred blocks respectively on processor 0 and 8
9 Fig. 7: The representations of the notations class, class cycle, data distribution pattern, and occurrence of a block within a data distribution pattern. 1 within the second data distribution pattern, and so on. The representations of the notations class, class cycle, data distribution pattern and occurrence of a block within a data distribution pattern are illustrated in Fig. 7. For a processor, data distributions among dierent data distribution patterns are identical. The only dierence between the rst data distribution pattern and other data distribution patterns is the indices of every pair of corresponding cells. However, for every corresponding cells, their indices are dierent in only a xed oset. Hence, for a processor, only how to compress holes for the rst data distribution pattern needs to be considered and, for the rest of data distribution patterns, only the xed oset needs to be evaluated. For every block allocated on processor p in the rst data distribution pattern, if the following three items can be evaluated, we can systematically generate the compressed local array. Thus, hole compression can be easily achieved. C p g, the global index of the rst array element in a block on processor p, C p n, the number of array elements in a block on processor p, C p l, the local index in the compressed local array for the rst array element in a block on processor p. Therefore, to facilitate hole compression, we design a structure, termed compression table, to characterize blocks in a data distribution pattern on each processor. For a block in processor p, the compression table records the following three items: Cg p, Cn, p and C p l. Take the compression table of processor p 0 as an example. Block b 0 is the rst occurred block within the rst data distribution pattern on processor p 0. The global index of the rst array element in block b 0 is 0, the number of array elements in the block is 2, and the local index of array element A(0) in the compressed local array is 0. Thus, Cg(0) 0 = 0, Cn(0) 0 = 2, and Cl 0 (0) = 0. The second occurrence of block within the rst data distribution pattern on processor p 0 is block b 4. The global index of the rst array element in block b 4 is 7, the number of array elements in the block is 1, and the local index of A(7) in the compressed local array is 2. Therefore, Cg(1) 0 = 7, Cn(1) 0 = 1, and Cl 0(1) = 2. Similarly, C0 g(2) = 13, Cn(2) 0 = 2, and Cl 0 (2) = 3. The 9
10 occurrence Cg 0 Cn 0 Cl occurrence Cg 1 Cn 1 Cl (a) The compression table of p 0. (b) The compression table of p 1. occurrence Cg 2 Cn 2 Cl occurrence Cg 3 Cn 3 Cl (c) The compression table of p 2. (d) The compression table of p 3. Fig. 8: Compression tables for processor p 0 and p 1 in the example shown in Fig. 2. (a) The compression table of processor p 0. (b) The compression table of processor p 1. (c) The compression table of processor p 2. (d) The compression table of processor p 3. compression tables of processor p 0, p 1, p 2, and p 3, for the example shown in Fig. 2 are illustrated in Fig. 8 (a), (b), (c), and (d), respectively. How to systematically construct a compression table for processor p is described as follows. Since all blocks are classied into N c classes, every N c blocks form a class cycle. For any block b, there are b b N c c class cycles appeared before b. Furthermore, each class cycle contains A c active elements. Hence, there are (b b N c ca c ) active elements in these class cycles. From Section 2, A f (c) implicitly implies the number of active elements from class 0 to class (c? 1) in a class cycle. As a result, for any block b, the global index of the rst array element in the block can be obtained by b b N c ca c +A f (c), where c is the class number of b. Therefore, Cg p = b b N c c A c + A f (c). On the other hand, Cn p can be obtained by A n (c) and C p l can be evaluated by the local index of the rst array element in the previous block plus the number of active elements contained by the previous block. Of course, the local index of the rst array element in the rst block on the rst data distribution pattern is 0. Note that C p l can also represent the number of active elements occurred before the block in a data distribution pattern. The algorithm to generate a compression table is demonstrated in Fig. 9. The time complexity of Compression Table Generation algorithm is O(N c = gcd(n c ; P )), where N c is the number of classes and P is the number of processors. By Section 2, N c = s= gcd(s; x). Hence, the worst case of Compression Table Generation algorithm is O(s), as much as that of Class Table Generation algorithm. Note that if the alignment stride is smaller than or equal to the distribution block size, (s x), each block contains at least one active element. In such a condition, the rst array element in a block obtained by the above calculation is exactly the rst array element of that block. On the contrary, if the alignment stride is larger than the distribution block size, (s > x), a block may contain no active element. The rst array element of the empty block, a block contains no active element, obtained by the above calculation is a pseudo array element. However, the pseudo array element does not aect the correctness of our proposed approach. We shall show the fact in the following section. 10
11 b = p; nelem = 0; addr = 0; /* initialization */ N p pb = N c gcd(n c;p ) DO occurrence = 0 TO (N p pb? 1) c = b mod N c /* the class number of block b */ Cg p (occurrence) = b b N c c A c + A f (c) Cn(occurrence) p = A n (c) C p l (occurrence) = addr + nelem b = b + P /*the next block on processor p */ nelem = Cn(occurrence) p /* keep the current value for the next C p l */ addr = C p l (occurrence) /* keep the current value for the next Cp l */ ENDDO Fig. 9: Compression Table Generation algorithm. 3.2 Generation of Compressed Local Arrays Using the compression table to generate the compressed local array on a processor is proposed in this subsection. Dierent alignment stride and oset cause dierent data-processor mapping. For simplifying discussion, we rst consider a two-level data-processor mapping without considering the pseudo active elements. That is, we assume o < s. However, a compressed local array should not include the pseudo active elements if pseudo active elements do exist. Therefore, it follows the discussions of the general two-level data-processor mapping containing pseudo active elements Special Case of Two-Level Data-Processor Mapping Generally speaking, the main factors aecting a two-level data-processor mapping are the alignment stride s, alignment oset o, distribution block size x and the number of processors P. However, two dierent alignment osets o 1 and o 2 lead to isomorphic data-processor mapping if and only if o 1 o 2 (mod s). Hence, the constructions of class table and compression table regard two alignment osets as the same if the two alignment osets lead to isomorphic data-processor mapping. That is, if o s, we shift the alignment oset from o to r as we are generating class table and compression table, where r equals (o mod s). In this subsection, to simplify the discussion of the generation of compressed local array, we do not take the pseudo active elements into consideration. Specically, we assume that the alignment oset o is smaller than the alignment stride s. On the other hand, since, in our proposed method, a basic unit to generate the compressed local array is a data distribution pattern, thus we also require that array elements be distributed onto multiple of data distribution patterns. To do so can avoid evaluating the pseudo active elements occurred in back of the last active element (s (N A? 1) + o). Fig. 2 is an example of such a two-level data-processor mapping. Again, here we consider that array A(i) is aligned with T (si+o) and T is distributed onto P processors using cyclic(x) distribution. As previously stated, the data distribution among dierent data distribution patterns are identical except the indices of the corresponding elements. Let the dierence of the global indices of two corresponding array elements on two contiguous data distribution patterns be global indexing oset. Therefore, if one array element is known, the corresponding array element on the previous/next data distribution pattern can be obtained by subtracting/adding the global indexing oset from/to the global index of that array element. Hence, the most ecient approach to generating the compressed local array for 11
12 Algorithm: Hole Compression algorithm for the special case of two-level data-processor mapping. Input: Compression table and global indexing oset. Output: A p, the compressed local array on processor p. Procedure: Step 1. Generate the compressed local array for the rst data distribution pattern on processor p by using the compression table. Step 2. Generate A p for the rest of data distribution patterns according to the global indexing oset. Fig. 10: A simple Hole Compression algorithm for processor p in the special case of two-level data-processor mapping. processor p is to rst generate the compressed local array for the rst data distribution pattern according to the compression table, and then, for the following data distribution patterns, the compressed local array can be generated according to previously generated compressed local array and the global indexing oset. Consider the two-level data-processor mapping shown in Fig. 2 and take the generation of the compressed local array for processor p 0 as an example. The compressed local array for the rst data distribution pattern is A(0; 1; 7; 13; 14). Accordingly, the compressed local array for the second data distribution pattern is A(20; 21; 27; 33; 34) since the global indexing oset is 20. Thus, the compressed local array of processor p 0 is A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34). A simple Hole Compression algorithm for the special case of two-level data-processor mapping is shown in Fig. 10. The algorithm shown in Fig. 10 is straightforward but less ecient. Based on the similar idea, we propose another ecient algorithm to generate the compressed local array for the special case of two-level dataprocessor mapping. Let the dierence of the local indices in a compressed local array for two corresponding array elements on two contiguous data distribution patterns be local indexing oset. As the dierences of the global and the local indices of two corresponding array elements on two contiguous data distribution patterns are xed, which are equal to the global and the local indexing osets, respectively, the global indices of the compressed local array elements on the latter data distribution pattern can be obtained by adding the global indexing oset to the global indices of the corresponding compressed local array elements on the former data distribution pattern. Similarly, the local indices in the compressed local array for the compressed local array elements on the latter data distribution pattern can be obtained by adding the local indexing oset to the local indices of the corresponding compressed local array elements on the former data distribution pattern. Once the global and local indexing osets for two corresponding array elements on two contiguous data distribution patterns are gured out, the compressed local array for processor p can be eciently generated as follows. For the blocks of the same occurrence on every data distribution pattern, the global index of the rst array element can be obtained by compression table, and for the remaining array elements at the same position in each block on each data distribution pattern, its index can be obtained by adding the global indexing oset to the global index of the previous corresponding array element. Similarly, the local index of the rst array element in the rst occurred block on the rst data distribution pattern can also be obtained by the compression table. For the other corresponding array elements, the local indices can be obtained by adding the local indexing oset to the local index of the previous corresponding array element. Fig. 11 illustrates the ecient algorithm to generate the compressed local array for processor p in the special case of two-level data-processor mapping. 12
13 Algorithm: Hole Compression algorithm for the special case of two-level data-processor mapping. Input: Compression table, global and local indexing osets. Output: A p, the compressed local array on processor p. Procedure: Step 1. for the blocks of the same occurrence on every data distribution pattern do Fill in A p according to the compression table, global and local indexing osets. Fig. 11: Hole Compression algorithm for processor p in the special case of two-level data-processor mapping. Again, take the generation of the compressed local array for processor p 0 in Fig. 2 as an example. The global index of the rst array element in block b 0 on the rst data distribution pattern is 0, which can be obtained by looking up the compression table of processor p 0. The local index of A(0) is 0, which can also be obtained by the compression table. For the corresponding array elements on the second data distribution pattern in this example, the global index of that array element is 20, which is equal to 20+0, where 20 is the global indexing oset and 0 is the global index of the previous corresponding array element. The local index of A(20) is 5, which equals 5 + 0, where 5 is the local indexing oset and 0 is the local index of the previous corresponding array element. That is, A(20) is mapped to A p0 (5). By the same way, we can obtained the compressed local array A p0 = (0; 1; 7; 13; 14; 20; 21; 27; 33; 34). Implementation Issues The main concept to generate the compressed local array for the special case of two-level data-processor mapping has been discussed in the previous section. However, some implementation issues should be addressed. Since the data distributions among dierent data distribution patterns are identical except the indices of the corresponding elements, the key to generating the compressed local array is to evaluate the global indexing oset. Evidently the dierence of the global indices of two corresponding array elements on two contiguous data distribution patterns is the number of active elements in a data distribution pattern. Let A ptn be the number of active elements in a data distribution pattern. Thus A ptn is the global indexing oset. The number of active elements in a data distribution pattern can be evaluated by A ptn = d(n pb x)=se, where N pb is the number of blocks in a data distribution pattern. For instance, in the example shown in Fig. 2, the rst blocks in the rst and the second data distribution patterns are blocks 0 and 12, respectively. The rst array elements in blocks 0 and 12 are 0 and 20, respectively. The dierence between the global indices of the two corresponding array elements is 20, which is equal to A ptn. The global indices of two corresponding array elements on two contiguous data distribution patterns are xed dierence. Similarly, the local indices in a processor's compressed local array for two corresponding array elements on two contiguous data distribution patterns are also xed dierence. We term this xed dierence local indexing oset. Clearly, the dierence of the local indices in a processor's compressed local array for two corresponding array elements on two contiguous data distribution patterns equals the number of active elements on a data distribution pattern allocated onto that processor. Let A p ptn be the number of active elements allocated onto processor p in a data distribution pattern. Obviously, A p ptn is the local 13
14 indexing oset. From compression table, A p ptn = C p n(n p pb? 1) + Cp l (N p pb? 1): For processor p 0 in the example shown in Fig. 2, the number of active elements on a data distribution pattern is 5. The rst array elements on the rst and the second data distribution patterns allocated onto processor p 0 are 0 and 20, respectively. The local index of array element 0 in compressed local array is 0 and that for array element 20 is 5. The dierence of the local indices in the compressed local array of p 0 for these two corresponding array elements is 5, which is equal to A p ptn. Suppose there are N ptn data distribution patterns in the special case of two-level data-processor mapping. The number of data distribution patterns N ptn can be obtained by N ptn = N b =N pb, where N b is the number of blocks and N pb is the number of blocks in a data distribution pattern. Since the array elements are distributed onto multiple of data distribution patterns, the number of compressed local array elements on processor p can be gured out as follows. Let the number of compressed local array elements on processor p be N p A. Then N p A = N ptn A p ptn : The code to generate the compressed local array on processor p for the special case of two-level data-processor mapping is shown in Appendix A General Case of Two-Level Data-Processor Mapping The concept used by the special case of two-level data-processor mapping is the same for the general case of two-level data-processor mapping. However, the pseudo active elements are taken into consideration for the general case of two-level data-processor mapping. For example, Fig. 12 is a general two-level data-processor mapping. In this data-processor mapping, array A has 30 elements indexed from 0 to 29. Array element A(i) is aligned with a template T at a stride 3 and an oset 28. Template T is distributed onto 4 processors using cyclic(5) distribution. The two data-processor mappings shown in Figs. 2 and 12 are isomorphic since the alignment strides, distribution block sizes and the numbers of processors of two mappings are identical, and the alignment osets 1 28 (mod 3). The class table and compression table used by the data-processor mapping shown in Fig. 2 are the same for that in Fig. 12. The compressed local array of processor p 0 in Fig. 2 is A p0 = A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34), which has been introduced in previous section. However, in Fig. 12, taking the pseudo active elements into consideration, the compressed local array is turned to A p0 = A(4; 5; 11; 12; 18; 24; 25). Hole Compression algorithm for the special case of two-level data-processor mapping ignores the eect caused by the pseudo active elements. In the general case of two-level data-processor mapping, the rst and the last data distribution patterns may have pseudo active elements. These two data distribution patterns should be considered separately. However, for the rest of data distribution patterns between the rst and the last data distribution patterns, there is no pseudo active element in these data distribution patterns. Hence the generation of compressed local array for these data distribution patterns can adopt the Hole Compression algorithm for the special case of data-processor mapping. The evaluations of the global and local indexing osets in the special case of two-level data-processor mapping are also important for general two-level data-processor mapping, where the global and local indexing osets are respectively the dierences of the global indices and the local indices in a compressed local array for two corresponding array elements on two contiguous data distribution patterns. In addition, for processor p, we still have to count 14
15 Fig. 12: A general two-level data-processor mapping. Array A(i) is aligned with template T (3 i + 28) and the template is then distributed onto 4 processors with cyclic(5) distribution, assuming N A =
16 Algorithm: Hole Compression algorithm for the general two-level data-processor mapping. Input: Compression table, N p A, global and local indexing osets. Output: A p, the compressed local array on processor p. Procedure: Step 1. Generate the rst A p ptn compressed local array elements according to the compression table and the number of pseudo active elements. Step 2. Generate the remainder (bn p A =Ap ptnc? 1) A p ptn compressed local array elements. Every A p ptn elements are obtained by adding the global indexing oset to the previous A p ptn elements. Step 3. Generate the last (N p A mod Ap ptn) compressed local array elements according to the previous A p ptn compressed local array elements and the global indexing oset. Fig. 13: Hole Compression algorithm for processor p on a general two-level data-processor mapping. the number of pseudo active elements on p before the rst active element o and that after the last active element (s (N A? 1) + o) within the block where the last active element is allocated. Consider the generation of compressed local array of processor p 0 for the general case of two-level data-processor mapping shown in Fig. 12. Since the data-processor mapping shown in Fig. 12 is isomorphic to that in Fig. 2, the compression tables used by the two data-processor mappings are identical and have been shown in Fig. 8. The global and local indexing osets are 20 and 5, respectively, which are the same as those in the special case of two-level data-processor mapping shown in Fig. 2. The compressed local array of processor p 0 without considering the impact caused by the pseudo active elements is A p0 = A(0; 1; 7; 13; 14; 20; 21; 27; 33; 34), as discussed in Section Take the pseudo active elements into consideration. There are totally 9 pseudo active elements appeared before o in the two-level dataprocessor mapping, which are indexed as 1, 4, 7, 10, 13, 16, 19, 22, and 25. It implies, in this example, all the global indices of array elements have to subtract 9 from their original indices in the special case of two-level data-processor mapping to get their real indices in the general case. Thus, A p0 is turned to A(?9;?8;?2; 4; 5; 11; 12; 18; 24; 25). In addition, there are three pseudo active elements on processor p 0 ( 1, 4, and 22). As a result, the previous three array elements should be kicked out. Thus, the compressed local array of processor p 0 for the general two-level data-processor mapping in Fig. 12 is A p0 = A(4; 5; 11; 12; 18; 24; 25). However, if the number of compressed local array on processor p is known, the compressed local array on processor p can be generated more eciently. Let the number of compressed local array on processor p be N p A. Suppose the local indexing oset is Ap ptn. It implies that the global indices of array elements in the compressed local array will dier in a xed oset, global indexing oset, for a period of A p ptn elements. For example, in the two-level data-processor mapping shown in Fig. 12, the global and local indexing osets are 20 and 5, respectively. If the global index of the rst compressed local array element is 4, the global index of the sixth(= 1+5) compressed local array element is 24(= 4+20). Thus we can generate the compressed local array for the rst A p ptn elements by using the compression table and the number of pseudo active elements. After that, we can generate the next A p ptn elements according to the previous A p ptn elements and the global indexing oset until the last (N p A mod Ap ptn) elements. Finally, we can generate the last (N p A mod Ap ptn) elements accordingly. The algorithm to generate the compressed local array for the general case of two-level data-processor mapping is demonstrated in Fig
17 Although the concept of Hole Compression algorithm for the general case of two-level data-processor mapping is easy, some implementation issues should be addressed. The details of the algorithm will also be explained in the following subsection. Implementation Issues For a processor p, only the blocks which contain active elements do we need to concern about. Hence the block which contains active elements is termed active block. Let b p f and bp l respectively denote the rst and the last active blocks that processor p contains. Thus b p f and bp l can be gured out as follows. The block numbers of blocks contained by processor p can be written in the form of (p + k P ), where k 2 Z + and Z + is the set of positive integers. Since the rst and the last active elements are o and s (N A? 1) + o, they are allocated to blocks bo=xc and b(s (N A? 1) + o)=xc, respectively. Therefore, b p f should be greater than or equal to bo=xc and b p l should be smaller than or equal to b(s(n A?1)+o)=xc. Suppose b p f = (p+k f P ) and b p l = (p + k l P ), where k f and k l 2 Z +. Thus, (p + k f P ) bo=xc and (p + k l P ) b(s (N A? 1) + o)=xc. We can obtain that k f = d(bo=xc? p)=p e and k l = b(b(s (N A? 1) + o)=xc? p)=p c. As a result, b p f = p + d(bo=xc? p)=p e P; b p l = p + b(b(s (N A? 1) + o)=xc? p)=p c P: For any block in processor p, the block will belong to a unique data distribution pattern and occurrence in that data distribution pattern. Therefore, for any block b in processor p, b can be decomposed into two factors and, where and are the data distribution pattern number and the occurrence in the distribution pattern where b belongs, respectively. and can be respectively obtained by bb=n pb c and b(b mod N pb )=P c, where N pb is the number of blocks in a data distribution pattern and N pb = lcm(n c ; P ), which has been introduced in Section 3.1. On the other hand, given (; ) p, it will correspond to a unique block number. That is, given (; ) p, a unique block number b can be obtained via b = p + N pb + P: Thus, we term (; ) p addressing pair. Accordingly, the addressing pairs for the rst and the last active blocks that contained by processor p, b p f and bp l, are represented by ( f ; f ) p and ( l ; l ) p, respectively. When we are generating the class table and the compression table, pseudo active elements are regarded as the same with active elements. However, as we are generating the compressed local array, these pseudo active elements have to be deducted out. Hence, in addition to the evaluations of the global and local indexing osets, we have to gure out the number of pseudo active elements allocated on processor p and occurred before the rst active element o. In order not to over generate the compressed local array elements, we still have to calculate the number of pseudo active elements occurred in the last active block b p l on processor p. Let PA p h denote the number of pseudo active elements allocated on processor p and occurred before the rst active element o and PA p t denote the number of pseudo active elements occurred in the last active block b p l. The values of PA p h and PAp t can be calculated as follows. To calculate PA p h, we have to decide whether the rst active element o is allocated on processor p or not. If the rst active element o is not allocated on p, pseudo active elements only occur in the blocks before the rst active block b p f. Therefore, PAp h = f A p ptn + C p l ( f ), where A p ptn is the number of active elements allocated on processor p in a data distribution pattern. Otherwise, which means that the rst active element o is allocated on processor p, thus, in addition to the blocks before the rst active block b p f, the pseudo active elements occurred in the rst active block also need to be considered. Hence, PA p h = f A p ptn +C p l ( f )+b(o 17
18 mod x)=sc. As a result, PA p h can be obtained by ( PA p h = f A p ptn + C p l ( f ) if o 62 p; f A p ptn + C p l ( f ) + b(o mod x)=sc if o 2 p: For the example shown in Fig. 12, PA p h for processor p 0 can be evaluated as follows. The rst active block on p 0 is b 8 and the addressing pair for the block is (0; 2) p0. Since the rst active element 28 is not allocated on processor p 0, the pseudo active elements on p 0 only occur in the blocks before the rst active block b 8, that is, blocks b 0 and b 4. Accordingly, PA p0 h = Cp0 l (2) = 3, where A p ptn = 5. As regards to PA p h for processor p 1, the rst active block on p 1 is b 5 and the addressing pair of this block is (0; 1) p1. Since the rst active element 28 is allocated on p 1, the pseudo active elements occur not only in the blocks before the rst active block but also in the rst active block. Thus, PA p1 h = Cp1 l (1) + 1 = 2. To evaluate PA p t, we also have to decide whether the last active element (s (N A? 1) + o) is allocated on processor p or not. If the last active element (s (N A? 1) + o) is not allocated on p, there is no pseudo active element in the last active block. Therefore, PA p t = 0. For the example shown in Fig. 12, the values of PA p t for processors p 0, p 1 and p 2 are 0 because the last active element 115 is not allocated on p 0, p 1 and p 2. On the other hand, if the last active element is allocated on p, the number of pseudo active elements in the last active block b p l of p 3 shown in Fig. 12, since the last active element 115 is allocated on p 3, thus PA p3 t where the addressing pair of b p3 l is (1; 2) p3. Consequently, PA p t can be obtained by PA p t = can be calculated by PA p t = Cn( p l )? b((s (N A? 1) + o) mod x)=sc? 1. For the case = Cn p3 (2)? 0? 1 = 1, ( 0 if (s (N A? 1) + o) 62 p; Cn( p l )? b((s (N A? 1) + o) mod x)=sc? 1 if (s (N A? 1) + o) 2 p: Since the rst and the last active blocks on processor p and the corresponding addressing pairs can be evaluated, the number of active elements on processor p can be obtained accordingly. For processor p, the number of active elements occurred from the rst block up to the last active block is l A p ptn+c p l ( l)+cn( p l ). However, the pseudo active elements are also counted. Therefore, the number of pseudo active elements should be deducted out. As a result, the number of active elements allocated on processor p is N p A and can be obtained by N p A = l A p ptn + C p l ( l) + C p n( l )? PA p h? PAp t : The code to generate the compressed local array on processor p for the general two-level data-processor mapping is shown in Appendix B. We explain the ideas of the algorithm shown in Fig. 13 by using the following example. Consider the general two-level data-processor mapping shown in Fig. 12. Take the generation of the compressed local array for processor p 0 as an example. We have to evaluate the number of active elements allocated on processor p 0 rst. The rst and the last active blocks on processor p 0 are b 8 and b 20, respectively. That is, b p0 f = b 8 and b p0 l = b 20. The addressing pairs ( f ; f ) p0 and ( l ; l ) p0 are (0; 2) p0 and (1; 2) p0, respectively. The values of PA p h and PAp t for processor p 0 have been described in previous paragraphs and PA p0 h = 3 and PA p0 t = 0. The number of active elements on processor p 0 is N p0 A = Cp0 l (2) + Cn p0 (2)? 3? 0 = 7, where A p0 ptn = 5. Since A p0 ptn = 5, according to Hole Compression algorithm shown in Fig. 13, the rst step is to generate 5 compressed local array elements. The compression table records the information to generate A p ptn compressed local array elements when pseudo active elements are not considered. However, there are 9 pseudo active elements occurred before the rst active element o in this example. Hence all the global indices 18
1 Elementary number theory
Math 215 - Introduction to Advanced Mathematics Spring 2019 1 Elementary number theory We assume the existence of the natural numbers and the integers N = {1, 2, 3,...} Z = {..., 3, 2, 1, 0, 1, 2, 3,...},
More informationStatement-Level Communication-Free. Partitioning Techniques for. National Central University. Chung-Li 32054, Taiwan
Appeared in the Ninth Worshop on Languages and Compilers for Parallel Comping, San Jose, CA, Aug. 8-0, 996. Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers Kuei-Ping
More information1 Elementary number theory
1 Elementary number theory We assume the existence of the natural numbers and the integers N = {1, 2, 3,...} Z = {..., 3, 2, 1, 0, 1, 2, 3,...}, along with their most basic arithmetical and ordering properties.
More informationLine Graphs and Circulants
Line Graphs and Circulants Jason Brown and Richard Hoshino Department of Mathematics and Statistics Dalhousie University Halifax, Nova Scotia, Canada B3H 3J5 Abstract The line graph of G, denoted L(G),
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationEfficient Prefix Computation on Faulty Hypercubes
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 17, 1-21 (21) Efficient Prefix Computation on Faulty Hypercubes YU-WEI CHEN AND KUO-LIANG CHUNG + Department of Computer and Information Science Aletheia
More informationMath 302 Introduction to Proofs via Number Theory. Robert Jewett (with small modifications by B. Ćurgus)
Math 30 Introduction to Proofs via Number Theory Robert Jewett (with small modifications by B. Ćurgus) March 30, 009 Contents 1 The Integers 3 1.1 Axioms of Z...................................... 3 1.
More informationLecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture
More informationWeak Dynamic Coloring of Planar Graphs
Weak Dynamic Coloring of Planar Graphs Caroline Accurso 1,5, Vitaliy Chernyshov 2,5, Leaha Hand 3,5, Sogol Jahanbekam 2,4,5, and Paul Wenger 2 Abstract The k-weak-dynamic number of a graph G is the smallest
More informationclassify all blocks into classes and use a class table to record the memory accesses of the ærst repetitive pattern. By using the class table, they de
Eæcient Address Generation for Aæne Subscripts in Data-Parallel Programs Kuei-Ping Shih Department of Computer Science and Information Engineering National Central University Chung-Li 32054, Taiwan Email:
More informationAXIOMS FOR THE INTEGERS
AXIOMS FOR THE INTEGERS BRIAN OSSERMAN We describe the set of axioms for the integers which we will use in the class. The axioms are almost the same as what is presented in Appendix A of the textbook,
More informationDiscrete Mathematics Lecture 4. Harper Langston New York University
Discrete Mathematics Lecture 4 Harper Langston New York University Sequences Sequence is a set of (usually infinite number of) ordered elements: a 1, a 2,, a n, Each individual element a k is called a
More informationIS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?
Theoretical Computer Science 19 (1982) 337-341 North-Holland Publishing Company NOTE IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP? Nimrod MEGIDDO Statistics Department, Tel Aviv
More informationIII Data Structures. Dynamic sets
III Data Structures Elementary Data Structures Hash Tables Binary Search Trees Red-Black Trees Dynamic sets Sets are fundamental to computer science Algorithms may require several different types of operations
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationProcess Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering
Process Allocation for Load Distribution in Fault-Tolerant Multicomputers y Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering **Dept. of Electrical Engineering Pohang University
More informationRevised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN
Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN TWO POINT SETS ON A LINE UNDER TRANSLATION Gunter Rote Technische
More informationThe Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a
Preprint 0 (2000)?{? 1 Approximation of a direction of N d in bounded coordinates Jean-Christophe Novelli a Gilles Schaeer b Florent Hivert a a Universite Paris 7 { LIAFA 2, place Jussieu - 75251 Paris
More informationSolutions to the Second Midterm Exam
CS/Math 240: Intro to Discrete Math 3/27/2011 Instructor: Dieter van Melkebeek Solutions to the Second Midterm Exam Problem 1 This question deals with the following implementation of binary search. Function
More information9.5 Equivalence Relations
9.5 Equivalence Relations You know from your early study of fractions that each fraction has many equivalent forms. For example, 2, 2 4, 3 6, 2, 3 6, 5 30,... are all different ways to represent the same
More informationLinear Quadtree Construction in Real Time *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 26, 1917-1930 (2010) Short Paper Linear Quadtree Construction in Real Time * CHI-YEN HUANG AND YU-WEI CHEN + Department of Information Management National
More informationELEMENTARY NUMBER THEORY AND METHODS OF PROOF
CHAPTER 4 ELEMENTARY NUMBER THEORY AND METHODS OF PROOF Copyright Cengage Learning. All rights reserved. SECTION 4.8 Application: Algorithms Copyright Cengage Learning. All rights reserved. Application:
More informationarxiv: v2 [cs.ds] 18 May 2015
Optimal Shuffle Code with Permutation Instructions Sebastian Buchwald, Manuel Mohr, and Ignaz Rutter Karlsruhe Institute of Technology {sebastian.buchwald, manuel.mohr, rutter}@kit.edu arxiv:1504.07073v2
More informationStability in ATM Networks. network.
Stability in ATM Networks. Chengzhi Li, Amitava Raha y, and Wei Zhao Abstract In this paper, we address the issues of stability in ATM networks. A network is stable if and only if all the packets have
More information3186 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 9, SEPTEMBER Zero/Positive Capacities of Two-Dimensional Runlength-Constrained Arrays
3186 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 51, NO 9, SEPTEMBER 2005 Zero/Positive Capacities of Two-Dimensional Runlength-Constrained Arrays Tuvi Etzion, Fellow, IEEE, and Kenneth G Paterson, Member,
More informationMOST attention in the literature of network codes has
3862 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 Efficient Network Code Design for Cyclic Networks Elona Erez, Member, IEEE, and Meir Feder, Fellow, IEEE Abstract This paper introduces
More informationHowever, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t
FAST CALCULATION OF GEOMETRIC MOMENTS OF BINARY IMAGES Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodarenskou vez 4, 82 08 Prague 8, Czech
More informationFUTURE communication networks are expected to support
1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,
More informationOrthogonal art galleries with holes: a coloring proof of Aggarwal s Theorem
Orthogonal art galleries with holes: a coloring proof of Aggarwal s Theorem Pawe l Żyliński Institute of Mathematics University of Gdańsk, 8095 Gdańsk, Poland pz@math.univ.gda.pl Submitted: Sep 9, 005;
More informationPACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS
PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PAUL BALISTER Abstract It has been shown [Balister, 2001] that if n is odd and m 1,, m t are integers with m i 3 and t i=1 m i = E(K n) then K n can be decomposed
More informationCHAPTER 8. Copyright Cengage Learning. All rights reserved.
CHAPTER 8 RELATIONS Copyright Cengage Learning. All rights reserved. SECTION 8.3 Equivalence Relations Copyright Cengage Learning. All rights reserved. The Relation Induced by a Partition 3 The Relation
More informationLet v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have
Let v be a vertex primed by v i (s). Then the number f(v) of neighbours of v which have been red in the sequence up to and including v i (s) is deg(v)? s(v), and by the induction hypothesis this sequence
More informationWorst-case running time for RANDOMIZED-SELECT
Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case
More informationMatlab and Coordinate Systems
Matlab and Coordinate Systems Math 45 Linear Algebra David Arnold David-Arnold@Eureka.redwoods.cc.ca.us Abstract In this exercise we will introduce the concept of a coordinate system for a vector space.
More informationPart 2: Balanced Trees
Part 2: Balanced Trees 1 AVL Trees We could dene a perfectly balanced binary search tree with N nodes to be a complete binary search tree, one in which every level except the last is completely full. A
More informationMath Introduction to Advanced Mathematics
Math 215 - Introduction to Advanced Mathematics Number Theory Fall 2017 The following introductory guide to number theory is borrowed from Drew Shulman and is used in a couple of other Math 215 classes.
More informationProperly even harmonious labelings of disconnected graphs
Available online at www.sciencedirect.com ScienceDirect AKCE International Journal of Graphs and Combinatorics 12 (2015) 193 203 www.elsevier.com/locate/akcej Properly even harmonious labelings of disconnected
More information11 1. Introductory part In order to follow the contents of this book with full understanding, certain prerequisites from high school mathematics will
11 1. Introductory part In order to follow the contents of this book with full understanding, certain prerequisites from high school mathematics will be necessary. This firstly pertains to the basic concepts
More informationCLOCK DRIVEN SCHEDULING
CHAPTER 4 By Radu Muresan University of Guelph Page 1 ENGG4420 CHAPTER 4 LECTURE 2 and 3 November 04 09 7:51 PM CLOCK DRIVEN SCHEDULING Clock driven schedulers make their scheduling decisions regarding
More informationImplementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract
Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer
More informationIntroduction to Modular Arithmetic
Randolph High School Math League 2014-2015 Page 1 1 Introduction Introduction to Modular Arithmetic Modular arithmetic is a topic residing under Number Theory, which roughly speaking is the study of integers
More informationRecognizing Interval Bigraphs by Forbidden Patterns
Recognizing Interval Bigraphs by Forbidden Patterns Arash Rafiey Simon Fraser University, Vancouver, Canada, and Indiana State University, IN, USA arashr@sfu.ca, arash.rafiey@indstate.edu Abstract Let
More informationMatlab and Coordinate Systems
Math 45 Linear Algebra David Arnold David-Arnold@Eureka.redwoods.cc.ca.us Abstract In this exercise we will introduce the concept of a coordinate system for a vector space. The map from the vector space
More informationThe 4/5 Upper Bound on the Game Total Domination Number
The 4/ Upper Bound on the Game Total Domination Number Michael A. Henning a Sandi Klavžar b,c,d Douglas F. Rall e a Department of Mathematics, University of Johannesburg, South Africa mahenning@uj.ac.za
More informationExcerpt from "Art of Problem Solving Volume 1: the Basics" 2014 AoPS Inc.
Chapter 5 Using the Integers In spite of their being a rather restricted class of numbers, the integers have a lot of interesting properties and uses. Math which involves the properties of integers is
More informationOn k-dimensional Balanced Binary Trees*
journal of computer and system sciences 52, 328348 (1996) article no. 0025 On k-dimensional Balanced Binary Trees* Vijay K. Vaishnavi Department of Computer Information Systems, Georgia State University,
More informationDischarging and reducible configurations
Discharging and reducible configurations Zdeněk Dvořák March 24, 2018 Suppose we want to show that graphs from some hereditary class G are k- colorable. Clearly, we can restrict our attention to graphs
More informationAn Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract
An Ecient Approximation Algorithm for the File Redistribution Scheduling Problem in Fully Connected Networks Ravi Varadarajan Pedro I. Rivera-Vega y Abstract We consider the problem of transferring a set
More informationModule 2: Classical Algorithm Design Techniques
Module 2: Classical Algorithm Design Techniques Dr. Natarajan Meghanathan Associate Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Module
More informationtime using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o
Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen
More informationScan Scheduling Specification and Analysis
Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract
More information11.1 Facility Location
CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Facility Location ctd., Linear Programming Date: October 8, 2007 Today we conclude the discussion of local
More informationII (Sorting and) Order Statistics
II (Sorting and) Order Statistics Heapsort Quicksort Sorting in Linear Time Medians and Order Statistics 8 Sorting in Linear Time The sorting algorithms introduced thus far are comparison sorts Any comparison
More informationIntegers and Mathematical Induction
IT Program, NTUT, Fall 07 Integers and Mathematical Induction Chuan-Ming Liu Computer Science and Information Engineering National Taipei University of Technology TAIWAN 1 Learning Objectives Learn about
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationLemma (x, y, z) is a Pythagorean triple iff (y, x, z) is a Pythagorean triple.
Chapter Pythagorean Triples.1 Introduction. The Pythagorean triples have been known since the time of Euclid and can be found in the third century work Arithmetica by Diophantus [9]. An ancient Babylonian
More informationExperiments on string matching in memory structures
Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de
More informationA Simplied NP-complete MAXSAT Problem. Abstract. It is shown that the MAX2SAT problem is NP-complete even if every variable
A Simplied NP-complete MAXSAT Problem Venkatesh Raman 1, B. Ravikumar 2 and S. Srinivasa Rao 1 1 The Institute of Mathematical Sciences, C. I. T. Campus, Chennai 600 113. India 2 Department of Computer
More informationEdge disjoint monochromatic triangles in 2-colored graphs
Discrete Mathematics 31 (001) 135 141 www.elsevier.com/locate/disc Edge disjoint monochromatic triangles in -colored graphs P. Erdős a, R.J. Faudree b; ;1, R.J. Gould c;, M.S. Jacobson d;3, J. Lehel d;
More informationTopology Homework 3. Section Section 3.3. Samuel Otten
Topology Homework 3 Section 3.1 - Section 3.3 Samuel Otten 3.1 (1) Proposition. The intersection of finitely many open sets is open and the union of finitely many closed sets is closed. Proof. Note that
More informationMDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science
MDP Routing in ATM Networks Using the Virtual Path Concept 1 Ren-Hung Hwang, James F. Kurose, and Don Towsley Department of Computer Science Department of Computer Science & Information Engineering University
More informationreasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap
Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com
More informationValmir C. Barbosa. Programa de Engenharia de Sistemas e Computac~ao, COPPE. Caixa Postal Abstract
The Interleaved Multichromatic Number of a Graph Valmir C. Barbosa Universidade Federal do Rio de Janeiro Programa de Engenharia de Sistemas e Computac~ao, COPPE Caixa Postal 685 2945-970 Rio de Janeiro
More informationTilings of the Euclidean plane
Tilings of the Euclidean plane Yan Der, Robin, Cécile January 9, 2017 Abstract This document gives a quick overview of a eld of mathematics which lies in the intersection of geometry and algebra : tilings.
More informationParameterized Complexity of Independence and Domination on Geometric Graphs
Parameterized Complexity of Independence and Domination on Geometric Graphs Dániel Marx Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany. dmarx@informatik.hu-berlin.de
More informationMITOCW watch?v=kvtlwgctwn4
MITOCW watch?v=kvtlwgctwn4 PROFESSOR: The idea of congruence was introduced to the world by Gauss in the early 18th century. You've heard of him before, I think. He's responsible for some work on magnetism
More informationCompiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz
Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University
More informationIntroduction to Programming in C Department of Computer Science and Engineering\ Lecture No. #02 Introduction: GCD
Introduction to Programming in C Department of Computer Science and Engineering\ Lecture No. #02 Introduction: GCD In this session, we will write another algorithm to solve a mathematical problem. If you
More information31.6 Powers of an element
31.6 Powers of an element Just as we often consider the multiples of a given element, modulo, we consider the sequence of powers of, modulo, where :,,,,. modulo Indexing from 0, the 0th value in this sequence
More informationParallel Pipeline STAP System
I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,
More informationPick s Theorem and Lattice Point Geometry
Pick s Theorem and Lattice Point Geometry 1 Lattice Polygon Area Calculations Lattice points are points with integer coordinates in the x, y-plane. A lattice line segment is a line segment that has 2 distinct
More informationAn Eternal Domination Problem in Grids
Theory and Applications of Graphs Volume Issue 1 Article 2 2017 An Eternal Domination Problem in Grids William Klostermeyer University of North Florida, klostermeyer@hotmail.com Margaret-Ellen Messinger
More informationTelecommunication and Informatics University of North Carolina, Technical University of Gdansk Charlotte, NC 28223, USA
A Decoder-based Evolutionary Algorithm for Constrained Parameter Optimization Problems S lawomir Kozie l 1 and Zbigniew Michalewicz 2 1 Department of Electronics, 2 Department of Computer Science, Telecommunication
More informationSegmentation. Multiple Segments. Lecture Notes Week 6
At this point, we have the concept of virtual memory. The CPU emits virtual memory addresses instead of physical memory addresses. The MMU translates between virtual and physical addresses. Don't forget,
More informationTrees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.
Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial
More informationGeneral properties of staircase and convex dual feasible functions
General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia
More informationFundamental Properties of Graphs
Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,
More informationSEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION
CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION Copyright Cengage Learning. All rights reserved. SECTION 5.5 Application: Correctness of Algorithms Copyright Cengage Learning. All rights reserved.
More informationOne of the most important areas where quantifier logic is used is formal specification of computer programs.
Section 5.2 Formal specification of computer programs One of the most important areas where quantifier logic is used is formal specification of computer programs. Specification takes place on several levels
More information2 The Fractional Chromatic Gap
C 1 11 2 The Fractional Chromatic Gap As previously noted, for any finite graph. This result follows from the strong duality of linear programs. Since there is no such duality result for infinite linear
More informationInternational Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA
International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI
More informationA NOTE ON INCOMPLETE REGULAR TOURNAMENTS WITH HANDICAP TWO OF ORDER n 8 (mod 16) Dalibor Froncek
Opuscula Math. 37, no. 4 (2017), 557 566 http://dx.doi.org/10.7494/opmath.2017.37.4.557 Opuscula Mathematica A NOTE ON INCOMPLETE REGULAR TOURNAMENTS WITH HANDICAP TWO OF ORDER n 8 (mod 16) Dalibor Froncek
More informationTiling Rectangles with Gaps by Ribbon Right Trominoes
Open Journal of Discrete Mathematics, 2017, 7, 87-102 http://www.scirp.org/journal/ojdm ISSN Online: 2161-7643 ISSN Print: 2161-7635 Tiling Rectangles with Gaps by Ribbon Right Trominoes Premalatha Junius,
More informationAlgebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee. The Chinese University of Hong Kong.
Algebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR, China fyclaw,jleeg@cse.cuhk.edu.hk
More informationEXTREME POINTS AND AFFINE EQUIVALENCE
EXTREME POINTS AND AFFINE EQUIVALENCE The purpose of this note is to use the notions of extreme points and affine transformations which are studied in the file affine-convex.pdf to prove that certain standard
More information6th Bay Area Mathematical Olympiad
6th Bay Area Mathematical Olympiad February 4, 004 Problems and Solutions 1 A tiling of the plane with polygons consists of placing the polygons in the plane so that interiors of polygons do not overlap,
More information2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006
2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,
More informationAlgorithms for an FPGA Switch Module Routing Problem with. Application to Global Routing. Abstract
Algorithms for an FPGA Switch Module Routing Problem with Application to Global Routing Shashidhar Thakur y Yao-Wen Chang y D. F. Wong y S. Muthukrishnan z Abstract We consider a switch-module-routing
More informationModule 11. Directed Graphs. Contents
Module 11 Directed Graphs Contents 11.1 Basic concepts......................... 256 Underlying graph of a digraph................ 257 Out-degrees and in-degrees.................. 258 Isomorphism..........................
More informationCS 6402 DESIGN AND ANALYSIS OF ALGORITHMS QUESTION BANK
CS 6402 DESIGN AND ANALYSIS OF ALGORITHMS QUESTION BANK Page 1 UNIT I INTRODUCTION 2 marks 1. Why is the need of studying algorithms? From a practical standpoint, a standard set of algorithms from different
More informationEulerian subgraphs containing given edges
Discrete Mathematics 230 (2001) 63 69 www.elsevier.com/locate/disc Eulerian subgraphs containing given edges Hong-Jian Lai Department of Mathematics, West Virginia University, P.O. Box. 6310, Morgantown,
More informationMultiway Blockwise In-place Merging
Multiway Blockwise In-place Merging Viliam Geffert and Jozef Gajdoš Institute of Computer Science, P.J.Šafárik University, Faculty of Science Jesenná 5, 041 54 Košice, Slovak Republic viliam.geffert@upjs.sk,
More informationThe Structure of Bull-Free Perfect Graphs
The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint
More informationDierential-Linear Cryptanalysis of Serpent? Haifa 32000, Israel. Haifa 32000, Israel
Dierential-Linear Cryptanalysis of Serpent Eli Biham, 1 Orr Dunkelman, 1 Nathan Keller 2 1 Computer Science Department, Technion. Haifa 32000, Israel fbiham,orrdg@cs.technion.ac.il 2 Mathematics Department,
More informationHeap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the
Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence
More informationData Dependences and Parallelization
Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]
More informationA GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY
A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY KARL L. STRATOS Abstract. The conventional method of describing a graph as a pair (V, E), where V and E repectively denote the sets of vertices and edges,
More informationarxiv: v1 [math.co] 13 Aug 2017
Strong geodetic problem in grid like architectures arxiv:170803869v1 [mathco] 13 Aug 017 Sandi Klavžar a,b,c April 11, 018 Paul Manuel d a Faculty of Mathematics and Physics, University of Ljubljana, Slovenia
More informationON SWELL COLORED COMPLETE GRAPHS
Acta Math. Univ. Comenianae Vol. LXIII, (1994), pp. 303 308 303 ON SWELL COLORED COMPLETE GRAPHS C. WARD and S. SZABÓ Abstract. An edge-colored graph is said to be swell-colored if each triangle contains
More informationReverse Polish notation in constructing the algorithm for polygon triangulation
Reverse Polish notation in constructing the algorithm for polygon triangulation Predrag V. Krtolica, Predrag S. Stanimirović and Rade Stanojević Abstract The reverse Polish notation properties are used
More information