Generation of parallel synchronization-free tiled code

Size: px

Start display at page:

Download "Generation of parallel synchronization-free tiled code"

Scarlett Singleton
5 years ago
Views:

1 Computing (2018) 100: Generation of parallel synchronization-free tiled code Wlodzimierz Bielecki 1 Marek Palkowski 1 Piotr Skotnicki 1 Received: 22 August 2016 / Accepted: 5 October 2017 / Published online: 20 October 2017 Springer-Verlag GmbH Austria 2017 Abstract A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence graphs to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target (corrected) tiles. Then parallel synchronization-free tiled code is generated on the basis of valid (corrected) tiles applying the transitive closure of dependence graphs. The main contribution of the paper is demonstrating that the presented technique is able to generate parallel synchronization-free tiled code, provided that the exact transitive closure of a dependence graph can be calculated and there exist synchronization-free slices on the statement instance level in the loop nest. We show that the presented approach extracts such a parallelism when well-known techniques fail to extract it. Enlarging the scope of loop nests, for which synchronization-free tiled code can be generated, is achieved by means of applying the intersection of extracted slices and generated valid tiles, in contrast to forming slices of valid tiles as suggested in previously published techniques based on the transitive closure of a dependence graph. The presented approach is implemented in the publicly available TC optimizing compiler. Results of experiments demonstrating the effectiveness of the approach and the efficiency of parallel programs generated by means of it are discussed. B Piotr Skotnicki pskotnicki@wi.zut.edu.pl Wlodzimierz Bielecki wbielecki@wi.zut.edu.pl Marek Palkowski mpalkowski@wi.zut.edu.pl 1 Faculty of Computer Science, West Pomeranian University of Technology, ul. Zolnierska 49, Szczecin, Poland

2 278 W. Bielecki et al. Keywords Synchronization-free parallelism Tiling Transitive closure Optimizing compiler Polyhedral model Iteration space slicing Mathematics Subject Classification 65Y05 68M20 68N20 1 Introduction In this paper, we deal with automatic parallelization of sequential programs by means of an optimizing compiler. The parallel program is executed on a computer including two or more processing units. Process synchronization is required to guarantee that a parallel program produces correct results. Synchronization is the coordination of parallel tasks in real time, often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same point. Synchronization has considerable impact on parallel program overhead, granularity, and load balancing. It usually involves waiting by at least one task, and can therefore cause a parallel application s wall-clock execution time to increase, i.e., it introduces parallel program overhead. Any time one task spends waiting for another is considered synchronization overhead. Minimizing its cost is a very important part of making a program efficient. Since synchronization overhead tends to grow rapidly as the number of tasks in a parallel job increases, it is the most important factor in obtaining good scaling behavior for the parallel program. In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. A parallel application can be coarse- or fine-grained: coarse means that relatively large amounts of computational work are done between communication or synchronization events; fine means that relatively small amounts of computational work are done between communication events. For current multicore CPUs with support for SMT (Simultaneous Multi Threading), coarse-grained parallelism is strictly required, independently of the parallel technology used (pthreads, OpenMP, MPI). Usually, decreasing the number of synchronization events allows for increasing parallel code granularity. Load balancing is important to parallel programs for performance reasons: to get good parallel program performance, all threads should have the same volume of work to be executed. If all tasks are subject to a barrier synchronization point, the largest task will determine the overall performance. Summing up, we may conclude that to decrease synchronization overhead, increase parallel program granularity, and improve load balancing we need to minimize the number of synchronization events in the parallel program. The ideal situation is when there is no synchronization in the parallel program, i.e., when an optimizing compiler discovers synchronization-free parallelism that requires no synchronization. Synchronization-free parallelism can be considered on the statement instance level or on the tile level when multiple threads run independent fragments of the program comprising statement instances or tiles, respectively. Well-known techniques discovering synchronization-free parallelism are based on the affine transformation framework [10,21]. Limitations of affine trans-

3 Generation of parallel synchronization-free tiled code 279 formations to extract synchronization-free parallelism are discussed in paper [3], the authors demonstrate that despite the fact that there exists synchronization-free parallelism on the statement instance level for some classes of loop nests, there does not exist any affine transformation allowing for extracting such a parallelism. The authors show how the transitive closure of dependence graphs can be used to extract synchronization-free parallelism on the statement instance level for such problematic loop nests but they do not consider extracting synchronization-free parallelism on the tile level. In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs. Tiling [10,13,17,21,29,35] is a very important iteration reordering transformation for both improving data locality and extracting loop nest parallelism. Tiling for improving locality groups loop statement instances into smaller blocks (tiles) allowing reuse when the block fits in local memory. In parallel tiled code, tiles are considered as indivisible macro statements. This coarsens the granularity of parallel applications that often leads to improving the performance of an application running in parallel computers with shared memory. One well-known class of tiling techniques is based on affine transformations of program loops. To generate tiled code, first affine transformations allowing for producing a band of fully permutable loops are formed, then this band is transformed into tiled code. Papers [5,6] introduce a novel approach for automatic generation of tiled code for nested loops which is based on the transitive closure of a loop nest dependence graph. This technique produces tiled code even when there does not exist any affine transformation allowing for producing a band of fully permutable loops. According to that approach, we first form fixed rectangular original tiles and next examine whether all loop nest dependences are respected under the lexicographic order of tile enumeration. If so, we conclude that all original tiles are valid, hence code generation is straightforward. Otherwise, we correct original tiles so that all target tiles are valid, i.e., the lexicographic enumeration order of target tiles respects all dependences available in the original loop nest. The final step is code generation representing target (corrected) tiles. In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs. Paper [6] deals with extracting slices on the tile level so that each slice contains valid tiles generated by means of applying the transitive closure of dependence graphs. In Sect. 5, we demonstrate that such a way of generation of synchronization-free tiled code can lose synchronizationfree parallelism on the tile level in spite of there exists synchronization-free parallelism on the statement instance level. In this paper, we show how this problem can be resolved, i.e., how to generate synchronization-free tiled code provided we are able to calculate the exact transitive closure of a loop nest dependence graph and there exists synchronization-free parallelism on the statement instance level. The contributions of this paper over previous work are as follows: a concept and an algorithm demonstrating how the Iteration Space Slicing framework can be combined with the Polyhedral Model to generate parallel synchronization-free tiled code on the tile level when there exists synchronization-

4 280 W. Bielecki et al. free parallelism on the statement instance level and the exact transitive closure of dependence graph can be calculated; development and presentation of the public available source-to-source TC compiler implementing the introduced algorithms using the ISL library; evaluation of the effectiveness of the introduced algorithms and the speed-up of tiled code produced by means of the presented approach. The rest of the paper is organized as follows. Section 2 contains background. Section 3 describes how synchronization-free slices can be extracted on the statement instance level. Section 4 summarizes how tiled code can be generated by means of the transitive closure of dependence graphs. Section 5 introduces a concept and an algorithm to generate parallel synchronization-free code on the tile level. Section 6 discusses related work. Section 7 highlights the results of experiments. Section 8 concludes our work and outlines plans for future. 2 Background In this paper, we deal with affine loop nests [11]. A statement instance S[I ] is a particular execution of a loop statement S for a given iteration vector I. Given a loop nest with q statements, we transform it into its polyhedral representation, including: an iteration space IS i for each statement Si, i = 1,..., q, read/write access relations (RA/WA, respectively), and global schedule S corresponding to the original execution order of statement instances in the loop nest. The loop nest iteration space IS i is the set of statement instances executed by a loop nest for statement Si. An access relation maps an iteration vector I i to one or more memory locations of array elements. Schedule S is represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed. In this paper, we use two types of iteration spaces (i) original one where instances are identified by a statement identifier and a sequence of integers and (ii) global one where instances are identified by their execution order. We obtain the global iteration space by means of applying the global schedule to the original iteration space. A global iteration vector I represents statement instances in the global iteration space. Further on, under I and IS, we mean the global iteration vector and global iteration space, respectively. Two statement instances S1[I ] and S2[J] are dependent if both access the same memory location and if at least one access is a write. S1[I ] and S2[J] are called the source and target of a dependence, respectively, provided that S1[I ] is executed before S2[ J]. The sequential ordering of statement instances, denoted S1[I ] S2[ J], is induced by the global schedule. The algorithms, presented in this paper, use a dependence relation which is a tuple relation of the form {[input list] [output list] formula }, where input list and output list are the lists of variables and/or expressions used to describe input and output tuples, and formula describes the constraints imposed upon input and output lists and it is a Presburger formula built of constraints represented by algebraic expressions and using logical and existential operators.

5 Generation of parallel synchronization-free tiled code 281 A dependence relation, describing all the dependences in a loop nest, can be computed as a union of flow, anti, and output dependences according to the following formula [32]: R = ((RA 1 WA) (WA 1 RA) (WA 1 WA)) (S S), (1) where RA/WA are read/write access relations, respectively, mapping an iteration vector to one or more referenced memory locations, and S is the original schedule represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp (i.e., a discrete time when the statement instance has to be executed). S S denotes a strict partial order of statement instances: S 1 ({[e] [e ] e e } S). A dependence relation is a mathematical representation of a data dependence graph whose vertices correspond to loop statement instances while edges connect dependent instances. The input and output tuples of a relation represent dependence sources and destinations, respectively; the relation constraints point out instances which are dependent. In the presented algorithm, standard operations on relations and sets are used, such as intersection ( ), union ( ), difference ( ), composition ( ), domain (domain(r)), range (range(r)), relation application (R(S) ={[e ] e S :[e] [e ] R }). The positive transitive closure of a given relation R, R +, is defined as follows: R + = i= i=1 R i, (2) where R i is the i-th power of R, defined inductively by R 1 = R, and for i > 1, R i = R R i 1, where denotes the composition of relations. A relation R is reflexively closed on a set D if the identity relation Id D is a subset of R. The reflexive closure of R on D is R Id D. The reflexive and transitive closure of R on D is: R = R + Id D. (3) It describes the same connections in a dependence graph (represented by R) that R + does plus connections of each vertex with itself. Techniques aimed at calculating the transitive closure of a dependence graph, which in general is parametric, are presented in papers [4,18,33] and they are out of the scope of this paper. In general, it is impossible to represent the exact transitive closure using affine constraints [18]. Existing algorithms return either exact transitive closure or an approximation of it. Exact transitive closure or an over-approximation of it can be used in the presented algorithms, but if we use an over-approximation of transitive closure, tiled code will be not optimal: it provides less code locality and/or parallelization. Paper [4] presents the time of transitive closure calculation for NPB benchmarks [23]. It depends on the number of dependence relations extracted for a loop nest and can vary from milliseconds to several minutes (when the number of dependence relations is equal to hundreds or thousands).

6 282 W. Bielecki et al. 3 Extraction of synchronization-free slices To extract synchronization-free parallelism on the statement instance level applying the transitive closure of dependence graphs, we use the approach presented in paper [3]. Let us recall basic definitions related to iteration space slicing on the statement instance level. Definition 1 (Slice) Given a dependence graph, a slice is a weakly connected component of this graph, i.e., a maximal subgraph such that each pair of its vertices is connected by some path, ignoring the edge directions. Definition 2 (Ultimate dependence source) An ultimate dependence source is a source that is not the destination of another dependence. Definition 3 (Representative source) The representative (source) of a slice is its lexicographically minimal ultimate dependence source in the global iteration space of a loop nest. In order to extract parallelism represented with synchronization-free slices, we need to carry out the following steps: 1. Find a set, REPR, including representatives of slices; 2. Reconstruct slices from their representatives as independent execution flows. Given relation R found as the union of all dependence relations extracted for a loop nest, we start with forming a set of statement instances, UDS, describing all ultimate dependence sources, as the difference between the domain of R and the range of R: UDS = domain(r) range(r). (4) Subsequently, to find which elements of set UDS are representatives of slices 1,we construct a relation, R_USC, that describes all pairs (e, e ) of the ultimate dependence sources contained in set UDS that are connected by some path, ignoring the edge directions. Formally, set R_USC is defined as shown below: R_USC ={[e] [e ] e, e UDS e e e (R R 1 ) + ({[e]})}. (5) The inequality (e e ) in the constraints of relation R_USC means that e is lexicographically less than e. Such a condition guarantees that the lexicographically smallest element will be represented only with the input tuple, and as a result the set range(r_usc) will contain all but the lexicographically smallest sources of synchronization-free slices. R 1 denotes the inverse relation of R, i.e., R 1 ={[e] [e ] [e ] [e] R }. The condition e (R R 1 ) + ({[e]}) implies that there exists some path between e and e when the edge directions are ignored. 1 If a slice has multiple sources, then although all its sources belong to UDS, only the lexicographically minimal source is the representative of a slice.

7 Generation of parallel synchronization-free tiled code 283 Fig. 1 Connections described with relations R, R 1 and R_USC In order to illustrate the presented idea, let us consider the following relation: R := { [1] [5]; [2] [4]; [3] [4]; [3] [5]}. The graph, described with relation R, is the single weakly connected component, presented in Fig. 1 (solid lines). Set UDS computed over relation R includes vertices belonging to the set {[1]; [2]; [3]}, i.e., all the vertices that have no incoming edges and thus satisfy the definition of an ultimate dependence source. According to Definition 1, each weakly connected component constitutes a synchronization-free slice. We would therefore like to find its lexicographically smallest element, serving as the representative of a slice. First, we compute the inverse relation of R, R 1 := { [4] [2]; [4] [3]; [5] [1]; [5] [3]}. Let us highlight that the forward (solid) and backward (dashed) edges in Fig. 1 form paths connecting ultimate dependence sources contained in the same slice, i.e., if there exists a pair of ultimate dependence sources described with the relation (R R 1 ) +, then they both belong to a single synchronization-free slice. For the working example, relation R_USC is as follows: R_USC := { [1] [2]; [1] [3]; [2] [3]}. The paths connecting the ultimate dependence sources for the working example, being described with relation R_USC, are presented in Fig. 1 (dotted lines). As already mentioned, the range of relation R_USC contains all but the lexicographically smallest sources of all slices. Following this observation, in order to find set REPR comprising representatives of slices, we carry out the below computation: REPR = UDS range(r_usc). (6) Set REPR is very important because its cardinality is equal to the number of synchronization-free slices, each its element is a slice representative, which is enough to reconstruct a corresponding slice by means of the way discussed below. As far as the working example is considered, the set of representatives contains a single element, REPR := { [1]}, which means there exists a single slice. After finding the representatives of slices and a relation, describing the connection of each representative with other ultimate dependence sources contained in the same slice, given a representative, rpr, we can reconstruct the corresponding synchronization-free slice of the original graph using the following formula: SFS(rpr) = R (R_USC (rpr)). (7) It is worth to note that if R_USC = then R_USC (rpr) = rpr.

8 284 W. Bielecki et al. To generate valid code representing synchronization-free parallelism and executing all statement instances that reside in the loop nest iteration space, we also need to calculate set, IND, including independent statement instances as follows: IND= IS (domain(r) range(r)), (8) where set ISrepresents the global iteration space of a loop nest. 4 Loop tiling In papers [5,6], algorithms based on the transitive closure of a dependence graph allowing for loop nest tiling are introduced. They correct original rectangular tiles so that target tiles are valid under lexicographical order. Let us consider the following example: Example 1 for(i = 1; i <= 4; ++i) for(j = 1; j <= 4; ++j) S1: A[i][j] = A[i-1][j+1]; A data dependence analysis over the read and write accesses, based on formula (1), results in the following dependence relation: R := { S1[i, j] S1[i + 1, j 1] 0 < i 3 2 j 4 }, which describes data dependences between the instances of statement S1. Figure 2a shows dependences and synchronization-free slices for Example 1. In general, for each statement Si, i = 1,..., q, surrounded by d i loops, we form set TILE i (II i ) including iterations belonging to a parametric original rectangular tile as follows: TILE i (II i ) =[II i ] {[I i ] B i II i + LB i I i min(b i (II i + 1 i ) + LB i 1 i, UB i ) II i 0 i }, where vectors LB i and UB i include the lower and upper bounds, respectively, of the indices of the loops surrounding statement Si; diagonal matrix B i defines the size of original rectangular tiles; elements of vectors I i and II i represent the original indices of the loops enclosing statement Si and the identifiers of tiles, respectively; 1 i and 0 i are the vectors whose all d i elements have value 1 and 0, respectively. Additionally, with each set TILE i, i = 1,..., q, we associate another set, II_SET i, i = 1,..., q, that includes the tile identifiers of all tiles represented with set TILE i : II_SET i ={[II i ] II i 0 i B i II i + LB i UB i }. As far as Example 1 is considered, sets TILE 1 and II_SET 1 represent tiles of the size 2 2 in space IS 1 :

9 Generation of parallel synchronization-free tiled code 285 j 4 j 4 j T01 T11 3 T_VLD01 T_VLD SFS0 2 1 j UDS, REPR IND T_VLD01 T_VLD00 (a) (d) T_VLD10 T_VLD11 i SFS2 T_VLD11 SFS1 i SFS1 2 1 T00 T j 1 ID01 T_VLD01 ID00 T_VLD00 (b) T_VLD (e) T_VLD10 i T_VLD11 T_VLD11 T_VLD11 SFS3 SFS2 SFS4 1 i T_VLD j T_VLD02 T_VLD01 (c) T_VLD12 T_VLD T_VLD Fig. 2 Illustrations for Example 1: a dependences, UDS, independent iterations, slice representatives, and slices, b original tiles, c target tiles, d slices generated without splitting tiles including slice representatives, e slices generated with relation T _RPR, presented in Sect. 5.1, f slices generated with the affine transformations i = i, j = i + j ID10 ID11 (f) SFS4 SFS1 T_VLD13 SFS3 SFS2 T_VLD11 i i TILE 1 := [ii, jj] {S1[i, j] 0 ii 1 0 jj 1 i > 2ii 0 < i 4 i 2 + 2ii j > 2 jj 0 < j 4 j jj}, II_SET 1 := { [ii, jj] 0 ii 1 0 jj 1 }. Figure 2b illustrates original tiles of the size 2 2(T 00, T 01, T 10, T 11) defined by the above sets. The approach, discussed in this paper, is applicable to both perfectly and imperfectly nested loops. We form a global iteration space for instances of all loop nest statements by means of applying global schedule S, computed by the Polyhedral Extraction Tool (PET) [34], to sets TILE i, IS i used in a tiling algorithm. We call this procedure normalization. To normalize dependence relation R, we apply global schedule S to both the domain and range of this relation. To compare lexicographically tile identifiers in the global iteration space and generate valid code, we normalize in the same way also sets II_SET i. For the reader convenience, we present the tiling algorithm in Appendix A, which is a bit modification of that presented in paper [6]. The modification concerns the way of set and relation normalization, which is carried out using the global schedule of loop nest statement instances returned with PET [34]. The first step of the algorithm transforms a loop nest into its polyhedral representation. The second one prepares data to be used for tile correction. The third step envisages carrying out a dependence

10 286 W. Bielecki et al. T_VLD01 T_VLD11 T_VLD01 T_VLD11 T_VLD00 T_VLD10 T_VLD00 (a) Fig. 3 Inter-tile dependence graphs for a Example 1, and b Example 2 (b) T_VLD10 analysis for the loop nest. Step 4 carries out the normalization of sets and relations formed in steps 1 to 3. Steps 5 to 7 are to generate set TILE_VLD. It is the result of the correction of original rectangular ones and it represents target tiles valid under lexicographic order. The inter-tile dependence graph whose vertices are represented with set TILE_VLDis acyclic, so there exists a schedule for those vertices [6].In the next section, we show how we use set TILE_VLDto generate synchronization-free code on the tile level. Figure 2c shows valid target tiles for Example 1, generated according to Algorithm A. 5 Generation of synchronization-free tiled code In this section, we demonstrate how the techniques presented in Sects. 3 and 4 can be combined to generate parallel synchronization-free tiled code on the tile level when there exist synchronization-free slices on the statement instance level. Extracting synchronization-free parallelism on the tile level is a more complex task than that on the statement instance level. The techniques to extract slices on the tile level, discussed in papers [6,25], are based on the following steps: (i) valid (corrected) tiles are generated; (ii) a relation describing all the dependences among valid tiles (inter-tile dependences) is derived; (iii) techniques, presented in paper [3], are applied to the relation, obtained in step (ii), to generate synchronization-free code on the tile level. Applying the way described in the previous paragraph to Example 1, we get target tiles shown in Fig. 2c. The inter-tile dependence graph for Example 1 is shown in Fig. 3a. As we can see, tiles T _VLD01, T _VLD10, T _VLD11 are in the same slice, so we lose synchronization-free parallelism on the tile level despite there exists synchronization-free parallelism on the statement instance level. In the following subsection, we discuss how this problem can be resolved. 5.1 Basic concept There can be the following possible cases concerning the number of slice representatives within a valid tile: (i) no representative, (ii) a single representative, (iii) two or

11 Generation of parallel synchronization-free tiled code 287 more representatives. When two or more representatives are contained within a valid tile, it may be reasonable to split this tile into several sub-tiles so that each sub-tile includes at least one representative. The reason is the following. The number of slice representatives within a tile impacts the parallelism degree and granularity of tiled code. Increasing the number of representatives in a tile leads to decreasing parallelism degree but increases parallel program granularity. We illustrate this trade-off in the following sub-section by means of two examples. In this paper, we consider the following three cases of splitting a tile including slice representatives (i) a target tile, including multiple representatives is not sliced, i.e., all slice representatives, extracted on the statement instance level and contained in some target tile, are included in a single slice on the tile level; (ii) a valid tile is sliced into several sub-tiles so that each one includes only a single representative, i.e., each slice representative, extracted on the statement instance level and contained in some target tile, is included in a separate slice on the tile level; (iii) a set, including slice representatives, is tiled so that each tile includes the same number of representatives, N, N 2 except from the last one. To generate target code, corresponding to each of those cases, we need corresponding mappings from statement instances or slice representatives to abstract identifiers. For this purpose, we form the following two relations. Relation, SLC, mapping each statement instance, i, to the corresponding slice representative, rpr, is formed as follows: SLC ={[i] [rpr] i SFS(rpr) rpr REPR}, where SFS(rpr) is the set of instances within the slice whose representative is instance rpr, REPR is the set including slice representatives; sets SFS(rpr) and REPR are defined in Sect. 3. Relation, T, mapping each statement instance, i, to the corresponding tile identifier, II, is built as follows: T ={[i] [II] i TILE_VLD(II) II II_SET }, where TILE_VLD(II) represents instances within the valid tile with identifier II, this set is formed by means of Algorithm A; II_SET is the set including the identifiers of valid tiles, this set is defined in step 4.4 of Algorithm A. The third relation, T _RP R, is user-provided, it maps each slice representative, rpr, to the corresponding identifier, ID, of the tile generated as the result of tiling a set including all slice representatives. It is of the form: T _RPR ={[rpr] [ID] rpr REPR constraints on I D }. It is worth noting that relation T _RP R represents tiles different from valid target tiles generated with Algorithm 1, they include only slice representatives. In this paper, we suppose that relation T _RPR has to be provided by an expert and it is the input of Algorithm 1 discussed below. If it is not presented on input, the algorithm skips its usage.

12 288 W. Bielecki et al. For Example 1, such a relation, provided that each tile should include 2 representatives (except from the last one), can be the following: T _RPR := { [rpr1, rpr2] [id1, id2] rpr1 = 1 id1 = 0 2 id2 + 1 rpr2 2 id2 + 2 rpr2 = 4 id1 = 1 2 id2 + 2 rpr1 min(2 id2 + 3, 4) 0 id1, id2 1 }. Tiles ID00, ID01, ID10, and ID11, formed according to the relation T _RPRabove, are shown in Fig. 2e in green. To generate code without splitting tiles including slice representatives, taking into account that representative rpr can be found as rpr = SLC(i), identifier II of the tile, including instance i, isformedasii = T (i), and identifier II of the tile, including representative rpr, is calculated as II = T (rpr) = T (SLC(i)), weform the following set: CODE_UNSLICED ={[II, II, i] II = T (SLC(i)) II = T (i) }. To generate code with splitting tiles including slice representatives, taking into account that representative rpr can be found as rpr = SLC(i), identifier II of the tile, including instance i, is calculated as II = T (i), we form the following set: CODE_SLICED ={[rpr, II, i] rpr = SLC(i) II = T (i) }. When relation T _RPR is given on input, taking into account that rpr = SLC(i), we form the following set: CODE_SLICED ={[ID, II, i] ID= T _RPR(SLC(i)) II = T (i) }. Next, we apply the code generator of the Integer Set Library [14] to generate pseudocode scanning elements of set CODE_UNSLICEDor CODE_ SLICED and finally postprocess this pseudo-code to get the parallel pseudo-code of the following structure: parfor scanning elements II or rpr or ID for scanning elements II for scanning elements i Algorithm 1 lists the steps of the procedure for generation of synchronizationfree parallel tiled pseudo-code. It includes the following three steps. The first step is responsible for the calculation of slice representatives and independent statement instances according to the technique presented in Sect. 3. The second step is to extract synchronization-free slices on the tile level by means of the concept presented in this section and Algorithm A. The last step generates synchronization-free pseudo-code on the tile level. It is worth noting that to generate compilable code, we need a postprocessor which has to transform pseudo-code into compilable one. A postprocessor organization depends on a target platform, programming API or library to represent and then

13 Generation of parallel synchronization-free tiled code 289 compile parallel programs. In Sect. 7, we clarify how the postprocessor of the TC compiler generates target compilable code. 5.2 Illustrative examples In this sub-section, we illustrate extracting synchronization-free slices on the tile level by means of two examples. All calculations were carried out by means of the iscc calculator [31]. Let us start with Example 1. For this loop nest, set UDS, calculated according to formula (4), is as follows: UDS := { [i, 4] 2 i 3 } {[1, j] 2 j 4 }. Relation R_USC, calculated with formula (5), is empty, hence we conclude that each element of set UDSis the representative of a slice, that is REPR = UDS.For Example 1, Fig. 2a shows dependences, ultimate dependence sources (blue points), independent iterations (green points), and slices on the statement instance level (points within red parallelograms). Consequently, set S_REPR_INDis as follows: S_REPR_IND:= { [i, 4] 2 i 4 } {[1, j] 1 j 4 }. Slice representatives and independent statement instances are depicted in Fig. 2a. After applying Algorithm 1, we receive the following set representing valid target tiles: TILE_VLD:= [ii, jj] {[i, j] i > 2ii i > 0 (( jj = 0 i 4 0 < j 3 + 2ii i) ( jj = 1 ii 1 i 2 + 2ii 4 + 2ii i j 4)) }. Figure 2c shows four target tiles defined by the above set. Assuming that tiles, including multiple slice representatives, should not be sliced and a slice representative is presented with variables i, j,setsfs is as follows: SFS := [i, j] {[i0, 4 + i i0] j = 4 2 i 4 i < i0 4 i0 <= 3 + i } {[i0, 1 + j i0] i = 1 0 < j 4 2 i0 <= 4 i0 j } {[i, 4] j = 4 2 i 4;[1, j] i = 1 0 < j 4 }. Let us note that tile T _VLD11 is divided into two sub-tiles T _VLD11 and T _VLD11. The former belongs to slice SFS1 while the latter is within slice SFS2 (see Fig. 2d. Set CODE includes the following elements: { [1, 1, 1, 1, 4, 4]; [1, 1, 1, 1, 3, 4]; [0, 1, 0, 1, 2, 4]; [0, 1, 0, 1, 1, 4]; [1, 1, 1, 1, 4, 3]; [0,1,1,1,3,3];[0,1,0,1,2,3];[0,1,0,1,1,3];[0,1,1,1,4,2];[0,1,1,0,3,2];

14 290 W. Bielecki et al. Algorithm 1 Generation of synchronization-free code on the tile level. Input: Global iteration space represented with set IS, normalized dependence relation R describing all the dependences available in the original loop nest, set TILE_VLDdescribing valid target tiles obtained by means of Algorithm A, set II_SET comprising valid tile identifiers, a value of variable sliced = TRUE/FALSE defining whether multiple slice representatives in a tile should be sliced (TRUE) or not (FALSE), relation T _RPR (if provided), mapping a slice representative to the corresponding identifier of the tile generated due to tiling a set including all representatives. Output: Synchronization-free parallel tiled code if it exists. Method: 1. Calculation of slice representatives and independent loop nest statement instances. 1.1 Calculate set, UDS, including ultimate dependence sources: UDS = domain(r) range(r). 1.2 Calculate relation, R_USC, which describes all pairs (e, e ) of the ultimate dependence sources contained in set UDSthat belong to the same slice: R_USC ={[e] [e ] e, e UDS e e e (R R 1 ) + ({[e]}) }. 1.3 Calculate set REPR, including synchronization-free slice representatives: REPR = UDS range(r_usc). 1.4 Calculate set, IND, including independent statement instances: IND= IS (domain(r) range(r)). 1.5 Calculate set S_REPR_IND including both slice representatives and independent statement instances: S_REPR_IND= REPR IND. 1.6 If the cardinality of set S_REPR_INDis equal to 1, then print There are no synchronization-free slices in the original code, the end. 2. Extraction of synchronization-free slices on the tile level. 2.1 Form set, SFS(rpr), including statement instances belonging to the synchronization-free slice defined with representative rpr: SFS(rpr) = R (R_USC (rpr)). 2.2 Form relation, SLC, which maps each statement instance, i, to the corresponding slice representative, rpr: SLC ={[i] [rpr] i SFS(rpr) rpr S_REPR_IND}. 2.3 Form relation T which maps each instance i to the corresponding tile identifier II: T ={[i] [II] i TILE_VLD(II) II II_SET }, where II_SET is the set including the identifiers of tiles, this set is defined in step 4.4 of Algorithm A. 2.4 If sliced = FALSE form set CODE as follows: CODE ={[II, II, i] II = T (SLC(i)) II = T (i) }. Else, if relation T _RPR is not provided: CODE ={[rpr, II, i] rpr = SLC(i) II = T (i) }. Else: CODE ={[ID, II, i] ID= T _RPR(SLC(i)) II = T (i) }. 3. Code generation. 3.1 Generate pseudo-code applying ISL [30] or CLooG [1] tosetcode and transform it to the following form: parfor scanning elements II or rpr or ID for scanning elements II for scanning elements i

15 Generation of parallel synchronization-free tiled code 291 [0,1,0,1,2,2];[0,0,0,0,1,2];[0,1,1,0,4,1];[0,1,1,0,3,1];[0,0,0,0,2,1]; [0,0,0,0,1,1]}. The functions of elements in set CODEare the following. The first pair of elements represents the identifier of the tile including slice representatives (0,0 or 0,1 or 1,1), the second pair stands for the identifiers of the tiles including elements of the slice defined with the slice representatives determined with the first pair (from 0,0 to 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4). Figure 2d illustrates tree slices SFS0, SFS1, and SFS2 represented with set CODE. Applying relation T _RPRpresented in the previous sub-section, we get tiles shown in blue in Fig. 2e. They are the same as those obtained by means of applying the affine transformations: i = i, j = i + j to the original iteration space and then applying Algorithm A to the transformed iteration space, see Fig. 2f. Let us now consider another example. Example 2 for(i = 1; i <= 4; ++i) for(j = 1; j <= 4; ++j) S1: A[i][j] = A[i+1][j+1] + A[i+1][j-1]; Dependences in this loop are described by the following relation: R :={ [i, j] [i + 1, j + 1] 0 < i 3 0 < j 3 } {[i, j] [i + 1, j 1] 0 < i 3 2 j 4 }. Figure 4a illustrates dependences for this example. As we can see, there are two synchronization-free slices, red arrows depict the first one, while the black ones form the second. Set UDS computed over relation R includes the iterations contained in the following set: {[1, 1]; [1, 2]; [1, 3]; [1, 4]}. Figure 4a presents those four ultimate dependence sources, however, only two of them are slice representatives. In order to exclude non-representative sources, we apply formula (5) to get the following relation: R_USC := { [1, 1] [1, 3]; [1, 2] [1, 4]}. Subsequently, we form set S_REPR as S_REPR := UDS range(r_usc) ={[1, 1]; [1, 2]}. Let us note that there are no independent statement instances for this example, i.e., set S_REPR_INDis the same as set S_REPR. Applying Algorithm 1, we obtain the following set representing target tiles: TILE_VLD:= [ii, jj] {[i, j] i > 2ii i > 0 (( jj = 0 i 4 0 < j 3 + 2ii i) ( jj = 1 ii 1 i 2 + 2ii 4 + 2ii i j 4)) }. Figure 4b shows original tiles of the size 2 2 and target tiles. The dependence graph on the target tile level is shown in Fig. 3b. As we can see, for this example, all

16 292 W. Bielecki et al. 4 j 4 j T_VLD01 T_VLD11 T01 T UDS REPR (a) i 2 T T_VLD00 T_VLD10 (b) T10 i 4 j T_VLD01 T_VLD11 4 j T_VLD01 T_VLD11 3 SFS SFS T_VLD00 T_VLD10 (c) i T_VLD00 T_VLD10 Fig. 4 Illustrations for Example 2: a dependences, UDS, slice representatives, and slices, b original and target tiles, c slice with representative (1,1), d slice with representative (1,2) (d) i target tiles are combined in a single slice, i.e., the way described in paper [6] fails to extract any synchronization-free parallelism on the tile level. Applying Algorithm 1 with slicing tiles including multiple slice representatives, we first construct set SFS, we skip its mathematical representation because it is too long. Figure 4c, d shows the two synchronization-free slices with the representatives (1,1) and (1,2), respectively. As we can see, the elements of target tiles are divided between the two slices. Each slice includes different elements of all target tiles. Eventually, for the purpose of the code generation phase, we form set CODE which contains the following elements: { [1, 1, 1, 1, 4, 4]; [1, 2, 1, 1, 3, 4]; [1, 1, 0, 1, 2, 4]; [1, 2, 0, 1, 1, 4]; [1, 2, 1, 1, 4, 3]; [1,1,1,1,3,3];[1,2,0,1,2,3];[1,1,0,1,1,3];[1,1,1,1,4,2];[1,2,1,0,3,2]; [1,1,0,1,2,2];[1,2,0,0,1,2];[1,2,1,0,4,1];[1,1,1,0,3,1];[1,2,0,0,2,1]; [1,1,0,0,1,1]}. The roles of elements in this set are the following. The values of the first two elements correspond to a slice representative (1,1 or 1,2), the second pair stands for the identifiers of the tiles including elements of the slice defined with a slice representative

17 Generation of parallel synchronization-free tiled code 293 (0,0 or 0,1 or 1,0 or 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4). 5.3 Discussion The high level concept, presented in Sect. 5.1, is not dependent on how synchronizationfree slices on the statement instance level are extracted and how target tiles valid under lexicographic order are derived but the ways of its implementation define the quality of tiled code generated. The result of extracting slices should be presented with a relation which maps each statement instance to the corresponding slice representative while the result of tiling has to be presented with a relation which maps each statement instance to the corresponding tile identifier. In addition to those relations, an expert can provide a way of splitting slice representatives and the corresponding relation which maps each representative to the identifier of the tile generated due to splitting slice representatives. Such a relation may allow for generation of better tiled code. These three relations are then used to form a set, as it is shown in Sect. 5.1, responsible for generation of tiled code. These relations define the effectiveness of the concept and quality of code generated by it: parallelism degree, program granularity, performance, scalability depend on the implementation of this concept. In this paper, to implement the introduced concept, we present and apply Algorithms A and 1 based on the transitive closure of dependence graphs. Applying other algorithms, for example those based on affine transformations, will lead to other tiled code. 6 Related work There has been a considerable amount of research into tiling demonstrating how to aggregate a set of loop nest iterations into tiles with each tile as an atomic macro statement, from pioneer papers [17,29,35,36] to those presenting advanced techniques [13,16,19 21]. Several popular frameworks are used to produce tiled code: the affine transformation framework based on the Polyhedral Model [9,11,12,22], non-polyhedral Model [19], and Iteration Space Slicing [27,28]. The affine transformation framework is one of the most advanced reordering transformations. Let us recall that this approach includes the following three steps: (i) program analysis aimed at translating high level codes to their polyhedral representation and to provide data dependence analysis based on this representation, (ii) program transformation with the aim of improving program locality and/or parallelization, (iii) code generation. All the three steps are available in the algorithms presented in this paper. But there exists the following difference in step (ii): in affine transformations, a (sequence of) program transformation(s) is represented by a set of affine functions, one for each statement, while the presented approach does not find and use any affine function for program transformation(s). It applies the transitive closure of a program dependence graph to transform invalid original tiles into valid target ones. At this point of view the program transformation step is rather within the Itera-

18 294 W. Bielecki et al. tion Space Slicing Framework introduced by Pugh and Rosser [27]: Iteration Space Slicing takes dependence information as input to find all statement instances from a given loop nest which must be executed to produce correct values for the specified array elements. That is, we may conclude that the introduced algorithms are based on a combination of the Polyhedral and Iteration Spacing Slicing frameworks. Such a combination allows for improving the loop nest tiling transformation effectiveness. In the next section, we show that for examined benchmarks, the presented approach extracts more synchronization-free parallelism than that provided with well-known affine transformations. Papers [3,7] demonstrate how to extract coarse- and fine-grained parallelism applying different Iteration Space Slicing algorithms, however, they do not consider any tiling transformation. Paper [5] deals with applying transitive closure to only perfectly nested loops and does not present any algorithm to extract synchronization-free parallelism. In paper [25], the authors present a way to extract synchronization-free parallelism using a relation representing inter-tile dependences. As we demonstrated in Sect. 5, such an approach can lose discovering synchronization-free parallelism on the tile level. Paper [8] demonstrates how tiled code can be generated applying free scheduling of tiles, but such a way results in generating code with synchronization. Paper [6] deals with tiling arbitrarily nested loops, however extracting synchronization-free parallelism is based on forming slices including valid tiles without splitting tiles including multiple slice representatives. That prevents extracting synchronizationfree parallelism on the tile level for some classes of loop nests. Diamond and hexagonal tiling [2,15] allows for scalable parallelism, but it can be applied only to stencil algorithms while the approach presented in our paper is of general usage. Summing up, we may conclude that provided that the exact transitive closure of a dependence graph can be calculated, the approach, presented in this paper, is able to generate synchronization-free parallelism on the tile level when there exists synchronization-free parallelism on the statement instance level. As far as disadvantages of the presented technique are concerned, we observed, that for loop nests, exposing non-regular dependences, generated tiles are also non-regular that results in more complicated code than that generated with affine transformations. Some target tiles can be parametric, i.e., their sizes depend on parametric upper loop index bounds. It does not guarantee that the data size per a parametric tile is smaller than the capacity of cache. In such a case, a parametric tile represents an iteration sub-space where tiling is excluded. We will address and illustrate these issues in our future work. 7 Implementation and experimental study The algorithms, presented in this paper, have been implemented into the TC optimizing compiler, v , 2 which utilizes the Polyhedral Extraction Tool [34] for extracting 2

19 Generation of parallel synchronization-free tiled code 295 polyhedral representations of original loop nests, and the Integer Set Library [30]for performing dependence analysis, manipulating integer sets and relations, and generating output code. To evaluate the effectiveness of the presented approach and the performance of parallel tiled code generated by means of this approach, we have experimented with the PolyBench/C 4.1 [26] benchmark suite. Out of the 30 benchmarks contained in PolyBench, TC finds a total of 8 benchmarks for which there exist more than one synchronization-free slice on the both statement instance and tile levels. The list of these loop nests is as follows: 2mm, bicg, gemm, gesummv, mvt, syr2k, syrk, trmm. The code generated by TC for the studied kernels can be found in the results directory of the compiler s repository. The evaluation of parallel code performance was carried out on a multicore architecture (2x Intel Xeon E v3 clocked at 2.3 GHz, 18 cores/socket, 36 threads/socket, 32 KB L1 data cache/core, 256 KB L2 cache/core, 45 MB L3 cache/socket, 256 GB RAM clocked at 2133 MHz). The code of both original and transformed loop nests was compiled under the Linux kernel x86_64 by GCC with the O3 optimization enabled. Each examined loop nest was tiled with tiles of the size 32 (in each dimension). Such transformed code was then executed using 1, 2, 4, 8, 16, 32, and 64 threads in subsequent runs. The problem sizes used for the studied benchmarks are shown in Table 1 which also presents the number of synchronization-free slices extracted from each loop nest, expressed as formulas involving loop index upper bounds and a tile size constant B representing the width of a tile side in each dimension (32 in our study). The vertical bar indicates that in code generated with TC and PLUTO v [10] a state-ofthe-art optimizing compiler based on the Affine Transformations Framework there is a synchronization barrier between parallel regions each including synchronizationfree code, e.g., M N means that first a loop nest of M synchronization-free slices is computed, then after a barrier another loop nest of N synchronization-free slices is executed. The row Theoretical presents the cardinality of set S_REPR_IND, i.e., the total number of synchronization-free slices extracted with Algorithm 1. The rows TC sliced and TC unsliced present the number of slices extracted when the tiles, including multiple slice representatives, are sliced (so that each such tile includes a single representative) and uncliced, respectively. The row PLUTO shows the number of slices extracted with PLUTO. For several kernels, despite Algorithm 1 extracts only synchronization-free parallelism, compilable TC code has a synchronization point (for example, for kernel mvt). This fact is explained by the way used by a TC post-processor to generate compilable code. Set CODE generated with Algorithm 1, may consist of several sub-sets. For each sub-set, a code generator may produce a separate loop nest (although all those loop nests are independent) and for each of them, the TC post-processor generates a separate parallel loop nest. So we have a sequence of parallel loop nests and there exists barrier synchronization between each pair of such loop nests. So, TC does not exploit the whole parallelism, extracted with Algorithm 1.

PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES

Computing and Informatics, Vol. 36, 07, 6 8, doi: 0.449/cai 07 6 6 PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES Marek Palkowski, Wlodzimierz Bielecki Faculty of Computer Science West