Generation of parallel synchronization-free tiled code

Size: px
Start display at page:

Download "Generation of parallel synchronization-free tiled code"

Transcription

1 Computing (2018) 100: Generation of parallel synchronization-free tiled code Wlodzimierz Bielecki 1 Marek Palkowski 1 Piotr Skotnicki 1 Received: 22 August 2016 / Accepted: 5 October 2017 / Published online: 20 October 2017 Springer-Verlag GmbH Austria 2017 Abstract A novel approach to generation of parallel synchronization-free tiled code for the loop nest is presented. It is derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. It uses the transitive closure of loop nest dependence graphs to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target (corrected) tiles. Then parallel synchronization-free tiled code is generated on the basis of valid (corrected) tiles applying the transitive closure of dependence graphs. The main contribution of the paper is demonstrating that the presented technique is able to generate parallel synchronization-free tiled code, provided that the exact transitive closure of a dependence graph can be calculated and there exist synchronization-free slices on the statement instance level in the loop nest. We show that the presented approach extracts such a parallelism when well-known techniques fail to extract it. Enlarging the scope of loop nests, for which synchronization-free tiled code can be generated, is achieved by means of applying the intersection of extracted slices and generated valid tiles, in contrast to forming slices of valid tiles as suggested in previously published techniques based on the transitive closure of a dependence graph. The presented approach is implemented in the publicly available TC optimizing compiler. Results of experiments demonstrating the effectiveness of the approach and the efficiency of parallel programs generated by means of it are discussed. B Piotr Skotnicki pskotnicki@wi.zut.edu.pl Wlodzimierz Bielecki wbielecki@wi.zut.edu.pl Marek Palkowski mpalkowski@wi.zut.edu.pl 1 Faculty of Computer Science, West Pomeranian University of Technology, ul. Zolnierska 49, Szczecin, Poland

2 278 W. Bielecki et al. Keywords Synchronization-free parallelism Tiling Transitive closure Optimizing compiler Polyhedral model Iteration space slicing Mathematics Subject Classification 65Y05 68M20 68N20 1 Introduction In this paper, we deal with automatic parallelization of sequential programs by means of an optimizing compiler. The parallel program is executed on a computer including two or more processing units. Process synchronization is required to guarantee that a parallel program produces correct results. Synchronization is the coordination of parallel tasks in real time, often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same point. Synchronization has considerable impact on parallel program overhead, granularity, and load balancing. It usually involves waiting by at least one task, and can therefore cause a parallel application s wall-clock execution time to increase, i.e., it introduces parallel program overhead. Any time one task spends waiting for another is considered synchronization overhead. Minimizing its cost is a very important part of making a program efficient. Since synchronization overhead tends to grow rapidly as the number of tasks in a parallel job increases, it is the most important factor in obtaining good scaling behavior for the parallel program. In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. A parallel application can be coarse- or fine-grained: coarse means that relatively large amounts of computational work are done between communication or synchronization events; fine means that relatively small amounts of computational work are done between communication events. For current multicore CPUs with support for SMT (Simultaneous Multi Threading), coarse-grained parallelism is strictly required, independently of the parallel technology used (pthreads, OpenMP, MPI). Usually, decreasing the number of synchronization events allows for increasing parallel code granularity. Load balancing is important to parallel programs for performance reasons: to get good parallel program performance, all threads should have the same volume of work to be executed. If all tasks are subject to a barrier synchronization point, the largest task will determine the overall performance. Summing up, we may conclude that to decrease synchronization overhead, increase parallel program granularity, and improve load balancing we need to minimize the number of synchronization events in the parallel program. The ideal situation is when there is no synchronization in the parallel program, i.e., when an optimizing compiler discovers synchronization-free parallelism that requires no synchronization. Synchronization-free parallelism can be considered on the statement instance level or on the tile level when multiple threads run independent fragments of the program comprising statement instances or tiles, respectively. Well-known techniques discovering synchronization-free parallelism are based on the affine transformation framework [10,21]. Limitations of affine trans-

3 Generation of parallel synchronization-free tiled code 279 formations to extract synchronization-free parallelism are discussed in paper [3], the authors demonstrate that despite the fact that there exists synchronization-free parallelism on the statement instance level for some classes of loop nests, there does not exist any affine transformation allowing for extracting such a parallelism. The authors show how the transitive closure of dependence graphs can be used to extract synchronization-free parallelism on the statement instance level for such problematic loop nests but they do not consider extracting synchronization-free parallelism on the tile level. In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs. Tiling [10,13,17,21,29,35] is a very important iteration reordering transformation for both improving data locality and extracting loop nest parallelism. Tiling for improving locality groups loop statement instances into smaller blocks (tiles) allowing reuse when the block fits in local memory. In parallel tiled code, tiles are considered as indivisible macro statements. This coarsens the granularity of parallel applications that often leads to improving the performance of an application running in parallel computers with shared memory. One well-known class of tiling techniques is based on affine transformations of program loops. To generate tiled code, first affine transformations allowing for producing a band of fully permutable loops are formed, then this band is transformed into tiled code. Papers [5,6] introduce a novel approach for automatic generation of tiled code for nested loops which is based on the transitive closure of a loop nest dependence graph. This technique produces tiled code even when there does not exist any affine transformation allowing for producing a band of fully permutable loops. According to that approach, we first form fixed rectangular original tiles and next examine whether all loop nest dependences are respected under the lexicographic order of tile enumeration. If so, we conclude that all original tiles are valid, hence code generation is straightforward. Otherwise, we correct original tiles so that all target tiles are valid, i.e., the lexicographic enumeration order of target tiles respects all dependences available in the original loop nest. The final step is code generation representing target (corrected) tiles. In this paper, we present a way to extract synchronization-free parallelism on the tile level applying the transitive closure of dependence graphs. Paper [6] deals with extracting slices on the tile level so that each slice contains valid tiles generated by means of applying the transitive closure of dependence graphs. In Sect. 5, we demonstrate that such a way of generation of synchronization-free tiled code can lose synchronizationfree parallelism on the tile level in spite of there exists synchronization-free parallelism on the statement instance level. In this paper, we show how this problem can be resolved, i.e., how to generate synchronization-free tiled code provided we are able to calculate the exact transitive closure of a loop nest dependence graph and there exists synchronization-free parallelism on the statement instance level. The contributions of this paper over previous work are as follows: a concept and an algorithm demonstrating how the Iteration Space Slicing framework can be combined with the Polyhedral Model to generate parallel synchronization-free tiled code on the tile level when there exists synchronization-

4 280 W. Bielecki et al. free parallelism on the statement instance level and the exact transitive closure of dependence graph can be calculated; development and presentation of the public available source-to-source TC compiler implementing the introduced algorithms using the ISL library; evaluation of the effectiveness of the introduced algorithms and the speed-up of tiled code produced by means of the presented approach. The rest of the paper is organized as follows. Section 2 contains background. Section 3 describes how synchronization-free slices can be extracted on the statement instance level. Section 4 summarizes how tiled code can be generated by means of the transitive closure of dependence graphs. Section 5 introduces a concept and an algorithm to generate parallel synchronization-free code on the tile level. Section 6 discusses related work. Section 7 highlights the results of experiments. Section 8 concludes our work and outlines plans for future. 2 Background In this paper, we deal with affine loop nests [11]. A statement instance S[I ] is a particular execution of a loop statement S for a given iteration vector I. Given a loop nest with q statements, we transform it into its polyhedral representation, including: an iteration space IS i for each statement Si, i = 1,..., q, read/write access relations (RA/WA, respectively), and global schedule S corresponding to the original execution order of statement instances in the loop nest. The loop nest iteration space IS i is the set of statement instances executed by a loop nest for statement Si. An access relation maps an iteration vector I i to one or more memory locations of array elements. Schedule S is represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed. In this paper, we use two types of iteration spaces (i) original one where instances are identified by a statement identifier and a sequence of integers and (ii) global one where instances are identified by their execution order. We obtain the global iteration space by means of applying the global schedule to the original iteration space. A global iteration vector I represents statement instances in the global iteration space. Further on, under I and IS, we mean the global iteration vector and global iteration space, respectively. Two statement instances S1[I ] and S2[J] are dependent if both access the same memory location and if at least one access is a write. S1[I ] and S2[J] are called the source and target of a dependence, respectively, provided that S1[I ] is executed before S2[ J]. The sequential ordering of statement instances, denoted S1[I ] S2[ J], is induced by the global schedule. The algorithms, presented in this paper, use a dependence relation which is a tuple relation of the form {[input list] [output list] formula }, where input list and output list are the lists of variables and/or expressions used to describe input and output tuples, and formula describes the constraints imposed upon input and output lists and it is a Presburger formula built of constraints represented by algebraic expressions and using logical and existential operators.

5 Generation of parallel synchronization-free tiled code 281 A dependence relation, describing all the dependences in a loop nest, can be computed as a union of flow, anti, and output dependences according to the following formula [32]: R = ((RA 1 WA) (WA 1 RA) (WA 1 WA)) (S S), (1) where RA/WA are read/write access relations, respectively, mapping an iteration vector to one or more referenced memory locations, and S is the original schedule represented with a relation which maps an iteration vector of a statement to a corresponding multidimensional timestamp (i.e., a discrete time when the statement instance has to be executed). S S denotes a strict partial order of statement instances: S 1 ({[e] [e ] e e } S). A dependence relation is a mathematical representation of a data dependence graph whose vertices correspond to loop statement instances while edges connect dependent instances. The input and output tuples of a relation represent dependence sources and destinations, respectively; the relation constraints point out instances which are dependent. In the presented algorithm, standard operations on relations and sets are used, such as intersection ( ), union ( ), difference ( ), composition ( ), domain (domain(r)), range (range(r)), relation application (R(S) ={[e ] e S :[e] [e ] R }). The positive transitive closure of a given relation R, R +, is defined as follows: R + = i= i=1 R i, (2) where R i is the i-th power of R, defined inductively by R 1 = R, and for i > 1, R i = R R i 1, where denotes the composition of relations. A relation R is reflexively closed on a set D if the identity relation Id D is a subset of R. The reflexive closure of R on D is R Id D. The reflexive and transitive closure of R on D is: R = R + Id D. (3) It describes the same connections in a dependence graph (represented by R) that R + does plus connections of each vertex with itself. Techniques aimed at calculating the transitive closure of a dependence graph, which in general is parametric, are presented in papers [4,18,33] and they are out of the scope of this paper. In general, it is impossible to represent the exact transitive closure using affine constraints [18]. Existing algorithms return either exact transitive closure or an approximation of it. Exact transitive closure or an over-approximation of it can be used in the presented algorithms, but if we use an over-approximation of transitive closure, tiled code will be not optimal: it provides less code locality and/or parallelization. Paper [4] presents the time of transitive closure calculation for NPB benchmarks [23]. It depends on the number of dependence relations extracted for a loop nest and can vary from milliseconds to several minutes (when the number of dependence relations is equal to hundreds or thousands).

6 282 W. Bielecki et al. 3 Extraction of synchronization-free slices To extract synchronization-free parallelism on the statement instance level applying the transitive closure of dependence graphs, we use the approach presented in paper [3]. Let us recall basic definitions related to iteration space slicing on the statement instance level. Definition 1 (Slice) Given a dependence graph, a slice is a weakly connected component of this graph, i.e., a maximal subgraph such that each pair of its vertices is connected by some path, ignoring the edge directions. Definition 2 (Ultimate dependence source) An ultimate dependence source is a source that is not the destination of another dependence. Definition 3 (Representative source) The representative (source) of a slice is its lexicographically minimal ultimate dependence source in the global iteration space of a loop nest. In order to extract parallelism represented with synchronization-free slices, we need to carry out the following steps: 1. Find a set, REPR, including representatives of slices; 2. Reconstruct slices from their representatives as independent execution flows. Given relation R found as the union of all dependence relations extracted for a loop nest, we start with forming a set of statement instances, UDS, describing all ultimate dependence sources, as the difference between the domain of R and the range of R: UDS = domain(r) range(r). (4) Subsequently, to find which elements of set UDS are representatives of slices 1,we construct a relation, R_USC, that describes all pairs (e, e ) of the ultimate dependence sources contained in set UDS that are connected by some path, ignoring the edge directions. Formally, set R_USC is defined as shown below: R_USC ={[e] [e ] e, e UDS e e e (R R 1 ) + ({[e]})}. (5) The inequality (e e ) in the constraints of relation R_USC means that e is lexicographically less than e. Such a condition guarantees that the lexicographically smallest element will be represented only with the input tuple, and as a result the set range(r_usc) will contain all but the lexicographically smallest sources of synchronization-free slices. R 1 denotes the inverse relation of R, i.e., R 1 ={[e] [e ] [e ] [e] R }. The condition e (R R 1 ) + ({[e]}) implies that there exists some path between e and e when the edge directions are ignored. 1 If a slice has multiple sources, then although all its sources belong to UDS, only the lexicographically minimal source is the representative of a slice.

7 Generation of parallel synchronization-free tiled code 283 Fig. 1 Connections described with relations R, R 1 and R_USC In order to illustrate the presented idea, let us consider the following relation: R := { [1] [5]; [2] [4]; [3] [4]; [3] [5]}. The graph, described with relation R, is the single weakly connected component, presented in Fig. 1 (solid lines). Set UDS computed over relation R includes vertices belonging to the set {[1]; [2]; [3]}, i.e., all the vertices that have no incoming edges and thus satisfy the definition of an ultimate dependence source. According to Definition 1, each weakly connected component constitutes a synchronization-free slice. We would therefore like to find its lexicographically smallest element, serving as the representative of a slice. First, we compute the inverse relation of R, R 1 := { [4] [2]; [4] [3]; [5] [1]; [5] [3]}. Let us highlight that the forward (solid) and backward (dashed) edges in Fig. 1 form paths connecting ultimate dependence sources contained in the same slice, i.e., if there exists a pair of ultimate dependence sources described with the relation (R R 1 ) +, then they both belong to a single synchronization-free slice. For the working example, relation R_USC is as follows: R_USC := { [1] [2]; [1] [3]; [2] [3]}. The paths connecting the ultimate dependence sources for the working example, being described with relation R_USC, are presented in Fig. 1 (dotted lines). As already mentioned, the range of relation R_USC contains all but the lexicographically smallest sources of all slices. Following this observation, in order to find set REPR comprising representatives of slices, we carry out the below computation: REPR = UDS range(r_usc). (6) Set REPR is very important because its cardinality is equal to the number of synchronization-free slices, each its element is a slice representative, which is enough to reconstruct a corresponding slice by means of the way discussed below. As far as the working example is considered, the set of representatives contains a single element, REPR := { [1]}, which means there exists a single slice. After finding the representatives of slices and a relation, describing the connection of each representative with other ultimate dependence sources contained in the same slice, given a representative, rpr, we can reconstruct the corresponding synchronization-free slice of the original graph using the following formula: SFS(rpr) = R (R_USC (rpr)). (7) It is worth to note that if R_USC = then R_USC (rpr) = rpr.

8 284 W. Bielecki et al. To generate valid code representing synchronization-free parallelism and executing all statement instances that reside in the loop nest iteration space, we also need to calculate set, IND, including independent statement instances as follows: IND= IS (domain(r) range(r)), (8) where set ISrepresents the global iteration space of a loop nest. 4 Loop tiling In papers [5,6], algorithms based on the transitive closure of a dependence graph allowing for loop nest tiling are introduced. They correct original rectangular tiles so that target tiles are valid under lexicographical order. Let us consider the following example: Example 1 for(i = 1; i <= 4; ++i) for(j = 1; j <= 4; ++j) S1: A[i][j] = A[i-1][j+1]; A data dependence analysis over the read and write accesses, based on formula (1), results in the following dependence relation: R := { S1[i, j] S1[i + 1, j 1] 0 < i 3 2 j 4 }, which describes data dependences between the instances of statement S1. Figure 2a shows dependences and synchronization-free slices for Example 1. In general, for each statement Si, i = 1,..., q, surrounded by d i loops, we form set TILE i (II i ) including iterations belonging to a parametric original rectangular tile as follows: TILE i (II i ) =[II i ] {[I i ] B i II i + LB i I i min(b i (II i + 1 i ) + LB i 1 i, UB i ) II i 0 i }, where vectors LB i and UB i include the lower and upper bounds, respectively, of the indices of the loops surrounding statement Si; diagonal matrix B i defines the size of original rectangular tiles; elements of vectors I i and II i represent the original indices of the loops enclosing statement Si and the identifiers of tiles, respectively; 1 i and 0 i are the vectors whose all d i elements have value 1 and 0, respectively. Additionally, with each set TILE i, i = 1,..., q, we associate another set, II_SET i, i = 1,..., q, that includes the tile identifiers of all tiles represented with set TILE i : II_SET i ={[II i ] II i 0 i B i II i + LB i UB i }. As far as Example 1 is considered, sets TILE 1 and II_SET 1 represent tiles of the size 2 2 in space IS 1 :

9 Generation of parallel synchronization-free tiled code 285 j 4 j 4 j T01 T11 3 T_VLD01 T_VLD SFS0 2 1 j UDS, REPR IND T_VLD01 T_VLD00 (a) (d) T_VLD10 T_VLD11 i SFS2 T_VLD11 SFS1 i SFS1 2 1 T00 T j 1 ID01 T_VLD01 ID00 T_VLD00 (b) T_VLD (e) T_VLD10 i T_VLD11 T_VLD11 T_VLD11 SFS3 SFS2 SFS4 1 i T_VLD j T_VLD02 T_VLD01 (c) T_VLD12 T_VLD T_VLD Fig. 2 Illustrations for Example 1: a dependences, UDS, independent iterations, slice representatives, and slices, b original tiles, c target tiles, d slices generated without splitting tiles including slice representatives, e slices generated with relation T _RPR, presented in Sect. 5.1, f slices generated with the affine transformations i = i, j = i + j ID10 ID11 (f) SFS4 SFS1 T_VLD13 SFS3 SFS2 T_VLD11 i i TILE 1 := [ii, jj] {S1[i, j] 0 ii 1 0 jj 1 i > 2ii 0 < i 4 i 2 + 2ii j > 2 jj 0 < j 4 j jj}, II_SET 1 := { [ii, jj] 0 ii 1 0 jj 1 }. Figure 2b illustrates original tiles of the size 2 2(T 00, T 01, T 10, T 11) defined by the above sets. The approach, discussed in this paper, is applicable to both perfectly and imperfectly nested loops. We form a global iteration space for instances of all loop nest statements by means of applying global schedule S, computed by the Polyhedral Extraction Tool (PET) [34], to sets TILE i, IS i used in a tiling algorithm. We call this procedure normalization. To normalize dependence relation R, we apply global schedule S to both the domain and range of this relation. To compare lexicographically tile identifiers in the global iteration space and generate valid code, we normalize in the same way also sets II_SET i. For the reader convenience, we present the tiling algorithm in Appendix A, which is a bit modification of that presented in paper [6]. The modification concerns the way of set and relation normalization, which is carried out using the global schedule of loop nest statement instances returned with PET [34]. The first step of the algorithm transforms a loop nest into its polyhedral representation. The second one prepares data to be used for tile correction. The third step envisages carrying out a dependence

10 286 W. Bielecki et al. T_VLD01 T_VLD11 T_VLD01 T_VLD11 T_VLD00 T_VLD10 T_VLD00 (a) Fig. 3 Inter-tile dependence graphs for a Example 1, and b Example 2 (b) T_VLD10 analysis for the loop nest. Step 4 carries out the normalization of sets and relations formed in steps 1 to 3. Steps 5 to 7 are to generate set TILE_VLD. It is the result of the correction of original rectangular ones and it represents target tiles valid under lexicographic order. The inter-tile dependence graph whose vertices are represented with set TILE_VLDis acyclic, so there exists a schedule for those vertices [6].In the next section, we show how we use set TILE_VLDto generate synchronization-free code on the tile level. Figure 2c shows valid target tiles for Example 1, generated according to Algorithm A. 5 Generation of synchronization-free tiled code In this section, we demonstrate how the techniques presented in Sects. 3 and 4 can be combined to generate parallel synchronization-free tiled code on the tile level when there exist synchronization-free slices on the statement instance level. Extracting synchronization-free parallelism on the tile level is a more complex task than that on the statement instance level. The techniques to extract slices on the tile level, discussed in papers [6,25], are based on the following steps: (i) valid (corrected) tiles are generated; (ii) a relation describing all the dependences among valid tiles (inter-tile dependences) is derived; (iii) techniques, presented in paper [3], are applied to the relation, obtained in step (ii), to generate synchronization-free code on the tile level. Applying the way described in the previous paragraph to Example 1, we get target tiles shown in Fig. 2c. The inter-tile dependence graph for Example 1 is shown in Fig. 3a. As we can see, tiles T _VLD01, T _VLD10, T _VLD11 are in the same slice, so we lose synchronization-free parallelism on the tile level despite there exists synchronization-free parallelism on the statement instance level. In the following subsection, we discuss how this problem can be resolved. 5.1 Basic concept There can be the following possible cases concerning the number of slice representatives within a valid tile: (i) no representative, (ii) a single representative, (iii) two or

11 Generation of parallel synchronization-free tiled code 287 more representatives. When two or more representatives are contained within a valid tile, it may be reasonable to split this tile into several sub-tiles so that each sub-tile includes at least one representative. The reason is the following. The number of slice representatives within a tile impacts the parallelism degree and granularity of tiled code. Increasing the number of representatives in a tile leads to decreasing parallelism degree but increases parallel program granularity. We illustrate this trade-off in the following sub-section by means of two examples. In this paper, we consider the following three cases of splitting a tile including slice representatives (i) a target tile, including multiple representatives is not sliced, i.e., all slice representatives, extracted on the statement instance level and contained in some target tile, are included in a single slice on the tile level; (ii) a valid tile is sliced into several sub-tiles so that each one includes only a single representative, i.e., each slice representative, extracted on the statement instance level and contained in some target tile, is included in a separate slice on the tile level; (iii) a set, including slice representatives, is tiled so that each tile includes the same number of representatives, N, N 2 except from the last one. To generate target code, corresponding to each of those cases, we need corresponding mappings from statement instances or slice representatives to abstract identifiers. For this purpose, we form the following two relations. Relation, SLC, mapping each statement instance, i, to the corresponding slice representative, rpr, is formed as follows: SLC ={[i] [rpr] i SFS(rpr) rpr REPR}, where SFS(rpr) is the set of instances within the slice whose representative is instance rpr, REPR is the set including slice representatives; sets SFS(rpr) and REPR are defined in Sect. 3. Relation, T, mapping each statement instance, i, to the corresponding tile identifier, II, is built as follows: T ={[i] [II] i TILE_VLD(II) II II_SET }, where TILE_VLD(II) represents instances within the valid tile with identifier II, this set is formed by means of Algorithm A; II_SET is the set including the identifiers of valid tiles, this set is defined in step 4.4 of Algorithm A. The third relation, T _RP R, is user-provided, it maps each slice representative, rpr, to the corresponding identifier, ID, of the tile generated as the result of tiling a set including all slice representatives. It is of the form: T _RPR ={[rpr] [ID] rpr REPR constraints on I D }. It is worth noting that relation T _RP R represents tiles different from valid target tiles generated with Algorithm 1, they include only slice representatives. In this paper, we suppose that relation T _RPR has to be provided by an expert and it is the input of Algorithm 1 discussed below. If it is not presented on input, the algorithm skips its usage.

12 288 W. Bielecki et al. For Example 1, such a relation, provided that each tile should include 2 representatives (except from the last one), can be the following: T _RPR := { [rpr1, rpr2] [id1, id2] rpr1 = 1 id1 = 0 2 id2 + 1 rpr2 2 id2 + 2 rpr2 = 4 id1 = 1 2 id2 + 2 rpr1 min(2 id2 + 3, 4) 0 id1, id2 1 }. Tiles ID00, ID01, ID10, and ID11, formed according to the relation T _RPRabove, are shown in Fig. 2e in green. To generate code without splitting tiles including slice representatives, taking into account that representative rpr can be found as rpr = SLC(i), identifier II of the tile, including instance i, isformedasii = T (i), and identifier II of the tile, including representative rpr, is calculated as II = T (rpr) = T (SLC(i)), weform the following set: CODE_UNSLICED ={[II, II, i] II = T (SLC(i)) II = T (i) }. To generate code with splitting tiles including slice representatives, taking into account that representative rpr can be found as rpr = SLC(i), identifier II of the tile, including instance i, is calculated as II = T (i), we form the following set: CODE_SLICED ={[rpr, II, i] rpr = SLC(i) II = T (i) }. When relation T _RPR is given on input, taking into account that rpr = SLC(i), we form the following set: CODE_SLICED ={[ID, II, i] ID= T _RPR(SLC(i)) II = T (i) }. Next, we apply the code generator of the Integer Set Library [14] to generate pseudocode scanning elements of set CODE_UNSLICEDor CODE_ SLICED and finally postprocess this pseudo-code to get the parallel pseudo-code of the following structure: parfor scanning elements II or rpr or ID for scanning elements II for scanning elements i Algorithm 1 lists the steps of the procedure for generation of synchronizationfree parallel tiled pseudo-code. It includes the following three steps. The first step is responsible for the calculation of slice representatives and independent statement instances according to the technique presented in Sect. 3. The second step is to extract synchronization-free slices on the tile level by means of the concept presented in this section and Algorithm A. The last step generates synchronization-free pseudo-code on the tile level. It is worth noting that to generate compilable code, we need a postprocessor which has to transform pseudo-code into compilable one. A postprocessor organization depends on a target platform, programming API or library to represent and then

13 Generation of parallel synchronization-free tiled code 289 compile parallel programs. In Sect. 7, we clarify how the postprocessor of the TC compiler generates target compilable code. 5.2 Illustrative examples In this sub-section, we illustrate extracting synchronization-free slices on the tile level by means of two examples. All calculations were carried out by means of the iscc calculator [31]. Let us start with Example 1. For this loop nest, set UDS, calculated according to formula (4), is as follows: UDS := { [i, 4] 2 i 3 } {[1, j] 2 j 4 }. Relation R_USC, calculated with formula (5), is empty, hence we conclude that each element of set UDSis the representative of a slice, that is REPR = UDS.For Example 1, Fig. 2a shows dependences, ultimate dependence sources (blue points), independent iterations (green points), and slices on the statement instance level (points within red parallelograms). Consequently, set S_REPR_INDis as follows: S_REPR_IND:= { [i, 4] 2 i 4 } {[1, j] 1 j 4 }. Slice representatives and independent statement instances are depicted in Fig. 2a. After applying Algorithm 1, we receive the following set representing valid target tiles: TILE_VLD:= [ii, jj] {[i, j] i > 2ii i > 0 (( jj = 0 i 4 0 < j 3 + 2ii i) ( jj = 1 ii 1 i 2 + 2ii 4 + 2ii i j 4)) }. Figure 2c shows four target tiles defined by the above set. Assuming that tiles, including multiple slice representatives, should not be sliced and a slice representative is presented with variables i, j,setsfs is as follows: SFS := [i, j] {[i0, 4 + i i0] j = 4 2 i 4 i < i0 4 i0 <= 3 + i } {[i0, 1 + j i0] i = 1 0 < j 4 2 i0 <= 4 i0 j } {[i, 4] j = 4 2 i 4;[1, j] i = 1 0 < j 4 }. Let us note that tile T _VLD11 is divided into two sub-tiles T _VLD11 and T _VLD11. The former belongs to slice SFS1 while the latter is within slice SFS2 (see Fig. 2d. Set CODE includes the following elements: { [1, 1, 1, 1, 4, 4]; [1, 1, 1, 1, 3, 4]; [0, 1, 0, 1, 2, 4]; [0, 1, 0, 1, 1, 4]; [1, 1, 1, 1, 4, 3]; [0,1,1,1,3,3];[0,1,0,1,2,3];[0,1,0,1,1,3];[0,1,1,1,4,2];[0,1,1,0,3,2];

14 290 W. Bielecki et al. Algorithm 1 Generation of synchronization-free code on the tile level. Input: Global iteration space represented with set IS, normalized dependence relation R describing all the dependences available in the original loop nest, set TILE_VLDdescribing valid target tiles obtained by means of Algorithm A, set II_SET comprising valid tile identifiers, a value of variable sliced = TRUE/FALSE defining whether multiple slice representatives in a tile should be sliced (TRUE) or not (FALSE), relation T _RPR (if provided), mapping a slice representative to the corresponding identifier of the tile generated due to tiling a set including all representatives. Output: Synchronization-free parallel tiled code if it exists. Method: 1. Calculation of slice representatives and independent loop nest statement instances. 1.1 Calculate set, UDS, including ultimate dependence sources: UDS = domain(r) range(r). 1.2 Calculate relation, R_USC, which describes all pairs (e, e ) of the ultimate dependence sources contained in set UDSthat belong to the same slice: R_USC ={[e] [e ] e, e UDS e e e (R R 1 ) + ({[e]}) }. 1.3 Calculate set REPR, including synchronization-free slice representatives: REPR = UDS range(r_usc). 1.4 Calculate set, IND, including independent statement instances: IND= IS (domain(r) range(r)). 1.5 Calculate set S_REPR_IND including both slice representatives and independent statement instances: S_REPR_IND= REPR IND. 1.6 If the cardinality of set S_REPR_INDis equal to 1, then print There are no synchronization-free slices in the original code, the end. 2. Extraction of synchronization-free slices on the tile level. 2.1 Form set, SFS(rpr), including statement instances belonging to the synchronization-free slice defined with representative rpr: SFS(rpr) = R (R_USC (rpr)). 2.2 Form relation, SLC, which maps each statement instance, i, to the corresponding slice representative, rpr: SLC ={[i] [rpr] i SFS(rpr) rpr S_REPR_IND}. 2.3 Form relation T which maps each instance i to the corresponding tile identifier II: T ={[i] [II] i TILE_VLD(II) II II_SET }, where II_SET is the set including the identifiers of tiles, this set is defined in step 4.4 of Algorithm A. 2.4 If sliced = FALSE form set CODE as follows: CODE ={[II, II, i] II = T (SLC(i)) II = T (i) }. Else, if relation T _RPR is not provided: CODE ={[rpr, II, i] rpr = SLC(i) II = T (i) }. Else: CODE ={[ID, II, i] ID= T _RPR(SLC(i)) II = T (i) }. 3. Code generation. 3.1 Generate pseudo-code applying ISL [30] or CLooG [1] tosetcode and transform it to the following form: parfor scanning elements II or rpr or ID for scanning elements II for scanning elements i

15 Generation of parallel synchronization-free tiled code 291 [0,1,0,1,2,2];[0,0,0,0,1,2];[0,1,1,0,4,1];[0,1,1,0,3,1];[0,0,0,0,2,1]; [0,0,0,0,1,1]}. The functions of elements in set CODEare the following. The first pair of elements represents the identifier of the tile including slice representatives (0,0 or 0,1 or 1,1), the second pair stands for the identifiers of the tiles including elements of the slice defined with the slice representatives determined with the first pair (from 0,0 to 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4). Figure 2d illustrates tree slices SFS0, SFS1, and SFS2 represented with set CODE. Applying relation T _RPRpresented in the previous sub-section, we get tiles shown in blue in Fig. 2e. They are the same as those obtained by means of applying the affine transformations: i = i, j = i + j to the original iteration space and then applying Algorithm A to the transformed iteration space, see Fig. 2f. Let us now consider another example. Example 2 for(i = 1; i <= 4; ++i) for(j = 1; j <= 4; ++j) S1: A[i][j] = A[i+1][j+1] + A[i+1][j-1]; Dependences in this loop are described by the following relation: R :={ [i, j] [i + 1, j + 1] 0 < i 3 0 < j 3 } {[i, j] [i + 1, j 1] 0 < i 3 2 j 4 }. Figure 4a illustrates dependences for this example. As we can see, there are two synchronization-free slices, red arrows depict the first one, while the black ones form the second. Set UDS computed over relation R includes the iterations contained in the following set: {[1, 1]; [1, 2]; [1, 3]; [1, 4]}. Figure 4a presents those four ultimate dependence sources, however, only two of them are slice representatives. In order to exclude non-representative sources, we apply formula (5) to get the following relation: R_USC := { [1, 1] [1, 3]; [1, 2] [1, 4]}. Subsequently, we form set S_REPR as S_REPR := UDS range(r_usc) ={[1, 1]; [1, 2]}. Let us note that there are no independent statement instances for this example, i.e., set S_REPR_INDis the same as set S_REPR. Applying Algorithm 1, we obtain the following set representing target tiles: TILE_VLD:= [ii, jj] {[i, j] i > 2ii i > 0 (( jj = 0 i 4 0 < j 3 + 2ii i) ( jj = 1 ii 1 i 2 + 2ii 4 + 2ii i j 4)) }. Figure 4b shows original tiles of the size 2 2 and target tiles. The dependence graph on the target tile level is shown in Fig. 3b. As we can see, for this example, all

16 292 W. Bielecki et al. 4 j 4 j T_VLD01 T_VLD11 T01 T UDS REPR (a) i 2 T T_VLD00 T_VLD10 (b) T10 i 4 j T_VLD01 T_VLD11 4 j T_VLD01 T_VLD11 3 SFS SFS T_VLD00 T_VLD10 (c) i T_VLD00 T_VLD10 Fig. 4 Illustrations for Example 2: a dependences, UDS, slice representatives, and slices, b original and target tiles, c slice with representative (1,1), d slice with representative (1,2) (d) i target tiles are combined in a single slice, i.e., the way described in paper [6] fails to extract any synchronization-free parallelism on the tile level. Applying Algorithm 1 with slicing tiles including multiple slice representatives, we first construct set SFS, we skip its mathematical representation because it is too long. Figure 4c, d shows the two synchronization-free slices with the representatives (1,1) and (1,2), respectively. As we can see, the elements of target tiles are divided between the two slices. Each slice includes different elements of all target tiles. Eventually, for the purpose of the code generation phase, we form set CODE which contains the following elements: { [1, 1, 1, 1, 4, 4]; [1, 2, 1, 1, 3, 4]; [1, 1, 0, 1, 2, 4]; [1, 2, 0, 1, 1, 4]; [1, 2, 1, 1, 4, 3]; [1,1,1,1,3,3];[1,2,0,1,2,3];[1,1,0,1,1,3];[1,1,1,1,4,2];[1,2,1,0,3,2]; [1,1,0,1,2,2];[1,2,0,0,1,2];[1,2,1,0,4,1];[1,1,1,0,3,1];[1,2,0,0,2,1]; [1,1,0,0,1,1]}. The roles of elements in this set are the following. The values of the first two elements correspond to a slice representative (1,1 or 1,2), the second pair stands for the identifiers of the tiles including elements of the slice defined with a slice representative

17 Generation of parallel synchronization-free tiled code 293 (0,0 or 0,1 or 1,0 or 1,1), the third pair defines the iterations of the tile with the identifier represented with the second pair (from 1,1 to 4,4). 5.3 Discussion The high level concept, presented in Sect. 5.1, is not dependent on how synchronizationfree slices on the statement instance level are extracted and how target tiles valid under lexicographic order are derived but the ways of its implementation define the quality of tiled code generated. The result of extracting slices should be presented with a relation which maps each statement instance to the corresponding slice representative while the result of tiling has to be presented with a relation which maps each statement instance to the corresponding tile identifier. In addition to those relations, an expert can provide a way of splitting slice representatives and the corresponding relation which maps each representative to the identifier of the tile generated due to splitting slice representatives. Such a relation may allow for generation of better tiled code. These three relations are then used to form a set, as it is shown in Sect. 5.1, responsible for generation of tiled code. These relations define the effectiveness of the concept and quality of code generated by it: parallelism degree, program granularity, performance, scalability depend on the implementation of this concept. In this paper, to implement the introduced concept, we present and apply Algorithms A and 1 based on the transitive closure of dependence graphs. Applying other algorithms, for example those based on affine transformations, will lead to other tiled code. 6 Related work There has been a considerable amount of research into tiling demonstrating how to aggregate a set of loop nest iterations into tiles with each tile as an atomic macro statement, from pioneer papers [17,29,35,36] to those presenting advanced techniques [13,16,19 21]. Several popular frameworks are used to produce tiled code: the affine transformation framework based on the Polyhedral Model [9,11,12,22], non-polyhedral Model [19], and Iteration Space Slicing [27,28]. The affine transformation framework is one of the most advanced reordering transformations. Let us recall that this approach includes the following three steps: (i) program analysis aimed at translating high level codes to their polyhedral representation and to provide data dependence analysis based on this representation, (ii) program transformation with the aim of improving program locality and/or parallelization, (iii) code generation. All the three steps are available in the algorithms presented in this paper. But there exists the following difference in step (ii): in affine transformations, a (sequence of) program transformation(s) is represented by a set of affine functions, one for each statement, while the presented approach does not find and use any affine function for program transformation(s). It applies the transitive closure of a program dependence graph to transform invalid original tiles into valid target ones. At this point of view the program transformation step is rather within the Itera-

18 294 W. Bielecki et al. tion Space Slicing Framework introduced by Pugh and Rosser [27]: Iteration Space Slicing takes dependence information as input to find all statement instances from a given loop nest which must be executed to produce correct values for the specified array elements. That is, we may conclude that the introduced algorithms are based on a combination of the Polyhedral and Iteration Spacing Slicing frameworks. Such a combination allows for improving the loop nest tiling transformation effectiveness. In the next section, we show that for examined benchmarks, the presented approach extracts more synchronization-free parallelism than that provided with well-known affine transformations. Papers [3,7] demonstrate how to extract coarse- and fine-grained parallelism applying different Iteration Space Slicing algorithms, however, they do not consider any tiling transformation. Paper [5] deals with applying transitive closure to only perfectly nested loops and does not present any algorithm to extract synchronization-free parallelism. In paper [25], the authors present a way to extract synchronization-free parallelism using a relation representing inter-tile dependences. As we demonstrated in Sect. 5, such an approach can lose discovering synchronization-free parallelism on the tile level. Paper [8] demonstrates how tiled code can be generated applying free scheduling of tiles, but such a way results in generating code with synchronization. Paper [6] deals with tiling arbitrarily nested loops, however extracting synchronization-free parallelism is based on forming slices including valid tiles without splitting tiles including multiple slice representatives. That prevents extracting synchronizationfree parallelism on the tile level for some classes of loop nests. Diamond and hexagonal tiling [2,15] allows for scalable parallelism, but it can be applied only to stencil algorithms while the approach presented in our paper is of general usage. Summing up, we may conclude that provided that the exact transitive closure of a dependence graph can be calculated, the approach, presented in this paper, is able to generate synchronization-free parallelism on the tile level when there exists synchronization-free parallelism on the statement instance level. As far as disadvantages of the presented technique are concerned, we observed, that for loop nests, exposing non-regular dependences, generated tiles are also non-regular that results in more complicated code than that generated with affine transformations. Some target tiles can be parametric, i.e., their sizes depend on parametric upper loop index bounds. It does not guarantee that the data size per a parametric tile is smaller than the capacity of cache. In such a case, a parametric tile represents an iteration sub-space where tiling is excluded. We will address and illustrate these issues in our future work. 7 Implementation and experimental study The algorithms, presented in this paper, have been implemented into the TC optimizing compiler, v , 2 which utilizes the Polyhedral Extraction Tool [34] for extracting 2

19 Generation of parallel synchronization-free tiled code 295 polyhedral representations of original loop nests, and the Integer Set Library [30]for performing dependence analysis, manipulating integer sets and relations, and generating output code. To evaluate the effectiveness of the presented approach and the performance of parallel tiled code generated by means of this approach, we have experimented with the PolyBench/C 4.1 [26] benchmark suite. Out of the 30 benchmarks contained in PolyBench, TC finds a total of 8 benchmarks for which there exist more than one synchronization-free slice on the both statement instance and tile levels. The list of these loop nests is as follows: 2mm, bicg, gemm, gesummv, mvt, syr2k, syrk, trmm. The code generated by TC for the studied kernels can be found in the results directory of the compiler s repository. The evaluation of parallel code performance was carried out on a multicore architecture (2x Intel Xeon E v3 clocked at 2.3 GHz, 18 cores/socket, 36 threads/socket, 32 KB L1 data cache/core, 256 KB L2 cache/core, 45 MB L3 cache/socket, 256 GB RAM clocked at 2133 MHz). The code of both original and transformed loop nests was compiled under the Linux kernel x86_64 by GCC with the O3 optimization enabled. Each examined loop nest was tiled with tiles of the size 32 (in each dimension). Such transformed code was then executed using 1, 2, 4, 8, 16, 32, and 64 threads in subsequent runs. The problem sizes used for the studied benchmarks are shown in Table 1 which also presents the number of synchronization-free slices extracted from each loop nest, expressed as formulas involving loop index upper bounds and a tile size constant B representing the width of a tile side in each dimension (32 in our study). The vertical bar indicates that in code generated with TC and PLUTO v [10] a state-ofthe-art optimizing compiler based on the Affine Transformations Framework there is a synchronization barrier between parallel regions each including synchronizationfree code, e.g., M N means that first a loop nest of M synchronization-free slices is computed, then after a barrier another loop nest of N synchronization-free slices is executed. The row Theoretical presents the cardinality of set S_REPR_IND, i.e., the total number of synchronization-free slices extracted with Algorithm 1. The rows TC sliced and TC unsliced present the number of slices extracted when the tiles, including multiple slice representatives, are sliced (so that each such tile includes a single representative) and uncliced, respectively. The row PLUTO shows the number of slices extracted with PLUTO. For several kernels, despite Algorithm 1 extracts only synchronization-free parallelism, compilable TC code has a synchronization point (for example, for kernel mvt). This fact is explained by the way used by a TC post-processor to generate compilable code. Set CODE generated with Algorithm 1, may consist of several sub-sets. For each sub-set, a code generator may produce a separate loop nest (although all those loop nests are independent) and for each of them, the TC post-processor generates a separate parallel loop nest. So we have a sequence of parallel loop nests and there exists barrier synchronization between each pair of such loop nests. So, TC does not exploit the whole parallelism, extracted with Algorithm 1.

PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES

PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES Computing and Informatics, Vol. 36, 07, 6 8, doi: 0.449/cai 07 6 6 PARALLEL TILED CODE GENERATION WITH LOOP PERMUTATION WITHIN TILES Marek Palkowski, Wlodzimierz Bielecki Faculty of Computer Science West

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching. 1 Primal/Dual Algorithm for weighted matchings in Bipartite Graphs

Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching. 1 Primal/Dual Algorithm for weighted matchings in Bipartite Graphs CMPUT 675: Topics in Algorithms and Combinatorial Optimization (Fall 009) Lecture 4: Primal Dual Matching Algorithm and Non-Bipartite Matching Lecturer: Mohammad R. Salavatipour Date: Sept 15 and 17, 009

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Generating All Solutions of Minesweeper Problem Using Degree Constrained Subgraph Model

Generating All Solutions of Minesweeper Problem Using Degree Constrained Subgraph Model 356 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Generating All Solutions of Minesweeper Problem Using Degree Constrained Subgraph Model Hirofumi Suzuki, Sun Hao, and Shin-ichi Minato Graduate

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral

More information

The Polyhedral Model (Transformations)

The Polyhedral Model (Transformations) The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for

More information

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J. Journal of Universal Computer Science, vol. 14, no. 14 (2008), 2416-2427 submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.UCS Tabu Search on GPU Adam Janiak (Institute of Computer Engineering

More information

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize. Cornell University, Fall 2017 CS 6820: Algorithms Lecture notes on the simplex method September 2017 1 The Simplex Method We will present an algorithm to solve linear programs of the form maximize subject

More information

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Tessellating Stencils Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Outline Introduction Related work Tessellating Stencils Stencil Stencil Overview update each

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Slides for Faculty Oxford University Press All rights reserved.

Slides for Faculty Oxford University Press All rights reserved. Oxford University Press 2013 Slides for Faculty Assistance Preliminaries Author: Vivek Kulkarni vivek_kulkarni@yahoo.com Outline Following topics are covered in the slides: Basic concepts, namely, symbols,

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era 1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,

More information

LooPo: Automatic Loop Parallelization

LooPo: Automatic Loop Parallelization LooPo: Automatic Loop Parallelization Michael Claßen Fakultät für Informatik und Mathematik Düsseldorf, November 27 th 2008 Model-Based Loop Transformations model-based approach: map source code to an

More information

On Demand Parametric Array Dataflow Analysis

On Demand Parametric Array Dataflow Analysis On Demand Parametric Array Dataflow Analysis Sven Verdoolaege Consultant for LIACS, Leiden INRIA/ENS, Paris Sven.Verdoolaege@ens.fr Hristo Nikolov LIACS, Leiden nikolov@liacs.nl Todor Stefanov LIACS, Leiden

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/21017 holds various files of this Leiden University dissertation Author: Balevic, Ana Title: Exploiting multi-level parallelism in streaming applications

More information

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory ROSHAN DATHATHRI, RAVI TEJA MULLAPUDI, and UDAY BONDHUGULA, Department of Computer Science and Automation,

More information

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM NAME: This exam is open: - books - notes and closed: - neighbors - calculators ALGORITHMS QUALIFYING EXAM The upper bound on exam time is 3 hours. Please put all your work on the exam paper. (Partial credit

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Class Information INFORMATION and REMINDERS Homework 8 has been posted. Due Wednesday, December 13 at 11:59pm. Third programming has been posted. Due Friday, December 15, 11:59pm. Midterm sample solutions

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Deterministic Parallel Graph Coloring with Symmetry Breaking

Deterministic Parallel Graph Coloring with Symmetry Breaking Deterministic Parallel Graph Coloring with Symmetry Breaking Per Normann, Johan Öfverstedt October 05 Abstract In this paper we propose a deterministic parallel graph coloring algorithm that enables Multi-Coloring

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/32963 holds various files of this Leiden University dissertation Author: Zhai, Jiali Teddy Title: Adaptive streaming applications : analysis and implementation

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

MOST attention in the literature of network codes has

MOST attention in the literature of network codes has 3862 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 Efficient Network Code Design for Cyclic Networks Elona Erez, Member, IEEE, and Meir Feder, Fellow, IEEE Abstract This paper introduces

More information

AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014

AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 2014 AC64/AT64 DESIGN & ANALYSIS OF ALGORITHMS DEC 214 Q.2 a. Design an algorithm for computing gcd (m,n) using Euclid s algorithm. Apply the algorithm to find gcd (31415, 14142). ALGORITHM Euclid(m, n) //Computes

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

x ji = s i, i N, (1.1)

x ji = s i, i N, (1.1) Dual Ascent Methods. DUAL ASCENT In this chapter we focus on the minimum cost flow problem minimize subject to (i,j) A {j (i,j) A} a ij x ij x ij {j (j,i) A} (MCF) x ji = s i, i N, (.) b ij x ij c ij,

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Lecture 2: Getting Started

Lecture 2: Getting Started Lecture 2: Getting Started Insertion Sort Our first algorithm is Insertion Sort Solves the sorting problem Input: A sequence of n numbers a 1, a 2,..., a n. Output: A permutation (reordering) a 1, a 2,...,

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Framework for Design of Dynamic Programming Algorithms

Framework for Design of Dynamic Programming Algorithms CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Shared memory parallel algorithms in Scotch 6

Shared memory parallel algorithms in Scotch 6 Shared memory parallel algorithms in Scotch 6 François Pellegrini EQUIPE PROJET BACCHUS Bordeaux Sud-Ouest 29/05/2012 Outline of the talk Context Why shared-memory parallelism in Scotch? How to implement

More information

Transforming Complex Loop Nests For Locality

Transforming Complex Loop Nests For Locality Transforming Complex Loop Nests For Locality Qing Yi Ken Kennedy Computer Science Department Rice University Abstract Because of the increasing gap between the speeds of processors and standard memory

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory

More information

Static and Dynamic Frequency Scaling on Multicore CPUs

Static and Dynamic Frequency Scaling on Multicore CPUs Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University

More information

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 1 [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

Jump Statements. The keyword break and continue are often used in repetition structures to provide additional controls.

Jump Statements. The keyword break and continue are often used in repetition structures to provide additional controls. Jump Statements The keyword break and continue are often used in repetition structures to provide additional controls. break: the loop is terminated right after a break statement is executed. continue:

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Practice Exercises 449

Practice Exercises 449 Practice Exercises 449 Kernel processes typically require memory to be allocated using pages that are physically contiguous. The buddy system allocates memory to kernel processes in units sized according

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

ABC basics (compilation from different articles)

ABC basics (compilation from different articles) 1. AIG construction 2. AIG optimization 3. Technology mapping ABC basics (compilation from different articles) 1. BACKGROUND An And-Inverter Graph (AIG) is a directed acyclic graph (DAG), in which a node

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Extremal Graph Theory: Turán s Theorem

Extremal Graph Theory: Turán s Theorem Bridgewater State University Virtual Commons - Bridgewater State University Honors Program Theses and Projects Undergraduate Honors Program 5-9-07 Extremal Graph Theory: Turán s Theorem Vincent Vascimini

More information

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Introduction Problems & Solutions Join Recognition Experimental Results Introduction GK Spring Workshop Waldau: Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery Database & Information

More information

Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs

Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs Computer Science Technical Report Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs Tomofumi Yuki Sanjay Rajopadhye June 10, 2013 Colorado State University Technical Report

More information

SourcererCC -- Scaling Code Clone Detection to Big-Code

SourcererCC -- Scaling Code Clone Detection to Big-Code SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories

More information

EULER S FORMULA AND THE FIVE COLOR THEOREM

EULER S FORMULA AND THE FIVE COLOR THEOREM EULER S FORMULA AND THE FIVE COLOR THEOREM MIN JAE SONG Abstract. In this paper, we will define the necessary concepts to formulate map coloring problems. Then, we will prove Euler s formula and apply

More information

Polyhedral Optimizations of Explicitly Parallel Programs

Polyhedral Optimizations of Explicitly Parallel Programs Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015

More information

Job-shop scheduling with limited capacity buffers

Job-shop scheduling with limited capacity buffers Job-shop scheduling with limited capacity buffers Peter Brucker, Silvia Heitmann University of Osnabrück, Department of Mathematics/Informatics Albrechtstr. 28, D-49069 Osnabrück, Germany {peter,sheitman}@mathematik.uni-osnabrueck.de

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions

More information

FADA : Fuzzy Array Dataflow Analysis

FADA : Fuzzy Array Dataflow Analysis FADA : Fuzzy Array Dataflow Analysis M. Belaoucha, D. Barthou, S. Touati 27/06/2008 Abstract This document explains the basis of fuzzy data dependence analysis (FADA) and its applications on code fragment

More information

ARELAY network consists of a pair of source and destination

ARELAY network consists of a pair of source and destination 158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 55, NO 1, JANUARY 2009 Parity Forwarding for Multiple-Relay Networks Peyman Razaghi, Student Member, IEEE, Wei Yu, Senior Member, IEEE Abstract This paper

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

arxiv:cs/ v1 [cs.ds] 20 Feb 2003

arxiv:cs/ v1 [cs.ds] 20 Feb 2003 The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1

More information

1 Linear programming relaxation

1 Linear programming relaxation Cornell University, Fall 2010 CS 6820: Algorithms Lecture notes: Primal-dual min-cost bipartite matching August 27 30 1 Linear programming relaxation Recall that in the bipartite minimum-cost perfect matching

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np

Complexity Theory. Compiled By : Hari Prasad Pokhrel Page 1 of 20. ioenotes.edu.np Chapter 1: Introduction Introduction Purpose of the Theory of Computation: Develop formal mathematical models of computation that reflect real-world computers. Nowadays, the Theory of Computation can be

More information

HPCC Random Access Benchmark Excels on Data Vortex

HPCC Random Access Benchmark Excels on Data Vortex HPCC Random Access Benchmark Excels on Data Vortex Version 1.1 * June 7 2016 Abstract The Random Access 1 benchmark, as defined by the High Performance Computing Challenge (HPCC), tests how frequently

More information

Jump Statements. The keyword break and continue are often used in repetition structures to provide additional controls.

Jump Statements. The keyword break and continue are often used in repetition structures to provide additional controls. Jump Statements The keyword break and continue are often used in repetition structures to provide additional controls. break: the loop is terminated right after a break statement is executed. continue:

More information

arxiv: v1 [cs.dm] 21 Dec 2015

arxiv: v1 [cs.dm] 21 Dec 2015 The Maximum Cardinality Cut Problem is Polynomial in Proper Interval Graphs Arman Boyacı 1, Tinaz Ekim 1, and Mordechai Shalom 1 Department of Industrial Engineering, Boğaziçi University, Istanbul, Turkey

More information

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware

More information