Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings

Size: px

Start display at page:

Download "Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings"

Basil Newton
5 years ago
Views:

Enjoying tight cooperation with the R&D community, with numerous individuals, as well as with prestigious organizations and societies, LNCS has grown into the most comprehensive computer science

1 Victor Malyshkin (Ed.) Lecture Notes in Computer Science The LNCS series reports state-of-the-art results in computer science re search, development, and education, at a high level and in both printed and electronic form. Enjoying tight cooperation with the R&D community, with numerous individuals, as well as with prestigious organizations and societies, LNCS has grown into the most comprehensive computer science research forum available. The scope of LNCS, including its subseries LNAI and LNBI, spans the whole range of computer science and information technology including interdisciplinary topics in a variety of application fields. The type of material published traditionally includes proceedings (published in time for the respective conference) post-proceedings (consisting of thoroughly revised final full papers) research monographs (which may be based on outstanding PhD work, research projects, technical reports, etc.) Malyshkin (Ed.) 1 LNCS 9251 Parallel Computing Technologies More recently, several color-cover sublines have been added featuring, beyond a collection of papers, various added-value components; these sublines in clude tutorials (textbook-like monographs or collections of lectures given at advanced courses) state-of-the-art surveys (offering complete and mediated coverage of a topic) hot topics (introducing emergent topics to the broader community) In parallel to the printed book, each new volume is published electronically in LNCS Online. Detailed information on LNCS can be found at Proposals for publication should be sent to LNCS Editorial, Tiergartenstr. 17, Heidelberg, Germany lncs@springer.com LNCS 9251 Parallel Computing Technologies 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings ISSN ISBN springer.com PaCT

2 Heuristic Algorithms for Optimizing Array Operations in Parallel PGAS-programs Ivan Kulagin 1, Alexey Paznikov 1,2, and Mikhail Kurnosov 1,3 1 Siberian State University of Telecommunications and Information Sciences, 86 Kirova str., Novosibirsk, Russia, Rzhanov Institute of Semiconductor Physics of the Siberian Branch of the RAS, 13 Lavrentev ave., Novosibirsk, Russia, Saint Petersburg Electrotechnical University LETI, 5 Professor Popov str., Saint-Petersburg, Russia, ikulagin@sibsutis.ru,apaznikov@isp.nsc.ru,mkurnosov@gmail.com Abstract. The algorithms for optimizing array operations in PGASprograms are represented. They minimize execution time by taking into account hierarchical structure of computer systems in reduction and by preloading of remote elements to nodes while accessing distributed arrays. Algorithms are implemented for Cray Chapel and IBM X10. Keywords: PGAS, compiler optimization, reduction, scalar replacement 1 Introduction The main approach to the parallel programs development in modern distributed computer systems (CS) is message-passing interface (MPICH2, Open MPI, Intel MPI). The major challenge for modern CS is their lack of programmability. To exploit all the resources of modern systems, we need to use diverse technologies (OpenMP/Intel TBB/Intel Cilk Plus, NVIDIA CUDA/OpenCL, SSE/AVX) in conjunction with the MPI. While this model provides a great deal of flexibility and performance potential, it burdens programmers with the complexity of utilizing multiple programming systems in the same applications. Need of simplification parallel programming has lead the development of high-level tools, e.g. the languages that implement the model of a partitioned global address space (PGAS), including Cray Chapel, IBM X10, UPC. PGASprograms does not explicitly call communication functions unlike MPI; instead they operate with distributed structures and instructions for parallel tasks management (threads, activities) and synchronization. All the communications are scheduled by the compiler and performed by the runtime-system which provides the transparent access to remote nodes memory. High abstraction level of PGAS allows to reduce the complexity of parallel programs development, but requires the development of effective methods for optimizing compilation. The reported study was partially supported by RFBR, research projects , and by Ministry of Education and Science of the Russian Federation (02.G from )

3 2 Ivan Kulagin, Alexey Paznikov, Mikhail Kurnosov One can emphasize the two most common patterns in parallel PGAS-programs: (i) iteration by distributed arrays, (ii) specified reduction operation for distributed arrays elements (reduce, reduction). The existing algorithms of operations on distributed arrays [4, 3] does not take into account the features of PGAS, such as high intensity of one-side communications, memory consistency, multithreading, etc. The compiler optimization algorithms implemented in IBM X10 [1], UPC [2] do not minimize overheads in PGAS-programs that perform cyclic access to the array s elements, located on the remote nodes. In this paper, we propose the algorithms for optimizing the communications in operations on distributed arrays. The algorithms are implemented for Cray Chapel and IBM X10. 2 Communications Optimization 2.1 PGAS Model Let P = {1, 2,..., N} is the set of SMP/NUMA-nodes of a distributed CS. Each node i P consists of n CPU cores and the local memory. PGAS model realizes the abstraction of a multicore node locale (region, place). Each locale manages its own local memory segment. Dynamically spawned tasks (activities, threads) run within the locale. A task can access the global address space comprised nodes local memory segments. The local segment access performs much faster, because the access to the remote ones demands the communications. The design units required for developing of PGAS-programs: begin S performs the instructions S asynchronously on the separate thread, on i S performs the instructions S on the node i, on x S performs the instructions S on the node which owns the object x, coforall S performs each iteration of the loop body S in the independent thread, sync T the synchronization variable. 2.2 Parallel Reduction Algorithm Reduction is the collective operation, which performs some associative operation with the distributed array V [1 : D]. The result r of the operation is placed in the memory of the thread initialized reduction: r = V [1] V [2]... V [D]. This paper offers the algorithm BlockReduce of reduction in PGAS-programs (Figure 1). In Figure 2 you can see the algorithm for Cray Chapel. Each node i P is aware of the set V i of array s V elements storing in its local memory. In the first stage (Figure 2, lines 3-17) of the sub-arrays V i are splitted into n parts (by number of cores) (Figure 2, line 6). Then these parts are processed in parallel. The threads t = 1, 2,..., n of each node i perform reduction with their sub-array V it (Figure 2, lines 7-11). On the second stage the nodes organize the binary tree with the root is first locale. Each operation for the pair of values r[first], r[second] is performed in the separate thread on that node, wherein the value r[first] is located. After the reduction of all the values the barrier is performed.

Heuristic Algorithms for Optimizing Array Operations 3 Fig. 1: Distribution of array elements in the algorithm BlockReduce Barrier may be implemented e.g. by Dissemination barrier algorithm (O(log N)).

4 Heuristic Algorithms for Optimizing Array Operations 3 Fig. 1: Distribution of array elements in the algorithm BlockReduce Barrier may be implemented e.g. by Dissemination barrier algorithm (O(log N)). Then BlockReduce complexity equals T = O( V /N + log N). In the current implementation we used the Centralized barrier with the time O(N). 2.3 Arrays Access Optimization Another common pattern in PGAS-programs is the looping through the elements, wherein the threads access the elements in the memory of other nodes (Figure 3a). In this case, the runtime-system provides required elements. In current days PGAS-compilers use relatively straightforward heuristics. Accessing to a remote array s element causes the copying entire array to the local memory. Though copying the whole array is redundant and incurs communication overheads. Scalar replacement algorithm [1, 2] reduces these overheads. While the looping through remote array s elements, runtime-system copies to local memory the entire array at each iteration (Figure 3a). That scheme is highly inefficient. Scalar replacement also may cause the redundant copying in loops because the total number of sent elements exceeds the entire array. We propose the ArrayPreload algorithm optimizing the looping access to remote arrays for minimizing communication time. ArrayPreload prevents multiple copying of remote arrays by preemptive copying the array once before loop Input: V [1 : D] distributed array, operation for reduction. Output: r reduction result for the array V. 1: procedure BlockReduce(V [1 : D], ) 2: Parallel computation of operation over the local elements in the locales 3: coforall i in [1, 2,..., N] do 4: on i 5: Split V i to n (cores number) blocks V it 6: SplitArray(V i, n) 7: coforall t in [1, 2,..., n] do 8: forall x in V it do 9: r[i][t] r[i][t] x 10: end for 11: end coforall 12: forall t in [1, 2,..., n] do 13: r[i] is the reduction of elements V i located on the locale i 14: r[i] r[i] r[i][t] 15: end for 16: end on 17: end coforall 18: return r Bintree(r[1 : N]) 19: end procedure Fig. 2: Algorithm BlockReduce

4 Ivan Kulagin, Alexey Paznikov, Mikhail Kurnosov iterations (Figure 3b). Figure 3 shows the example of optimization of array A access for IBM X10 language.

The optimization (Figure 3b) involves the copying array A in advance to every node to the local array locala. The statement at used in IBM X10 corresponds the statement on.

5 4 Ivan Kulagin, Alexey Paznikov, Mikhail Kurnosov iterations (Figure 3b). Figure 3 shows the example of optimization of array A access for IBM X10 language. Unoptimized version (Figure 3a) incurs passing the array A to the id node on each iteration. The optimization (Figure 3b) involves the copying array A in advance to every node to the local array locala. The statement at used in IBM X10 corresponds the statement on. (a) Version without optimization (b) Optimized version (ArrayPreload) Fig. 3: Example of optimization by passing the array A in a IBM X10 program The ArrayPreload algorithm is based on static analyze by Abstract Syntax Tree (AST) traversal. The first stage realizes the search of the loops with accessing remote array elements. The second checks if array is not changed during the loop iterations so as to avoid violation the original program during optimization. The way this examination depends on compiler implementation, e.g. this check may be implemented on base of previously built loop context. The third stage makes AST transformation which includes loop prologues for each found arrays. The prologue performs coping a remote array to local memory once before iterations. The remote array access is replaced by access to the local one copied by prologue loop. Computational complexity of the algorithm ArrayPreload is determined of the AST height. 3 Experiments and Results Experiments are carried out on the cluster A (16 nodes: 2 x Quad-Core Intel Xeon E5420, Gigabit Ethernet) and cluster B (6 nodes: 2 x Quad-Core Intel Xeon E5420, Infiniband QDR). The algorithms are implemented for the languages Cray Chapel (BlockReduce) and IBM X10 (ArrayPreload). The evaluation of reduction algorithms was done on the basis of microbenchmarks (reduction of distributed array of length D = 4000,..., 20000) and Chapel programs PTRANS (transposition of distributed matrices) and minimd (molecular dynamics). Node number N was varying from 1 to 16. BlockReduce efficiency depends on the N and D. Algorithm outperforms by 10 30% the default algorithm DefaultReduce. Slight benefit on the real programs is due to the reduce computation time is much less than the total execution time. For the efficiency evaluation of ArrayPreload and Scalar replacement the microbenchmark was used. The benchmark performs the looping through the array s elements placed in the memory of remote nodes.

6 Heuristic Algorithms for Optimizing Array Operations 5 (a) ArrayPreload algorithm (b) Scalar replacement algorithm Fig. 4: Speedup of test program (cluster A) Both ArrayPreload and Scalar replacement perform the speed-up from 5 to 82 times (Figure 4). Generally the efficiency depends on the interconnect performance, the number of nodes, the size of array, the number of iterations. 4 Conclusion The proposed algorithms reduces the execution time of PGAS programs by means of minimizing the communication overheads. It has been achieved by preemptive copying of remote arrays and taking into account the computer system structure. The algorithms may be used for the wide range of PGAS languages. References 1. R. Barik, J. Zhao, D. Grove, I. Peshansky, Z. Budimlic, and V. Sarkar. Communication Optimizations for Distirbuted-Memory X10 Programs. In IEEE International Parallel and Distributed Processing Symposium, pages 1 13, W. Chen, C. Iancu, and K. Yelick. Communication Optimizations for Fine-grained UPC Applications. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages , M. Kurnosov. All-to-all broadcast algorithms in hierarchical distributed computer systems. Vestnik of computer and information technologies, 5:27 34, [in Russian]. 4. R. Rabenseifner. Optimization of Collective Reduction Operations. Computational Science - ICCS Lecture Notes in Computer Science, 3036:1 9, 2004.

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University