Level-synchronous BFS algorithm implemented in Java using PCJ Library

Size: px
Start display at page:

Download "Level-synchronous BFS algorithm implemented in Java using PCJ Library"

Transcription

1 2016 International Conference on Computational Science and Computational Intelligence Level-synchronous BFS algorithm implemented in Java using PCJ Library Full/Regular Research Paper CSCI-ISPD Magdalena Ryczkowska Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/ Toruń, Poland Marek Nowicki Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/ Toruń, Poland Piotr Bała Interdisciplinary Centre for Mathematical and Computational Modeling University of Warsaw Pawińskiego 5a Warsaw, Poland Abstract Graph processing is used in many fields of science such as sociology, risk prediction or biology. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. We show good scalability and performance of our implementation in comparison with MPI code written in C. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance. Index Terms parallel and distributed graph algorithms, BFS, Java, PGAS, performance evaluation I. INTRODUCTION A number of computational problems can be formulated in the terms of graphs since they are very convenient when talking about relations in any context. Graphs are widely investigated in the computer science and are commonly used in many scientific fields, for example in biology (to model protein interactions or a food chain), in sociology (for social network analysis), WWW mining, analysis of networks or data transfer processing. The size of analyzed problems increases, therefore, time for getting a solution for the large graph is becoming more and more important [1]. Graphs of interest often consist of millions of vertices and processing of them is not an easy task. The important challenge is large memory demand to store and process graphs and in the case of parallel processing, a huge amount of communication and synchronization. In order to analyze supercomputers powers across the world in the context of graph problems the Graph500 benchmark has been created [2]. Most of the tools for graph processing is using traditional programming languages such as C/C++. However, the growing adoption of Java as a programming language for the data analytics opens requirement for new scalable solutions. The parallel execution in Java is based on the Thread class or fork-join framework. Recent addition to the Java parallelization capabilities is the Java Concurrency Package introduced in Java SE5 and improved in Java SE6. All these features can be used within single Java Virtual Machine which limits parallelization capabilities to the single shared memory node which is not enough for large problems. In this paper we present parallelization of the problem of a large graph traversing using level-synchronous breadth first search algorithm (BFS).The algorithm is adopted to the PGAS programming paradigm and is implemented in the Java language. The algorithm and its optimization described here are continuity of the work presented in [3]. Here we present in details level-synchronous BFS (Breadth First Search) optimizations used to improve the performance. We have adopted BFS optimizations to the PGAS model, implemented in Java using PCJ library. In result, we have received much better results in comparison with the previous work. The solution has been checked on different hardware architectures by the performance analysis performed for the large graphs of different size. The performance has been compared with the C/C++ MPI implementation which is reported to be the fastest and with the best scalability. The paper is organized as follows: next section describes basic features of the PCJ library. The third section presents related work. The BFS implementation details are described in section 4. Next section contains performance results. The paper is concluded with final remarks. II. PCJ LIBRARY PCJ [4], [5], [6] is a Java library for parallel and distributed computations in Java based on the PGAS (Partitioned Global Address Space) model [7]. In this model, all communication /16 $ IEEE DOI /CSCI

2 details like threads management or network programming are hidden. The communication is one-sided which makes programming easier than in the traditional double-sided message passing approach. There has been a number of attempts to use Java for parallel computing, however, most of the existing solutions have not become widely used either because of difficulty for the programmer or due to insufficient performance. Programs developed with PCJ can be run on the distributed systems with different JVM running on the nodes which make it possible to run an application on hundreds and thousands of cores. PCJ library takes care on this process. The application can be started using ssh or dedicated scheduling mechanisms such as SLURM, LoadLeveler or LSF. The number of nodes and threads can be easily configured at the start of the application. In the PCJ, as in the other PGAS languages, the memory is divided among the threads of execution (PCJ threads). By default, all variables in the particular area of memory are private to the owner thread, however, some of them can be shared between threads. PCJ library offers all primary methods such as broadcasting, asynchronous one-sided communication (put, get), synchronizing tasks, creating groups of tasks, monitoring and waiting for a variable change. The core PCJ methods are as follows: int PCJ.myId() - returns the identifier of the task (integer from 0 to N 1, where N is the number of all tasks); int PCJ.threadCount() - returns the number of all tasks; PCJ.put() - sends asynchronous data to remote PCJ thread; PCJ.get() - asynchronously gets a value of the variable stored at the remote PCJ thread; PCJ.broadcast() - sends a value to all PCJ threads; PCJ.waitFor() - holds execution until communication is finished. This method is used together with the PCJ.put(). PCJ library provides also additional functionality to handle asynchronous communication through FutureObjects and allows a user to maintain groups of PCJ threads. The PCJ library has been successfully used to parallelize selected scientific applications on the HPC systems with the thousands of cores. III. RELATED WORK BFS as one of the most important graph algorithms has been widely studied. The main idea of our BFS implementation is based on MPI reference simple approach of Graph500 benchmark [2] with synchronization after each level, which has been closely examined in [8][9]. Most of the algorithms based on the level-synchronized BFS, adopt the idea to either specific programming model or to the environment and present some optimizations to improve performance [10]. Many of the studied BFS implementations use MPI-threads model [11][12]. Some papers describe BFS in a specific environment and hardware [13][14]. To improve level-synchronous BFS s efficiency several algorithm modifications has been developed [15][17]. Some of them are: keeping vector bitmap, separate socket queues or batching when inserting and removing from the inter-socket communication channel, threading extensions, and lazy polling implementation. Most of the work has been done for message passing, however, the graph processing problem has also been studied in PGAS languages [18]. In particular, fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems are reported. Some other papers describe graph traversal in PGAS UPC language [19][20]. IV. BFS IMPLEMENTATION In this section, we present the implementation details of the BFS algorithm. We first describe the input to the BFS algorithm, later we present a general idea of BFS algorithm together with our approach and optimizations we used. A. BFS Input The input to the BFS algorithm consists of a graph in the sparse-efficient CSR form. CSR (Compressed Sparse Row) is a representation of sparse graphs, where the graph is stored in two one-dimensional arrays. The first array keeps all nonzero entries (endpoints of the edges) from sparse adjacency matrix in top-to-bottom order. The second array stores offsets which denote the start vertex of the edges. In our implementation 1D partitioning is used.all vertices and edges of the original graph are distributed, so that each PCJ thread owns N/p vertices and its incident edges (p is a number of processors and N is a number of vertices in a graph). The distribution of vertices is realized by blocks: process 0 owns vertices with numbers from 0 to x, process 1 owns vertices with numbers from x +1 to y etc.. B. BFS Implementation Details In BFS algorithm vertices of the graph G(V,E) are traversed starting from a given source vertex s V. The algorithm travels the edges E of G to find all reachable vertices v V from the source vertex s. The algorithm s outcome is a tree rooted at s that shows shortest path from the source s to reachable vertices of G. Our implementation is based on level-synchronous BFS strategy. The vertex v is visited at level l it means that the distance between s and v equals l. All vertices at level l are visited before vertices at distance l +1. Algorithm 1: BFS pseudocode. Input: Graph G(V, E) in the form of CSR distributed by blocks between processors, source vertex s V Output: Array of predecessors (pred) in the BFS outcome tree rooted at s Functions: INIT-DATA() - responsible for initialization of all necessary data ex. pred - array of predecessors, bitmap - array of bits (bit set to 1 if vertex is visited), currentlvl/nextlvl - array of vertices that should be visited at current/next level, buffers etc. CHECK-IF-I-OWN-THE-VERTEX(v) - returns true if the thread performing this check owns the vertex v, false otherwise FIND-LOCAL-NBR(v) - returns local vertex number for global vertex number v FIND-TASK-FOR-VERTEX(v) - returns the task identifier that is the owner of v ADD-TO-BUFFER(b, v) - adds to the buffer b provided value v BUFFER-IS-FULL(b) - returns true if buffer b is full, false otherwise BUFFER-IS-NOT-EMPTY(b) - returns true if buffer b contains any data, false if buffer

3 b is empty VISIT-VERTICES(a) - responsible for visiting vertices from array data a received in inter-process communication SUM(a) - returns sum of all elements in the array a REDUCE(a) - returns true if any of integers in array a is greater than 0, false otherwise INIT-FOR-NEXT-LOOP() - responsible for reinitialization of necessary data for next loop ex. clearing buffers, moving nextlvl s elements to currentlvl etc. ALL-TO-ALL(a) - all to all communication - sending array a to all threads BFS(s) 1: INIT-DATA() 2: currentlvlindex 0 3: if CHECK-IF-I-OWN-THE-VERTEX(s) 4: MARK-SOURCE-AS-VISITED(s) 5: while (true) 6: for each v1 in currentlvl 7: for each v2 adjacent to v1 8: v2owner FIND-TASK-FOR-VERTEX(v2) 9: if bitmap[v2owner].get(find-local-nbr(v2)) == false) 10: if CHECK-IF-I-OWN-THE-VERTEX(v2) 11: VISIT(v2, v1) 12: else 13: ACCUMULATE-BUFFER-SEND(v2Owner, v2, v1) 14: for each process p 15: BUFFER-SEND(p) Send buffer to task p 16: for each process p Send number of parts 17: PCJ.put(p, partssent,tosendpredpartcounter[p]) 18: WAIT-AND-VISIT() 19: if PCJ.myId()!=0 20: PCJ.put(0, verticesinnextlvl, nextlvlindex, PCJ.myId()) 21: else 22: computenextlvl = REDUCE(verticesInNextLvl) 23: PCJ.broadcast( computenextlvl, computenextlvl) 24: PCJ.waitFor( computenextlvl ) 25: if computenextlvl == true 26: INIT-FOR-NEXT-LOOP() 27: ALL-TO-ALL(bitmap[PCJ.myId()]) 28: PCJ.waitFor( bitmap ) 29: else 30: break MARK-SOURCE-AS-VISITED(s) 1: currentlvl[currentlvlindex++] s 2: slocal FIND-LOCAL-NBR(s) 3: pred[slocal] -1 4: bitmap[pcj.myid()].set(slocal) Mark source vertex as visited VISIT(v2, v1) 1: v2local FIND-LOCAL-NBR(v2) 2: pred[v2local] v1 3: nextlvl[nextlvlindex++] v2 4: bitmap[pcj.myid()].set(v2local) Mark vertex v2 as visited ACCUMULATE-BUFFER-SEND(v2Owner, v2, v1) 1: ADD-TO-BUFFER(toSendPredBuffer[v2Owner], v2) 2: ADD-TO-BUFFER(toSendPredBuffer[v2Owner], v1) 3: if BUFFER-IS-FULL(toSendPredBuffer[v2Owner]) 4: PCJ.put(v2Owner, rcvedpred,tosendpredbuffer[v2owner],..) 5: tosendpredpartcounter[v2owner]++ Count sent data parts BUFFER-SEND(p) 1: if BUFFER-IS-NOT-EMPTY(toSendPredBuffer[p]) 2: PCJ.put(p, rcvedpred,tosendpredbuffer[p],..) Send data 3: tosendpredpartcounter[p]++ Count sent out data parts WAIT-AND-VISIT() 1: partssentcounter = PCJ.waitFor( partssent, 0) 2: while partssentcounter < PCJ.threadCount() 3: VISIT-VERTICES(rcvedPred) 4: partssentcounter = PCJ.waitFor( partssent, 0) 5: chunkswaitingforcounter SUM(partsSent) 6: rcvedpredcounter = PCJ.waitFor( rcvedpred, 0) 7: while rcvedpredcounter < chunkswaitingforcounter 8: VISIT-VERTICES(rcvedPred) Visit received vertices 9: rcvedpredcounter = PCJ.waitFor( rcvedpred, 0) 10: VISIT-VERTICES(rcvedPred) All parts came, so visit vertices The idea of the BFS algorithm is presented in the Algorithm 1 together with the explanation of functions that have been used in the pseudocode. The first step of the algorithm is the initialization of necessary data. Each task initializes its own predecessor s array, bitmap - an array of bits (if the vertex is already visited bit is set, false otherwise), buffers, counters, current level array, next level array etc. Although Java has many dynamic structures suitable to handle changing data, using them in the implementation was highly ineffective. Therefore, we have decided to use simple arrays instead. When initialization is done, each PCJ thread checks if he is the owner of the source vertex (line 3). If the thread owns the source vertex it adds it to its current level array, sets source vertex predecessor to -1 and changes the proper bit in the bitmap array. In the current level array, the vertices at the current level are stored. In the next level array, there are vertices that will be visited at the next level. Later all vertices in a while loop perform the following code: for every vertex in the current level all its adjacent vertices are checked if they have been already visited (line 9). If the bit for a vertex is set the thread does nothing and goes to the next adjacent vertex. If the bit is not set, which means that the vertex has not been visited yet, the thread needs to check which task owns the vertex (line 10). There are two situations. When the task performing this check is the owner of the vertex the situation is simple: the vertex becomes visited so the task sets predecessor, bitmap and adds the vertex to the next level. But, if the vertex s owner is not the task performing this check the communication is necessary. Proper message is constructed and when the buffer is full the message is sent to the vertex owner (lines 13-15). Instead of sending a single message we accumulate the data in a buffer of 64KB (the size of the buffer has been adjusted experimentally). Besides sending information about visited vertices, the task needs to pass the number of sent parts (lines 16-17) so that the receiving task knows how many chunks it should expect. While waiting for data, procedure WAIT-AND- VISIT is beeing executed (line 18). When all parts have been received, remaining vertices are visited (function WAIT-AND- VISIT line 10). At the end of each level, all tasks communicate to check if there is any vertex that needs to be visited in the next level (line 19-23). When there are no such vertices for each PCJ thread, the algorithm stops by breaking out of the main loop (line 30). Otherwise, tasks exchange bitmap and the search continue to the next level (line 26-28). In order to speed up implementation we have used a number of optimizations. The first one is bitmap used to reduce the amount of communication between tasks [15]. Each task keeps vector of bits bitmap[v]. The bit is set to 1 if vertex v is visited, otherwise 0. Besides of the memory savings and reduced time of memory access, checking if the vertex has already been visited (line 9) can avoid a great number of communication. In our implementation, the bitmap contains

4 information about visited vertices from the previous level. At the end of each level, the bitmap information is exchanged between tasks (line 27). This solution does not require locks so the synchronization is avoided and we can benefit from the asynchronous one-sided communication provided by the PGAS model. To gather data that shall be sent to another task we used array buffers. Every task has its own buffer array which allows us to split messages at the time of creation. This is important because PCJ library supports sending arrays with provided indexes. Moreover, we have used message batching. Buffers minimize the overhead connected with starting the communications. Instead of sending single vertex, each task batches some part of vertices. When the buffer is full the data, the buffer is sent to another task. The buffer size has been experimentally set to reduce overhead connected with starting of communication and to send adequately frequently so that other PCJ threads receive data quicker. While waiting for communication coming to the end, procedure VISIT-VERTICES is being executed. Instead of unproductive waiting for finished communication, the task uses this time to process data that already have been received. This is realized using PCJ.waitFor("variable",0) method which returns number of received data. V. RESULTS In this section, we present performance results of the BFS implementation with PCJ Library compared with Graph500 benchmark BFS Simple implementation. Please note that in most cases found in the literature, performance results are presented for weak scaling, eg. for the increasing number of nodes, the problem size is also increased. In this paper we focus on the algorithm evaluation, therefore, the scalability is measured for the constant size of the input data (strong scaling). The tests has been carried out in the following environments: AMD Opteron(TM) Processor 6272 (Interlagos) 2.2 GHz with nodes build of 64 cores (on four sockets) with 512GB RAM. The nodes are connected with 10Gb Ethernet. AMD Opteron(TM) Processor 2435 (Istanbul) 2.6 GHz with nodes build of 12 cores (on two sockets) with 32GB RAM. The nodes are connected with Infiniband DDR + 1Gb Ethernet. Cray XC40, Intel Xeon CPU E v3@2.60ghz (Okeanos) with nodes build of 12 cores (on two sockets) with 128GB RAM. The nodes are connected with communication network aries (dragonfly topology). We have used 64-bit JVM Oracle Java 8 and OpenMPI version with gcc compiler. Sample graphs in the form of edge tuples list and source vertices for BFS implementation used in performance tests has been generated from Graph Generator of the Graph500 benchmark. The size of a graph is specified by the following parameters: SCALE (the logarithm base two of the number of vertices) and edgefactor (the ratio of the graph s edge count to its vertex count), which means that the total number of vertices is N =2 SCALE and the number of edges equals M = edgefactor N. The performance has been tested on graphs of SCALE = {26, 27, 28, 29} with edgefactor =16. It is well known, that performance and scalability depend on the graph structure [16]. For the purposes of this paper, we have used the same input data for all tests following Graph500 methodology. All charts present results out of several independent BFS runs (we took five different source vertices and for each of them five independent runs have been carried out). The execution time together with the number of traversed edges per second (TEPS) is presented. Figure 1 present TEPS obtained in various environments for graphs of SCALE 26 and 27 respectively. In general, the PCJ implementation provides performance and scalability similar to the MPI C implementation. The speedup is linear up to 32 PCJ/MPI threads and this is independent of the number of threads run on the single node. The results for the higher number of nodes are scattered and this is due to the fact that communication starts to dominate. This is confirmed by the fact that the best results are obtained for the nodes connected with the Infiniband interconnect providing higher communication bandwidth and lower latency. In the figures 2 and 3 the time spent in the different part of the algorithm is presented. We have divided algorithm into four parts: Part 1: initialization of the data structures, processing all vertices in the current frontier and sending data to remote tasks (lines 1-29 in the pseudocode) Part 2: waiting for data and processing received messages (lines 30-39) Part 3: reduction, broadcast and waiting for message if the next level should be computed (lines 40-45) Part 4: reinitialization of necessary data and all to all communication - exchanging bitmap (lines 46-51) The most time-consuming part of the algorithm is Part 1, however, this part is well parallelized and with the increased number of threads the time spent in this part of the algorithm is decreasing. Part 2 is second important part of the algorithm. This part involves communication and becomes dominant for the larger number of threads. For the system with the fast interconnect (Cray XC40 or istambul processors with InfiniBand interconnect), the time spent on this part of the algorithm is smaller than while Gigabit Ethernet is used. In result, better scaling is achieved. Part 3 and 4 of the algorithm contribute less to the overall time. For the small number of threads both parts are dominated by the performed operations buy with the increased number of threads the communication (broadcast, reduction) starts to dominate. With the increase of the number of cores used, the communication time increases reaching 20-30% for 32 nodes and 20-45% for the 128 cores. VI. CONCLUSIONS We have presented parallelization of the large graph traversing using level-synchronous breadth first search algorithm

5 Fig. 1. BFS TEPS performance for the graph of SCALE 26 and SCALE 27. The PCJ performance is compared with the Graph500 MPI C implementation in various environments. Tests have been carried out for different number of PCJ threads per node (marked as x pn): 4pn means 4 PCJ threads running on the node. Fig. 2. Profiling of the PCJ implementation of the level synchronous BFS algorithm for the graph of SCALE 26. Fig. 3. Profiling of the PCJ implementation of the level synchronous BFS algorithm for the graph of SCALE

6 Fig. 4. BFS TEPS performance for the graphs of SCALE 27, 28 and 29. (BFS). The algorithm has been adopted to the PGAS programming model and implemented using PCJ library. Necessary optimizations including an overlay of the work with asynchronous communications have been performed to increase performance and scalability. The solution has been checked on different hardware architectures by the performance analysis performed for the large graphs of different size. Our implementation shows good performance and scalability. The results have been compared to the MPI C implementation showing similar behavior. Such comparison was yet not provided for Cray XC40 due to the problems with the C code. The results proof that Java together with the PCJ library can be used for successful parallelization for complicated algorithms. ACKNOWLEDGMENT This work has been performed using the PL-Grid infrastructure. Partial support from CHIST-ERA under HPDCJ project is acknowledged (Polish partners are supported by NCN grant 2014/14/Z/ST6/00007). REFERENCES [1] A. Lumsdaine, D. Gregor, B. Hendrickson, J. Berry: Challenges in parallel graph processing. Parallel Processing Letters, vol. 17 no. 01, pp (2007) [2] R. C. Murphy, K. B. Wheeler, B. W. Barrett, J. A. Ang: Introducing the Graph 500. Cray Users Group (CUG) 2010 [3] M. Ryczkowska, M. Nowicki, P. Bała: The performance evaluation of the Java implementation of Graph500, 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) in press [4] Accessed: [5] M. Nowicki, P. Bała. Parallel computations in Java with PCJ library In: W. W. Smari and V. Zeljkovic (Eds.) 2012 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2012 pp [6] M. Nowicki, Ł. Górski, P. Grabarczyk, P. Bała. PCJ - Java library for high performance computing in PGAS model In: W. W. Smari and V. Zeljkovic (Eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2014 pp [7] D. Mallón, G. Taboada, C. Teijeiro, J.Tourino, B. Fraguela, A. Gómez, R. Doallo, J. Mourino. Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures In: M. Ropo, J. Westerholm, J. Dongarra (Eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface (Lecture Notes in Computer Science 5759) Springer Berlin/Heidelberg 2009, pp [8] T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, S. Matsuoka: Performance Characteristics of Graph500 on Large-Scale Distributed Environment In: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp [9] K. Ueno, T. Suzumura: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp [10] A. Buluc, K. Madduri: Parallel breadth-first search on distributed memory systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, [11] N. Satish, C. Kim, J. Chhugani, P.Dubey: Large-scale energy-efficient graph travelsal: A path to efficient data-intensive supercomputing, In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012, p.14. [12] H. Lv, G. Tan, M.Chen, N. Sun: Understanding parallelism in graph traversal on multi-core clusters, Computer Science-Research and Development, vol. 28, no. 2-3, pp , [13] D. P. Scarpazza, O. Villa, F. Petrini: Efficient Breadth-First Search on the Cell/BE Processor, IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 10, pp , [14] D. Mizell, K. Maschhoff: Early experiences with large-scale Cray XMT systems, In: Proceedings of the 24th International Symposium on Parallel & Distributed Processing (IPDPS09), Rome, Italy, [15] V. Agarwal, F. Petrini, D. Pasetto, D. A. Bader: Scalable Graph Exploration on Multicore Processors, SC 10 Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp [16] R. Berrendorf, M. Makulla: Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems. FC 14 pp 2631, 2014 [17] A. Amer, L. Huiwei, P. Balaji, S. Matsuoka: Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS, th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp [18] G. Cong, G. Almasi, V. Saraswat Fast PGAS connected components algorithms In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models. ACM, p. 13. [19] J. Jose, S. Potluri, M. Luo, S. Sur, D. Panda: UPC Queues for Scalable Graph Travelsals: Design and Evaluation on InfiniBand Clusters In: Conference on PGAS Programming Models [20] G. Cong, G. Almasi, V. Saraswat: Fast PGAS implementation of distributed graph algorithms. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society 2010, pp [21] M. Ryczkowska: Evaluating PCJ library for graph problems - Graph500 in PCJ In: W. W. Smari and V. Zeljkovic (Eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS) IEEE 2014 pp

Performance evaluation of parallel computing and Big Data processing with Java and PCJ

Performance evaluation of parallel computing and Big Data processing with Java and PCJ Performance evaluation of parallel computing and Big Data processing with Java and PCJ dr Marek Nowicki 2, dr Łukasz Górski 1, prof. Piotr Bała 1 bala@icm.edu.pl 1 ICM University of Warsaw, Warsaw, Poland

More information

PCJ - a Java library for heterogenous parallel computing

PCJ - a Java library for heterogenous parallel computing PCJ - a Java library for heterogenous parallel computing MAREK NOWICKI faramir@mat.umk.pl MAGDALENA RYCZKOWSKA gdama@mat.umk.pl ŁUKASZ GÓRSKI lgorski@mat.umk.pl MICHAŁ SZYNKIEWICZ szynkiewicz@mat.umk.pl

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

pcj-blast - highly parallel similarity search implementation

pcj-blast - highly parallel similarity search implementation pcj-blast - highly parallel similarity search implementation Piotr Bała bala@icm.edu.pl ICM University of Warsaw Marek Nowicki faramir@mat.umk.pl ICM University of Warsaw N. Copernicus Univeristy Davit

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Optimizing LS-DYNA Productivity in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments 10 th International LS-DYNA Users Conference Computing Technology Optimizing LS-DYNA Productivity in Cluster Environments Gilad Shainer and Swati Kher Mellanox Technologies Abstract Increasing demand for

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Evaluating New Communication Models in the Nek5000 Code for Exascale

Evaluating New Communication Models in the Nek5000 Code for Exascale Evaluating New Communication Models in the Nek5000 Code for Exascale Ilya Ivanov (KTH), Rui Machado (Fraunhofer), Mirko Rahn (Fraunhofer), Dana Akhmetova (KTH), Erwin Laure (KTH), Jing Gong (KTH), Philipp

More information

Performance analysis of -stepping algorithm on CPU and GPU

Performance analysis of -stepping algorithm on CPU and GPU Performance analysis of -stepping algorithm on CPU and GPU Dmitry Lesnikov 1 dslesnikov@gmail.com Mikhail Chernoskutov 1,2 mach@imm.uran.ru 1 Ural Federal University (Yekaterinburg, Russia) 2 Krasovskii

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

A Parallel Algorithm to Process Harmonic Progression Series Using OpenMP

A Parallel Algorithm to Process Harmonic Progression Series Using OpenMP A Parallel Algorithm to Process Harmonic Progression Series Using OpenMP 1, Rinki Kaur, 2. Sanjay Kumar, 3. V. K. Patle 1,2,3. School of Studies in Computer science & IT Pt. Ravishankar Shukla University,

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

Lecture 4: Graph Algorithms

Lecture 4: Graph Algorithms Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e

More information

New Approach for Graph Algorithms on GPU using CUDA

New Approach for Graph Algorithms on GPU using CUDA New Approach for Graph Algorithms on GPU using CUDA 1 Gunjan Singla, 2 Amrita Tiwari, 3 Dhirendra Pratap Singh Department of Computer Science and Engineering Maulana Azad National Institute of Technology

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Scheduling Transactions in Replicated Distributed Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly

More information

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Current and Future Challenges of the Tofu Interconnect for Emerging Applications Current and Future Challenges of the Tofu Interconnect for Emerging Applications Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 22, 2017, ExaComm 2017 Workshop

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Simple Parallel Biconnectivity Algorithms for Multicore Platforms Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info

More information

Performance Pack. Benchmarking with PlanetPress Connect and PReS Connect

Performance Pack. Benchmarking with PlanetPress Connect and PReS Connect Performance Pack Benchmarking with PlanetPress Connect and PReS Connect Contents 2 Introduction 4 Benchmarking results 5 First scenario: Print production on demand 5 Throughput vs. Output Speed 6 Second

More information

Breadth First Search on Cost efficient Multi GPU Systems

Breadth First Search on Cost efficient Multi GPU Systems Breadth First Search on Cost efficient Multi Systems Takuji Mitsuishi Keio University 3 14 1 Hiyoshi, Yokohama, 223 8522, Japan mits@am.ics.keio.ac.jp Masaki Kan NEC Corporation 1753, Shimonumabe, Nakahara

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How

More information

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein Parallel & Cluster Computing cs 6260 professor: elise de doncker by: lina hussein 1 Topics Covered : Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

Multi-core spanning forest algorithms using the disjoint-set data structure

Multi-core spanning forest algorithms using the disjoint-set data structure 1 Multi-core spanning forest algorithms using the disjoint-set data structure Md. Mostofa Ali Patwary* Peder Refsnes** Fredrik Manne** Abstract We present new multi-core algorithms for computing spanning

More information

Parallel Implementation of the NIST Statistical Test Suite

Parallel Implementation of the NIST Statistical Test Suite Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

L21: Putting it together: Tree Search (Ch. 6)!

L21: Putting it together: Tree Search (Ch. 6)! Administrative CUDA project due Wednesday, Nov. 28 L21: Putting it together: Tree Search (Ch. 6)! Poster dry run on Dec. 4, final presentations on Dec. 6 Optional final report (4-6 pages) due on Dec. 14

More information

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer

A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

GAIL The Graph Algorithm Iron Law

GAIL The Graph Algorithm Iron Law GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

High Performance Java Remote Method Invocation for Parallel Computing on Clusters High Performance Java Remote Method Invocation for Parallel Computing on Clusters Guillermo L. Taboada*, Carlos Teijeiro, Juan Touriño taboada@udc.es UNIVERSIDADE DA CORUÑA SPAIN IEEE Symposium on Computers

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

HYCOM Performance Benchmark and Profiling

HYCOM Performance Benchmark and Profiling HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities

More information

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers , October 20-22, 2010, San Francisco, USA Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers I-Lun Tseng, Member, IAENG, Huan-Wen Chen, and Che-I Lee Abstract Longest-path routing problems,

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Lecture 1: January 23

Lecture 1: January 23 CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Analysis of Basic Data Reordering Techniques

Analysis of Basic Data Reordering Techniques Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Message-Passing Programming with MPI

Message-Passing Programming with MPI Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer

Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Naval Research Laboratory Stennis Space Center, MS 39529-5004 NRL/MR/7440--14-9497 Customized Architecture for Complex Routing Analysis: Case Study for the Convey Hybrid-Core Computer Chris J. Michael

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář

Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Flexible Cache Cache for afor Database Management Management Systems Systems Radim Bača and David Bednář Department ofradim Computer Bača Science, and Technical David Bednář University of Ostrava Czech

More information

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect Alexander Agarkov and Alexander Semenov JSC NICEVT, Moscow, Russia {a.agarkov,semenov}@nicevt.ru

More information

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI Huansong Fu*, Manjunath Gorentla Venkata, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Data Communication and Parallel Computing on Twisted Hypercubes

Data Communication and Parallel Computing on Twisted Hypercubes Data Communication and Parallel Computing on Twisted Hypercubes E. Abuelrub, Department of Computer Science, Zarqa Private University, Jordan Abstract- Massively parallel distributed-memory architectures

More information

Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings

Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings Victor Malyshkin (Ed.) Lecture Notes in Computer Science The LNCS series reports state-of-the-art results in computer science re search, development, and education, at a high level and in both printed

More information