Level-synchronous BFS algorithm implemented in Java using PCJ Library

Size: px

Start display at page:

Download "Level-synchronous BFS algorithm implemented in Java using PCJ Library"

Roy Cross
5 years ago
Views:

1 2016 International Conference on Computational Science and Computational Intelligence Level-synchronous BFS algorithm implemented in Java using PCJ Library Full/Regular Research Paper CSCI-ISPD Magdalena Ryczkowska Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/ Toruń, Poland Marek Nowicki Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/ Toruń, Poland Piotr Bała Interdisciplinary Centre for Mathematical and Computational Modeling University of Warsaw Pawińskiego 5a Warsaw, Poland Abstract Graph processing is used in many fields of science such as sociology, risk prediction or biology. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. We show good scalability and performance of our implementation in comparison with MPI code written in C. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance. Index Terms parallel and distributed graph algorithms, BFS, Java, PGAS, performance evaluation I. INTRODUCTION A number of computational problems can be formulated in the terms of graphs since they are very convenient when talking about relations in any context. Graphs are widely investigated in the computer science and are commonly used in many scientific fields, for example in biology (to model protein interactions or a food chain), in sociology (for social network analysis), WWW mining, analysis of networks or data transfer processing. The size of analyzed problems increases, therefore, time for getting a solution for the large graph is becoming more and more important [1]. Graphs of interest often consist of millions of vertices and processing of them is not an easy task. The important challenge is large memory demand to store and process graphs and in the case of parallel processing, a huge amount of communication and synchronization. In order to analyze supercomputers powers across the world in the context of graph problems the Graph500 benchmark has been created [2]. Most of the tools for graph processing is using traditional programming languages such as C/C++. However, the growing adoption of Java as a programming language for the data analytics opens requirement for new scalable solutions. The parallel execution in Java is based on the Thread class or fork-join framework. Recent addition to the Java parallelization capabilities is the Java Concurrency Package introduced in Java SE5 and improved in Java SE6. All these features can be used within single Java Virtual Machine which limits parallelization capabilities to the single shared memory node which is not enough for large problems. In this paper we present parallelization of the problem of a large graph traversing using level-synchronous breadth first search algorithm (BFS).The algorithm is adopted to the PGAS programming paradigm and is implemented in the Java language. The algorithm and its optimization described here are continuity of the work presented in [3]. Here we present in details level-synchronous BFS (Breadth First Search) optimizations used to improve the performance. We have adopted BFS optimizations to the PGAS model, implemented in Java using PCJ library. In result, we have received much better results in comparison with the previous work. The solution has been checked on different hardware architectures by the performance analysis performed for the large graphs of different size. The performance has been compared with the C/C++ MPI implementation which is reported to be the fastest and with the best scalability. The paper is organized as follows: next section describes basic features of the PCJ library. The third section presents related work. The BFS implementation details are described in section 4. Next section contains performance results. The paper is concluded with final remarks. II. PCJ LIBRARY PCJ [4], [5], [6] is a Java library for parallel and distributed computations in Java based on the PGAS (Partitioned Global Address Space) model [7]. In this model, all communication /16 $ IEEE DOI /CSCI

2 details like threads management or network programming are hidden. The communication is one-sided which makes programming easier than in the traditional double-sided message passing approach. There has been a number of attempts to use Java for parallel computing, however, most of the existing solutions have not become widely used either because of difficulty for the programmer or due to insufficient performance. Programs developed with PCJ can be run on the distributed systems with different JVM running on the nodes which make it possible to run an application on hundreds and thousands of cores. PCJ library takes care on this process. The application can be started using ssh or dedicated scheduling mechanisms such as SLURM, LoadLeveler or LSF. The number of nodes and threads can be easily configured at the start of the application. In the PCJ, as in the other PGAS languages, the memory is divided among the threads of execution (PCJ threads). By default, all variables in the particular area of memory are private to the owner thread, however, some of them can be shared between threads. PCJ library offers all primary methods such as broadcasting, asynchronous one-sided communication (put, get), synchronizing tasks, creating groups of tasks, monitoring and waiting for a variable change. The core PCJ methods are as follows: int PCJ.myId() - returns the identifier of the task (integer from 0 to N 1, where N is the number of all tasks); int PCJ.threadCount() - returns the number of all tasks; PCJ.put() - sends asynchronous data to remote PCJ thread; PCJ.get() - asynchronously gets a value of the variable stored at the remote PCJ thread; PCJ.broadcast() - sends a value to all PCJ threads; PCJ.waitFor() - holds execution until communication is finished. This method is used together with the PCJ.put(). PCJ library provides also additional functionality to handle asynchronous communication through FutureObjects and allows a user to maintain groups of PCJ threads. The PCJ library has been successfully used to parallelize selected scientific applications on the HPC systems with the thousands of cores. III. RELATED WORK BFS as one of the most important graph algorithms has been widely studied. The main idea of our BFS implementation is based on MPI reference simple approach of Graph500 benchmark [2] with synchronization after each level, which has been closely examined in [8][9]. Most of the algorithms based on the level-synchronized BFS, adopt the idea to either specific programming model or to the environment and present some optimizations to improve performance [10]. Many of the studied BFS implementations use MPI-threads model [11][12]. Some papers describe BFS in a specific environment and hardware [13][14]. To improve level-synchronous BFS s efficiency several algorithm modifications has been developed [15][17]. Some of them are: keeping vector bitmap, separate socket queues or batching when inserting and removing from the inter-socket communication channel, threading extensions, and lazy polling implementation. Most of the work has been done for message passing, however, the graph processing problem has also been studied in PGAS languages [18]. In particular, fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems are reported. Some other papers describe graph traversal in PGAS UPC language [19][20]. IV. BFS IMPLEMENTATION In this section, we present the implementation details of the BFS algorithm. We first describe the input to the BFS algorithm, later we present a general idea of BFS algorithm together with our approach and optimizations we used. A. BFS Input The input to the BFS algorithm consists of a graph in the sparse-efficient CSR form. CSR (Compressed Sparse Row) is a representation of sparse graphs, where the graph is stored in two one-dimensional arrays. The first array keeps all nonzero entries (endpoints of the edges) from sparse adjacency matrix in top-to-bottom order. The second array stores offsets which denote the start vertex of the edges. In our implementation 1D partitioning is used.all vertices and edges of the original graph are distributed, so that each PCJ thread owns N/p vertices and its incident edges (p is a number of processors and N is a number of vertices in a graph). The distribution of vertices is realized by blocks: process 0 owns vertices with numbers from 0 to x, process 1 owns vertices with numbers from x +1 to y etc.. B. BFS Implementation Details In BFS algorithm vertices of the graph G(V,E) are traversed starting from a given source vertex s V. The algorithm travels the edges E of G to find all reachable vertices v V from the source vertex s. The algorithm s outcome is a tree rooted at s that shows shortest path from the source s to reachable vertices of G. Our implementation is based on level-synchronous BFS strategy. The vertex v is visited at level l it means that the distance between s and v equals l. All vertices at level l are visited before vertices at distance l +1. Algorithm 1: BFS pseudocode. Input: Graph G(V, E) in the form of CSR distributed by blocks between processors, source vertex s V Output: Array of predecessors (pred) in the BFS outcome tree rooted at s Functions: INIT-DATA() - responsible for initialization of all necessary data ex. pred - array of predecessors, bitmap - array of bits (bit set to 1 if vertex is visited), currentlvl/nextlvl - array of vertices that should be visited at current/next level, buffers etc. CHECK-IF-I-OWN-THE-VERTEX(v) - returns true if the thread performing this check owns the vertex v, false otherwise FIND-LOCAL-NBR(v) - returns local vertex number for global vertex number v FIND-TASK-FOR-VERTEX(v) - returns the task identifier that is the owner of v ADD-TO-BUFFER(b, v) - adds to the buffer b provided value v BUFFER-IS-FULL(b) - returns true if buffer b is full, false otherwise BUFFER-IS-NOT-EMPTY(b) - returns true if buffer b contains any data, false if buffer

3 b is empty VISIT-VERTICES(a) - responsible for visiting vertices from array data a received in inter-process communication SUM(a) - returns sum of all elements in the array a REDUCE(a) - returns true if any of integers in array a is greater than 0, false otherwise INIT-FOR-NEXT-LOOP() - responsible for reinitialization of necessary data for next loop ex. clearing buffers, moving nextlvl s elements to currentlvl etc. ALL-TO-ALL(a) - all to all communication - sending array a to all threads BFS(s) 1: INIT-DATA() 2: currentlvlindex 0 3: if CHECK-IF-I-OWN-THE-VERTEX(s) 4: MARK-SOURCE-AS-VISITED(s) 5: while (true) 6: for each v1 in currentlvl 7: for each v2 adjacent to v1 8: v2owner FIND-TASK-FOR-VERTEX(v2) 9: if bitmap[v2owner].get(find-local-nbr(v2)) == false) 10: if CHECK-IF-I-OWN-THE-VERTEX(v2) 11: VISIT(v2, v1) 12: else 13: ACCUMULATE-BUFFER-SEND(v2Owner, v2, v1) 14: for each process p 15: BUFFER-SEND(p) Send buffer to task p 16: for each process p Send number of parts 17: PCJ.put(p, partssent,tosendpredpartcounter[p]) 18: WAIT-AND-VISIT() 19: if PCJ.myId()!=0 20: PCJ.put(0, verticesinnextlvl, nextlvlindex, PCJ.myId()) 21: else 22: computenextlvl = REDUCE(verticesInNextLvl) 23: PCJ.broadcast( computenextlvl, computenextlvl) 24: PCJ.waitFor( computenextlvl ) 25: if computenextlvl == true 26: INIT-FOR-NEXT-LOOP() 27: ALL-TO-ALL(bitmap[PCJ.myId()]) 28: PCJ.waitFor( bitmap ) 29: else 30: break MARK-SOURCE-AS-VISITED(s) 1: currentlvl[currentlvlindex++] s 2: slocal FIND-LOCAL-NBR(s) 3: pred[slocal] -1 4: bitmap[pcj.myid()].set(slocal) Mark source vertex as visited VISIT(v2, v1) 1: v2local FIND-LOCAL-NBR(v2) 2: pred[v2local] v1 3: nextlvl[nextlvlindex++] v2 4: bitmap[pcj.myid()].set(v2local) Mark vertex v2 as visited ACCUMULATE-BUFFER-SEND(v2Owner, v2, v1) 1: ADD-TO-BUFFER(toSendPredBuffer[v2Owner], v2) 2: ADD-TO-BUFFER(toSendPredBuffer[v2Owner], v1) 3: if BUFFER-IS-FULL(toSendPredBuffer[v2Owner]) 4: PCJ.put(v2Owner, rcvedpred,tosendpredbuffer[v2owner],..) 5: tosendpredpartcounter[v2owner]++ Count sent data parts BUFFER-SEND(p) 1: if BUFFER-IS-NOT-EMPTY(toSendPredBuffer[p]) 2: PCJ.put(p, rcvedpred,tosendpredbuffer[p],..) Send data 3: tosendpredpartcounter[p]++ Count sent out data parts WAIT-AND-VISIT() 1: partssentcounter = PCJ.waitFor( partssent, 0) 2: while partssentcounter < PCJ.threadCount() 3: VISIT-VERTICES(rcvedPred) 4: partssentcounter = PCJ.waitFor( partssent, 0) 5: chunkswaitingforcounter SUM(partsSent) 6: rcvedpredcounter = PCJ.waitFor( rcvedpred, 0) 7: while rcvedpredcounter < chunkswaitingforcounter 8: VISIT-VERTICES(rcvedPred) Visit received vertices 9: rcvedpredcounter = PCJ.waitFor( rcvedpred, 0) 10: VISIT-VERTICES(rcvedPred) All parts came, so visit vertices The idea of the BFS algorithm is presented in the Algorithm 1 together with the explanation of functions that have been used in the pseudocode. The first step of the algorithm is the initialization of necessary data. Each task initializes its own predecessor s array, bitmap - an array of bits (if the vertex is already visited bit is set, false otherwise), buffers, counters, current level array, next level array etc. Although Java has many dynamic structures suitable to handle changing data, using them in the implementation was highly ineffective. Therefore, we have decided to use simple arrays instead. When initialization is done, each PCJ thread checks if he is the owner of the source vertex (line 3). If the thread owns the source vertex it adds it to its current level array, sets source vertex predecessor to -1 and changes the proper bit in the bitmap array. In the current level array, the vertices at the current level are stored. In the next level array, there are vertices that will be visited at the next level. Later all vertices in a while loop perform the following code: for every vertex in the current level all its adjacent vertices are checked if they have been already visited (line 9). If the bit for a vertex is set the thread does nothing and goes to the next adjacent vertex. If the bit is not set, which means that the vertex has not been visited yet, the thread needs to check which task owns the vertex (line 10). There are two situations. When the task performing this check is the owner of the vertex the situation is simple: the vertex becomes visited so the task sets predecessor, bitmap and adds the vertex to the next level. But, if the vertex s owner is not the task performing this check the communication is necessary. Proper message is constructed and when the buffer is full the message is sent to the vertex owner (lines 13-15). Instead of sending a single message we accumulate the data in a buffer of 64KB (the size of the buffer has been adjusted experimentally). Besides sending information about visited vertices, the task needs to pass the number of sent parts (lines 16-17) so that the receiving task knows how many chunks it should expect. While waiting for data, procedure WAIT-AND- VISIT is beeing executed (line 18). When all parts have been received, remaining vertices are visited (function WAIT-AND- VISIT line 10). At the end of each level, all tasks communicate to check if there is any vertex that needs to be visited in the next level (line 19-23). When there are no such vertices for each PCJ thread, the algorithm stops by breaking out of the main loop (line 30). Otherwise, tasks exchange bitmap and the search continue to the next level (line 26-28). In order to speed up implementation we have used a number of optimizations. The first one is bitmap used to reduce the amount of communication between tasks [15]. Each task keeps vector of bits bitmap[v]. The bit is set to 1 if vertex v is visited, otherwise 0. Besides of the memory savings and reduced time of memory access, checking if the vertex has already been visited (line 9) can avoid a great number of communication. In our implementation, the bitmap contains

4 information about visited vertices from the previous level. At the end of each level, the bitmap information is exchanged between tasks (line 27). This solution does not require locks so the synchronization is avoided and we can benefit from the asynchronous one-sided communication provided by the PGAS model. To gather data that shall be sent to another task we used array buffers. Every task has its own buffer array which allows us to split messages at the time of creation. This is important because PCJ library supports sending arrays with provided indexes. Moreover, we have used message batching. Buffers minimize the overhead connected with starting the communications. Instead of sending single vertex, each task batches some part of vertices. When the buffer is full the data, the buffer is sent to another task. The buffer size has been experimentally set to reduce overhead connected with starting of communication and to send adequately frequently so that other PCJ threads receive data quicker. While waiting for communication coming to the end, procedure VISIT-VERTICES is being executed. Instead of unproductive waiting for finished communication, the task uses this time to process data that already have been received. This is realized using PCJ.waitFor("variable",0) method which returns number of received data. V. RESULTS In this section, we present performance results of the BFS implementation with PCJ Library compared with Graph500 benchmark BFS Simple implementation. Please note that in most cases found in the literature, performance results are presented for weak scaling, eg. for the increasing number of nodes, the problem size is also increased. In this paper we focus on the algorithm evaluation, therefore, the scalability is measured for the constant size of the input data (strong scaling). The tests has been carried out in the following environments: AMD Opteron(TM) Processor 6272 (Interlagos) 2.2 GHz with nodes build of 64 cores (on four sockets) with 512GB RAM. The nodes are connected with 10Gb Ethernet. AMD Opteron(TM) Processor 2435 (Istanbul) 2.6 GHz with nodes build of 12 cores (on two sockets) with 32GB RAM. The nodes are connected with Infiniband DDR + 1Gb Ethernet. Cray XC40, Intel Xeon CPU E v3@2.60ghz (Okeanos) with nodes build of 12 cores (on two sockets) with 128GB RAM. The nodes are connected with communication network aries (dragonfly topology). We have used 64-bit JVM Oracle Java 8 and OpenMPI version with gcc compiler. Sample graphs in the form of edge tuples list and source vertices for BFS implementation used in performance tests has been generated from Graph Generator of the Graph500 benchmark. The size of a graph is specified by the following parameters: SCALE (the logarithm base two of the number of vertices) and edgefactor (the ratio of the graph s edge count to its vertex count), which means that the total number of vertices is N =2 SCALE and the number of edges equals M = edgefactor N. The performance has been tested on graphs of SCALE = {26, 27, 28, 29} with edgefactor =16. It is well known, that performance and scalability depend on the graph structure [16]. For the purposes of this paper, we have used the same input data for all tests following Graph500 methodology. All charts present results out of several independent BFS runs (we took five different source vertices and for each of them five independent runs have been carried out). The execution time together with the number of traversed edges per second (TEPS) is presented. Figure 1 present TEPS obtained in various environments for graphs of SCALE 26 and 27 respectively. In general, the PCJ implementation provides performance and scalability similar to the MPI C implementation. The speedup is linear up to 32 PCJ/MPI threads and this is independent of the number of threads run on the single node. The results for the higher number of nodes are scattered and this is due to the fact that communication starts to dominate. This is confirmed by the fact that the best results are obtained for the nodes connected with the Infiniband interconnect providing higher communication bandwidth and lower latency. In the figures 2 and 3 the time spent in the different part of the algorithm is presented. We have divided algorithm into four parts: Part 1: initialization of the data structures, processing all vertices in the current frontier and sending data to remote tasks (lines 1-29 in the pseudocode) Part 2: waiting for data and processing received messages (lines 30-39) Part 3: reduction, broadcast and waiting for message if the next level should be computed (lines 40-45) Part 4: reinitialization of necessary data and all to all communication - exchanging bitmap (lines 46-51) The most time-consuming part of the algorithm is Part 1, however, this part is well parallelized and with the increased number of threads the time spent in this part of the algorithm is decreasing. Part 2 is second important part of the algorithm. This part involves communication and becomes dominant for the larger number of threads. For the system with the fast interconnect (Cray XC40 or istambul processors with InfiniBand interconnect), the time spent on this part of the algorithm is smaller than while Gigabit Ethernet is used. In result, better scaling is achieved. Part 3 and 4 of the algorithm contribute less to the overall time. For the small number of threads both parts are dominated by the performed operations buy with the increased number of threads the communication (broadcast, reduction) starts to dominate. With the increase of the number of cores used, the communication time increases reaching 20-30% for 32 nodes and 20-45% for the 128 cores. VI. CONCLUSIONS We have presented parallelization of the large graph traversing using level-synchronous breadth first search algorithm

5 Fig. 1. BFS TEPS performance for the graph of SCALE 26 and SCALE 27. The PCJ performance is compared with the Graph500 MPI C implementation in various environments. Tests have been carried out for different number of PCJ threads per node (marked as x pn): 4pn means 4 PCJ threads running on the node. Fig. 2. Profiling of the PCJ implementation of the level synchronous BFS algorithm for the graph of SCALE 26. Fig. 3. Profiling of the PCJ implementation of the level synchronous BFS algorithm for the graph of SCALE

6 Fig. 4. BFS TEPS performance for the graphs of SCALE 27, 28 and 29. (BFS). The algorithm has been adopted to the PGAS programming model and implemented using PCJ library. Necessary optimizations including an overlay of the work with asynchronous communications have been performed to increase performance and scalability. The solution has been checked on different hardware architectures by the performance analysis performed for the large graphs of different size. Our implementation shows good performance and scalability. The results have been compared to the MPI C implementation showing similar behavior. Such comparison was yet not provided for Cray XC40 due to the problems with the C code. The results proof that Java together with the PCJ library can be used for successful parallelization for complicated algorithms. ACKNOWLEDGMENT This work has been performed using the PL-Grid infrastructure. Partial support from CHIST-ERA under HPDCJ project is acknowledged (Polish partners are supported by NCN grant 2014/14/Z/ST6/00007). REFERENCES [1] A. Lumsdaine, D. Gregor, B. Hendrickson, J. Berry: Challenges in parallel graph processing. Parallel Processing Letters, vol. 17 no. 01, pp (2007) [2] R. C. Murphy, K. B. Wheeler, B. W. Barrett, J. A. Ang: Introducing the Graph 500. Cray Users Group (CUG) 2010 [3] M. Ryczkowska, M. Nowicki, P. Bała: The performance evaluation of the Java implementation of Graph500, 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) in press [4] Accessed: [5] M. Nowicki, P. Bała. Parallel computations in Java with PCJ library In: W. W. Smari and V. Zeljkovic (Eds.) 2012 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2012 pp [6] M. Nowicki, Ł. Górski, P. Grabarczyk, P. Bała. PCJ - Java library for high performance computing in PGAS model In: W. W. Smari and V. Zeljkovic (Eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), IEEE 2014 pp [7] D. Mallón, G. Taboada, C. Teijeiro, J.Tourino, B. Fraguela, A. Gómez, R. Doallo, J. Mourino. Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures In: M. Ropo, J. Westerholm, J. Dongarra (Eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface (Lecture Notes in Computer Science 5759) Springer Berlin/Heidelberg 2009, pp [8] T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, S. Matsuoka: Performance Characteristics of Graph500 on Large-Scale Distributed Environment In: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp [9] K. Ueno, T. Suzumura: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp [10] A. Buluc, K. Madduri: Parallel breadth-first search on distributed memory systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, [11] N. Satish, C. Kim, J. Chhugani, P.Dubey: Large-scale energy-efficient graph travelsal: A path to efficient data-intensive supercomputing, In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012, p.14. [12] H. Lv, G. Tan, M.Chen, N. Sun: Understanding parallelism in graph traversal on multi-core clusters, Computer Science-Research and Development, vol. 28, no. 2-3, pp , [13] D. P. Scarpazza, O. Villa, F. Petrini: Efficient Breadth-First Search on the Cell/BE Processor, IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 10, pp , [14] D. Mizell, K. Maschhoff: Early experiences with large-scale Cray XMT systems, In: Proceedings of the 24th International Symposium on Parallel & Distributed Processing (IPDPS09), Rome, Italy, [15] V. Agarwal, F. Petrini, D. Pasetto, D. A. Bader: Scalable Graph Exploration on Multicore Processors, SC 10 Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp [16] R. Berrendorf, M. Makulla: Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems. FC 14 pp 2631, 2014 [17] A. Amer, L. Huiwei, P. Balaji, S. Matsuoka: Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS, th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp [18] G. Cong, G. Almasi, V. Saraswat Fast PGAS connected components algorithms In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models. ACM, p. 13. [19] J. Jose, S. Potluri, M. Luo, S. Sur, D. Panda: UPC Queues for Scalable Graph Travelsals: Design and Evaluation on InfiniBand Clusters In: Conference on PGAS Programming Models [20] G. Cong, G. Almasi, V. Saraswat: Fast PGAS implementation of distributed graph algorithms. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society 2010, pp [21] M. Ryczkowska: Evaluating PCJ library for graph problems - Graph500 in PCJ In: W. W. Smari and V. Zeljkovic (Eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS) IEEE 2014 pp

Performance evaluation of parallel computing and Big Data processing with Java and PCJ

Performance evaluation of parallel computing and Big Data processing with Java and PCJ dr Marek Nowicki 2, dr Łukasz Górski 1, prof. Piotr Bała 1 bala@icm.edu.pl 1 ICM University of Warsaw, Warsaw, Poland