Native Marshalling. Java Marshalling. Mb/s. kbytes

Size: px

Start display at page:

Download "Native Marshalling. Java Marshalling. Mb/s. kbytes"

Gordon Thomas
5 years ago
Views:

1 Design Issues for Ecient Implementation of MPI in Java Glenn Judd, Mark Clement, Quinn Snell Computer Science Department, Brigham Young University, Provo, USA Vladimir Getov 2 2 School of Computer Science, University of Westminster, London, UK Abstract While there is growing interest in using Java for high-performance applications, many in the highperformance computing community do not believe that Java can match the performance of traditional native message passing environments. This paper discusses critical issues that must be addressed in the design of Java based message passing systems. Ecient handling of these issues allows Java-MPI applications to obtain performance which rivals that of traditional native message passing systems. To illustrate these concepts, the design and performance of a pure Java implementation of MPI are discussed. Introduction The Message Passing Interface (MPI) [] has proven to be an eective means of writing portable parallel programs. With the increasing interest in using Java for high-performance computing several groups have investigated using MPI from within Java. Nevertheless, there are still many in the high-performance computing community who are skeptical that Java MPI performance can compete with native MPI. These skeptics usually refer to data showing early Java implementations of message passing standards [2] that have performed orders of magnitude slower than native versions. Their skepticism is further backed by the fact that, as of yet, there has not been an MPI implementation completely written in Java that has been competitive with native MPI implementations. To investigate possible Java MPI performance, we have designed and implemented, an MPI implementation written completely in Java which seeks to be competitive with native MPI implementations in clustered computing environments. In this paper we discuss issues that must be addressed in order to eciently implement MPI in Java, explain how they are concretely addressed in, and compare the performance obtained using with native MPI. Our results show that there are many cases where MPI in Java can compete with native MPI. Section 2 reviews previous work on Java message passing. In Section 3, we discuss issues which must be addressed in order to allow for ecient communication between MPI processes in Java. In Section 4 we discuss issues related to supporting threads in Java MPI processes. Section 5 discusses methods for integrating high performance libraries into Java. In Section 6 we discuss the design of a pure Java implementation of MPI:. Section 7 presents performance results. 2 Related Work Over the last few years many Java message passing systems have been developed. A large number of these systems such as JavaParty [3], JET [4], and IceT [5] have developed novel parallel programming methodologies using Java. Others have looked at using variations on Java Remote Method Invocation [6] or JavaSpaces [7] for high performance computing. A number of eorts have also investigated using Java versions of established message passing standards such as MPI [] and Parallel Virtual Machine (PVM) [8]: JPVM [2] is an implementation of PVM written completely Java. Unfortunately, JPVM has very poor performance compared to native PVM and MPI systems. mpijava [9] is a Java wrapper to native MPI implementations. It allows application code to be written in pure Java, but currently requires native MPI implementations in order to function. JavaMPI [] is also a Java wrapper to native MPI libraries, but JavaMPI wrappers are generated au-

2 tomatically with the help of a special-purpose tool called JCI (Java-to-C Interface generator). Eorts are currently underway to develop a standard Java MPI binding in order to increase the interoperability and quality of Java MPI bindings []. This research adds to this eort by exploring issues that must be addressed in order to eciently implement MPI in Java. These issues can then be addressed by both the developing Java MPI bindings, and by the Java environment itself in order to foster the development of ecient message passing systems which are written completely in Java. 3 Network Communication in Java 3. Native Marshalling High network communication performance is a critical element of any MPI implementation. Achieving high network communication performance under Java requires consideration of issues not found under native code. Before discussing these issues, consider Figure 6 which compares Java byte array communication to C byte array communication. It is clear that Java communication of byte arrays completely matches C in this case. As both C and Java rely on the same underlying communication library to carry out the communication, it is not surprising that they should achieve very comparable results. However, most communication in MPI consists of data other than bytes. In C this is a trivial issue since arrays of any type can be type cast to be a byte array, but in Java this issue is signicant because a simple type cast is not permitted. The most common method of sending non-byte data in Java is to marshal the data into a byte array and then send this byte array. Performing this marshaling in Java code is an intrinsically less ecient operation than performing this marshaling in native code. Consider the following Java code fragment which marshals an array of doubles: void marshal(double[] src, byte[] dst) { int count; count = ; for(int i = ; i < src.length; i++) { long value = Double. doubletolongbits(src[i]); (byte)((int)(value >>> 56)); (byte)((int)(value >>> 48)); (byte)((int)(value >>> 4)); (byte)((int)(value >>> 32)); (byte)((value >>> 24)); (byte)((value >>> 6)); (byte)((value >>> 8)); Mb/s Native Marshalling Java Marshalling kbytes Figure : Native vs. Java Marshalling (JDK.2 on a Pentium II 266MHz machine running Windows NT) } } (byte)((value >>> )); In C this marshaling can be accomplished with a simple memcpy. In Java the equivalent marshaling code requires a total of 35 Java bytecode instructions including a method invocation, several shift operations, and several type conversion operations all of which are not required in C. It is suggested in [2] that the just-in-time (JIT) compilers should be able to optimize data marshaling into its native equivalent (i.e. eliminate the method invocation, shifts, etc., and replace them with memcpy code). While this is theoretically possible, it is complicated by the fact that there are several primitive types and several approaches to marshaling each of these types that a JIT compiler would need to be able to optimize. At this time, we are not aware of any JIT compiler which even attempts this optimization. A much simpler solution is to follow the precedent established by the System.arraycopy method already included in Java. This method is used to eciently copy data between Java arrays. Currently this method requires both source and destination to be of the same data type. This routine could be extended to allow the source and destination arrays to be of dierent primitive types. Alternatively, a new method could be added which would specically allow for copying data between arrays of dierent primitive types. Note that such a method would not compromise Java language safety or security as it introduces no new functionality, but rather expedites existing functionality. As shown in Figure, this method enables huge performance increases in data marshaling speed, and enables data marshaling to occur at the same speed as a memcpy.

3 3.2 Typed Array Communication The native marshaling we have discussed still requires a memory copy. Native code is able to transfer data over the network without any copy unless the destination machine of a message uses dierent byte ordering in which case the byte order changing memory copy is required. The same approach could be used in Java code, but Java's current design does not allow this. Currently in Java, all network communication is sent using java.io.inputstream and java.io.outputstream these classes only provide routines for sending bytes. This design allows Java to leave byte ordering undened, allowing each Java Virtual Machine (JVM) to use the native machine's byte ordering. Since JVMs are only able to send bytes to each other, dierent internal byte ordering is unimportant. While this is a clean design, it limits performance for primitive arrays of type other than byte. A possible solution is to add input and output classes which are able to send typed data directly without any memory copy. These classes would have the ability to automatically determine when dierent byte ordering is used on input and output machines, and introduce byte ordering changes only when needed. These classes could also provide a uniform means for access to non-tcp/ip communication found in many high performance computing clusters. Under this scheme, Java MPI implementations would request that a "factory" provide a typed array communication class capable of communicating on the local machine's specialized network. Now consider a typical MPI implementation of MPI's standard communication mode: For messages below a certain size threshold, messages are buered and sent. For messages above the threshold, the sender blocks until the receiver actually posts the receive. This allows the message to be sent without any buering. Native marshaling allows buering in Java to occur at essentially the same speed as buering in native code. However, native marshaling does not allow Java code to send without buering. Using typed input and output classes alleviates this problem by allowing communication to occur directly between send and receive buers. The two additions to Java we have discussed allow Java applications to achieve communication performance which is comparable to that of native applications. 3.3 Shared Memory Communication On multi-processor machines, it is desirable to use direct memory transfers for communication. In Java this is accomplished by placing the multiple MPI processes in a single JVM. This introduces a signicant dierence with native MPI: MPI processes become Java threads. This means that multiple threads in a single class which uses class variables (class variables are global to the JVM) will all see the same data. This is inconvenient for applications which assume a native MPI process model where MPI processes do not see the same global information. However, the cost of this inconvenience is more than made up for by the fact that Java class variables can be exploited by programmers to provide a very simple and very powerful shared memory mechanism for threads residing on the same machine. MPI implementations can exploit this shared memory mechanism to speed up both point-to-point communication as well as global operations. We have found that global operations in Java MPI benet greatly from using Java class variables to organize a rendezvous and direct memory transfer rather the standard method of using point-to-point communication for global operations. 4 Thread Support 4. Thread Support for Shared Memory Utilization Programmers desiring the maximum amount of performance on multi-processor machines can write programs which utilize Java MPI calls between machines, and Java threads within the machine. One simple way that Java MPI implementations can aid this process is to provide a method for determining the number of processors on the machine. Java provides no mechanism for determining the number of processors on a machine, but this can be overcome by writing a method which determines the number of eective processors on a machine. This is easily accomplished by writing a routine which divides work among increasing numbers of threads. The number of processors is then determined by the occurrence of a signicant drop in the amount of work per cpu. This method can also be used by Java MPI implementations to automatically determine the number of MPI processes to run on the machine, rather than relying on a process group le. 4.2 General Thread Support Threads in Java are useful for more than just taking advantage of multiple processors. They allow many important functions such as i/o etc. to be performed outside of the main thread of execution. The widespread use of threading in Java makes support for threading a very important issue. Unfortunately, MPI includes very little thread support. Rather, MPI merely delineates what a \threaded" version of MPI should provide, and what the user should be required to provide. As threads are pervasive in Java, any Java MPI implementation should, at least, follow the guidelines provided by

4 MPI for a threaded MPI. However, the easy and power of Java threading begs for a more elegant solution. 5 Integrating Standard High Performance Libraries Into Java A signicant issue that must be addressed is how to integrate the additions we have proposed into Java. Native marshaling could easily be added into Java's current API with very little diculty, but the proposed classes for high performance communication of typed arrays introduces a more substantial change. As Java is largely driven by the needs of business applications, it is unlikely that a substantial class like the typed array networking class will make it into the core Java API, in spite of the huge performance increases possible. The issue of how to integrate a high performance computing API into Java is faced by several eorts to establish standard Java high performance computing libraries [2]. No standard method has yet been established for inclusion of these libraries in Java. Therefore, we propose a straightforward, and exible system for addition of Java high performance libraries to Java. With the introduction of the Java 2 platform, Java contains core API, packages named java.*, and several standard extension packages named javax.*. When Sun introduced the Swing API (Java's new GUI API), Sun originally dened its package as com.sun.java.swing.*. In order to move Swing into the core library for Java 2, but leave it as an extension for JDK., Sun dened Swing's package to be javax.swing. Sun then dened javax.swing as a core API, in addition to the java.* packages, in Java.2. Following this pattern, we propose that libraries critical for high-performance computing be included in standard Java Grande extensions javax.grande.*. Parts of these libraries which are useful to the general public could eventually be dened as core. Less critical libraries, or libraries still under development could be dened in grande.org.* packages. These classes could eventually be promoted to javax.* if necessary. So, under this scheme, standard native libraries critical for performance would be installed on systems. If a native library could not be found, a default Java implementation is substituted. In this way applications can have both portability to machines which do not have any native code installed, and superior performance on machines which do. 6 Design Principles In designing our implementation of MPI in Java, we followed four major principles: Pure Java Implementation A pure Java implementation is very desirable as it inherits all of Java's cross platform, security, and language safety features. The only exception we allowed to the pure Java implementation is on systems where a library for native marshaling of arrays is available. In this case, it is best to use native marshaling. If the native marshaling library is unavailable, we simply use Java marshaling. As will be shown, this small bit of native code allows messaging to compete favorably with native message passing schemes. Java Grande Forum MPI binding proposal compliance The MPI standard contains bindings for C, FORTRAN, and C++, but none for Java. It is important to have well-dened Java bindings for MPI in order to foster compatible, high-quality implementations. To remedy this situation, we are working as part of the Java Grande Forum Concurrency and Applications Working Group to develop MPI bindings. We sought to follow these emerging bindings in order to allow programs written under to run under other Java MPI systems and vice versa. High communication performance Ecient communication is critical in order to make Java MPI a viable alternative to native MPI. When is started, it rst searches for a native marshaling library when it is loaded. If a native library is found uses it to perform native marshaling as we described it earlier. If no native marshaling library is found, uses a Java library for marshaling data. does not yet use any typed array communication classes. We are currently working on incorporating them into, and we expect to see signicant performance increases when they are included. On multi-processor machines, makes use of shared access to class variables to perform ecient collective communication. This allows us to directly copy data between source and destination buers, and achieve a high degree of eciency. Independence from any particular application framework This greatly increases the usability of by allowing it to be used by any framework which provides a few simple startup functions. 7 Performance Results 7. Test Environment To quantify the performance of our current implementation, we ran benchmarks on three dierent parallel computing environments:

5 MB/s... Bytes MB/s... Bytes Figure 2: Ping Pong Distributed Memory. A cluster of dual processor Pentium II 266 MHz Windows NT machine under JDK.2 communicating via switched Mbps switched Ethernet with only one MPI process per machine. 2. The same cluster as in, but with each machine having up to two MPI processes (one per CPU). 3. A 4 processor Xeon 4 MHz Windows NT machine under JDK.2. As stated, one of our major aims is to show that MPI under Java can match native MPI performance. In order to demonstrate this we compare performance with one of the best available MPI systems for Windows NT { [3]. was chosen because an evaluation study elsewhere [4] showed it to have very good shared memory and distributed memory communication performance. We do not compare our results with JPVM and PVM 3.4 under Windows NT because their performance is signicantly less than that of. We also do not compare against Linux MPI because performance is fairly comparable to MPI on Linux (NT and Linux bandwidth on our hardware is nearly equal while Linux latency is much lower), and because Java on Linux is far less advanced than Java on Windows NT. 7.2 Point-to-Point Communication Performance Ping Pong The Ping Pong benchmark nds the maximum bandwidth that can be achieved sending bytes between two nodes, one direction at a time. As can be seen in Figure 2 and Figure 3 distributed memory communication is essentially equivalent to that of. Shared memory performance of is reasonably close to that of. (Note that this test was run with explicit tags and sources in the MPI receive call. currently performs signicantly slower when using the MPI.ANY SOURCE wildcard. The cause of this MB/s Figure 3: Ping Pong Shared Memory... Bytes Figure 4: Ping Ping Distributed Memory ineciency seems to be synchronization overhead, and we are investigating more ecient implementations.) Ping Ping The Ping Ping test (Figure 4 and Figure 5) nds the maximum bandwidth that can be obtained between two nodes when messages are being sent simultaneously in both directions. Once again, distributed memory performance is equivalent to that of. However, in this case, signicantly outperforms in a shared memory environment. Communication of Various Primitive Types The Ping Pong and Ping Ping tests measure communication of bytes. As stated previously, communicating other data types is more troublesome in Java. Figure 6 compares and communication of double precision oating point data and integer data. The native marshaling technique mentioned previously allows to reach essentially the same performance as on double precision oating point data, and on integer data, actually outperforms slightly. Now, if a native marshaling library is unavailable to, will use pure Java marshaling. The

6 System Startup Latency Shared Distributed Memory Memory 6sec 422sec 9sec 352sec MB/s... Bytes Figure 5: Ping Ping Shared Memory Microeconds Table : Latencies for and Processors Figure 7: Barrier Hybrid Memory lowest line in Figure 6 represents double precision oating point communication performance if Java marshaling is used, and clearly shows that the use of Java marshaling instead of native marshaling results in signicantly worse performance. Mbps double (native marshalling) double (Java marshalling) int (native marshalling) Message size in kbytes Figure 6: Communication of Primitive Types Startup Latency Table shows startup latency for both distributed and shared memory. distributed memory latency is lower than that of. This is possibly due to the fact is implemented directly on the Java socket API while relies on an intermediate API before accessing the Windows socket API. However, is signicantly slower in shared memory mode. This is possibly due to Java synchronization overhead. 7.3 Other benchmarks Barrier The Barrier test measures process synchronization performance. Figures 7, 8, and 9 compare performance to for the hybrid system, the shared memory system and the distributed memory system. performs well in both the hybrid and and distributed memory modes, but is signicantly slower in shared memory. This performance gap should shrink signicantly once we optimize the shared memory barrier code. NAS Parallel Benchmarks: Integer Sort As a nal test, we evaluated the performance of on a single NAS Parallel Benchmark: Integer Sort [5]. We compare this performance with the performance of both

7 Microeconds Processors Figure 8: Barrier Shared Memory Seconds PentiumII 266 Xeon 4 Xeon 4 IBM SP2 LAM IBM SP2 IBM MPI Processors Microeconds Processors Figure 9: Barrier Distributed Memory on the four processor Xeon, and of MPI on an SP2. A critical element for this benchmark is the performance of the MPI function ALLTOALLV. We optimized this function to exploit shared memory variables. As shown in gure, was able to outperform, and perform quite well compared to the SP2 [6]. 8 Conclusions We have shown several instances where MPI implemented in Java can match performance of native MPI in a clustered environment. Achieving this performance in Java requires careful implementation of data marshaling. Currently, data marshaling must occur in native code in order to achieve high performance. In our view, this functionality should be added to the core Java classes either by allowing System.arraycopy to copy between arrays of dierent types or adding a method which has this functionality. However, this still requires a memory copy. The most demanding environments will require a zero copy communication system. This is possible by adding a class similar to DataOutputStream that is capable of sending arrays without marshaling unless the message destination requires dierent byte ordering. Figure : Barrier Shared Memory - Integer Sort We have also shown that Java MPI implementations which allow multiple threads to exist in a single JVM can exploit static shared access to static variables. We have demonstrated how this technique can be used by MPI to speed up global operations, but application programmers could also use threads directly to allow shared access to data without any message passing. As Java MPI implementations mature and incorporate key communication capabilities, they will be able to provide a viable alternative to native MPI implementations. 9 Future Work We have examined some of the most critical Java MPI performance issues, but there are still many other open questions to be addressed. In addition, while our implementation of MPI contains the most essential functionality, it is not yet complete. Future work will address implementation of remaining MPI features as they are included in the nal Java MPI bindings as well as implementation on supercomputers such as the IBM SP2. References [] MPI Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Applications, 8(3/4), 994. [2] A. Ferrari. JPVM: network parallel computing in Java. Concurrency: Practice and Experience, vol. (-3), pp. 985{992, [3] M. Philippsen and M. Zenger. JavaParty - transparent remote objects in Java. Concurrency: Pract. Exper., vol. 9 (), pp. 225{242, 997. [4] H. Pedroso, L. M. Silva, and J. G. Silva. Web-based metacomputing with JET. Concurrency: Pract. Exper., vol. 9 (), pp. 69{73, 997.

8 [5] P. Gray and V. Sunderam. IceT: Distributed computing and Java. Concurrency: Pract. Exper., vol. 9 (), pp. 6{67, 997. [6] Javasoft. Remote method invocation. Technical report, docs/guide/rmi/index.html, 997. [7] Javasoft. Javaspaces. Technical report, javaspaces/, 997. [8] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM 3 user's guide and reference manual. Technical Report ORNL/TM-287, Oak Ridge National Laboratory, Sept [9] B. Carpenter, G. Fox, G. Zhang, and X. Li. A draft Java binding for MPI., Nov pcrc/hpjava/mpijava.html [] S. Mintchev and V. Getov, Towards portable message passing in Java: Binding MPI, in M. Bubak, J. Dongarra, J. Wasniewski (Eds.), Recent Advances in PVM and MPI, LNCS, Springer, pp. 35{42, Nov [] B. Carpenter, V. Getov, G. Judd, T. Skjellum, G. Fox. MPI for Java: Position Document and Draft API Specication, Technical Report JGF-TR-3, Java Grande Forum, Nov [2] Java Grande Forum. Making Java Work for High-End Computing. Technical Report JGF-TR-, Java Grande Forum, Nov [3] Wmpi. Technical report, [4] M. Baker and G. Fox. Mpi on nt: A preliminary evaluation of the available environments. in: Jose Rolim (Ed.), Parallel and Distributed Computing, (2th IPPS and 9th SPDP), LNCS, Springer, pp. 549{563, April 998. [5] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, The NAS parallel benchmarks, Technical Report RNR-94-7, NASA Ames Research Center, NPB/ (994). [6] V. Getov, S. Flynn-Hummel, and S. Mintchev, High-performance parallel programming in Java: Exploiting native libraries, Concurrency: Exper., vol. (-3), pp. 863{872, 998. Pract.

A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING*

A MIXED-LANGUAGE PROGRAMMING METHODOLOGY FOR HIGH PERFORMANCE JAVA COMPUTING* Vladimir S. Getov University of Westminster Northwick Park, Harrow, UK and Los Alamos National Laboratory Los Alamos, NM, USA