NUSGRID a computational grid at NUS Grace Foo (SVU/Academic Computing, Computer Centre) SVU is leading an initiative to set up a campus wide computational grid prototype at NUS. The initiative arose out of a desire to enhance resource sharing and overall utilization / efficiency of compute resources across NUS. Implementation of the grid prototype, called NUSgrid, is based on the popular grid middleware Globus Toolkit. NUSgrid will link up existing computational resources (connected over the campus network) from three organizations / entities. More resources will be added after the prototype is tested and proven. The organizations are SVU, Computational Science Department (CZD) and Engineering IT Unit (EITU). The contributed resources consist of a heterogeneous mix of parallel servers, LINUX or other UNIX workstation clusters shown in Table 1: Entity CZD EITU SVU Resource Contribution AMD Linux cluster SUN BLADE 2000 workstation cluster Compaq ES40 (4 CPUs) server Intel Xeon (16 CPUs) Linux cluster Table 1. NUSgrid resources In addition to the compute severs, two other servers complete the NUSgrid infrastructure: a server hosting a certificate authority (CA) and a web portal server. By having its own CA, the process of digital certification is made more convenient for the development and testing of NUSgrid. The infrastructure is summarized in Figure 1:
Figure 1. NUSgrid infrastructure Grid Design The grid middleware is the component that makes the grid possible. For NUSgrid, we used Globus Toolkit (GT) aka Globus (web site at http://www.globus.org). Since its release in the late 1990s, Globus has become the de facto grid middleware, with a very high rate of adoption in academia. Funding for the NUSgrid project is minimal (mainly for hardware - CA and portal servers), so Globus was the obvious choice. Globus Toolkit has a command line interface which is not easy to use. For greater accessibility and ease of use, we developed a web interface to NUSgrid. Since this is a computational grid, the main focus is on compiling/running jobs. Users will be able to compile code and submit their jobs from the web interface. The java Commodity Grid (CoG) kit (web site at http://wwwunix.globus.org/cog/java), an open source Globus application development toolkit, was used to implement the portal. Applications need to be enabled to run on the grid. We grid enabled the common compilers (c/c++, fortran, java). Some of the resources on the grid are parallel servers/clusters. So, we also grid enabled the MPI (c, fortran) compilers and the MPI run time environment. Matlab, a general purpose mathematical tool widely used across many research domains, was also enabled. We intend to grid enable more applications in the future.
Portal User Interface The portal user interface is meant to facilitate the compile / run cycle of a developer. We describe briefly some of its features. After a registration / activation process, the user may log into the portal to see a page with the following menu items (Figure 2). Figure 2. Portal menus The main menu items are Compile job and Run job. But before code can be compiled, it has to be uploaded to the portal server. Every user is allocated space on the portal to hold files and data. The upload may be easily done through the Manage files link. Through this link, the user may also view and delete files in his portal space. When the Compile job link is clicked, the right side of the page is expanded to show host and compiler dropdown lists. The host dropdown allows the user to select the host (contributed by the organizations) for the compilation. The compiler dropdown only lists compilers available on the selected host. After the compiler is selected, the page expands to show more items as in Figure 3. The user enters the source code filename and may specify compiler arguments. Pressing the Update button will update the command line box which shows the command that will be executed on the host. The user may further edit the command line if necessary.
Figure 3. Compiling a job There is an add / drop section for specifying any other files needed in the compilation, for example, header files or special libraries. These will be copied over to the execution host, together with the source code. The executable filename may also be specified (the default name given depends on the compiler), with an option to keep a copy on the execution host. The compilation status / result will be shown in the portal window. The Run job link provides a similar window for the user to fill in host, application, arguments, input / output file, and other requirements for his job. As in Compile job, the page is customized for the host / application selected. Both interactive and batch job submissions are possible for most applications. An add / drop section allows the user to specify any other files needed in the execution, for example, (java) class files or data input files. For interactive jobs, the job is submitted directly to the host and the user waits for the results of the execution. This is similar to code compilation. Batch jobs are run in the background and results may not be returned immediately. Furthermore, batch jobs will be submitted to queues if a job scheduler (for example, LSF) exists on the execution host.
The Job status link allows the user to check the status of his submitted batch jobs. When the user clicks the link, the portal server checks the host for the user s uncompleted batch jobs and updates the status information in the listing. If the job is completed, result / data files will be copied back to the portal. Batch jobs which are queued are listed in a separate table from the ordinary / non queued ones. The listings have Information such as time of submission, host and output files, for each job. The Grid info link provides Information about the hosts / servers and applications available. The host information includes status, operating system, number / speed of CPUs, free and total memory. Problems and Status There were many problems / issues encountered in the design and implementation of NUSgrid. One of the main problem areas was the Globus Toolkit middleware itself. Globus Toolkit is open sourced and under active development. There are many bugs and ongoing bug fixes. Documentation is poor. Installation of the Toolkit on some of platforms had problems due to bugs in the Toolkit which were later discovered / fixed. The open sourced java Commodity Grid (CoG) kit used to develop the portal has similar stability issues. Also, many features of Globus are not available through the java CoG kit. The portal user interface was not easy to design. It had to be quite general as our computational grid was intended to support general computing needs. At the same time, it was sometimes necessary to customize to specific applications and hosts. Each host environment was different, but these differences had to be hidden from the user as much as possible. Testing of the grid / portal required substantial manpower and patience. There were two main types of testing, Globus command line and through portal integration testing. Successful execution at the Globus command line was necessary but did not guarantee an easy or successful portal integration due to the peculiarities of the java CoG kit. Lack of manpower was a major problem in this project and the development / implementation of the system proceeded at a very slow pace. Nevertheless, the system implementation is now close to completion. NUSgrid will be released for use by year end. After release, the system will be monitored and users will be surveyed after a period of use. In the future, more hosts/servers will be added. More applications will also be enabled to run in our grid environment. There will be improvements to enhance the system features based on data collected during monitoring and user feedback. For questions or feedback on NUSgrid, please contact the author at ccefoog@nus.edu.sg.