An Intelligent and Cost-effective Solution to Implement High Performance Computing

Size: px

Start display at page:

Download "An Intelligent and Cost-effective Solution to Implement High Performance Computing"

Clementine Parrish
5 years ago
Views:

1 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 doi: /ijape An Intelligent and Cost-effective Solution to Implement High Performance Computing Afrin Naz *1, Mingyu Lu 2, Joshua Keiffer 3, Benjamin Culkin 4 1,3,4 Comuter Science and Information Systems Department, West Virginia University Institute of Technology, Montgomery, WV, USA 2Electrical and Comuter Engineering Department, West Virginia University Institute of Technology, Montgomery, WV, USA *1afrin.naz@mail.wvu.edu; 2 mingyu.lu@ mail.wvu.edu; 3 jbkeiffer@mix.wvu.edu; 4 bjculkin@mix.wvu.edu Abstract In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech; and, it will be used very often. The developed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. We are hoping that, the new scalable supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Keywords Parallel Computing; High Performance Computing; Scalable Introduction High performance computing, also termed parallel computing, is a fast-developing field in Computer Science. In high performance computing, one computational task is partitioned into multiple s and executed in parallel. The hardware platform to support parallel computing is usually called supercomputer. In this paper we talk about a smart and highly cost effective way to develop a supercomputer (parallel computer) to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed supercomputer employs a cluster architecture: 10 computing nodes are interconnected using Ethernet switches and each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. The proposed cluster architecture constitutes an intelligent and highly cost-effective solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. In this project, a high performance cluster computer is developed in WVU Tech. The developed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech. We are currently developing a new course on parallel programming. The students registered in the new course will use the cluster computer extensively. Meanwhile, it will create significant synergic impact on the entire undergraduate curriculum of Computer Science and Computer Engineering at WVU Tech. The proposed cluster computer will be connected to the Internet; all the students and faculties of WVU Tech can apply for access to support their education and research. As a close neighbor of WVU Tech, BridgeValley Community and Technical College is interested in using the cluster computer for their teaching as well. As parallel computing plays critical roles in virtually every science and engineering discipline, the proposed supercomputer is expected to benefit all the undergraduate students of the College of Engineering and Sciences. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding sources to upgrade the proposed supercomputer. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of 56

2 International Journal of Automation and Power Engineering (IJAPE) Volume 5, West Virginia. Motivation Founded in 1895, WUV Tech is a nationally-recognized institution of about 100 faculty and 1000 students. WVU Tech has been dedicated to offering high-quality undergraduate education to the region centered at Charleston, WV. Over the past 110 years, the institution has supplied a large number of graduates to the industry, business, and government agencies. Currently, WVU Tech does not have parallel computing facilities, which prevents our students from having first-hand experience with the fascinating field of parallel computing. The availability of a supercomputer will enable undergraduate students to visualize a parallel computing architecture, learn about parallel programming, and conduct hands-on experiments. Meanwhile as a visually-impressive instrument, the proposed supercomputer can serve the purpose of motivating more students to select engineering and science as their future careers. This new high performance cluster will enable us to develop new courses on parallel computing. Also, the proposed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. To name a few, in Computer System Concept (CS 350) and Linux (CS 270) classes, we will develop hands-on projects to demonstrate how to build and administrate a Linux cluster; in E- commerce (CS 266), C# (CS 225), Visual Basic (MANG 370), and Database Management (CS 324) classes, vivid examples will be offered by showing how large amount of data could be processed in parallel; since the proposed cluster computer depends on Ethernet connections, it can be readily employed to reveal many concepts in Introduction to Networking (CS 263). We are hoping that, the new supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Moreover, all students from the College of Engineering and Sciences may opt to use the supercomputer in their senior design projects or independent study. The proposed cluster computer will also have significant impact on the research of WVU Tech. At present, faculties of WVU Tech must resort to supercomputers of other institutes such as National Center for Supercomputing Applications (NCSA) for heavy-duty computational tasks.though the proposed supercomputer is not as powerful as those at NCSA, it is sufficient for researchers to debug and test some light-duty jobs before a heavy-duty job is submitted to NCSA. The proposed supercomputer is open to all the faculties of WVU Tech for their research. Related Work As mentioned before, the hardware platform to support parallel computing is usually called supercomputer. Since 1992, the top 500 supercomputers in the world haveranked twice a year [10]. In June 2014, MilkWay-2 was ranked to be the most powerful supercomputer, which is comprised of more than three million s [5]. Nowadays, parallel computing is an integral part of the Computer Science curriculum in numerous universities world widely; a few well-established examples can be found in [1 4, 9]. Implementation In this section we will talk about our implementation process. First we will describe our hardware and software respectively. Then we will walk through the entire implementation process step by step. Hardware The proposed supercomputer employs a cluster architecture, thus is also named a high performance cluster computer. As depicted in Fig. 1, 10 computing nodes are interconnected using Ethernet switches. Each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. Comparingwith one single, the cluster computer is expected to be at least 10 times faster (though ideally a 20- computer ought to achieve a speed-up of 20 times, the speed-up value measured in practice is typically 10 to 15 times [6]). The cluster architecture shown in Fig. 1 constitutes a highly cost-effective intelligent solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. Meanwhile, the proposed cluster computer is scalable: more computing nodes can 57

3 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 be incorporated straightforwardly. On the basis of this project, we will actively seek other funding sources to expand the proposed supercomputer. Node 1 Node 2 Node 3 Node 10 Ethernet cable Ethernet switches To Internet Ethernet cable FIG. 1 ARCHITECTURE OF THE HIGH PERFORMANCE CLUSTER COMPUTER Software 1) Operating System All ten servers are running Ubuntu LTS Server. Ubuntu was chosen due to its stability, flexibility, and because it is Debian based, the package management is very good. Also, there was no cost since it is Linux. 2) MPI Software For networking the clusters together, we use the MPICH implementation of the MPI (Message Passing Interface) standard. The MPICH implementation was chosen because it is one of the first and most wellmaintained implementations of the standard. The MPI configuration we are currently using is Per-Node, which means that each physical computer hosts a single MPI process. The alternative is Per-Core mode, where each core hosts a single MPI process, leading to, on our hardware, six processes per node. Per-Node mode was chosen to give each process complete access to a single node, and not have to share memory or disk space with each other. TABLE I TIMELINE FOR THIS PROJECT Year Task Month Build the 10 computing nodes 2. Install Linux operating system 3. Construct network connections 4. Build the cluster 5. Diagnose the cluster 6. Test the cluster 7. Start to develop a new course MPICH handles configuring the programs to talk over the network to each other, but requires a external piece of software called the Process Manager to actually start all of the programs. We use the default process manager, which goes by the name of Hydra, in analogy to the many-headed beast of Greek legend. To launch processes, it itself delegates to SSH, which is configured for keyed login, which means that once a user has 58

4 International Journal of Automation and Power Engineering (IJAPE) Volume 5, provided their key to the server, they can run programs through ssh without having to enter their password. MPICH also gives the functionality to pass data across the network to a different process so that it may be shared. While the installation is the same on every machine, one is required to act as the master, while the remaining receive jobs from it. In the end, any output and/or collected data is sent to the master. 3) Implementation Steps This project s implementation plan and timeline are presented in Table 1. This project started on July 1, It has the following seven (7) specific tasks. Task 1: Build the 10 computing nodes. This task is approximately equivalent to constructing 10 regular computers using regular components including s, motherboard, RAM, and. Task 2: Install Linux operating system over the 10 computing nodes. In this task, Red Hat Linux 7.1 (which is free of charge) were installed over each computing node. Task 3: Construct network connections. The 10 computing nodes were interconnected using Ethernet cables and switches. Task 4: Build the cluster. One computer is designated as the master node and the other nine nodes behave as slave nodes. The latest version of MPICH, which is the most commonly adopted protocol for parallel programming, was downloaded and installed [7]. Task 5: Diagnose the cluster. A few diagnosis tools readily available over the MPICH website were used to diagnose the cluster; hardware and software faults reported by the diagnosis tools were also identified and removed. Task 6: Test the cluster. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks [11] are applied to test the performance of individual nodes as well as the entire cluster. All the data collected in this task are being documented as the benchmark data for the purpose of future diagnosis and calibration. We also have started our datacollection with Standard Performance Evaluation Corporation (SPEC) parallel benchmarks [12]. Task 7: Develop a new course on parallel programming. We are now planning to develop a new course named Parallel Programming using the proposed cluster computer at the Department of Computer Science and Information Systems of WVU Tech. Benchmarks For our testing procedure we have used the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks (NPB) of NASA which are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification [11]. Brief description of each program is provided below [11]. LU solver (LU): This benchmark is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from finite-difference discretization of the Navier- Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. 3D FFT PDE (FT): This benchmark contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFT s, one for each dimension. Multigrid (MG): This benchmark uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement. Conjugate Gradient (CG): This benchmark uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. Block tridiagonal solver (BT): This benchmark is a simulated CFD application that uses an implicit algorithm to 59

www.ijape.org International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 solve 3-dimensional (3- D) compressible Navier-Stokes equations.

5 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 solve 3-dimensional (3- D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. Pentadiagonal solver (SP): It is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. Embarrassingly Parallel (EP): It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. Integer Sort (IS): A large integer sort. This kernel performs a sorting operation that is important in particle method codes. It tests both integer computation speed and communication performance. Problem sizes in NPB are predefined and indicated as different classes as described below. In this paper we are presenting initial data collected from C class. Results Class S: small for test purposes; Class W: Workstation size Classes A, B, C: standard test problems; ~4X size increase going from one class to the next; Classes D, E, F: large test problems; ~16X size larger than previous classes In this paper we are presenting initial data collected from C class of the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks. In Fig. 2 we are comparing the execution times for the becnhmarks CG, EP, FT, LU and MG while running as of individual nodes as well as with 2, 4 and 8 nodes respectively. We ended up with verification failure for the benchmarks BT, IS and SP. We are currently working to fix this verification problem. Execution time in seconds One node Two nodes Four nodes Eight nodes FIG. 2 EXECUTION TIMES FOR BENCHMARKS WHILE RUNNING WITH ONE, TWO, FOUR AND EIGHT NODES RESPECTIVELY Conclusions In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of WVU Tech. Currentlywe are collecting data to be documented for the purpose of future diagnosis and calibration. We are also developing a new course on parallel programming. The cluster has already being incorporated in some of our classes at the college of Engineering. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding 60

6 International Journal of Automation and Power Engineering (IJAPE) Volume 5, sources to expand our supercomputer. For instance, we will submit proposals to the Major Research Instrumentation (MRI) program and Improving Undergraduate STEM Education (IUSE: EHR) program of the National Science Foundation. With further funding, we will continuously upgrade the 10-node supercomputer to hundreds of nodes. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of West Virginia. ACKNOWLEDGMENT This work was supported by West Virginia Higher Education Policy Commission Instrumentation Grant. REFERENCES [1] COMP 633: Parallel Computing, University of North Carolina. [2] CS525: Parallel Computing, Purdue University. [3] ECE408/CS483: Applied Parallel Programming, University of Illinois. [4] INFR11023: Parallel Programming Languages and Systems, University of Edinburgh, UK. [5] Intel's Milky Way 2 Is the World's Fastest Computer, New Top Supercomputer Named. [6] Introduction to Parallel Computing, by Blaise Barney, Lawrence Livermore National Laboratory. [7] MPICH: High-Performance Portable MPI. [8] National Center for Supercomputing Applications. [9] Parallel Programming for Multicore Machines Using OpenMP and MPI, Open Courseware, Massachusetts Institute of Technology [10] Top 500 Supercomputer Sites. [11] [12] 61

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to