An Intelligent and Cost-effective Solution to Implement High Performance Computing
|
|
- Clementine Parrish
- 5 years ago
- Views:
Transcription
1 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 doi: /ijape An Intelligent and Cost-effective Solution to Implement High Performance Computing Afrin Naz *1, Mingyu Lu 2, Joshua Keiffer 3, Benjamin Culkin 4 1,3,4 Comuter Science and Information Systems Department, West Virginia University Institute of Technology, Montgomery, WV, USA 2Electrical and Comuter Engineering Department, West Virginia University Institute of Technology, Montgomery, WV, USA *1afrin.naz@mail.wvu.edu; 2 mingyu.lu@ mail.wvu.edu; 3 jbkeiffer@mix.wvu.edu; 4 bjculkin@mix.wvu.edu Abstract In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech; and, it will be used very often. The developed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. We are hoping that, the new scalable supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Keywords Parallel Computing; High Performance Computing; Scalable Introduction High performance computing, also termed parallel computing, is a fast-developing field in Computer Science. In high performance computing, one computational task is partitioned into multiple s and executed in parallel. The hardware platform to support parallel computing is usually called supercomputer. In this paper we talk about a smart and highly cost effective way to develop a supercomputer (parallel computer) to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed supercomputer employs a cluster architecture: 10 computing nodes are interconnected using Ethernet switches and each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. The proposed cluster architecture constitutes an intelligent and highly cost-effective solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. In this project, a high performance cluster computer is developed in WVU Tech. The developed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech. We are currently developing a new course on parallel programming. The students registered in the new course will use the cluster computer extensively. Meanwhile, it will create significant synergic impact on the entire undergraduate curriculum of Computer Science and Computer Engineering at WVU Tech. The proposed cluster computer will be connected to the Internet; all the students and faculties of WVU Tech can apply for access to support their education and research. As a close neighbor of WVU Tech, BridgeValley Community and Technical College is interested in using the cluster computer for their teaching as well. As parallel computing plays critical roles in virtually every science and engineering discipline, the proposed supercomputer is expected to benefit all the undergraduate students of the College of Engineering and Sciences. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding sources to upgrade the proposed supercomputer. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of 56
2 International Journal of Automation and Power Engineering (IJAPE) Volume 5, West Virginia. Motivation Founded in 1895, WUV Tech is a nationally-recognized institution of about 100 faculty and 1000 students. WVU Tech has been dedicated to offering high-quality undergraduate education to the region centered at Charleston, WV. Over the past 110 years, the institution has supplied a large number of graduates to the industry, business, and government agencies. Currently, WVU Tech does not have parallel computing facilities, which prevents our students from having first-hand experience with the fascinating field of parallel computing. The availability of a supercomputer will enable undergraduate students to visualize a parallel computing architecture, learn about parallel programming, and conduct hands-on experiments. Meanwhile as a visually-impressive instrument, the proposed supercomputer can serve the purpose of motivating more students to select engineering and science as their future careers. This new high performance cluster will enable us to develop new courses on parallel computing. Also, the proposed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. To name a few, in Computer System Concept (CS 350) and Linux (CS 270) classes, we will develop hands-on projects to demonstrate how to build and administrate a Linux cluster; in E- commerce (CS 266), C# (CS 225), Visual Basic (MANG 370), and Database Management (CS 324) classes, vivid examples will be offered by showing how large amount of data could be processed in parallel; since the proposed cluster computer depends on Ethernet connections, it can be readily employed to reveal many concepts in Introduction to Networking (CS 263). We are hoping that, the new supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Moreover, all students from the College of Engineering and Sciences may opt to use the supercomputer in their senior design projects or independent study. The proposed cluster computer will also have significant impact on the research of WVU Tech. At present, faculties of WVU Tech must resort to supercomputers of other institutes such as National Center for Supercomputing Applications (NCSA) for heavy-duty computational tasks.though the proposed supercomputer is not as powerful as those at NCSA, it is sufficient for researchers to debug and test some light-duty jobs before a heavy-duty job is submitted to NCSA. The proposed supercomputer is open to all the faculties of WVU Tech for their research. Related Work As mentioned before, the hardware platform to support parallel computing is usually called supercomputer. Since 1992, the top 500 supercomputers in the world haveranked twice a year [10]. In June 2014, MilkWay-2 was ranked to be the most powerful supercomputer, which is comprised of more than three million s [5]. Nowadays, parallel computing is an integral part of the Computer Science curriculum in numerous universities world widely; a few well-established examples can be found in [1 4, 9]. Implementation In this section we will talk about our implementation process. First we will describe our hardware and software respectively. Then we will walk through the entire implementation process step by step. Hardware The proposed supercomputer employs a cluster architecture, thus is also named a high performance cluster computer. As depicted in Fig. 1, 10 computing nodes are interconnected using Ethernet switches. Each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. Comparingwith one single, the cluster computer is expected to be at least 10 times faster (though ideally a 20- computer ought to achieve a speed-up of 20 times, the speed-up value measured in practice is typically 10 to 15 times [6]). The cluster architecture shown in Fig. 1 constitutes a highly cost-effective intelligent solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. Meanwhile, the proposed cluster computer is scalable: more computing nodes can 57
3 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 be incorporated straightforwardly. On the basis of this project, we will actively seek other funding sources to expand the proposed supercomputer. Node 1 Node 2 Node 3 Node 10 Ethernet cable Ethernet switches To Internet Ethernet cable FIG. 1 ARCHITECTURE OF THE HIGH PERFORMANCE CLUSTER COMPUTER Software 1) Operating System All ten servers are running Ubuntu LTS Server. Ubuntu was chosen due to its stability, flexibility, and because it is Debian based, the package management is very good. Also, there was no cost since it is Linux. 2) MPI Software For networking the clusters together, we use the MPICH implementation of the MPI (Message Passing Interface) standard. The MPICH implementation was chosen because it is one of the first and most wellmaintained implementations of the standard. The MPI configuration we are currently using is Per-Node, which means that each physical computer hosts a single MPI process. The alternative is Per-Core mode, where each core hosts a single MPI process, leading to, on our hardware, six processes per node. Per-Node mode was chosen to give each process complete access to a single node, and not have to share memory or disk space with each other. TABLE I TIMELINE FOR THIS PROJECT Year Task Month Build the 10 computing nodes 2. Install Linux operating system 3. Construct network connections 4. Build the cluster 5. Diagnose the cluster 6. Test the cluster 7. Start to develop a new course MPICH handles configuring the programs to talk over the network to each other, but requires a external piece of software called the Process Manager to actually start all of the programs. We use the default process manager, which goes by the name of Hydra, in analogy to the many-headed beast of Greek legend. To launch processes, it itself delegates to SSH, which is configured for keyed login, which means that once a user has 58
4 International Journal of Automation and Power Engineering (IJAPE) Volume 5, provided their key to the server, they can run programs through ssh without having to enter their password. MPICH also gives the functionality to pass data across the network to a different process so that it may be shared. While the installation is the same on every machine, one is required to act as the master, while the remaining receive jobs from it. In the end, any output and/or collected data is sent to the master. 3) Implementation Steps This project s implementation plan and timeline are presented in Table 1. This project started on July 1, It has the following seven (7) specific tasks. Task 1: Build the 10 computing nodes. This task is approximately equivalent to constructing 10 regular computers using regular components including s, motherboard, RAM, and. Task 2: Install Linux operating system over the 10 computing nodes. In this task, Red Hat Linux 7.1 (which is free of charge) were installed over each computing node. Task 3: Construct network connections. The 10 computing nodes were interconnected using Ethernet cables and switches. Task 4: Build the cluster. One computer is designated as the master node and the other nine nodes behave as slave nodes. The latest version of MPICH, which is the most commonly adopted protocol for parallel programming, was downloaded and installed [7]. Task 5: Diagnose the cluster. A few diagnosis tools readily available over the MPICH website were used to diagnose the cluster; hardware and software faults reported by the diagnosis tools were also identified and removed. Task 6: Test the cluster. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks [11] are applied to test the performance of individual nodes as well as the entire cluster. All the data collected in this task are being documented as the benchmark data for the purpose of future diagnosis and calibration. We also have started our datacollection with Standard Performance Evaluation Corporation (SPEC) parallel benchmarks [12]. Task 7: Develop a new course on parallel programming. We are now planning to develop a new course named Parallel Programming using the proposed cluster computer at the Department of Computer Science and Information Systems of WVU Tech. Benchmarks For our testing procedure we have used the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks (NPB) of NASA which are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification [11]. Brief description of each program is provided below [11]. LU solver (LU): This benchmark is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from finite-difference discretization of the Navier- Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. 3D FFT PDE (FT): This benchmark contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFT s, one for each dimension. Multigrid (MG): This benchmark uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement. Conjugate Gradient (CG): This benchmark uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. Block tridiagonal solver (BT): This benchmark is a simulated CFD application that uses an implicit algorithm to 59
5 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 solve 3-dimensional (3- D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. Pentadiagonal solver (SP): It is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. Embarrassingly Parallel (EP): It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. Integer Sort (IS): A large integer sort. This kernel performs a sorting operation that is important in particle method codes. It tests both integer computation speed and communication performance. Problem sizes in NPB are predefined and indicated as different classes as described below. In this paper we are presenting initial data collected from C class. Results Class S: small for test purposes; Class W: Workstation size Classes A, B, C: standard test problems; ~4X size increase going from one class to the next; Classes D, E, F: large test problems; ~16X size larger than previous classes In this paper we are presenting initial data collected from C class of the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks. In Fig. 2 we are comparing the execution times for the becnhmarks CG, EP, FT, LU and MG while running as of individual nodes as well as with 2, 4 and 8 nodes respectively. We ended up with verification failure for the benchmarks BT, IS and SP. We are currently working to fix this verification problem. Execution time in seconds One node Two nodes Four nodes Eight nodes FIG. 2 EXECUTION TIMES FOR BENCHMARKS WHILE RUNNING WITH ONE, TWO, FOUR AND EIGHT NODES RESPECTIVELY Conclusions In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of WVU Tech. Currentlywe are collecting data to be documented for the purpose of future diagnosis and calibration. We are also developing a new course on parallel programming. The cluster has already being incorporated in some of our classes at the college of Engineering. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding 60
6 International Journal of Automation and Power Engineering (IJAPE) Volume 5, sources to expand our supercomputer. For instance, we will submit proposals to the Major Research Instrumentation (MRI) program and Improving Undergraduate STEM Education (IUSE: EHR) program of the National Science Foundation. With further funding, we will continuously upgrade the 10-node supercomputer to hundreds of nodes. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of West Virginia. ACKNOWLEDGMENT This work was supported by West Virginia Higher Education Policy Commission Instrumentation Grant. REFERENCES [1] COMP 633: Parallel Computing, University of North Carolina. [2] CS525: Parallel Computing, Purdue University. [3] ECE408/CS483: Applied Parallel Programming, University of Illinois. [4] INFR11023: Parallel Programming Languages and Systems, University of Edinburgh, UK. [5] Intel's Milky Way 2 Is the World's Fastest Computer, New Top Supercomputer Named. [6] Introduction to Parallel Computing, by Blaise Barney, Lawrence Livermore National Laboratory. [7] MPICH: High-Performance Portable MPI. [8] National Center for Supercomputing Applications. [9] Parallel Programming for Multicore Machines Using OpenMP and MPI, Open Courseware, Massachusetts Institute of Technology [10] Top 500 Supercomputer Sites. [11] [12] 61
Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance
Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to
More informationBenchmarking CPU Performance. Benchmarking CPU Performance
Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,
More informationAn evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks
An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract
More informationLow-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures
Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Ahmed S. Mohamed Department of Electrical and Computer Engineering The George Washington University Washington, DC 20052 Abstract:
More informationPerformance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster
Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationA Relative Development Time Productivity Metric for HPC Systems
A Relative Development Time Productivity Metric for HPC Systems Andrew Funk, Jeremy Kepner Victor Basili, Lorin Hochstein University of Maryland Ninth Annual Workshop on High Performance Embedded Computing
More informationSlurm Configuration Impact on Benchmarking
Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Section 5. Victor Gergel, Professor, D.Sc. Lobachevsky State University of Nizhni Novgorod (UNN) Contents (CAF) Approaches to parallel programs development Parallel
More informationThe Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts
1. Abstract The Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts specifically oriented toward solving problems dealing with computational accelerator physics. It is integrated
More informationPorting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation
Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code
More informationCOMMUNICATION CHARACTERISTICS IN THE NAS PARALLEL BENCHMARKS
COMMUNICATION CHARACTERISTICS IN THE NAS PARALLEL BENCHMARKS Name: Ahmad A. Faraj Department: Computer Science Department Major Professor: Xin Yuan Degree: Master of Science Term Degree Awarded: Fall,
More informationCost-Performance Evaluation of SMP Clusters
Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and
More informationBenchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware
Benchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware Michael J. Linnig and Gary R. Suder Engineering Fellows Sept 12, 2014 Linnig@Raytheon.com Gary_R_Suder@Raytheon.com
More informationComparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of
More informationHybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores
Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores T/NT INTERFACE y/ x/ z/ 99 99 Juan A. Sillero, Guillem Borrell, Javier Jiménez (Universidad Politécnica de Madrid) and Robert D. Moser (U.
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationCost-benefit analysis and exploration of cost-energy-performance trade-offs in scientific computing infrastructures
Procedia Computer Science Volume 80, 2016, Pages 2256 2260 ICCS 2016. The International Conference on Computational Science Cost-benefit analysis and exploration of cost-energy-performance trade-offs in
More informationNAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract
THE NAS PARALLEL BENCHMARKS D. H. Bailey 1, E. Barszcz 1, J. T. Barton 1,D.S.Browning 2, R. L. Carter, L. Dagum 2,R.A.Fatoohi 2,P.O.Frederickson 3, T. A. Lasinski 1,R.S. Schreiber 3, H. D. Simon 2,V.Venkatakrishnan
More informationParallel Mesh Partitioning in Alya
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationDerivation and Verification of Parallel Components for the Needs of an HPC Cloud
XVII Brazilian Symposiun on Formal Methods () In: III Brazilian Conference on Software: Theory and Practice (CBSOFT'2013) Derivation and Verification of Parallel Components for the Needs of an HPC Cloud
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationPerformance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and
Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications
More informationHperformance. In hybrid architectures, more speed up is obtained by overlapping the computations of
Reviews of Literature ISSN:2347-2723 Impact Factor : 3.3754(UIF) Volume - 5 Issue - 5 DECEMBER - 2017 BALANCING THE LOAD IN HYBRID HIGH PERFORMANCE COMPUTING (HPC) SYSTEMS Shabnaz fathima Assistant Professor,
More informationThe Use of the MPI Communication Library in the NAS Parallel Benchmarks
The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical
More informationDell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS Mechanical, ANSYS Fluent, and
More informationA Distance Learning Tool for Teaching Parallel Computing 1
A Distance Learning Tool for Teaching Parallel Computing 1 RAFAEL TIMÓTEO DE SOUSA JR., ALEXANDRE DE ARAÚJO MARTINS, GUSTAVO LUCHINE ISHIHARA, RICARDO STACIARINI PUTTINI, ROBSON DE OLIVEIRA ALBUQUERQUE
More informationLithe: Enabling Efficient Composition of Parallel Libraries
Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović xoxo@mit.edu apple {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology apple UC
More informationCHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer
CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical
More informationCOSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science
Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense
More informationDeveloping the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang
Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Outline of the Talk Introduction to the TELEMAC System and to TELEMAC-2D Code Developments Data Reordering Strategy Results Conclusions
More informationBlue Waters I/O Performance
Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationPerformance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer
Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer César Gómez-Martín 1, José Luis González-Sánchez 1, Javier Corral-García 1, Ángel Bejarano-Borrega 1, Javier Lázaro-Jareño
More informationINTRODUCTION TO COMPUTATIONAL TECHNIQUES FOR MULTIPHASE FLOWS
3-DAY COURSE INTRODUCTION TO COMPUTATIONAL TECHNIQUES FOR MULTIPHASE FLOWS July 17-July 19, 2017 I-Hotel, Champaign, IL An introductory 3-Day course covering gas-particle and gas-liquid flows SCOPE Multiphase
More informationA Comparison of Three MPI Implementations
Communicating Process Architectures 24 127 Ian East, Jeremy Martin, Peter Welch, David Duce, and Mark Green (Eds.) IOS Press, 24 A Comparison of Three MPI Implementations Brian VINTER 1 University of Southern
More informationNPB3.3-MPI/BT tutorial example application. Brian Wylie Jülich Supercomputing Centre October 2010
NPB3.3-MPI/BT tutorial example application Brian Wylie Jülich Supercomputing Centre b.wylie@fz-juelich.de October 2010 NPB-MPI suite The NAS Parallel Benchmark suite (sample MPI version) Available from
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationImproving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers
Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box
More informationResearch Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library
Research Collection Report WebParFE A web interface for the high performance parallel finite element solver ParFE Author(s): Paranjape, Sumit; Kaufmann, Martin; Arbenz, Peter Publication Date: 2009 Permanent
More informationHPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:
HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com
More informationPARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS
Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationMultigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK
Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationIncreasing the Scale of LS-DYNA Implicit Analysis
Increasing the Scale of LS-DYNA Implicit Analysis Cleve Ashcraft 2, Jef Dawson 1, Roger Grimes 2, Erman Guleryuz 3, Seid Koric 3, Robert Lucas 2, James Ong 4, Francois-Henry Rouet 2, Todd Simons 4, and
More informationDell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance
Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationAPPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER
Mathematical Modelling and Analysis 2005. Pages 171 177 Proceedings of the 10 th International Conference MMA2005&CMAM2, Trakai c 2005 Technika ISBN 9986-05-924-0 APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC
More informationApplication of Finite Volume Method for Structural Analysis
Application of Finite Volume Method for Structural Analysis Saeed-Reza Sabbagh-Yazdi and Milad Bayatlou Associate Professor, Civil Engineering Department of KNToosi University of Technology, PostGraduate
More informationBenchmarking CPU Performance
Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed
More informationA Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012
A Software Developing Environment for Earth System Modeling Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 1 Outline Motivation Purpose and Significance Research Contents Technology
More informationAchieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College
More informationEfficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid
Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of
More informationJ. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst
Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function
More informationParallel solution for finite element linear systems of. equations on workstation cluster *
Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang
More informationApplication-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering
More informationExploring Hardware Overprovisioning in Power-Constrained, High Performance Computing
Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationDISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM
DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM V.V. Korkhov 1,a, S.S. Kobyshev 1, A.B. Degtyarev 1, A. Cubahiro 2, L. Gaspary 3, X. Wang 4, Z. Wu 4 1 Saint Petersburg State University, 7/9 Universitetskaya
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationDevelopment of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics
Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics I. Pantle Fachgebiet Strömungsmaschinen Karlsruher Institut für Technologie KIT Motivation
More informationEnergy- Regional Innovation Cluster (E-RIC)
Energy- Regional Innovation Cluster (E-RIC) Dual E-RIC Mission: Reduced energy use in buildings Regional economic development Department of Energy $122 million Economic Development Administration $5 million
More informationExploring unstructured Poisson solvers for FDS
Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests
More informationA Local-View Array Library for Partitioned Global Address Space C++ Programs
Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June
More informationD036 Accelerating Reservoir Simulation with GPUs
D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance
More informationMSE Comprehensive Exam
MSE Comprehensive Exam The MSE requires a comprehensive examination, which is quite general in nature. It is administered on the sixth Friday of the semester, consists of a written exam in the major area
More informationLarge-scale Gas Turbine Simulations on GPU clusters
Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationDepartment of Computer Science and Engineering
Department of Computer Science and Engineering 1 Department of Computer Science and Engineering Department Head: Professor Edward Swan Office: 300 Butler Hall The Department of Computer Science and Engineering
More informationCo-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationBenchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.
I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.
More informationLarge Scale Debugging of Parallel Tasks with AutomaDeD!
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd
More informationPerformance of Implicit Solver Strategies on GPUs
9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationClusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory
Clusters Rob Kunz and Justin Watson Penn State Applied Research Laboratory rfk102@psu.edu Contents Beowulf Cluster History Hardware Elements Networking Software Performance & Scalability Infrastructure
More informationUpdate of Post-K Development Yutaka Ishikawa RIKEN AICS
Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing
More informationTRAFFIC CONTROLLER LABORATORY UPGRADE
TRAFFIC CONTROLLER LABORATORY UPGRADE Final Report KLK206 N06-21 National Institute for Advanced Transportation Technology University of Idaho Ahmed Abdel-Rahim August 2006 DISCLAIMER The contents of this
More informationAdaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationAnalysis of 2D Torus and Hub Topologies of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed 1
Analysis of 2D Torus and Hub Topologies of 1Mb/s Ethernet for the Whitney Commodity Computing Testbed 1 Kevin T. Pedretti and Samuel A. Fineberg NAS Technical Report NAS-97-17 September 1997 MRJ, Inc.
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationVIRTUAL NETWORKING LABORATORY FOR EDUCATION IN COMPUTER SCIENCE
INFORMATION TECHNOLOGY IN EDUCATION VIRTUAL NETWORKING LABORATORY FOR EDUCATION IN COMPUTER SCIENCE Jordan H. Kanev 1, Stanimir M. Sadinov 1 1 Technical University of Gabrovo, Gabrovo, Bulgaria Abstract:
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationLinux+ Base Pod Installation and Configuration Guide
Linux+ Base Pod Installation and Configuration Guide This document provides detailed guidance on performing the installation and configuration of the Linux+ Base Pod on a NETLAB+ system. The Linux+ Base
More informationQuickGuide for CC, GS, and Barnard CS Students
QuickGuide for CC, GS, and Barnard CS Students (New Requirements Beginning Fall 2013) This QuickGuide is for Columbia College, General Studies, and Barnard students thinking of majoring or concentrating
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More information