An Intelligent and Cost-effective Solution to Implement High Performance Computing

Size: px
Start display at page:

Download "An Intelligent and Cost-effective Solution to Implement High Performance Computing"

Transcription

1 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 doi: /ijape An Intelligent and Cost-effective Solution to Implement High Performance Computing Afrin Naz *1, Mingyu Lu 2, Joshua Keiffer 3, Benjamin Culkin 4 1,3,4 Comuter Science and Information Systems Department, West Virginia University Institute of Technology, Montgomery, WV, USA 2Electrical and Comuter Engineering Department, West Virginia University Institute of Technology, Montgomery, WV, USA *1afrin.naz@mail.wvu.edu; 2 mingyu.lu@ mail.wvu.edu; 3 jbkeiffer@mix.wvu.edu; 4 bjculkin@mix.wvu.edu Abstract In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech; and, it will be used very often. The developed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. We are hoping that, the new scalable supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Keywords Parallel Computing; High Performance Computing; Scalable Introduction High performance computing, also termed parallel computing, is a fast-developing field in Computer Science. In high performance computing, one computational task is partitioned into multiple s and executed in parallel. The hardware platform to support parallel computing is usually called supercomputer. In this paper we talk about a smart and highly cost effective way to develop a supercomputer (parallel computer) to support the undergraduate education program as well as the research of West Virginia University Institute of Technology (WVU Tech). The proposed supercomputer employs a cluster architecture: 10 computing nodes are interconnected using Ethernet switches and each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. The proposed cluster architecture constitutes an intelligent and highly cost-effective solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. In this project, a high performance cluster computer is developed in WVU Tech. The developed high performance cluster computer will be used primarily to support the undergraduate education of WVU Tech. We are currently developing a new course on parallel programming. The students registered in the new course will use the cluster computer extensively. Meanwhile, it will create significant synergic impact on the entire undergraduate curriculum of Computer Science and Computer Engineering at WVU Tech. The proposed cluster computer will be connected to the Internet; all the students and faculties of WVU Tech can apply for access to support their education and research. As a close neighbor of WVU Tech, BridgeValley Community and Technical College is interested in using the cluster computer for their teaching as well. As parallel computing plays critical roles in virtually every science and engineering discipline, the proposed supercomputer is expected to benefit all the undergraduate students of the College of Engineering and Sciences. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding sources to upgrade the proposed supercomputer. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of 56

2 International Journal of Automation and Power Engineering (IJAPE) Volume 5, West Virginia. Motivation Founded in 1895, WUV Tech is a nationally-recognized institution of about 100 faculty and 1000 students. WVU Tech has been dedicated to offering high-quality undergraduate education to the region centered at Charleston, WV. Over the past 110 years, the institution has supplied a large number of graduates to the industry, business, and government agencies. Currently, WVU Tech does not have parallel computing facilities, which prevents our students from having first-hand experience with the fascinating field of parallel computing. The availability of a supercomputer will enable undergraduate students to visualize a parallel computing architecture, learn about parallel programming, and conduct hands-on experiments. Meanwhile as a visually-impressive instrument, the proposed supercomputer can serve the purpose of motivating more students to select engineering and science as their future careers. This new high performance cluster will enable us to develop new courses on parallel computing. Also, the proposed supercomputer will be integrated into a wide range of undergraduate courses in Computer Science and Computer Engineering programs. To name a few, in Computer System Concept (CS 350) and Linux (CS 270) classes, we will develop hands-on projects to demonstrate how to build and administrate a Linux cluster; in E- commerce (CS 266), C# (CS 225), Visual Basic (MANG 370), and Database Management (CS 324) classes, vivid examples will be offered by showing how large amount of data could be processed in parallel; since the proposed cluster computer depends on Ethernet connections, it can be readily employed to reveal many concepts in Introduction to Networking (CS 263). We are hoping that, the new supercomputer will benefit the entire curriculum of the College of Engineering and Sciences at WVU Tech. Moreover, all students from the College of Engineering and Sciences may opt to use the supercomputer in their senior design projects or independent study. The proposed cluster computer will also have significant impact on the research of WVU Tech. At present, faculties of WVU Tech must resort to supercomputers of other institutes such as National Center for Supercomputing Applications (NCSA) for heavy-duty computational tasks.though the proposed supercomputer is not as powerful as those at NCSA, it is sufficient for researchers to debug and test some light-duty jobs before a heavy-duty job is submitted to NCSA. The proposed supercomputer is open to all the faculties of WVU Tech for their research. Related Work As mentioned before, the hardware platform to support parallel computing is usually called supercomputer. Since 1992, the top 500 supercomputers in the world haveranked twice a year [10]. In June 2014, MilkWay-2 was ranked to be the most powerful supercomputer, which is comprised of more than three million s [5]. Nowadays, parallel computing is an integral part of the Computer Science curriculum in numerous universities world widely; a few well-established examples can be found in [1 4, 9]. Implementation In this section we will talk about our implementation process. First we will describe our hardware and software respectively. Then we will walk through the entire implementation process step by step. Hardware The proposed supercomputer employs a cluster architecture, thus is also named a high performance cluster computer. As depicted in Fig. 1, 10 computing nodes are interconnected using Ethernet switches. Each computing node consists of regular components/equipment, including 2 s,, motherboard, and Ethernet interface card. Comparingwith one single, the cluster computer is expected to be at least 10 times faster (though ideally a 20- computer ought to achieve a speed-up of 20 times, the speed-up value measured in practice is typically 10 to 15 times [6]). The cluster architecture shown in Fig. 1 constitutes a highly cost-effective intelligent solution to implement a supercomputer: the proposed cluster computer (with 20 s in total) only requires a fraction of the price as the cost of all the hardware parts. Meanwhile, the proposed cluster computer is scalable: more computing nodes can 57

3 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 be incorporated straightforwardly. On the basis of this project, we will actively seek other funding sources to expand the proposed supercomputer. Node 1 Node 2 Node 3 Node 10 Ethernet cable Ethernet switches To Internet Ethernet cable FIG. 1 ARCHITECTURE OF THE HIGH PERFORMANCE CLUSTER COMPUTER Software 1) Operating System All ten servers are running Ubuntu LTS Server. Ubuntu was chosen due to its stability, flexibility, and because it is Debian based, the package management is very good. Also, there was no cost since it is Linux. 2) MPI Software For networking the clusters together, we use the MPICH implementation of the MPI (Message Passing Interface) standard. The MPICH implementation was chosen because it is one of the first and most wellmaintained implementations of the standard. The MPI configuration we are currently using is Per-Node, which means that each physical computer hosts a single MPI process. The alternative is Per-Core mode, where each core hosts a single MPI process, leading to, on our hardware, six processes per node. Per-Node mode was chosen to give each process complete access to a single node, and not have to share memory or disk space with each other. TABLE I TIMELINE FOR THIS PROJECT Year Task Month Build the 10 computing nodes 2. Install Linux operating system 3. Construct network connections 4. Build the cluster 5. Diagnose the cluster 6. Test the cluster 7. Start to develop a new course MPICH handles configuring the programs to talk over the network to each other, but requires a external piece of software called the Process Manager to actually start all of the programs. We use the default process manager, which goes by the name of Hydra, in analogy to the many-headed beast of Greek legend. To launch processes, it itself delegates to SSH, which is configured for keyed login, which means that once a user has 58

4 International Journal of Automation and Power Engineering (IJAPE) Volume 5, provided their key to the server, they can run programs through ssh without having to enter their password. MPICH also gives the functionality to pass data across the network to a different process so that it may be shared. While the installation is the same on every machine, one is required to act as the master, while the remaining receive jobs from it. In the end, any output and/or collected data is sent to the master. 3) Implementation Steps This project s implementation plan and timeline are presented in Table 1. This project started on July 1, It has the following seven (7) specific tasks. Task 1: Build the 10 computing nodes. This task is approximately equivalent to constructing 10 regular computers using regular components including s, motherboard, RAM, and. Task 2: Install Linux operating system over the 10 computing nodes. In this task, Red Hat Linux 7.1 (which is free of charge) were installed over each computing node. Task 3: Construct network connections. The 10 computing nodes were interconnected using Ethernet cables and switches. Task 4: Build the cluster. One computer is designated as the master node and the other nine nodes behave as slave nodes. The latest version of MPICH, which is the most commonly adopted protocol for parallel programming, was downloaded and installed [7]. Task 5: Diagnose the cluster. A few diagnosis tools readily available over the MPICH website were used to diagnose the cluster; hardware and software faults reported by the diagnosis tools were also identified and removed. Task 6: Test the cluster. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks [11] are applied to test the performance of individual nodes as well as the entire cluster. All the data collected in this task are being documented as the benchmark data for the purpose of future diagnosis and calibration. We also have started our datacollection with Standard Performance Evaluation Corporation (SPEC) parallel benchmarks [12]. Task 7: Develop a new course on parallel programming. We are now planning to develop a new course named Parallel Programming using the proposed cluster computer at the Department of Computer Science and Information Systems of WVU Tech. Benchmarks For our testing procedure we have used the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks (NPB) of NASA which are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications in the original "pencil-and-paper" specification [11]. Brief description of each program is provided below [11]. LU solver (LU): This benchmark is a simulated CFD application that uses symmetric successive over-relaxation (SSOR) method to solve a seven-block-diagonal system resulting from finite-difference discretization of the Navier- Stokes equations in 3-D by splitting it into block Lower and Upper triangular systems. 3D FFT PDE (FT): This benchmark contains the computational kernel of a 3-D fast Fourier Transform (FFT)-based spectral method. FT performs three one-dimensional (1-D) FFT s, one for each dimension. Multigrid (MG): This benchmark uses a V-cycle MultiGrid method to compute the solution of the 3-D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement. Conjugate Gradient (CG): This benchmark uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries. Block tridiagonal solver (BT): This benchmark is a simulated CFD application that uses an implicit algorithm to 59

5 International Journal of Automation and Power Engineering (IJAPE) Volume 5, 2016 solve 3-dimensional (3- D) compressible Navier-Stokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. Pentadiagonal solver (SP): It is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension. Embarrassingly Parallel (EP): It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. Integer Sort (IS): A large integer sort. This kernel performs a sorting operation that is important in particle method codes. It tests both integer computation speed and communication performance. Problem sizes in NPB are predefined and indicated as different classes as described below. In this paper we are presenting initial data collected from C class. Results Class S: small for test purposes; Class W: Workstation size Classes A, B, C: standard test problems; ~4X size increase going from one class to the next; Classes D, E, F: large test problems; ~16X size larger than previous classes In this paper we are presenting initial data collected from C class of the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks. In Fig. 2 we are comparing the execution times for the becnhmarks CG, EP, FT, LU and MG while running as of individual nodes as well as with 2, 4 and 8 nodes respectively. We ended up with verification failure for the benchmarks BT, IS and SP. We are currently working to fix this verification problem. Execution time in seconds One node Two nodes Four nodes Eight nodes FIG. 2 EXECUTION TIMES FOR BENCHMARKS WHILE RUNNING WITH ONE, TWO, FOUR AND EIGHT NODES RESPECTIVELY Conclusions In this paper we describe a smart and cost effective way to develop a high performance cluster computer to support the undergraduate education program as well as the research of WVU Tech. Currentlywe are collecting data to be documented for the purpose of future diagnosis and calibration. We are also developing a new course on parallel programming. The cluster has already being incorporated in some of our classes at the college of Engineering. The proposed supercomputer is scalable: more computing nodes can be straightforwardly incorporated into the cluster network without altering the existing nodes. On the basis of this project, we will actively seek other funding 60

6 International Journal of Automation and Power Engineering (IJAPE) Volume 5, sources to expand our supercomputer. For instance, we will submit proposals to the Major Research Instrumentation (MRI) program and Improving Undergraduate STEM Education (IUSE: EHR) program of the National Science Foundation. With further funding, we will continuously upgrade the 10-node supercomputer to hundreds of nodes. In the course of upgrading, we will gradually offer access to other institutes. It is our goal that this supercomputer will eventually become a valuable asset for the entire state of West Virginia. ACKNOWLEDGMENT This work was supported by West Virginia Higher Education Policy Commission Instrumentation Grant. REFERENCES [1] COMP 633: Parallel Computing, University of North Carolina. [2] CS525: Parallel Computing, Purdue University. [3] ECE408/CS483: Applied Parallel Programming, University of Illinois. [4] INFR11023: Parallel Programming Languages and Systems, University of Edinburgh, UK. [5] Intel's Milky Way 2 Is the World's Fastest Computer, New Top Supercomputer Named. [6] Introduction to Parallel Computing, by Blaise Barney, Lawrence Livermore National Laboratory. [7] MPICH: High-Performance Portable MPI. [8] National Center for Supercomputing Applications. [9] Parallel Programming for Multicore Machines Using OpenMP and MPI, Open Courseware, Massachusetts Institute of Technology [10] Top 500 Supercomputer Sites. [11] [12] 61

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance

Cluster Computing Paul A. Farrell 9/15/2011. Dept of Computer Science Kent State University 1. Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed to defeat any effort to

More information

Benchmarking CPU Performance. Benchmarking CPU Performance

Benchmarking CPU Performance. Benchmarking CPU Performance Cluster Computing Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance,

More information

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks WRF Model NASA Parallel Benchmark Intel MPI Bench My own personal benchmark HPC Challenge Benchmark Abstract

More information

Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures

Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures Ahmed S. Mohamed Department of Electrical and Computer Engineering The George Washington University Washington, DC 20052 Abstract:

More information

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster

Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Performance Evaluation of Fast Ethernet, Giganet and Myrinet on a Cluster Marcelo Lobosco, Vítor Santos Costa, and Claudio L. de Amorim Programa de Engenharia de Sistemas e Computação, COPPE, UFRJ Centro

More information

CSE5351: Parallel Processing Part III

CSE5351: Parallel Processing Part III CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?

More information

A Relative Development Time Productivity Metric for HPC Systems

A Relative Development Time Productivity Metric for HPC Systems A Relative Development Time Productivity Metric for HPC Systems Andrew Funk, Jeremy Kepner Victor Basili, Lorin Hochstein University of Maryland Ninth Annual Workshop on High Performance Embedded Computing

More information

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Section 5. Victor Gergel, Professor, D.Sc. Lobachevsky State University of Nizhni Novgorod (UNN) Contents (CAF) Approaches to parallel programs development Parallel

More information

The Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts

The Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts 1. Abstract The Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts specifically oriented toward solving problems dealing with computational accelerator physics. It is integrated

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

COMMUNICATION CHARACTERISTICS IN THE NAS PARALLEL BENCHMARKS

COMMUNICATION CHARACTERISTICS IN THE NAS PARALLEL BENCHMARKS COMMUNICATION CHARACTERISTICS IN THE NAS PARALLEL BENCHMARKS Name: Ahmad A. Faraj Department: Computer Science Department Major Professor: Xin Yuan Degree: Master of Science Term Degree Awarded: Fall,

More information

Cost-Performance Evaluation of SMP Clusters

Cost-Performance Evaluation of SMP Clusters Cost-Performance Evaluation of SMP Clusters Darshan Thaker, Vipin Chaudhary, Guy Edjlali, and Sumit Roy Parallel and Distributed Computing Laboratory Wayne State University Department of Electrical and

More information

Benchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware

Benchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware Benchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware Michael J. Linnig and Gary R. Suder Engineering Fellows Sept 12, 2014 Linnig@Raytheon.com Gary_R_Suder@Raytheon.com

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores

Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores Hybrid OpenMP-MPI Turbulent boundary Layer code over 32k cores T/NT INTERFACE y/ x/ z/ 99 99 Juan A. Sillero, Guillem Borrell, Javier Jiménez (Universidad Politécnica de Madrid) and Robert D. Moser (U.

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Cost-benefit analysis and exploration of cost-energy-performance trade-offs in scientific computing infrastructures

Cost-benefit analysis and exploration of cost-energy-performance trade-offs in scientific computing infrastructures Procedia Computer Science Volume 80, 2016, Pages 2256 2260 ICCS 2016. The International Conference on Computational Science Cost-benefit analysis and exploration of cost-energy-performance trade-offs in

More information

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract

NAS Applied Research Branch. Ref: Intl. Journal of Supercomputer Applications, vol. 5, no. 3 (Fall 1991), pg. 66{73. Abstract THE NAS PARALLEL BENCHMARKS D. H. Bailey 1, E. Barszcz 1, J. T. Barton 1,D.S.Browning 2, R. L. Carter, L. Dagum 2,R.A.Fatoohi 2,P.O.Frederickson 3, T. A. Lasinski 1,R.S. Schreiber 3, H. D. Simon 2,V.Venkatakrishnan

More information

Parallel Mesh Partitioning in Alya

Parallel Mesh Partitioning in Alya Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Derivation and Verification of Parallel Components for the Needs of an HPC Cloud

Derivation and Verification of Parallel Components for the Needs of an HPC Cloud XVII Brazilian Symposiun on Formal Methods () In: III Brazilian Conference on Software: Theory and Practice (CBSOFT'2013) Derivation and Verification of Parallel Components for the Needs of an HPC Cloud

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications

More information

Hperformance. In hybrid architectures, more speed up is obtained by overlapping the computations of

Hperformance. In hybrid architectures, more speed up is obtained by overlapping the computations of Reviews of Literature ISSN:2347-2723 Impact Factor : 3.3754(UIF) Volume - 5 Issue - 5 DECEMBER - 2017 BALANCING THE LOAD IN HYBRID HIGH PERFORMANCE COMPUTING (HPC) SYSTEMS Shabnaz fathima Assistant Professor,

More information

The Use of the MPI Communication Library in the NAS Parallel Benchmarks

The Use of the MPI Communication Library in the NAS Parallel Benchmarks The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for ANSYS Mechanical, ANSYS Fluent, and

More information

A Distance Learning Tool for Teaching Parallel Computing 1

A Distance Learning Tool for Teaching Parallel Computing 1 A Distance Learning Tool for Teaching Parallel Computing 1 RAFAEL TIMÓTEO DE SOUSA JR., ALEXANDRE DE ARAÚJO MARTINS, GUSTAVO LUCHINE ISHIHARA, RICARDO STACIARINI PUTTINI, ROBSON DE OLIVEIRA ALBUQUERQUE

More information

Lithe: Enabling Efficient Composition of Parallel Libraries

Lithe: Enabling Efficient Composition of Parallel Libraries Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović xoxo@mit.edu apple {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology apple UC

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense

More information

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang

Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Developing the TELEMAC system for HECToR (phase 2b & beyond) Zhi Shang Outline of the Talk Introduction to the TELEMAC System and to TELEMAC-2D Code Developments Data Reordering Strategy Results Conclusions

More information

Blue Waters I/O Performance

Blue Waters I/O Performance Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer

Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer Performance Study of Hyper-Threading Technology on the LUSITANIA Supercomputer César Gómez-Martín 1, José Luis González-Sánchez 1, Javier Corral-García 1, Ángel Bejarano-Borrega 1, Javier Lázaro-Jareño

More information

INTRODUCTION TO COMPUTATIONAL TECHNIQUES FOR MULTIPHASE FLOWS

INTRODUCTION TO COMPUTATIONAL TECHNIQUES FOR MULTIPHASE FLOWS 3-DAY COURSE INTRODUCTION TO COMPUTATIONAL TECHNIQUES FOR MULTIPHASE FLOWS July 17-July 19, 2017 I-Hotel, Champaign, IL An introductory 3-Day course covering gas-particle and gas-liquid flows SCOPE Multiphase

More information

A Comparison of Three MPI Implementations

A Comparison of Three MPI Implementations Communicating Process Architectures 24 127 Ian East, Jeremy Martin, Peter Welch, David Duce, and Mark Green (Eds.) IOS Press, 24 A Comparison of Three MPI Implementations Brian VINTER 1 University of Southern

More information

NPB3.3-MPI/BT tutorial example application. Brian Wylie Jülich Supercomputing Centre October 2010

NPB3.3-MPI/BT tutorial example application. Brian Wylie Jülich Supercomputing Centre October 2010 NPB3.3-MPI/BT tutorial example application Brian Wylie Jülich Supercomputing Centre b.wylie@fz-juelich.de October 2010 NPB-MPI suite The NAS Parallel Benchmark suite (sample MPI version) Available from

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library

Research Collection. WebParFE A web interface for the high performance parallel finite element solver ParFE. Report. ETH Library Research Collection Report WebParFE A web interface for the high performance parallel finite element solver ParFE Author(s): Paranjape, Sumit; Kaufmann, Martin; Arbenz, Peter Publication Date: 2009 Permanent

More information

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT: HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms Author: Stan Posey Panasas, Inc. Correspondence: Stan Posey Panasas, Inc. Phone +510 608 4383 Email sposey@panasas.com

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Increasing the Scale of LS-DYNA Implicit Analysis

Increasing the Scale of LS-DYNA Implicit Analysis Increasing the Scale of LS-DYNA Implicit Analysis Cleve Ashcraft 2, Jef Dawson 1, Roger Grimes 2, Erman Guleryuz 3, Seid Koric 3, Robert Lucas 2, James Ong 4, Francois-Henry Rouet 2, Todd Simons 4, and

More information

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance This Dell EMC technical white paper discusses performance benchmarking results and analysis for Simulia

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER Mathematical Modelling and Analysis 2005. Pages 171 177 Proceedings of the 10 th International Conference MMA2005&CMAM2, Trakai c 2005 Technika ISBN 9986-05-924-0 APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC

More information

Application of Finite Volume Method for Structural Analysis

Application of Finite Volume Method for Structural Analysis Application of Finite Volume Method for Structural Analysis Saeed-Reza Sabbagh-Yazdi and Milad Bayatlou Associate Professor, Civil Engineering Department of KNToosi University of Technology, PostGraduate

More information

Benchmarking CPU Performance

Benchmarking CPU Performance Benchmarking CPU Performance Many benchmarks available MHz (cycle speed of processor) MIPS (million instructions per second) Peak FLOPS Whetstone Stresses unoptimized scalar performance, since it is designed

More information

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012

A Software Developing Environment for Earth System Modeling. Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 A Software Developing Environment for Earth System Modeling Depei Qian Beihang University CScADS Workshop, Snowbird, Utah June 27, 2012 1 Outline Motivation Purpose and Significance Research Contents Technology

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of

More information

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst Ali Khajeh-Saeed Software Engineer CD-adapco J. Blair Perot Mechanical Engineering UMASS, Amherst Supercomputers Optimization Stream Benchmark Stag++ (3D Incompressible Flow Code) Matrix Multiply Function

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering

More information

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing Tapasya Patki 1 David Lowenthal 1 Barry Rountree 2 Martin Schulz 2 Bronis de Supinski 2 1 The University of Arizona

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM

DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM DISTRIBUTED VIRTUAL CLUSTER MANAGEMENT SYSTEM V.V. Korkhov 1,a, S.S. Kobyshev 1, A.B. Degtyarev 1, A. Cubahiro 2, L. Gaspary 3, X. Wang 4, Z. Wu 4 1 Saint Petersburg State University, 7/9 Universitetskaya

More information

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Splotch: High Performance Visualization using MPI, OpenMP and CUDA Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,

More information

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics

Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics Development of an Integrated Computational Simulation Method for Fluid Driven Structure Movement and Acoustics I. Pantle Fachgebiet Strömungsmaschinen Karlsruher Institut für Technologie KIT Motivation

More information

Energy- Regional Innovation Cluster (E-RIC)

Energy- Regional Innovation Cluster (E-RIC) Energy- Regional Innovation Cluster (E-RIC) Dual E-RIC Mission: Reduced energy use in buildings Regional economic development Department of Energy $122 million Economic Development Administration $5 million

More information

Exploring unstructured Poisson solvers for FDS

Exploring unstructured Poisson solvers for FDS Exploring unstructured Poisson solvers for FDS Dr. Susanne Kilian hhpberlin - Ingenieure für Brandschutz 10245 Berlin - Germany Agenda 1 Discretization of Poisson- Löser 2 Solvers for 3 Numerical Tests

More information

A Local-View Array Library for Partitioned Global Address Space C++ Programs

A Local-View Array Library for Partitioned Global Address Space C++ Programs Lawrence Berkeley National Laboratory A Local-View Array Library for Partitioned Global Address Space C++ Programs Amir Kamil, Yili Zheng, and Katherine Yelick Lawrence Berkeley Lab Berkeley, CA, USA June

More information

D036 Accelerating Reservoir Simulation with GPUs

D036 Accelerating Reservoir Simulation with GPUs D036 Accelerating Reservoir Simulation with GPUs K.P. Esler* (Stone Ridge Technology), S. Atan (Marathon Oil Corp.), B. Ramirez (Marathon Oil Corp.) & V. Natoli (Stone Ridge Technology) SUMMARY Over the

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 1: Basics of Parallel Computing G. Rapin Brazil March 2011 Outline 1 Structure of Lecture 2 Introduction 3 Parallel Performance

More information

MSE Comprehensive Exam

MSE Comprehensive Exam MSE Comprehensive Exam The MSE requires a comprehensive examination, which is quite general in nature. It is administered on the sixth Friday of the semester, consists of a written exam in the major area

More information

Large-scale Gas Turbine Simulations on GPU clusters

Large-scale Gas Turbine Simulations on GPU clusters Large-scale Gas Turbine Simulations on GPU clusters Tobias Brandvik and Graham Pullan Whittle Laboratory University of Cambridge A large-scale simulation Overview PART I: Turbomachinery PART II: Stencil-based

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

Department of Computer Science and Engineering

Department of Computer Science and Engineering Department of Computer Science and Engineering 1 Department of Computer Science and Engineering Department Head: Professor Edward Swan Office: 300 Butler Hall The Department of Computer Science and Engineering

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques. I. Course Title Parallel Computing 2 II. Course Description Students study parallel programming and visualization in a variety of contexts with an emphasis on underlying and experimental technologies.

More information

Large Scale Debugging of Parallel Tasks with AutomaDeD!

Large Scale Debugging of Parallel Tasks with AutomaDeD! International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Seattle, Nov, 0 Large Scale Debugging of Parallel Tasks with AutomaDeD Ignacio Laguna, Saurabh Bagchi Todd

More information

Performance of Implicit Solver Strategies on GPUs

Performance of Implicit Solver Strategies on GPUs 9. LS-DYNA Forum, Bamberg 2010 IT / Performance Performance of Implicit Solver Strategies on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Abstract: The increasing power of GPUs can be used

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory Clusters Rob Kunz and Justin Watson Penn State Applied Research Laboratory rfk102@psu.edu Contents Beowulf Cluster History Hardware Elements Networking Software Performance & Scalability Infrastructure

More information

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing

More information

TRAFFIC CONTROLLER LABORATORY UPGRADE

TRAFFIC CONTROLLER LABORATORY UPGRADE TRAFFIC CONTROLLER LABORATORY UPGRADE Final Report KLK206 N06-21 National Institute for Advanced Transportation Technology University of Idaho Ahmed Abdel-Rahim August 2006 DISCLAIMER The contents of this

More information

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics H. Y. Schive ( 薛熙于 ) Graduate Institute of Physics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Analysis of 2D Torus and Hub Topologies of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed 1

Analysis of 2D Torus and Hub Topologies of 100Mb/s Ethernet for the Whitney Commodity Computing Testbed 1 Analysis of 2D Torus and Hub Topologies of 1Mb/s Ethernet for the Whitney Commodity Computing Testbed 1 Kevin T. Pedretti and Samuel A. Fineberg NAS Technical Report NAS-97-17 September 1997 MRJ, Inc.

More information

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015 PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability

More information

VIRTUAL NETWORKING LABORATORY FOR EDUCATION IN COMPUTER SCIENCE

VIRTUAL NETWORKING LABORATORY FOR EDUCATION IN COMPUTER SCIENCE INFORMATION TECHNOLOGY IN EDUCATION VIRTUAL NETWORKING LABORATORY FOR EDUCATION IN COMPUTER SCIENCE Jordan H. Kanev 1, Stanimir M. Sadinov 1 1 Technical University of Gabrovo, Gabrovo, Bulgaria Abstract:

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Linux+ Base Pod Installation and Configuration Guide

Linux+ Base Pod Installation and Configuration Guide Linux+ Base Pod Installation and Configuration Guide This document provides detailed guidance on performing the installation and configuration of the Linux+ Base Pod on a NETLAB+ system. The Linux+ Base

More information

QuickGuide for CC, GS, and Barnard CS Students

QuickGuide for CC, GS, and Barnard CS Students QuickGuide for CC, GS, and Barnard CS Students (New Requirements Beginning Fall 2013) This QuickGuide is for Columbia College, General Studies, and Barnard students thinking of majoring or concentrating

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information