Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

Similar documents
Status and physics plan of the PACS-CS Project

Application Performance on Dual Processor Cluster Nodes

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Commodity Cluster Computing

Interconnection Network for Tightly Coupled Accelerators Architecture

Yoichi Iwasaki (Vice President (Research), University of Tsukuba) Akira Ukawa(Core Member) Inst. of Phys., Univ. of Tsukuba (AIOV)

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Japanese Supercomputer Project. Jun Makino. University of Tokyo

arxiv:astro-ph/ v1 7 Jul 1997

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Lecture 9: MIMD Architectures

High Performance Computing and Data Mining

Leveraging HyperTransport for a custom high-performance cluster network

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology

COSC 6385 Computer Architecture - Multi Processor Systems

Fujitsu s Approach to Application Centric Petascale Computing

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Low Cost Supercomputing. Rajkumar Buyya, Monash University, Melbourne, Australia. Parallel Processing on Linux Clusters

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Hybrid (MPP+OpenMP) version of LS-DYNA

Real Parallel Computers

Lecture 9: MIMD Architecture

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

HA-PACS Project Challenge for Next Step of Accelerating Computing

COSC 6374 Parallel Computation. Parallel Computer Architectures

Pedraforca: a First ARM + GPU Cluster for HPC

High performance Computing and O&G Challenges

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Advances of parallel computing. Kirill Bogachev May 2016

Multi-core Programming - Introduction

PROGRAPE-1: A Programmable, Multi-Purpose Computer for Many-Body Simulations

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

PARALLEL ARCHITECTURES

10 Gbit/s Challenge inside the Openlab framework

Tightly Coupled Accelerators Architecture

Code Performance Analysis

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

The future is parallel but it may not be easy

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Overview of Tianhe-2

Practical Scientific Computing

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction of Fujitsu s next-generation supercomputer

How to Write Fast Code , spring th Lecture, Mar. 31 st

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Numerical Algorithms on Multi-GPU Architectures

MM5 Modeling System Performance Research and Profiling. March 2009

NAMD Serial and Parallel Performance

Memory Consistency and Multiprocessor Performance

PU0 PU1 PU2 PU3. barrier gather

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

The Earth Simulator System

Parallel Computing: From Inexpensive Servers to Supercomputers

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Current status of GRAPE project

Optimizing LS-DYNA Productivity in Cluster Environments

11. Particle Simulator Research Team

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Parallel 3D Sweep Kernel with PaRSEC

The Red Storm System: Architecture, System Update and Performance Analysis

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project

The Mont-Blanc approach towards Exascale

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

On the Migration of the Scientific Code Dyana from SMPs to Clusters of PCs and on to the Grid

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

GPU-centric communication for improved efficiency

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Center for Computational Science

Reconfigurable Computing - (RC)

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Particle-based simulations in Astrophysics

Kengo Nakajima Information Technology Center, The University of Tokyo. SC15, November 16-20, 2015 Austin, Texas, USA

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

InfiniBand Experiences of PC²

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

Accelerating HPL on Heterogeneous GPU Clusters

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

WHY PARALLEL PROCESSING? (CE-401)

Performance Evaluation of Myrinet-based Network Router

G-NET: Effective GPU Sharing In NFV Systems

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Brand-New Vector Supercomputer

Scalable, Hybrid-Parallel Multiscale Methods using DUNE

Virtual EM Inc. Ann Arbor, Michigan, USA

Transcription:

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation Taisuke Boku, Hajime Susa, Masayuki Umemura, Akira Ukawa Center for Computational Physics, University of Tsukuba Junichiro Makino, Toshiyuki Fukushige Department of Astronomy, University of Tokyo

Outline Background Concept and Design of HMCS prototype Implementation of prototype Performance evaluation Computational physics result Variation of HMCS Conclusions 06/24/2002 ICS02, New York 2

Background Requirements to Platforms for Next Generation Large Scale Scientific Simulation More powerful computation power Large capacity of Memory, Wide bandwidth of Network High speed & Wide bandwidth of I/O High speed Networking Interface (outside) Is it enough? How about the quality? Multi-Scale or Multi-Paradigm Simulation 06/24/2002 ICS02, New York 3

Multi-Scale Physics Simulation Various level of interaction Newton Dynamics, Electro-Magnetic interaction, Quantum Dynamics, Microscopic and Macroscopic Interactions Difference in Computation Order O(N 2 ): ex. N-body O(N log N): ex. FFT O(N): ex. straight-cfd Combining these simulation, Multi-Scale or Multi- Paradigm Computational Physics is realized 06/24/2002 ICS02, New York 4

HMCS Heterogeneous Multi-Computer System Combining Particle Simulation (ex: Gravity interaction) and Continuum Simulation (ex: SPH) in a Platform Combining General Purpose Processor (flexibility) and Special Purpose Processor (high-speed) Connecting General Purpose MPP and Special Purpose MPP via high-throughput network Exchanging particle data at every time-step Prototype System: CP-PACS + GRAPE-6 (JSPS Research for the Future Project Computational Science and Engineering ) 06/24/2002 ICS02, New York 5

Block Diagram of HMCS MPP for Continuum Simulation (CP-PACS) Parallel I/O System PAVEMENT/PIO Paralel File Server (SGI Origin2000) 100base-TX Switches 32bit PCI N MPP for Particle Simulation (GRAPE-6) Hybrid System Communication Cluster (Compaq Alpha) Parallel Visualization Server (SGI Onyx2) Parallel Visualization System PAVEMENT/VIZ

CP-PACS Pseudo Vector Processor with 300 Mflops of Peak Performance 2048 614.4 Gflops I/O node with the same performance 128 Interconnection Network: 3-D Hyper Crossbar (300MB/s / link) Platform for General Purpose Scientific Calculation 100base-TX NIC on 16 IOUs for outside comm. Partitioning is available (Any partition can access any IOU) Manufactured by Hitachi Co. Operation from 1996.4 with 1024 PUs, from 1996.10 with 2048 PUs 06/24/2002 ICS02, New York 7

CP-PACS (Center for Computational Physics) 06/24/2002 ICS02, New York 8

GRAPE-6 The 6th generation of GRAPE (Gravity Pipe) Project Gravity calculation for many particles with 31 Gflops/chip 32 chips / board 0.99 Tflops/board 64 boards of full system is under implementation 63 Tflops On each board, all particles (j-particles) data are set onto SRAM memory, and each target particle (i-particle) data is injected into the pipeline and acceleration data is calculated Gordon Bell Prize at SC01 Denver 06/24/2002 ICS02, New York 9

GRAPE-6 (University of Tokyo) GRAPE-6 board (32 chips) 8 board 4 system 06/24/2002 ICS02, New York 10

GRAPE-6 (cont d) (Top View) (Bottom View) Daughter Card Module (4 chip / module) 06/24/2002 ICS02, New York 11

Host Computer for GRAPE-6 GRAPE-6 is not a stand-alone system Host computer is required Alpha CPU base PC (Intel x86, AMD Ahtlon are also available) Connected via 32bit PCI Interface Card to GRAPE-6 board A host computer can handle several GRAPE-6 boards It is impossible to handle an enormous number of particles with a single host computer for complicated calculation 06/24/2002 ICS02, New York 12

Hyades (Alpha CPU base Cluster) Cluster with Alpha 21264A (600MHz) 16 node Samsung UP1100 (single CPU) board 768 MB memory / node Dual 100base-TX NIC 8 nodes are equipped with GRAPE-6 PCI card Cooperative work with 8 GRAPE-6 boards under MPI programming One of 100base-TX NICs is connected with CP-PACS via PIO (Parallel I/O System) Linux RedHat 6.2 (kernel 2.2.16) Operated as a data exchanging and controlling system to connect CP-PACS and GRAPE-6 06/24/2002 ICS02, New York 13

GRAPE-6 & Hyades GRAPE-6 & Hyades Connection between GRAPE-6 and Hyades 06/24/2002 ICS02, New York 14

PAVEMENT/PIO Parallel I/O and Visualization Environment Connecting multiple parallel processing platforms with commodity-based parallel network Automatic and dynamic load balancing feature to utilize spatial parallelism for applications Utilizing multiple I/O processors of MPP not to make bottleneck in communication Providing easy-to-program API with various operation modes (user-oriented, static or dynamic load balancing) 06/24/2002 ICS02, New York 15

MPP DSM system example CP-PACS SMP or Cluster Switch I/O processor (PIO server) Calculation processor (user process) PIO server User process (thread) 06/24/2002 ICS02, New York 16

HMCS Prototype Massively Parallel Processor CP-PACS (2048 PUs, 128 IOUs) 8 links Parallel Visualization Server SGI Onyx2 (4 Processors) Switching HUB 2 Parallel 100Base-TX Ethernet 8 links Parallel File Server SGI Origin-2000 (8 Processors) GRAPE-6 & Hyades (16 node, 8 board)

SPH (Smoothed Particle Hydrodynamics) Representing the material as a collection of particles ρ ρ( r) = ρ W ( r r ) i j 0 i j j W : kernel function

RT (Radiative Transfer) for SPH Accurate calculation of optical depth along light paths required. Use the method by Kessel-Deynet & Burkert (2000). ( n n )( s s ) τ TS = E + i Ei 1 E + i+ 1 Ei 2 i σ Target P2 P3 P5 Source T θ E1 E2 E3 E4 E5 S P1 P4

SPH Algorithm with Self-Gravity Interaction GRAPE-6 calculation O(N 2 ) Gravity SPH (Dencity) Radiation Trans. Comm. O(N) Iteration Chemistry Temperature CP-PACS O(N) Pressure Calculation Newton Dynamics

g6cpplib CP-PACS API g6cpp_start(myid, nio, mode, error) g6cpp_unit(n, t_unit, x_unit, eps2, error) g6cpp_calc(mass, r, f_old, phi_old, error) g6cpp_wait(acc, pot, error) g6cpp_end(error) 06/24/2002 ICS02, New York 21

Performance (raw G6 cluster) GRAPE-6 cluster performance with dummy data (without real RT-SPH) GRAPE-6 board 4 with 128K particles (sec) process particle data trans. all-to-all data circulation set-up data in SRAM N-body comp. result return time 1.177 0.746 0.510 0.435 0.085 Processing time for 1 iteration = 3.24 sec (total) 06/24/2002 ICS02, New York 22

Scalability with problem size (sec.) process n=15 n=16 n=17 data trans. 5.613 10.090 17.998 all-to-all circulation set data to SRAM 0.309 0.231 0.476 0.362 0.681 0.628 RT-SPH calculation is included calculation 0.064 0.169 0.504 TOTAL 6.217 11.097 19.811 # of particles N = 2 n (#P=512) 06/24/2002 ICS02, New York 23

Scalability with # of PUs process #P=512 #P=1024 data trans. 17.998 10.594 all-to-all circulation set data to SRAM 0.681 0.628 0.639 0.609 RT-SPH calculation is included calculation 0.504 0.503 TOTAL 19.811 12.345 # of particles (N) = 2 17 06/24/2002 ICS02, New York 24

Example of Physics Results (64K SPH particles + 64K dark matters) 06/24/2002 ICS02, New York 25

Various implementation methods of HMCS HMCS-L (Local) Same as current prototype Simple, but the system is closed HMCS-R (Remote) Remote access to GRAPE-6 server through Network (LAN or WAN = Grid) Utilizing GRAPE-6 cluster in time-sharing manner as Gravity Server HMCS-E (Embedded) Enhanced HMCS-L : Each node of MPP (or large scale cluster) is equipped with GRAPE chip Combining wide network bandwidth of MPP (or cluster) and powerful node processing power 06/24/2002 ICS02, New York 26

HMCS-R on Grid Client Computer (general) Client Computer (general) Client Computer (general) High Speed Network GRAPE + HOST Remote acess to GRAPE-6 server via g6cpp API no persistency on particle data suitable for Grid O(N 2 ) of calculation with O(N) of data amount 06/24/2002 ICS02, New York 27

HMCS-E (Embedded) G-P NIC M S-P Local comm. between general purpose and special purpose processors Utilizing wide bandwidth of large scale network Ideal fusion of flexibility and high performance High Speed Network Switch

Conclusions HMCS Platform for Multi-Scale Scientific Simulation Combining General Purpose MPP (CP-PACS) and Special Purpose MPP (GRAPE-6) with parallel network under PAVEMENT/PIO middleware SPH + Radiation Transfer with Gravity Interaction Detailed simulation for Galaxy formation 128K particle real simulation with 1024PU CP-PACS makes new epoch of simulation Next Step: HMCS-R and HMCS-E 06/24/2002 ICS02, New York 29