Programming for Fujitsu Supercomputers

Similar documents
PRIMEHPC FX10: Advanced Software

Technical Computing Suite supporting the hybrid system

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Topology Awareness in the Tofu Interconnect Series

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Introduction of Fujitsu s next-generation supercomputer

Fujitsu s Approach to Application Centric Petascale Computing

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Getting the best performance from massively parallel computer

Findings from real petascale computer systems with meteorological applications

Post-K: Building the Arm HPC Ecosystem

Fujitsu s new supercomputer, delivering the next step in Exascale capability

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

FUJITSU HPC and the Development of the Post-K Supercomputer

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Parallel Architectures

MPI Performance Snapshot

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Non-Blocking Collectives for MPI

LS-DYNA Scalability Analysis on Cray Supercomputers

FUJITSU PHI Turnkey Solution

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

HOKUSAI System. Figure 0-1 System diagram

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

MPI Performance Snapshot. User's Guide

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Advantages to Using MVAPICH2 on TACC HPC Clusters

Non-blocking Collective Operations for MPI

The Tofu Interconnect D

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Scalability issues : HPC Applications & Performance Tools

The way toward peta-flops

Introduction to Parallel Computing

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

6.1 Multiprocessor Computing Environment

CUDA GPGPU Workshop 2012

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

Fujitsu s Technologies to the K Computer

Intel Xeon Phi архитектура, модели программирования, оптимизация.

MPI Performance Snapshot

AutoTune Workshop. Michael Gerndt Technische Universität München

Porting Applications to Blue Gene/P

Optimize HPC - Application Efficiency on Many Core Systems

AcuSolve Performance Benchmark and Profiling. October 2011

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

BlueGene/L (No. 4 in the Latest Top500 List)

Compiler Technology That Demonstrates Ability of the K computer

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

High Performance Computing. Introduction to Parallel Computing

Linux multi-core scalability

Hybrid Programming with MPI and SMPSs

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Advanced Message-Passing Interface (MPI)

Bei Wang, Dmitry Prohorov and Carlos Rosales

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Parallel Computing Introduction

Implementation and Usage of the PERUSE-Interface in Open MPI

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Trends and Challenges in Multicore Programming

Intel Xeon Phi архитектура, модели программирования, оптимизация.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

ABySS Performance Benchmark and Profiling. May 2010

Introduction to Performance Tuning & Optimization Tools

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

Introduction to parallel Computing

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Analyzing I/O Performance on a NEXTGenIO Class System

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Optimization of MPI Applications Rolf Rabenseifner

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

HPC Architectures. Types of resource currently in use

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Screencast: Basic Architecture and Tuning

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

General introduction: GPUs and the realm of parallel architectures

LS-DYNA Performance on Intel Scalable Solutions

DISP: Optimizations Towards Scalable MPI Startup

Fujitsu High Performance CPU for the Post-K Computer

Dimemas internals and details. BSC Performance Tools

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

Transcription:

Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited

To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming Tuning Parallel Programs

System Software Stack System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. Linux-based enhanced Operating System Super Computer: PRIMEHPC FX10 Red Hat Enterprise Linux PC cluster: PRIMERGY

HPC-ACE architecture & Compiler Expanded Registers reduces the wait-time of instructions SIMD instructions reduces the number of instructions Sector Cache improves cache efficiency

Compiler Gives Life to HPC-ACE NPB3.3 LU: operation wait-time dramatically reduced NPB3.3 MG: number of instruction is halved NPB3.3 LU Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Faster Memory wait Operation wait FX1 Cache misses Instructions committed Efficient use of Expanded registers reduces operation wait PRIMEHPC FX10 NPB3.3 MG Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Memory wait Operation wait Faster FX1 Cache misses Instructions committed SIMD implementation reduces Instructions committed PRIMEHPC FX10

Parallel Programming for Busy Researchers Large # of Parallelism for Large Scale Systems Large # processes need Large Memory & Overhead hybrid thread-process programming to reduce number of processes Hybrid parallel programming is annoying for programmers Even for multi-threading, the more coarse grain, the better. Procedure level or outer loop parallelism is desired Little opportunity for such coarse grain parallelism System support for fine grain parallelism is required VISIMPACT solves these problems

VISIMPACT TM (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Mechanism that treats multiple cores as one high-speed CPU through automatic parallelization Just program and compile, and enjoy high-speed You need not think about hybrid CPU L2$ Core CPU Core Memory L2$ Process Core Memory L2$ Process Inter-core thread parallel processing Process Core

Time VISIMPACT TM (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Technologies to make multiple cores to single high speed CPU Shared L2 cache memory to avoid false sharing Inter-core hardware barrier facilities to reduce overhead of thread synchronization Thread parallel programs use these technologies Barrier synchronization Core Thread 1 Core Thread 2 Hardware barrier synchronization: 10 times faster than conventional system Core Thread N

thread & process Hybrid-Parallel Programming thread parallel in a chip Auto-parallel or explicit parallel description using OpenMP TM API VISIMPACT enables low overhead thread execution process parallel beyond CPU MPI programming Collective communication is tuned using Tofu barrier facility

MPI Software stack Original Open MPI Software Stack (Using openib BTL) MPI MPI PML ob1 PML (Point-to-Point Messaging Layer) BML r2 BML (BTL Management Layer) BTL openib BTL (Byte Transfer Layer) COLL OpenFabrics Verbs PML BML BTL Hardware dependent Adapting to tofu Extension tofu LLP Supported special Bcast Allgather Alltoall Allreduce for Tofu MPI PML BML BTL tofu common r2 BML COLL ob1 PML tofu BTL Tofu Library MPI BML BTL Extension LLP (Low Latency Path) Special Hardware dependent layer For Tofu Interconnect Providing Common Data processing and structures for BTL LLP COLL Rendezvous Protocol Optimization etc

Customized MPI Library for High Scalability Point-to-Point communication Two methods for inside communication The transfer method selection by the data length process location number of hops Collective communication Barrier, Allreduce, Bcast and Reduce use Tofu-barrier facility Bcast, Allgather, Allgatherv, Allreduce and Alltoall use Tofu-optimized algorithm Quotation from K computer performance data

Application Tuning Cycle and Tools Collecting Job Info. Tofu-PA PAPI Analysis & Tuning Profiler Vampir-trace Overall tuning RMATT Execution MPI Tuning CPU Tuning Rank mapping Fujitsu Tools Open Source Tools Profiler Vampir-trace

Program Profile (Event Counter Example) 3-D job example Display 4096 procs in 16 x 16 x 16 cells Cells painted in colors according to the proc status (e.g. CPU time) Cut a slice of jobs along x-, y-, or z-axis to view

Rank Mapping Optimization (RMATT) Network Construction Communication Pattern (Communication processing contents between Rank) Rank number : 4096 rank Network construction : 16x16x16 node (4096) x,y,z order mapping 22.3ms 0 1 2 input RMATT output Apply MPI_Allgather Communication Processing Performance Optimized Rank Map Reduce number of hop and congestion Remapping used RMATT 5.5ms 4 times performance Up 8 6 2 apply 3 4 5 6 7 8 1 3 0 5 7 4