PRIMEHPC FX10: Advanced Software

Similar documents
Programming for Fujitsu Supercomputers

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Technical Computing Suite supporting the hybrid system

Topology Awareness in the Tofu Interconnect Series

Fujitsu s Approach to Application Centric Petascale Computing

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Introduction of Fujitsu s next-generation supercomputer

Findings from real petascale computer systems with meteorological applications

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Getting the best performance from massively parallel computer

Fujitsu s Technologies to the K Computer

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Post-K: Building the Arm HPC Ecosystem

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

The way toward peta-flops

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

FUJITSU HPC and the Development of the Post-K Supercomputer

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

The Tofu Interconnect D

HOKUSAI System. Figure 0-1 System diagram

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Non-Blocking Collectives for MPI

Parallel Architectures

Compiler Technology That Demonstrates Ability of the K computer

Porting Applications to Blue Gene/P

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Challenges in Developing Highly Reliable HPC systems

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

MPI Performance Snapshot. User's Guide

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Cray XC Scalability and the Aries Network Tony Ford

MPI Performance Snapshot

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Fujitsu High Performance CPU for the Post-K Computer

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Bei Wang, Dmitry Prohorov and Carlos Rosales

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Advantages to Using MVAPICH2 on TACC HPC Clusters

An Overview of Fujitsu s Lustre Based File System

Harp-DAAL for High Performance Big Data Computing

Open MPI for Cray XE/XK Systems

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

CS 475: Parallel Programming Introduction

Introduction to Parallel Computing

Intra-MIC MPI Communication using MVAPICH2: Early Experience

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

MPI Performance Snapshot

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

LS-DYNA Scalability Analysis on Cray Supercomputers

Non-blocking Collective Operations for MPI

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Cray RS Programming Environment

Programming Models for Supercomputing in the Era of Multicore

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Optimization of MPI Applications Rolf Rabenseifner

Research on the Implementation of MPI on Multicore Architectures

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

ABySS Performance Benchmark and Profiling. May 2010

Profiling: Understand Your Application

Scalability issues : HPC Applications & Performance Tools

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Dimemas internals and details. BSC Performance Tools

Hybrid Programming with MPI and SMPSs

DISP: Optimizations Towards Scalable MPI Startup

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

The Mont-Blanc approach towards Exascale

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Introduction to parallel Computing

FUJITSU PHI Turnkey Solution

The Tofu Interconnect 2

OpenMP for next generation heterogeneous clusters

Review: Creating a Parallel Program. Programming for Performance

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

I/O Profiling Towards the Exascale

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Analyzing I/O Performance on a NEXTGenIO Class System

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

CP2K Performance Benchmark and Profiling. April 2011

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Kaisen Lin and Michael Conley

Introduction to parallel computing

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

Transcription:

PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited

System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for High Speed Execution Just Compile and Enjoy High Performance Compiler MPI Tuning Support Environment

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 Enhanced hardware support System noise reduction Error detection / Low power

OS: Linux ported on SPARC64 You can port your applications on PC clusters with little effort Additional feature for large scale system Daemons are scheduled to reduce waiting CPU suspension facility Daemon services = System noise Node 1 Node 2 Node 3 Node 4 Application running Application running idle wait Synchronous daemon scheduling Idle CPU suspend Job running

Flexible System Management Hierarchical structure in large scale systems A job is distributed to each node thorough the sub job management node. A single system image through the control node A logical resource partition, named Resource Unit allocated flexibly System administrator Easy to operate with single system image Nodegroup#1 Sub job management node IO node IO node Control node Job management node IO node - Power control - Hardware monitoring - Software service monitoring Job operations management Nodegroup#2 Sub job management node IO node Cluster Hierarchical structure for による efficient 効率的な運用 operation Compute nodes Compute nodes Resource Unit #1 Compute nodes Tofu interconnect Logical 階層化によ Resource る効率的な Partition Compute nodes Resource Unit#2

Robust system operation The important nodes have redundancy. Control node Job management node Sub job management node File servers user Job management nodes active data stand-by active failure sync In case of job management node failure A stand-by node succeeds Job data synchronization between active nodes and stand-by nodes. Executing jobs can continue to run JOBs

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment File system, operations management High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions MPI Library Scalability of High-Func. Barrier Comm. Support Tools IDE Profiler & Tuning tools Interactive debugger Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 Enhanced hardware support System noise reduction Error detection / Low power

I have a dream that one day you just compile your programs and enjoy high performance on your high-end supercomputer. So, we must provide easy hybrid parallel programming method including compiler and run-time system support.

Hybrid Parallelism on huge # of cores Too large # of processes to manipulate To reduce number of processes, hybrid thread-process programming is required But Hybrid parallel programming is annoying for programmers Even for multi-threading, procedure level or outer loop parallelism was desired Little opportunity for such coarse grain parallelism System support for fine grain parallelism is required VISIMPACT solves these problems

Time VISIMPACT TM (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Mechanism that treats multiple cores as one high-speed CPU Practical automatic parallelization Program and compile And enjoy high-speed You need not think about hybrid Memory L2$ L2$ CPU Core Process Core Memory L2$ CPU Core Process Inter-core thread parallel processing CPU technologies Shared L2 cache memory to avoid false sharing Process Barrier synchronization Core Core Thread 1 Thread 2 Core Thread N Core Inter-core hardware barrier facilities to reduce overhead of thread synchronization Hardware barrier synchronization: 10 times faster than conventional system

Compiler uses HPC-ACE architecture Instruction-level parallelism with SIMD instructions Improvement of computing efficiency used 256 floating point registers Improvement of cache efficiency used sector cache NPB3.3 MG Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Memory wait Operation wait Faster FX1 Cache misses Instructions committed SIMD implementation reduces Instructions committed PRIMEHPC FX10 NPB3.3 LU Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Faster Memory wait Operation wait FX1 Cache misses Instructions committed Efficient use of Expanded registers reduces Operation wait PRIMEHPC FX10

MPI Approach for FX10 Open MPI based Open Standard, Open Source, Multi-Platform including PC Cluster Adding extension to Open MPI for Tofu interconnect High Performance Short-cut message path for low latency communication Torus oriented protocol: Message Size, Location, Hop Sensitive Trunking Communication utilizing multi-dimensional network links by Tofu selective routing.

Goal for MPI on FX10 High Performance Low Latency & High Bandwidth Highly Scalability Collective Performance Optimized for Tofu interconnect High Availability, Flexibility and Easy to Use Providing Logical 3D-Torus for each JOB with eliminating failure nodes. Providing New up version of MPI Standard functions as soon as possible

MPI Software stack Original Open MPI Software Stack (Using openib BTL) MPI MPI PML ob1 PML (Point-to-Point Messaging Layer) BML r2 BML (BTL Management Layer) BTL openib BTL (Byte Transfer Layer) COLL OpenFabrics Verbs PML BML BTL Hardware dependent Adapting to tofu Extension tofu LLP Supported special Bcast Allgather Alltoall Allreduce for Tofu MPI PML BML BTL COLL ob1 PML r2 BML tofu BTL tofu common Tofu Library MPI BML BTL Extension LLP (Low Latency Path) Special Hardware dependent layer For Tofu Interconnect Providing Common Data processing and structures for BTL LLP COLL Rendezvous Protocol Optimization etc

Flexible Process Mapping to Tofu environment You can allocate your processes as you like. Dimension Specification for each rank 1D :(x) 2D :(x,y) 3D :(x,y,z) (0) (1) (2) (3) (7) (6) (5) (4) (0,0) (1,0) (2,0) (3,0) (3,1) (2,1) (1,1) (0,1) (0,0,0) (1,0,0) (0,1,0) (1,1,0) (0,0,1) (0,1,1) (1,0,1) (1,1,1) 4 5 6 7 7 6 5 4 5 7 0 1 2 3 0 1 2 3 y 2 3 4 6 x y z x 0 1

Elapsed Time (micro seconds) Customized MPI Library for High Scalability Point-to-Point communication Use a special type of low-latency path that bypasses the software layer The transfer method optimization according to the data length, process location and number of hops Collective communication High performance Barrier, Allreduce, Bcast and Reduce used Tofu barrier facility Scalable Bcast, Allgather, Allgatherv, Allreduce and Alltoall algorithm optimized for Tofu network 100 90 80 70 60 50 40 30 20 10 0 Barrier / Allreduce Performance Barrier (Tofu Barrier) Barrier (software) Allreduce (Tofu Barrier) Allreduce (software) Quotation from K computer performance data 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Number of Nodes (48x6xZ)

Programming Model for High Scalability Hybrid parallelism by VISIMPACT and MPI library VISIMPACT Multi-thread parallelization MPI library Collective communications using Tofu barrier facility Performance ratio 20 18 16 14 12 10 8 6 4 2 0 Time Ratio 1.2 Scalability of Himeno benchmark(xl size) Hybrid + Tofu barrier Hybrid Flat MPI Quotation from K computer performance data 1,000 10,000 100,000 Number of cores 1 0.8 0.6 0.4 0.2 0 Himeno benchmark detail (65536 Core) Collective communication Neighbor Communication and Calculation Quotation from K computer performance data Hybrid + Tofu barrier Hybrid Flat MPI

Performance Tuning Not only by compiler optimization, but also you can manipulate performance Compiler directives to tune programs. Tools to help your effort to tune your programs

Application Tuning Cycle and Tools Job Information Profiler Vampir-trace RMATT Tofu-PA Execution MPI Tuning CPU Tuning Overall Tuning Profiler snapshot FX10 Specific Tools Profiler Vampir-trace Open Source Tools PAPI

Performance Tuning (Event Counter Example) 3-D job example Display 4096 procs in 16 x 16 x 16 cells Cells painted in colors according to the proc status (e.g. CPU time) Cut a slice of jobs along x-, y-, or z-axis to view

Rank Mapping Optimization (RMATT : Rank Map Automatic Turning Tool) Bruck algorism(allgather type) Communication Standard x,y,z order mapping 22.3ms 0 1 2 3 4 5 6 7 8 Communication analysis by Vampir-trace Log file of network configuration and communication pattern (Communication weight between ranks) Number of ranks : 4096 ranks Network configuration : 16x16x16 nodes (4096) input RMATT output Remapping with RMATT 8 6 2 1 3 0 5 7 4 4 times faster 5.5ms Re-execution using optimized rank map file Optimized rank map file Reduce number of hops and congestion

Conclusion: FX10 enables practical high-level parallel environment LINUX for SPARC64 processors Reducing the system noise effect & power usage Highly available job/system management facilities VISIMPACT TM lets you treat multi-core CPU as one single high-speed core. Collaboration by the CPU architecture and the compiler. High-speed hardware barrier to reduce the overhead of synchronization Shared L2 cache to improve memory access Automatic parallelization to recognize parallelism and accelerate your program Open MPI based MPI to utilize Tofu interconnect. Tuning facility shows the activity of parallel programs.