Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Similar documents
PRIMEHPC FX10: Advanced Software

Technical Computing Suite supporting the hybrid system

Programming for Fujitsu Supercomputers

Fujitsu s Approach to Application Centric Petascale Computing

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Fujitsu s new supercomputer, delivering the next step in Exascale capability

An Overview of Fujitsu s Lustre Based File System

Introduction of Fujitsu s next-generation supercomputer

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Getting the best performance from massively parallel computer

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Findings from real petascale computer systems with meteorological applications

Fujitsu s Technologies to the K Computer

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Post-K: Building the Arm HPC Ecosystem

Topology Awareness in the Tofu Interconnect Series

Challenges in Developing Highly Reliable HPC systems

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

The way toward peta-flops

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Advantages to Using MVAPICH2 on TACC HPC Clusters

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

FUJITSU HPC and the Development of the Post-K Supercomputer

STAR-CCM+ Performance Benchmark and Profiling. July 2014

ABySS Performance Benchmark and Profiling. May 2010

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

Fujitsu's Lustre Contributions - Policy and Roadmap-

Overview of Tianhe-2

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The Tofu Interconnect D

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

FUJITSU PHI Turnkey Solution

High-Performance Lustre with Maximum Data Assurance

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Cray XC Scalability and the Aries Network Tony Ford

CSCI 4717 Computer Architecture

The Future of Interconnect Technology

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

BlueGene/L (No. 4 in the Latest Top500 List)

A Global Operating System for HPC Clusters

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Cray RS Programming Environment

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

HOKUSAI System. Figure 0-1 System diagram

Cluster Network Products

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Best Practices for Setting BIOS Parameters for Performance

Intel Xeon Phi архитектура, модели программирования, оптимизация.

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Lecture 9: MIMD Architectures

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

A Case for High Performance Computing with Virtual Machines

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

GPUfs: Integrating a file system with GPUs

IBM Blue Gene/Q solution

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Parallel Architectures

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Improve Web Application Performance with Zend Platform

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MAHA. - Supercomputing System for Bioinformatics

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory

HIGH PERFORMANCE COMPUTING FROM SUN

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Lecture 9: MIMD Architectures

High Performance Computing Systems

Porting Applications to Blue Gene/P

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Xyratex ClusterStor6000 & OneStor

SNAP Performance Benchmark and Profiling. April 2014

Compiler Technology That Demonstrates Ability of the K computer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

HPC Architectures. Types of resource currently in use

Experiences with HP SFS / Lustre in HPC Production

Operating Systems Overview. Chapter 2

MPI Performance Snapshot. User's Guide

Advanced Message-Passing Interface (MPI)

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Bei Wang, Dmitry Prohorov and Carlos Rosales

Fujitsu High Performance CPU for the Post-K Computer

Application Performance on IME

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Harp-DAAL for High Performance Big Data Computing

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Transcription:

Advanced Software for the Supercomputer PRIMEHPC FX10

System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied by jobs) Data transfer to/from global file system Data communication for system job operations management IO network (IB), management network (GbE) User Login nodes Global file system (Data storage area) Job management nodes Management nodes File management nodes Control nodes System operations management Job operations management System integration node Administrator 1

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 2 Enhanced hardware support System noise reduction Error detection / Low power

OS (Linux-based enhanced Operating System) Easy existing application porting POSIX API: Linux kernel 2.6.x, glibc 2.x High performance / High scalability Enhanced hardware support CPU registers, Large memory page, High speed interconnect Reduce system noise in highly parallel program Inter-node OS scheduling High availability / low power consumption Hardware error detection / isolation memory patrol, io driver enhance. CPU suspend during system idle state. Daemon services = System noise 1 2 3 4 Application running Application running idle wait Synchronous daemon scheduling Idle CPU suspend Job running 3

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 4 Enhanced hardware support System noise reduction Error detection / Low power

System Operations Management Hierarchical structure for efficient system operation and adaptability to large-scale systems The load is distributed by using the job management sub node. Easy to operate with a single system image The system is efficiently operated by dividing a logical resource partition named Resource Unit. System administrator Easy to operate with single system image group#1 Job manage sub node IO node IO node Control node Job management node IO node - Power control - Hardware monitoring - Software service monitoring Job operations management group#2 Job manage sub node IO node Cluster Hierarchical structure for による efficient 効率的な運用 operation nodes nodes Resource Unit #1 5 nodes Tofu interconnect Logical 階層化によ Resource る効率的な Partition nodes Resource Unit#2

High Availability System The important nodes have redundancy Control node Job management node Job management sub node File servers For example : right figure Continuing job execution even if the job management node is in failed status The job data always synchronizes between active node and stand-by node. Alternatively to stand-by node if active node is down. 6 user Job management nodes active data stand-by active failure sync JOBs

Job Operations Environment Efficient resource usage Flexible job scheduling based on prioritized resource assignment Interconnect topology-aware resource assignment Backfill scheduling for keeping the resources busy Asynchronous file staging High availability Avoids assigning faulty resources to jobs disconnects faulty nodes from job operations Management nodes with failover support Backfilling disabled Backfilling enabled Time Now t1 t2 t3 Running job Job C Running job Job C T0 T1 Job A Job A Job B Job B Job C Job C 7

Resource Assignment Interconnect topology-aware resource assignment Treats 12 compute nodes as one interconnect unit Assigns cubic-shaped interconnect unit(s) to a job Using adjacent interconnect unit(s) is suitable for contiguous communication, and also avoids interfering with other jobs. Optimizes the alignment of resources Rotating the cubic-shaped interconnect units This improves total system utilization by rotating the cubic shaped interconnect units. 6 z y x 8 4 8 4 4 6 6 8 8 6 4 8 In-use unoccupied

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 9 Enhanced hardware support System noise reduction Error detection / Low power

Throughput High Scalability Achieved high-scalable IO performance with multiple OSSes. Add Server&Storage Adapted various IO model Parallel IO (MPI-IO) Single Stream IO Number of servers Master IO Shared File OSS OSS OSS OSS File File File File OSS OSS OSS OSS 10 File File File File OSS OSS OSS OSS OSS: Object storage server

IO Bandwidth Guarantee Fair Share QoS: Sharing IO bandwidth with all users. Without Fair Share QoS Login IO Bandwidth File Servers With Fair Share QoS User A Not Fair User A Fair User B User B Best Effort QoS: Utilize all IO bandwidth exhaustively. Occupied by one client Client(s) File Servers Shared by all clients Client(s) A Client(s) B 11

High Reliability and High Availability Avoiding single point of failure by redundant hardware and failover mechanism. Monitoring & Managing Software File Management Server MDS (Active) IB SW IB SW Network path Failover RAID MDS MDS (Standby) 12 OSS (Active) RAID Failover OSS OSS (Active) RAID Dual Server Disk path RAID

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment File system, operations management High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions MPI Library Scalability of High-Func. Barrier Comm. Support Tools IDE Profiler & Tuning tools Interactive debugger Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 13 Enhanced hardware support System noise reduction Error detection / Low power

Time VISIMPACT TM (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Mechanism that treats multiple cores as one high-speed CPU Easy and efficient execution of intercore thread parallel processing with a multi-core CPU Supports the realization of a highlyefficient Hybrid model (Automatic parallelization + MPI) CPU technologies Large-capacity shared L2 cache memory decrease in the influence of false sharing Inter-core hardware barrier facilities 6-10 times faster than conventional software barrier Memory 14 L2$ L2$ CPU Core Process Core Process Barrier synchronization Core Thread 1 Core Thread 2 Memory L2$ CPU Core Process Inter-core thread parallel processing Core Hardware barrier synchronization: 10 times faster than conventional system Core Thread N

System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment File system, operations management High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions MPI Library Scalability of High-Func. Barrier Comm. Support Tools IDE Profiler & Tuning tools Interactive debugger Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 15 Enhanced hardware support System noise reduction Error detection / Low power

Programming Model for High Scalability Hybrid parallelism by VISIMPACT and MPI library VISIMPACT Automated multi-thread parallelization High performance thread barrier used Inter-core hardware barrier facility MPI library High performance collective communications used Tofu barrier facility Performance ratio 20 18 16 14 12 10 8 6 4 2 0 Time Ratio 1.2 16 0.8 0.6 0.4 0.2 Scalability of Himeno benchmark(xl size) Hybrid + Tofu barrier Hybrid Flat MPI Quotation from K computer performance data 1,000 10,000 100,000 Number of cores 1 0 Himeno benchmark detail (65536 Core) Collective communication Neighbor Communication and Calculation Quotation from K computer performance data Hybrid + Tofu barrier Hybrid Flat MPI

Elapsed Time (micro seconds) Customized MPI Library for High Scalability Point-to-Point communication Use a special type of low-latency path that bypasses the software layer The transfer method optimization according to the data length, process location and number of hops Collective communication High performance Barrier, Allreduce, Bcast and Reduce used Tofu barrier facility Scalable Bcast, Allgather, Allgatherv, Allreduce and Alltoall algorithm optimized for Tofu network 17 100 90 80 70 60 50 40 30 20 10 0 Barrier / Allreduce Performance Barrier (Tofu Barrier) Barrier (software) Allreduce (Tofu Barrier) Allreduce (software) Quotation from K computer performance data 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Number of s (48x6xZ)

Compiler Optimization for High Performance Instruction-level parallelism with SIMD instructions Improvement of computing efficiency used Expanded registers Improvement of cache efficiency used Sector cache NPB3.3 LU Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Faster Memory wait Operation wait FX1 Cache misses Instructions committed Efficient use of Expanded registers reduces Operation wait PRIMEHPC FX10 18 NPB3.3 MG Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Memory wait Operation wait Faster FX1 Cache misses Instructions committed SIMD implementation reduces Instructions committed PRIMEHPC FX10

Application Tuning Cycle and Tools Job Information Profiler Vampir-trace RMATT Tofu-PA Execution MPI Tuning CPU Tuning Overall Tuning Profiler snapshot FX10 Specific Tools Profiler Vampir-trace Open Source Tools PAPI 19

20