Advanced Software for the Supercomputer PRIMEHPC FX10
System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied by jobs) Data transfer to/from global file system Data communication for system job operations management IO network (IB), management network (GbE) User Login nodes Global file system (Data storage area) Job management nodes Management nodes File management nodes Control nodes System operations management Job operations management System integration node Administrator 1
System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 2 Enhanced hardware support System noise reduction Error detection / Low power
OS (Linux-based enhanced Operating System) Easy existing application porting POSIX API: Linux kernel 2.6.x, glibc 2.x High performance / High scalability Enhanced hardware support CPU registers, Large memory page, High speed interconnect Reduce system noise in highly parallel program Inter-node OS scheduling High availability / low power consumption Hardware error detection / isolation memory patrol, io driver enhance. CPU suspend during system idle state. Daemon services = System noise 1 2 3 4 Application running Application running idle wait Synchronous daemon scheduling Idle CPU suspend Job running 3
System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 4 Enhanced hardware support System noise reduction Error detection / Low power
System Operations Management Hierarchical structure for efficient system operation and adaptability to large-scale systems The load is distributed by using the job management sub node. Easy to operate with a single system image The system is efficiently operated by dividing a logical resource partition named Resource Unit. System administrator Easy to operate with single system image group#1 Job manage sub node IO node IO node Control node Job management node IO node - Power control - Hardware monitoring - Software service monitoring Job operations management group#2 Job manage sub node IO node Cluster Hierarchical structure for による efficient 効率的な運用 operation nodes nodes Resource Unit #1 5 nodes Tofu interconnect Logical 階層化によ Resource る効率的な Partition nodes Resource Unit#2
High Availability System The important nodes have redundancy Control node Job management node Job management sub node File servers For example : right figure Continuing job execution even if the job management node is in failed status The job data always synchronizes between active node and stand-by node. Alternatively to stand-by node if active node is down. 6 user Job management nodes active data stand-by active failure sync JOBs
Job Operations Environment Efficient resource usage Flexible job scheduling based on prioritized resource assignment Interconnect topology-aware resource assignment Backfill scheduling for keeping the resources busy Asynchronous file staging High availability Avoids assigning faulty resources to jobs disconnects faulty nodes from job operations Management nodes with failover support Backfilling disabled Backfilling enabled Time Now t1 t2 t3 Running job Job C Running job Job C T0 T1 Job A Job A Job B Job B Job C Job C 7
Resource Assignment Interconnect topology-aware resource assignment Treats 12 compute nodes as one interconnect unit Assigns cubic-shaped interconnect unit(s) to a job Using adjacent interconnect unit(s) is suitable for contiguous communication, and also avoids interfering with other jobs. Optimizes the alignment of resources Rotating the cubic-shaped interconnect units This improves total system utilization by rotating the cubic shaped interconnect units. 6 z y x 8 4 8 4 4 6 6 8 8 6 4 8 In-use unoccupied
System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. File system, operations management Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 9 Enhanced hardware support System noise reduction Error detection / Low power
Throughput High Scalability Achieved high-scalable IO performance with multiple OSSes. Add Server&Storage Adapted various IO model Parallel IO (MPI-IO) Single Stream IO Number of servers Master IO Shared File OSS OSS OSS OSS File File File File OSS OSS OSS OSS 10 File File File File OSS OSS OSS OSS OSS: Object storage server
IO Bandwidth Guarantee Fair Share QoS: Sharing IO bandwidth with all users. Without Fair Share QoS Login IO Bandwidth File Servers With Fair Share QoS User A Not Fair User A Fair User B User B Best Effort QoS: Utilize all IO bandwidth exhaustively. Occupied by one client Client(s) File Servers Shared by all clients Client(s) A Client(s) B 11
High Reliability and High Availability Avoiding single point of failure by redundant hardware and failover mechanism. Monitoring & Managing Software File Management Server MDS (Active) IB SW IB SW Network path Failover RAID MDS MDS (Standby) 12 OSS (Active) RAID Failover OSS OSS (Active) RAID Dual Server Disk path RAID
System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment File system, operations management High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions MPI Library Scalability of High-Func. Barrier Comm. Support Tools IDE Profiler & Tuning tools Interactive debugger Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 13 Enhanced hardware support System noise reduction Error detection / Low power
Time VISIMPACT TM (Virtual Single Processor by Integrated Multi-core Parallel Architecture) Mechanism that treats multiple cores as one high-speed CPU Easy and efficient execution of intercore thread parallel processing with a multi-core CPU Supports the realization of a highlyefficient Hybrid model (Automatic parallelization + MPI) CPU technologies Large-capacity shared L2 cache memory decrease in the influence of false sharing Inter-core hardware barrier facilities 6-10 times faster than conventional software barrier Memory 14 L2$ L2$ CPU Core Process Core Process Barrier synchronization Core Thread 1 Core Thread 2 Memory L2$ CPU Core Process Inter-core thread parallel processing Core Hardware barrier synchronization: 10 times faster than conventional system Core Thread N
System Software Stack User/ISV Applications HPC Portal / System Management Portal System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment File system, operations management High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions MPI Library Scalability of High-Func. Barrier Comm. Support Tools IDE Profiler & Tuning tools Interactive debugger Application development environment Linux-based enhanced Operating System PRIMEHPC FX10 15 Enhanced hardware support System noise reduction Error detection / Low power
Programming Model for High Scalability Hybrid parallelism by VISIMPACT and MPI library VISIMPACT Automated multi-thread parallelization High performance thread barrier used Inter-core hardware barrier facility MPI library High performance collective communications used Tofu barrier facility Performance ratio 20 18 16 14 12 10 8 6 4 2 0 Time Ratio 1.2 16 0.8 0.6 0.4 0.2 Scalability of Himeno benchmark(xl size) Hybrid + Tofu barrier Hybrid Flat MPI Quotation from K computer performance data 1,000 10,000 100,000 Number of cores 1 0 Himeno benchmark detail (65536 Core) Collective communication Neighbor Communication and Calculation Quotation from K computer performance data Hybrid + Tofu barrier Hybrid Flat MPI
Elapsed Time (micro seconds) Customized MPI Library for High Scalability Point-to-Point communication Use a special type of low-latency path that bypasses the software layer The transfer method optimization according to the data length, process location and number of hops Collective communication High performance Barrier, Allreduce, Bcast and Reduce used Tofu barrier facility Scalable Bcast, Allgather, Allgatherv, Allreduce and Alltoall algorithm optimized for Tofu network 17 100 90 80 70 60 50 40 30 20 10 0 Barrier / Allreduce Performance Barrier (Tofu Barrier) Barrier (software) Allreduce (Tofu Barrier) Allreduce (software) Quotation from K computer performance data 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Number of s (48x6xZ)
Compiler Optimization for High Performance Instruction-level parallelism with SIMD instructions Improvement of computing efficiency used Expanded registers Improvement of cache efficiency used Sector cache NPB3.3 LU Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Faster Memory wait Operation wait FX1 Cache misses Instructions committed Efficient use of Expanded registers reduces Operation wait PRIMEHPC FX10 18 NPB3.3 MG Execution time comparison (relative values) 1.2 1 0.8 0.6 0.4 0.2 0 Memory wait Operation wait Faster FX1 Cache misses Instructions committed SIMD implementation reduces Instructions committed PRIMEHPC FX10
Application Tuning Cycle and Tools Job Information Profiler Vampir-trace RMATT Tofu-PA Execution MPI Tuning CPU Tuning Overall Tuning Profiler snapshot FX10 Specific Tools Profiler Vampir-trace Open Source Tools PAPI 19
20