Technical Computing Suite supporting the hybrid system

Size: px

Start display at page:

Download "Technical Computing Suite supporting the hybrid system"

Merry Ann Carter
5 years ago
Views:

1 Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster

Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect (Tofu) Fat-Tree Interconnect (Infiniband) Local file system (Temporary area occupied by jobs)

2 Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect (Tofu) Fat-Tree Interconnect (Infiniband) Local file system (Temporary area occupied by jobs) IO network (IB), management network (GbE) Management nodes User Login nodes Login Compilation Job submission Global file system (Data storage area) Job management nodes File management nodes System operations management Job operations management Control nodes Administrator 1

3 System Software Stack System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. Linux-based enhanced Operating System Supercomputer PRIMEHPC FX10 2 Red Hat Enterprise Linux PRIMERGY x86 cluster

System Operations Management Single system image in FX10 and PRIMERGY Installation / Update

structure for large-scale systems Load balance by using the job management sub node.

4 System Operations Management Single system image in FX10 and PRIMERGY Installation / Update Packages High Availability Control Hardware/Software Monitoring Power Control Hierarchical structure for large-scale systems Load balance by using the job management sub node. Logical resource partition for efficient operations Easy to operate with single system image. Admin. PRIMEHPC FX10 Job manage sub node IO node nodes IO node nodes Resource Unit #1 3 Cluster Control node nodes Logical Resource Partition PRIMERGY Job manage sub node nodes Resource Unit #2

5 Installation / Update Packages Large-scale system support Hierarchical installer node structure Installation Update packages Installer node Sub installer node broad/multicast node Installer node Installer node repository Installer node Sub installer node broad/multicast node 2-tier package management Common packages: on all nodes. Additional packages: on some nodes. node1 node3 node node2 node4 2-tier (common/addition) package management node * Support diskless node for FX10 Common package PKG-A PKG-B PKG-C Additional pacakge-1 PKG-D PKG-E Additional pacakge-2 PKG-F 4

High Availability Features The important nodes have redundancy.

nodes File servers (Meta Data Server / Object Storage Server) Full

Continuing job execution even on the failure of the job management node

6 High Availability Features The important nodes have redundancy. Control nodes (Installer nodes) Job management nodes Job management sub nodes File servers (Meta Data Server / Object Storage Server) Full automatic failover Job management node/sub node is in hot standby mode. Continuing job execution even on the failure of the job management node Rapid failover without time lag 5 user Job management nodes active data active failure sync JOBs

7 System Software Stack System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. Linux-based enhanced Operating System Super r: PRIMEHPC FX10 6 Red Hat Enterprise Linux PC cluster: PRIMERGY

8 Job Operations Management Same job operations in FX10 and PRIMERGY Efficient, fair and system-optimal resource usage Backfill scheduling Fair share scheduling System-optimal job scheduling Resource / Access control Elapsed time / CPU time / Physical memory Permission of job operation commands Reduce OS Jitter / Power saving control Backfilling disabled Backfilling enabled Time Now t1 t2 t3 Running job Job C Running job Job C T0 T1 Job A Job A Job B Job B Job C Job C 7

9 Optimal Job Scheduling for FX10 Interconnect topology-aware resource assignment One interconnect unit : 12 nodes (2 x 3 x 2) Job assignment rule: rectangular solid shape Guaranteeing neighbor communication Avoiding interfering with other jobs Rotates nodes to reduce fragmentation In-use unoccupied 6 z y x

Optimal Job Scheduling for FX10 Asynchronous file staging nodes PRIMEHPC FX10 Job A Interconnect IO nodes Stage IN/OUT Local file system Stage IN Asynchronously transfer files from Global to Local FS

10 Optimal Job Scheduling for FX10 Asynchronous file staging nodes PRIMEHPC FX10 Job A Interconnect IO nodes Stage IN/OUT Local file system Stage IN Asynchronously transfer files from Global to Local FS before the job starts. Stage OUT Asynchronously transfer files from Local to Global FS after the job ends. nodes Time Now t1 t2 t3 Running job Async. Job A Async. Job B Job C IO network (IB), management network (GbE) IO nodes Stage IN Stage IN Stage OUT Stage IN Stage OUT Login nodes Global file system (Data storage area) Co-scheduling of computation and file transfer. 9

11 Optimal Job Scheduling for PRIMERGY Fine-grained node assignment selection method : balancing / concentration Rank placement policy : pack / unpack Priority control of allocated nodes Execution mode : node is occupied or not by a job. Strict core assignment #0 #1 #2 Job A Job C Job B Job D concentration #0 #1 #2 R0 R0 Job B R1 Job A R1 Rank pack Rank unpack Processes are bound to cores in the job territory No process can move to cores in other job territory. 1 Job A 3 P 5 Job B 7 core 10

12 Reduce OS Jitter PRIMEHPC FX10 Stripped-down system processing Minimizes OS Jitter by using RDMA of Tofu. a. / service health check b. System information monitor (remote sadc) c. Job information monitor (CPU time/used memory) PRIMERGY Isolates OS Jitter from jobs by using Hyper-Threading. Avoiding the conflict between job and OS Jitter. IO node RDMA RDMA ICC Mem ICC Mem CPU CPU node node node Core Core Core HT HT HT Job HT HT HT OS 11

13 System Software Stack System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. Linux-based enhanced Operating System Super r: PRIMEHPC FX10 12 Red Hat Enterprise Linux PC cluster: PRIMERGY

14 High Scalability Achieved high-scalable IO performance with multiple OSSes. Add Server&Storage Throughput Adapted various IO model Parallel IO (MPI-IO) Single Stream IO Number of servers Master IO Shared File OSS OSS OSS OSS File File File File OSS OSS OSS OSS 13 File File File File OSS OSS OSS OSS OSS: Object storage server

15 IO Bandwidth Guarantee Fair Share QoS: Sharing IO bandwidth with all users. Without Fair Share QoS Login IO Bandwidth File Servers With Fair Share QoS User A Not Fair User A Fair User B User B Best Effort QoS: Utilize all IO bandwidth exhaustively. Occupied by one client Client(s) File Servers Shared by all clients Client(s) A Client(s) B 70% 30% 14

16 High Reliability and High Availability Avoiding single point of failure by redundant hardware and failover mechanism. Monitoring & Managing Software File Management Server MDS (Active) IB SW IB SW Network path Failover RAID MDS MDS (Standby) 15 OSS (Active) RAID Failover OSS OSS (Active) RAID Dual Server Disk path RAID

17 Performance: I/O Throughput of FEFS Achieved the world s top-level throughput on K computer using over 2,500 OSS. We were encountered with serious problems: Memory shortage & System noise issues. Write:965GB/s Read:1486GB/s Collaborative work with RIKEN Throughput [GB/s] IOR Write (direct, file per proc) Throughput [GB/s] IOR Read (direct, file per proc) Number of OSS Number of OSS 16

18 System Software Stack System operations management System configuration management System control System monitoring System installation & operation Job operations management Job manager Job scheduler Resource management Parallel execution environment User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite High-performance file system Lustre-based distributed file system High scalability IO bandwidth guarantee High reliability & availability VISIMPACT TM Shared L2 cache on a chip Hardware intra-processor synchronization Compilers Hybrid parallel programming Sector cache support SIMD / Register file extensions Support Tools IDE Profiler & Tuning tools Interactive debugger MPI Library Scalability of High-Func. Barrier Comm. Linux-based enhanced Operating System Super r: PRIMEHPC FX10 17 Red Hat Enterprise Linux PC cluster: PRIMERGY

parallelism among cores Inter-core hardware barrier Shared L2 Cache Automatic

19 VISIMPACT thread & process Hybrid-Parallel Programming Auto-parallel + MPI Auto-thread parallel in a chip VISIMPACT: CPU Architecture for low overhead parallelism among cores Inter-core hardware barrier Shared L2 Cache Automatic parallelization Process parallel by MPI Tofu barrier facility for collective communication 18

20 Customized MPI Library for High Scalability Point-to-Point communication The transfer method selection according to the data length, process location and number of hops Collective communication Barrier, Allreduce, Bcast and Reduce use Tofu-barrier & Reduction facility Bcast, Allgather, Allgatherv, Allreduce and Alltoall use Tofu-optimized algorithm Quotation from K computer performance data 19

21 Application Tuning Cycle and Tools Collecting Job Info. Analysis & Tuning Overall tuning Tofu-PA Profiler PAPI Vampir-trace RMATT Profiler snapshot Execution MPI Tuning CPU Tuning Rank mapping Fujitsu Tools Profiler Open Source Tools Vampir-trace 20

Rank Mapping Optimization (RMATT) Network Construction Communication Pattern (Communication processing contents between Rank) Rank number : 4096 rank Network construction : 16x16x16 node (4096) x,y,z

22 Rank Mapping Optimization (RMATT) Network Construction Communication Pattern (Communication processing contents between Rank) Rank number : 4096 rank Network construction : 16x16x16 node (4096) x,y,z order mapping 22.3ms input RMATT output Apply MPI_Allgather Communication Processing Performance Optimized Rank Map Reduce number of hop and congestion Remapping used RMATT apply 4 times performance Up 5.5ms

23 22

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied