Job Startup at Exascale:

Size: px

Start display at page:

Download "Job Startup at Exascale:"

Eleanor Alexander
6 years ago
Views:

1 Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University

watt Stampede2 @ TACC Hybrid programming models are popular for developing applications Message Passing

2 Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core architectures High-performance interconnects Core density (per node) is increasing Improvements in manufacturing tech More performance per watt TACC Hybrid programming models are popular for developing applications Message Passing Interface (MPI) Partitioned Global Address Space (PGAS) Sunway TaihuLight Fast and scalable job-startup is essential! SC '17 2

3 Why is Job Startup Important? Development and debugging Regression / Acceptance testing Checkpoint - Restart SC '17 3

4 Memory Overhead Towards Exascale: Challenges to Address Dynamic allocation of resources Leveraging high-performance interconnects Exploiting opportunities for overlap State of the art Towards Exascale Job Startup Performance Minimizing memory usage SC '17 4

5 Time Taken (Seconds) Challenge: Avoid All-to-all Connectivity Connection Setup PMI Exchange Memory Registration Shared Memory Setup K 2K 4K Application Processes Average Peers BT EP MG SP D Heat Connection setup phase takes 85% of initialization time with 4K processes Applications rarely require full all-to-all connectivity SC '17 5

6 On-demand Connection Management Main Connection Main Connection Thread Manager Thread Thread Manager Thread Put/Get Create QP (P2) QP Init Enqueue Send Connect Request (LID, QPN) (address, size, rkey) Create QP Connect Reply (LID, QPN) QP Init QP RTR Connection Established Dequeue Send QP RTR QP RTS (address, size, rkey) Put/Get (P2) QP RTS Connection Established Process 1 Process 2 SC '17 6

7 Time Taken (Seconds) Execution Time (seconds) Results - On-demand Connections Performance of OpenSHMEM Initialization and Hello World Hello World - Static start_pes - Static Hello World - On-demand start_pes - On-demand Execution time of OpenSHMEM NAS Parallel Benchmarks Static On-demand K 2K 4K 8K 0 BT EP MG SP Benchmark Initialization 29.6 times faster Total execution time 35% better On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS 15) SC '17 7

8 Time Taken (Seconds) Challenge: Exploit High-performance Interconnects in PMI Breakdown of MPI_Init in MVAPICH2 PMI Exchanges Shared Memory Other K 2K 4K 8K Used for network address exchange, heterogeneity detection, etc. Used by major parallel programming frameworks Uses TCP/IP for transport Not efficient for moving large amount of data Required to bootstrap highperformance interconnects PMI = Process Management Interface SC '17 8

9 Time Taken (seconds) PMIX_Ring: A Scalable Alternative Exchange data with only the left and right neighbors over TCP Exchange bulk of the data over High-speed interconnect (e.g. InfiniBand, OmniPath) int PMIX_Ring( char value[], char left[], char right[], ) Comparison of PMI operations Fence Put Gets Ring k 4k 16k PMIX_Ring is more scalable SC '17 9

10 Time Taken (seconds) Time Taken (seconds) Results - PMIX_Ring Performance of MPI_Init and Hello World with PMIX_Ring Hello World (Fence) Hello World (Ring) MPI_Init (Fence) MPI_Init (proposed) K 2K 4K 8K Fence Ring NAS Benchmarks with 1K Processes, Class B Data EP MG CG FT BT SP Benchmark 33% improvement in MPI_Init Total execution time 20% better PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia 14) SC '17 10

11 Challenge: Exploit Overlap in Application Initialization PMI operations are progressed by the process manager MPI/PGAS library is not involved Can be overlapped with other initialization tasks / application computation Put+Fence+Get combined into a single function - Allgather int PMIX_KVS_Ifence( PMIX_Request *request) int PMIX_Iallgather( const char value[], char buffer[], PMIX_Request *request) int PMIX_Wait( PMIX_Request request) SC '17 11

12 Time Taken (Seconds) Time Taken (Seconds) Results - Non-blocking PMI Collectives Performance of MPI_Init Fence Ifence Allgather Iallgather Comparison of Fence and Allgather PMI2_KVS_Fence PMIX_Allgather K 4K 16K K 4K 16K Near-constant MPI_Init at any scale Allgather is 38% faster than Fence Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 15) SC '17 12

13 Challenge: Minimize Memory Footprint Address table and similar information is stored in the PMI Key-value store (KVS) Process Manager All processes in the node duplicate the KVS PMI Key-Value Store (KVS) PPN redundant copies per node PMI KVS PMI KVS PMI KVS Process 1 Process 2 Process N PPN = per Node SC '17 13

14 Shared Memory based PMI Hash Table (Table) Process manager creates and populates shared memory region 1 2 Key Value Store (KVS) Top MPI processes directly read from shared memory 3 Empty Head Tail Key Value Next Dual shared memory region based hash-table design for performance and memory efficiency SC '17 14

15 Time Taken (milliseconds) Memory Usage per Node (MB) Shared Memory based PMI Time Taken by one PMI_Get Default Shmem PMI Memory Usage Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem K 64K 128K 256K 512K 1M PMI Gets are 1000x faster Memory footprint reduced by O(PPN) SHMEMPMI Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 16) SC '17 15

16 Efficient Intra-node Topology Discovery Previous Design Current Design Process 0 Process 0 Process 0 Process 0 Process 0 Process 0 hwloc hwloc hwloc hwloc hwloc hwloc /proc/ filesystem Contention! /proc/ filesystem No Contention! Significant improvement on Many-core systems SC '17 16

17 K 2K 4K 8K 16K 32K 64K 128K 181K 232K MPI_Init (Seconds) Time Taken (Seconds) Startup Performance on KNL + Omni-Path MPI_Init on TACC Stampede-KNL MPI_Init and Hello World on Oakforest-PACS Intel MPI 2018 beta MVAPICH2 2.3a 8.8x Hello World (MVAPICH2-2.3a) MPI_Init (MVAPICH2-2.3a) 21s s s K 2K 4K 8K 16K 32K 64K MPI_Init takes 22 seconds on 231,956 processes on 3,624 KNL nodes (Stampede Full scale) 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) All numbers reported with 64 processes per node SC '17 17

18 Memory Required to Store Endpoint Information Summary P a b c M d a On-demand Connection b PMIX_Ring P PGAS State of the art e c PMIX_Ibarrier M MPI State of the art d PMIX_Iallgather O PGAS/MPI Optimized O e Shmem based PMI Job Startup Performance Near constant MPI/OpenSHMEM initialization at any process count 10x and 30x improvement in startup time of MPI and OpenSHMEM with 16,384 processes (1,024 nodes) Full scale startup on Stampede2 under 22 seconds with 232K processes O(PPN) reduction in PMI memory footprint Optimized designs available in MVAPICH2 and MVAPICH2X-2.3b SC '17 18

19 Thank You!

Job Startup at Exascale: Challenges and Solutions

Job Startup at Exascale: Challenges and Solutions Sourav Chakraborty Advisor: Dhabaleswar K (DK) Panda The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Tremendous increase