DISP: Optimizations Towards Scalable MPI Startup

Size: px

Start display at page:

Download "DISP: Optimizations Towards Scalable MPI Startup"

Abraham Hampton
6 years ago
Views:

1 DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory

2 Outline Background and motivation Issues with MPI startup Cost analysis Design of DISP Delayed initialization Module sharing Prediction-based topology setup Experiments Conclusion

Increasing Scale of HPC Systems The scale of High Performance

Table: Top500 List (Nov 2016) Rank Name Cores Rpeak (TFlop/s)

6 N/A 2 Tianhe-2 3,120,000 33,862.7 1 3 Titan 560,640 17,590.

3 Increasing Scale of HPC Systems The scale of High Performance Computing (HPC) systems is increasing rapidly. Table: Top500 List (Nov 2016) Rank Name Cores Rpeak (TFlop/s) Rank in Nov Sunway 10,649,600 93,014.6 N/A 2 Tianhe-2 3,120,000 33, Titan 560,640 17, Sequoia 1,572,864 17, Cori 622,336 14,014.7 N/A 6 Oakforest 556,104 13,554.6 N/A S-3

4 MPI Startup at Scale For the last 20 years, Message Passing Interface (MPI) is the de facto parallel programming system on HPC systems. However, MPI startup has serious performance issue at scale MPI_Reduce x 1000 Time (ms) MPI_Init 24x x No. of procs S-4

MPI Startup Breakdown global comm init ompi_mpi_init sub- comm init

orte_init coll init mca_coll_select ompi_comm_dup reduce Get backend framework

subsequent communicators Run collectives with communicators Startup Phase Work

5 MPI Startup Breakdown global comm init ompi_mpi_init sub- comm init ompi_comm_split barrier opal_init ompi_comm_init ompi_comm_create bcast orte_init coll init mca_coll_select ompi_comm_dup reduce Get backend framework ready Init OMPI, including global communicator and collective module Initialize subsequent communicators Run collectives with communicators Startup Phase Work Phase The initialization of communicator object and collective module is particularly non-scalable. S-5

Issues with Comm & Coll Init There can be many communicators for various uses. For every communicator, there is a same communicator object to be created on every participating process.

6 Issues with Comm & Coll Init There can be many communicators for various uses. For every communicator, there is a same communicator object to be created on every participating process. A communicator object contains its basic info and a collective module that orchestrates collective communication. Thus, the computation time and memory consumption grows linearly with the number of processes. Comm A Comm B Comm C Process 0 c_id rank size Comm A Comm B Comm C Process 1 9 comm objects & collective modules Coll module Comm A Comm A Comm B Comm C Process 2 S-6

7 Multi-Level Collective Cheetah is a popular framework that provide a suit of fast collectives. Its module for OpenMPI is called Multi-Level (ML). ML has a hierarchical structure. Process from one group communicates with other group members through higher-level root. Every process needs to set up a global topology in order to communicate. S-7

8 Cost Analysis of MPI_Init We study the time and memory cost of MPI_Init using OpenMPI s default collective module (i.e. Tuned) and ML. Both have various performance behaviors. Generally, init of ML performs worse than init of Tuned. Init of ML scales particularly bad in terms of time. Time (s) Time Memory 24 MB 16 s Memory Consumption (MB) No. of procs Fig. Cost of MPI_Init using ML S-8

It is the essential cause that makes the ML init even more non-scalable.

9 Cost Breakdown of ML Init The topology setup needs to conduct many all-to-all collective communications across all participating processes in the corresponding communicator. Time to finish the inter-process communication can occupy most of the init of ML. It is the essential cause that makes the ML init even more non-scalable. % of Total Cost 100% 80% 60% 40% 20% 44% 47% 63% 64% 89% 86% 85% 86% 87% 87% 88% 93% 0% No. of procs Inter-process communication Other costs S-9

10 Related Works & our Solution Previous studies have recognized the performance issue of communicator initialization. But most of them have not identified and addressed the issue of non-scalable initialization of the collective module, especially ML. We propose a hybrid solution Delayed Initialization with Sharing and Prediction (DISP). 1. Delayed Initialization Comm A Comm A 2. Module Sharing Comm B Inter-process communication 3. Prediction-based Topology Setup S-10

Delayed Initialization Delay the initialization of communicator until it is actually used. Instead of a full-fledged communicator, we create a shadow communicator that only contains its basic info.

11 Delayed Initialization Delay the initialization of communicator until it is actually used. Instead of a full-fledged communicator, we create a shadow communicator that only contains its basic info. It removes cost of unused module. Delayed initialization also facilitates module sharing between successive identical communicators. This helps remove initialization cost of identical modules. Global comm init Sub-comm init Shallow init MPI startup Collectives On-demand init old process: unused new process: Time module sharing S-11

Only the root process on that node initializes the module.

12 Module Sharing Temporal sharing: collective module is shared between identical communicators. Spatial sharing: collective module is shared between MPI processes on the same node. Only the root process on that node initializes the module. Temporal sharing Coll module comm A identical Coll module comm B Spatial Sharing Process 0 Coll module comm A Process 1 Node 1 S-12

13 Prediction-based Topology Setup Based on system specifics, every process predicts the topology without exchanging information with others. Our prediction algorithm can computes the information: 1 Highest and lowest hierarchy level; 2 Ranks of all participating processes; 3 All group lists that contain the ranks of the members; 4 Routing table of how a process can be reached by another one. 4 levels 1 Level 2 groups ranks 3 Level 1 Level 0 S-13 2

14 Experimental Setup Testbed: all experiments are conducted on Titan s Cray XK6 machines. 16-core AMD Opteron 6200 series processor. 32GB of DDR3 memory. Connected through a Gemini interconnect. 600 TB total storage. Software: OpemMPI and Cheetah Benchmark: NAS Parallel Benchmarks v3.3 (customized) and MVAPICH MPI benchmark suite v S-14

15 Overall Improvement Real improvement is the difference between DISP s improvement to startup and its delay to work phase. DISP improves ML by a bigger factor than Tuned because of ML s longer initialization cost. real improvement Time (ms) bt cg ep is ft lu mg sp Startup Improv. (Tuned) Work Delay (Tuned) Startup Improv. (ML) Work Delay (ML) Fig. 1 Improvement vs. Delay S-15

16 Memory Savings Delayed initialization saves memory of unused communicator, and module sharing saves for reusable collective module. Actual savings depend on the ratio between size of the collective module and size of the communicator object. Memory consumption (MB) Orig DISP Memory consumption (MB) Orig DISP Avg savings: 8.6% Avg savings: 85.7% No. of procs Fig. 1 For Tuned. S-16 No. of procs Fig. 2 For ML.

Benefit of Prediction-based Setup By speeding up the

topologyprediction significantly reduces MPI

Time (ms) 35000 30000 25000 20000 15000 Orig DISP 70.

Time (ms) 700 600 500 400 300 Split (Orig) Split

17 Benefit of Prediction-based Setup By speeding up the initialization of collective module, topologyprediction significantly reduces MPI initialization calls. Time (ms) Orig DISP 70.0% impr. Time (ms) Split (Orig) Split (DISP) Create (Orig) Create (DISP) 63.8% 74.9% No. of procs Fig. 1 MPI_Init() No. of Procs Fig. 2 MPI_Comm_split & _create S-17

18 Conclusion Issues with communicator and collective module can significantly diminish its scalability to thousands or more processes. We have examined such impact in terms of time and memory cost. By prudently delaying the initialization and sharing the reusable collective module, we can efficiently reduce the time and memory cost. The costly topology setup of multi-level collective module can be well mitigated by a prediction-based approach without affecting the collective module s functionality. S-18

19 Acknowledgment S-19

20 Thank You and Questions? S-20

DISP: Optimizations towards Scalable MPI Startup

DISP: Optimizations towards Scalable MPI Startup Huansong Fu Computer Science Department Florida State University Tallahassee, FL 3233, USA Email: fu@cs.fsu.edu Swaroop Pophale, Manjunath Gorentla Venkata