Topology Awareness in the Tofu Interconnect Series

Size: px

Start display at page:

Download "Topology Awareness in the Tofu Interconnect Series"

Rosamond Ferguson
6 years ago
Views:

1 Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0

larger Systems have tens of thousands of nodes Highly scalable network topologies e.g. multi-dimensional torus, dragonfly Channel

2 Introduction Networks are getting larger Systems have tens of thousands of nodes Highly scalable network topologies e.g. multi-dimensional torus, dragonfly Channel bisection < 1/2 node count Bisection bandwidth < injection bandwidth Issue: communication algorithms Existing general algorithms will be inefficient (video) MPI_Bcast on the K computer Topology-aware optimization is required This talk presents the topologyawareness design of the Tofu interconnect series, and visualizes the achievements June 23rd, 2016, ExaComm2016 Workshop 1

3 Tofu Interconnect Series Highly scalable 6D mesh/torus network Tofu interconnect Developed for the K computer Tofu interconnect 2 SoC integration and optical transceiver Another version is being developed for the Post-K machine Tofu interconnect Tofu interconnect June 23rd, 2016, ExaComm2016 Workshop 2

4 Index Topology-aware task allocation Topology-aware optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 3

5 6D Mesh/Torus Network Dimension labels: XYZABC Lengths of A-, B-, and C-axes are fixed; 2, 3 and 2 Z B A C X Y Conceptual Model June 23rd, 2016, ExaComm2016 Workshop 4

6 Task Allocation and Rank Mapping A rectangular region in the physical 6D network for each task Contiguous in the XYZ-axes and not divided in the ABC-axes Virtual torus rank mapping Users defined the logical shape of the task as a virtual 1D/2D/3D torus The length for each dimension is defined in the batch script Example: using the full system of the K computer ( ) Virtual 1D torus Virtual 2D torus Virtual 3D torus #PJM -L "node=82944 #PJM -L "node=576x144 #PJM -L "node=54x48x32 A rank number reflects the logical coordinates of process Embedding a virtual torus into a physical rectangular region A nearest neighbor node in the virtual torus space is guaranteed to be a nearest neighbor node in the physical 6D network The task scheduler may add padding nodes and rotate the shape to increase the chance for allocation June 23rd, 2016, ExaComm2016 Workshop 5

7 Index Topology-aware task allocation Topology-aware optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 6

Manual Tuning with Profiling Dynamic profiling Enable profiling during the application s communication activity The profiler periodically samples performance analysis (PA) counters The profiling log

8 Manual Tuning with Profiling Dynamic profiling Enable profiling during the application s communication activity The profiler periodically samples performance analysis (PA) counters The profiling log is saved to storage after profiling PA counters of the Tofu interconnect Each counter is a hardware 64-bit register A set of PA counters is provided for each port of the router Bytes transferred, busy cycles, idle cycles, packet buffer depleted cycles, etc. Visualization Users find bottlenecks Manual performance tuning MPI and task allocation options Communication algorithms Screen shot of the Fujitsu Profiler June 23rd, 2016, ExaComm2016 Workshop 7

Automatic Tuning without Profiling Custom rank mapping order A default rank mapping

of the optimization candidates right after executing a vanilla code RMATT (Rank

Calculates rank mapping order using the simulated annealing algorithm Users input

includes source and destination pair of processes and total amount of transferred

file) Input 0 (0,0,0) Output 1 (0,0,1) Comm.

9 Automatic Tuning without Profiling Custom rank mapping order A default rank mapping order often affects the communication performance in a multi-dimensional torus One of the optimization candidates right after executing a vanilla code RMATT (Rank Mapping Automatic Tuning Tool) Requires no profiling log but execution statistics Calculates rank mapping order using the simulated annealing algorithm Users input the shape of torus and a list of communication pattern Each line of the list includes source and destination pair of processes and total amount of transferred data during a task MPI Application Configuration MPI option 32x32x16 RMATT (machine file) Input 0 (0,0,0) Output 1 (0,0,1) Comm. Pattern 2 (0,1,0) Rank Mapping Optimization 0 1 1kB 0 4 2kB 1 0 1kB MPI rank mapping file Execution environment June 23rd, 2016, ExaComm2016 Workshop 8

Evaluations of Improvement by RMATT NAS Parallel Benchmark (CG) Case 1: NPROCS=1024, CLASS=B, 2D Torus 32x32 Default(x-y order) Rank map optimized by RMATT Default RMATT Execution time 1.33 sec 1.

10 Evaluations of Improvement by RMATT NAS Parallel Benchmark (CG) Case 1: NPROCS=1024, CLASS=B, 2D Torus 32x32 Default(x-y order) Rank map optimized by RMATT Default RMATT Execution time 1.33 sec 1.24 sec 7% improved (includes calculation time) Case 2: NPROCS=8192, CLASS=D, 2D Torus 128x64 Default RMATT Execution time sec 9.98 sec 9% improved (includes calculation time) June 23rd, 2016, ExaComm2016 Workshop 9

11 Index Topology-aware task allocation Topology-aware performance optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 10

12 Low-Level Network Interface Fujitsu s FJMPI is developed based on Open MPI The tuned collective communication library bypasses the Open MPI stack and uses the low-level network interface directly MPI Interface Layer tuned COLL ob1 PML bypass tofu LLP r2 BML tofu BTL bypass tofu COMMON Tofu Library Tofu Interconnect June 23rd, 2016, ExaComm2016 Workshop 11

13 Simultaneous Communication Four RDMA engines (Tofu network interfaces) per node The peak injection bandwidth of each TNI is 5 GB/s for Tofu1 and 12.5 GB/s for Tofu2. The point-to-point messaging layer of the FJMPI uses four TNIs in a round-robin manner The tuned COLL identifies four TNIs to avoid a collision of the destination TNI CPU RDMA Engine 0 RDMA Engine 1 RDMA Engine 2 RDMA Engine 3 link 0 link 1 link 2 link 3 link 4 link 5 link 6 link 7 link 8 link 9 XYZ ABC June 23rd, 2016, ExaComm2016 Workshop 12

14 Injection Rate Control Latency (us) Contention depletes packet buffers and causes congestion Congestion can be avoided by reducing the injection rate Latencies of simultaneous 8-hop data transfer on a 32-node ring size Optimized packet gap Congestion Low injection rate Packet gap (multiples of MTU) June 23rd, 2016, ExaComm2016 Workshop 13

15 Index Topology-aware task allocation Topology-aware performance optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 14

16 Overview Assumed environment The shape of the communicator is a mesh or a torus One process per node participates in inter-process communication When there are multiple processes in a node, collective communication is fanned out through shared memory Optimization policies (for long messages only) Use multiple network interfaces Communicate with nearest neighbor nodes Control the injection rate for communication with far nodes Algorithms implemented in the FJMPI Triple trinary tree for broadcast and reduce Three-phase quad rings for gather Uniformly overlaid symmetrical pattern for all-to-all June 23rd, 2016, ExaComm2016 Workshop 15

17 Triple Trinary Tree Broadcasts data by dividing into three parts and simultaneously propagating each part via a different path Each path is a spanning trinary tree, and the three trees share no directed edges By reversing the direction of all edges, data can be reduced June 23rd, 2016, ExaComm2016 Workshop 16

18 MPI_Allreduce Two phases First phase reduce data using triple trinary trees Second phase broadcast the reduced data using the reversed trees (video) MPI_Allreduce on the K computer June 23rd, 2016, ExaComm2016 Workshop 17

19 Three-Phase Quad Rings The ring all-gather algorithm transfers data cyclically A C D A B D A B C B C D Divides data into four parts, and simultaneously transfers each part along a different direction Phase 1 Phase 2 Phase 3 June 23rd, 2016, ExaComm2016 Workshop 18

20 MPI_Allgather The three-phase quad ring algorithm (video) MPI_Allgather on the K computer June 23rd, 2016, ExaComm2016 Workshop 19

21 Uniformly Overlaid Symmetrical Pattern (1) A multi-phase all-to-all communication algorithm In each phase, each process transfers data to multiple processes that have symmetrical relative coordinates Each phase is divided into sub-phases June 23rd, 2016, ExaComm2016 Workshop 20

22 Uniformly Overlaid Symmetrical Pattern (2) For each phase, communication patterns of all processes are uniform For each sub-phase, the number of colliding transfers is the same as the hop count of a transfer Injection rate control for each sub-phase avoids congestion and increases effective throughput June 23rd, 2016, ExaComm2016 Workshop 21

23 MPI_Alltoall Uniformly overlaid symmetrical pattern algorithm (video) MPI_Alltoall on the K computer Left: the uniformly overlaid symmetrical pattern algorithm Right: default algorithm of the Open MPI June 23rd, 2016, ExaComm2016 Workshop 22

24 Summary Topology awareness design of the Tofu series Task allocation Virtual torus rank mapping Performance optimization Tofu PA counters for manual tuning with the Fujitsu Profiler Rank Mapping Automatic Tuning Tool (RMATT) Tuned collective communication library in the FJMPI Utilizes low-level network features Simultaneous communication Injection rate control Topology-aware algorithms for long messages Triple trinary tree algorithm for broadcast and reduce Three-phase quad rings algorithm for gather Uniformly overlaid symmetric pattern algorithm for all-to-all June 23rd, 2016, ExaComm2016 Workshop 23

25 24

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming