Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

Size: px

Start display at page:

Download "Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence"

Avis Jean Owen
5 years ago
Views:

1 Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence Jens Domke Research Staff at MATSUOKA Laboratory GSIC, Tokyo Institute of Technology, Japan Omni-Path User Group 2017/11/14 Denver, Colorado, USA

9PB+Home 45TB) Full Bisection Bandwidth Intel Omni-Path Interconnect.

2 Overview of TSUBAME3.0 BYTES-centric Architecture, Scalability to all 2160 s, all nodes, the entire memory hierarchy DDN Storage (Lustre FS 15.9PB+Home 45TB) Full Bisection Bandwidth Intel Omni-Path Interconnect. 4 ports/node Full Bisection / 432 Terabits/s bidirectional ~x2 BW of entire Internet backbone traffic Full Operations since Aug x Compute Nodes SGI ICE XA + New Blade Intel Xeon CPUx2 + NVIDIA Pascal x4 (NV-Link) 256GB memory 2TB Intel NVMe SSD 47.2 AI-Petaflops, 12.1 Petaflops

3 TSUBAME3.0 Compute Node SGI ICE-XA, a New Compute Blade Co- DMI Designed by SGI and Tokyo Tech GSIC PCH SGI ICE XA Infrastructure Intel Omni-Path Director Switch, Full Bisection Fat Tree Network 432 Terabit/s Bidirectional for HPC and DNN X60 Pairs Terabytes Memory (Total 120 Switches) SSD QPI CPU 0 NVLink 0 1 PCH SSD DMI QPI CPU 0 CPU 1 NVLink Compute Blade x9 PCH SSD DMI QPI CPU 0 CPU 1 400Gbps / node for HPC and DNN NVLink Compute Blade x60 sets (540 nodes) Optane NVM Optane NVM CPU Ultra high performance & bandwidth Fat Node High Performance: 4 SXM2(NVLink) NVIDIA Pascal P Intel Xeon 84 AI-TFLops High Network Bandwidth Intel Omni-Path 100Gbps x 4 = 400Gbps (100Gbps per ) High I/O Bandwidth - Intel 2 TeraByte NVMe > 1PB & 1.5~2TB/s system total Future Optane 3D-Xpoint memory Petabyte or more directly accessible Ultra High Density, Hot Water Cooled Blades 36 blades / rack = CPU, 50-60KW, x10 thermals c.f. IDC

4 Resource Partitioning with Container in TSUBAME 3.0 CPU CPU 14 core H 7 core Q 2 core G 4 core C4 Divide a compute node into some partitions in a hierarchical manner F: Full node H: 1/2 node Q: 1/4 node G: CPU Core C4: 4 CPU Core C1: 1 CPU Core Fig. 4: T3 node partitioning example December 11, 2017 Jens Domke 3

5 Research opened up by TSUBAME3.0 Large Scale Graphs Accelerating Conventional HPC Apps Flame simulations for energy propulsion Environment-friendly urban planning Real-time tsunami simulation Air-water violent flow simulation Ultrafast genome analysis, etc Big Data AI- Oriented HPC Optimizing System Software and Ops Mutual and Semi- Automated Co- Acceleration of HPC and BD/ML/AI Future Big Data AI Supercomputer Design Big Data and ML/AI Apps and Methodologies Image and Video Robots / Drones December 11, 2017 Jens Domke 4

6 The Good (HW + management tools) Early User Experiences with T3 s More stable hardware environment (comp. to T2 s IB) Less perf. variability (BW/latency) for large-scale jobs The Bad (SW stack immature + bugs) Issues with direct and MPICH, which affected many -based legacy codes transitioning from T2 1 process seems to underutilize 4 s (only 1 used?) Not all GASNet / DL workloads out-of-the-box (verbs?) Endpoints unable to connect (software stack issue?) Thanks to Intel s ongoing efforts (soon) fixed in newer releases The Ugly (operation critical bugs) Dead nodes after HPL run (software stack issue?) Load-related connection issues with (fixed in 10.6) Fig. 5: Intel director switch December 11, 2017 Jens Domke 5

Xeon CPUs, FDR SW/HCAs osu_latency + cache flush before

latency increase Cache Misses: Ping-Pong Latency Killer? Fig.

6: FDR KNL with integrated Fabric Suite + mvapich2-2.

7 Xeon CPUs, FDR SW/HCAs osu_latency + cache flush before MPI_Send Intel MPI with verbs to use IB L1-L3: 50%-100% latency increase Cache Misses: Ping-Pong Latency Killer? Fig. 7: on KNL Fig. 6: FDR KNL with integrated Fabric Suite + mvapich2-2.1-hfi 1b latency >4us ( low clock freq.?) L1 flush no effect L2 flush big slowdown (up to 3.6x) December 11, 2017 Jens Domke 6

Overview of Tianhe-2

Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn