The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Size: px

Start display at page:

Download "The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy"

Bernard Harmon
5 years ago
Views:

1 The Role of InfiniBand Technologies in High Performance Computing 1 Managed by UT-Battelle

2 Contributors Gil Bloch Noam Bloch Hillel Chapman Manjunath Gorentla- Venkata Richard Graham Michael Kagan Vasily Philipov Steve Poole Ishai Rabinovich Ariel Shahar Gilad Shainer Pavel Shamis Josh Ladd 2 Managed by UT-Battelle

3 Outline Spider file system CORE-Direct InfiniBand overview New InfiniBand capabilities Software design for collective operations Results 3 Managed by UT-Battelle

4 Spider File System at the Oak Ridge Leadership Computing Facility 4 Managed by UT-Battelle

5 Motivation for Spider File System Building dedicated file systems for each platforms does not scale operationally Storage often 10% or more of new system cost Bundled storage often not poised to grow independently of attached machine Different curves for storage and compute technology Data needs to be moved between different compute islands For example: Simulation platform to visualization platform Dedicated storage is only accessible when its machine is available Managing multiple file systems requires more manpower Lens Jaguar XT4 Smoky Jaguar XT5 Ewok Jaguar XT4 SION Network & Spider System Smoky Jaguar XT5 Ewok Lens data sharing path 5 Managed by UT-Battelle

storage servers Over 3 TB of memory (Lustre OSS)

high-speed network: Over 3,000 IB ports Over 5

Demonstrated I/O performance: 240 GB/s Current

6 Spider: A System At Scale Over 10.7 PB of RAID 6 Capacity 13,440 1TB drives 192 storage servers Over 3 TB of memory (Lustre OSS) Available to many compute systems through high-speed network: Over 3,000 IB ports Over 5 kilometer cables Over 26,000 client mounts for I/O Demonstrated I/O performance: 240 GB/s Current Status in production use on all major OLCF computing platforms 6 Managed by UT-Battelle

7 Spider: Couplet and Scalable Cluster 280 1TB Disks in 5 Disks disk trays 280 in Disks 5 trays 280 in 5 trays DDN Couplet (2 controllers) DDN Couplet (2 DDN controllers) Couplet (2 controllers) 16 SC Units on the floor 2 racks for each SC 24 IB ports Flextronics 24 IB Switch ports OSS (4 Dell nodes) Flextronics 24 IB Switch ports OSS (4 Dell nodes) Flextronics Switch OSS (4 Dell nodes) IB Ports IB Ports IB Ports Uplink Uplink to to Cisco Cisco Core Core Uplink Switch Switch to Cisco Core Switch SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC A Scalable Cluster (SC) 7 Managed by UT-Battelle

server, Storage Array! - + /.4#0//(5.'&(#/&#6 77# $'!!! " $( $#) " $#!!! " $! +( &" $!!!! " %&)! " %!!! " *!!! " *) *#" '!!! " ' ( &) " #!

8 ! "#$%&'()# Snapshot of Technical Challenges Solved Performance Asynchronous journaling Network congestion avoidance (topology aware I/O) Scalability 26,000 clients 7 OST per OSS Lesson from server side client statistics Fault Tolerance and Reliability Network, I/O server, Storage Array! - + /.4#0//(5.'&(#/&#6 77# $'!!! " $( $#) " $#!!! " $! +( &" $!!!! " %&)! " %!!! " *!!! " *) *#" '!!! " ' ( &) " #!!! " #$%&"! "! "! " )!!! " $!!!! " $)!!! " #!!!! " #)!!! " (!!!! " * %+, -.#/0#12'- &(3# SeaStar Torus Congestion 8 Managed by UT-Battelle

system No single vendor could deliver this system Trail blazing was

9 Spider - How Did We Get Here? 4 years project We didn t just pick up phone and order a center-wide file system No single vendor could deliver this system Trail blazing was required Collaborative effort was key to success ORNL Cray DDN Cisco CFS, SUN, Oracle, and now Whamcloud 9 Managed by UT-Battelle

10 CORE-Direct Technology 10 Managed by UT-Battelle

11 Problems Being Addressed Collective Operations Collective communication characteristics at scale Overlapping computation with communication true asynchronous communications System noise Performance Scalability Goal: Avoid using the CPU for communication processing Offload Communication management to the network 11 Managed by UT-Battelle

12 Collective Communications Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved) Optimized collectives involve a communicator-wide data-dependent communication pattern Data needs to be manipulated at intermediate stages of a collective operation Collective operations limit application scalability Collective operations magnify the effects of systemnoise 12 Managed by UT-Battelle

13 Scalability of Collective Operations Ideal Algorithm Impact of System Noise Managed by UT-Battelle

14 Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm 14 Managed by UT-Battelle - Communication processing

15 Approach to solving the problem Co-design Network stack design (Mellanox) Hardware development (Mellanox) Application level requirement (ORNL) MPI/Shmem level implementation (Joint) 15 Managed by UT-Battelle

16 InfiniBand Collective Offload Key idea Create local description of the communication patterns Hand the description to the HCA Manage collective communications at the network level Poll for collective completion Add new support for Synchronization primitives (hardware) Send Enable task Receive Enable task Wait task Multiple Work Request A sequence of network tasks Management Queue 16 Managed by UT-Battelle

17 InfiniBand Hardware Changes Tasks defined in the current standard Send Receive Read Write Atomic New support Synchronization primitives (hardware) Send Enable task Receive Enable task Wait task Multiple Work Request A sequence of network tasks Management Queue 17 Managed by UT-Battelle

18 Standard InfiniBand Connected Queue Design 18 Managed by UT-Battelle

19 Queue Structure Send CQ Per Communicator Resources Collective MQ MQ CQ Service MQ All send Queues Small data Resource recycling Large data Credit QP Send Recv Send Recv Send Recv Send Recv Per Peer Resources 19 Managed by UT-Battelle Recv CQ Recv CQ Recv CQ Recv CQ

20 Collectives Software Layers OMPI Module Component Architecture Collective Framework Basic Collectives Framework Subgroup Framework ML Hierarchical Collectives Comp. Tuned (pt2pt) Collectives Comp. IB IB OFFLOAD Pt2Pt SM Socket Shared IBNET Memory MLNX OFED MLNX OFED 20 Managed by UT-Battelle

21 Example 4 Process Recursive Doubling Step Step Managed by UT-Battelle

22 4 Process Barrier Example Algorithm Proc 0 Proc 1 Proc 2 Proc 3 Exchange With proc 1 Exchange With proc 0 Exchange With proc 3 Exchange With proc 2 Exchange With proc 2 Exchange With proc 3 Exchange With proc 0 Exchange With proc 1 MWR Proc 0 Proc 1 Proc 2 Proc 3 Send to proc 1 Wait on recv from 1 Send to proc 2 Wait on recv from 2 Send to proc 0 Wait on recv From 0 Send to proc 3 Wait on recv From 3 Send to proc 3 Wait on recv From 3 Send to proc 0 Wait on recv From 0 Send to proc 2 Wait on recv From 2 Send to proc 1 Wait on recv From 1 22 Managed by UT-Battelle

23 4 Process Barrier Example Queue view Send QP Proc 0 Proc 1 Proc 2 Proc 3 Send to proc 1 - enabled Send to 2 not enabled Send to proc 0 enabled Send to 3 not enabled Send to proc 3 - enabled Send to 0 not enabled Send to proc 2 - enabled Send to 1 not enabled MQ Completion Proc 0 Proc 1 Proc 2 Proc 3 Recv wait from 1 Send enable 1 Recv wait from 2 Recv wait from 0 Send enable 0 Recv wait from 3 Recv wait from 3 Send enable 3 Recv wait from 0 Recv wait from 2 Send enable 2 Recv wait from 1 23 Managed by UT-Battelle

24 8 Process Barrier Example Queue view no MQ, View at rank 0 QP 1 QP 2 QP 4 Send QP 1 Wait QP 1 Wait QP 1 Send QP 2 Wait QP 2 Send QP 4 Wait QP 4 24 Managed by UT-Battelle

25 etwork System Hierarchy Nod e Socket System 25 Managed by UT-Battelle Unused core Occupied core

26 Benchmarks 26 Managed by UT-Battelle

27 System setup 8 node cluster Node Architecture 3 GHz Intel Xeon Dual socket Quad core Network ConnextX-2 HCA 36 port QDR switch running pre-release firmware 27 Managed by UT-Battelle

28 Barrier Data 28 Managed by UT-Battelle

29 8 Node Blocking MPI Barrier 29 Managed by UT-Battelle

30 MPI Barrier - Offloaded 30 Managed by UT-Battelle

31 MPI Barrier Comparison with PtP 31 Managed by UT-Battelle

32 MPIX_Ibarrier Performance 32 Managed by UT-Battelle

33 Nonblocking Barrier Overlap Multiple Work Quanta 33 Managed by UT-Battelle

34 Nonblocking Barrier Overlap 1 Work Quanta 34 Managed by UT-Battelle

35 Barrier Data Hierarchy 35 Managed by UT-Battelle

36 Flat Barrier Algorithm Host 1 Host Step Inter Host Communication Step Managed by UT-Battelle

37 Hierarchical Barrier Algorithm Host 1 Host Step Inter Host Communication Step Step Managed by UT-Battelle

38 MPI Barrier timings 38 Managed by UT-Battelle

39 Barrier timings blocking vs. nonblocking 39 Managed by UT-Battelle

40 Nonblocking Barrier Overlap 40 Managed by UT-Battelle

41 Broadcast Data 41 Managed by UT-Battelle

42 IB Large Message Algorithm Process I Process J 1) Register Receive Memory Send Credit QP Send 2) No fy sender Send Credit QP Send Recv Recv Recv Recv 3) Wait on credit message QP QP Send Send Wait Wait Send Send Recv Recv Recv Recv 42 Managed by UT-Battelle 4) Send user data

43 Broadcast Latency usec per call Msg size IBOff + SM IBOff P2P + SM Open MPI default 16B MVAPICH 1KB MB Managed by UT-Battelle

44 Nonblocking Broadcast Latency usec per call Msg sizeß IBOff + SM IBOff P2P + SM 16B KB MB Managed by UT-Battelle

45 Broadcast small data - hierarchical 45 Managed by UT-Battelle

46 Broadcast large data - hierarchical 46 Managed by UT-Battelle

47 Overlap Measurement Benchmark steps: Polling Method 1. Post broadcast 2. Do work and poll for completion 3. Continue until broadcast completion Post-work-wait Method 1. Post broadcast 2. Do work 3. Wait for broadcast completion 4. Compare the time of steps 1 3 with post-wait 5. Increase the work and repeat steps 1-4 until the time for postwork-wait is greater than post-wait 47 Managed by UT-Battelle

48 Nonblocking Broadcast Overlap - Poll 48 Managed by UT-Battelle

49 Nonblocking Broadcast Overlap - Wait 49 Managed by UT-Battelle

50 All-To-All Data 50 Managed by UT-Battelle

51 All-To-All: 1 Byte 51 Managed by UT-Battelle

52 All-To-All: 64 Bytes 52 Managed by UT-Battelle

53 All-To-All: 128 Bytes 53 Managed by UT-Battelle

54 All-To-All: 4 MB/process 54 Managed by UT-Battelle

55 Allgather Data 55 Managed by UT-Battelle

56 All-Gather: 1 Byte 56 Managed by UT-Battelle

57 All-Gather: 128 Bytes 57 Managed by UT-Battelle

58 All-Gather: Bytes 58 Managed by UT-Battelle

59 Summary Added hardware support for offloading broadcast operations Developed MPI-level support for one-copy for asynchronous contiguous large-data transfer Good collective performance Good overlap capabilities 59 Managed by UT-Battelle

Efficient Object Storage Journaling in a Distributed Parallel File System

Efficient Object Storage Journaling in a Distributed Parallel File System Presented by Sarp Oral Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller, and Oleg Drokin FAST 10, Feb 25, 2010 A