Mellanox HPC-X Software Toolkit Release Notes

Similar documents
Mellanox HPC-X Software Toolkit Release Notes

Mellanox HPC-X Software Toolkit Release Notes

Ubuntu Inbox Driver Release Notes. Ubuntu 16.10

Mellanox ConnectX-4/ ConnectX-4 Lx Plugin for RedHat OpenStack Platform 10

Mellanox ConnectX-3 ESXi 6.5 Inbox Driver Release Notes. Rev 1.0

Mellanox OFED for FreeBSD for ConnectX-4/ConnectX-5 Release Note. Rev 3.4.1

Red Hat Enterprise Linux (RHEL) 7.4-ALT Driver Release Notes

Mellanox ConnectX-3 ESXi 6.0 Inbox Driver

SUSE Linux Enterprise Server (SLES) 12 SP3 Driver SLES 12 SP3

Mellanox NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

Mellanox NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

Mellanox NATIVE ESX Driver for VMware vsphere 6.5 Release Notes

Mellanox NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

SUSE Linux Enterprise Server (SLES) 12 SP2 Driver SLES 12 SP2

Innova-2 Flex Open for Application Acceleration EN Adapter Card. Software and Firmware Bundle Release Notes

Red Hat Enterprise Linux (RHEL) 7.5-ALT Driver Release Notes

Mellanox ConnectX-4 NATIVE ESXi Driver for VMware vsphere 5.5/6.0 Release Notes. Rev /

SUSE Linux Enterprise Server (SLES) 15 Inbox Driver Release Notes SLES 15

SUSE Linux Enterprise Server (SLES) 15 Inbox Driver User Manual

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 5.5/6.0 Release Notes

SUSE Linux Enterprise Server (SLES) 12 SP2 Driver User Manual

Mellanox GPUDirect RDMA User Manual

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

Mellanox NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

Mellanox ConnectX-4 NATIVE ESX Driver for VMware vsphere 5.5/6.0 Release Notes

Mellanox NATIVE ESX Driver for VMware vsphere 6.0 Release Notes

Mellanox Innova IPsec 4 Lx Ethernet Adapter Quick Start Guide

ibutils2 - InfiniBand Diagnostic Utilities Release Notes

MLNX_EN for FreeBSD Release Notes

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 5.5/6.0 Release Notes

Mellanox GPUDirect RDMA User Manual

Red Hat Enterprise Linux (RHEL) 7.3 Driver Release Notes

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.7 Release Notes

Mellanox GPUDirect RDMA User Manual

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.0 Release Notes

WinOF-2 Release Notes

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.5 Release Notes

Mellanox DPDK. Release Notes. Rev 16.11_4.0

Red Hat Enterprise Linux (RHEL) 7.3 Driver User Manual

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.5 Release Notes

Mellanox SwitchX Firmware (fw-sx) Release Notes

Mellanox OFED for FreeBSD for ConnectX-4/ConnectX-4 Lx/ ConnectX-5 Release Note. Rev 3.5.0

Mellanox DPDK. Release Notes. Rev 16.11_2.3

Scalable Hierarchical Aggregation Protocol (SHArP)

WinOF-2 for Windows 2016 Release Notes

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.5 Release Notes

Mellanox ScalableSHMEM User Manual

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

40Gb/s InfiniBand Switch Module (HSSM) for IBM BladeCenter

Mellanox HPC-X Software Toolkit User Manual

Mellanox Innova IPsec Ethernet Adapter Card Software and Firmware Bundle Release Notes

Mellanox DPDK Release Notes

Mellanox FlexBoot for ConnectX -4 / ConnectX -4 Lx and Connect-IB Release Notes

Mellanox ScalableSHMEM User Manual

RHEL6.x Deployment over iscsi over IPoIB Interfaces

Configuring Mellanox Hardware for VPI Operation Application Note

Mellanox ConnectX-4 NATIVE ESXi Driver for VMware vsphere 5.5/6.0 User Manual

Mellanox WinOF VPI Release Notes

Mellanox NIC s Performance Report with DPDK Rev 1.0

Mellanox CloudX, Mirantis Fuel 5.1/ 5.1.1/6.0 Solution Guide

WinOF VPI for Windows Installation Guide

Mellanox PreBoot Drivers (FlexBoot & UEFI) User Manual. Rev 2.8

Mellanox MLX4_EN Driver for VMware README

Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Deployment Guide

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Interconnect Your Future

Mellanox WinOF-2 Release Notes. Rev

Mellanox WinOF-2 Release Notes. Rev

Mellanox PreBoot Drivers (FlexBoot & UEFI) User Manual. Rev 5.0

Mellanox PreBoot Drivers (FlexBoot & UEFI) User Manual. Rev 4.0

Unified Communication X (UCX)

Mellanox ConnectX-4 NATIVE ESXi Driver for VMware vsphere 5.5/6.0 User Manual

Mellanox WinOF-2 Release Notes. Rev

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Solutions for Scalable HPC

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Mellanox WinOF-2 Release Notes. Rev

Mellanox ConnectX-3 NATIVE ESXi Driver for VMware vsphere 6.5 User Manual

Mellanox FlexBoot for ConnectX -3 /ConnectX -3 Pro Release Notes. Rev

Mellanox PreBoot Drivers (FlexBoot & UEFI)

CP2K Performance Benchmark and Profiling. April 2011

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 6.0

Mellanox ConnectX-4/ConnectX-5 NATIVE ESXi Driver for VMware vsphere 5.5/6.0

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Mellanox HPC-X Scalable Software Toolkit README

Mellanox PreBoot Drivers (FlexBoot & UEFI)

UCX: An Open Source Framework for HPC Network APIs and Beyond

Mellanox WinOF-2 Release Notes. Rev 1.70

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

The Exascale Architecture

InfiniScale IV fw-is4 Release Notes

RoCE vs. iwarp Competitive Analysis

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

ROCm: An open platform for GPU computing exploration

Building the Most Efficient Machine Learning System

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Transcription:

Mellanox HPC-X Software Toolkit Release Notes www.mellanox.com Mellanox Technologies

NOTE: THIS HARDWARE, SOFTWARE OR TEST SUITE PRODUCT ( PRODUCT(S) ) AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS -IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS. THE CUSTOMER'S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT(S) AND/OR THE SYSTEM USING IT. THEREFORE, MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED. IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT, INDIRECT, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING, BUT NOT LIMITED TO, PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY FROM THE USE OF THE PRODUCT(S) AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Mellanox Technologies 350 Oakmead Parkway Suite 100 Sunnyvale, CA 94085 U.S.A. www.mellanox.com Tel: (408) 970-3400 Fax: (408) 970-3403 Copyright 2018. Mellanox Technologies Ltd. All Rights Reserved. Mellanox, Mellanox logo, Accelio, BridgeX, CloudX logo, CompustorX, Connect-IB, ConnectX, CoolBox, CORE-Direct, EZchip, EZchip logo, EZappliance, EZdesign, EZdriver, EZsystem, GPUDirect, InfiniHost, InfiniBridge, InfiniScale, Kotura, Kotura logo, Mellanox CloudRack, Mellanox CloudXMellanox, Mellanox Federal Systems, Mellanox HostDirect, Mellanox Multi-Host, Mellanox Open Ethernet, Mellanox OpenCloud, Mellanox OpenCloud Logo, Mellanox PeerDirect, Mellanox ScalableHPC, Mellanox StorageX, Mellanox TuneX, Mellanox Connect Accelerate Outperform logo, Mellanox Virtual Modular Switch, MetroDX, MetroX, MLNX-OS, NP-1c, NP-2, NP-3, NPS, Open Ethernet logo, PhyX, PlatformX, PSIPHY, SiPhy, StoreX, SwitchX, Tilera, Tilera logo, TestX, TuneX, The Generation of Open Ethernet logo, UFM, Unbreakable Link, Virtual Protocol Interconnect, Voltaire and Voltaire logo are registered trademarks of Mellanox Technologies, Ltd. All other trademarks are property of their respective owners. For the most updated list of Mellanox trademarks, visit http://www.mellanox.com/page/trademarks Mellanox Technologies 2

Table of Contents Table of Contents................................................ 3 List Of Tables.................................................... 4 Release Update History........................................... 5 Chapter 1 Overview.............................................. 6 1.1 HPC-X Requirements.......................................... 6 1.2 HPC-X Content............................................... 6 1.3 Important Notes............................................... 6 Chapter 2 Changes and New Features................................ 7 Chapter 3 Known Issues........................................... 8 Chapter 4 Bug Fixes History....................................... 13 Chapter 5 Change Log History..................................... 16 5.1 HPC-X Toolkit Change Log History................................ 16 5.2 FCA Change Log History........................................ 20 5.3 MXM Change Log History....................................... 22 5.4 HPC-X Open MPI/OpenSHMEM Change Log History............... 24 Mellanox Technologies 3

List Of Tables Table 1: Release Update History................................................ 5 Table 2: Changes and New Features............................................. 7 Table 3: Known Issues........................................................ 8 Table 4: Bug Fixes History....................................................13 Table 5: HPC-X Toolkit Change Log History...................................... 16 Table 6: FCA Change Log History............................................... 20 Table 7: MXM Change Log History............................................. 22 Table 8: HPC-X Open MPI/OpenSHMEM Change Log History...................... 24 4 Mellanox Technologies

Release Update History Table 1 - Release Update History Release Date Description July 5, 2018 Initial version of this HPC-X version. Mellanox Technologies 5

1 Overview These are the release notes for the Mellanox HPC-X. The Mellanox HPC-X Software Toolkit is a comprehensive software package that includes Open MPI, OpenSHMEM, PGAS, MXM, UCX, FCA tool suite for high performance computing environments. HPC-X provides enhancements to significantly increase the scalability and performance of message communications in the network. HPC-X enables you to rapidly deploy and deliver maximum application performance without the complexity and costs of licensed third-party tools and libraries. 1.1 HPC-X Requirements The platform and requirements for HPC-X are detailed in the following table: Platform OFED / MLNX_OFED OFED 1.5.3 MLNX_OFED 4.2-x.x.x and above HCAs ConnectX -5 / ConnectX -5 Ex ConnectX -4 / ConnectX -4 Lx ConnectX -3 / ConnectX -3 Pro Connect-IB Drivers and HCAs 1.2 HPC-X Content The following communications libraries and acceleration packages are part of Mellanox HPC- X package: Library/Acceleration Package Open MPI 3.1.x Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) 1.7 HCOLL 4.1 UCX 1.4 MXM 3.7 OpenSHMEM specification compliant 1.3 Version Number 1.3 Important Notes When HPC-X is launched in an environment without resource manager (slurm, pbs,...) installed, or from a compute node, it will use Open MPI default rsh/ssh based launcher which does not propagate the library path to the compute nodes. In such case, pass the LD_LIBRARY_PATH variable as following: % mpirun -x LD_LIBRARY_PATH -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c 6 Mellanox Technologies

Changes and New Features 2 Changes and New Features HPC-X provides the following changes and new features: Table 2 - Changes and New Features Category HPC-X Content MXM Support UCX HPC-X Setup HCOLL Description Updated the following communications libraries and acceleration packages versions: Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7 HCOLL version 4.1 UCX version 1.4 Added support for Singularity containerization. For further information, please refer to HPC-X User Manual. osc ucx is no longer the default one-sided-component in OpenMPI. Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED. Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default. Added support for the following UCX features: New API for establishing client-server connection. Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs. Added support for HPC-X to work on Huawei ARM architecture. Improved performance by utilizing zero-copy messaging for MPI Bcast. Mellanox Technologies 7

3 Known Issues The following is a list of general limitations and known issues of the various components of this HPC-X release. Table 3 - Known Issues (Sheet 1 of 5) Internal Ref. 2670 Description: When enabling the Hardware Tag Matching feature on a large scale, the following error message may be printed due to the increased threshold for BCOPY messages: mpool.c:177 UCX ERROR Failed to allocate memory pool chunk: Out of memory. (Github issue: https://github.com/openucx/ucx/issues/2670) Workaround: Add -x UCX_DC_TM_MAX_BCOPY=8k -x UCX_RC_TM_MAX_B- COPY=8k to the command line to lower this threshold. Keywords: Hardware Tag Matching Discovered in Version: 2.2 (UCX 1.4) 2454 Description: Some one-sided benchmarks may hang when using osc ucx. For example: osu-micro-benchmarks-5.3.2/osu_get_acc_latency (Latency Test for accumulate with Active/Passive Synchronization). (Github issue: https://github.com/openucx/ucx/issues/2454) Workaround: N/A Keywords: UCX, one_sided Discovered in Version: 2.2 (UCX 1.4) 2702 Description: When using the Hardware Tag Matching feature, the following error messages may be printed: rcache.c:481 UCX WARN failed to register region 0xdec25a0 [0x2b7139ae0020..0x2b7139ae2020]: Input/output error ucp_mm.c:105 UCX ERROR failed to register address 0x2b7139ae0020 length 8192 on md[1]=ib/mlx5_0: Input/output error ucp_request.c:259 UCX ERROR failed to register user buffer datatype 0x20 address 0x2b7139ae0020 len 8192: Input/output error Issue (Github issue: https://github.com/openucx/ucx/issues/2702) Workaround: Add -x UCX_LOG_LEVEL=fatal to the command line to silence the errors. Keywords: Hardware Tag Matching Discovered in Version: 2.2 (UCX 1.4) 8 Mellanox Technologies

Known Issues Table 3 - Known Issues (Sheet 2 of 5) Internal Ref. Issue 1307243 Description: One-sided tests may fail with a segmentation fault. Workaround: In order to run one-sided tests, make sure to add -mca osc ucx to the command line. Keywords: OSC UCX, Open MPI, one-sided Discovered in Version: 2.1 (Open MPI 3.1.x) - Description: When the DC transport is used on a large scale, Retry exceeded messages may be printed from UCX. Workaround: Configure SL2VL on your OpenSM in the fabric and make UCX use SL=1 when using the InfiniBand transports via '-x UCX_IB_SL=1'. Keywords: UCX, DC transport Discovered in Version: 2.1 (UCX 1.3) 2267 Description: The following error message might appear when running at the scale of 256 ranks with the RC transport, when UD is used for wireup only: Fatal: send completion with error: Endpoint timeout. (Github issue: https://github.com/openucx/ucx/issues/2267) Workaround: Set the DC transport using the UCX_TLS parameter. Keywords: UCX Discovered in Version: 2.1 (UCX 1.3) - Description: When UCX requires more memory utilization than the memory space defined in /proc/sys/kernel/shmmni file, the following message is printed from UCX: shmget failed: No space left on device. (size=4325376). The max number of shared memory segments in the system is = 4096. Please try to increase this value through /proc/sys/kernel/shmmni. Workaround: Follow the instructions in the error message above and increase the value of shared memory segments in /proc/sys/kernel/shmmni file. Keywords: UCX, memory Discovered in Version: 2.1 (UCX 1.3) 2111 Description: When UCX is used in the multi-threaded mode, it might take the osu_latency_mt test a long time to be completed. (Github issue: https://github.com/openucx/ucx/issues/2111) Workaround: N/A Keywords: UCX, multi-threaded Discovered in Version: 2.1 (UCX 1.3) Mellanox Technologies 9

Table 3 - Known Issues (Sheet 3 of 5) Internal Ref. Issue 2226 Description: The following assertion may fail in certain cases: Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed (Gihub issue: https://github.com/openucx/ucx/issues/2226) Workaround: Set the DC transport using the UCX_TLS parameter. Keywords: UCX, assertion Discovered in Version: 2.1 (UCX 1.3) - Description: In OpenMPI 3.0.0, the MCA options rmaps_dist_device and rmaps_base_mapping_policy are not functional. A fix will be available in OpenMPI 3.1. Workaround: N/A Keywords: Process binding policy, NUMA/HCA locality Discovered in Version: 2.0 (OpenMPI 3.0.0) - Description: Mellanox SHARP library is not available in HPC-X for the Community OFED and Inbox OFED. Workaround: N/A Keywords: Mellanox SHARP library Discovered in Version: 2.0 1162 Description: UCX currently does not support canceling send requests. (Github issue: https://github.com/openucx/ucx/issues/1162) Workaround: N/A Keywords: UCX Discovered in Version: 2.0 1156724 Description: When using OpenSHMEM with MXM, a timeout on the UD transport might occur that will result in OpenSHMEM hanging during process finalization. The following error is displayed when this happens: ud_channel.c:768 Fatal: UD timeout sending to <hostname> Workaround: N/A Keywords: MXM, OpenSHMEM, timeout Discovered in Version: 2.0 - Description: MXM over Ethernet does not function for MTUs which are higher than 1024B when using firmware version 2.11.0500 Workaround: N/A Keywords: MXM over Ethernet 10 Mellanox Technologies

Known Issues Table 3 - Known Issues (Sheet 4 of 5) Internal Ref. Issue - Description: While running, MXM may show excessive log message. Workaround: To minimize the volume of log messages, use: -x MXM_LOG_LEVEL=fatal i.e. % mpirun -x MXM_LOG_LEVEL=fatal... Keywords: Logs - Description: A mixed configuration of active ports (one InfiniBand and the other Ethernet) on a single HCA is not supported. Workaround: In such case, specify the port you would like to use with: -x MXM_RDMA_PORTS or "-x MXM_IB_PORTS" Keywords: Port Configuration - Description: When stack size is set to "unlimited", some application may suffer from performance degradation. Workaround: Make sure that 'ulimit -s unlimited' is not set before running MXM. Keywords: Performance - Description: MXM v3.4 and v3.5 require that the max_op_vl value in OpenSM to be set as >=3. Workaround: Set the MXM environment parameter MXM_OOB_FIRST_SL to 0 from the command line: $mpirun x MXM_OOB_FIRST_SL=0 Keywords: OpenSM Configuration - Description: MXM_IB_USE_GRH must be set to yes when one of the following is used: 1. Socket Direct 2. Multi-Host 3. SR-IOV Workaround: N/A Keywords: MXM parameters - Description: Currently, the UPC Barrier does not utilize FCA Barrier, so while GAS- NET_FCA_ENABLE_BARRIER option that enables/disabled the FCA barrier does affect various UPC collectives, it does not affect UPC Barrier. Workaround: N/A Keywords: UPC Barrier - Description: UCX may not work properly with RoCE when running on a large scale. Workaround: N/A Keywords: UCX Mellanox Technologies 11

Table 3 - Known Issues (Sheet 5 of 5) Internal Ref. - Description: Using UCX on ARM hosts may result in hangs due to a known issue in OMPI when running on ARM. Workaround: N/A Keywords: UCX - Description: As UCX embedded in the HPC-X is compiled with AVX support, UCX cannot be run on hosts without AVX support. In case the AVX is not available, recompile the UCX that is available in the HPC-X with the option: --with-avx=no Workaround: N/A Keywords: UCX Issue 12 Mellanox Technologies

Bug Fixes History 4 Bug Fixes History Table 4 lists the bugs fixed in this release. Table 4 - Bug Fixes History (Sheet 1 of 3) Internal Ref. 1295679 Description: Fixed the issue where OpenSHMEM group cache had a default limit of 100 entries, which might have resulted in OpenSHMEM application exiting with the following message: group cache overflow on rank xxx: cache_size = 100. Keywords: OpenSHMEM, Open MPI Discovered in Version: 2.1 (Open MPI 3.1.x) Fixed in Version: 2.2 (Open MPI 3.1.x) - Description: Fixed the issue where UCX did not work out-of-the-box with CUDA support. Keywords: UCX, CUDA Discovered in Version: 2.2 (UCX 1.4) Fixed in Version: 2.1 (UCX 1.3) 1926 Description: Fixed the issue of when using multiple transports, invalid data was sent outof-sync with Hardware Tag Matching traffic. (Github issue: https://github.com/openucx/ucx/issues/1926) Keywords: Hardware Tag Matching Discovered in Version: 2.1 (UCX 1.3) Fixed in Version: 2.2 (UCX 1.4) 1949 Description: Fixed the issue where Hardware Tag Matching might not have functioned properly with UCX over DC transport. (Github issue: https://github.com/openucx/ucx/issues/1949) Keywords: UCX, Hardware Tag Matching, DC transport Discovered in Version: 2.0 Fixed in Version: 2.1 - Description: Fixed job data transfer from SD to libsharp. Keywords: Mellanox SHARP library Discovered in Release: 1.9 Fixed in Release: 1.9.7 884482 Description: Fixed internal HCOLL datatype mapping. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Issue Mellanox Technologies 13

Table 4 - Bug Fixes History (Sheet 2 of 3) Internal Ref. 884508 Description: Fixed internal HCOLL datatype lower bound calculation. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 884490 Description: Fixed allgather unpacking issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 885009 Description: Fixed wrong answer in alltoallv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 882193 Description: Fixed mcast group leak in HCOLL. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 - Description: Added IN_PLACE support for alltoall, alltoallv, and allgatherv. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 - Description: Fixed an issue related to multi-threaded MPI_Bcast. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Salesforce: Description: Fixed a memory barrier issue in MPI_Barrier on Power PPC systems. 316541 Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 Salesforce: 316547 Issue Description: Fixed multi-threaded MPI_COMM_DUP and MPI_COMM_SPLIT hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 14 Mellanox Technologies

Bug Fixes History Table 4 - Bug Fixes History (Sheet 3 of 3) Internal Ref. Issue 894346 Description: Fixed Quantum Espresso hanging issues. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 898283 Description: Fixed an issue which caused CP2K applications to hang when HCOLL was enabled. Keywords: HCOLL, FCA Discovered in Release: 1.7.405 Fixed in Release: 1.7.406 906155 Description: Fixed an issue which caused VASP applications to hang in MPI_Allreduce. Keywords: HCOLL, FCA Discovered in Release: 1.6 Fixed in Release: 1.7.406 Mellanox Technologies 15

5 Change Log History 5.1 HPC-X Toolkit Change Log History Table 5 - HPC-X Toolkit Change Log History Category Description HPC-X Content Rev 2.1 Updated the following communications libraries and acceleration packages versions: Open MPI version 3.1.x Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.5 HCOLL version 4.0 MXM version 3.7 UCX version 1.3 OpenSHMEM v1.3 specification compliant UCX UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA. Added the following UCX features: Added support for GPU memory in UCX communication libraries Added support for Multi-Rail protocol MXM The UD_RNDV_ZCOPY parameter is set to no by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol. HCOLL UCX is now the default p2p transport in HCOLL Improved multi-threaded performance Improved shared memory performance Added support for Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.5 Added support for Mellanox SHARP software multi-channel/multi-rail capable algorithms Improved Allreduce large message algorithm Improved AlltoAll algorithm Profiling IB verbs API (ibprof) UPC Removed ibprof tool from HPC-X toolkit. Removed UPC from HPC-X toolkit. Rev 2.0 HPC-X Content Updated the following communications libraries and acceleration packages versions: OpenMPI version 3.0.0 Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.4 HCOLL version 3.9 UCX version 1.3 16 Mellanox Technologies

Change Log History Table 5 - HPC-X Toolkit Change Log History Category UCX UCX is now at GA level. Added the following UCX features: [ConnectX-5 only] Added support for hardware Tag Matching with DC transport. [ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA. Hardware Tag Matching (See section Hardware Tag Matching in the User Manual) SR-IOV Support (See section SR-IOV Support in the User Manual) Adaptive Routing (AR) (See section Adaptive Routing in the User Manual) Error Handling (See section Error Handling in the User Manual) HCOLL Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4 Added support for NCCL on-host GPU based collectives. Added support for Hierarchical GPU based allreduce using NCCL for scale-in and MXM/UCX for scale-out. Improved shared memory performance for allreduce, barrier, and broadcast. Targeting high thread count systems, e.g. Power9. Improved large message allreduce (multi-radix, zero-copy fragmentation, CPU vectorization.) Added new and improved AlltoAllv algorithm - hybrid logarithmic pairwise exchange. Added support for on-demand HCOLL memory. Improves HCOLL's memory footprint on high thread count system e.g. Power9. Added a high performance multithreaded implementation to support MPI_- THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9. HCOLL startup improvements. Open MPI / OpenSH- MEM Added support for Open MPI 3.0.0. Added support for xpmem kernel module. Added a high performance implementation of shmem_ptr() with UCX SPML. Added a UCX allocator. The UCX allocator optimizes intra-node communication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML. Added a UCX one-sided component to support MPI RMA operations. Rev 1.9.7 Description Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Bug Fixes, see Section 4, Bug Fixes History, on page 13 Rev 1.9 Mellanox Technologies 17

Table 5 - HPC-X Toolkit Change Log History HPC-X Content UCX MXM Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory Configuration FCA Category OS architecture Updated the following communications libraries and acceleration packages versions: OpenMPI version 2.1.2a1 Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.3.1 HCOLL version 3.8.1652 MXM version 3.6.3103 UCX version 1.2.2947 Point-to-point communication API, with tag matching, remote memory access, and atomic operations. This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport A cleaner API with lower software overhead which provides better performance especially for small messages. Support for multitude of InifiniBand transports and Mellanox offloads to optimize data transfer performance: RDMA DC Out-of-order HW tag matching offload Registration cache ODP Shared memory communications for optimal intra-node data transfer: SysV posix knem CMA xpmem Enabled Adaptive Routing for all the transport layers (UD/RC/DC). Memory registration optimization. Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). Improved the intranode performance of allreduce and barrier. Changed many default parameter setting in order to achieve best out-of-thebox experience for several applications including - CP2K, minidft, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA. As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package. Improved AlltoAllv algorithm. Improved large data allreduce. Improved UCX BCOL. Added support for ARM architecture. Description 18 Mellanox Technologies

Change Log History Table 5 - HPC-X Toolkit Change Log History Category Description Rev 1.8.2 MXM Updated MXM version to 3.6.2098 which includes memory registration optimization. Cross Channel (CC) Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Shared memory POWER Mixed data types Non-contiguous Bcast UMR Unified Communication - X Framework (UCX) HPC-X Content Rev 1.8 Added Cross Channel (CC) AlltoAllv Added CC zcpy Ring Bcas Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives Added shared memory POWER optimizations for allreduce Added shared memory POWER optimizations for Barrier Added support for mixed data types Added support for non-contiguous Bcast with UMR or SGE in CC Added UMR support in CC bcol A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. Updated the following communications libraries and acceleration packages versions: HCOLL updated to v3.7. Open MPI updated to v2.10 FCA FCA 2.x is no longer the default FCA used in HPC-X. As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x. Bug Fixes See Section 4, Bug Fixes History, on page 13 Rev 1.7 MXM Updated MXM version to 3.6 FCA Collective FCA Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce. Added MPI datatype support. Added optimizations for PPC platforms. Added support for multiple Mellanox SHARP technology leaders on a single host. Added support for collecting Mellanox SHARP technology usage statistics. Exposed cross-channel non-blocking collectives to the MPI level. Rev 1.6 MXM v3.5 See Section 5.3, MXM Change Log History, on page 22 Mellanox Technologies 19

Table 5 - HPC-X Toolkit Change Log History Category IB-Router Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI. Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box. FCA v3.5 See Section 5.2, FCA Change Log History, on page 20 Rev 1.5 HPC-X Content Updated the following communications libraries and acceleration packages versions: Open MPI updated to v1.10 UPC update to 2.22.0 MXM updated to v3.4.369 FCA updated to v3.4.799 MXM v3.4.369 See Section 5.3, MXM Change Log History, on page 22 FCA v3.4.799 See Section 5.2, FCA Change Log History, on page 20 Rev 1.4 FCA v3.3 See Section 5.2, FCA Change Log History, on page 20 MXM v3.4 See Section 5.3, MXM Change Log History, on page 22 Rev 1.3 Description MLNX_OFED CPU Architecture LID Mask Control (LMC) Performance Adaptive Routing UD zero copy Added support for OFED Inbox drivers Added support for PPC architecture Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization. Performance improvements for all transport layers. Enhanced support for Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section Adaptive Routing for UD Transport. UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage. 5.2 FCA Change Log History Table 6 - FCA Change Log History Category Description Rev 3.5 FCA Collective Added MPI Allgatherv and MPI reduce 20 Mellanox Technologies

Change Log History Table 6 - FCA Change Log History FCA Category Added support for Mellanox SHARP library (including SHARP allreduce, reduce and barrier) Enhanced scalability for CORE-Direct based collectives Added support for complex data types Rev 3.4 Description General Collectives UCX support Communicator caching scheme with eviction: improves jobstart and communicator creation time Collectives: Added Alltoallv and Alltoall small message algorithms. Rev 3.3 General Collectives Ported to PowerPC Thread safety added Improved large message allreduce algorithm (Enabled by default) Beta version of network topology awareness (Enabled by default) Rev 3.0 Collectives MPI collectives MPI-3 Blocking and Non-blocking collectives HCOLL Collective algorithm Performance MPI libraries Offload collectives communication from MPI process onto Mellanox interconnect hardware Efficient collectives communication flow optimized to job and topology Significantly reduce MPI collectives runtime Native support for MPI-3 Support for blocking and nonblocking collectives Supports hierarchical communication algorithms (HCOLL) Supports multiple optimizations within a single collective algorithm Increase CPU availability and efficiency for increased application performance Seamless integration with MPI libraries and job schedulers Rev 2.5 Multicast Group Performance Added MCG (Multicast Group) cleanup tool Performance improvements Performance Dynamic offloading rules Mixed MTU Performance improvements Enabled dynamic offloading rules configuration based on the data type and reduce operations Added support for mixed MTU Mellanox Technologies 21

Table 6 - FCA Change Log History Category Description Rev 2.1.1 AMD/Interlagos CPUs Added support for AMD/Interlagos CPUs Rev 2.1 Core-Direct Non-contiguous data layouts PGI compilers Added support for Mellanox Core-Direct technology (enables offloading collective operations to the HCA.) Added support for non-contiguous data layouts Added support for PGI compilers 5.3 MXM Change Log History Table 7 - MXM Change Log History Category Description Rev 3.6 General Updated MXM version to 3.6 Rev 3.5 Performance Performance improvements Rev 3.4.369 Initialization Supported Transports Job startup performance optimization DC enhancements and startup optimizations Rev 3.4 Supported Transports Optimizations for the DC transport at scale Rev 3.3 LID Mask Control (LMC) Adaptive Routing UD zero copy Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization. Enhanced support for Adaptive Routing for the UD transport layer. UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage. Rev 3.2 Atomic Operations Added hardware atomic operations support in the RC and DC transport layers for 32bit and 64bit operands. This feature is set to ON by default. To disable it run: oshrun -x MXM_CIB_USE_HW_ATOMICS=n... Note: If hardware atomic operations are disabled, the software atomic is used instead. 22 Mellanox Technologies

Change Log History Table 7 - MXM Change Log History Category MXM API Event Interrupt Performance Adaptive Routing Description Added two additional functions (mxm_ep_wireup() and mxm_ep_powerdown)to the MXM API to allow pre-connection establishment for MXM (rather than on-demand). For further information, please refer to the HPC-X User Manual section MXM Performance Tuning. Added solicited event interrupt for the rendezvous protocol for potential performance improvement. For further information, please refer to the HPC-X User Manual section MXM Performance Tuning. Performance improvements for the DC transport layer. Added Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section Adaptive Routing for UD Transport. Rev 3.0 Service Level Adaptive Routing Supported Transports Performance Service Level support (at Alpha level) Adaptive Routing support in UD transport layers Dynamically Connected Transport (DC) (at GA level) Performance optimizations Rev 2.1 Supported Transports Performance Dynamically Connected Transport (DC) (at Beta level) RC is currently fully supported KNEM support for Intra-node communication Performance optimizations Rev 2.0 Reliable Connected MXM Binding On-demand connection establishment Performance Added Reliable Connection (RC) support (at beta level) MXM process can be pinned to a specific HCA port. MXM supports the following binding policies: static - user can specify process-to-port map cpu affinity based - HCA port will be bound automatically based on process affinity Added on-demand connection establishment between the processes Performance tuning improvements Rev 1.5 MXM over Ethernet Multi-Rail Added Ethernet support Added Multi-Rail support Mellanox Technologies 23

5.4 HPC-X Open MPI/OpenSHMEM Change Log History Table 8 - HPC-X Open MPI/OpenSHMEM Change Log History Category Description Rev 1.8.2 Acceleration Packages Job start optimization Performance Added support for new MXM, FCA, HCOLL versions Added job start optimization Performance improvements Performance memheap ptmalloc allocator multiple psync arrays spml yoda Performance Memory footprint optimizations Added Sandy Bridge performance optimizations. Allocated memheap using contiguous memory provided by the HCA. Replaced the buddy memheap by the ptmalloc allocator. Added the option of using multiple psync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines) Optimized small size puts Performance optimization Added memory footprint optimizations 24 Mellanox Technologies