Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017
Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2
Modern cluster architecture 3
Modern cluster architecture. Node level CPU MC Memory QPI PCIe Fast Interconnect Co-processor QPI DMI CPU MC PCIe Memory Co-processor PCH SSD Ethernet Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *PCIe PCI Express *DMI Direct Media Interface *PCH Platform Controller HUB *SSD Solid-state Drive 2/20/2017 4
Modern cluster architecture. Node level 2/20/2017 5
Modern cluster architecture. Xeon Phi MCDRAM MCDRAM MCDRAM Core Core Core D D R 4 MC Core Core Core Core Core Core MC D D R 4 Fast Interconnect PCIe DMI Ethernet PCH SSD 2/20/2017 6
Modern cluster architecture. Xeon Phi 7
Modern cluster architecture. Node level - HW: - Several sockets multicore CPU (2-4 sockets, 12+ cores per socket) - 2-4 GB memory per core - Accelerator/Co-processor (17.2% of total Top500 list: www.top500.org) - Fast Interconnect adapter (communication and IO) - Slow Interconnect adapter (management/ssh) - Local storage - SW: - Linux OS (RHEL/SLES/CentOS/ ) - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager node level (LSF/PBS/Torque/SLURM/ ) 2/20/2017 8
Modern cluster architecture. Cluster level. Fat Tree topology Head Node Node Node Node Node 2/20/2017 9
Modern cluster architecture. Cluster level - HW: - Interconnect switches/cables (Fat tree/dragonfly/butterfly/ topology) - SW: - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager (LSF/PBS/Torque/SLURM/ ) 2/20/2017 10
Modern cluster architecture. Node level. CPU - 64 bit architecture - Out Of Order execution - Xeon: Up to 22 cores per socket (44 with Hyper Threading) - Xeon Phi: 60+ cores (240+ with Hyper Threading) - 1, 2, 4 sockets configurations. (QPI links) - Vectorization (AVX instructions set. 256/512 bit vector length) - 2/3 cache levels - And many other features 2/20/2017 11
MC Modern cluster architecture. Node level. Memory hierarchy - Several levels of hierarchy - L1 cache latency ~4-5 cycles - L2 cache latency - ~10-12 cycles CPU HT HT HT HT Core Core L1 L1 L2 L2 Memory - L3 (LLC) cache latency - ~36-38 cycles - Local memory latency ~ 150-200 cycles - NUMA (Non Uniform Memory Access) impact - Remote LLC latency - ~ 70-80 cycles - Remote memory latency ~ 200-250 cycles LLC QPI Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *HT Hyper Thread *LLC Last level cache Data locality is very important QPI CPU MC Memory 2/20/2017 12
Modern cluster architecture. Interconnect - Infiniband - Technologies/APIs CPU MEM CPU MEM - RDMA (ibverbs, udapl, mxm) - PSM (True Scale) - Ethernet OS OS - Technologies/APIs - TCP/IP (sockets) - RoCE (ibverbs, udapl, ) HCA HCA Remote memory access latency (Different nodes, pingpong) ~ 1 usec Local memory access latency (cross CPU socket, pingpong) ~ 0.5 usec 2/21/2017 Abbreviations: *RDMA Remote Direct Memory Access *PSM Performance Scaled Messaging *RoCE RDMA over Converged Ethernet *HCA Host Channel Adapter OS-bypassing and zero-copy mechanisms 13
Modern cluster architecture. Interconnect. Intel Omni Path - Key features: - Link Speed: 100 Gbit/s - MPI Latency: less than 1 usec end-to-end - High MPI message rate (160 mmps) - Scalable to tens-of-thousands of nodes - APIs: - PSM2 (compatible with PSM) - OFI (Open Fabrics Interface) - ibverbs 2/20/2017 14
Modern cluster architecture. Interconnect. OFI API http://ofiwg.github.io/libfabric/ 2/20/2017 15
Programming models 16
Programming models. - Node level - Pthreads - OpenMP - TBB - CilckPlus - Cluster level - MPI - Different PGAS models - Hybrid model: MPI+X. MPI+OpenMP 2/20/2017 Intel Confidential 17
2/20/2017 18
What is MPI? - MPI Message Passing Interface - Version 1.0 of the standard was released in June 1994 - Current version is MPI 3.1 - Provides language independent API for point-to-point, collectives and many other operations across distributed memory systems - Many implementations exist (MPICH, Intel MPI, MVAPICH, Cray MPT, Platform MPI, MS MPI, Open MPI, HPC-X, etc) 2/20/2017 19
MPI basis - MPI provides a powerful, efficient and portable way for parallel programming - Typically MPI supports SPMD model (MPMD possible though), i. e. same sub-program runs on each processor. The total program (all subprograms of the program) must be started with the MPI startup tool. - MPI program talks by means of messages (not streams) - Rich API: MPI Environment Point-to-Point communication Collective communication One-sided communication (Remote Memory Access) MPI Datatypes Application topologies Profiling interface File I/O Dynamic Processes 2/20/2017 20
MPI Program #include "mpi.h" #include <stdio.h> program main use MPI integer ierr int main( int argc, char *argv[] ) { call MPI_INIT( ierr ) MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); print *, 'Hello, world!' call MPI_FINALIZE( ierr ) end return 0; } 2/20/2017 21
Point-to-Point Communication - Messages are matched by triplet of source, tag and communicator - Tag is just a message mark (MPI_Recv may provide MPI_ANY_TAG to match message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as source - Communicator represents two things: the group of processes and communication context 2/20/2017 22
Collective communication - Represent different communication patterns, which may involve an arbitrary number of ranks - Why would not plain send and receive be enough? O-p-t-i-m-i-z-at-i-on - All collective operations involve every process in a given communicator - MPI implementations may contain several algorithms for every collective - Typically based on point-to-point functionality (but not necessary) - Can be divided into 3 categories: one-to-all, all-to-one, all-to-all - There are regular, nonblocking and neighbor collectives. 2/20/2017 23
Collective communication MPI_Bcast: one process (root) sends some chunk of data to the rest of processes in the given communicator Possible algorithms: 2/20/2017 24
One-sided communication - One process to specify all communication parameters, both for the sending side and for the receiving side - Separate communication and synchronization - No matching - Process that initiates a one-sided communication operation is the origin process and the process that contains the memory being accessed is the target process - Memory is exposed via window concept - Quite rich API: a bunch of different window creation, communication, synchronization and atomic routines - Example of communication call: MPI_Put(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) 2/20/2017 25
One-sided communication Window Memory that a process allows other processes to access via one-sided communication is called a window Group of processes specify their local windows to other processes by calling the collective function (i. e. MPI_Win_create, MPI_Win_allocate, etc) P1 P2 P3 2/20/2017 26
Passive mode int b1 = 1, b2 = 2; int winbuf = -1; MPI_Win win; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Win_create(&winbuf, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (rank == 0) { MPI_Win_lock(MPI_LOCK_SHARED, 1, 0, win); MPI_Put(&b1, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Put(&b2, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_unlock(1, win ); } MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { printf("my win %d\n", winbuf); } 2/20/2017 27
MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm) MPI_Win_shared_query(&baseptr) MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() // MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides portable atomics, synchronization, - Eliminates X in MPI+X, when only shared memory is needed - Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with the process that allocated it http://htor.inf.ethz.ch/publications/img/mpi_mpi_hybrid_programming.pdf 2/20/2017 28
Tools 29
How Intel Parallel Studio XE 2017 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning MPI Messages Threading design & prototyping Professional Edition Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries
Performance Tuning Tools for Distributed Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tune cross-node MPI Visualize MPI behavior Evaluate MPI load balancing Find communication hotspots Tune single node threading Visualize thread behavior Evaluate thread load balancing Find thread sync bottlenecks
Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Performance Assistance and Imbalance Tuning NEW in 9.1: MPI Performance Snapshot 32
Using the Intel Trace Analyzer and Collector is Easy! Step 1 Run your binary and create a tracefile $ mpirun trace n 2./test Step 2 View the Results: $ traceanalyzer & 33
Intel Trace Analyzer and Collector Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact 34
Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too high load on Host = too low load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 8 MPI procs x 28 OpenMP threads
Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too low load on Host = too high load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 24 MPI procs x 8 OpenMP threads
Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Perfect balance Host load = Coprocessor load Host 16 MPI procs x 1 OpenMP thread Coprocessor 16 MPI procs x 12 OpenMP thrds
Ideal Interconnect Simulator (Idealizer) Helps to figure out application s imbalance simulating its behavior in the ideal communication environment Actual trace Idealized Trace Easy way to identify application bottlenecks 38
MPI Performance Assistance Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime
MPI Performance Snapshot High capacity MPI profiler Lightweight Low overhead profiling for 100K+ Ranks Scalability Performance variation at scale can be detected sooner Identifying Key Metrics Shows PAPI counters and MPI/OpenMP imbalances
MPI Correctness Checking Highlights: Checks and pin-point hard to find run-time errors Unique feature to identify run-time errors Displays the correctness (parameter passing) of MPI communication for more robust and reliable MPI based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and Warnings can be identified easily. By a single mouse-click, more detailed information helps to identify root-causes MPI statistics 41
Intel MPI Benchmarks 4.1 Standard benchmarks with OSI-compatible CPL license Enables testing of interconnects, systems, and MPI implementations Comprehensive set of MPI kernels that provide performance measurements for: Point-to-point message-passing Global data movement and computation routines One-sided communications File I/O Supports MPI-1.x, MPI-2.x, and MPI- 3.x standards What s New: Introduction of new benchmarks Measure cumulative bandwidth and message rate values The Intel MPI Benchmarks provide a simple and easy way to measure MPI performance on your cluster
Online Resources Intel MPI Library product page www.intel.com/go/mpi Intel Trace Analyzer and Collector product page www.intel.com/go/traceanalyzer Intel Clusters and HPC Technology forums http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology Intel Xeon Phi Coprocessor Developer Community http://software.intel.com/en-us/mic-developer 43