Dmitry Durnov 15 February 2017

Size: px

Start display at page:

Download "Dmitry Durnov 15 February 2017"

Theodore Preston
6 years ago
Views:

1 Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017

2 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2

3 Modern cluster architecture 3

4 Modern cluster architecture. Node level CPU MC Memory QPI PCIe Fast Interconnect Co-processor QPI DMI CPU MC PCIe Memory Co-processor PCH SSD Ethernet Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *PCIe PCI Express *DMI Direct Media Interface *PCH Platform Controller HUB *SSD Solid-state Drive 2/20/2017 4

5 Modern cluster architecture. Node level 2/20/2017 5

6 Modern cluster architecture. Xeon Phi MCDRAM MCDRAM MCDRAM Core Core Core D D R 4 MC Core Core Core Core Core Core MC D D R 4 Fast Interconnect PCIe DMI Ethernet PCH SSD 2/20/2017 6

7 Modern cluster architecture. Xeon Phi 7

8 Modern cluster architecture. Node level - HW: - Several sockets multicore CPU (2-4 sockets, 12+ cores per socket) GB memory per core - Accelerator/Co-processor (17.2% of total Top500 list: - Fast Interconnect adapter (communication and IO) - Slow Interconnect adapter (management/ssh) - Local storage - SW: - Linux OS (RHEL/SLES/CentOS/ ) - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager node level (LSF/PBS/Torque/SLURM/ ) 2/20/2017 8

9 Modern cluster architecture. Cluster level. Fat Tree topology Head Node Node Node Node Node 2/20/2017 9

10 Modern cluster architecture. Cluster level - HW: - Interconnect switches/cables (Fat tree/dragonfly/butterfly/ topology) - SW: - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager (LSF/PBS/Torque/SLURM/ ) 2/20/

11 Modern cluster architecture. Node level. CPU - 64 bit architecture - Out Of Order execution - Xeon: Up to 22 cores per socket (44 with Hyper Threading) - Xeon Phi: 60+ cores (240+ with Hyper Threading) - 1, 2, 4 sockets configurations. (QPI links) - Vectorization (AVX instructions set. 256/512 bit vector length) - 2/3 cache levels - And many other features 2/20/

12 MC Modern cluster architecture. Node level. Memory hierarchy - Several levels of hierarchy - L1 cache latency ~4-5 cycles - L2 cache latency - ~10-12 cycles CPU HT HT HT HT Core Core L1 L1 L2 L2 Memory - L3 (LLC) cache latency - ~36-38 cycles - Local memory latency ~ cycles - NUMA (Non Uniform Memory Access) impact - Remote LLC latency - ~ cycles - Remote memory latency ~ cycles LLC QPI Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *HT Hyper Thread *LLC Last level cache Data locality is very important QPI CPU MC Memory 2/20/

13 Modern cluster architecture. Interconnect - Infiniband - Technologies/APIs CPU MEM CPU MEM - RDMA (ibverbs, udapl, mxm) - PSM (True Scale) - Ethernet OS OS - Technologies/APIs - TCP/IP (sockets) - RoCE (ibverbs, udapl, ) HCA HCA Remote memory access latency (Different nodes, pingpong) ~ 1 usec Local memory access latency (cross CPU socket, pingpong) ~ 0.5 usec 2/21/2017 Abbreviations: *RDMA Remote Direct Memory Access *PSM Performance Scaled Messaging *RoCE RDMA over Converged Ethernet *HCA Host Channel Adapter OS-bypassing and zero-copy mechanisms 13

14 Modern cluster architecture. Interconnect. Intel Omni Path - Key features: - Link Speed: 100 Gbit/s - MPI Latency: less than 1 usec end-to-end - High MPI message rate (160 mmps) - Scalable to tens-of-thousands of nodes - APIs: - PSM2 (compatible with PSM) - OFI (Open Fabrics Interface) - ibverbs 2/20/

15 Modern cluster architecture. Interconnect. OFI API 2/20/

16 Programming models 16

17 Programming models. - Node level - Pthreads - OpenMP - TBB - CilckPlus - Cluster level - MPI - Different PGAS models - Hybrid model: MPI+X. MPI+OpenMP 2/20/2017 Intel Confidential 17

18 2/20/

19 What is MPI? - MPI Message Passing Interface - Version 1.0 of the standard was released in June Current version is MPI Provides language independent API for point-to-point, collectives and many other operations across distributed memory systems - Many implementations exist (MPICH, Intel MPI, MVAPICH, Cray MPT, Platform MPI, MS MPI, Open MPI, HPC-X, etc) 2/20/

20 MPI basis - MPI provides a powerful, efficient and portable way for parallel programming - Typically MPI supports SPMD model (MPMD possible though), i. e. same sub-program runs on each processor. The total program (all subprograms of the program) must be started with the MPI startup tool. - MPI program talks by means of messages (not streams) - Rich API: MPI Environment Point-to-Point communication Collective communication One-sided communication (Remote Memory Access) MPI Datatypes Application topologies Profiling interface File I/O Dynamic Processes 2/20/

21 MPI Program #include "mpi.h" #include <stdio.h> program main use MPI integer ierr int main( int argc, char *argv[] ) { call MPI_INIT( ierr ) MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); print *, 'Hello, world!' call MPI_FINALIZE( ierr ) end return 0; } 2/20/

message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as

22 Point-to-Point Communication - Messages are matched by triplet of source, tag and communicator - Tag is just a message mark (MPI_Recv may provide MPI_ANY_TAG to match message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as source - Communicator represents two things: the group of processes and communication context 2/20/

23 Collective communication - Represent different communication patterns, which may involve an arbitrary number of ranks - Why would not plain send and receive be enough? O-p-t-i-m-i-z-at-i-on - All collective operations involve every process in a given communicator - MPI implementations may contain several algorithms for every collective - Typically based on point-to-point functionality (but not necessary) - Can be divided into 3 categories: one-to-all, all-to-one, all-to-all - There are regular, nonblocking and neighbor collectives. 2/20/

24 Collective communication MPI_Bcast: one process (root) sends some chunk of data to the rest of processes in the given communicator Possible algorithms: 2/20/

25 One-sided communication - One process to specify all communication parameters, both for the sending side and for the receiving side - Separate communication and synchronization - No matching - Process that initiates a one-sided communication operation is the origin process and the process that contains the memory being accessed is the target process - Memory is exposed via window concept - Quite rich API: a bunch of different window creation, communication, synchronization and atomic routines - Example of communication call: MPI_Put(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) 2/20/

26 One-sided communication Window Memory that a process allows other processes to access via one-sided communication is called a window Group of processes specify their local windows to other processes by calling the collective function (i. e. MPI_Win_create, MPI_Win_allocate, etc) P1 P2 P3 2/20/

27 Passive mode int b1 = 1, b2 = 2; int winbuf = -1; MPI_Win win; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Win_create(&winbuf, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (rank == 0) { MPI_Win_lock(MPI_LOCK_SHARED, 1, 0, win); MPI_Put(&b1, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Put(&b2, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_unlock(1, win ); } MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { printf("my win %d\n", winbuf); } 2/20/

MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED)

MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() //

programming - RMA provides portable atomics, synchronization, - Eliminates

increasing - Allows NUMA-aware mapping - Each window piece is associated

28 MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm) MPI_Win_shared_query(&baseptr) MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() // MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides portable atomics, synchronization, - Eliminates X in MPI+X, when only shared memory is needed - Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with the process that allocated it 2/20/

29 Tools 29

30 How Intel Parallel Studio XE 2017 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning MPI Messages Threading design & prototyping Professional Edition Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries

Evaluate MPI load balancing Find communication hotspots Tune single node

31 Performance Tuning Tools for Distributed Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tune cross-node MPI Visualize MPI behavior Evaluate MPI load balancing Find communication hotspots Tune single node threading Visualize thread behavior Evaluate thread load balancing Find thread sync bottlenecks

32 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Performance Assistance and Imbalance Tuning NEW in 9.1: MPI Performance Snapshot 32

33 Using the Intel Trace Analyzer and Collector is Easy! Step 1 Run your binary and create a tracefile $ mpirun trace n 2./test Step 2 View the Results: $ traceanalyzer & 33

34 Intel Trace Analyzer and Collector Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact 34

35 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too high load on Host = too low load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 8 MPI procs x 28 OpenMP threads

36 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too low load on Host = too high load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 24 MPI procs x 8 OpenMP threads

37 Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Perfect balance Host load = Coprocessor load Host 16 MPI procs x 1 OpenMP thread Coprocessor 16 MPI procs x 12 OpenMP thrds

38 Ideal Interconnect Simulator (Idealizer) Helps to figure out application s imbalance simulating its behavior in the ideal communication environment Actual trace Idealized Trace Easy way to identify application bottlenecks 38

39 MPI Performance Assistance Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime

40 MPI Performance Snapshot High capacity MPI profiler Lightweight Low overhead profiling for 100K+ Ranks Scalability Performance variation at scale can be detected sooner Identifying Key Metrics Shows PAPI counters and MPI/OpenMP imbalances

based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and

41 MPI Correctness Checking Highlights: Checks and pin-point hard to find run-time errors Unique feature to identify run-time errors Displays the correctness (parameter passing) of MPI communication for more robust and reliable MPI based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and Warnings can be identified easily. By a single mouse-click, more detailed information helps to identify root-causes MPI statistics 41

42 Intel MPI Benchmarks 4.1 Standard benchmarks with OSI-compatible CPL license Enables testing of interconnects, systems, and MPI implementations Comprehensive set of MPI kernels that provide performance measurements for: Point-to-point message-passing Global data movement and computation routines One-sided communications File I/O Supports MPI-1.x, MPI-2.x, and MPI- 3.x standards What s New: Introduction of new benchmarks Measure cumulative bandwidth and message rate values The Intel MPI Benchmarks provide a simple and easy way to measure MPI performance on your cluster

43 Online Resources Intel MPI Library product page Intel Trace Analyzer and Collector product page Intel Clusters and HPC Technology forums Intel Xeon Phi Coprocessor Developer Community 43

Dmitry Durnov 17 February 2016

Особенности разработки ПО для систем с распределённой памятью Dmitry Durnov 17 February 2016 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/17/2016 2 Modern cluster