Dmitry Durnov 15 February 2017

Similar documents
Dmitry Durnov 17 February 2016

Advanced MPI. Andrew Emerson

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

MPI-3 One-Sided Communication

Bei Wang, Dmitry Prohorov and Carlos Rosales

Advanced MPI. Andrew Emerson

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Introduction to parallel computing concepts and technics

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

JURECA Tuning for the platform

Message Passing Interface

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Introduction to the Message Passing Interface (MPI)

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

CS 426. Building and Running a Parallel Application

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

Migrating Offloading Software to Intel Xeon Phi Processor

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

High Performance Computing: Tools and Applications

Intel VTune Amplifier XE

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI. (message passing, MIMD)

High Performance Computing: Tools and Applications

DPHPC Recitation Session 2 Advanced MPI Concepts

Reusing this material

Using Intel VTune Amplifier XE for High Performance Computing

For Distributed Performance

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Lecture 7: Distributed memory

Slides prepared by : Farzana Rahman 1

An Introduction to MPI

HPC Parallel Programing Multi-node Computation with MPI - I

Lecture 34: One-sided Communication in MPI. William Gropp

Message Passing Interface

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

High Performance Computing Course Notes Message Passing Programming I

Parallel Programming Libraries and implementations

MPI Runtime Error Detection with MUST

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Intel Parallel Studio XE 2015

MPI One sided Communication

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Distributed Memory Programming with Message-Passing

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Programming with MPI. Pedro Velho

Parallel Programming. Libraries and Implementations

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Profiling: Understand Your Application

Holland Computing Center Kickstart MPI Intro

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Efficiently Introduce Threading using Intel TBB

Technical Computing Suite supporting the hybrid system

HPC Architectures. Types of resource currently in use

Operational Robustness of Accelerator Aware MPI

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Jackson Marusarz Intel Corporation

Portable, MPI-Interoperable! Coarray Fortran

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

NUMA-aware OpenMP Programming

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Interconnect Your Future

Computer Architecture

Message Passing Interface. most of the slides taken from Hanjun Kim

VLPL-S Optimization on Knights Landing

ECE 574 Cluster Computing Lecture 13

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

High performance computing. Message Passing Interface

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

KNL tools. Dr. Fabio Baruffa

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Parallel Computing. November 20, W.Homberg

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Performance Tools for Technical Computing

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

AcuSolve Performance Benchmark and Profiling. October 2011

Advanced Job Launching. mapping applications to hardware

Intel Architecture for HPC

IXPUG 16. Dmitry Durnov, Intel MPI team

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Introducing Task-Containers as an Alternative to Runtime Stacking

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Transcription:

Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017

Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2

Modern cluster architecture 3

Modern cluster architecture. Node level CPU MC Memory QPI PCIe Fast Interconnect Co-processor QPI DMI CPU MC PCIe Memory Co-processor PCH SSD Ethernet Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *PCIe PCI Express *DMI Direct Media Interface *PCH Platform Controller HUB *SSD Solid-state Drive 2/20/2017 4

Modern cluster architecture. Node level 2/20/2017 5

Modern cluster architecture. Xeon Phi MCDRAM MCDRAM MCDRAM Core Core Core D D R 4 MC Core Core Core Core Core Core MC D D R 4 Fast Interconnect PCIe DMI Ethernet PCH SSD 2/20/2017 6

Modern cluster architecture. Xeon Phi 7

Modern cluster architecture. Node level - HW: - Several sockets multicore CPU (2-4 sockets, 12+ cores per socket) - 2-4 GB memory per core - Accelerator/Co-processor (17.2% of total Top500 list: www.top500.org) - Fast Interconnect adapter (communication and IO) - Slow Interconnect adapter (management/ssh) - Local storage - SW: - Linux OS (RHEL/SLES/CentOS/ ) - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager node level (LSF/PBS/Torque/SLURM/ ) 2/20/2017 8

Modern cluster architecture. Cluster level. Fat Tree topology Head Node Node Node Node Node 2/20/2017 9

Modern cluster architecture. Cluster level - HW: - Interconnect switches/cables (Fat tree/dragonfly/butterfly/ topology) - SW: - Parallel File system (PVFS/PanFS/GPFS/Lustre/ ) - Job manager (LSF/PBS/Torque/SLURM/ ) 2/20/2017 10

Modern cluster architecture. Node level. CPU - 64 bit architecture - Out Of Order execution - Xeon: Up to 22 cores per socket (44 with Hyper Threading) - Xeon Phi: 60+ cores (240+ with Hyper Threading) - 1, 2, 4 sockets configurations. (QPI links) - Vectorization (AVX instructions set. 256/512 bit vector length) - 2/3 cache levels - And many other features 2/20/2017 11

MC Modern cluster architecture. Node level. Memory hierarchy - Several levels of hierarchy - L1 cache latency ~4-5 cycles - L2 cache latency - ~10-12 cycles CPU HT HT HT HT Core Core L1 L1 L2 L2 Memory - L3 (LLC) cache latency - ~36-38 cycles - Local memory latency ~ 150-200 cycles - NUMA (Non Uniform Memory Access) impact - Remote LLC latency - ~ 70-80 cycles - Remote memory latency ~ 200-250 cycles LLC QPI Abbreviations: *MC Memory Controller *QPI Quick Path Interconnect *HT Hyper Thread *LLC Last level cache Data locality is very important QPI CPU MC Memory 2/20/2017 12

Modern cluster architecture. Interconnect - Infiniband - Technologies/APIs CPU MEM CPU MEM - RDMA (ibverbs, udapl, mxm) - PSM (True Scale) - Ethernet OS OS - Technologies/APIs - TCP/IP (sockets) - RoCE (ibverbs, udapl, ) HCA HCA Remote memory access latency (Different nodes, pingpong) ~ 1 usec Local memory access latency (cross CPU socket, pingpong) ~ 0.5 usec 2/21/2017 Abbreviations: *RDMA Remote Direct Memory Access *PSM Performance Scaled Messaging *RoCE RDMA over Converged Ethernet *HCA Host Channel Adapter OS-bypassing and zero-copy mechanisms 13

Modern cluster architecture. Interconnect. Intel Omni Path - Key features: - Link Speed: 100 Gbit/s - MPI Latency: less than 1 usec end-to-end - High MPI message rate (160 mmps) - Scalable to tens-of-thousands of nodes - APIs: - PSM2 (compatible with PSM) - OFI (Open Fabrics Interface) - ibverbs 2/20/2017 14

Modern cluster architecture. Interconnect. OFI API http://ofiwg.github.io/libfabric/ 2/20/2017 15

Programming models 16

Programming models. - Node level - Pthreads - OpenMP - TBB - CilckPlus - Cluster level - MPI - Different PGAS models - Hybrid model: MPI+X. MPI+OpenMP 2/20/2017 Intel Confidential 17

2/20/2017 18

What is MPI? - MPI Message Passing Interface - Version 1.0 of the standard was released in June 1994 - Current version is MPI 3.1 - Provides language independent API for point-to-point, collectives and many other operations across distributed memory systems - Many implementations exist (MPICH, Intel MPI, MVAPICH, Cray MPT, Platform MPI, MS MPI, Open MPI, HPC-X, etc) 2/20/2017 19

MPI basis - MPI provides a powerful, efficient and portable way for parallel programming - Typically MPI supports SPMD model (MPMD possible though), i. e. same sub-program runs on each processor. The total program (all subprograms of the program) must be started with the MPI startup tool. - MPI program talks by means of messages (not streams) - Rich API: MPI Environment Point-to-Point communication Collective communication One-sided communication (Remote Memory Access) MPI Datatypes Application topologies Profiling interface File I/O Dynamic Processes 2/20/2017 20

MPI Program #include "mpi.h" #include <stdio.h> program main use MPI integer ierr int main( int argc, char *argv[] ) { call MPI_INIT( ierr ) MPI_Init( &argc, &argv ); printf( "Hello, world!\n" ); MPI_Finalize(); print *, 'Hello, world!' call MPI_FINALIZE( ierr ) end return 0; } 2/20/2017 21

Point-to-Point Communication - Messages are matched by triplet of source, tag and communicator - Tag is just a message mark (MPI_Recv may provide MPI_ANY_TAG to match message with any tag) - MPI_Recv may receive from any process by using MPI_ANY_SOURCE as source - Communicator represents two things: the group of processes and communication context 2/20/2017 22

Collective communication - Represent different communication patterns, which may involve an arbitrary number of ranks - Why would not plain send and receive be enough? O-p-t-i-m-i-z-at-i-on - All collective operations involve every process in a given communicator - MPI implementations may contain several algorithms for every collective - Typically based on point-to-point functionality (but not necessary) - Can be divided into 3 categories: one-to-all, all-to-one, all-to-all - There are regular, nonblocking and neighbor collectives. 2/20/2017 23

Collective communication MPI_Bcast: one process (root) sends some chunk of data to the rest of processes in the given communicator Possible algorithms: 2/20/2017 24

One-sided communication - One process to specify all communication parameters, both for the sending side and for the receiving side - Separate communication and synchronization - No matching - Process that initiates a one-sided communication operation is the origin process and the process that contains the memory being accessed is the target process - Memory is exposed via window concept - Quite rich API: a bunch of different window creation, communication, synchronization and atomic routines - Example of communication call: MPI_Put(const void *origin_addr, int origin_count, MPI_Datatype origin_datatype, int target_rank, MPI_Aint target_disp, int target_count, MPI_Datatype target_datatype, MPI_Win win) 2/20/2017 25

One-sided communication Window Memory that a process allows other processes to access via one-sided communication is called a window Group of processes specify their local windows to other processes by calling the collective function (i. e. MPI_Win_create, MPI_Win_allocate, etc) P1 P2 P3 2/20/2017 26

Passive mode int b1 = 1, b2 = 2; int winbuf = -1; MPI_Win win; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Win_create(&winbuf, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (rank == 0) { MPI_Win_lock(MPI_LOCK_SHARED, 1, 0, win); MPI_Put(&b1, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Put(&b2, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_unlock(1, win ); } MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { printf("my win %d\n", winbuf); } 2/20/2017 27

MPI+MPI: Shared Memory Windows MPI_Comm_split_type(TYPE_SHARED) MPI_Win_allocate_shared(shr_comm) MPI_Win_shared_query(&baseptr) MPI_Win_lock_all(shr_win) // access baseptr[] MPI_Win_sync() // MPI_Win_unlock_all(shr_win) - Leverage RMA to incorporate node-level programming - RMA provides portable atomics, synchronization, - Eliminates X in MPI+X, when only shared memory is needed - Memory/core is not increasing - Allows NUMA-aware mapping - Each window piece is associated with the process that allocated it http://htor.inf.ethz.ch/publications/img/mpi_mpi_hybrid_programming.pdf 2/20/2017 28

Tools 29

How Intel Parallel Studio XE 2017 helps make Faster Code Faster for HPC HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning MPI Messages Threading design & prototyping Professional Edition Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries

Performance Tuning Tools for Distributed Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tune cross-node MPI Visualize MPI behavior Evaluate MPI load balancing Find communication hotspots Tune single node threading Visualize thread behavior Evaluate thread load balancing Find thread sync bottlenecks

Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Performance Assistance and Imbalance Tuning NEW in 9.1: MPI Performance Snapshot 32

Using the Intel Trace Analyzer and Collector is Easy! Step 1 Run your binary and create a tracefile $ mpirun trace n 2./test Step 2 View the Results: $ traceanalyzer & 33

Intel Trace Analyzer and Collector Compare the event timelines of two communication profiles Blue = computation Red = communication Chart showing how the MPI processes interact 34

Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too high load on Host = too low load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 8 MPI procs x 28 OpenMP threads

Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Too low load on Host = too high load on coprocessor Host 16 MPI procs x 1 OpenMP thread Coprocessor 24 MPI procs x 8 OpenMP threads

Improving Load Balance: Real World Case Collapsed data per node and coprocessor card Perfect balance Host load = Coprocessor load Host 16 MPI procs x 1 OpenMP thread Coprocessor 16 MPI procs x 12 OpenMP thrds

Ideal Interconnect Simulator (Idealizer) Helps to figure out application s imbalance simulating its behavior in the ideal communication environment Actual trace Idealized Trace Easy way to identify application bottlenecks 38

MPI Performance Assistance Automatic Performance Assistant Detect common MPI performance issues Automated tips on potential solutions Automatically detect performance issues and their impact on runtime

MPI Performance Snapshot High capacity MPI profiler Lightweight Low overhead profiling for 100K+ Ranks Scalability Performance variation at scale can be detected sooner Identifying Key Metrics Shows PAPI counters and MPI/OpenMP imbalances

MPI Correctness Checking Highlights: Checks and pin-point hard to find run-time errors Unique feature to identify run-time errors Displays the correctness (parameter passing) of MPI communication for more robust and reliable MPI based HPC applications More details on issues More details on Warnings Run-time Errors Warnings Run-time Errors and Warnings can be identified easily. By a single mouse-click, more detailed information helps to identify root-causes MPI statistics 41

Intel MPI Benchmarks 4.1 Standard benchmarks with OSI-compatible CPL license Enables testing of interconnects, systems, and MPI implementations Comprehensive set of MPI kernels that provide performance measurements for: Point-to-point message-passing Global data movement and computation routines One-sided communications File I/O Supports MPI-1.x, MPI-2.x, and MPI- 3.x standards What s New: Introduction of new benchmarks Measure cumulative bandwidth and message rate values The Intel MPI Benchmarks provide a simple and easy way to measure MPI performance on your cluster

Online Resources Intel MPI Library product page www.intel.com/go/mpi Intel Trace Analyzer and Collector product page www.intel.com/go/traceanalyzer Intel Clusters and HPC Technology forums http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology Intel Xeon Phi Coprocessor Developer Community http://software.intel.com/en-us/mic-developer 43