Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E
|
|
- Kerry Jones
- 5 years ago
- Views:
Transcription
1 Understanding Communication and MPI on Cray XC40
2 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level features like MPI derived types. Cray provides enhancements on top of this: low level communication libraries Point to point tuning Collective tuning Shared memory device is built on top of Cray XPMEM Cray MPI uses the NEMESIS module SMP aware communication between ranks on the same node uses shared memory rather than the Aries network gives higher performance Supported under all Cray PrgEnv The compiler/linker wrappers takes care of including the correct header and libs 2
3 Basics about communication 3
4 Costs of communication All parallel applications communicate data between individual processes unless they're embarrassingly parallel The cost of any communication is usually defined by two properties of the underlying network (or memory system) 1. Latency 2. Bandwidth 4
5 Costs of communication 1. Latency The time from a message being sent to it reaching its destination Dominates the performance of small messages Combination of factors from: constant software, hardware overheads. the physical and topological distance between the nodes (hops) 2. Bandwidth The maximum rate at which data can flow over the network. Dominates the performance of larger messages. Bandwidth between nodes generally depends upon the number of possible paths between nodes on the network (topology) Can usually be tuned with a large enough budget. 5
6 How message size affects communication performance (As with all things) the decisions made by application developer can affect the overall performance of the application. The size of messages sent between processes affects how important latency and bandwidth costs become. When a message is small the network latency is dominant. Therefore it is advisable to try and bundle multiple small messages into fewer larger message to reduce the number of latency penalties. This is true for all closely coupled communication over any protocols, e.g MPI, SHMEM, UPC, TCP/IP 6
7 Understanding Inter- and Intra- node performance The rise of multi-core has led to fat nodes being common Five years ago there may have one or two CPUs per node Now we routinely see CPUs per node. This will only increase in the future (e.g Intel Phi) Codes usually have multiple MPI ranks per node Many (even most) codes are flat MPI rather than hybrid with, for instance, OpenMP threads Even hybrid codes usually have more than one rank per node as threading does not usually scale well across NUMA regions (e.g. sockets) Latency, bandwidth is different for on- and off-node messages messages between PEs on the same node (intra-node) will be faster messages between PEs on different nodes (inter-node) will be slower We can optimise application performance by maximising communication between process on the same node 7
8 Time (us) Ping-ping Bandwidth (MB/s) Intra Node MPI Ping-Ping Message Performance Time Bandwidth Message Size (Bytes) 0 8
9 Time (us) Ping-ping Bandwidth (MB/s) Single rank MPI inter-node message performance Time Bandwidth Message Size (Bytes) 9
10 Some information about what the Aries HW can do and when it is not used 10
11 Inter-node transfer protocols Building ever-larger flat SMPs is expensive and doesn't scale, An XC system has a hybrid architecture set of SMP machines (nodes) linked into an MPP by the Cray Aries network Messages to another node are sent via Aries messages are sent by one of two possible methods the choice depends on message size developers do not target these directly but knowing about them helps you to understand how to use MPI successfully Possible Aries protocols 1. Fast Memory Access (FMA) 2. Block Transfer Engine (BTE) Sli de 11
12 Comparing FMA and BTE Fast Memory Access (FMA) Used for small messages Good features: Lowest latency More than one transfer active at the same time (multi-core) Bad features: Synchronous: CPU involved in the transfer Block Transfer Engine (BTE) Used for large messages Good features: Gets the better Point-to-point bandwidth in more cases Asynchronous: transfer done by Aries independent of the CPU Bad features: Higher latency Transfers are queued if BTE is busy 12
13 Cray Aries and XPMEM RDMA Capabilities Cray Aries (and Gemini) NICs can do RDMA operations Remote Direct Memory Access can directly read from or write into another node's memory space (with certain limitations). The remote CPU does not have to be involved in the transfer. Sending CPU only needs to initialise the transfer If the sending Aries uses it's Block Transfer Engine (BTE) XPMEM provides similar capability within a node Feature of the Cray Linux Environment Each PE shares part of their address space with other PEs on the same node RDMA operations offer good potential for overlapping communication with computation Sli de 13
14 Overlapping communication 14
15 Overlapping communication with computation The Holy Grail Do communication "in the background" While each PE does (separate) computation The cost of communication is then almost nothing Save the overhead of initiating transfers and synchronisation Relies on Having enough (independent) computation to hide the comms time Having the correct code structure to make this possible Using non-blocking communication calls, e,g. via RDMA 15
16 Time Using RDMA to overlap communication and computation to hide costs PE A RDMA Put PE B RDMA Put So, rather than sitting waiting for a communication operation to complete, applications can use asynchronous RDMA operations instead. e.g. putting some data into a remote PEs memory. The application could then continue with other useful computation until the checkpoint where the data is required. Checkpoint Data Transferred in Background Checkpoint 16
17 Short about PGAS programming models To exploit RDMA directly in an application need a programming model that exposes parts of address space to other PEs (on or off the node) Partitioned Global Address Space programming models (PGAS) All memory (across all the PEs) is addressable, but is not evenly accessible Addressing method recognises this: PE number and local address Languages: e.g. Fortran Coarrays (CAF); Unified Parallel C (UPC); Coarray C++; Chapel APIs e.g. OpenSHMEM; GASPI. Also MPI3-RMA single sided. Symmetric allocation Users declare portions of each PE's memory space symmetrically e.g. (usually) using exactly the same addresses on each PE This is done by either: automatically, using special language features (e.g. coarrays) using collective allocation API calls (e.g. shmalloc) get/put operations only allowed between memory on symmetric heap 17
18 PGAS on Cray Aries Cray Aries network RDMA capabilities highly suited to one-sided communications low latency high bandwidth ideal for small-message transfers PGAS and other single-sided models Fit this hardware model very closely Have potential to give highest bandwidths and lowest latencies Nonetheless, two-sided protocols are more popular specifically MPI been around for longer as a de facto standard traditionally more portable 18
19 Two-sided protocols Typically two-sided protocols like MPI are easier to use. The implicit synchronisation between PEs makes it easier to write programs that are not as vulnerable to race conditions. They allow for data to be sent or received into or from any part of the PEs address space Messages can be matched (or not) via tags or by the PE source (MPI_ANY_SOURCE, MPI_ANY_TAG) However this additional flexibility requires the cooperation of the Intel Xeon CPU to perform many of these tasks. This means communication may wait until the CPU enters an MPI call. Overheads caused by the MPI standard may increase latency and reduce effective bandwidth! 19
20 Time Overlapping Communication and Computation Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend The MPI API provides many functions that allow point-topoint messages (and with MPI- 3, collectives) to be performed asynchronously. Ideally applications would be able to overlap communication and computation, hiding all data transfer behind useful computation. MPI_Waitall Data Transferred in Background MPI_Waitall Unfortunately this is not always possible at the application and not always possible at the implementation level. 20
21 What prevents Overlap? Overlapping computation/comms not always possible even if library has asynchronous API calls and the application has enough computation to allow overlap Usual reason: sending PE does not know where to put messages on the destination this is part of the MPI_Recv, not MPI_Send. Also on Gemini and Aries, host CPU is required complex tasks performed on the CPU e.g. matching message tags with the sender and receiver Allows NICs to have higher clockspeed (better latency, bandwidth) But messages can only "progress" when program is in MPI i.e. within MPI library function or subroutine 21
22 MPI messaging protocols To understand when overlap is or isn't possible need to understand how MPI actually sends messages Two different protocols choice depends on message size 1. Eager messaging Used for small messages Offers good potential for overlap 2. Rendezvous messaging Used for large messages Does not usually overlap (without progress engine) These map (loosely) onto Cray Aries FMA, BTE methods the finer details can be quite complicated 22
23 Time EAGER Messaging Buffering Small Messages Sender MPI_Send MPI Buffers Data pushed to receiver's buffer MPI Buffers Receiver MPI_Recv Smaller messages can avoid this problem using the eager protocol. If the sender does not know where to put a message it can be buffered until the sender is ready to take it. When MPI Recv is called the library fetches the message data from the remote buffer and into the appropriate location (or potentially local buffer) Sender can proceed as soon as data has been copied to the buffer. Sender will block if there are no free buffers 23
24 Time EAGER potentially allows overlapping Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend Data is pushed into an empty buffer(s) on the remote processor. Data is copied from the buffer into the real receive destination when the wait or waitall is called. Involves an extra memcopy, but much greater opportunity for overlap of computation and communication. MPI_Waitall MPI_Waitall 24
25 Time RENDEZVOUS Messaging Larger Messages Sender DATA MPI_Send DATA MPI Buffers Receiver Larger messages (that are too big to fit in the buffers) are sent via the rendezvous protocol Messages cannot begin transfer until MPI_Recv called by the receiver. MPI_Recv Data is pulled from the sender by the receiver. Data pulled from the sender DATA Sender must wait for data to be copied to receiver before continuing. Sender and Receiver block until communication is finished 25
26 Time RENDEZVOUS does not usually overlap Rank A DATA MPI_IRecv MPI_ISend Rank B DATA MPI_IRecv MPI_ISend With rendezvous data transfer is often only occurs during the Wait or Waitall statement. When the message arrives at the destination, the host CPU is busy doing computation, so is unable to do any message matching. Control only returns to the library when MPI_Waitall occurs and does not return until all data is transferred. MPI_Waitall MPI_Waitall There has been no overlap of computation and communication. DATA DATA 26
27 Making more messages EAGER One way to improve performance send more messages on the eager protocol; potentially more overlap Do this by raising the value of the eager threshold set environment variable in jobscript export MPICH_GNI_MAX_EAGER_MSG_SIZE=<value> value is in bytes: default is 8192 bytes. Maximum size is bytes (128KB) When might this help If MPI takes a significant time in the profile If you have a lot of messages between 8kB and 128kB CrayPAT MPI tracing can tell you this Also try to post MPI_IRecv call before the MPI_ISend call can avoid unnecessary buffer copies 27
28 Consequences of more EAGER messages Places more demands on buffers on receiver If the buffers are full, transfer will wait until space is available or until the MPI_Wait Number of buffers can be increased at runtime export MPICH_GNI_NUM_BUFS=<number> default number is 64 buffers are 32kB each (total of 2MB) Buffer memory space competes with application memory so we recommend only moderate increases 28
29 The MPI progress engine 29
30 Time Progress threads help overlap Rank A DATA MPI_IRecv MPI_ISend Thread A Thread B Rank B DATA MPI_IRecv MPI_ISend Cray's MPT library can spawn additional threads that allow progress of messages while computation occurs in the background. Thread performs message matching and initiates the transfer. MPI_Waitall DATA MPI_Waitall DATA Data has already arrived by the time Waitall is called, so overlap between compute and communication. 30
31 MPI - Async Progress Engine Support Used to improve communication/computation overlap Each MPI rank starts a helper thread during MPI_Init Helper threads progress MPI engine in background while the application computes Only works for messages using BTE inter-node messages using Rendezvous path (i.e. large messages) 31
32 Using the progress engine To enable on XC, set two env. vars. in jobscript export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple Also need somewhere for progress engine threads to run: If you are running with aprun -j1 all the (second) hyperthreads are spare Progress engine threads will use these by default So you don't need to do anything else If you are running with aprun -j2 you need to reserve one (or more) hyperthreads for progress use the -r core specialisation flag, e.g. aprun -n XX -N63 -r1./a.out aprun -n XX -N62 -r2./a.out 32
33 Will the progress engine help? For codes that spend a lot of time on large-message transfers and using non-blocking MPI calls Yes, it can help 10% or more performance improvements seen with some apps Why might it not help (even if we have slow, large message transfers with non-blocking MPI) Possible reasons include: MPICH_MAX_THREAD_SAFETY=multiple has performance implications thread safety means more locking in the library core specialisation means fewer user processes per node less computational power per node reduced amount of intra-node MPI messages 33
34 Improving performance of MPI collectives 34
35 Using optimised MPI collectives on a Cray DMAPP: low-lying communication API on Cray systems Used "under the hood" by MPI, SHMEM, CAF, UPC... Some MPI collectives have optimised DMAPP versions MPI_Allreduce, MPI_Barrier, MPI_Alltoall not used by default To use DMAPP collectives: users must manually add the library to link line using: -ldmapp set environment variable in jobscript: export MPICH_USE_DMAPP_COLL=1 Cray Aries also offers some accelerated collective ops: Barriers, single word Allreduce calls enabled using environment variable in jobscript export MPICH_DMAPP_HW_CE=1 35
36 Other Techniques for collectives (1) Cray MPICH uses optimised versions of MPI collectives In most cases these are the correct thing to use If these are not suitable for a particular application can switch them off using env. vars. in jobscript export MPICH_COLL_OPT_OFF=<collective name> e.g. MPICH_COLL_OPT_OFF=mpi_allgather When would you try this? If MPI collectives take a significant part of your profile And if a particular collective takes a "surprising" amount of time e.g. compared to runs on a different architecture 36
37 Other Techniques for collectives (2) Cray MPICH uses various algorithms for all-to-all routines allgather(v), alltoall(v) Library-internal decisions on which to use are based on number of ranks on the calling communicator message sizes Can change this decision using env. vars. in jobscript: MPICH_XXXX_VSHORT_MSG where XXXX is a collective name e.g. ALLGATHER When might you try this e.g. If ALLGATHER suddenly becomes very important for a small change in problem size 37
38 Issue: Expensive collectives The implementation of collectives based on the DMAPP layer will (usually significantly) improve the performance of the expensive collectives Enabled by the variable MPICH_USE_DMAPP_COLL Can be used also selectively, e.g. export MPICH_USE_DMAPP_COLL=mpi_allreduce Consult man mpi XC Hardware Collective Engine (CE) XC supports hardware-offload of Barrier & Allreduce collectives Invoke these via MPICH_USE_DMAPP_COLL and MPICH_DMAPP_HW_CE environment variables Must also link libdmapp into your application (see man mpi )
39 Latency (mircoseconds) CE Test : 8-byte MPI_Allreduce Latency (16p/node) Comparing MPICH2 software, DMAPP software, Aries Collective Engine Cray XC system (CSCS daint) - 03/21/ MPICH2 15 DMAPP-software DMAPP-Hardware (CE) Number of MPI processes 39
40 Huge Pages The Aries NIC performs better with HUGE pages than with 4K pages. The Aries can map more pages using fewer resources meaning communications may be faster. The cray-mpich library will map its buffers into huge pages by default. The size of the huge pages used is controlled by: export MPICH_GNI_HUGEPAGE_SIZE=<size> <size> can be 2M, 4M, 16M, 32M, 64M, 128M, 256M default is 2M 40
41 Miscellaneously Useful Flags Performance enhancements export MPICH_COLL_SYNC=1 Adds a barrier before collectives, use this if CrayPAT makes your code run faster. Reporting export MPICH_CPUMASK_DISPLAY=1 Shows the binding of each MPI rank by core and hostname export MPICH_ENV_DISPLAY=1 Print the value of all MPI environment variables at runtime (STDERR) export MPICH_MPIIO_STATS=1 Prints some MPI-IO stats useful for optimisation (STDERR) export MPICH_RANK_REORDER_DISPLAY=1 Prints the node that each rank is residing on, useful for checking MPICH_RANK_REORDER_METHOD results. export MPICH_VERSION_DISPLAY=1 Display library version and build information. For more information: man mpi 41
42 How can I make my MPI faster? Some hints Runtime options Try to maximise on-node transfers (rank reordering) Try using optimised collectives; or DMAPP collectives (relink needed) Help the MPI library get better overlap use non-blocking MPI calls MPI_Isend, MPI_Irecv, MPI_Iallgather... small messages use the EAGER protocol with good overlap potential try to send more data using the small-message EAGER method consider raising the EAGER threshold larger messages use the RENDEZVOUS protocol post non-blocking receives as early as possible consider using the asynchronous progress thread Try to reorder code to give more potential for overlap local computation (or I/O) that can be done while messages transfer Perhaps consider adding some PGAS 42
43 Additional, detailed slides
44 Day in the Life of an MPI Message Four Main Pathways through the MPICH2 GNI NetMod Two EAGER paths (E0 and E1) For a message that can fit in a GNI SMSG mailbox (E0) For a message that can't fit into a mailbox but is less than MPICH_GNI_MAX_EAGER_MSG_SIZE in length (E1) Two RENDEZVOUS (aka LMT) paths : R0 (RDMA get) and R1 (RDMA put) Selected Pathway is based on Message Size E0 E1 R0 R K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1MB 4MB MPICH_GNI_MAX_VSHORT_MSG_SIZE MPICH_GNI_MAX_EAGER_MSG_SIZE MPICH_GNI_NDREG_MAXSIZE 44
45 Day in the Life of Message type E0 EAGER messages that fit in the GNI SMSG Mailbox Sender 1. GNI SMSG Send (MPI header + user data) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver CQs 2. Memcpy GNI SMSG Mailbox size changes with number of ranks in job If user data is 16 bytes or less, it is copied into the MPI header 45
46 Day in the Life of Message type E1 EAGER messages that don't fit in the GNI SMSG Mailbox Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Memcpy data to pre-allocated MPI buffers CQs 4. GNI SMSG Send (Recv done) 5. Memcpy MPICH_GNI_NUM_BUFS User data is copied into internal MPI buffers on both send and receive side MPICH_GNI_NUM_BUFS default 64 buffers, each 32K 46
47 EAGER Message Protocol Default mailbox size varies with number of ranks in the job Protocol for messages that can fit into a GNI SMSG mailbox The default varies with job size, although this can be tuned by the user to some extent Ranks in Job Max user data (MPT 5.3 ) MPT 5.4 and later < = 512 ranks 984 bytes 8152 bytes > 512 and <= bytes 2008 bytes > 1024 and < bytes 472 bytes > ranks 216 bytes 216 bytes 47
48 Day in the Life of Message type R0 Rendezvous messages using RDMA Get Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Register App Send Buffer CQs 3. Register App Recv Buffer 5. GNI SMSG Send (Recv done) No extra data copies Best chance of overlapping communication with computation 48
49 Day in the Life of Message type R1 Rendezvous messages using RDMA Put Sender 1. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 3. GNI SMSG Send (CTS msg ) CQs 4. Register Chunk of App Send Buffer 5. RDMA PUT 2. Register Chunk of App Recv Buffer 6. GNI SMSG Send (Send done) Repeat steps 2-6 until all sender data is transferred Chunksize is MPI_GNI_MAX_NDREG_SIZE (default of 4MB) 49
Understanding MPI on Cray XC30
Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication
More informationMPI on the Cray XC30
MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment
More informationMPI Optimization. HPC Saudi, March 15 th 2016
MPI Optimization HPC Saudi, March 15 th 2016 Useful Variables MPI Variables to get more insights MPICH_VERSION_DISPLAY=1 MPICH_ENV_DISPLAY=1 MPICH_CPUMASK_DISPLAY=1 MPICH_RANK_REORDER_DISPLAY=1 When using
More informationMPI for Cray XE/XK Systems & Recent Enhancements
MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
More informationOptimising Communication on the Cray XE6
Optimising Communication on the Cray XE6 Outline MPICH2 Releases for XE Day in the Life of an MPI Message Gemini NIC Resources Eager Message Protocol Rendezvous Message Protocol Important MPI environment
More informationMPI and MPI on ARCHER
MPI and MPI on ARCHER Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationMPI Programming Techniques
MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any
More informationFurther MPI Programming. Paul Burton April 2015
Further MPI Programming Paul Burton April 2015 Blocking v Non-blocking communication Blocking communication - Call to MPI sending routine does not return until the send buffer (array) is safe to use again
More informationMemory allocation and sample API calls. Preliminary Gemini performance measurements
DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini
More informationSINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM
SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In
More informationAdvanced Job Launching. mapping applications to hardware
Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and
More informationComparing One-Sided Communication with MPI, UPC and SHMEM
Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationMark Pagel
Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of scaling to 150K MPI ranks MPI-IO Improvements(MPT 3.1 and MPT 3.2) SMP aware collective improvements(mpt 3.2) Misc
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationHow to Make Best Use of Cray MPI on the XT. Slide 1
How to Make Best Use of Cray MPI on the XT Slide 1 Outline Overview of Cray Message Passing Toolkit (MPT) Key Cray MPI Environment Variables Cray MPI Collectives Cray MPI Point-to-Point Messaging Techniques
More informationMPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018
MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,
More informationCSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)
Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts
More informationPart - II. Message Passing Interface. Dheeraj Bhardwaj
Part - II Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110016 India http://www.cse.iitd.ac.in/~dheerajb 1 Outlines Basics of MPI How to compile and
More informationRunning applications on the Cray XC30
Running applications on the Cray XC30 Running on compute nodes By default, users do not access compute nodes directly. Instead they launch jobs on compute nodes using one of three available modes: 1. Extreme
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationDiscussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center
1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate
More informationCompute Node Linux (CNL) The Evolution of a Compute OS
Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide
More informationBulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model
Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationSayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018
Sayantan Sur, Intel SEA Symposium on Overlapping Computation and Communication April 4 th, 2018 Legal Disclaimer & Benchmark results were obtained prior to implementation of recent software patches and
More informationHeidi Poxon Cray Inc.
Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA
More informationPoint-to-Point Communication. Reference:
Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationMPI Performance Engineering through the Integration of MVAPICH and TAU
MPI Performance Engineering through the Integration of MVAPICH and TAU Allen D. Malony Department of Computer and Information Science University of Oregon Acknowledgement Research work presented in this
More informationThe Cray Programming Environment. An Introduction
The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent
More informationUSING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY
14th ANNUAL WORKSHOP 2018 USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY Michael Chuvelev, Software Architect Intel April 11, 2018 INTEL MPI LIBRARY Optimized MPI application performance Application-specific
More informationMATH 676. Finite element methods in scientific computing
MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&M University Lecture 41: Parallelization on a cluster of distributed memory machines Part 1: Introduction to MPI Shared
More informationBatch environment PBS (Running applications on the Cray XC30) 1/18/2016
Batch environment PBS (Running applications on the Cray XC30) 1/18/2016 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch
More informationCS 6230: High-Performance Computing and Parallelization Introduction to MPI
CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationPerformance properties The metrics tour
Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationOptimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers
Optimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel Cray Inc. {kkandalla, knaak, kmcmahon, nradcliff,
More informationPGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12
PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk What is PGAS? a model, not a language! based on principle of partitioned global address space many different
More informationThe Cray Programming Environment. An Introduction
The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent
More informationLS-DYNA Scalability Analysis on Cray Supercomputers
13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,
More informationProgramming Environment 4/11/2015
Programming Environment 4/11/2015 1 Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent interface
More informationManaging Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications
Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen, Cray Inc. ABSTRACT: The Cray XT implementation of MPI provides configurable runtime environment variables
More informationCSE 160 Lecture 15. Message Passing
CSE 160 Lecture 15 Message Passing Announcements 2013 Scott B. Baden / CSE 160 / Fall 2013 2 Message passing Today s lecture The Message Passing Interface - MPI A first MPI Application The Trapezoidal
More informationHow to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries
How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationManaging Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen
Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen May 5, 2008 Cray Inc. Proprietary Slide 1 Goals of the Presentation Provide users an overview of the
More informationCompiling applications for the Cray XC
Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers
More informationARCHER Single Node Optimisation
ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential
More informationHow to Use MPI on the Cray XT. Jason Beech-Brandt Kevin Roy Cray UK
How to Use MPI on the Cray XT Jason Beech-Brandt Kevin Roy Cray UK Outline XT MPI implementation overview Using MPI on the XT Recently added performance improvements Additional Documentation 20/09/07 HECToR
More informationProgramming with MPI
Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren nmm1@cam.ac.uk March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous set of practical points Over--simplifies
More informationMPI - Today and Tomorrow
MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationIntroduction to parallel computing concepts and technics
Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing
More informationPROGRAMMING MODEL EXAMPLES
( Cray Inc 2015) PROGRAMMING MODEL EXAMPLES DEMONSTRATION EXAMPLES OF VARIOUS PROGRAMMING MODELS OVERVIEW Building an application to use multiple processors (cores, cpus, nodes) can be done in various
More informationLiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster
LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task
More informationMPI Message Passing Interface
MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information
More informationParallel Programming with Coarray Fortran
Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming
More informationLecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)
COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,
More informationPortable, MPI-Interoperable! Coarray Fortran
Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationIntroduction to MPI. Branislav Jansík
Introduction to MPI Branislav Jansík Resources https://computing.llnl.gov/tutorials/mpi/ http://www.mpi-forum.org/ https://www.open-mpi.org/doc/ Serial What is parallel computing Parallel What is MPI?
More informationADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE
13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationOne-Sided Append: A New Communication Paradigm For PGAS Models
One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationHIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS
HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access
More informationDistributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello
Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and
More informationPerformance Evaluation of MPI on Cray XC40 Xeon Phi Systems
Performance Evaluation of MPI on Cray XC0 Xeon Phi Systems ABSTRACT Scott Parker Argonne National Laboratory sparker@anl.gov Kevin Harms Argonne National Laboratory harms@anl.gov The scale and complexity
More informationSayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018
Sayantan Sur, Intel ExaComm Workshop held in conjunction with ISC 2018 Legal Disclaimer & Optimization Notice Software and workloads used in performance tests may have been optimized for performance only
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationUCX: An Open Source Framework for HPC Network APIs and Beyond
UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation
More informationEnabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed
More informationLeveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems
PROCEEDINGS OF THE CRAY USER GROUP, 2012 1 Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems Howard Pritchard, Duncan Roweth, David
More informationDocument Classification
Document Classification Introduction Search engine on web Search directories, subdirectories for documents Search for documents with extensions.html,.txt, and.tex Using a dictionary of key words, create
More informationShared Memory & Message Passing Programming on SCI-Connected Clusters
Shared Memory & Message Passing Programming on SCI-Connected Clusters Joachim Worringen, RWTH Aachen SCI Summer School 2000 Trinitiy College Dublin Agenda How to utilize SCI-Connected Clusters SMI Library
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationMyths and reality of communication/computation overlap in MPI applications
Myths and reality of communication/computation overlap in MPI applications Alessandro Fanfarillo National Center for Atmospheric Research Boulder, Colorado, USA elfanfa@ucar.edu Oct 12th, 2017 (elfanfa@ucar.edu)
More informationAssessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationHybrid MPI - A Case Study on the Xeon Phi Platform
Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationMOVING FORWARD WITH FABRIC INTERFACES
14th ANNUAL WORKSHOP 2018 MOVING FORWARD WITH FABRIC INTERFACES Sean Hefty, OFIWG co-chair Intel Corporation April, 2018 USING THE PAST TO PREDICT THE FUTURE OFI Provider Infrastructure OFI API Exploration
More information