Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Size: px
Start display at page:

Download "Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E"

Transcription

1 Understanding Communication and MPI on Cray XC40

2 Features of the Cray MPI library Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Well tested code for high level features like MPI derived types. Cray provides enhancements on top of this: low level communication libraries Point to point tuning Collective tuning Shared memory device is built on top of Cray XPMEM Cray MPI uses the NEMESIS module SMP aware communication between ranks on the same node uses shared memory rather than the Aries network gives higher performance Supported under all Cray PrgEnv The compiler/linker wrappers takes care of including the correct header and libs 2

3 Basics about communication 3

4 Costs of communication All parallel applications communicate data between individual processes unless they're embarrassingly parallel The cost of any communication is usually defined by two properties of the underlying network (or memory system) 1. Latency 2. Bandwidth 4

5 Costs of communication 1. Latency The time from a message being sent to it reaching its destination Dominates the performance of small messages Combination of factors from: constant software, hardware overheads. the physical and topological distance between the nodes (hops) 2. Bandwidth The maximum rate at which data can flow over the network. Dominates the performance of larger messages. Bandwidth between nodes generally depends upon the number of possible paths between nodes on the network (topology) Can usually be tuned with a large enough budget. 5

6 How message size affects communication performance (As with all things) the decisions made by application developer can affect the overall performance of the application. The size of messages sent between processes affects how important latency and bandwidth costs become. When a message is small the network latency is dominant. Therefore it is advisable to try and bundle multiple small messages into fewer larger message to reduce the number of latency penalties. This is true for all closely coupled communication over any protocols, e.g MPI, SHMEM, UPC, TCP/IP 6

7 Understanding Inter- and Intra- node performance The rise of multi-core has led to fat nodes being common Five years ago there may have one or two CPUs per node Now we routinely see CPUs per node. This will only increase in the future (e.g Intel Phi) Codes usually have multiple MPI ranks per node Many (even most) codes are flat MPI rather than hybrid with, for instance, OpenMP threads Even hybrid codes usually have more than one rank per node as threading does not usually scale well across NUMA regions (e.g. sockets) Latency, bandwidth is different for on- and off-node messages messages between PEs on the same node (intra-node) will be faster messages between PEs on different nodes (inter-node) will be slower We can optimise application performance by maximising communication between process on the same node 7

8 Time (us) Ping-ping Bandwidth (MB/s) Intra Node MPI Ping-Ping Message Performance Time Bandwidth Message Size (Bytes) 0 8

9 Time (us) Ping-ping Bandwidth (MB/s) Single rank MPI inter-node message performance Time Bandwidth Message Size (Bytes) 9

10 Some information about what the Aries HW can do and when it is not used 10

11 Inter-node transfer protocols Building ever-larger flat SMPs is expensive and doesn't scale, An XC system has a hybrid architecture set of SMP machines (nodes) linked into an MPP by the Cray Aries network Messages to another node are sent via Aries messages are sent by one of two possible methods the choice depends on message size developers do not target these directly but knowing about them helps you to understand how to use MPI successfully Possible Aries protocols 1. Fast Memory Access (FMA) 2. Block Transfer Engine (BTE) Sli de 11

12 Comparing FMA and BTE Fast Memory Access (FMA) Used for small messages Good features: Lowest latency More than one transfer active at the same time (multi-core) Bad features: Synchronous: CPU involved in the transfer Block Transfer Engine (BTE) Used for large messages Good features: Gets the better Point-to-point bandwidth in more cases Asynchronous: transfer done by Aries independent of the CPU Bad features: Higher latency Transfers are queued if BTE is busy 12

13 Cray Aries and XPMEM RDMA Capabilities Cray Aries (and Gemini) NICs can do RDMA operations Remote Direct Memory Access can directly read from or write into another node's memory space (with certain limitations). The remote CPU does not have to be involved in the transfer. Sending CPU only needs to initialise the transfer If the sending Aries uses it's Block Transfer Engine (BTE) XPMEM provides similar capability within a node Feature of the Cray Linux Environment Each PE shares part of their address space with other PEs on the same node RDMA operations offer good potential for overlapping communication with computation Sli de 13

14 Overlapping communication 14

15 Overlapping communication with computation The Holy Grail Do communication "in the background" While each PE does (separate) computation The cost of communication is then almost nothing Save the overhead of initiating transfers and synchronisation Relies on Having enough (independent) computation to hide the comms time Having the correct code structure to make this possible Using non-blocking communication calls, e,g. via RDMA 15

16 Time Using RDMA to overlap communication and computation to hide costs PE A RDMA Put PE B RDMA Put So, rather than sitting waiting for a communication operation to complete, applications can use asynchronous RDMA operations instead. e.g. putting some data into a remote PEs memory. The application could then continue with other useful computation until the checkpoint where the data is required. Checkpoint Data Transferred in Background Checkpoint 16

17 Short about PGAS programming models To exploit RDMA directly in an application need a programming model that exposes parts of address space to other PEs (on or off the node) Partitioned Global Address Space programming models (PGAS) All memory (across all the PEs) is addressable, but is not evenly accessible Addressing method recognises this: PE number and local address Languages: e.g. Fortran Coarrays (CAF); Unified Parallel C (UPC); Coarray C++; Chapel APIs e.g. OpenSHMEM; GASPI. Also MPI3-RMA single sided. Symmetric allocation Users declare portions of each PE's memory space symmetrically e.g. (usually) using exactly the same addresses on each PE This is done by either: automatically, using special language features (e.g. coarrays) using collective allocation API calls (e.g. shmalloc) get/put operations only allowed between memory on symmetric heap 17

18 PGAS on Cray Aries Cray Aries network RDMA capabilities highly suited to one-sided communications low latency high bandwidth ideal for small-message transfers PGAS and other single-sided models Fit this hardware model very closely Have potential to give highest bandwidths and lowest latencies Nonetheless, two-sided protocols are more popular specifically MPI been around for longer as a de facto standard traditionally more portable 18

19 Two-sided protocols Typically two-sided protocols like MPI are easier to use. The implicit synchronisation between PEs makes it easier to write programs that are not as vulnerable to race conditions. They allow for data to be sent or received into or from any part of the PEs address space Messages can be matched (or not) via tags or by the PE source (MPI_ANY_SOURCE, MPI_ANY_TAG) However this additional flexibility requires the cooperation of the Intel Xeon CPU to perform many of these tasks. This means communication may wait until the CPU enters an MPI call. Overheads caused by the MPI standard may increase latency and reduce effective bandwidth! 19

20 Time Overlapping Communication and Computation Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend The MPI API provides many functions that allow point-topoint messages (and with MPI- 3, collectives) to be performed asynchronously. Ideally applications would be able to overlap communication and computation, hiding all data transfer behind useful computation. MPI_Waitall Data Transferred in Background MPI_Waitall Unfortunately this is not always possible at the application and not always possible at the implementation level. 20

21 What prevents Overlap? Overlapping computation/comms not always possible even if library has asynchronous API calls and the application has enough computation to allow overlap Usual reason: sending PE does not know where to put messages on the destination this is part of the MPI_Recv, not MPI_Send. Also on Gemini and Aries, host CPU is required complex tasks performed on the CPU e.g. matching message tags with the sender and receiver Allows NICs to have higher clockspeed (better latency, bandwidth) But messages can only "progress" when program is in MPI i.e. within MPI library function or subroutine 21

22 MPI messaging protocols To understand when overlap is or isn't possible need to understand how MPI actually sends messages Two different protocols choice depends on message size 1. Eager messaging Used for small messages Offers good potential for overlap 2. Rendezvous messaging Used for large messages Does not usually overlap (without progress engine) These map (loosely) onto Cray Aries FMA, BTE methods the finer details can be quite complicated 22

23 Time EAGER Messaging Buffering Small Messages Sender MPI_Send MPI Buffers Data pushed to receiver's buffer MPI Buffers Receiver MPI_Recv Smaller messages can avoid this problem using the eager protocol. If the sender does not know where to put a message it can be buffered until the sender is ready to take it. When MPI Recv is called the library fetches the message data from the remote buffer and into the appropriate location (or potentially local buffer) Sender can proceed as soon as data has been copied to the buffer. Sender will block if there are no free buffers 23

24 Time EAGER potentially allows overlapping Rank A MPI_IRecv MPI_ISend Rank B MPI_IRecv MPI_ISend Data is pushed into an empty buffer(s) on the remote processor. Data is copied from the buffer into the real receive destination when the wait or waitall is called. Involves an extra memcopy, but much greater opportunity for overlap of computation and communication. MPI_Waitall MPI_Waitall 24

25 Time RENDEZVOUS Messaging Larger Messages Sender DATA MPI_Send DATA MPI Buffers Receiver Larger messages (that are too big to fit in the buffers) are sent via the rendezvous protocol Messages cannot begin transfer until MPI_Recv called by the receiver. MPI_Recv Data is pulled from the sender by the receiver. Data pulled from the sender DATA Sender must wait for data to be copied to receiver before continuing. Sender and Receiver block until communication is finished 25

26 Time RENDEZVOUS does not usually overlap Rank A DATA MPI_IRecv MPI_ISend Rank B DATA MPI_IRecv MPI_ISend With rendezvous data transfer is often only occurs during the Wait or Waitall statement. When the message arrives at the destination, the host CPU is busy doing computation, so is unable to do any message matching. Control only returns to the library when MPI_Waitall occurs and does not return until all data is transferred. MPI_Waitall MPI_Waitall There has been no overlap of computation and communication. DATA DATA 26

27 Making more messages EAGER One way to improve performance send more messages on the eager protocol; potentially more overlap Do this by raising the value of the eager threshold set environment variable in jobscript export MPICH_GNI_MAX_EAGER_MSG_SIZE=<value> value is in bytes: default is 8192 bytes. Maximum size is bytes (128KB) When might this help If MPI takes a significant time in the profile If you have a lot of messages between 8kB and 128kB CrayPAT MPI tracing can tell you this Also try to post MPI_IRecv call before the MPI_ISend call can avoid unnecessary buffer copies 27

28 Consequences of more EAGER messages Places more demands on buffers on receiver If the buffers are full, transfer will wait until space is available or until the MPI_Wait Number of buffers can be increased at runtime export MPICH_GNI_NUM_BUFS=<number> default number is 64 buffers are 32kB each (total of 2MB) Buffer memory space competes with application memory so we recommend only moderate increases 28

29 The MPI progress engine 29

30 Time Progress threads help overlap Rank A DATA MPI_IRecv MPI_ISend Thread A Thread B Rank B DATA MPI_IRecv MPI_ISend Cray's MPT library can spawn additional threads that allow progress of messages while computation occurs in the background. Thread performs message matching and initiates the transfer. MPI_Waitall DATA MPI_Waitall DATA Data has already arrived by the time Waitall is called, so overlap between compute and communication. 30

31 MPI - Async Progress Engine Support Used to improve communication/computation overlap Each MPI rank starts a helper thread during MPI_Init Helper threads progress MPI engine in background while the application computes Only works for messages using BTE inter-node messages using Rendezvous path (i.e. large messages) 31

32 Using the progress engine To enable on XC, set two env. vars. in jobscript export MPICH_NEMESIS_ASYNC_PROGRESS=1 export MPICH_MAX_THREAD_SAFETY=multiple Also need somewhere for progress engine threads to run: If you are running with aprun -j1 all the (second) hyperthreads are spare Progress engine threads will use these by default So you don't need to do anything else If you are running with aprun -j2 you need to reserve one (or more) hyperthreads for progress use the -r core specialisation flag, e.g. aprun -n XX -N63 -r1./a.out aprun -n XX -N62 -r2./a.out 32

33 Will the progress engine help? For codes that spend a lot of time on large-message transfers and using non-blocking MPI calls Yes, it can help 10% or more performance improvements seen with some apps Why might it not help (even if we have slow, large message transfers with non-blocking MPI) Possible reasons include: MPICH_MAX_THREAD_SAFETY=multiple has performance implications thread safety means more locking in the library core specialisation means fewer user processes per node less computational power per node reduced amount of intra-node MPI messages 33

34 Improving performance of MPI collectives 34

35 Using optimised MPI collectives on a Cray DMAPP: low-lying communication API on Cray systems Used "under the hood" by MPI, SHMEM, CAF, UPC... Some MPI collectives have optimised DMAPP versions MPI_Allreduce, MPI_Barrier, MPI_Alltoall not used by default To use DMAPP collectives: users must manually add the library to link line using: -ldmapp set environment variable in jobscript: export MPICH_USE_DMAPP_COLL=1 Cray Aries also offers some accelerated collective ops: Barriers, single word Allreduce calls enabled using environment variable in jobscript export MPICH_DMAPP_HW_CE=1 35

36 Other Techniques for collectives (1) Cray MPICH uses optimised versions of MPI collectives In most cases these are the correct thing to use If these are not suitable for a particular application can switch them off using env. vars. in jobscript export MPICH_COLL_OPT_OFF=<collective name> e.g. MPICH_COLL_OPT_OFF=mpi_allgather When would you try this? If MPI collectives take a significant part of your profile And if a particular collective takes a "surprising" amount of time e.g. compared to runs on a different architecture 36

37 Other Techniques for collectives (2) Cray MPICH uses various algorithms for all-to-all routines allgather(v), alltoall(v) Library-internal decisions on which to use are based on number of ranks on the calling communicator message sizes Can change this decision using env. vars. in jobscript: MPICH_XXXX_VSHORT_MSG where XXXX is a collective name e.g. ALLGATHER When might you try this e.g. If ALLGATHER suddenly becomes very important for a small change in problem size 37

38 Issue: Expensive collectives The implementation of collectives based on the DMAPP layer will (usually significantly) improve the performance of the expensive collectives Enabled by the variable MPICH_USE_DMAPP_COLL Can be used also selectively, e.g. export MPICH_USE_DMAPP_COLL=mpi_allreduce Consult man mpi XC Hardware Collective Engine (CE) XC supports hardware-offload of Barrier & Allreduce collectives Invoke these via MPICH_USE_DMAPP_COLL and MPICH_DMAPP_HW_CE environment variables Must also link libdmapp into your application (see man mpi )

39 Latency (mircoseconds) CE Test : 8-byte MPI_Allreduce Latency (16p/node) Comparing MPICH2 software, DMAPP software, Aries Collective Engine Cray XC system (CSCS daint) - 03/21/ MPICH2 15 DMAPP-software DMAPP-Hardware (CE) Number of MPI processes 39

40 Huge Pages The Aries NIC performs better with HUGE pages than with 4K pages. The Aries can map more pages using fewer resources meaning communications may be faster. The cray-mpich library will map its buffers into huge pages by default. The size of the huge pages used is controlled by: export MPICH_GNI_HUGEPAGE_SIZE=<size> <size> can be 2M, 4M, 16M, 32M, 64M, 128M, 256M default is 2M 40

41 Miscellaneously Useful Flags Performance enhancements export MPICH_COLL_SYNC=1 Adds a barrier before collectives, use this if CrayPAT makes your code run faster. Reporting export MPICH_CPUMASK_DISPLAY=1 Shows the binding of each MPI rank by core and hostname export MPICH_ENV_DISPLAY=1 Print the value of all MPI environment variables at runtime (STDERR) export MPICH_MPIIO_STATS=1 Prints some MPI-IO stats useful for optimisation (STDERR) export MPICH_RANK_REORDER_DISPLAY=1 Prints the node that each rank is residing on, useful for checking MPICH_RANK_REORDER_METHOD results. export MPICH_VERSION_DISPLAY=1 Display library version and build information. For more information: man mpi 41

42 How can I make my MPI faster? Some hints Runtime options Try to maximise on-node transfers (rank reordering) Try using optimised collectives; or DMAPP collectives (relink needed) Help the MPI library get better overlap use non-blocking MPI calls MPI_Isend, MPI_Irecv, MPI_Iallgather... small messages use the EAGER protocol with good overlap potential try to send more data using the small-message EAGER method consider raising the EAGER threshold larger messages use the RENDEZVOUS protocol post non-blocking receives as early as possible consider using the asynchronous progress thread Try to reorder code to give more potential for overlap local computation (or I/O) that can be done while messages transfer Perhaps consider adding some PGAS 42

43 Additional, detailed slides

44 Day in the Life of an MPI Message Four Main Pathways through the MPICH2 GNI NetMod Two EAGER paths (E0 and E1) For a message that can fit in a GNI SMSG mailbox (E0) For a message that can't fit into a mailbox but is less than MPICH_GNI_MAX_EAGER_MSG_SIZE in length (E1) Two RENDEZVOUS (aka LMT) paths : R0 (RDMA get) and R1 (RDMA put) Selected Pathway is based on Message Size E0 E1 R0 R K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1MB 4MB MPICH_GNI_MAX_VSHORT_MSG_SIZE MPICH_GNI_MAX_EAGER_MSG_SIZE MPICH_GNI_NDREG_MAXSIZE 44

45 Day in the Life of Message type E0 EAGER messages that fit in the GNI SMSG Mailbox Sender 1. GNI SMSG Send (MPI header + user data) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver CQs 2. Memcpy GNI SMSG Mailbox size changes with number of ranks in job If user data is 16 bytes or less, it is copied into the MPI header 45

46 Day in the Life of Message type E1 EAGER messages that don't fit in the GNI SMSG Mailbox Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Memcpy data to pre-allocated MPI buffers CQs 4. GNI SMSG Send (Recv done) 5. Memcpy MPICH_GNI_NUM_BUFS User data is copied into internal MPI buffers on both send and receive side MPICH_GNI_NUM_BUFS default 64 buffers, each 32K 46

47 EAGER Message Protocol Default mailbox size varies with number of ranks in the job Protocol for messages that can fit into a GNI SMSG mailbox The default varies with job size, although this can be tuned by the user to some extent Ranks in Job Max user data (MPT 5.3 ) MPT 5.4 and later < = 512 ranks 984 bytes 8152 bytes > 512 and <= bytes 2008 bytes > 1024 and < bytes 472 bytes > ranks 216 bytes 216 bytes 47

48 Day in the Life of Message type R0 Rendezvous messages using RDMA Get Sender 2. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 1. Register App Send Buffer CQs 3. Register App Recv Buffer 5. GNI SMSG Send (Recv done) No extra data copies Best chance of overlapping communication with computation 48

49 Day in the Life of Message type R1 Rendezvous messages using RDMA Put Sender 1. GNI SMSG Send (MPI header) SMSG Mailboxes PE 82 PE 96 PE 5 PE 22 PE 1 Receiver 3. GNI SMSG Send (CTS msg ) CQs 4. Register Chunk of App Send Buffer 5. RDMA PUT 2. Register Chunk of App Recv Buffer 6. GNI SMSG Send (Send done) Repeat steps 2-6 until all sender data is transferred Chunksize is MPI_GNI_MAX_NDREG_SIZE (default of 4MB) 49

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

MPI on the Cray XC30

MPI on the Cray XC30 MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment

More information

MPI Optimization. HPC Saudi, March 15 th 2016

MPI Optimization. HPC Saudi, March 15 th 2016 MPI Optimization HPC Saudi, March 15 th 2016 Useful Variables MPI Variables to get more insights MPICH_VERSION_DISPLAY=1 MPICH_ENV_DISPLAY=1 MPICH_CPUMASK_DISPLAY=1 MPICH_RANK_REORDER_DISPLAY=1 When using

More information

MPI for Cray XE/XK Systems & Recent Enhancements

MPI for Cray XE/XK Systems & Recent Enhancements MPI for Cray XE/XK Systems & Recent Enhancements Heidi Poxon Technical Lead Programming Environment Cray Inc. Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.

More information

Optimising Communication on the Cray XE6

Optimising Communication on the Cray XE6 Optimising Communication on the Cray XE6 Outline MPICH2 Releases for XE Day in the Life of an MPI Message Gemini NIC Resources Eager Message Protocol Rendezvous Message Protocol Important MPI environment

More information

MPI and MPI on ARCHER

MPI and MPI on ARCHER MPI and MPI on ARCHER Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MPI Programming Techniques

MPI Programming Techniques MPI Programming Techniques Copyright (c) 2012 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any

More information

Further MPI Programming. Paul Burton April 2015

Further MPI Programming. Paul Burton April 2015 Further MPI Programming Paul Burton April 2015 Blocking v Non-blocking communication Blocking communication - Call to MPI sending routine does not return until the send buffer (array) is safe to use again

More information

Memory allocation and sample API calls. Preliminary Gemini performance measurements

Memory allocation and sample API calls. Preliminary Gemini performance measurements DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini

More information

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In

More information

Advanced Job Launching. mapping applications to hardware

Advanced Job Launching. mapping applications to hardware Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and

More information

Comparing One-Sided Communication with MPI, UPC and SHMEM

Comparing One-Sided Communication with MPI, UPC and SHMEM Comparing One-Sided Communication with MPI, UPC and SHMEM EPCC University of Edinburgh Dr Chris Maynard Application Consultant, EPCC c.maynard@ed.ac.uk +44 131 650 5077 The Future ain t what it used to

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Mark Pagel

Mark Pagel Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of scaling to 150K MPI ranks MPI-IO Improvements(MPT 3.1 and MPT 3.2) SMP aware collective improvements(mpt 3.2) Misc

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

How to Make Best Use of Cray MPI on the XT. Slide 1

How to Make Best Use of Cray MPI on the XT. Slide 1 How to Make Best Use of Cray MPI on the XT Slide 1 Outline Overview of Cray Message Passing Toolkit (MPT) Key Cray MPI Environment Variables Cray MPI Collectives Cray MPI Point-to-Point Messaging Techniques

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC) Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts

More information

Part - II. Message Passing Interface. Dheeraj Bhardwaj

Part - II. Message Passing Interface. Dheeraj Bhardwaj Part - II Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110016 India http://www.cse.iitd.ac.in/~dheerajb 1 Outlines Basics of MPI How to compile and

More information

Running applications on the Cray XC30

Running applications on the Cray XC30 Running applications on the Cray XC30 Running on compute nodes By default, users do not access compute nodes directly. Instead they launch jobs on compute nodes using one of three available modes: 1. Extreme

More information

WhatÕs New in the Message-Passing Toolkit

WhatÕs New in the Message-Passing Toolkit WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center 1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate

More information

Compute Node Linux (CNL) The Evolution of a Compute OS

Compute Node Linux (CNL) The Evolution of a Compute OS Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Parallel Programming. Libraries and implementations

Parallel Programming. Libraries and implementations Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018 Sayantan Sur, Intel SEA Symposium on Overlapping Computation and Communication April 4 th, 2018 Legal Disclaimer & Benchmark results were obtained prior to implementation of recent software patches and

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

Point-to-Point Communication. Reference:

Point-to-Point Communication. Reference: Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de Scalasca analysis result Online description Analysis report explorer GUI provides

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

MPI Performance Engineering through the Integration of MVAPICH and TAU

MPI Performance Engineering through the Integration of MVAPICH and TAU MPI Performance Engineering through the Integration of MVAPICH and TAU Allen D. Malony Department of Computer and Information Science University of Oregon Acknowledgement Research work presented in this

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY 14th ANNUAL WORKSHOP 2018 USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY Michael Chuvelev, Software Architect Intel April 11, 2018 INTEL MPI LIBRARY Optimized MPI application performance Application-specific

More information

MATH 676. Finite element methods in scientific computing

MATH 676. Finite element methods in scientific computing MATH 676 Finite element methods in scientific computing Wolfgang Bangerth, Texas A&M University Lecture 41: Parallelization on a cluster of distributed memory machines Part 1: Introduction to MPI Shared

More information

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016

Batch environment PBS (Running applications on the Cray XC30) 1/18/2016 Batch environment PBS (Running applications on the Cray XC30) 1/18/2016 1 Running on compute nodes By default, users do not log in and run applications on the compute nodes directly. Instead they launch

More information

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS 6230: High-Performance Computing and Parallelization Introduction to MPI CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de August 2012 Scalasca analysis result Online description Analysis report explorer

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Performance properties The metrics tour

Performance properties The metrics tour Performance properties The metrics tour Markus Geimer & Brian Wylie Jülich Supercomputing Centre scalasca@fz-juelich.de January 2012 Scalasca analysis result Confused? Generic metrics Generic metrics Time

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Optimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers

Optimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers Optimizing Cray MPI and Cray SHMEM for Current and Next Generation Cray-XC Supercomputers K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel Cray Inc. {kkandalla, knaak, kmcmahon, nradcliff,

More information

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12 PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk What is PGAS? a model, not a language! based on principle of partitioned global address space many different

More information

The Cray Programming Environment. An Introduction

The Cray Programming Environment. An Introduction The Cray Programming Environment An Introduction Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent

More information

LS-DYNA Scalability Analysis on Cray Supercomputers

LS-DYNA Scalability Analysis on Cray Supercomputers 13 th International LS-DYNA Users Conference Session: Computing Technology LS-DYNA Scalability Analysis on Cray Supercomputers Ting-Ting Zhu Cray Inc. Jason Wang LSTC Abstract For the automotive industry,

More information

Programming Environment 4/11/2015

Programming Environment 4/11/2015 Programming Environment 4/11/2015 1 Vision Cray systems are designed to be High Productivity as well as High Performance Computers The Cray Programming Environment (PE) provides a simple consistent interface

More information

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen, Cray Inc. ABSTRACT: The Cray XT implementation of MPI provides configurable runtime environment variables

More information

CSE 160 Lecture 15. Message Passing

CSE 160 Lecture 15. Message Passing CSE 160 Lecture 15 Message Passing Announcements 2013 Scott B. Baden / CSE 160 / Fall 2013 2 Message passing Today s lecture The Message Passing Interface - MPI A first MPI Application The Trapezoidal

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen

Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen Managing Cray XT MPI Runtime Environment Variables to Optimize and Scale Applications Geir Johansen May 5, 2008 Cray Inc. Proprietary Slide 1 Goals of the Presentation Provide users an overview of the

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential

More information

How to Use MPI on the Cray XT. Jason Beech-Brandt Kevin Roy Cray UK

How to Use MPI on the Cray XT. Jason Beech-Brandt Kevin Roy Cray UK How to Use MPI on the Cray XT Jason Beech-Brandt Kevin Roy Cray UK Outline XT MPI implementation overview Using MPI on the XT Recently added performance improvements Additional Documentation 20/09/07 HECToR

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren nmm1@cam.ac.uk March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous set of practical points Over--simplifies

More information

MPI - Today and Tomorrow

MPI - Today and Tomorrow MPI - Today and Tomorrow ScicomP 9 - Bologna, Italy Dick Treumann - MPI Development The material presented represents a mix of experimentation, prototyping and development. While topics discussed may appear

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

PROGRAMMING MODEL EXAMPLES

PROGRAMMING MODEL EXAMPLES ( Cray Inc 2015) PROGRAMMING MODEL EXAMPLES DEMONSTRATION EXAMPLES OF VARIOUS PROGRAMMING MODELS OVERVIEW Building an application to use multiple processors (cores, cpus, nodes) can be done in various

More information

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,

More information

Portable, MPI-Interoperable! Coarray Fortran

Portable, MPI-Interoperable! Coarray Fortran Portable, MPI-Interoperable! Coarray Fortran Chaoran Yang, 1 Wesley Bland, 2! John Mellor-Crummey, 1 Pavan Balaji 2 1 Department of Computer Science! Rice University! Houston, TX 2 Mathematics and Computer

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Introduction to MPI. Branislav Jansík

Introduction to MPI. Branislav Jansík Introduction to MPI Branislav Jansík Resources https://computing.llnl.gov/tutorials/mpi/ http://www.mpi-forum.org/ https://www.open-mpi.org/doc/ Serial What is parallel computing Parallel What is MPI?

More information

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE 13 th ANNUAL WORKSHOP 2017 ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE Erik Paulson, Kayla Seager, Sayantan Sur, James Dinan, Dave Ozog: Intel Corporation Collaborators: Howard Pritchard:

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

One-Sided Append: A New Communication Paradigm For PGAS Models

One-Sided Append: A New Communication Paradigm For PGAS Models One-Sided Append: A New Communication Paradigm For PGAS Models James Dinan and Mario Flajslik Intel Corporation {james.dinan, mario.flajslik}@intel.com ABSTRACT One-sided append represents a new class

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

Performance Evaluation of MPI on Cray XC40 Xeon Phi Systems

Performance Evaluation of MPI on Cray XC40 Xeon Phi Systems Performance Evaluation of MPI on Cray XC0 Xeon Phi Systems ABSTRACT Scott Parker Argonne National Laboratory sparker@anl.gov Kevin Harms Argonne National Laboratory harms@anl.gov The scale and complexity

More information

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018 Sayantan Sur, Intel ExaComm Workshop held in conjunction with ISC 2018 Legal Disclaimer & Optimization Notice Software and workloads used in performance tests may have been optimized for performance only

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER MPI-3.0 REMOTE MEMORY ACCESS MPI-3.0 supports RMA ( MPI One Sided ) Designed

More information

Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems

Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems PROCEEDINGS OF THE CRAY USER GROUP, 2012 1 Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems Howard Pritchard, Duncan Roweth, David

More information

Document Classification

Document Classification Document Classification Introduction Search engine on web Search directories, subdirectories for documents Search for documents with extensions.html,.txt, and.tex Using a dictionary of key words, create

More information

Shared Memory & Message Passing Programming on SCI-Connected Clusters

Shared Memory & Message Passing Programming on SCI-Connected Clusters Shared Memory & Message Passing Programming on SCI-Connected Clusters Joachim Worringen, RWTH Aachen SCI Summer School 2000 Trinitiy College Dublin Agenda How to utilize SCI-Connected Clusters SMI Library

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Myths and reality of communication/computation overlap in MPI applications

Myths and reality of communication/computation overlap in MPI applications Myths and reality of communication/computation overlap in MPI applications Alessandro Fanfarillo National Center for Atmospheric Research Boulder, Colorado, USA elfanfa@ucar.edu Oct 12th, 2017 (elfanfa@ucar.edu)

More information

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Hybrid MPI - A Case Study on the Xeon Phi Platform

Hybrid MPI - A Case Study on the Xeon Phi Platform Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory

More information

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Noise Injection Techniques to Expose Subtle and Unintended Message Races Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MOVING FORWARD WITH FABRIC INTERFACES

MOVING FORWARD WITH FABRIC INTERFACES 14th ANNUAL WORKSHOP 2018 MOVING FORWARD WITH FABRIC INTERFACES Sean Hefty, OFIWG co-chair Intel Corporation April, 2018 USING THE PAST TO PREDICT THE FUTURE OFI Provider Infrastructure OFI API Exploration

More information