OPENSHMEM AND OFI: BETTER TOGETHER

Similar documents
ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

Intel Architecture 2S Server Tioga Pass Performance and Power Optimization

Ravindra Babu Ganapathi

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

IXPUG 16. Dmitry Durnov, Intel MPI team

Jim Harris. Principal Software Engineer. Intel Data Center Group

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Mohan J. Kumar Intel Fellow Intel Corporation

Intel SSD Data center evolution

April 2 nd, Bob Burroughs Director, HPC Solution Sales

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solution Ziye Yang, Changpeng Liu Senior software Engineer Intel

Jim Pappas Director of Technology Initiatives, Intel Vice-Chair, Storage Networking Industry Association (SNIA) December 07, 2018

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

LS-DYNA Performance on Intel Scalable Solutions

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

MOVING FORWARD WITH FABRIC INTERFACES

Daniel Verkamp, Software Engineer

One-Sided Append: A New Communication Paradigm For PGAS Models

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

DEVITO AUTOMATED HIGH-PERFORMANCE FINITE DIFFERENCES FOR GEOPHYSICAL EXPLORATION

SELINUX SUPPORT IN HFI1 AND PSM2

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Ben Walker Data Center Group Intel Corporation

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

FAST FORWARD TO YOUR <NEXT> CREATION

OPENFABRICS INTERFACES: PAST, PRESENT, AND FUTURE

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Hardware and Software Co-Optimization for Best Cloud Experience

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Real-Time Systems and Intel take industrial embedded systems to the next level

Välkommen. Intel Anders Huge

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Intel Xeon Processor E v3 Family

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Jim Harris. Principal Software Engineer. Data Center Group

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Graphics Performance Analyzer for Android

Intel optane memory as platform accelerator. Vladimir Knyazkin

LNet Roadmap & Development. Amir Shehata Lustre * Network Engineer Intel High Performance Data Division

pymic: A Python* Offload Module for the Intel Xeon Phi Coprocessor

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Bei Wang, Dmitry Prohorov and Carlos Rosales

Software Optimization Case Study. Yu-Ping Zhao

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Efficient Parallel Programming on Xeon Phi for Exascale

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Ultimate Workstation Performance

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

Intel Parallel Studio XE 2015

Intel Speed Select Technology Base Frequency - Enhancing Performance

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Accelerating Insights In the Technical Computing Transformation

Fast packet processing in linux with af_xdp

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Real World Development examples of systems / iot

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Expressing and Analyzing Dependencies in your C++ Application

Using Intel Transactional Synchronization Extensions

Embree Ray Tracing Kernels: Overview and New Features

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

What s P. Thierry

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

Growth in Cores - A well rehearsed story

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

More performance options

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

INTEL MKL Vectorized Compact routines

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

CS420: Operating Systems

CREATING A COMMON SOFTWARE VERBS IMPLEMENTATION

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Jackson Marusarz Intel

High Performance Computing The Essential Tool for a Knowledge Economy

Efficiently Introduce Threading using Intel TBB

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

Obtaining the Last Values of Conditionally Assigned Privates

Transcription:

4th ANNUAL WORKSHOP 208 OPENSHMEM AND OFI: BETTER TOGETHER James Dinan, David Ozog, and Kayla Seager Intel Corporation [ April, 208 ]

NOTICES AND DISCLAIMERS Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks. Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel Advanced Vector Extensions (Intel AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo. Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. 208 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others. 2 OpenFabrics Alliance Workshop 208

WHAT IS OPENSHMEM? Process 0 Process Process 2 Process 3 Global Address Space Data Private Heap Region Private Region Data Private Heap Region Private Region Data Private Heap Region Private Region Data Private Heap Region Private Region Open standard for SHMEM programming model Partitioned Global Address Space memory model, SPMD execution Part of the memory in a process is exposed for remote access Asynchronous read (get), write (put), and atomic update operations Fence (ordering), quiet (remote completion), barrier/wait (sync)

OPENSHMEM.4 Specification ratified Dec. 4, 207 safety Communication management API (contexts) Test, sync, calloc Bitwise atomic operations Updated C generic selection bindings Committee actively working on.5 Happy to have you join us! Intel is engaged in the Sandia OpenSHMEM implementation effort SOS v.4. release candidate out Open source, supports OFI and Portals Req.: FI_RMA, FI_ATOMICS, FI_EP_RDM First open source implementation to support OpenSHMEM.3 and.4 https://github.com/sandia-openshmem/sos 4

OPENSHMEM.4 THREAD SAFETY int shmem_init_thread(int requested, int *provided); void shmem_query_thread(int *provided); Defines semantics of threads and OpenSHMEM routines ing level selected at initialization: SHMEM_THREAD_SINGLE: No threading SHMEM_THREAD_FUNNELED: Master thread calls SHMEM API SHMEM_THREAD_SERIALIZED: Any thread calls SHMEM API, but serialized SHMEM_THREAD_MULTIPLE: Any thread calls SHMEM API, concurrently Sandia OpenSHMEM supports FI_THREAD_SAFE and COMPLETION FI_THREAD_SAFE: SOS-level atomics, no mutexes FI_THREAD_COMPLETION: SOS-level mutexes, but can be eliminated with user-provided hints 5

OPENSHMEM CONTEXTS: ISOLATION AND OVERLAP Without Contexts SHMEM PE With Contexts SHMEM PE Quiet Quiet Programmer chooses which operations are completed by quiet Control communication/computation overlap Eliminate interference between threads 6

OPENSHMEM.4 CONTEXTS API int shmem_ctx_create(long options, shmem_ctx_t *ctx); void shmem_ctx_destroy(shmem_ctx_t ctx); void shmem_ctx_putmem(shmem_ctx_t ctx, void *dest, const void *source, size_t nbytes, int pe); void shmem_ctx_fence(shmem_ctx_t ctx); void shmem_ctx_quiet(shmem_ctx_t ctx); SHMEM_CTX_DEFAULT: Created during initialization Legacy SHMEM API operations are performed on the default context Context options: SHMEM_CTX_SERIALIZED: The given context will not be used by multiple threads concurrently SHMEM_CTX_PRIVATE: The given context will be used only by the thread that created it Options enable thread synchronization optimizations Need a way to pass hints to OFI in FI_THREAD_SAFE mode to relax synchronization 7

CONTEXTS AND THREADS EXAMPLE long task_cntr = 0; /* Next task counter */ int main(int argc, char **argv) { long ntasks = 024; /* Total tasks per PE */... #pragma omp parallel { shmem_ctx_t ctx; int task_pe = shmem_my_pe(), pes_done = 0; shmem_ctx_create(shmem_ctx_private, &ctx); PE 0 PE PE N PE } while (pes_done < npes) { long task = shmem_atomic_fetch_inc(ctx, &task_cntr, task_pe); while (task < ntasks) { /* Perform task (task_pe, task) */ task = shmem_atomic_fetch_inc(ctx, &task_cntr, task_pe); } pes_done++; task_pe = (task_pe + ) % shmem_n_pes(); } shmem_ctx_destroy(ctx); } /* End parallel section */ Dynamic load balancing s process local tasks Proceed to help round-robin Contexts isolate threads Fetch-inc completion waits on event counter s share counter Leads to interference 8

SOS.4.X OFI TRANSPORT ARCHITECTURE Address Vector STX STX STX STX FI Address CQ EP RX EP, Get,, Get, Fetch Buffered s (opt) RMA Target FI_WRITE CQ FI_READ Get User Context FI_WRITE FI_READ FI_TRANSMIT Get CQ SHMEM_CTX_DEFAULT RX Heap MR Fabric Domain Data MR Remote VA: One MR exposes full VA 9

THREAD-AWARE RESOURCE PRIVATIZATION 0 2 STX STX STX, Get,, Get,, Get,, Get,, Get,, Get, FI_WRITE CQ FI_READ Get FI_WRITE CQ FI_READ Get FI_WRITE CQ FI_READ Get FI_WRITE CQ FI_READ Get FI_WRITE CQ FI_READ Get FI_WRITE CQ FI_READ Get Fabric Domain Use shareable transmit context (STX) Leverage thread-context mapping hints to optimize STX assignment Scalable endpoints TX resource is automatic, can t optimize for usage model 0

SHAREABLE TRANSMIT CONTEXT MANAGEMENT STX allocator controls assignment of STX to contexts STXs are in shared, private, or free state Default context is created first and claims 0 th STX as shared Private contexts Check TID-to-STX table for given thread If no STX, attempt to allocate a private STX to the calling thread If none available, treat as shared Shared contexts Allocate according to policy: round-robin, random, least used, etc. Set low water mark to favor private usage or disable private to favor sharing CPU (8-cores) 0 2 4 6 3 5 7 to STX Mapping Table TID STX 3 2 5 4 Default Context (shared) Private context Reference Count 0 2 0 0 HFI

STX PARTITIONING Multiple PEs per node Query maximum number of STX and automatically partition Or manually Set maximum STX per PE: SHMEM_OFI_STX_MAX OpenSHMEM threading introduces new and interesting resource management challenges Exposes threads to middleware enabling optimizations Good solutions critical for realizing full performance potential PE 0 PE CPU (8 cores) 0 2 0 2 3 3 Private context(s) Shared contexts Private context(s) Shared contexts 2 2 2 0 HFI 2

BLOCKING PUT BANDWIDTH Early results subject to change Multithreaded point-to-point unidirectional bandwidth test Each thread has a separate context Two nodes, PE per node: Dual socket Intel Xeon CPU E5-2699 v3 (Haswell) 2.30GHz 8 cores, 36 threads Intel Omni-Path Architecture Nodes connected via single switch 64 GB RAM Libfabric v.6.0, PSM2 provider CentOS* Linux release 7.3.6 Sandia OpenSHMEM v.4.rc Manual progress and thread completion support enabled ~2.5 GB/s Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright 208, Intel Corporation. *Other names and brands may be claimed as the property of others. 3 OpenFabrics Alliance Workshop 208

COMPARISON OF STX ALLOCATION POLICIES Private STX Shared STX Experiment: 8 threads per PE, increase STX from to 8 Always at least one shared STX (default context); how we assign the rest? E.g., Private @2 STX, private, 7 threads STX. Shared @2 STX, 8 threads share 2 STX. Application usage model determines best method for using available resources Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright 208, Intel Corporation. *Other names and brands may be claimed as the property of others. 4 OpenFabrics Alliance Workshop 208

4th ANNUAL WORKSHOP 208 THANK YOU James Dinan Intel Corporation