HPC Network Stack on ARM

Size: px
Start display at page:

Download "HPC Network Stack on ARM"

Transcription

1 HPC Network Stack on ARM Pavel Shamis (Pasha) Principal Research Engineer ARM Research ExaComm /22/2017

2 HPC network stack on ARM? 2

3 Serious ARM HPC deployments starting in 2017 ARM Emerging CPU architecture in HPC and server space Future deployments Islamabad Cray CS-400, Japan Post-K ARMv8 3

4 An introduction to ARM ARM is the world's leading semiconductor intellectual property supplier. We license to over 350 partners, are present in 95% of smart phones, 80% of digital cameras, 35% of all electronic devices, and a total of 60 billion ARM cores have been shipped since Our CPU business model: License technology to partners, who use it to create their own system-on-chip (SoC) products. We may license an instruction set architecture (ISA) such as ARMv8-A ) or a specific implementation, such as Cortex-A72. Partners who license an ISA can create their own implementation, as long as it passes the compliance tests. and our IP extends beyond the CPU 4

5 Range of SoCs addressing infrastructure Highly Accelerated Massively Multicore QorIQ Layerscape 2080A 5 One size does not fit all

6 Integration with Network Interconnects 6

7 CCIX Accelerators and Network (NIC/HCA/etc.) as a first class citizen in the system Seamless process and accelerator hardware cache coherence support Low-latency and high-bandwidth Allow in-line acceleration Bump in the wire processing (network packet processing, storage acceleration, etc.) Allows off-line acceleration (co-processor model) Driver-less / interrupt-less usage model 7

8 Scale-up server node DMC-620 DMC-620 DMC-620 DMC-620 Accelerator CoreLink CMN-600 CoreLink CMN-600 Smart Network DMC-620 DMC-620 DMC-620 DMC-620 Persistent Memory Shared virtual memory system 8

9 CCIX multichip connectivity and topologies New class of interconnect providing high performance, low latency for new accelerators use cases CCIX defines 25GT/s (3x performance*) Examining 56GT/s (7x performance*) and beyond Enabling low latency via light transaction layer Compute Node Switch Accelerator Smart Network Persistent Memory Flexible, scalable interconnect topologies Flexible point-to-point, daisy chained and switched topologies Simplified deployment by leveraging existing PCIe hardware and software infrastructure Runs on existing PCIe transport layer and management stack Coexist with legacy PCIe designs * Note: Based on PCIe Gen3 Performance 9

10 Building CCIX devices Cadence IP for CCIX Built upon silicon proven PCIe solutions Cadence Controller & PHY IP IP products: Controller IP Provides the CCIX transaction and data link layers. PHY IP Provides the high performance SERDES physical layer supporting speeds up to 25Gpbs. Verification IP Provides the necessary test infrastructure to verify CCIX designs. 10

11 Cadence CCIX integration Example CMN-600 mesh design CML converts CHI to CCIX messages Low latency CCIX transaction layer Support up to 25Gbps vs 16Gbps PCIe Gen4 DMC-620 DMC-620 Cadence IP CoreLink CMN-600 XP CML RNI CXS AXI CCIX Transaction Layer PCIe Transaction Layer Data Link Layer PHY (up to 25Gpbs) 16 Lanes DMC-620 DMC-620 PCIe IP connects to a CMN IO interface via AXI 11

12 Gen-Z All data is accessed by some form of a Read or a Write Example of reads: DDR Row + Column Read, PCI DMA Read, SCSI Write, Socket Read, File Read, RDMA Read Example of writes: DDR Row + Column Write, PCI DMA Write, SCSI Read, Socket Write, File Write, RDMA Write The Goal: Simplify world to memory semantic Reads & Writes 12

13 Gen-Z Overview An open, standards-based, scalable, system interconnect and protocol. Optimized to support memory semantic communications Breaks Processor-Memory Interlock Split controller model Memory controller Initiates high-level requests Read, Write, Atomic, Put / Get, etc. Enforces ordering, reliability, path selection, etc. Media controller Abstracts memory media Supports volatile / non-volatile / mixed-media Performs media-specific operations Executes requests and returns responses Enables data-centric computing (accelerator, compute, etc.) 13

14 Software Stack Overview 14

15 Linux / FreeBSD w/ AARCH64 support Debian 8 adds AARCH64 April LTS & 14.04LTS released ß Also & releases Fedora 22 released May 2015 Fedora 23 released Nov 2015 Red Hat Enterprise Linux Server for ARM 7.2 BETA Sept, 2015 CentOS Linux 7 for AArch64 GA August 2015 OpenSUSE 13.2 Nov 2014 SUSE Launches Partner Program to Bring SUSE Linux Enterprise 12 to 64-bit ARM July ISC 15 ß Engaged with FreeBSD foundation / Semi-half & Cavium to get FreeBSD on ARMv8 FreeBSD Beta version demo d by Semihalf Nov. 2015

16 Open source and commercial compilers GCC C, C++, Fortran OpenMP 4.0 PathScale C, C++ Fortran OpenACC OpenMP 4.0 LLVM C, C++ OpenMP 3.1, (4.0 coming soon) Fortran coming Q NAG Fortran OpenMP 3.1 ARM C/C++ Compiler LLVM based Includes SVE 16

17 ARM HPC ecosystem roadmap AppliedMicro X-Gene 1 & 2 Hardware AMD Seattle Cavium ThunderX Qualcomm Centriq Phytium Mars Cavium ThunderX2 AppliedMicro X-Gene 3 Fujitsu Post K (SVE) Released Planned Concept Open-Source software OpenHPC 1.2 ARM Optimized Routines ARM Optimized Routines vector versions Altair PBS Pro GCC (gcc/g++/gfortran) LLVM - clang LLVM Flang ARM C/C++ Compiler ahead of LLVM trunk ARM Fortran Compiler ARM HPC tools ARM Performance Libraries ARM Code Advisor (Beta) ARM Code Advisor (Full release) ARM Instruction Emulator ISV software Allinea DDT and MAP NAG Library & Compiler PathScale ENZO Rogue Wave TotalView ISV software Future

18 RDMA Networks Remote Direct Memory Access (RDMA) popular hardware network technology InfiniBand 37% of systems in TOP

19 RDMA Support Mellanox OFED 2.4 and above supports ARM Linux Kernel and above (maybe even earlier) Rdma-Core runs on ARM OFED No support Linux Distribution on going process 19

20 OpenUCX v1.2 The first official release from OpenUCX community Features Support for InfiniBand and RoCE Transports RC, UD, DC Support for Accelerated Verbs 40% speedup on ARM compared to vanilla Verbs Support for Cray Aries and Gemini Support for Shared Memory: KNEM, CMA, XPMEM, Posix, SySV Support for x86, ARMv8, Power Efficient memory polling 36% increase in efficiency on ARM UCX interface is integrated with MPICH, OpenMPI, OSHMEM, ORNL- SHMEM, etc. 20 Pavel Shamis, M. Graham Lopez, and Gilad Shainer. Enabling One-sided Communication Semantics on ARM, HIPS 2017

21 Programing models Open MPI compiles and runs on ARMv8 Continues integration with HPCAC ARMv8 server MPICH compiles and runs on ARMv8 MVAPICH compiles (with patches) and runs OSHMEM compiles and runs Continues integration with HPCAC ARMv8 server 21

22 Example: MPI+SHMEM+OpenUCX on InfiniBand 22

23 Lessons Learned Memory Barriers Multithread environment Software-hardware interaction Examples You can fish for these bugs in MPI implementations around Eager-RDMA and shared memory protocols RDMA Write Payload Busy-wait Read Notify RDMA Write Barrier Write Notify Read Barrier Read Payload Maranget, Luc, Susmit Sarkar, and Peter Sewell. "A tutorial introduction to the ARM and POWER relaxed memory models." Draft available from cl. cam. ac. uk/~ pes20/ppc-supplemental/test7. pdf (2012). 23

24 More About Barriers There are multiple types of barriers DSB Completion semantics Interaction with external devices (PCIe doorbells) Device drivers DMB ISH* domain on Linux Poll-flag, barrier, data ISB 24

25 Lessons Learned - continued Low-level timers Typically found in benchmarks and MPI Code examples 25

26 Lessons Learned continued Not all cache-lines are 64Byte! Implementation dependent 128Byte and 64Byte 26

27 Optimizations AVX => Neon Mostly found around communication request initialization codes ib/mlx5/ib_mlx5.inl#l160 Busy-wait loop See Wait-For-Event (WFE) 27 Pavel Shamis, M. Graham Lopez, and Gilad Shainer. Enabling One-sided Communication Semantics on ARM, HIPS 2017

28 Preliminary Results 28

29 Testbed 2 x Softiron Overdrive 3000 servers with AMD Opteron A1100 / 2GHz ConnectX-4 IB/VPI EDR (PCIe gen2 x8) Ubuntu MOFED UCX [0558b41] XPMEM [bdfcc52] OSHMEM/OPEN-MPI [fed4849] 29

30 OpenUCX IB: MLX5 vs Verbs 40% 30

31 OpenUCX: XPMEM 31

32 SHMEM_WAIT() 73% 35% 32

33 OpenSHMEM SSCA 7-30% 33

34 OpenSHMEM GUPs 21% 34 Pavel Shamis, M. Graham Lopez, and Gilad Shainer. Enabling One-sided Communication Semantics on ARM

35 Summary Linux RDMA community is doing great job! A lot of progress was made in ARM HPC/server software eco-system 35

36 The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2017 ARM Limited

HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer

HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer HPC Network Stack on Arm Pavel Shamis/Pasha Principal Research Engineer Mvapich User Group Mee:ng, 2017 Annapolis, MD Arm Overview An introduc0on to Arm Arm is the world's leading semiconductor intellectual

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Presented by: Pavel Shamis / Pasha ORNL is managed by UT-Battelle for the US Department of Energy Co-Design Collaboration The Next Generation

More information

Unified Communication X (UCX)

Unified Communication X (UCX) Unified Communication X (UCX) Pavel Shamis / Pasha ARM Research SC 18 UCF Consortium Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks

More information

CCIX: a new coherent multichip interconnect for accelerated use cases

CCIX: a new coherent multichip interconnect for accelerated use cases : a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017 Interconnects for different scale SoC interconnect. Connectivity

More information

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited

Arm in HPC. Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm Arm Limited Arm in HPC Toshinori Kujiraoka Sales Manager, APAC HPC Tools Arm 2019 Arm Limited Arm Technology Connects the World Arm in IOT 21 billion chips in the past year Mobile/Embedded/IoT/ Automotive/GPUs/Servers

More information

Enabling the ARM high performance computing (HPC) software ecosystem

Enabling the ARM high performance computing (HPC) software ecosystem Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016 Are these supercomputers? For example, the

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

The Arm Technology Ecosystem: Current Products and Future Outlook

The Arm Technology Ecosystem: Current Products and Future Outlook The Arm Technology Ecosystem: Current Products and Future Outlook Dan Ernst, PhD Advanced Technology Cray, Inc. Why is an Ecosystem Important? An Ecosystem is a collection of common material Developed

More information

ARM High Performance Computing

ARM High Performance Computing ARM High Performance Computing Eric Van Hensbergen Distinguished Engineer, Director HPC Software & Large Scale Systems Research IDC HPC Users Group Meeting Austin, TX September 8, 2016 ARM 2016 An introduction

More information

SUSE Linux Entreprise Server for ARM

SUSE Linux Entreprise Server for ARM FUT89013 SUSE Linux Entreprise Server for ARM Trends and Roadmap Jay Kruemcke Product Manager jayk@suse.com @mr_sles ARM Overview ARM is a Reduced Instruction Set (RISC) processor family British company,

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

Arm Processor Technology Update and Roadmap

Arm Processor Technology Update and Roadmap Arm Processor Technology Update and Roadmap ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture

More information

Beyond Hardware IP An overview of Arm development solutions

Beyond Hardware IP An overview of Arm development solutions Beyond Hardware IP An overview of Arm development solutions 2018 Arm Limited Arm Technical Symposia 2018 Advanced first design cost (US$ million) IC design complexity and cost aren t slowing down 542.2

More information

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms Sayantan Sur, Matt Koop, Lei Chai Dhabaleswar K. Panda Network Based Computing Lab, The Ohio State

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

HYCOM Performance Benchmark and Profiling

HYCOM Performance Benchmark and Profiling HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities

More information

ABySS Performance Benchmark and Profiling. May 2010

ABySS Performance Benchmark and Profiling. May 2010 ABySS Performance Benchmark and Profiling May 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience Jithin Jose, Mingzhe Li, Xiaoyi Lu, Krishna Kandalla, Mark Arnold and Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Jay Kruemcke Sr. Product Manager, HPC, Arm,

Jay Kruemcke Sr. Product Manager, HPC, Arm, Jay Kruemcke Sr. Product Manager, HPC, Arm, POWER jayk@suse.com @mr_sles What s changed in the last year? 1.More capable Arm server chips New processors from Cavium, Qualcomm, HiSilicon, Ampere 2.Maturing

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

ARISTA: Improving Application Performance While Reducing Complexity

ARISTA: Improving Application Performance While Reducing Complexity ARISTA: Improving Application Performance While Reducing Complexity October 2008 1.0 Problem Statement #1... 1 1.1 Problem Statement #2... 1 1.2 Previous Options: More Servers and I/O Adapters... 1 1.3

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing Innovative Alternate Architecture for Exascale Computing Surya Hotha Director, Product Marketing Cavium Corporate Overview Enterprise Mobile Infrastructure Data Center and Cloud Service Provider Cloud

More information

Software Ecosystem for Arm-based HPC

Software Ecosystem for Arm-based HPC Software Ecosystem for Arm-based HPC CUG 2018 - Stockholm Florent.Lebeau@arm.com Ecosystem for HPC List of components needed: Linux OS availability Compilers Libraries Job schedulers Debuggers Profilers

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

UCX: An Open Source Framework for HPC Network APIs and Beyond

UCX: An Open Source Framework for HPC Network APIs and Beyond UCX: An Open Source Framework for HPC Network APIs and Beyond Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard

More information

Arm's role in co-design for the next generation of HPC platforms

Arm's role in co-design for the next generation of HPC platforms Arm's role in co-design for the next generation of HPC platforms Filippo Spiga Software and Large Scale Systems What it is Co-design? Abstract: Preparations for Exascale computing have led to the realization

More information

Transforming the Data Center with ARM

Transforming the Data Center with ARM WHITE PAPER Transforming the Data Center with ARM Maximizing Energy, Scalability, and Performance in the Modern Data Center IT leaders looking to modernize their computer, networking and storage systems

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers

Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers World s First 10nm Server Processor Sandeep Sethia Staff Engineer Qualcomm Datacenter Technologies, Inc. February 25, 2018 MariaDB

More information

The Future of High Performance Interconnects

The Future of High Performance Interconnects The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

spin: High-performance streaming Processing in the Network

spin: High-performance streaming Processing in the Network T. HOEFLER, S. DI GIROLAMO, K. TARANOV, R. E. GRANT, R. BRIGHTWELL spin: High-performance streaming Processing in the Network spcl.inf.ethz.ch The Development of High-Performance Networking Interfaces

More information

RapidIO.org Update. Mar RapidIO.org 1

RapidIO.org Update. Mar RapidIO.org 1 RapidIO.org Update rickoco@rapidio.org Mar 2015 2015 RapidIO.org 1 Outline RapidIO Overview & Markets Data Center & HPC Communications Infrastructure Industrial Automation Military & Aerospace RapidIO.org

More information

Bootstrapping a HPC Ecosystem

Bootstrapping a HPC Ecosystem Bootstrapping a HPC Ecosystem Eric Van Hensbergen Fellow Senior Director of HPC Software and Large Scale Systems Research Teratech Forum June 19, 2018 Copyright ARM computing is everywhere #1 shipping

More information

ARM BOF. Jay Kruemcke Sr. Product Manager, HPC, ARM,

ARM BOF. Jay Kruemcke Sr. Product Manager, HPC, ARM, ARM BOF Jay Kruemcke Sr. Product Manager, HPC, ARM, POWER jayk@suse.com @mr_sles SUSE and the High Performance Computing Ecosystem Partnerships with HPE, Arm, Cavium, Cray, Intel, Microsoft, Dell, Qualcomm,

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

AMBER 11 Performance Benchmark and Profiling. July 2011

AMBER 11 Performance Benchmark and Profiling. July 2011 AMBER 11 Performance Benchmark and Profiling July 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource -

More information

Scaling with PGAS Languages

Scaling with PGAS Languages Scaling with PGAS Languages Panel Presentation at OFA Developers Workshop (2013) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var

More information

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks Jeff Maguire Senior Product Manager Infrastructure IP Product Management Arm 2017 Arm Limited Arm Tech Symposia 2017 Agenda 5G networks

More information

NVMe over Universal RDMA Fabrics

NVMe over Universal RDMA Fabrics NVMe over Universal RDMA Fabrics Build a Flexible Scale-Out NVMe Fabric with Concurrent RoCE and iwarp Acceleration Broad spectrum Ethernet connectivity Universal RDMA NVMe Direct End-to-end solutions

More information

MM5 Modeling System Performance Research and Profiling. March 2009

MM5 Modeling System Performance Research and Profiling. March 2009 MM5 Modeling System Performance Research and Profiling March 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015 Low latency, high bandwidth communication. Infiniband and RDMA programming Knut Omang Ifi/Oracle 2 Nov, 2015 1 Bandwidth vs latency There is an old network saying: Bandwidth problems can be cured with

More information

Use Cases and Best Practices Primer for SUSE and ARM

Use Cases and Best Practices Primer for SUSE and ARM Use Cases and Best Practices Primer for SUSE and ARM CAS91763 Andrew Wafaa Principal Engineer ARM Ltd Alexander Graf Dirk Mueller The Data Center is Evolving Today Next 3 Years 5 Years + Data center workload

More information

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics Lloyd Dickman, CTO InfiniBand Products Host Solutions Group QLogic Corporation November 13, 2007 @ SC07, Exhibitor Forum

More information

CCR. ISC18 June 28, Kevin Pedretti, Jim H. Laros III, Si Hammond SAND C. Photos placed in horizontal env

CCR. ISC18 June 28, Kevin Pedretti, Jim H. Laros III, Si Hammond SAND C. Photos placed in horizontal env Photos placed in horizontal position with even amount of white space between photos and header Photos placed in horizontal env position with even amount of white space between photos and header Vanguard

More information

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016 AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP

More information

Each Milliwatt Matters

Each Milliwatt Matters Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Post-K: Building the Arm HPC Ecosystem

Post-K: Building the Arm HPC Ecosystem Post-K: Building the Arm HPC Ecosystem Toshiyuki Shimizu FUJITSU LIMITED Nov. 14th, 2017 Exhibitor Forum, SC17, Nov. 14, 2017 0 Post-K: Building up Arm HPC Ecosystem Fujitsu s approach for HPC Approach

More information

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 In-Network Computing Paving the Road to Exascale 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner, IWOPH workshop, ISC, Germany June 21, 2017 OpenPOWER Innovations for HPC IBM Research Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research - Zurich Established in 1956 45+ different

More information

DRAM and Storage-Class Memory (SCM) Overview

DRAM and Storage-Class Memory (SCM) Overview Page 1 of 7 DRAM and Storage-Class Memory (SCM) Overview Introduction/Motivation Looking forward, volatile and non-volatile memory will play a much greater role in future infrastructure solutions. Figure

More information

Atos ARM solutions for HPC

Atos ARM solutions for HPC Atos ARM solutions for HPC Eric Eppe Head of Solution Marketing & Portfolio HPC & Quantum Global Business Line Tuesday, March 7th, HPC User Forum, TERATEC Atos HPC and ARM A long time engagement 2012 2013

More information

RapidIO.org Update.

RapidIO.org Update. RapidIO.org Update rickoco@rapidio.org June 2015 2015 RapidIO.org 1 Outline RapidIO Overview Benefits Interconnect Comparison Ecosystem System Challenges RapidIO Markets Data Center & HPC Communications

More information

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Arm crossplatform. VI-HPS platform October 16, Arm Limited Arm crossplatform tools VI-HPS platform October 16, 2018 An introduction to Arm Arm is the world's leading semiconductor intellectual property supplier We license to over 350 partners: present in 95% of

More information

ARM SERVER STANDARDIZATION

ARM SERVER STANDARDIZATION ARM SERVER STANDARDIZATION (and a general update on some happenings at Red Hat) Jon Masters, Chief ARM Architect, Red Hat 6+ YEARS OF ARM AT RED HAT Red Hat ARM Team formed in March 2011 Bootstrapped ARMv8

More information

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb. Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation

More information

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

OPEN MPI AND RECENT TRENDS IN NETWORK APIS 12th ANNUAL WORKSHOP 2016 OPEN MPI AND RECENT TRENDS IN NETWORK APIS #OFADevWorkshop HOWARD PRITCHARD (HOWARDP@LANL.GOV) LOS ALAMOS NATIONAL LAB LA-UR-16-22559 OUTLINE Open MPI background and release timeline

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Datacenter Java Developers Start your ARMv8 Engines! CON11179

Datacenter Java Developers Start your ARMv8 Engines! CON11179 Datacenter Java Developers Start your ARMv8 Engines! CON11179 Jeff Underhill ARM - Director Server Programs Christian Thalinger Oracle - Principal Member of Technical Staff 1 Agenda ARM overview - who

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture Gen-Z Technology: Enabling Memory Centric Architecture Why Gen-Z? Gen-Z Consortium 2017 2 Why Gen-Z? Gen-Z Consortium 2017 3 Why Gen-Z? Businesses Need to Monetize Data Big Data AI Machine Learning Deep

More information

Arm in High Performance Computing: Fortran on AArch64

Arm in High Performance Computing: Fortran on AArch64 Arm in High Performance Computing: Fortran on AArch64 Nathan Sircombe Arm Manchester nathan.sircombe@arm.com 70% of the world s population uses Arm technology 2 Total computing experience Consumer Arm

More information

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Toward Building up Arm HPC Ecosystem --Fujitsu s Activities-- Shinji Sumimoto, Ph.D. Next Generation Technical Computing Unit FUJITSU LIMITED Jun. 28 th, 2018 0 Copyright 2018 FUJITSU LIMITED Outline of

More information

iwarp Learnings and Best Practices

iwarp Learnings and Best Practices iwarp Learnings and Best Practices Author: Michael Fenn, Penn State Date: March 28, 2012 www.openfabrics.org 1 Introduction Last year, the Research Computing and Cyberinfrastructure group at Penn State

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

LAMMPS Performance Benchmark and Profiling. July 2012

LAMMPS Performance Benchmark and Profiling. July 2012 LAMMPS Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC

More information

PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN

PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN BUILDING ON A HERITAGE OF INNOVATION 64-bit x86 Hardware Virtualization Enablement Integrated Memory

More information

Application Acceleration Beyond Flash Storage

Application Acceleration Beyond Flash Storage Application Acceleration Beyond Flash Storage Session 303C Mellanox Technologies Flash Memory Summit July 2014 Accelerating Applications, Step-by-Step First Steps Make compute fast Moore s Law Make storage

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

ROCm: An open platform for GPU computing exploration

ROCm: An open platform for GPU computing exploration UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand Abstract...1 Introduction...2 Overview of ConnectX Architecture...2 Performance Results...3 Acknowledgments...7 For

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

ARMv8-A Software Development

ARMv8-A Software Development ARMv8-A Software Development Course Description ARMv8-A software development is a 4 days ARM official course. The course goes into great depth and provides all necessary know-how to develop software for

More information

ARM instruction sets and CPUs for wide-ranging applications

ARM instruction sets and CPUs for wide-ranging applications ARM instruction sets and CPUs for wide-ranging applications Chris Turner Director, CPU technology marketing ARM Tech Forum Taipei July 4 th 2017 ARM computing is everywhere #1 shipping GPU in the world

More information

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces Li Chen, Staff AE Cadence China Agenda Performance Challenges Current Approaches Traffic Profiles Intro Traffic Profiles Implementation

More information

RDMA in Embedded Fabrics

RDMA in Embedded Fabrics RDMA in Embedded Fabrics Ken Cain, kcain@mc.com Mercury Computer Systems 06 April 2011 www.openfabrics.org 2011 Mercury Computer Systems, Inc. www.mc.com Uncontrolled for Export Purposes 1 Outline Embedded

More information

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet Gurkirat Kaur, Manoj Kumar 1, Manju Bala 2 1 Department of Computer Science & Engineering, CTIEMT Jalandhar, Punjab, India 2 Department of Electronics

More information

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and

More information

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments LS-DYNA Productivity and Power-aware Simulations in Cluster Environments Gilad Shainer 1, Tong Liu 1, Jacob Liberman 2, Jeff Layton 2 Onur Celebioglu 2, Scot A. Schultz 3, Joshua Mora 3, David Cownie 3,

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

Early Software Development Through Emulation for a Complex SoC

Early Software Development Through Emulation for a Complex SoC Early Software Development Through Emulation for a Complex SoC FTF-NET-F0204 Raghav U. Nayak Senior Validation Engineer A P R. 2 0 1 4 TM External Use Session Objectives After completing this session you

More information

Comparing Ethernet and Soft RoCE for MPI Communication

Comparing Ethernet and Soft RoCE for MPI Communication IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 7-66, p- ISSN: 7-77Volume, Issue, Ver. I (Jul-Aug. ), PP 5-5 Gurkirat Kaur, Manoj Kumar, Manju Bala Department of Computer Science & Engineering,

More information

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011 AcuSolve Performance Benchmark and Profiling October 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, Altair Compute

More information

Software Development Using Full System Simulation with Freescale QorIQ Communications Processors

Software Development Using Full System Simulation with Freescale QorIQ Communications Processors Patrick Keliher, Simics Field Application Engineer Software Development Using Full System Simulation with Freescale QorIQ Communications Processors 1 2013 Wind River. All Rights Reserved. Agenda Introduction

More information