Maximizing heterogeneous system performance with ARM interconnect and CCIX

Similar documents
CCIX: a new coherent multichip interconnect for accelerated use cases

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Evolving IP configurability and the need for intelligent IP configuration

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

DynamIQ Processor Designs Using Cortex-A75 & Cortex- A55 for 5G Networks

Each Milliwatt Matters

DRAM and Storage-Class Memory (SCM) Overview

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Building blocks for 64-bit Systems Development of System IP in ARM

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Next Generation Enterprise Solutions from ARM

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Messaging Overview. Introduction. Gen-Z Messaging

OCP Engineering Workshop - Telco

Accelerating intelligence at the edge for embedded and IoT applications

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

RapidIO.org Update. Mar RapidIO.org 1

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

ARM instruction sets and CPUs for wide-ranging applications

Industry Collaboration and Innovation

RapidIO.org Update.

HPC Network Stack on ARM

Bifrost - The GPU architecture for next five billion

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

Copyright 2016 Xilinx

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Industry Collaboration and Innovation

Zynq-7000 All Programmable SoC Product Overview

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Toward a Memory-centric Architecture

Industry Collaboration and Innovation

Gen-Z Memory-Driven Computing

Combining Arm & RISC-V in Heterogeneous Designs

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

Density Optimized System Enabling Next-Gen Performance

Pactron FPGA Accelerated Computing Solutions

SoC Platforms and CPU Cores

ARISTA: Improving Application Performance While Reducing Complexity

Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

OpenCAPI and its Roadmap

GEN-Z AN OVERVIEW AND USE CASES

This presentation provides an overview of Gen Z architecture and its application in multiple use cases.

CPU Agnostic Motherboard design with RapidIO Interconnect in Data Center

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Hardware NVMe implementation on cache and storage systems

S2C K7 Prodigy Logic Module Series

N V M e o v e r F a b r i c s -

Altera SDK for OpenCL

Gen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

New ARMv8-R technology for real-time control in safetyrelated

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

Versal: AI Engine & Programming Environment

Netronome NFP: Theory of Operation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

High-Performance, Highly Secure Networking for Industrial and IoT Applications

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Software Driven Verification at SoC Level. Perspec System Verifier Overview

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

ARM the Company ARM the Research Collaborator

P51: High Performance Networking

ServerReady and Open Standards Accelerating Delivery

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Linux Storage System Bottleneck Exploration

CEC 450 Real-Time Systems

ARM big.little Technology Unleashed An Improved User Experience Delivered

Welcome. Altera Technology Roadshow 2013

NVMe over Universal RDMA Fabrics

ARM mbed Towards Secure, Scalable, Efficient IoT of Scale

NETWORKING FLASH STORAGE? FIBRE CHANNEL WAS ALWAYS THE ANSWER!

The Next Steps in the Evolution of Embedded Processors

HPE ProLiant DL580 Gen10 Server

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Tile Processor (TILEPro64)

COSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team

Artificial Intelligence Enriched User Experience with ARM Technologies

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

Flexible architecture to add Bluetooth 5 and to your next SoC

Adaptable Intelligence The Next Computing Era

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

The Next Steps in the Evolution of ARM Cortex-M

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

AXI4 Interconnect Paves the Way to Plug-and-Play IP

ARM mbed mbed OS mbed Cloud

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Mobile & IoT Market Trends and Memory Requirements

Transcription:

Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017

Intelligent flexible cloud to enable new use cases Compute Acceleration 2

Distributing intelligence efficiently at scale Cortex-A75: Ground-breaking performance Cortex-A55: High-efficiency redefined 3

Unveiling DynamIQ New single cluster design New application processors Cortex-A75 and Cortex-A55 CPUs Built for DynamIQ technology Cortex-A75 32b/64b CPU.. Cortex-A55 32b/64b CPU Private L2 and shared L3 caches Local cache close to processors L3 cache shared between all cores DynamIQ shared unit Contains L3, Snoop Control Unit (SCU) and all cluster interfaces New AMBA 5 CHI bus interface Private L2 cache Private L2 cache SCU Peripheral Port Async Bridges ACP AMBA4 ACE or AMBA5 CHI Shared L3 cache DynamIQ Shared Unit (DSU) DDR4-3200 DRAM 4

Build higher performance systems Boost performance Up to 7.5x more compute Linux Kernel ARM Trusted Firmware UEFI boot Fixed Virtual Platform (System Model) Innovations to increase throughput Up to 3x more packets per second Tailor solutions from edge to cloud Scale from 1 to 256 CPUs on a single chip Multichip connectivity supporting CCIX Scale to 1000+ CPUs Cortex-A75 Cortex-A75 Cortex-A75 Cortex-A75 DynamIQ DSU & L3 Cache SCP MCP Peripherals Open Source GIC-600 1 to 32 Clusters Cortex-A75 Cortex-A75 Cortex-A75 Cortex-A75 Cortex-A75 DynamIQ DSU & L3 Cache PCIe/25GbE Coherent Mesh Network CoreSight Non-ARM IP ARM IP MMU-500 1 to 8 Controllers DDR4-3200 DRAM Storage NIC-450 5

Relative Performance CMN-600: Delivering maximum compute density 7.5 5 7.5x compute 2.5 0 16 Cortex-A57 CoreLink CCN-504 2x DMC-520 32 Cortex-A72 CoreLink CCN-508 4x DMC-520 64 Cortex-A75 8x 5x throughput 3x Compute = measured by specint2k6_rate Throughput = achieved requested bandwidth Same process node and test conditions. 6

Scalable solutions from core to edge Scalable system configurations Access point 1 Cortex-A CPUs 256 Data center compute PCIe 100GbE Cortex-A Accelerator IO IO NIC-450 0MB System cache 128MB CPU CPU CPU CPU CPU CPU CPU CPU System cache 20 GB/s Bandwidth 1 TB/s CPU CPU CPU CPU CPU CPU CPU CPU 1 DDR channels 8 7

ARM AMBA 5 CHI coherency protocol Credited, non-blocking, coherence protocol Protocol Messages Protocol Capable of achieving high frequencies (>2GHz) Routing Link Physical Packets Flits Phits Routing Link Physical Layered architecture definition Protocol/Routing/Link/Physical layers Topology agnostic Flexible topologies for target applications 8

Innovations to increase throughput Up to 3x more throughput with cache stashing Used for acceleration, network, storage use-cases Critical data stays on-chip Accelerator or IO selects data placement Low latency access by a CPU or group of CPUs Accelerator or IO Cortex-A CPU Cortex-A CPU CPU CPU CPU CPU L2 L2 Cache L2 L2 L2 Cache L2 Cache Cache L2 L2 Cache Cache L3 Cache L3 Cache Stash to any cache level CPU L2 cache private to a single CPU thread DynamIQ L3 Cache shared by CPUs in a cluster Agile System Cache shared by all DynamIQ clusters Stash critical data to any cache level Agile System Cache DDR4 Agile System Cache DDR4 9

Local Accelerator Customized, tiered acceleration platform Closely coupled cluster accelerator Low latency control and data exchange Global SoC accelerator or IO Direct, high bandwidth path to memory Cortex-A DSU Global Accelerator or IO Off-chip coherent accelerator High performance CCIX connectivity Agile System Cache Agile System Cache Remote Accelerator DDR4 DDR4 Agile System Cache Shared heterogenous system cache 10

Coherent multichip 11

Interconnects needs at different scale SoC interconnect Connectivity for on-chip processors, accelerator, IO and memory elements. Server node interconnect - scale-up Simple multichip interconnect (typically PCIe) topology on a PCB motherboard with simple switches and expansion connectors Rack interconnect - scale-out Scale-out capabilities with complex topologies connecting 1000 s of server nodes and storage elements. 12

Scale-up server node compute and acceleration Accelerator Smart Network Persistent Memory Shared virtual memory system 13

Cache coherent interconnect for accelerators Extends cache coherency to the multi-chip use cases Intelligent network/storage, machine learning, image/video analytics, search and 4G/5G wireless Open standard to foster innovation, collaboration and choice to solve emerging workload challenges Hardware specification available for designs start for member companies http://www.ccixconsortium.com 14

CCIX multichip connectivity and topologies New class of interconnect providing high performance, low latency for new accelerators use cases CCIX defines up to 25GT/s (3x performance*) Examining 56GT/s (7x performance*) and beyond Enabling low latency via light transaction layer Compute Node Switch Accelerator Smart Network Flexible, scalable interconnect topologies Persistent Memory Flexible point-to-point, daisy chained and switched topologies Simplified deployment by leveraging existing PCIe hardware and software infrastructure Runs on existing PCIe transport layer and management stack Coexist with legacy PCIe designs * Note: Based on PCIe Gen3 Performance 15

Extending the benefits of cache coherency CCIX provides true peer processing with shared memory Accelerators may cache data from processor memory Coherency eliminates the software and DMA overhead of transferring data between devices Compute Node DDR Memory shared virtual memory Accelerator Cache Coherency protocol easily adaptable to other transport in the future shared virtual memory Supports all major instruction set architectures (ISA) enabling innovation and flexibility in accelerator system design Compute Node DDR Memory Accelerator Cache DDR Memory 16

CCIX example request to home data flows Accelerator shares processor memory Shared processor and accelerator memory Requestor Requestor Cache Cache Cache Cache Home Memory Memory Memory Daisy chain to shared processor memory Shared memory with aggregation Cache Cache Cache Cache Cache Memory Memory Memory 17

RNI PCIe Transaction Layer Data Link Layer PHY (16Gpbs) Today s existing PCIe integration Example CMN-600 mesh design PCIe IP connects to a CMN IO interface via AXI Cadence IP XP 16 Lanes AXI 18

RNI PCIe Transaction Layer Data Link Layer PHY (16Gpbs) CML Coherent Multi-chip Link (CML) Example CMN-600 mesh design CML converts CHI to CCIX messages CXS - CCIX stream interface Cadence IP CXS XP 16 Lanes AXI 19

RNI PCIe Transaction Layer Data Link Layer CCIX Transaction Layer PHY (up to 25Gpbs) CML Cadence CCIX integration Example CMN-600 mesh design Low latency CCIX transaction layer Cadence IP Support up to 25Gbps vs 16Gbps PCIe Gen4 CXS XP 16 Lanes AXI 20

CCIX 25Gbps technology demonstration 3X faster transfer speed with CCIX vs existing PCIe Gen3 solutions Transferring of a data pattern at 25 Gbps between two FPGAs Channel comprised of an Amphenol/FCI PCI Express CEM connector and a trace card Transceivers are electrically compliant with CCIX Fastest data transfer between accelerators over PCI Express connections Xilinx and Amphenol FCI first public CCIX technology demo https://forums.xilinx.com/t5/xcell-daily-blog/ccix-tech-demo-proves- 25Gbps-Performance-over-PCIe/ba-p/767484 https://youtu.be/jpusacnn7va 21

Scale-up server node performance with CCIX CCIX is a class of interconnect providing high performance, low latency for new accelerator use cases Easy adoption and simplified development by leveraging today s data centers infrastructure Optimize CCIX SoCs with ARM and CCIX controller IP Server, FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs Accelerator Smart Network 22 Persistent Memory

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2017 ARM Limited