CCIX: a new coherent multichip interconnect for accelerated use cases

Similar documents
Maximizing heterogeneous system performance with ARM interconnect and CCIX

New Interconnnects. Moderator: Andy Rudoff, SNIA NVM Programming Technical Work Group and Persistent Memory SW Architect, Intel

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Implementing debug. and trace access. through functional I/O. Alvin Yang Staff FAE. Arm Tech Symposia Arm Limited

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

Accelerating intelligence at the edge for embedded and IoT applications

DynamIQ Processor Designs Using Cortex-A75 & Cortex- A55 for 5G Networks

Advanced IP solutions enabling the autonomous driving revolution

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Beyond TrustZone PSA Reed Hinkel Senior Manager Embedded Security Market Development

2017 Arm Limited. How to design an IoT SoC and get Arm CPU IP for no upfront license fee

Bringing Intelligence to Enterprise Storage Drives

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

WAVE ONE MAINFRAME WAVE THREE INTERNET WAVE FOUR MOBILE & CLOUD WAVE TWO PERSONAL COMPUTING & SOFTWARE Arm Limited

Arm s Latest CPU for Laptop-Class Performance

OCP Engineering Workshop - Telco

The Changing Face of Edge Compute

Connect your IoT device: Bluetooth 5, , NB-IoT

Connect Your IoT Device: Bluetooth 5, , NB-IoT

Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

GEN-Z AN OVERVIEW AND USE CASES

OpenCAPI and its Roadmap

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

New Approaches to Connected Device Security

Building blocks for 64-bit Systems Development of System IP in ARM

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

Arm Mbed Edge. Shiv Ramamurthi Arm. Arm Tech Symposia Arm Limited

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

A Developer's Guide to Security on Cortex-M based MCUs

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

DRAM and Storage-Class Memory (SCM) Overview

Arm crossplatform. VI-HPS platform October 16, Arm Limited

ARM mbed Towards Secure, Scalable, Efficient IoT of Scale

Bringing Intelligence to Enterprise Storage Drives

ARISTA: Improving Application Performance While Reducing Complexity

Next Generation Enterprise Solutions from ARM

Evolving IP configurability and the need for intelligent IP configuration

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Industry Collaboration and Innovation

Design Process. in an embedded system. Kasper Ornstein Mecklenburg SW/HW development engineer Arm Limited

Compute solutions for mass deployment of autonomy

6.9. Communicating to the Outside World: Cluster Networking

Data Path acceleration techniques in a NFV world

Arm Mbed Edge. Nick Zhou Senior Technical Account Manager. Arm Tech Symposia Arm Limited

Zhang Tianfei. Rosen Xu

Toward a Memory-centric Architecture

Beyond TrustZone Security Enclaves Reed Hinkel Senior Manager Embedded Security Market Develop

DPDK on Arm64 Status Review & Plan

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

HPC Network Stack on ARM

ARM mbed mbed OS mbed Cloud

Deploying Data Center Switching Solutions

Each Milliwatt Matters

Agilio CX 2x40GbE with OVS-TC

Messaging Overview. Introduction. Gen-Z Messaging

Welcome. Altera Technology Roadshow 2013

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

RapidIO.org Update. Mar RapidIO.org 1

Building firmware update: The devil is in the details

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Combining Arm & RISC-V in Heterogeneous Designs

What is gem5 and where do I get it?

Bifrost - The GPU architecture for next five billion

Industry Collaboration and Innovation

High-Performance, Highly Secure Networking for Industrial and IoT Applications

RapidIO.org Update.

Differentiation in Flexible Data Center Interconnect

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

EDGE COMPUTING & IOT MAKING IT SECURE AND MANAGEABLE FRANCK ROUX MARKETING MANAGER, NXP JUNE PUBLIC

Diversity of. connectivity required for scalable IoT devices. Sam Grove Principal Software Engineer Arm. Arm TechCon 2017.

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Netronome NFP: Theory of Operation

Adaptable Intelligence The Next Computing Era

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

An FPGA-Based Optical IOH Architecture for Embedded System

Programmable NICs. Lecture 14, Computer Networks (198:552)

Beyond Hardware IP An overview of Arm development solutions

Gen-Z Memory-Driven Computing

Fast Hardware For AI

An Intelligent NIC Design Xin Song

RISC-V: Enabling a New Era of Open Data-Centric Computing Architectures

Implementing RapidIO. Travis Scheckel and Sandeep Kumar. Communications Infrastructure Group, Texas Instruments

Optimize HPC - Application Efficiency on Many Core Systems

QLogic/Lenovo 16Gb Gen 5 Fibre Channel for Database and Business Analytics

This presentation provides an overview of Gen Z architecture and its application in multiple use cases.

Innovator, Disruptor or Laggard, Where will your storage applications live? Next generation storage

Maintaining Cache Coherency with AMD Opteron Processors using FPGA s. Parag Beeraka February 11, 2009

Designing Security & Trust into Connected Devices

mbed OS Update Sam Grove Technical Lead, mbed OS June 2017 ARM 2017

ARM mbed: Internet of Possible

Gen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric

Transcription:

: a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017

Interconnects for different scale SoC interconnect. Connectivity for on-chip processor, accelerator, IO and memory elements Server node interconnect - scale-up. Simple multichip interconnect (typically PCIe) topology on a PCB motherboard with simple switches and expansion connectors Rack interconnect - scale-out. Scale-out capabilities with complex topologies connecting 1000 s of server nodes and storage elements 2

Key drivers for interconnect technology Decline of Moore s law forcing more heterogeneous compute. Big data analytics growing at 11.7% CAGR. 5G wireless applications requiring 10x more bandwidth, 10x lower latency by 2021. Increase in distributed data forcing more network intelligence at faster data rates (10GbE -> 100GbE -> 400GbE). Data bandwidth and sharing growth projected at 10x-50x increase vs. present PCIe by 2021. 3

TM cache coherent interconnect for accelerators New class of interconnect for accelerated applications. Mission of the Consortium is to develop and promote adoption of an industry standard specification to enable coherent interconnect technologies between general-purpose processors and acceleration devices for efficient heterogeneous computing. https://www.ccixconsortium.com/ 4

Consortium Inc Promoters Formed January 2016, incorporated in February 2017. Complete ecosystem with 34 members and growing. Contributors Hardware specification available for design starts for member companies. pronounced: (c siks). Arteris, Inc. Guizhou Huaxintong Semiconductor Technology Co. Ltd. INVECAS INC Netronome Phytium Technology Co., Ltd. PLDA Shanghai Zhaoxin Semiconductor Co., Ltd. Silicon Laboratories Inc. SmartDV Technologies India Private Ltd. Adopters 5

Applications benefiting from 4G and 5G base station Data-center Search Embedded Computing High Performance (Super)Computing In memory database processing Intelligent network acceleration Machine / Deep Learning Mobile Edge Computing Video analytics 6

multichip connectivity High performance, low latency. defines 25GT/s (3x performance*) Examining 56GT/s (7x performance*) and beyond Accelerator Enabling low latency via light transaction layer Flexible, scalable interconnect topologies. Flexible point-to-point, daisy chained and switched topologies Seamless integration. Switch Smart Network Persistent Runs on existing PCIe transport layer and management stack Supports all major instruction set architectures (ISA) 7

System topology examples PCIe PCIe Accelerator Accel Accel PCIe Accel PCIe Accel Switch Accel Accel Accel Accel Accelerator Direct attached, daisy chain, mesh and switched topologies. 8

DMA Engines: The problem with traditional accelerators Operating System vendors are interested in the opportunity for workload-optimized accelerators. Traditional DMA approach is to provide a special (Linux) kernel driver for every unique accelerator. uires skilled kernel developers (a driver for each accelerator), failure mode is catastrophic (system crash/downtime) Operating Systems used tomorrow have already been deployed. Updates are 9-12 months apart. Drivers must be in upstream Linux before we support them, a year+ turnaround for every accelerator 'Trilby : DMA Engine driven FPGA based workload accelerator built by Jon Masters for research into the barriers to adoption in the Enterprise, uses traditional approach of kernel driver and Operating System hacks. 9

Coherent virtual memory eliminates data transfer overhead Non-coherent system without Shared Virtual (SVM) Software must manage cache maintenance and data copying Clean and copy data Accelerator Clean and copy data Clean and copy data Accelerator coherent system with Shared Virtual (SVM) Hardware managed cache maintenance, shared address space with direct memory access Accelerator 10

acceleration functions that just work in the cloud Container_1 Container_2 Container_1 Container_2 JVM_1 App_1 VNF_2 function... VNF_1 function... App_4 VNF_3 func VNF_4 func Non-privileged Nonpriveleged Guest OS 1 Guest OS 2 Guest OS 3 Priveleged Privileged Virtual Machines Virtual Machine 1 Virtual Machine 2 Virtual Machine Monitor (VMM)/Hypervisor Virtual Machine 3 Hyper- Hyper Priveleged Privileged Firmware, Option ROMs, etc Physical Machine Firmware Physical Machine (e.g., processors, DRAM, caches, mmu, iommu, other resources and SoC devices...) (Optional) Optional System Dependent System Dependent Other External Devices (e.g., disks, NICs, FPGAs, GPUs, crypto, other accelerators, other devices...) 11

layered architecture Protocol Layer coherency protocol, memory read & write flows Link Layer formats messages for target transport Protocol Layer messages Link Layer PCIe packets Transaction Layer Adds optimized packets, manages credit based flow control Physical Layer Dual mode PHY to support extended data rates Transaction Layer PCIe Data Link Layer PCIe Transaction Layer /PCIe Physical Layer Tx Rx 12

example request to home data flows Accelerator shares processor memory Shared processor and accelerator memory Home Home Home Daisy chain to shared processor memory Shared memory with aggregation Home Home Home 13

Improved efficiency with transaction layer Reduced latency with light weight transaction layer Improved packet efficiency with optimized header 14

PCI PCI PCI PCI port aggregation to boost bandwidth and transactions with Port Aggregation defines a hashing function to steer requests across multiple links. Aggregation effectively multiplies the bandwidth. Aggregation could also be used to increase number of transactions (eg 50GT/s vs 25GT/s). PCIe requires separate address spaces, requests can not be hashed. Home Home Home PCIe with Aggregation Home0 Accelerator Home0 Mem0 Mem1 15

RNI PCIe Transaction Layer Data Link Layer Transaction Layer PHY (up to 25Gpbs) CML SoC integration example Example CMN-600 mesh design DMC-620 DMC-620 3 rd party PCIe/ IP CXS CoreLink CMN-600 XP 16 Lanes AXI DMC-620 DMC-620 16

Arm demonstration vehicle Arm s DynamIQ and CoreLink CMN-600 technology. Cadence and PCIe controller and PHY IP. TSMC 7nm process technology. Connectivity to Xilinx s Virtex UltraSoC+ FPGA. Xilinx, Arm, Cadence, and TSMC Announce World's First Silicon Demonstration Vehicle in 7nm Process Technology 17

Scale-up server node performance with is a class of interconnect providing high performance, low latency for new accelerators use cases. Easy adoption and simplified development by leveraging today s data center infrastructure. IP available from Arm and ecosystem to optimize SoC today. Server, FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs For more information go to: https://developer.arm.com 18

Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks 19