Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

Similar documents
CCIX: a new coherent multichip interconnect for accelerated use cases

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Each Milliwatt Matters

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Arm CoreLink MMU-600 System Memory Management Unit

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

An Intelligent NIC Design Xin Song

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

CoreLink MMU-500 System Memory Management Unit ARM. Technical Reference Manual. Revision: r2p2

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Evolving IP configurability and the need for intelligent IP configuration

Building blocks for 64-bit Systems Development of System IP in ARM

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

CS 153 Design of Operating Systems Winter 2016

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Next Generation Enterprise Solutions from ARM

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

KeyStone II. CorePac Overview

RA3 - Cortex-A15 implementation

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Virtual Memory: From Address Translation to Demand Paging

VIRTUAL MEMORY II. Jo, Heeseung

Virtual Memory: From Address Translation to Demand Paging

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Virtual to physical address translation

CS 318 Principles of Operating Systems

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

CSE 120 Principles of Operating Systems Spring 2017

Virtual Virtual Memory

Chapter 8: Memory-Management Strategies

ARM big.little Technology Unleashed An Improved User Experience Delivered

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

Copyright 2016 Xilinx

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CSE 120 Principles of Operating Systems

1. Creates the illusion of an address space much larger than the physical memory

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Mobile & IoT Market Trends and Memory Requirements

VT-d and FreeBSD. Константин Белоусов 21 сентября 2013 г. Revision : Константин Белоусов VT-d and FreeBSD

This presentation covers Gen Z Memory Management Unit (ZMMU) and memory interleave capabilities.

Chapter 8: Main Memory

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Main Memory

CoreLink MMU-400 System Memory Management Unit ARM. Technical Reference Manual. Revision: r0p1

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Analyze system performance using IWB. Interconnect Workbench Dave Huang

CSE 120 Principles of Operating Systems

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Pipelined processors and Hazards

Netronome NFP: Theory of Operation

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

How Open Channel SSD Benefit Datacenter and Enterprise Applications

1. Memory technology & Hierarchy

Chapter 13: I/O Systems

The Evolution of the ARM Architecture Towards Big Data and the Data-Centre

Mobile & IoT Market Trends and Memory Requirements

LEON4: Fourth Generation of the LEON Processor

I/O, today, is Remote (block) Load/Store, and must not be slower than Compute, any more

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Designing with NXP i.mx8m SoC

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (3 rd Week)

Virtual Memory. Motivations for VM Address translation Accelerating translation with TLBs

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Address spaces and memory management

ADDRESS TRANSLATION AND TLB

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

CSE 560 Computer Systems Architecture

Mapping applications into MPSoC

14 May 2012 Virtual Memory. Definition: A process is an instance of a running program

SmartNIC Programming Models

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

CIS Operating Systems I/O Systems & Secondary Storage. Professor Qiang Zeng Fall 2017

Chapter 5. Introduction ARM Cortex series

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Mobile & IoT Market Trends and Memory Requirements

The Nios II Family of Configurable Soft-core Processors

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

ADDRESS TRANSLATION AND TLB

Computer Science 146. Computer Architecture

... Application Note AN-531. PCI Express System Interconnect Software Architecture. Notes Introduction. System Architecture.

DynamIQ Processor Designs Using Cortex-A75 & Cortex-A55 for 5G Networks

An Approach for Implementing NVMeOF based Solutions

Data Path acceleration techniques in a NFV world

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

CIS Operating Systems I/O Systems & Secondary Storage. Professor Qiang Zeng Spring 2018

Transcription:

Fast, calable and Energy Efficient IO olutions: Accelerating infrastructure oc time-to-market ridhar Valluru Product Manager ARM Tech ymposia 2016

Intelligent Flexible Cloud

calability and Flexibility C A C A C A C A Compute torage torage torage ion ion Acceleration Packet flows Packet flows Packet flows Access point 2-5W oc 4-8 CPUs ~30mm 2 Target design space 100W oc 64-96 CPUs ~300mm 2 Data center

Execution environment supporting IFC Container_1 Container_2 Container_1 Container_2 JVM_1 App_1 VNF_2... VNF_1... App_4 VNF_3 VNF_4 Non-privileged N onpriveleged G uest O 1 G uest O 2 G uest O 3 P riveleged Privileged V irtual Machines V irtu al M achine 1 V irtu al M achine 2 V irtu al M achine M onito r (V M M )/H ypervisor V irtu al M achine 3 H yper- Hyper P riveleged Privileged Firm w are, O ption R O M s, etc Physical Machine Firm w are P h ysical M ach in e (e.g., processors, DRAM, caches, mmu, iommu, other resources and oc devices...) (O ptional) Optional ystem ystem D ependent Dependent O th er E xtern al D evices (e.g., disks, NICs, FPGAs, GPUs, crypto, other accelerators, other devices...)

IO challenges for next-gen oc systems calability Performance Power Efficiency Limited number of hardware IO due to capacity Large number of translations Large number of Page Table Walks Large Page Table Walks Large number memory access Large number of IO stream traffic management Insufficient TLB in MMU Enormous TLB Large dynamic power 5

ystem memory management unit (MMU) 6

Coreight oc ELA-500 ELA-500 ELA-500 ELA-500 ELA-500 ELA-500 Next-generation example server subsystem Generic Interrupt Controller (GIC) IO (PCIe and accelerators) Process or Cortex-A Process or Cortex-A Interconnect Process or Cortex-A Process or Cortex-A Process or Cortex-A Process or Cortex-A ystem MMU (MMU) Coherent Mesh Network (CMN-600) Non-coherent Interconnect Peripherals ecurity (CryptoCell) DMC-620 DDR4 1-8 memory controllers..... DMC-620

ARM IO MMU or system MMU (MMU) IO Accelerator - Virtual Address Virtual address pace (VA) IO #1 IO #2 IO #3 IO Accelerator - Virtual Address Translation Buffer Unit - Performs translation from VA PA - Holds TLB - Performs security/access checks - Request translation miss to TCU Local AXI tream - Free flowing transport - Enables distributed MMU Address translation AXI-tream interconnect Memory TCU Physical address space (PA) Translation Cache Unit - Performs table walks of translation tables - Handles AT* requests/responses for PCIe - Request translation miss to TCU - Performs security/access checks * AT PCIe address translation services (PCIe 3.0 ECN) Memory: - Physical Address

MMU architecture evolution MMUv1 Features upport for v7 page table for IO virtualization 4k page granule Implemented by CoreLink MMU-401 Adds MMUv2 Up to 128 translation contexts upport for v8 page tables 64k page granule Implemented by CoreLink MMU-500 Adds MMUv3 calability enhancements for millions of translation contexts Context store in memory PCIe address translation services (AT) for returning translations to end points with address translation caches (ATC) PCIe process address space ID (PAID) for processspecific translations PCIe page requires interface (PRI) support for access to unpinned pages in memory oftware communication via memory queues (nonblocking / scalable) upport for message-signalled interrupts

Addressing performance & scalability challenge MMU microarchitecture VA PA translation overhead Limited TLB scalability with # IO devices VA PA Translation in ATC Cache ATC removes dependency on ize Micro- TLB mall fully associative Config cache Caches context info MMU TCU Main TLB Large set associative Multi-Level walk cache eparate 1/2 Populating ATC requires MMU to support AT (address translation services) Local ATC (address translation caches) MMU ATC #1 IO #1 ATC #2 IO #2 TCU ATC #3 IO #3 AXI-tream interconnect 10

Advantages of PCIe AT for IO access performance calability of ATC ize and number of ATCs grows with number of IO devices, whereas TLB size in MMU is fixed (however large) Independence of ATCs Local ATC accesses are independent of each other and do not result in cache trashing hared TLB size in a MMU can suffer from trashing if multiple IO devices access too many scattered locations in memory Customizable pre-fetch IO devices can request translations ahead of time according to known access patterns hared TLB in an MMU is not aware of IO access patterns and cannot implement a universal pre-fetch policy Customizable replacement policies IO devices can prioritize caching of some entries over others based upon known access patterns E.g., an Ethernet NIC might choose to exclusively cache ring descriptor translations and store only data buffer translations temporarily upport for unpinned memory without stalling faults with the use of PRI 11

ARM MMU and Cadence PCIe RP integration M: AXI master interface All normal PCIe packets with or without translated address are seen here : AXI slave interface T: DTI-AT (direct translation interface for PCIe AT) supports AT translation requests from EP Invalidation requests from TCU PRI (page request interface) requests from EP ATC EndPoint EP M PCIe Link Root Port RP M T DTI- AT TCU MMU Mem * AT PCIe address translation services (PCIe 3.0 ECN) 12

Cadence PCIe RC s DTI-AT features eparate interface provided with the PCIe RC IP All PCIe AT related requests, responses, invalidations are routed to this I/F DTI-AT implementation supports additional features PCIe PRI (page request interface) PCIe PAID support (process address space ID) DTI-AT is conveyed using AXI4-tream eparate master AXI4-tream and slave AXI4-tream interfaces Transaction sideband signals to indicate the context information to the DTI-AT packets can be presented/accepted in one clock cycle Debug & status Registers to capture status and error conditions encountered in the DTI-AT protocol 13

MMU with PCIe operation no AT 1. Client logic generates a TLP with a untranslated address 2. EP sends this as a PCIe TLP to the RP 3. On receipt by the RP, since the packet is a data flow packet, this is sent on the M interface a) If the does not have a suitable translation for the address received, it will issue a request to TCU b) The TCU will respond with the response for the 4. The then forwards the transaction to the memory 1 EndPoint EP M 2 PCIe Link Root Port RP M T 3 DTI- AT 4 a b TCU MMU Mem 14 14

MMU with PCIe operation with AT and ATC hit 1. Client logic generates a TLP with a virtual address 2. The client logic uses the translated Addr if available from the ATC 3. The EP sends this as a PCIe TLP that has translated address 4. On receipt by the RP, since the packet is a data-flow packet, this is sent on the M interface 5. The then forwards the transaction to the memory via the main interconnect Lookup ATC Hit 1 2 EndPoint EP M 3 PCIe Link Root Port RP M T 4 DTI- AT TCU MMU 5 Mem 15 15

MMU with PCIe operation with AT and ATC miss 1. EP client generates a PCIe translation for a particular address that needs translation 2. Translation request goes out on the PCIe link to the RP 3. RP sends the translation request it received on the T interface to the TCU 4. The TCU then generates the response completion 5. The RP repacks the translation completion TLP back to the EP 6. Once the EP received this completion for the translation request it generated, it populates the local ATC Lookup ATC Miss 1 1 EndPoint EP 6 M 2 PCIe Link Root Port 5 RP M T 3 DTI- AT 4 TCU MMU Mem 16 16

IO challenges for next-gen oc systems calability Performance Power Efficiency ATC allows PCIe RC to support multiple IO accelerators With ATC, no more address translation needed for every transaction ATC & TCU minimize memory access for page table walks AXI tream Interface allows distributed s to be connected to a TCU TCU Cache reduces page table walks Custom ATC in IO accelerator removes the need for very large TLB in MMU 17

ummary IFC is driving need for scalability, performance and efficiency for IO accesses in infrastructure ocs ARM has been addressing IO virtualization solutions via its MMU Fast, performant IO such as PCIe Gen4 from Cadence has been efficiently integrated with ARM s MMU with an architected interface DTI-AT Combined MMU-PCIe solution delivers high performance access for IO devices with PCIe AT as well as PRI and PAID support MMU IP from ARM is designed to handle the performance, scalability, and power efficiency demands from ocs for IFC 18

Questions? Want to know more? Please contact sridhar.valluru@arm.com 19