Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

Fast, calable and Energy Efficient IO olutions: Accelerating infrastructure oc time-to-market ridhar Valluru Product Manager ARM Tech ymposia 2016

Intelligent Flexible Cloud

calability and Flexibility C A C A C A C A Compute torage torage torage ion ion Acceleration Packet flows Packet flows Packet flows Access point 2-5W oc 4-8 CPUs ~30mm 2 Target design space 100W oc 64-96 CPUs ~300mm 2 Data center

Execution environment supporting IFC Container_1 Container_2 Container_1 Container_2 JVM_1 App_1 VNF_2... VNF_1... App_4 VNF_3 VNF_4 Non-privileged N onpriveleged G uest O 1 G uest O 2 G uest O 3 P riveleged Privileged V irtual Machines V irtu al M achine 1 V irtu al M achine 2 V irtu al M achine M onito r (V M M )/H ypervisor V irtu al M achine 3 H yper- Hyper P riveleged Privileged Firm w are, O ption R O M s, etc Physical Machine Firm w are P h ysical M ach in e (e.g., processors, DRAM, caches, mmu, iommu, other resources and oc devices...) (O ptional) Optional ystem ystem D ependent Dependent O th er E xtern al D evices (e.g., disks, NICs, FPGAs, GPUs, crypto, other accelerators, other devices...)

IO challenges for next-gen oc systems calability Performance Power Efficiency Limited number of hardware IO due to capacity Large number of translations Large number of Page Table Walks Large Page Table Walks Large number memory access Large number of IO stream traffic management Insufficient TLB in MMU Enormous TLB Large dynamic power 5

ystem memory management unit (MMU) 6

Coreight oc ELA-500 ELA-500 ELA-500 ELA-500 ELA-500 ELA-500 Next-generation example server subsystem Generic Interrupt Controller (GIC) IO (PCIe and accelerators) Process or Cortex-A Process or Cortex-A Interconnect Process or Cortex-A Process or Cortex-A Process or Cortex-A Process or Cortex-A ystem MMU (MMU) Coherent Mesh Network (CMN-600) Non-coherent Interconnect Peripherals ecurity (CryptoCell) DMC-620 DDR4 1-8 memory controllers..... DMC-620

ARM IO MMU or system MMU (MMU) IO Accelerator - Virtual Address Virtual address pace (VA) IO #1 IO #2 IO #3 IO Accelerator - Virtual Address Translation Buffer Unit - Performs translation from VA PA - Holds TLB - Performs security/access checks - Request translation miss to TCU Local AXI tream - Free flowing transport - Enables distributed MMU Address translation AXI-tream interconnect Memory TCU Physical address space (PA) Translation Cache Unit - Performs table walks of translation tables - Handles AT* requests/responses for PCIe - Request translation miss to TCU - Performs security/access checks * AT PCIe address translation services (PCIe 3.0 ECN) Memory: - Physical Address

MMU architecture evolution MMUv1 Features upport for v7 page table for IO virtualization 4k page granule Implemented by CoreLink MMU-401 Adds MMUv2 Up to 128 translation contexts upport for v8 page tables 64k page granule Implemented by CoreLink MMU-500 Adds MMUv3 calability enhancements for millions of translation contexts Context store in memory PCIe address translation services (AT) for returning translations to end points with address translation caches (ATC) PCIe process address space ID (PAID) for processspecific translations PCIe page requires interface (PRI) support for access to unpinned pages in memory oftware communication via memory queues (nonblocking / scalable) upport for message-signalled interrupts

Addressing performance & scalability challenge MMU microarchitecture VA PA translation overhead Limited TLB scalability with # IO devices VA PA Translation in ATC Cache ATC removes dependency on ize Micro- TLB mall fully associative Config cache Caches context info MMU TCU Main TLB Large set associative Multi-Level walk cache eparate 1/2 Populating ATC requires MMU to support AT (address translation services) Local ATC (address translation caches) MMU ATC #1 IO #1 ATC #2 IO #2 TCU ATC #3 IO #3 AXI-tream interconnect 10

Advantages of PCIe AT for IO access performance calability of ATC ize and number of ATCs grows with number of IO devices, whereas TLB size in MMU is fixed (however large) Independence of ATCs Local ATC accesses are independent of each other and do not result in cache trashing hared TLB size in a MMU can suffer from trashing if multiple IO devices access too many scattered locations in memory Customizable pre-fetch IO devices can request translations ahead of time according to known access patterns hared TLB in an MMU is not aware of IO access patterns and cannot implement a universal pre-fetch policy Customizable replacement policies IO devices can prioritize caching of some entries over others based upon known access patterns E.g., an Ethernet NIC might choose to exclusively cache ring descriptor translations and store only data buffer translations temporarily upport for unpinned memory without stalling faults with the use of PRI 11

ARM MMU and Cadence PCIe RP integration M: AXI master interface All normal PCIe packets with or without translated address are seen here : AXI slave interface T: DTI-AT (direct translation interface for PCIe AT) supports AT translation requests from EP Invalidation requests from TCU PRI (page request interface) requests from EP ATC EndPoint EP M PCIe Link Root Port RP M T DTI- AT TCU MMU Mem * AT PCIe address translation services (PCIe 3.0 ECN) 12

Cadence PCIe RC s DTI-AT features eparate interface provided with the PCIe RC IP All PCIe AT related requests, responses, invalidations are routed to this I/F DTI-AT implementation supports additional features PCIe PRI (page request interface) PCIe PAID support (process address space ID) DTI-AT is conveyed using AXI4-tream eparate master AXI4-tream and slave AXI4-tream interfaces Transaction sideband signals to indicate the context information to the DTI-AT packets can be presented/accepted in one clock cycle Debug & status Registers to capture status and error conditions encountered in the DTI-AT protocol 13

MMU with PCIe operation no AT 1. Client logic generates a TLP with a untranslated address 2. EP sends this as a PCIe TLP to the RP 3. On receipt by the RP, since the packet is a data flow packet, this is sent on the M interface a) If the does not have a suitable translation for the address received, it will issue a request to TCU b) The TCU will respond with the response for the 4. The then forwards the transaction to the memory 1 EndPoint EP M 2 PCIe Link Root Port RP M T 3 DTI- AT 4 a b TCU MMU Mem 14 14

MMU with PCIe operation with AT and ATC hit 1. Client logic generates a TLP with a virtual address 2. The client logic uses the translated Addr if available from the ATC 3. The EP sends this as a PCIe TLP that has translated address 4. On receipt by the RP, since the packet is a data-flow packet, this is sent on the M interface 5. The then forwards the transaction to the memory via the main interconnect Lookup ATC Hit 1 2 EndPoint EP M 3 PCIe Link Root Port RP M T 4 DTI- AT TCU MMU 5 Mem 15 15

MMU with PCIe operation with AT and ATC miss 1. EP client generates a PCIe translation for a particular address that needs translation 2. Translation request goes out on the PCIe link to the RP 3. RP sends the translation request it received on the T interface to the TCU 4. The TCU then generates the response completion 5. The RP repacks the translation completion TLP back to the EP 6. Once the EP received this completion for the translation request it generated, it populates the local ATC Lookup ATC Miss 1 1 EndPoint EP 6 M 2 PCIe Link Root Port 5 RP M T 3 DTI- AT 4 TCU MMU Mem 16 16

IO challenges for next-gen oc systems calability Performance Power Efficiency ATC allows PCIe RC to support multiple IO accelerators With ATC, no more address translation needed for every transaction ATC & TCU minimize memory access for page table walks AXI tream Interface allows distributed s to be connected to a TCU TCU Cache reduces page table walks Custom ATC in IO accelerator removes the need for very large TLB in MMU 17

ummary IFC is driving need for scalability, performance and efficiency for IO accesses in infrastructure ocs ARM has been addressing IO virtualization solutions via its MMU Fast, performant IO such as PCIe Gen4 from Cadence has been efficiently integrated with ARM s MMU with an architected interface DTI-AT Combined MMU-PCIe solution delivers high performance access for IO devices with PCIe AT as well as PRI and PAID support MMU IP from ARM is designed to handle the performance, scalability, and power efficiency demands from ocs for IFC 18

Questions? Want to know more? Please contact sridhar.valluru@arm.com 19