DynamIQ Processor Designs Using Cortex-A75 & Cortex- A55 for 5G Networks

DynamIQ Processor Designs Using Cortex-A75 & Cortex- A55 for 5G Networks 2017 Arm Limited David Koenen Sr. Product Manager, Arm Arm Tech Symposia 2017, Taipei

Agenda 5G networks Ecosystem software to support Arm based deployments of network hardware Arm Cortex-A75 and Cortex-A55 Conclusions 2 2017 Arm Limited

5G Networks 2017 Arm Limited

5G Dynamics Across the Network: Edge to Cloud Compute IoT, Automotive, Industrial, Medical S A C S C A C C S A S C C A Storage Storage Storage ion ion Acceleration Packet Flows Packet Flows Packet Flows Devices Access/Edge Edge Compute Core / DC 5G vs. LTE Latency 30-50x Throughput 100x Connections 100x Mobility 1.5x 4 2017 Arm Limited

4G logical RAN architecture RU RU Remote Unit (RU) enodeb Baseband Unit (BBU) Management and control BBU enb BBU RU RU RU RU Radio Radio Radio Radio CPRI L1 / Phy MAC RLC PDCP EPC BBU enb S6a S11 HSS MME SGW S5 Evolved Packet Core (EPC) Gx PCRF PGW RU : Radio Unit BBU : Baseband Unit EPC : Evolved Packet Core BBU enb RU RU BBU RU RU enb BBU RU RU 5 2017 Arm Limited Source : Ericsson review, 3GPP TR 38.801

5G logical RAN architecture DU RU RU DU EPC/NGC CU RU RU DU RU RU gnb RU : Radio Unit DU : Distributed Unit EPC : Evolved Packet Core CU: Central Unit NGC : Next Generation Core Central Unit (CU) Management Switch PDCP DU DU DU RU RU Distributed Unit (DU) RU RU RU RU L1 / Phy Management MAC RLC 6 2017 Arm Limited Source : Ericsson review, 3GPP TR 38.801

Network/Wireless Infrastructure: Three Distinct Segments Two Distinct Approaches Typically standardized platform designs 5G Antenna Mobile Edge Compute Core Network/Data Centre Edge Network Architecture CPU Server Server-like Architecture more efficient on Arm 5G Air Interface Management Antenna, RRH, DFE, Distributed Base Band, Multi-Access Compute Acceleration. IoT Gateway and End applications. Optimized power, performance, scalability, latency Air Interface Processing Base Station Network And RAN Offload Mobile Edge Compute Architecture Performance critical functions to the edge Application of virtualized / containerized workloads Server Network Offload 5G Network Hierarchy Design Multi-Access Edge Compute, Distributed Evolved Packet Core, Distributed Applications Processing, Distributed Content Delivery. Application of virtualized/containerized workloads. Typically specialized platform designs 7 2017 Arm Limited

Networking-centric Ecosystem Enabling Deployments Key Applications, VNFs, and Middleware Operating System and Virtual Infrastructure Managers Virtualization and HAL Firmware & Network Boot ACPI OEMs and ODMs Arm SoC 8 2017 Arm Limited

Linaro Segment Groups - Infrastructure Networking - LNG OSS for networking ODP crossplatform support for SoC accelerators Enterprise - LEG OSS for Arm servers Reduces fragmentation, cost, accelerates time to market for servers based on Arm Real Time Support Virtualization, Core isolation OpenDataPlane (ODP) OPNFV integration ATF/UEFI/ACPI KVM/Xen Armv8 optimization OpenJDK, Hadoop, OpenStack, Docker 9 2017 Arm Limited

WorksOnArm, and AIDC 10 2017 Arm Limited WorksOnArm Initiated and hosted by Packet.net Public vehicle to engage open source developers and share software readiness AIDC Initiated and hosted by Arm Public vehicle to engage infrastructure developer community (H/W & S/W)

Demonstrating Scalability: the PicoPod The World s Tiniest Pharos Pod: The NFV PicoPod Fully Functional OPNFV Pharos Pod @ ~1 cu. ft. 1 Jump Server, 3 Controllers & 2 Compute 12 port Switch, Power Supply, Fans Marvell MACCHIATObin boards w/ Marvell Armada 8040 SoC (Quad Arm Cortex-A72) Dual 10G NICs, 16Gb RAM, PCIe, SATA & more Runs Danube release NFVi sample VNFs 11 2017 Arm Limited

Cortex-A75 and Cortex-A55 2017 Arm Limited

Networking requires diverse platforms Storage GPP Acceleration Networking vcpe GPP Accel Mem GPP GPP Accel PLD NPU Acceleration Storage Edge content caching Networking & optimization GPP Wireless-access-point / vcpe GPP GPP NPU Switch / router blade GPP GPP GPP GPP GPP GPP GPP GPP Acceleration Storage Networking GPP NFV appliance DSP DSP GPP DSP NPU PLD GPP GPP GPP GPP GPP GPP GPP GPP Base station / modem Scale out / server 13 2017 Arm Limited

DMC-620 DMC-620 Scalable solutions from edge to core Scalable system configurations Access point 4 Cortex-A CPUs 256 Data center compute DMC-620 PCIe 100GbE DMC-620 Cortex-A Accelerator IO IO NIC-450 0MB System cache 128MB CPU CPU CPU CPU DMC-620 CPU CPU CPU CPU CoreLink CMN-600 System cache DMC-620 20 GB/s Bandwidth 1 TB/s CoreLink CMN-600 CPU CPU CPU CPU DMC620 : dynamic memory controller CMN: mesh network interconnect 1 DDR channels 8 CPU CPU CPU CPU DMC-620 DMC-620 DMC-620 14 2017 Arm Limited

New DynamIQ-based CPUs for new possibilities Arm Cortex-A75 Arm Cortex-A55 SPECint2006 1.23x SPECint2006 1.21x SPECfp2006 1.39x SPECfp2006 1.42x LMBench memcpy 1.30x LMBench memcpy 1.97x Baseline to Cortex-A72 Baseline to Cortex-A53 All comparisons at ISO process and frequency All comparisons at ISO process and frequency 15 2017 Arm Limited

Next-generation features Dot product and half-precision float for AI/ML processing. Virtualized Host Extensions (VHE) offering Type-2 hypervisor (KVM) performance improvements. Cache stashing and atomic operations improves multicore networking performance and improves latency. Cache clean to persistence to support storage class memory. Infrastructure class RAS enhancement including data poisoning and improved error management. 16 2017 Arm Limited

DynamIQ Shared Unit (DSU) Support for multiple performance domains Core 0 DynamIQ cluster 0-7 Cores Core 7 Latency and bandwidth optimizations DynamIQ Shared Unit (DSU) Streamlines traffic across bridges Snoop filter L3 Cache Asynchronous bridges Bus I/F ACP and peripheral port I/F Power Management Advanced power management features Supports large amounts of local memory Scalable interfaces for edge to cloud applications Low latency interfaces for closely coupled accelerators 17 2017 Arm Limited

Level 3 cache partition in a networking application Infrastructure Process 1 = data plane Process 2 = control plane Packet processing data sent through low latency ACP interface Sensors or I/O agents Example configuration with two Core groups in a DynamIQ cluster Process 1 Process 2 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core group 1 Core group 2 L3 cache ACP Reserved for external accelerators via ACP Group 1 Group 2 18 2017 Arm Limited

Increasing performance through cache stashing Enables reads/writes into the shared L3 cache or per-core L2 cache. Allows closely coupled accelerators and I/O agents to gain access to core memory. AMBA 5 CHI and Accelerator Coherency Port (ACP) can be used for cache stashing. More throughput with Peripheral Port (PP) for acceleration, network, storage usecases. Core0 L1 cache L2 cache Snoop filter Accelerator or I/O Stash critical data to any cache level DynamIQ cluster 0 7 Cores DynamIQ Shared Unit (DSU) L3 Cache Asynchronous bridges Cortex-A CPU CPU CPU L2 L2 Cache L2 L2 Cache Cache L3 Cache Bus I/F Agile System Cache ACP and peripheral port I/F CoreLink CMN-600 Core7 L1 cache L2 cache Power Mngmt Cortex-A CPU CPU CPU L2 L2 Cache L2 L2 Cache Cache L3 Cache Agile System Cache DMC-620 DDR4 DMC-620 DDR4 19 2017 Arm Limited

Networking: Compute and packet-processing solution Build the right mix of compute High-performance Cortex-A75 High-efficiency throughput Cortex-A55 DSP, accelerators Data plane processing Throughput compute Optional local accelerators Control plane processing Performance compute Timing critical processing Scale CoreLink CMN-600 for application Cortex-A55 Level-3 Cache Local Accelerator Cortex-A75 Level-3 Cache DSP Access points: 2~5W Macro BTS: 20~30W Agile System Cache CoreLink CMN-600 Agile System Cache IO IO Ethernet Telecom server: 60~100W DMC-620 DMC-620 Accelerator Scalable coherent SoC memory system 20 2017 Arm Limited

DMC-600 DDR4-3200 PCIe PCIe CML CML DMC-600 DDR4-3200 Data center server blade: Maximize compute density Cortex-A75 can deliver 1.4x to 2.9x higher performance. DDR4-3200 DMC-600 PCIe 100GbE DDR4-3200 DMC-600 >1.4x performance with 32 core Cortex-A75 + CMN-600 vs. 32 core Cortex-A72 + CCN-512 DDR4-3200 DMC-600 Cache Cache Cache Cache 2.9x performance with 64 core Cortex-A75 +CMN-600 Higher rate performance than competitive architectures per socket Cache Cache Cache Cache 1-1.5W/core enables sustained operation at maximum performance Power: 60W compute. DDR4-3200 DMC-600 Cache Cache Cache Cache Cache Cache Cache Cache DMC-600 DMC-600 DDR4-3200 DDR4-3200 21 2017 Arm Limited

Enabling 5G transformation Transition to 5G networks and new use cases is transforming network architectures. Requires efficient compute closer to the edge Mobile Edge Compute (MEC) nodes combine server-like design (software driven, high aggregate performance) with edge network needs (latency, throughput, power optimized) Software investments from Arm and ecosystem partners to enable real 5G deployment. Cortex-A75 & Cortex-A55 together with CMN and other Arm IP enable scalable design. Address the compute requirements across the network in a range of power budgets For more information go to: https://developer.arm.com 22 2017 Arm Limited

Thank You! Danke! Merci! 謝謝! ありがとう! Gracias! Kiitos! 감사합니다 धन यव द 23 2017 Arm Limited

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks 24 2017 Arm Limited