Arm Processor Technology Update and Roadmap
ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture for HPC deployment and rationale for the design point of ThunderX2 Core and SOC in the single thread performance vs throughput space is presented in this talk. Focus is on the sustained performance per TCO and ease of exploiting spectral parallelism of HPC applications. Preliminary experience of porting, runningand performance analysis of HPC applications will be discussed. 2016 Cavium, Inc. Confidential and Proprietary Information
ThunderX2 in HPC
Cavium Corporate Overview Enterprise Mobile Infrastructure Data Center and Cloud Service Provider Cloud Multi-Core MIPS, ARM Processors, Security, SDN Switch and Server/Storage Connectivity ~$10B TAM 4
ARM Servers & ARM for HPC Most Widely Used ü Over 90B shipped in 25 yrs ü Out ships x86 by 20X per year Licensing Model ü Anyone can build ü Innovate & Optimize for targeted applications ARM for HPC ARM = Choice & path to more optimized solutions March to Exascale opening door for new ISA Massive parallelism requires SW changes ARM HPC projects active worldwide HPC has large open source component Thriving ARM ecosystem for HPC
Cavium s Proven Leadership in Silicon Design #1 in Security & Wireless Infrastructure,#2 in Embedded Multicore CPU Expert Performance 2S Config Highest performance, most widely supported, dual socket ARMv8 servers in production THE CPU company for Infrastructure ARMv8 architectural licensee High Perf Custom Cores Complete Portfolio 2 core to 48 core, variety of price points, TDP Power, Perf, Area Optimized Common SW architecture OPTIMIZING ARM64 SERVERS FOR HPC & CLOUD DATA CENTER
World s Highest Performance Xeon Class ARM Server 2 nd generation product from Cavium ARM Leadership ThunderX2 FIRSTS for ARM Processors Multi threaded, fully out of order high performance ARMv8 custom cores Single and dual socket support Highest memory bandwidth & capacity Server class virtualization Server class RAS Extensive power management Rich IO configurations Extensive Power management Core and Socket level performance competitive with next gen incumbent server CPUs Comprehensive hardware and software ecosystem 7
Differentiation Cores Higher core count delivers higher throughput Total Threads Higher thread count = larger number of vcpus Memory Bandwidth More memory bandwidth for memory intensive workloads Memory Capacity More memory capacity for in-memory workloads PCIe Lanes Incumbent Server CPU ThunderX2 Rich IO connectivity options Direct attach to VMe devices
: Thriving HPC Ecosystem Linux Enterprise SLE12 Industry Leading Operating Systems Debuggers, Profilers & Cluster Mgt Open Source & Community Focus Standards Based Sys Management & FW Optimized Compilers & Dev Environments
ThunderX Momentum in HPC Continues to Grow 1.0 2.2X Memory Bandwidth 2.5X Floating Point 3X Integer 4X Vectors 2-4X better HPC performance Server platforms at World s premier HPC Labs Significant HPC Engagements Early Press Announcements
Early Performance for HPC Applications
ThunderX2 Delivers Compelling Memory Bandwidth Stream Scaling % of peak bandwidth 100 90 80 70 60 50 40 30 20 10 0 Highest Memory bandwidth enables memory bound applications to scale better % of cpu load load copy scale add triad Details: ThunderX2 CPU Linux kernel 4.8.0-32-generic (4k pages) Stream compiled with GCC version 5.4.0-6 Ubuntu 16.04.4 at -O3
OpenBLAS DGEMM Efficient cores capable of achieving close to theoretical peak performance Details: ThunderX2 CPU Linux kernel 4.8.0-32-generic (4k pages) OpenBLAS compile with GCC version 5.4.0-6ubuntu1~16.04.4 at -O3 13
OpenBLAS SGEMM Efficient cores capable of achieving close to theoretical peak performance Details: ThunderX2 CPU Linux kernel 4.8.0-32-generic (4k pages) OpenBLAS compile with GCC version 5.4.0-6ubuntu1~16.04.4 at -O3 14
ThunderX2 Delivers Best-In-Class HPL Performance % of peak performance 100 80 60 40 20 0 HPL scaling - % of peak Gflops 3 13 19 25 28 38 50 63 75 78 94 100 % of system load Efficient cores capable of achieving close to theoretical peak performance 15 Details: ThunderX2 CPU Linux kernel 4.8.0-32-generic (4k pages) HPL compile with GCC version 5.4.0-6ubuntu1~16.04.4 (defaults) mpich v3.2, no openmp, single socket test, process grid for each test case based on number of cores in test
ThunderX2 Performance Scaling on Real Applications High memory throughput benefits simple simulations Large high performance core count combined with high memory throughput benefits complex simulations 16