KeyStone II. CorePac Overview

Similar documents
RA3 - Cortex-A15 implementation

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Introduction to AM5K2Ex/66AK2Ex Processors

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

ARM. Cortex -A15 MPCore. Processor. Technical Reference Manual. Revision: r4p0. Copyright ARM. All rights reserved. ARM DDI 0438I (ID062913)

Cortex -A15 MPCore. Technical Reference Manual. Revision: r3p0. Copyright ARM. All rights reserved. ARM DDI 0438E (ID040512)

Heterogeneous Multi-Processor Coherent Interconnect

ARMv8-A Software Development

Cortex-A15 MPCore Software Development

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Cortex-A9 MPCore Software Development

The ARM Cortex-A9 Processors

Copyright 2016 Xilinx

Each Milliwatt Matters

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Keystone Architecture Inter-core Data Exchange

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

ARM Cortex-A* Series Processors

LECTURE 5: MEMORY HIERARCHY DESIGN

ELC4438: Embedded System Design ARM Embedded Processor

Chapter 5. Introduction ARM Cortex series

Copyright 2012, Elsevier Inc. All rights reserved.

Cortex-R5 Software Development

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 24: Virtual Memory, Multiprocessors

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Copyright 2012, Elsevier Inc. All rights reserved.

Designing with NXP i.mx8m SoC

SoC Platforms and CPU Cores

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Introduction to the ARM Architecture. or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref.

Cortex-A5 MPCore Software Development

Multicore and MIPS: Creating the next generation of SoCs. Jim Whittaker EVP MIPS Business Unit

Cortex-A15 MPCore Software Development

ARMv8-A Memory Systems. Systems. Version 0.1. Version 1.0. Copyright 2016 ARM Limited or its affiliates. All rights reserved.

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

New ARMv8-R technology for real-time control in safetyrelated

Zynq-7000 All Programmable SoC Product Overview

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

PowerPC 740 and 750

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Building blocks for 64-bit Systems Development of System IP in ARM

ARM Processors for Embedded Applications

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

AMD s Next Generation Microprocessor Architecture

Freescale i.mx6 Architecture

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

The Cortex-A15 Verification Story

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

XT Node Architecture

SiFive Freedom SoCs: Industry s First Open-Source RISC-V Chips

ECE 571 Advanced Microprocessor-Based Design Lecture 22

OPENSPARC T1 OVERVIEW

Intel Architecture for HPC

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

ARM Trusted Firmware: Changes for Axxia

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Copyright 2012, Elsevier Inc. All rights reserved.

High performance computing. Memory

Universität Dortmund. ARM Architecture

Hercules ARM Cortex -R4 System Architecture. Processor Overview

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

Chapter 8: Memory-Management Strategies

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Virtual to physical address translation

Keywords and Review Questions

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

KeyStone Training. Turbo Encoder Coprocessor (TCP3E)

Sam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

EEM870 Embedded System and Experiment Lecture 3: ARM Processor Architecture

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Introduction to Sitara AM437x Processors

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Transcription:

KeyStone II ARM Cortex A15 CorePac Overview

ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB L2 cache 20cc L1 miss L2 hit load to use Same latency on dual and single core variants Per core NEON (SIMD co processor) 128 bit wide SIMDv2 processor 8, 16, 32, and 64 bit data types Single precision floating point PercoreVFP (vector floatingpoint co processor) Single and double precision floating point, VFPv4 Error correction and detection on L1 and L2 caches AXI 4 ACE interface for cache coherent interconnect GIC 400 Interrupt Controller (SoC level)

Each Cortex A15 Core in the MPCore Full ARMv7 A architecture instruction set 3 issue out of order pipeline (3 12 stages) Dynamic branch prediction 2k entry Branch Target Buffer (BTB), 8k entry Global History Buffer (GHB) 32kB data L1 cache and 32kB L1 instruction cache 4cc load to use latency (typically hidden by out of order pipeline) Full support for virtualization, 32bit virtual and 40bit physical addresses 2 level MemoryManagement Management unit (MMU) Virtual to intermediate physical address Intermediate physical to physical address Three 32 entry fully associative L1 TLBs Instruction fetch, load, store 512 entry 4 way set associative L2 TLB Performance Monitoring Unit (6 counter PMUv2)

KeyStone II Quad Cortex A15 MPCore ARM Cortex-A15 MPCore ARM GIC-400 interrupt controller Access to and from the SoC

KeyStone II ARM CorePac: Key Features Cache coherency for ARM CorePac and IO ARM CorePac and DDR3 (DDR3A for those devices with multiple DDR3s) MSMC SRAM (on chip scratch memory) Low latency and high bandwidth connection from CorePac to external memory Virtualization support with 40 bit physical addressing (large physical address extensions, LPAE) Large L2 cache (4MB, 16 way set associative) Reliability and availability with ECC in internal and external memories High performance IO from user space and any paravirtualized guest OS Energy efficiency

KeyStone II: IO Cache Coherency ARM A15 Write-invalidate Read-snoop for MSMC SRAM Write-invalidate Read-snoop for DDR3A TeraNet IO Coherency for the ARMs, SMP for the quad cluster DDR3A from 0x08_0000_0000 to 0x09_FFFF_FFFF MSMC SRAM Coherency for ease of use and performance

Benchmarks: Core Only, Memory Latency Dhrystone, DMIPS/MHz, CPU core and L1 only 3.5DMIPS/MHz (highly dependant on compiler) 19600DMIPS with KeyStone II Quad ARM CorePac at 1.4GHz Floating point Quad single precision IEEE 754 FMAC per cycle Memory Latency, load to use latency 4cc L1D hit, typically hidden by A15 micro architecture 20cc L2 hit (4MB) MSMC2 SRAM: ~50cc External Memory: ~100ns L2 miss to DDR page that is open

Memory Bandwidth Benchmarks DDR3-1600 theoretical ti throughput 12.8GB/s ~30% to ~50% achieved Physical placement of arrays is critical; Linux virtual memory with 4kB pages is good. Memory Bandwidth, external memory only: Stream Copy a(i) = b(i), where a and a b are arrays. Stream Scale a(i) = q * b(i), where a and b are arrays, and q is a constant. Stream Add computes a(i) = b(i) + c(i), where a, b, and c are arrays. Stream Triad computes a(i) = b(i) + q * c(i), where a, b, and c are arrays, and q is a constant. Array sizes are defined to force missing on cache regardless of size

Long descriptor format page tables 40 bit physical addressing (LPAE) Virtualization Short descriptor with 32 bit physical addressing with ARM CorePac running from DDR3B Three level data structure to get to 4kB pages at each stage Two levels to get to 2MB pages (Linux huge pages) TLBs cache a page per entry Hypervisor privilege (3 rd level, user, supervisor) Manages the second stage address translation ti per each virtual it machine HW traps exceptions, CP15 register access and WFI/WFE Paravirtualized drivers or direct IO register access for IO performance IO accesses use MPAX

Reliability KeyStone II ARM CorePac is designed for highreliability embedded applications. 100k power on hours at 105C Data caches support SECDED (single bit error correct, double bit error detect). L1D and L2 caches data and tag RAMs Program caches support parity and refetch on error. L1I cache Tag RAM Branch Target Buffer (BTB)

Energy Efficiency Clock gating inside the ARM CorePac: Total dynamic power consumption for a fully loaded 1.4GHz core will range from 1.2W to 0.35W depending on the type of instructions it runs. Waitfor interrupt and event (WFI, WFE) instructions bring the dynamic power down to <0.1W per core. Power switches per core and per quad cluster including L2: Each ARM A15 core can be shut down independently The entire Quad ARM A15 CorePac, including the 4MB/1MB L2 cache, can also be shut down. Reduces static power to <5%

KeyStone II Architecture

For More Information ARM Reference Manuals http://infocenter.arm.com/help/index.jsp p// / p/ jp A15 Technical Reference Manual (TRM) r2p2 GIC 400 r0p0rel1 STREAM Benchmark http://www.cs.virginia.edu/stream/ For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website. KeyStone SoC & Software Overview Training