Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Similar documents
ARM big.little Technology Unleashed An Improved User Experience Delivered

Each Milliwatt Matters

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

KeyStone II. CorePac Overview

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

GPUs: The Hype, The Reality, and The Future

RA3 - Cortex-A15 implementation

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Building blocks for 64-bit Systems Development of System IP in ARM

MediaTek CorePilot. Heterogeneous Multi-Processing Technology. Delivering extreme compute performance with maximum power efficiency

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

SoC Platforms and CPU Cores

The ARM Cortex-A9 Processors

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Chapter 5. Introduction ARM Cortex series

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Amber Baruffa Vincent Varouh

Cortex-A15 MPCore Software Development

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

ARMv8-A Software Development

Cortex-A9 MPCore Software Development

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Helio X20: The First Tri-Gear Mobile SoC with CorePilot 3.0 Technology

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

Copyright 2016 Xilinx

The Challenges of System Design. Raising Performance and Reducing Power Consumption

ECE 571 Advanced Microprocessor-Based Design Lecture 22

A Study on C-group controlled big.little Architecture

Arm s Latest CPU for Laptop-Class Performance

ARM instruction sets and CPUs for wide-ranging applications

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Enabling Arm DynamIQ support. Dan Handley (Arm) Ionela Voinescu (Arm) Vincent Guittot (Linaro)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Building Ultra-Low Power Wearable SoCs

ELC4438: Embedded System Design ARM Embedded Processor

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Multicore and MIPS: Creating the next generation of SoCs. Jim Whittaker EVP MIPS Business Unit

New ARMv8-R technology for real-time control in safetyrelated

ARMv8-A CPU Architecture Overview

The Next Steps in the Evolution of ARM Cortex-M

The Evolution of the ARM Architecture Towards Big Data and the Data-Centre

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Position Paper: OpenMP scheduling on ARM big.little architecture

Computer Systems Architecture

Bifrost - The GPU architecture for next five billion

04 - DSP Architecture and Microarchitecture

ECE 471 Embedded Systems Lecture 2

ECE 471 Embedded Systems Lecture 2

Cortex-A5 MPCore Software Development

Next Generation Enterprise Solutions from ARM

Introducing the Latest SiFive RISC-V Core IP Series

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

Seahawk Power-optimized implementation of High Performance Quad-core Cortex-A15 Processor

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Cortex-R5 Software Development

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

ARM Vision for Thermal Management and Energy Aware Scheduling on Linux

Jack Kang ( 剛至堅 ) VP Product June 2018

Cortex-A15 MPCore Software Development

Hardware and Software Optimisation. Tom Spink

HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Verifying big.little using the Palladium XP. Deepak Venkatesan Murtaza Johar ARM India

Computer Systems Architecture

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM's POWER5 Micro Processor Design and Methodology

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Parallel Programming on Larrabee. Tim Foley Intel Corp

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

WHY PARALLEL PROCESSING? (CE-401)

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

«Real Time Embedded systems» Multi Masters Systems

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Leveraging the Benefits of Symmetric Multiprocessing (SMP) in Mobile Devices

ARMv8-A Memory Systems. Systems. Version 0.1. Version 1.0. Copyright 2016 ARM Limited or its affiliates. All rights reserved.

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Multiple Instruction Issue. Superscalars

Cortex-M3/M4 Software Development

The Nios II Family of Configurable Soft-core Processors

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Multithreaded Processors. Department of Electrical Engineering Stanford University

The Cortex-A15 Verification Story

Effective System Design with ARM System IP

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Evolving IP configurability and the need for intelligent IP configuration

ECE 471 Embedded Systems Lecture 3

How much energy can you save with a multicore computer for web applications?

Embedded Systems: Projects

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Modern Processor Architectures. L25: Modern Compiler Design

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Transcription:

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design behind the first big.little system from ARM based on the high-performance Cortex-A15 processor, the energy efficient Cortex-A7 processor, the coherent CCI-400 interconnect and supporting IP. Contents Contents... 1 Introduction... 2 The Processors... 2 The System... 4 big.little Task Migration Use Model... 5 big.little MP Use Model... 7 Software... 7 Conclusion... 8 Page 1 of 8

Introduction The range of performance being demanded from modern, high-performance, mobile platforms is unprecedented. Users require platforms to be accomplished at high processing intensity tasks such as gaming and web browsing while providing long battery life for low processing intensity tasks such as texting, e-mail and audio. In the first big.little system from ARM a big ARM Cortex -A15 processor is paired with a LITTLE Cortex -A7 processor to create a system that can accomplish both high intensity and low intensity tasks in the most energy efficient manner. By coherently connecting the Cortex-A15 and Cortex-A7 processors via the CCI-400 coherent interconnect the system is flexible enough to support a variety of big.little use models, which can be tailored to the processing requirements of the tasks. The Processors The central tenet of big.little is that the processors are architecturally identical. Both Cortex-A15 and Cortex-A7 implement the full ARM v7a architecture including Virtualization and Large Physical Address Extensions. Accordingly all instructions will execute in an architecturally consistent way on both Cortex- A15 and Cortex-A7, albeit with different performances. The implementation defined feature set of Cortex-A15 and Cortex-A7 is also similar. Both processors can be configured to have between one and four cores and both integrate a level-2 cache inside the processing cluster. Additionally, each processor implements a single AMBA 4 coherent interface that can be connected to a coherent interconnect such as CCI-400. It is in the micro-architectures that the differences between Cortex-A15 and Cortex-A7 become clear. While Cortex-A7 (Figure 1) is an in-order, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages, Cortex-A15 (Figure 2) is an out-of-order sustained triple-issue processor with a pipeline length of between 15-stages and 24-stages. Integer Writeback Multiply Lowest Cortex- A15 Operati ng PointFe tch Decode Queue Issue Floating-Point / NEON Dual Issue Load/Store Figure 1 Cortex-A7 Pipeline Page 2 of 8

Queue Issue Integer Integer Writeback Fetch Decode, Rename & Dispatch Multiply Floating-Point / NEON Branch Load Loop Cache Store Figure 2 Cortex-A15 Pipeline Since the energy consumed by the execution of an instruction is partially related to the number of pipeline stages it must traverse, a significant difference in energy between Cortex-A15 and Cortex-A7 comes from the different pipeline lengths. In general, there is a different ethos taken in the Cortex-A15 micro-architecture than with the Cortex-A7 micro-architecture. When appropriate, Cortex-A15 trades off energy efficiency for performance, while Cortex-A7 will trade off performance for energy efficiency. A good example of these micro-architectural trade-offs is in the level-2 cache design. While a more area optimized approach would have been to share a single level-2 cache between Cortex-A15 and Cortex-A7 this part of the design can benefit from optimizations in favor of energy efficiency or performance. As such Cortex-A15 and Cortex-A7 have integrated level-2 caches. Table 1 illustrates the difference in performance and energy between Cortex-A15 and Cortex-A7 across a variety of benchmarks and micro-benchmarks. The first column describes the uplift in performance from Cortex-A7 to Cortex-A15, while the second column considers both the performance and power difference to show the improvement in energy efficiency from Cortex-A15 to Cortex-A7. All measurements are on complete, frequency optimized layouts of Cortex-A15 and Cortex-A7 using the same cell and RAM libraries. All code that is executed on Cortex-A7 is compiled for Cortex-A15. Cortex-A15 vs Cortex-A7 Performance Cortex-A7 vs Cortex-A15 Energy Efficiency Dhrystone 1.9x 3.5x FDCT 2.3x 3.8x IMDCT 3.0x 3.0x MemCopy L1 1.9x 2.3x MemCopy L2 1.9x 3.4x Table 1 Cortex-A15 & Cortex-A7 Performance & Energy Comparison It should be observed from Table 1 that although Cortex-A7 is labeled the LITTLE processor its performance potential is considerable. In fact, due to micro-architecture advances Cortex-A7 provides higher performance than current Cortex-A8 based implementations for a fraction of the power. As such a significant amount of processing can remain on Cortex-A7 without resorting to Cortex-A15. Page 3 of 8

The System To create a compelling big.little solution the system around the processors must also be considered. A key part is the CCI-400 interconnect which facilitates full coherency between Cortex-A15 and Cortex-A7 as well as IO coherency for components such as a GPU. Through optimizations around the transaction characteristics of Cortex-A15 and Cortex-A7 as well as by considering the paths to main memory and the system, a single solution is offered with the highest possible performance. Interrupts GIC-400 Interrupts Cortex-A15 Core L2 Cortex-A15 Core Cortex-A7 Core L2 Cortex-A7 Core IO Coherent Master CCI-400 (Cache Coherent Interconnect) Memory Controller Ports System Port Figure 3 Cortex-A15 CCI Cortex-A7 System Another element of the big.little system is a shared Generic Interrupt Controller (GIC-400). As well as being able to distribute up to 480-interrupts to Cortex-A15 and Cortex-A7, the programmable nature of the GIC-400 allows interrupts to be migrated between any cores in the Cortex-A15 or Cortex-A7 clusters. From the perspective of trace and debug, both Cortex-A15 and Cortex-A7 offer trace solutions and are both compliant with the Debug v7.1 architecture. Full support for big.little debug and trace is provided through CoreSight SoC. A final point to consider is that a big.little system incorporating Cortex-A15, Cortex-A7, CCI-400 and the GIC-400 is optimal for all big.little use-models. There is no configurable feature or optimization that favors a particular use-model. However, to reduce software complexity in the big.little task migration use model it is recommended that the same number of cores be implemented in the Cortex-A15 cluster and Cortex-A7 cluster. Page 4 of 8

big.little Task Migration Use Model In the big.little task migration use model the OS and applications only ever execute on Cortex-A15 or Cortex-A7 and never both processors at the same time. This use-model is a natural extension to the Dynamic Voltage and Frequency Scaling (DVFS), operating points provided by current mobile platforms with a single application processor to allow the OS to match the performance of the platform to the performance required by the application. However, in a Cortex-A15-Cortex-A7 platform these operating points are applied both to Cortex-A15 and Cortex-A7. When Cortex-A7 is executing the OS can tune the operating points as it would for an existing platform with a single applications processor. Once Cortex-A7 is at its highest operating point if more performance is required a task migration can be invoked that picks up the OS and applications and moves them to Cortex-A15. This allows low and medium intensity applications to be executed on Cortex-A7 with better energy efficiency than Cortex-A15 can achieve while the high intensity applications that characterize today s smartphones can execute on Cortex-A15. Figure 4 Cortex-A15-Cortex-A7 DVFS Curves An important consideration of a big.little system is the time it takes to migrate a task between the Cortex-A15 cluster and the Cortex-A7 cluster. If it takes too long then it may become noticeable to the operating system and the system power may outweigh the benefit of task migration for some time. Therefore, the Cortex-A15-Cortex-A7 system is designed to migrate in less than 20,000-cycles, or 20- microseconds with processors operating at 1GHz. Page 5 of 8

One of the reasons the task migration can be so fast is that the amount of processor state involved in the task migration is relatively small. The processor that is going to be turned off, which is termed the outbound processor, must have all of the integer and Advanced SIMD register files contents saved along with the entire CP15 configuration state. The processor that is going to resume execution, which is termed the inbound processor, must then restore all of the state saved from the outbound processor. Additionally, any active interrupts that are being controlled by the GIC-400 must be migrated. Less than 2,000 instructions are required to achieve save-restore and because the two processors are architecturally identical there is a one-to-one mapping between state registers in the inbound and outbound processors. Inbound Processor Outbound Processor Power on & Reset Task Migration Stimulus Cache Invalidate Normal Operation Enable Snooping Ready for task migration Save State Restore State Task migration State (Snoop Outbound Processor) Normal Operation L2 Snooping Allowed Clean L2 Cache Disable Snooping Power Down Outbound Processor OFF Figure 5 big-little Task Migration Figure 5 describes the task migration process between inbound and outbound processors. Coherency is clearly a critical enabler in achieving a fast task migration time as it allows the state that has been saved on the outbound processor to be snooped and restored on the inbound processor rather than going via main memory. Additionally, because the level-2 cache of the outbound processor is coherent it can remain powered up after a task migration to improve the cache warming time of the inbound processor through snooping of data values. However, since the level-2 cache of the outbound processor cannot be allocated too, it will eventually need to be cleaned and powered off to save leakage power. Page 6 of 8

It should also be observed that normal execution of the thread occurs during the task migration process. The only black out period is during the task migration when interrupts are disabled and state is transferred from the outbound to the inbound processor. big.little MP Use Model Since a big.little system containing Cortex-A15 and Cortex-A7 is fully coherent through CCI-400 another logical use-model is to allow both Cortex-A15 and Cortex-A7 to be powered on and simultaneously executing code. This is termed big.little MP, which is essentially Heterogeneous Multi- Processing. Note that in this use model Cortex-A15 only needs to be powered on and simultaneously executing next to Cortex-A7 if there are threads that need that level of processing performance. If not, only Cortex-A7 needs to be powered on. big.little MP is compelling because it enables threads to be executed on the processing resource that is most appropriate. Compute intensive threads that require significant amounts of processing performance, as their output is user visible, can be allocated to Cortex-A15. Threads that are I/O heavy or that do not produce a result that is time critical to the user can be executed on Cortex-A7. A simple example of a non-time critical thread is one associated with e-mail updates. While web browsing the user will want email updates to continue, but it does not matter if they are done at Cortex- A15 performance levels or Cortex-A7 performance levels. Since Cortex-A7 is a more energy efficient processor it makes more sense to take a LITTLE longer, but consume less battery life. Finally, as a fully coherent system can create a significant volume of coherent transactions, Cortex-A15, Cortex-A7 and CCI-400 have been designed to cope with worst case snooping scenarios. This includes the case where a Mali -T604 GPU is connected to one of the I/O coherent CCI-400 ports and every transaction is snooping Cortex-A15 and Cortex-A7 at the same time as Cortex-A15 and Cortex-A7 are snooping each other. Software As part of the big.little system, ARM provides a software switcher for use with Cortex-A15, Cortex-A7, CCI-400 and the GIC-400. The switcher serves two purposes: The first purpose is to provide all of the mechanisms required for task migration between Cortex-A15 and Cortex-A7. As well as the processor state save-restore this also includes the code required to bring the processors in and out of coherency, control snooping in the interconnect and migrate interrupts. The switcher can be used as-is or the code can be used as a template for integration in to the operating system. A second purpose is to hide the small number of programmer s model differences between Cortex-A15 and Cortex-A7 from the Operating System. While Cortex-A15 and Cortex-A7 are architecturally identical and all registers are read and written in an architecturally consistent manner, the contents of the registers may not always be identical. So Cortex-A15 and Cortex-A7 are not totally programmers model identical in all cases. Page 7 of 8

For example, the contents of the Main ID register that identifies the processor will be different between Cortex-A15 and Cortex-A7 as will the contents of the CP15 registers that describe the level-1 and level-2 cache topologies. Fortunately, since both Cortex-A15 and Cortex-A7 implement the virtualization extensions OS accesses to these registers can be trapped to the hypervisor layer which is where the switcher can handle them. The switcher enables a big.little system to be built today with current Operating Systems. However, as in the case of the state save-restore code, it may be that the small number of programmer s model differences between Cortex-A15 and Cortex-A7 may want to be handled by the OS rather than the switcher. Conclusion This white paper has described the first big.little system from ARM. The combination of a fully coherent system with Cortex-A15 and Cortex-A7 opens up new processing possibilities beyond what is possible in current high-performance mobile platforms. Rather than having to compromise the implementation of a single applications processor to cope with high and low intensity tasks, the big.little system opens the door to an extremely high performance implementation of Cortex-A15 since it will only be powered on when that performance is needed. This is complimented by the opportunity to create an extremely energy efficient implementation of Cortex-A7 since it will be the workhorse of the platform. Through these implementation techniques and the variety of use-models, big.little provides the opportunity to raise performance and extend battery life in the next generation of mobile platforms. Page 8 of 8