The Challenges of System Design. Raising Performance and Reducing Power Consumption

Similar documents
Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Effective System Design with ARM System IP

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

FPGA Adaptive Software Debug and Performance Analysis

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Building blocks for 64-bit Systems Development of System IP in ARM

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Copyright 2016 Xilinx

The ARM Cortex-A9 Processors

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

Combining Arm & RISC-V in Heterogeneous Designs

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Validation Strategies with pre-silicon platforms

RM4 - Cortex-M7 implementation

Bus AMBA. Advanced Microcontroller Bus Architecture (AMBA)

RM3 - Cortex-M4 / Cortex-M4F implementation

Software Driven Verification at SoC Level. Perspec System Verifier Overview

The CoreConnect Bus Architecture

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

Zynq-7000 All Programmable SoC Product Overview

ARM Processors for Embedded Applications

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

Addressing the Memory Wall

ARM s IP and OSCI TLM 2.0

Chapter 2 The AMBA SOC Platform

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

ARM Multimedia IP: working together to drive down system power and bandwidth

Managing Complex Trace Filtering and Triggering Capabilities of CoreSight. Jens Braunes pls Development Tools

Designing with ALTERA SoC Hardware

SoC Platforms and CPU Cores

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Designing with NXP i.mx8m SoC

Test and Verification Solutions. ARM Based SOC Design and Verification

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Software Defined Modem A commercial platform for wireless handsets

CoreTile Express for Cortex-A5

New STM32 F7 Series. World s 1 st to market, ARM Cortex -M7 based 32-bit MCU

Zynq Architecture, PS (ARM) and PL

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Product Series SoC Solutions Product Series 2016

It's not about the core, it s about the system

NoC Generic Scoreboard VIP by François Cerisier and Mathieu Maisonneuve, Test and Verification Solutions

Each Milliwatt Matters

Achieving UFS Host Throughput For System Performance

Hardware-Software Codesign

Next Generation Enterprise Solutions from ARM

Maximizing heterogeneous system performance with ARM interconnect and CCIX

AN4777 Application note

Fujitsu System Applications Support. Fujitsu Microelectronics America, Inc. 02/02

AMBA Protocol for ALU

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

The Design and Implementation of a Low-Latency On-Chip Network

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

TRACE32. Product Overview

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Strato and Strato OS. Justin Zhang Senior Applications Engineering Manager. Your new weapon for verification challenge. Nov 2017

Chapter 5. Introduction ARM Cortex series

FPGA Entering the Era of the All Programmable SoC

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

A Next Generation Home Access Point and Router

3D Graphics in Future Mobile Devices. Steve Steele, ARM

KeyStone C665x Multicore SoC

Verification Futures Nick Heaton, Distinguished Engineer, Cadence Design Systems

Analyze system performance using IWB. Interconnect Workbench Dave Huang

Simplify System Complexity

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

ARM Debug and Trace. Configuration and Usage Models. Document number: ARM DEN 0034A Copyright ARM Limited

The Evolution of the ARM Architecture Towards Big Data and the Data-Centre

Chapter 15 ARM Architecture, Programming and Development Tools

VLSI Design of Multichannel AMBA AHB

Growth outside Cell Phone Applications

Multimedia in Mobile Phones. Architectures and Trends Lund

AHB monitor. Monitor. AHB bridge. Expansion AHB ports M1, M2, and S. AHB bridge. AHB bridge. Configuration. Smart card reader SSP (PL022)

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

SQLoC: Using SQL database for performance analysis of an ARM v8 SoC

Place Your Logo Here. K. Charles Janac

ARM Connected Community Technical Symposium Reaching High Performance System Design Using AMBA Fabric IP

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

ECE 551 System on Chip Design

Power Aware Architecture Design for Multicore SoCs

Mobile & IoT Market Trends and Memory Requirements

Buses. Maurizio Palesi. Maurizio Palesi 1

Mapping applications into MPSoC

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

The Nios II Family of Configurable Soft-core Processors

Evolving IP configurability and the need for intelligent IP configuration

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

An Efficient AXI Read and Write Channel for Memory Interface in System-on-Chip

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

Transcription:

The Challenges of System Design Raising Performance and Reducing Power Consumption 1

Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2

Product Challenge - Software For software engineers Good visibility Power Osprey Management Application Processors CPU1 CPU 2 CPU 3 CPU 4 AXI Interconnect CoreSight Debug & Trace On-Chip Debug & Trace DMA Controller HD LCD controller CPU L2 cache Coherency, Virtualisation Coherent interconnect AXI Interconnect AXI Interconnect DDR3/LPDDR2 Memory Controller Static Memory Controller PCIe Media Processors A standard easy-to-program h/w platform Graphics processor Video engine AXI interconnect APB Peripherals SRAM ARM Profiler To optimise performance 3

Design Challenge - PPA Power Osprey Management Performance Performance Performance Application Processors CPU 3 CPU L2 cache Media Processors Graphics processor Video engine Coherent interconnect AXI Interconnect Power CPU1 CPU 2 CPU 4 Power AXI AXI Interconnect Interconnect Power AXI interconnect CoreSight Debug & Trace DMA Controller HD LCD controller DDR3/LPDDR2 APB Power Memory Controller Static Memory Controller PCIe Peripherals SRAM 4

How to optimise your software and understand what your design VISIBILITY FOR OPTIMISATION 5

On Chip Visibility: a key requirement Power Osprey Management Application Processors CPU1 CPU 3 CPU 2 CPU 4 AXI Interconne ct CoreSight Debug & Trace DMA Controller HD LCD controller CPU L2 cache Coherent interconn ect AXI Interconnect DDR3/LPDDR2 Memory Controller Static Memory Controller AXI Interconnect PCIe Media Processors Graphics processor Video engine AXI interconnect AP B Peripherals SRAM 6

Typical CoreSight System Cross triggering between cores Single debug access port Cost effective debug AMBA AXI Cross trigger matrix System Trace Example ARM SoC SWD DAP New ETM Cortex A9 PTM Interface Cross Trigger Cortex R4 ETMR4 CS Interface Cross Trigger DSP DSP ETM Interface Cross Trigger Bus trace System trace APB bridge Shared s Port Debug bus (APB) Trace bus (ATB) Funnel Debug control bus RealView ICE Trace bus for system trace RealView Trace Trace port Trace Port Interface Unit Embedded Trace Buffer Buffer Trace Collection strategies 7

Software profiling using CPU Trace Top-down insight into the analyzed software Starting with overview screen, containing top 5 functions by Self Time, Delay and Memory access Detailed information on the source code and its derived assembly code, annotated with performance information Code coverage Source associated instructions Cycles per instruction Interlock information 8

System Trace Macrocell - STM System level visibility required by application development up to final product Debug and tuning of s/w applications running on OS Tracing of system events and system performance PMU Counts OS Trace System Trace Macrocell enables High level application software view Tuning of system performance Tracing of SoC internal signals Benefits Flexible and affordable hardware based debug for applications and system level developers Complements CPU trace, MIPI STPv2 compliant 9

System Level Code Instrumentation System level debug information can be sent through trace to debug your System static inline void stm_emit(unsigned int port, unsigned int value) { stm_addr[port] = value; } static inline void stm_emit_blocking(unsigned int port, unsigned int value) { // Reading from an stm port returns 1 if the FIFO can // accept data, 0 if it is full. while(!stm_addr[port]); stm_addr[port] = value; } Export debug information Visualise system level data 10

Event Profiling using STM Cortex-A9 Cortex-A9 11 L2 cache

Trace Memory Controller Single solution for cost effective and flexible trace collection SoC visibility in final product with only 2 pins Storage of trace using low cost system memory Routing to Gigabit links such as HSSTP or Reduce trace overflows and trace port size by averaging out trace bandwidth 12 Bits / cycle Ethernet Existing modes with ETB (SRAM) & Trace Port (TPIU)

Getting the highest performance at the lowest power consumption EFFICIENT SOC DESIGN 13

Introduction Systems use external memory Large address space Low cost-per-bit Large interface bandwidth Challenge: Manage the flow of data to and from external memory to present the best bandwidth and latency characteristics to each processing element 14 GPU Comms control Geometry processor Renderer Apps processor Tiling Network interface DMA Controller Display Controller Audio CODEC Interconnect Image Transform Video required Access to memory depends on accesses from other processing elements CPU Motion Estimation Motion Compensate buffer Primitives Frame buffer Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash Physical View

QoS Contracts Minimum Bandwidth Minimum Bandwidth CPU GPU Comms control Geometry processor Renderer Apps processor Minimize Latency Tiling Network interface DMA Controller Maximum Latency or Minimum Bandwidth Display Controller Audio CODEC Interconnect Image Transform Video Minimum Bandwidth Motion Estimation Motion Compensate Minimum Bandwidth buffer Primitives Frame buffer 15 Maximum Latency Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash Maximum Latency

QoS Objectives Allocate system capacity (latency and bandwidth) to each master to meet the contract Dynamically vary the priority to react to changes in bus traffic If there is excess capacity Allocate excess to where it can offer the most improvement Usually reducing the CPU latency Allocate excess to masters that can reduce performance later If there is insufficient capacity Remove capacity from masters that have the least impact on system performance 16

System Latency Latency is added throughout the system in two forms: Static latency the delay through pipeline stages Constant and specific to the path from master to slave Queuing latency the delay at arbitration points in the system The delay for each transaction depends on the number of transactions ahead of it in the queue and the rate at which they are processed The queue length depends on the capacity of the slave (memory type and efficiency), and the desired throughput Efficiency of the Memory Controller is a function of: Queue length, Burst length, Read-write mix, Address distribution Population Efficiency System Latency 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Static Latency Queuing Latency AMBA DMC-341 : Average Burst Length Latency/clocks 17

System Interface Characteristics The performance on the CPU master is determined by the latency characteristics that it sees from the system Determined by the bandwidth Other masters can be replaced with traffic profile generators (VPE) Calibrated to generate the same traffic behaviour 18 GPU Apps processor Geometry processor Renderer Tiling Network interface DMA Controller Display Controller Audio CODEC Interconnect Image Transform Video from the other masters Also by the efficiency of the memory controller Depends on the burst characteristics of the traffic Comms control Motion Estimation Motion Compensate buffer Primitives Frame buffer Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash

VPE Verification and Performance Exploration The AMBA VPE design tool is for verification of the system performance: A graphical profiling toolkit to generate & view traffic profiles 3 verification components: AXI Monitor, AXI Master, AXI Slave Runs on all of the big 3 RTL simulation tools Speeds up RTL simulation by Giving-up execution of functions (e.g. CPU, GPU) in favour of 19 emulating their traffic No need to model their cycle-accurate behaviour as a result Replacing real data with constrained random data Can test typical and worst case scenarios

System Interface Characteristics The performance on the master is determined by the latency characteristics that it sees from the system Determined by the bandwidth Other masters can be replaced with traffic profile generators (VPE) Calibrated to generate the same traffic behaviour 20 VPE Master Geometry processor Renderer Tiling Network interface DMA Controller Audio CODEC Display Controller Interconnect Image Transform Video from the other masters Also by the efficiency of the memory controller Depends on the burst characteristics of the traffic GPU Motion Estimation Static Memory Ctrl Motion Compensate NAND Flash VPE Slave

Calibrating VPE Master Behaviour Benchmarks Bus Master VPE Slave VPE monitor Run benchmark applications on the bus master VPE Monitor captures the traffic profile VPE Slave varies the latency seen by the master System Architect selects a representative set of benchmarks Benchmark results provide bandwidth and latency contracts Traffic profile and latency sensitivity results are used to generate a VPE model of the bus master 21

Better designs more quickly Iteration time of a spreadsheet with the accuracy approaching RTL simulation Spreadsheet Analysis minutes/hours RTL simulation, VPE, User VIP Industry standards VIP Statistical or recorded traffic profiles days/weeks months/years HIGH 22 Acceleration/ Emulation VIP, Logic Tiles, SW Silicon/ Applications Adding S/W, external I/F with realistic scenarios Observe actual behaviour LOW Realistic behaviour minutes/hours Mathematical formula, not dynamic Cycle time LOW HIGH

Reducing Visible System Latency Write data can be buffered The latency for write traffic seen by the system is significantly reduced Can be used to reduce read latency Prioritize reads Cache memory reduces latency seen by the master Also reduces system bandwidth which reduces latency to other masters Diminishing returns from increases in cache size 23 25 20 Unbuffered Read Write 15 10 5 0 10% 13% 16% 19% 22% 25% 28% 31% 34% 37% 40% 43% 46% 49% 52% 55% 58% 61% 64% 67% 70% 73% 76% 79% 82% 85% 88% 91% 94% As long as coherency is managed 30 Burst Latency 35 System Utilization

Increasing Latency Tolerance Masters that generate transactions that are weakly dependent on the completion of previous transactions Can issue multiple outstanding transactions Multiple outstanding transactions can eliminate the effects of static (pipeline) latency The have no impact on the effects from dynamic latency Additional outstanding transactions will increase the queue length Static latency Static latency 24 Processing rate

Queue location A queue is implemented in the memory controller Allows re-ordering of transactions to maximize efficiency If the queue fills it extends through the interconnect Interconnect arbitration only operates when the queue extends through the interconnect For effective QoS, the arbitration policy should be consistent throughout the system System topology influences the performance CPU and LCD Ctrl placed close to the memory controller Lower latency Mali and DMA on a separate 25 Interconnect Mali-VE6 Memory Ctrl Mali-400 DMA Ctrl level LCD Ctrl Peripheral Cortex-A9 Peripheral Hierarchy improves performance

Stream Processing Masters Adding latency does not affect Priority Time-out performance While latency is less than the maximum Entry priority set third highest Reduces the latency to the Best-effort Best-effort Stream processing masters Batch processing If the transaction is still waiting after a masters when necessary 100% 80% Survival time-out period Promoted to highest priority Only higher priority than Best-effort Time-out 60% Maximum latency 40% 20% 0% 0 20 40 60 80 100 Latency/clocks 26 120 140

Batch Processing Masters Average latency, bandwidth and queue Priority length related by Little s Law E(L)=λ.E(S) Time-out Hold queue length constant Best-effort Measure average latency and control priority Stream processing Priority controls latency which controls bandwidth Excess bandwidth is used Priority only exceeds other masters over Best-effort masters 2000 Bandwidth/MB/s when Insufficient bandwidth obtained Minimizes transactions prioritized Batch processing 1500 1000 500 0 0 100 200 300 400 Latency/clocks 27 500 600 700

QoS with Existing Memory Controllers Existing memory controllers have only 3 priority Priority levels CPU given high priority but demoted if there is Time-out Best-effort insufficient minimum bandwidth available for the batch processors Increase proportional to outstanding transactions Excess bandwidth partitioned between other masters Hard regulation can set a maximum bandwidth for a batch processing master Batch processing Batch processing bandwidth increases to use any available bandwidth Batch processing Batch processing Batch processor bandwidth is partitioned by varying the number of outstanding transactions Stream processing 28

30% Browsing Boost with QoS-301 700 135% 130% Cortex performs >30% better 600 soaks up Mali spare system bandwidth 125% 550 500 450 120% Cortex performs Base22% Case better Performance Mali meets its target 115% Mali BW (No QoS) Reduced Mali allows Mali BWrequirement (QoS-301) Cortex-A9 to be higher Target Mali BW priority more often 400 110% 105% CPU Performance (No QoS) 350 CPU Performance (QoS-301) 300 Lower Mali 350 400 requirements are exceeded. Cortex-A9 is highest priority most of the time 29 100% 450 500 550 Targetted Mali Bandwidth (MB/s) 600 650 700 Cortex A9 Performance Improvement Mali Bandwidth (MB/s) 650

Optimizing Efficiency The performance of a system depends on Maximizing the efficiency from the memory controller Using Cache to minimize the system bandwidth And reduce latency to the masters Using write buffering to minimize the latency from the system Performance is optimized by Implementing a consistent arbitration policy throughout the system Exploiting the different latency sensitivities of masters Roadmap to QoS 30 Consistent, system-wide, priority-based arbitration policy Priority controllers for the system masters Time-out mechanism in the system queue High efficiency memory controller with write buffering Regulation from QoS-301 NIC-301 Mali-VE6 Memory Ctrl Mali-400 DMA Ctrl LCD Ctrl Peripheral Cortex-A9 Peripheral