The Challenges of System Design Raising Performance and Reducing Power Consumption 1
Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2
Product Challenge - Software For software engineers Good visibility Power Osprey Management Application Processors CPU1 CPU 2 CPU 3 CPU 4 AXI Interconnect CoreSight Debug & Trace On-Chip Debug & Trace DMA Controller HD LCD controller CPU L2 cache Coherency, Virtualisation Coherent interconnect AXI Interconnect AXI Interconnect DDR3/LPDDR2 Memory Controller Static Memory Controller PCIe Media Processors A standard easy-to-program h/w platform Graphics processor Video engine AXI interconnect APB Peripherals SRAM ARM Profiler To optimise performance 3
Design Challenge - PPA Power Osprey Management Performance Performance Performance Application Processors CPU 3 CPU L2 cache Media Processors Graphics processor Video engine Coherent interconnect AXI Interconnect Power CPU1 CPU 2 CPU 4 Power AXI AXI Interconnect Interconnect Power AXI interconnect CoreSight Debug & Trace DMA Controller HD LCD controller DDR3/LPDDR2 APB Power Memory Controller Static Memory Controller PCIe Peripherals SRAM 4
How to optimise your software and understand what your design VISIBILITY FOR OPTIMISATION 5
On Chip Visibility: a key requirement Power Osprey Management Application Processors CPU1 CPU 3 CPU 2 CPU 4 AXI Interconne ct CoreSight Debug & Trace DMA Controller HD LCD controller CPU L2 cache Coherent interconn ect AXI Interconnect DDR3/LPDDR2 Memory Controller Static Memory Controller AXI Interconnect PCIe Media Processors Graphics processor Video engine AXI interconnect AP B Peripherals SRAM 6
Typical CoreSight System Cross triggering between cores Single debug access port Cost effective debug AMBA AXI Cross trigger matrix System Trace Example ARM SoC SWD DAP New ETM Cortex A9 PTM Interface Cross Trigger Cortex R4 ETMR4 CS Interface Cross Trigger DSP DSP ETM Interface Cross Trigger Bus trace System trace APB bridge Shared s Port Debug bus (APB) Trace bus (ATB) Funnel Debug control bus RealView ICE Trace bus for system trace RealView Trace Trace port Trace Port Interface Unit Embedded Trace Buffer Buffer Trace Collection strategies 7
Software profiling using CPU Trace Top-down insight into the analyzed software Starting with overview screen, containing top 5 functions by Self Time, Delay and Memory access Detailed information on the source code and its derived assembly code, annotated with performance information Code coverage Source associated instructions Cycles per instruction Interlock information 8
System Trace Macrocell - STM System level visibility required by application development up to final product Debug and tuning of s/w applications running on OS Tracing of system events and system performance PMU Counts OS Trace System Trace Macrocell enables High level application software view Tuning of system performance Tracing of SoC internal signals Benefits Flexible and affordable hardware based debug for applications and system level developers Complements CPU trace, MIPI STPv2 compliant 9
System Level Code Instrumentation System level debug information can be sent through trace to debug your System static inline void stm_emit(unsigned int port, unsigned int value) { stm_addr[port] = value; } static inline void stm_emit_blocking(unsigned int port, unsigned int value) { // Reading from an stm port returns 1 if the FIFO can // accept data, 0 if it is full. while(!stm_addr[port]); stm_addr[port] = value; } Export debug information Visualise system level data 10
Event Profiling using STM Cortex-A9 Cortex-A9 11 L2 cache
Trace Memory Controller Single solution for cost effective and flexible trace collection SoC visibility in final product with only 2 pins Storage of trace using low cost system memory Routing to Gigabit links such as HSSTP or Reduce trace overflows and trace port size by averaging out trace bandwidth 12 Bits / cycle Ethernet Existing modes with ETB (SRAM) & Trace Port (TPIU)
Getting the highest performance at the lowest power consumption EFFICIENT SOC DESIGN 13
Introduction Systems use external memory Large address space Low cost-per-bit Large interface bandwidth Challenge: Manage the flow of data to and from external memory to present the best bandwidth and latency characteristics to each processing element 14 GPU Comms control Geometry processor Renderer Apps processor Tiling Network interface DMA Controller Display Controller Audio CODEC Interconnect Image Transform Video required Access to memory depends on accesses from other processing elements CPU Motion Estimation Motion Compensate buffer Primitives Frame buffer Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash Physical View
QoS Contracts Minimum Bandwidth Minimum Bandwidth CPU GPU Comms control Geometry processor Renderer Apps processor Minimize Latency Tiling Network interface DMA Controller Maximum Latency or Minimum Bandwidth Display Controller Audio CODEC Interconnect Image Transform Video Minimum Bandwidth Motion Estimation Motion Compensate Minimum Bandwidth buffer Primitives Frame buffer 15 Maximum Latency Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash Maximum Latency
QoS Objectives Allocate system capacity (latency and bandwidth) to each master to meet the contract Dynamically vary the priority to react to changes in bus traffic If there is excess capacity Allocate excess to where it can offer the most improvement Usually reducing the CPU latency Allocate excess to masters that can reduce performance later If there is insufficient capacity Remove capacity from masters that have the least impact on system performance 16
System Latency Latency is added throughout the system in two forms: Static latency the delay through pipeline stages Constant and specific to the path from master to slave Queuing latency the delay at arbitration points in the system The delay for each transaction depends on the number of transactions ahead of it in the queue and the rate at which they are processed The queue length depends on the capacity of the slave (memory type and efficiency), and the desired throughput Efficiency of the Memory Controller is a function of: Queue length, Burst length, Read-write mix, Address distribution Population Efficiency System Latency 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 Static Latency Queuing Latency AMBA DMC-341 : Average Burst Length Latency/clocks 17
System Interface Characteristics The performance on the CPU master is determined by the latency characteristics that it sees from the system Determined by the bandwidth Other masters can be replaced with traffic profile generators (VPE) Calibrated to generate the same traffic behaviour 18 GPU Apps processor Geometry processor Renderer Tiling Network interface DMA Controller Display Controller Audio CODEC Interconnect Image Transform Video from the other masters Also by the efficiency of the memory controller Depends on the burst characteristics of the traffic Comms control Motion Estimation Motion Compensate buffer Primitives Frame buffer Dynamic Memory Ctrl Texture Primitives Application Memory Static Memory Ctrl buffer Tile lists Media source NAND Flash
VPE Verification and Performance Exploration The AMBA VPE design tool is for verification of the system performance: A graphical profiling toolkit to generate & view traffic profiles 3 verification components: AXI Monitor, AXI Master, AXI Slave Runs on all of the big 3 RTL simulation tools Speeds up RTL simulation by Giving-up execution of functions (e.g. CPU, GPU) in favour of 19 emulating their traffic No need to model their cycle-accurate behaviour as a result Replacing real data with constrained random data Can test typical and worst case scenarios
System Interface Characteristics The performance on the master is determined by the latency characteristics that it sees from the system Determined by the bandwidth Other masters can be replaced with traffic profile generators (VPE) Calibrated to generate the same traffic behaviour 20 VPE Master Geometry processor Renderer Tiling Network interface DMA Controller Audio CODEC Display Controller Interconnect Image Transform Video from the other masters Also by the efficiency of the memory controller Depends on the burst characteristics of the traffic GPU Motion Estimation Static Memory Ctrl Motion Compensate NAND Flash VPE Slave
Calibrating VPE Master Behaviour Benchmarks Bus Master VPE Slave VPE monitor Run benchmark applications on the bus master VPE Monitor captures the traffic profile VPE Slave varies the latency seen by the master System Architect selects a representative set of benchmarks Benchmark results provide bandwidth and latency contracts Traffic profile and latency sensitivity results are used to generate a VPE model of the bus master 21
Better designs more quickly Iteration time of a spreadsheet with the accuracy approaching RTL simulation Spreadsheet Analysis minutes/hours RTL simulation, VPE, User VIP Industry standards VIP Statistical or recorded traffic profiles days/weeks months/years HIGH 22 Acceleration/ Emulation VIP, Logic Tiles, SW Silicon/ Applications Adding S/W, external I/F with realistic scenarios Observe actual behaviour LOW Realistic behaviour minutes/hours Mathematical formula, not dynamic Cycle time LOW HIGH
Reducing Visible System Latency Write data can be buffered The latency for write traffic seen by the system is significantly reduced Can be used to reduce read latency Prioritize reads Cache memory reduces latency seen by the master Also reduces system bandwidth which reduces latency to other masters Diminishing returns from increases in cache size 23 25 20 Unbuffered Read Write 15 10 5 0 10% 13% 16% 19% 22% 25% 28% 31% 34% 37% 40% 43% 46% 49% 52% 55% 58% 61% 64% 67% 70% 73% 76% 79% 82% 85% 88% 91% 94% As long as coherency is managed 30 Burst Latency 35 System Utilization
Increasing Latency Tolerance Masters that generate transactions that are weakly dependent on the completion of previous transactions Can issue multiple outstanding transactions Multiple outstanding transactions can eliminate the effects of static (pipeline) latency The have no impact on the effects from dynamic latency Additional outstanding transactions will increase the queue length Static latency Static latency 24 Processing rate
Queue location A queue is implemented in the memory controller Allows re-ordering of transactions to maximize efficiency If the queue fills it extends through the interconnect Interconnect arbitration only operates when the queue extends through the interconnect For effective QoS, the arbitration policy should be consistent throughout the system System topology influences the performance CPU and LCD Ctrl placed close to the memory controller Lower latency Mali and DMA on a separate 25 Interconnect Mali-VE6 Memory Ctrl Mali-400 DMA Ctrl level LCD Ctrl Peripheral Cortex-A9 Peripheral Hierarchy improves performance
Stream Processing Masters Adding latency does not affect Priority Time-out performance While latency is less than the maximum Entry priority set third highest Reduces the latency to the Best-effort Best-effort Stream processing masters Batch processing If the transaction is still waiting after a masters when necessary 100% 80% Survival time-out period Promoted to highest priority Only higher priority than Best-effort Time-out 60% Maximum latency 40% 20% 0% 0 20 40 60 80 100 Latency/clocks 26 120 140
Batch Processing Masters Average latency, bandwidth and queue Priority length related by Little s Law E(L)=λ.E(S) Time-out Hold queue length constant Best-effort Measure average latency and control priority Stream processing Priority controls latency which controls bandwidth Excess bandwidth is used Priority only exceeds other masters over Best-effort masters 2000 Bandwidth/MB/s when Insufficient bandwidth obtained Minimizes transactions prioritized Batch processing 1500 1000 500 0 0 100 200 300 400 Latency/clocks 27 500 600 700
QoS with Existing Memory Controllers Existing memory controllers have only 3 priority Priority levels CPU given high priority but demoted if there is Time-out Best-effort insufficient minimum bandwidth available for the batch processors Increase proportional to outstanding transactions Excess bandwidth partitioned between other masters Hard regulation can set a maximum bandwidth for a batch processing master Batch processing Batch processing bandwidth increases to use any available bandwidth Batch processing Batch processing Batch processor bandwidth is partitioned by varying the number of outstanding transactions Stream processing 28
30% Browsing Boost with QoS-301 700 135% 130% Cortex performs >30% better 600 soaks up Mali spare system bandwidth 125% 550 500 450 120% Cortex performs Base22% Case better Performance Mali meets its target 115% Mali BW (No QoS) Reduced Mali allows Mali BWrequirement (QoS-301) Cortex-A9 to be higher Target Mali BW priority more often 400 110% 105% CPU Performance (No QoS) 350 CPU Performance (QoS-301) 300 Lower Mali 350 400 requirements are exceeded. Cortex-A9 is highest priority most of the time 29 100% 450 500 550 Targetted Mali Bandwidth (MB/s) 600 650 700 Cortex A9 Performance Improvement Mali Bandwidth (MB/s) 650
Optimizing Efficiency The performance of a system depends on Maximizing the efficiency from the memory controller Using Cache to minimize the system bandwidth And reduce latency to the masters Using write buffering to minimize the latency from the system Performance is optimized by Implementing a consistent arbitration policy throughout the system Exploiting the different latency sensitivities of masters Roadmap to QoS 30 Consistent, system-wide, priority-based arbitration policy Priority controllers for the system masters Time-out mechanism in the system queue High efficiency memory controller with write buffering Regulation from QoS-301 NIC-301 Mali-VE6 Memory Ctrl Mali-400 DMA Ctrl LCD Ctrl Peripheral Cortex-A9 Peripheral