It's not about the core, it s about the system

Similar documents
Combining Arm & RISC-V in Heterogeneous Designs

DRAFT. Joined up debugging and analysis in the RISC-V world RISC-V Workshop November DRAFT

System-wide visibility in post-silicon to drive meaningful analytics

Processor Trace in a Holistic World. DAC-2018 San Francisco RISC-V Foundation Booth

Virtual Platforms, Simulators and Software Tools

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Effective System Design with ARM System IP

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

Designing with ALTERA SoC Hardware

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

FPGA Adaptive Software Debug and Performance Analysis

Test and Verification Solutions. ARM Based SOC Design and Verification

Hardware Software Bring-Up Solutions for ARM v7/v8-based Designs. August 2015

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Fujitsu SOC Fujitsu Microelectronics America, Inc.

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Copyright 2016 Xilinx

Building blocks for 64-bit Systems Development of System IP in ARM

Lecture 5: Computing Platforms. Asbjørn Djupdal ARM Norway, IDI NTNU 2013 TDT

Copyright 2014 Xilinx

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Assembling and Debugging VPs of Complex Cycle Accurate Multicore Systems. July 2009

TRACE32. Product Overview

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

OCP Engineering Workshop - Telco

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Visual Profiler. User Guide

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Intelligent Interconnect for Autonomous Vehicle SoCs. Sam Wong / Chi Peng, NetSpeed Systems

Maximizing heterogeneous system performance with ARM interconnect and CCIX

NEWS 2018 CONTENTS SOURCE CODE COVERAGE WORKS WITHOUT CODE INSTRUMENTATION. English Edition

ARM Processors for Embedded Applications

Will Everything Start To Look Like An SoC?

A 1-GHz Configurable Processor Core MeP-h1

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

Proven 8051 Microcontroller Technology, Brilliantly Updated

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

RISC-V Core IP Products

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

The Nios II Family of Configurable Soft-core Processors

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

Simplify System Complexity

Implementing debug. and trace access. through functional I/O. Alvin Yang Staff FAE. Arm Tech Symposia Arm Limited

Designing with NXP i.mx8m SoC

IMPROVES. Initial Investment is Low Compared to SoC Performance and Cost Benefits

A Seamless Tool Access Architecture from ESL to End Product

Nexus Instrumentation architectures and the new Debug Specification

Support for RISC-V. Lauterbach GmbH. Bob Kupyn Lauterbach Markus Goehrle - Lauterbach GmbH

First hour Zynq architecture

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Designing with ALTERA SoC

Ncore Cache Coherent Interconnect

S2C K7 Prodigy Logic Module Series

System Level Instrumentation using the Nexus specification

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Software Quality is Directly Proportional to Simulation Speed

LEON4: Fourth Generation of the LEON Processor

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

Microsemi IP Cores Accelerate the Development Cycle and Lower Development Costs

Zynq-7000 All Programmable SoC Product Overview

Verification Futures The next three years. February 2015 Nick Heaton, Distinguished Engineer

Simplify System Complexity

Toward a Memory-centric Architecture

On-Chip Debugging of Multicore Systems

Integrated Workflow to Implement Embedded Software and FPGA Designs on the Xilinx Zynq Platform Puneet Kumar Senior Team Lead - SPC

The Veloce Emulator and its Use for Verification and System Integration of Complex Multi-node SOC Computing System

Employing Multi-FPGA Debug Techniques

Yet Another Implementation of CoRAM Memory

Formal Technology in the Post Silicon lab

Strato and Strato OS. Justin Zhang Senior Applications Engineering Manager. Your new weapon for verification challenge. Nov 2017

Will Everything Start To Look Like An SoC?

FPGA Entering the Era of the All Programmable SoC

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Fujitsu System Applications Support. Fujitsu Microelectronics America, Inc. 02/02

Best Practices of SoC Design

Zynq Architecture, PS (ARM) and PL

Chapter 6 Storage and Other I/O Topics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Software Defined Modem A commercial platform for wireless handsets

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

SoC Platforms and CPU Cores

Validation Strategies with pre-silicon platforms

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Next Generation Enterprise Solutions from ARM

ChipScope Inserter flow. To see the Chipscope added from XPS flow, please skip to page 21. For ChipScope within Planahead, please skip to page 23.

System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture

Verifying big.little using the Palladium XP. Deepak Venkatesan Murtaza Johar ARM India

Software Design Challenges for heterogenic SOC's

HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Benefits of Network on Chip Fabrics

An Efficient Multi Mode and Multi Resolution Based AHB Bus Tracer

Transcription:

It's not about the core, it s about the system Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Workshop 18 19 July 2018 Chennai, India

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 2

Overview In complex systems understanding the behaviour is not easy Surprisingly, systems sometimes do not behave as expected This may be due a number of factors, for example, interactions with cores software, peripherals, realtime events, poor implementation or some combination of all of the above Hiring better software engineers is not always an option : you have done that already Oh, RTL engineers introduce bugs too Providing visibility of SoC behaviour is important This needs to be done in an intelligent manner and without swamping the system with vast amounts of data Remember the core is a very small part of the overall SoC 3

Some obvious statements SoCs have become increasingly complicated and they are not going to get simpler Contain several (even 1000s) processors, from different vendors Contain 100s of SIP Contain complex interconnects Software created by large disparate teams All this has to successfully work together Debugging is more that just Run-control It is more than just CPU centric information such as instructions trace These are important but are only parts of the problem In order for RISCV to be successful it must be useable in systems constructed as above 4

Key requirements A vendor-neutral debug, itoring and analytics infrastructure One that enables access to different proprietary debug schemes used today by various cores Allows for itors into interconnects, NoCs, interfaces and custom logic These need to be run-time configurable Re-use the hardware to provide visibility for different scenarios Run-time configuration of cross-triggering Support 10s if not 100s of cross-triggering events These can be interrogated after a problem to determine actual status Need to be power aware Security built-in Can be used during the whole development flow and more importantly in the field 5

Corporate overview Founded 2009 VC-funded start-up 2017 D-round ($7M) SSD Controller-1 Custom up Server ARMv8 Server SSD Controller-2 Tier-1 Automotive New Chairman October 2017 Alberto Sangiovanni-Vincentelli Headquarters in Cambridge UK 44 patents 32 employees Industry leaders adopting UltraSoC Silicon-proven with multiple customers 6

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 7

Advanced debug/itoring for the whole SoC (AXI, ACE, ACE-lite, OCP, NoC) xtensa DRAM controller GPU Custom Logic Bus Mon Trace Receiver PAM PAM Trace Encoder PAM Static Instrumentation DMA Monitor Portfolio of Analytic Modules Message Engine Message Engine Message Engine Message Engine Flexible & Scalable Message Fabric System Block UltraSoC IP AXI Comm JTAG Comm USB Comm Universal Streaming Comm System Memory Buffer Family of Communicators 8

Software tools for data-driven insights RISC-V CPU Eclipse based UltraDevelop IDE single step & breakpoint CPU code & decoded trace Script based Multiple other CPUs SW & HW in one tool Real-time HW Data RISC-V instruction packets 9

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 10

Example of UltraSoC Enabled SoC UltraSoC IP I D Processor I$ D$ I D Processor I$ D$ FFT Radio IF Radio IF Bus Turbo USB MAC Debug Hub UltraSoC Infrastructure Peripheral DMA-1 RAM DMA-2 Timer Security Bus DFI-PHY DRAM controller PHY DDR3 11

Example problems UltraSoC solves UltraSoC IP Why is the CPU not performing as fast as expected? Why do some DMA transfers take too long? I D UltraSoC Infrastructure Processor I$ D$ I D Processor DMA-1 I$ D$ Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Timer Radio IF Security USB MAC Debug Hub What is the mismatch between the host & the? What is going on with my memory controller? Bus DFI-PHY DDR3 DRAM controller PHY Why does the system hang or deadlock on rare occasions? 12

Example 1: Where have my MIPS gone? UltraSoC IP Why is the CPU not performing as fast as expected? I D UltraSoC Infrastructure Processor I$ D$ I D Processor I$ D$ FFT Bus 12% Turbo 8% Peripheral Radio IF Radio IF CPU spent cycles USB MAC Debug Hub Compute DMA-1 RAM DMA-2 Timer 80% Security Stall 1 outstanding Stall 2 outstanding Bus DFI-PHY DRAM controller PHY DDR3 13

1000 4000 7000 10000 13000 16000 19000 22000 25000 28000 31000 34000 37000 40000 43000 46000 49000 Effective B/s Example 2: DDR bandwidth UltraSoC IP Why do some DMA transfers take too long? I D UltraSoC Infrastructure Processor I$ D$ I D Processor DMA-1 I$ D$ 1.00E+09 8.00E+08 6.00E+08 4.00E+08 2.00E+08 0.00E+00 Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Timer Radio IF Security USB MAC Debug Hub Windowed DDR traffic Time in ns 1 2 CPU1 CPU2 What is going on with my memory controller? Bus DFI-PHY DDR3 DRAM controller Look at I$ from compute engines Aggregate bandwidth from each is within spec PHY But at Time 2300 Combined peak I$ read request of >2GB/s, cf average of ~570MBs 14

Example 3: Deadlock detection Many different types but consider this as an example CPU (master) asserts arvalid and issues a read address to the Slave Slave asserts rvalid and outputs read data but never sees rready asserted Configure bus itor trace to trigger when transaction duration exceeds threshold (programmable up to 16k cycles) Trace not output until triggered When triggered by deadlocked transaction, trace will output most recent transactions up to and including the deadlocked transaction Trace identifies transaction ID and address, identifying both master and slave of deadlocked transaction 15

Example 4: System hang or freeze The itors continue to function when the system freezes The can operate by updating internal circular buffer When a system freeze is detected the trace buffers from all the itors can be extracted The detection of freeze can be done by the itors themselves For example no transaction in a window Trace not output until triggered When triggered by system freeze transaction, trace will output most recent transactions up to and including the deadlocked transaction Trace identifies transaction ID and address, identifying both master and slave of deadlocked transaction Similar for itor Can be considered as a system-wide core dump Use to create known state before hang Send out core-dumps periodically 16

Stall Triggers Observed Metrics generation Example 1 Runtime Configuration Monitor configured to count Stall triggers from Processor 10 9 Set period of Interval Timer 8 Counter values snapshot on 7 expiry 6 of interval timer Data Flow 1. Stall trigger observed on SM inputs 2. Counter data periodically output from SM 3. Data traced out via USB 5 4 3 2 1 0 I I$ I$ Monitor Counter Values I Processor Processor D 2 Bus DFI-PHY 1 UltraSoC Infrastructure D$ Sample Time (ns) DRAM controller UltraSoC IP D DMA-1 D$ Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Stall Triggers Timer Radio IF Security USB MAC Debug Hub PHY 3 DDR3 17

Cross-triggering Example 1 Example ARM+RISCV System Data Flow 1. Bus Monitor A outputs UltraSoC event when memory access detected 2. Monitor receives Stall trigger 3. Event output from SM after transitioning from DMA START -> STALL 4. Trace Receiver(s) and RISCV encoder enabled after receiving event 5. Processor Trace output via USC-P Memory access Non CPU Masters Bus Monitor A Bus Monitor C System SRAM 1 NoC or Bus Fabric Bus Monitor B DMA-AXI PAM-APB 2 APB Monitor CTI ARM Core Trace Receiver 3 4 ETM JPAM RISCV Trace Encoder 3 4 IDLE DMA START Message Engine Interval expired Comm 5 Stall Trigger SoC Boundary STALL 18 External Debugger

Example of Instrumented SoC I D Processor I$ D$ UltraSoC IP I Processor D I$ D$ FFT Radio IF Radio IF The SI provides independent memory-mapped channels (mailboxes) Software and hardware can post writes to these channels which can be used to understand system wide behaviour The data is timestamped Or no data only timestamp Bus Turbo USB MAC Debug Hub The channels can be filtered Each channel can be enabled to provide events which can be used for cross-triggering UltraSoC Peripheral The Virtual Console provides bi-directional channels DMA-1 RAM Efuse DMA-2 Timer Key Store Security Bus DFI-PHY DRAM controller Static Instrumentation PHY DDR3 19

Simple SI visualization 20

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 21

Value Actionable insights across the whole SoC UltraSoC delivers actionable insights Knowledge With system-wide understanding From rich data across the whole SoC Information Data UltraSoC enables full visibility of SoC 22

Non-intrusive latency-bandwidth correlation Shows how bandwidth and latency are cross-correlated Interested in masters: this is where latency is consumed affecting master operation Interested in reads mainly: master will have to wait for read results, writes less critical Presented in a heat map diagram For example: on the diagram shown, all CPU latencies are affected by DMA bandwidths 23

Non intrusive anomaly detection Three CPU plots below show CPU cache-like traffic for 3 CPUs configured with different miss rates Excessive (anomalous) latencies are shown in red 24

Non-intrusive profiling with anomaly detection Traditional profilers are inadequate: Sampling = miss subtle or fast events (Nyquist) Performance impact/intrusive Heisenbugs UltraSoC is non-intrusive UltraSoC is wirespeed (100% coverage) Analytics and automated anomaly detection to make engineer more efficient 18 July 2018 Gajinder Panesar UL-002074-PT 25

Non-intrusive stuck pixels detection Incoming image Fastest time to detection Detected stuck pixels 18 July 2018 Gajinder Panesar UL-002074-PT 26

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 27

Summary The challenge today is Systemic Complexity Processor-processor interactions HW/SW interactions, NoC & deadlock Long-tail bugs dominate performance but are hard to detect UltraSoC provides a completely scalable coherent analytics, itoring and debug system UltraSoC is system wide, non-intrusive, wire-speed Analytics and ML help engineer identify subtle problems efficiently 28

Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 29

Demo System Architecture UtraSoc Component Zynq ZC706 FPGA platform LEDs & Switchs SODIMM ARM Plus RV32 RISCV Plus custom logic Demo shows: Bus state Traffic Performance histogram Memory Processor control Bus deadlock detection RISC-V Processor trace GPIO DMA (dma1) SRAM LCD Controller Custom Mon (sm1) AXI Comm. AXI Mon (xbm1) DRAM Controller JTAG Comm. Virtual Console (vc1) DRAM Controller ARM A9 (Bare) System (AXI) USB 2.0 Debug Hub Communicator SD Card etc Zynq SoC ARM A9 (Linux) 1 0 1 0 Static Instr (si1) Message Infrastructure System Memory Buffer AXI CTI AXI Proc. Analytic Module (pam1) AXI- IF AXI Mon (xbm2) JTAG RISC-V core Debug JTAG Proc. Analytic Module (jtm1) Trace Enc (rte1) 5 pin 1149.1 ULPI to off-chip PHY 30

UltraSoC IDE Decoded trace showing source code and assembly Bus activity Control configuration Trace Packets 31