It's not about the core, it s about the system Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Workshop 18 19 July 2018 Chennai, India
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 2
Overview In complex systems understanding the behaviour is not easy Surprisingly, systems sometimes do not behave as expected This may be due a number of factors, for example, interactions with cores software, peripherals, realtime events, poor implementation or some combination of all of the above Hiring better software engineers is not always an option : you have done that already Oh, RTL engineers introduce bugs too Providing visibility of SoC behaviour is important This needs to be done in an intelligent manner and without swamping the system with vast amounts of data Remember the core is a very small part of the overall SoC 3
Some obvious statements SoCs have become increasingly complicated and they are not going to get simpler Contain several (even 1000s) processors, from different vendors Contain 100s of SIP Contain complex interconnects Software created by large disparate teams All this has to successfully work together Debugging is more that just Run-control It is more than just CPU centric information such as instructions trace These are important but are only parts of the problem In order for RISCV to be successful it must be useable in systems constructed as above 4
Key requirements A vendor-neutral debug, itoring and analytics infrastructure One that enables access to different proprietary debug schemes used today by various cores Allows for itors into interconnects, NoCs, interfaces and custom logic These need to be run-time configurable Re-use the hardware to provide visibility for different scenarios Run-time configuration of cross-triggering Support 10s if not 100s of cross-triggering events These can be interrogated after a problem to determine actual status Need to be power aware Security built-in Can be used during the whole development flow and more importantly in the field 5
Corporate overview Founded 2009 VC-funded start-up 2017 D-round ($7M) SSD Controller-1 Custom up Server ARMv8 Server SSD Controller-2 Tier-1 Automotive New Chairman October 2017 Alberto Sangiovanni-Vincentelli Headquarters in Cambridge UK 44 patents 32 employees Industry leaders adopting UltraSoC Silicon-proven with multiple customers 6
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 7
Advanced debug/itoring for the whole SoC (AXI, ACE, ACE-lite, OCP, NoC) xtensa DRAM controller GPU Custom Logic Bus Mon Trace Receiver PAM PAM Trace Encoder PAM Static Instrumentation DMA Monitor Portfolio of Analytic Modules Message Engine Message Engine Message Engine Message Engine Flexible & Scalable Message Fabric System Block UltraSoC IP AXI Comm JTAG Comm USB Comm Universal Streaming Comm System Memory Buffer Family of Communicators 8
Software tools for data-driven insights RISC-V CPU Eclipse based UltraDevelop IDE single step & breakpoint CPU code & decoded trace Script based Multiple other CPUs SW & HW in one tool Real-time HW Data RISC-V instruction packets 9
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 10
Example of UltraSoC Enabled SoC UltraSoC IP I D Processor I$ D$ I D Processor I$ D$ FFT Radio IF Radio IF Bus Turbo USB MAC Debug Hub UltraSoC Infrastructure Peripheral DMA-1 RAM DMA-2 Timer Security Bus DFI-PHY DRAM controller PHY DDR3 11
Example problems UltraSoC solves UltraSoC IP Why is the CPU not performing as fast as expected? Why do some DMA transfers take too long? I D UltraSoC Infrastructure Processor I$ D$ I D Processor DMA-1 I$ D$ Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Timer Radio IF Security USB MAC Debug Hub What is the mismatch between the host & the? What is going on with my memory controller? Bus DFI-PHY DDR3 DRAM controller PHY Why does the system hang or deadlock on rare occasions? 12
Example 1: Where have my MIPS gone? UltraSoC IP Why is the CPU not performing as fast as expected? I D UltraSoC Infrastructure Processor I$ D$ I D Processor I$ D$ FFT Bus 12% Turbo 8% Peripheral Radio IF Radio IF CPU spent cycles USB MAC Debug Hub Compute DMA-1 RAM DMA-2 Timer 80% Security Stall 1 outstanding Stall 2 outstanding Bus DFI-PHY DRAM controller PHY DDR3 13
1000 4000 7000 10000 13000 16000 19000 22000 25000 28000 31000 34000 37000 40000 43000 46000 49000 Effective B/s Example 2: DDR bandwidth UltraSoC IP Why do some DMA transfers take too long? I D UltraSoC Infrastructure Processor I$ D$ I D Processor DMA-1 I$ D$ 1.00E+09 8.00E+08 6.00E+08 4.00E+08 2.00E+08 0.00E+00 Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Timer Radio IF Security USB MAC Debug Hub Windowed DDR traffic Time in ns 1 2 CPU1 CPU2 What is going on with my memory controller? Bus DFI-PHY DDR3 DRAM controller Look at I$ from compute engines Aggregate bandwidth from each is within spec PHY But at Time 2300 Combined peak I$ read request of >2GB/s, cf average of ~570MBs 14
Example 3: Deadlock detection Many different types but consider this as an example CPU (master) asserts arvalid and issues a read address to the Slave Slave asserts rvalid and outputs read data but never sees rready asserted Configure bus itor trace to trigger when transaction duration exceeds threshold (programmable up to 16k cycles) Trace not output until triggered When triggered by deadlocked transaction, trace will output most recent transactions up to and including the deadlocked transaction Trace identifies transaction ID and address, identifying both master and slave of deadlocked transaction 15
Example 4: System hang or freeze The itors continue to function when the system freezes The can operate by updating internal circular buffer When a system freeze is detected the trace buffers from all the itors can be extracted The detection of freeze can be done by the itors themselves For example no transaction in a window Trace not output until triggered When triggered by system freeze transaction, trace will output most recent transactions up to and including the deadlocked transaction Trace identifies transaction ID and address, identifying both master and slave of deadlocked transaction Similar for itor Can be considered as a system-wide core dump Use to create known state before hang Send out core-dumps periodically 16
Stall Triggers Observed Metrics generation Example 1 Runtime Configuration Monitor configured to count Stall triggers from Processor 10 9 Set period of Interval Timer 8 Counter values snapshot on 7 expiry 6 of interval timer Data Flow 1. Stall trigger observed on SM inputs 2. Counter data periodically output from SM 3. Data traced out via USB 5 4 3 2 1 0 I I$ I$ Monitor Counter Values I Processor Processor D 2 Bus DFI-PHY 1 UltraSoC Infrastructure D$ Sample Time (ns) DRAM controller UltraSoC IP D DMA-1 D$ Peripheral RAM FFT Bus Turbo DMA-2 Radio IF Stall Triggers Timer Radio IF Security USB MAC Debug Hub PHY 3 DDR3 17
Cross-triggering Example 1 Example ARM+RISCV System Data Flow 1. Bus Monitor A outputs UltraSoC event when memory access detected 2. Monitor receives Stall trigger 3. Event output from SM after transitioning from DMA START -> STALL 4. Trace Receiver(s) and RISCV encoder enabled after receiving event 5. Processor Trace output via USC-P Memory access Non CPU Masters Bus Monitor A Bus Monitor C System SRAM 1 NoC or Bus Fabric Bus Monitor B DMA-AXI PAM-APB 2 APB Monitor CTI ARM Core Trace Receiver 3 4 ETM JPAM RISCV Trace Encoder 3 4 IDLE DMA START Message Engine Interval expired Comm 5 Stall Trigger SoC Boundary STALL 18 External Debugger
Example of Instrumented SoC I D Processor I$ D$ UltraSoC IP I Processor D I$ D$ FFT Radio IF Radio IF The SI provides independent memory-mapped channels (mailboxes) Software and hardware can post writes to these channels which can be used to understand system wide behaviour The data is timestamped Or no data only timestamp Bus Turbo USB MAC Debug Hub The channels can be filtered Each channel can be enabled to provide events which can be used for cross-triggering UltraSoC Peripheral The Virtual Console provides bi-directional channels DMA-1 RAM Efuse DMA-2 Timer Key Store Security Bus DFI-PHY DRAM controller Static Instrumentation PHY DDR3 19
Simple SI visualization 20
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 21
Value Actionable insights across the whole SoC UltraSoC delivers actionable insights Knowledge With system-wide understanding From rich data across the whole SoC Information Data UltraSoC enables full visibility of SoC 22
Non-intrusive latency-bandwidth correlation Shows how bandwidth and latency are cross-correlated Interested in masters: this is where latency is consumed affecting master operation Interested in reads mainly: master will have to wait for read results, writes less critical Presented in a heat map diagram For example: on the diagram shown, all CPU latencies are affected by DMA bandwidths 23
Non intrusive anomaly detection Three CPU plots below show CPU cache-like traffic for 3 CPUs configured with different miss rates Excessive (anomalous) latencies are shown in red 24
Non-intrusive profiling with anomaly detection Traditional profilers are inadequate: Sampling = miss subtle or fast events (Nyquist) Performance impact/intrusive Heisenbugs UltraSoC is non-intrusive UltraSoC is wirespeed (100% coverage) Analytics and automated anomaly detection to make engineer more efficient 18 July 2018 Gajinder Panesar UL-002074-PT 25
Non-intrusive stuck pixels detection Incoming image Fastest time to detection Detected stuck pixels 18 July 2018 Gajinder Panesar UL-002074-PT 26
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 27
Summary The challenge today is Systemic Complexity Processor-processor interactions HW/SW interactions, NoC & deadlock Long-tail bugs dominate performance but are hard to detect UltraSoC provides a completely scalable coherent analytics, itoring and debug system UltraSoC is system wide, non-intrusive, wire-speed Analytics and ML help engineer identify subtle problems efficiently 28
Overview Architecture overview Example Scenarios In-field Analysis/ML Summary Demos 29
Demo System Architecture UtraSoc Component Zynq ZC706 FPGA platform LEDs & Switchs SODIMM ARM Plus RV32 RISCV Plus custom logic Demo shows: Bus state Traffic Performance histogram Memory Processor control Bus deadlock detection RISC-V Processor trace GPIO DMA (dma1) SRAM LCD Controller Custom Mon (sm1) AXI Comm. AXI Mon (xbm1) DRAM Controller JTAG Comm. Virtual Console (vc1) DRAM Controller ARM A9 (Bare) System (AXI) USB 2.0 Debug Hub Communicator SD Card etc Zynq SoC ARM A9 (Linux) 1 0 1 0 Static Instr (si1) Message Infrastructure System Memory Buffer AXI CTI AXI Proc. Analytic Module (pam1) AXI- IF AXI Mon (xbm2) JTAG RISC-V core Debug JTAG Proc. Analytic Module (jtm1) Trace Enc (rte1) 5 pin 1149.1 ULPI to off-chip PHY 30
UltraSoC IDE Decoded trace showing source code and assembly Bus activity Control configuration Trace Packets 31