Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y

Similar documents
Tracing embedded heterogeneous systems

Introduction to AM5K2Ex/66AK2Ex Processors

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

KeyStone C665x Multicore SoC

KeyStone II. CorePac Overview

Providing Near-Optimal Fair- Queueing Guarantees at Round-Robin Amortized Cost

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Software Design Challenges for heterogenic SOC's

Keystone Architecture Inter-core Data Exchange

«UNDERSTANDING EMBEDDED LINUX BENCHMARKING USING KERNEL TRACE ANALYSIS» ALEXIS MARTIN INRIA / LIG / UNIV. GRENOBLE, FRANCE

Optimizing the performance and portability of multicore DSP platforms with a scalable programming model supporting the Multicore Association s MCAPI

Heterogeneous Multi-Processor Coherent Interconnect

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Leverage Vybrid's asymmetrical multicore architecture for real-time applications by Stefan Agner

Doing more with multicore! Utilizing the power-efficient, high-performance KeyStone multicore DSPs. November 2012

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

KeyStone C66x Multicore SoC Overview. Dec, 2011

Designing with ALTERA SoC Hardware

Tutorial: PREESM - Dataflow Programming of Multicore DSPs

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

ADVANCED trouble-shooting of real-time systems. Bernd Hufmann, Ericsson

A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI SOI Symposium Santa Clara, Apr.

KeyStone Training. Multicore Navigator Overview

AT-501 Cortex-A5 System On Module Product Brief

MYC-C437X CPU Module

ARM+DSP - a winning combination on Qseven

With Fixed Point or Floating Point Processors!!

Simplify System Complexity

Under The Hood: Performance Tuning With Tizen. Ravi Sankar Guntur

Introducing the AM57x Sitara Processors from Texas Instruments

Speeding AM335x Programmable Realtime Unit (PRU) Application Development Through Improved Debug Tools

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

KeyStone Training. Turbo Encoder Coprocessor (TCP3E)

C66x KeyStone Training HyperLink

Linux Storage System Bottleneck Exploration

MTAPI: Parallel Programming for Embedded Multicore Systems

Maximizing heterogeneous system performance with ARM interconnect and CCIX

C66x KeyStone Training HyperLink

Designing with ALTERA SoC

EyeCheck Smart Cameras

ODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013

Near Memory Key/Value Lookup Acceleration MemSys 2017

Chapter 6 Storage and Other I/O Topics

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

SDSoC: Session 1

Time Synchronization for AV applications across Wired and Wireless 802 LANs [for residential applications]

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Partitioning of computationally intensive tasks between FPGA and CPUs

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

extended external Benchmarking extension (XXBX)

SoC Platforms and CPU Cores

Real-Timeness and System Integrity on a Asymmetric Multi Processing configuration

MYD-C437X-PRU Development Board

Scalable embedded Realtime

Porting BLIS to new architectures Early experiences

ARM Powered SoCs OpenEmbedded: a framework for toolcha. generation and rootfs management

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Tile Processor (TILEPro64)

The Next Steps in the Evolution of Embedded Processors

Embedded Linux Conference 2010

Development of Real-Time Systems with Embedded Linux. Brandon Shibley Senior Solutions Architect Toradex Inc.

PRU Hardware Overview. Building Blocks for PRU Development: Module 1

Introduction to Sitara AM437x Processors

Asymmetric MultiProcessing for embedded vision

System Wide Tracing User Need

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Video Interface Module for TI EVM TMDXEVM8148 and TMDXEVM368

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

29 th NATIONAL RADIO SCIENCE CONFERENCE (NRSC 2012) April 10 12, 2012, Faculty of Engineering/Cairo University, Egypt

TMS320C6678 Memory Access Performance

Performance of Host Identity Protocol on Lightweight Hardware

Simplify System Complexity

Low Level Tracing for Latency Analysis

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

QuartzV: Bringing Quality of Time to Virtual Machines

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

Exploring OpenCL Memory Throughput on the Zynq

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

64-bit ARM Unikernels on ukvm

CanSCA4.1ReplaceSTRSinSpace Applications?

Next Generation Enterprise Solutions from ARM

Porting VME-Based Optical-Link Remote I/O Module to a PLC Platform - an Approach to Maximize Cross-Platform Portability Using SoC

Recovering Disk Storage Metrics from low level Trace events

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Heterogeneous Software Architecture with OpenAMP

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

Xytech MediaPulse Equipment Guidelines (Version 8 and Sky)

EMBEDDED SYSTEMS WITH ROBOTICS AND SENSORS USING ERLANG

Software Driven Verification at SoC Level. Perspec System Verifier Overview

NanoMind Z7000. Datasheet On-board CPU and FPGA for space applications

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

VICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

«Real Time Embedded systems» Multi Masters Systems

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Altera SDK for OpenCL

Transcription:

Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y 2 0 1 6 T H O M A S B E R T A U L D D I R E C T E D B Y M I C H E L D A G E N A I S May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 1

Presentation plan 1. Introduction 2. The Keystone 2 architecture 3. BareCTF 4. Tracing embedded heterogeneous systems 5. The synchronization process 6. Use-case 7. Conclusion May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 2

Introduction - Why tracing heterogeneous embedded systems? Have you ever wondered how images get processed inside these cameras? May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 3

Introduction - Why tracing heterogeneous embedded systems? Systems designed for specific needs/tasks Often used for real-time applications like signal processing Can be used anywhere Power-efficient Used inside much more complex systems May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 4

Introduction - Challenges Different kind of processors Some may be «unconventional» ones Some may be «bare-metal» ones Complex and specialized hardwares Limited resources No internal storage Little RAM Lack of traditionnal tools May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 5

The Keystone 2 - Specifications 66AK2H TI SoC 4 ARM Cortex A15 running Linux (1.4 GHz) 8 C66x TI CorePacs DSPs (1.2 GHz) 2 GB DDR3 6 MB Multicore Shared Memory http://www.ti.com/product/66ak2h12 May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 6

The Keystone 2 - Benefits and drawbacks Broadly used TI DSPs Powerful SoC 8 processors with built-in signal processing abilities TI s SYS/BIOS modules Full C support on the DSPs No way of tracing the DSPs Complex to use May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 7

BareCTF - Tracing bare-metal systems Python tool created by Philippe Proulx (EfficiOS) Targets bare-metal systems Generates CTF traces Easy-to-use (configuration by YAML files) Lightweight https://github.com/efficios/barectf May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 8

Tracing embedded heterogeneous systems - Facts and goal LTTng can be used to trace the ARM side of any board BareCTF can be used to trace every other type of cores For what end? Trace the whole application s chain Detect anomalies, bottlenecks, latencies Have a global view of a process distributed between different type of cores May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 9

Tracing embedded heterogeneous systems - Challenges BareCTF must be ported to any new platform The traces obtained from different processors must be synchronized Necessity to generate matching events in each trace Interrupt-based mechanism May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 10

The synchronization process - Description ARM Generic core SYNC Generic cores ARM May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 11

The synchronization process - Description ARM Generic core SYNC ACK Generic cores ARM May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 12

Use-case - Description Instrumentation of an image processing algorithm Edge detection Sobel s filter May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 13

Use-case - Setup 3000*3000 bmp image 1 ARM process acting as master Gives commands Sends input image and receives result 8 DSPs running acting as slaves Wait for commands Use TI s ImgLib for image processing In charge of memory management May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 14

Use-case - Results May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 15

Use-case - Results Low impact ~95ms to ~96ms of processing time Effective Can show if the work isn t well balanced Allows to keep track of the overall process Uses TraceCompass internal traces synchronization mechanism May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 16

Conclusion - Limitations The barectf platform can be improved Heavy API High latency Wasted memory The synchronization doesn t take drift in account May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 17

Conclusion - Future work Switch to more efficient message-passing methods Determine an optimal synchronization rate Improve the overall overhead of the barectf platform Tests on more complex systems May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 18

Thank you for your attention! Contact : thomas.bertauld@gmail.com May 5th 2016 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 19