Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, D E C E M B E R 2015 T H O M A S B E R T A U L D D I R E C T E D B Y M I C H E L D A G E N A I S December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 1
Presentation plan 1. Introduction 2. The Parallella board 3. BareCTF 4. The synchronization process 5. Results 6. Future work December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 2
Introduction - Why tracing heterogeneous embedded systems? Systems designed for specific needs/tasks Often used for real-time applications like signal processing Can be used anywhere Power-efficient December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 3
Introduction - Goals Heterogeneous systems imply different kinds of processors Not all of them are directly tracable DSPs and generic calculation processors are not meant to run any OS (bare-metal) How do the different processing units interact? Use-case 1 : a classical CPU sends work to a DSP and retrieves the results Use-case 2 : some classical processor monitors a bare-metal processor activity December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 4
The Parallella board - Specifications Cortex A9 running Linux 1 GB DDR3 chip : 16 processors mesh (ecores) 32-bits RISC CPU 512KB distributed on-chip shared memory (32KB per CPU) Superscalar architecture (internal pipeline) 2 32-bits events counters 32MB external shared memory http://parallella.org/ December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 5
The Parallella board - Benefits and drawbacks Cheap (~100$ per board) Open-source design Can be used inside a cluster Power-efficient 16 multi-purposes CPU but no optimization for signal processing Only 32KB of local memory per core One way interrupt mechanism Only partial support of nested interrupts LibC not fully integrated (no malloc ) Lack of community support and documentation Hard to debug December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 6
The Parallella board - Benchmarks ecore -> EDRAM 1 byte read: ~540 cycles ecore -> EDRAM 1 byte write: 17 cycles Interrupt handling: ~300 000 cycles Internal synchronization between 16 ecores: ~200 cycles December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 7
BareCTF - Tracing bare-metal systems Python tool created by Philippe Proulx (EfficiOS) Targets bare-metal systems Generates CTF traces Easy-to-use (configuration by YAML files) Lightweight https://github.com/efficios/barectf December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 8
The synchronization process - Facts and goal We can use LTTng to trace the side of the Parallella We can use BareCTF to trace the side of the Parallella A hardware barrier can be used for internal synchronization All ecores share the same clock rate How do we synchronize and correlate and traces? December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 9
The synchronization process - Description Shared Memory December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 10
The synchronization process - Description Shared Memory 0. Periodically check the monitored core s cycle counter December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 11
The synchronization process - Description 1.2 Poll interrupt_ack flag Shared Memory 1.1 Send sync signal December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 12
The synchronization process - Description 1.2 Poll interrupt_ack flag Shared Memory 2. Set interrupt_ack flag December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 13
The synchronization process - Description 3.2 Poll sync_ack flag Shared Memory 3.1 Enter sync routine 3.3 Internal synchronization December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 14
The synchronization process - Description 3.2 Poll sync_ack flag Shared Memory 4.1 Set sync_ack flag 3.1 Enter sync routine 4.2 Idle December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 15
The synchronization process - Description Shared Memory 5. Send end signal 4.2 Idle December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 16
Results - General Synchronization is working Takes an average 0,1 ms to complete on host Partial integration with barectf It is possible to allow nested interrupts of the same kind Need to overwrite a system register Not very safe Another method using only one interruption has been tested Average cycles taken by first method: ~300 000 (without first interruption) Average cycles taken by second method: ~a few thousand Drawback: poll shared memory instead of idle and less precise synchronization December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 17
Results - Limitations BareCTF + synchronization code on a core One core is periodically checked Need to use shared memory Huge overhead due to interruptions Internal synchronization can t be shared by different workgroups December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 18
Future work Try the DMA engine of the Parallella Create a TraceCompass view Finalize barectf integration See how this method can be integrated to other devices Estimate the optimal synchronization rate December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 19
Future work - The TI board 4 Cortex A15 8 C66x DSPs 6 MB shared memory Cache system December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 20
Thank you for your attention! Contact : thomas.bertauld@gmail.com December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 21