Tracing embedded heterogeneous systems

Similar documents
Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y

Computer System Overview

Common Computer-System and OS Structures

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

I/O Systems. Jo, Heeseung

ECE 598 Advanced Operating Systems Lecture 4

Chapter 1 Computer System Overview

Module 2: Computer-System Structures. Computer-System Architecture

KeyStone II. CorePac Overview

Keystone Architecture Inter-core Data Exchange

Computer-System Structures

Hardware and Software Architecture. Chapter 2

Architectural Support for Operating Systems

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

ADVANCED trouble-shooting of real-time systems. Bernd Hufmann, Ericsson

Architecture and OS. To do. q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure

Computer-System Architecture

Computer-System Architecture. Common Functions of Interrupts. Computer-System Operation. Interrupt Handling. Chapter 2: Computer-System Structures

Microprocessors and Microcontrollers. Assignment 1:

Projects on the Intel Single-chip Cloud Computer (SCC)

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

XPU A Programmable FPGA Accelerator for Diverse Workloads

DSP/BIOS Kernel Scalable, Real-Time Kernel TM. for TMS320 DSPs. Product Bulletin

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

Instruction Cycle. Computer-System Architecture. Computer-System Operation. Common Functions of Interrupts. Chapter 2: Computer-System Structures

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Optimising for the p690 memory system

CS 341l Fall 2008 Test #4 NAME: Key

Providing Near-Optimal Fair- Queueing Guarantees at Round-Robin Amortized Cost

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

ECE 571 Advanced Microprocessor-Based Design Lecture 24

I/O Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

Under The Hood: Performance Tuning With Tizen. Ravi Sankar Guntur

Architectural Support for Operating Systems

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Tomasz Włostowski Beams Department Controls Group Hardware and Timing Section. Developing hard real-time systems using FPGAs and soft CPU cores

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Tile Processor (TILEPro64)

INTRODUCTION TO FLEXIO

4. Hardware Platform: Real-Time Requirements

Parallelism in Hardware

Low-power Architecture. By: Jonathan Herbst Scott Duntley

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster


EE 354 Fall 2015 Lecture 1 Architecture and Introduction

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Interrupt response times on Arduino and Raspberry Pi. Tomaž Šolc

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Designing with ALTERA SoC Hardware

Uniprocessor Computer Architecture Example: Cray T3E

EE4144: ARM Cortex-M Processor

CS330: Operating System and Lab. (Spring 2006) I/O Systems

Chapter 1 Computer System Overview

Chapter 5. Introduction ARM Cortex series

ECE331: Hardware Organization and Design

Initial Evaluation of a User-Level Device Driver Framework

Computer Architecture

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

_ V Renesas R8C In-Circuit Emulation. Contents. Technical Notes

Processors, Performance, and Profiling

Control unit. Input/output devices provide a means for us to make use of a computer system. Computer System. Computer.

ECE 3055: Final Exam

ARM Trusted Firmware: Changes for Axxia

OS And Hardware. Computer Hardware Review PROCESSORS. CPU Registers. CPU Registers 02/04/2013

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

KeyStone C665x Multicore SoC

AT-501 Cortex-A5 System On Module Product Brief

Hardware OS & OS- Application interface

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Tracing and profiling dataflow applications

CS2253 COMPUTER ORGANIZATION AND ARCHITECTURE 1 KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Linux Storage System Bottleneck Exploration

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Four Components of a Computer System

Anne Bracy CS 3410 Computer Science Cornell University

ECE232: Hardware Organization and Design

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Systems Architecture II

Interconnecting Components

Operating Systems: Internals and Design Principles. Chapter 1 Computer System Overview Seventh Edition By William Stallings

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Multithreading: Exploiting Thread-Level Parallelism within a Processor

The T0 Vector Microprocessor. Talk Outline

CS510 Operating System Foundations. Jonathan Walpole

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Copyright 2016 Xilinx

1. state the priority of interrupts of Draw and explain MSW format of List salient features of

HETEROGENOUS COMPUTE IN A QUAD CORE CPU

Dongjun Shin Samsung Electronics

System Wide Tracing User Need

Introduction to Operating. Chapter Chapter

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Transcription:

Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, D E C E M B E R 2015 T H O M A S B E R T A U L D D I R E C T E D B Y M I C H E L D A G E N A I S December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 1

Presentation plan 1. Introduction 2. The Parallella board 3. BareCTF 4. The synchronization process 5. Results 6. Future work December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 2

Introduction - Why tracing heterogeneous embedded systems? Systems designed for specific needs/tasks Often used for real-time applications like signal processing Can be used anywhere Power-efficient December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 3

Introduction - Goals Heterogeneous systems imply different kinds of processors Not all of them are directly tracable DSPs and generic calculation processors are not meant to run any OS (bare-metal) How do the different processing units interact? Use-case 1 : a classical CPU sends work to a DSP and retrieves the results Use-case 2 : some classical processor monitors a bare-metal processor activity December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 4

The Parallella board - Specifications Cortex A9 running Linux 1 GB DDR3 chip : 16 processors mesh (ecores) 32-bits RISC CPU 512KB distributed on-chip shared memory (32KB per CPU) Superscalar architecture (internal pipeline) 2 32-bits events counters 32MB external shared memory http://parallella.org/ December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 5

The Parallella board - Benefits and drawbacks Cheap (~100$ per board) Open-source design Can be used inside a cluster Power-efficient 16 multi-purposes CPU but no optimization for signal processing Only 32KB of local memory per core One way interrupt mechanism Only partial support of nested interrupts LibC not fully integrated (no malloc ) Lack of community support and documentation Hard to debug December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 6

The Parallella board - Benchmarks ecore -> EDRAM 1 byte read: ~540 cycles ecore -> EDRAM 1 byte write: 17 cycles Interrupt handling: ~300 000 cycles Internal synchronization between 16 ecores: ~200 cycles December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 7

BareCTF - Tracing bare-metal systems Python tool created by Philippe Proulx (EfficiOS) Targets bare-metal systems Generates CTF traces Easy-to-use (configuration by YAML files) Lightweight https://github.com/efficios/barectf December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 8

The synchronization process - Facts and goal We can use LTTng to trace the side of the Parallella We can use BareCTF to trace the side of the Parallella A hardware barrier can be used for internal synchronization All ecores share the same clock rate How do we synchronize and correlate and traces? December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 9

The synchronization process - Description Shared Memory December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 10

The synchronization process - Description Shared Memory 0. Periodically check the monitored core s cycle counter December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 11

The synchronization process - Description 1.2 Poll interrupt_ack flag Shared Memory 1.1 Send sync signal December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 12

The synchronization process - Description 1.2 Poll interrupt_ack flag Shared Memory 2. Set interrupt_ack flag December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 13

The synchronization process - Description 3.2 Poll sync_ack flag Shared Memory 3.1 Enter sync routine 3.3 Internal synchronization December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 14

The synchronization process - Description 3.2 Poll sync_ack flag Shared Memory 4.1 Set sync_ack flag 3.1 Enter sync routine 4.2 Idle December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 15

The synchronization process - Description Shared Memory 5. Send end signal 4.2 Idle December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 16

Results - General Synchronization is working Takes an average 0,1 ms to complete on host Partial integration with barectf It is possible to allow nested interrupts of the same kind Need to overwrite a system register Not very safe Another method using only one interruption has been tested Average cycles taken by first method: ~300 000 (without first interruption) Average cycles taken by second method: ~a few thousand Drawback: poll shared memory instead of idle and less precise synchronization December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 17

Results - Limitations BareCTF + synchronization code on a core One core is periodically checked Need to use shared memory Huge overhead due to interruptions Internal synchronization can t be shared by different workgroups December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 18

Future work Try the DMA engine of the Parallella Create a TraceCompass view Finalize barectf integration See how this method can be integrated to other devices Estimate the optimal synchronization rate December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 19

Future work - The TI board 4 Cortex A15 8 C66x DSPs 6 MB shared memory Cache system December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 20

Thank you for your attention! Contact : thomas.bertauld@gmail.com December 10th 2015 TRACING EMBEDDED HETEROGENEOUS SYSTEMS 21