Technology and Design Tools for Multicore Embedded Systems Software Development

Similar documents
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Methods and tools for system on chip retargetable parallel programming

Standards for Vision Processing and Neural Networks

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Graph Streaming Processor

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Accelerating Vision Processing

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

Modern Processor Architectures. L25: Modern Compiler Design

Software Driven Verification at SoC Level. Perspec System Verifier Overview

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

The OpenVX Computer Vision and Neural Network Inference

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Update on Khronos Open Standard APIs for Vision Processing Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

SDSoC: Session 1

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Navigating the Vision API Jungle: Which API Should You Use and Why? Embedded Vision Summit, May 2015

Home Gateway: the next battle ground. Majid Bemanian Security & Networking Marketing

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Multicore and MIPS: Creating the next generation of SoCs. Jim Whittaker EVP MIPS Business Unit

Test and Verification Solutions. ARM Based SOC Design and Verification

Advances in Compilers

XPU A Programmable FPGA Accelerator for Diverse Workloads

Altera SDK for OpenCL

Trends and Challenges in Multicore Programming

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

arm MULTICORE PLATFORMS FOR ADVANCED APPLICATIONS Product Longevity

Intelligent Interconnect for Autonomous Vehicle SoCs. Sam Wong / Chi Peng, NetSpeed Systems

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Intel Parallel Studio XE 2015

THE LEADER IN VISUAL COMPUTING

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Parallel Programming on Larrabee. Tim Foley Intel Corp

Silicon Acceleration APIs

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Copyright Khronos Group Page 1

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

Enabling Safe, Secure, Smarter Cars from Silicon to Software. Jeff Hutton Synopsys Automotive Business Development

Handling Challenges of Multi-Core Technology in Automotive Software Engineering

Deep Learning Requirements for Autonomous Vehicles

Simplify System Complexity

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Introduction to parallel Computing

What s New with the MATLAB and Simulink Product Families. Marta Wilczkowiak & Coorous Mohtadi Application Engineering Group

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Open Standard APIs for Augmented Reality

ClearSpeed Visual Profiler

Combining Arm & RISC-V in Heterogeneous Designs

Design guidelines for embedded real time face detection application

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Parallel Computing. Hwansoo Han (SKKU)

C6000 Compiler Roadmap

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Chapter 4: Threads. Chapter 4: Threads

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations

SystemC Modelling of the Embedded Networks

Application Programming

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Hardware Software Bring-Up Solutions for ARM v7/v8-based Designs. August 2015

OPERATING SYSTEM. Chapter 4: Threads

Application parallelization for multi-core Android devices

Copyright Khronos Group Page 1

Copyright Khronos Group Page 1

High Performance Computing

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Mapping applications into MPSoC

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

FPQ6 - MPC8313E implementation

27 March 2018 Mikael Arguedas and Morgan Quigley

Department of aerospace computer and software systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

PRACE Autumn School Basic Programming Models

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Computer Architecture

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Martin Kruliš, v

Coding Tools. (Lectures on High-performance Computing for Economists VI) Jesús Fernández-Villaverde 1 and Pablo Guerrón 2 March 25, 2018

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

VisionCPP Mehdi Goli

Verification Futures Nick Heaton, Distinguished Engineer, Cadence Design Systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

General introduction: GPUs and the realm of parallel architectures

GPUs and Emerging Architectures

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Hardware/Software Co-design

The Challenges of System Design. Raising Performance and Reducing Power Consumption

개발과정에서의 MATLAB 과 C 의연동 ( 영상처리분야 )

OpenCL: History & Future. November 20, 2017

Transcription:

Technology and Design Tools for Multicore Embedded Systems Software Development Yuriy Sheynin, Alexey Syschikov, Boris Sedov Saint Petersburg State University of Aerospace Instrumentation

Why do we need such technology? 1. Two-in-one developer is required: skilled domain experts + skilled programmer 2. Contradictive requirements to hardware platforms 3. Development of an algorithm and program should be started before the selection of a specific platform 4. Hardware platforms become more and more complex, includes many cores, are heterogeneous in all aspects (cores, memory, interconnect) 5. In order to achieve the necessary requirements an adaptation of algorithms for the platform and the platform for algorithms is needed 2/49 2006 OMS & MCIT GUAP

The life cycle of the program in the technology Algorithms design and programming Performance evaluation Deployment to target platforms Frontend Middlend Backend 3/49

The life cycle of the program in the technology Algorithms design and programming Performance evaluation Deployment to target platforms 4/49

The life cycle of the program in the technology Algorithms design and programming Performance evaluation Deployment to target platforms Platform 1 5/49

The life cycle of the program in the technology Algorithms design and programming Performance evaluation Deployment to target platforms Platform N Platform 2 Platform 1 6/49

The high-level technology structure VIPE 7/49

The high-level technology structure VIPE 1 Algorithms Размещение Симуляция design на и целевой ранняя and programming оценка платформе 8/49

The high-level technology structure VIPE 1 Simulation and early estimation 2 9/49

The high-level technology structure VIPE 1 Deployment Симуляция to и ранняя target platforms оценка 3 2 10/49

The Visual Programming Language (VPL) Separation of design and programming (coding) 11/49

The Visual Programming Language (VPL) Separation of design and programming (coding) Expert Programmer C/C++ OpenCL Asm 12/49

The Visual Programming Language (VPL) Separation of design and programming (coding) Expert Programmer Flexibility and ease-of-change at any design stage Explicit parallel program scheme control and management No direct coder influence on a parallel program scheme Decreasing errors possibility without sacrificing parallel program visibility Efficient program maintenance during the whole lifecycle 13/49

The Visual Programming Language (VPL) Cognitive advantages: clear view of the development process traceability of the dependency graph calculations control structures shared data natural parallelism potential pipelining Program on VPL scheme that consist from calculation operator, control operators and share data 14/49

VPL and Domain Specific Programming DSL libraries provide convenience of design within an application domain Easy re-use of development results Easy to involve domain experts in embedded software development Simplicity of language and libraries extensions by users CImg OpenCV OpenVX Lowlevel math Arduino 15/49

An example: DSL creation for image processing (OpenCV) 1. Analysis of the domain area 3. Development of FE functionality on C++ & OpenCV 2. Creation of the functional elements (FE) library 16/49

An example: DSL usage for image processing Image recognition 4. The scheme is designed of DSL and basic VPL language elements 17/49

An example: DSL usage for image processing Face/eyes recognition 4. The scheme is designed of DSL and basic language elements 18/49

About OpenVX OpenVX C-based programming approach with mixed C/non-C computing model Includes functions and data types of video processing domain area Functions library can be expanded, but it is inconvenient (non-portable) OpenVX support in VIPE Full implementation for spec. v.1.0.1 (working on v.1.1) VPL: DSL for OpenVX functions OpenVX-specific data objects Code generation: Plain C mode with OpenVX functions (vxu) OpenVX graph mode Mixed mode with OpenVX graphs and other DSLs An OpenVX graph a limited subset of VPL program schemes. VPL scheme + OpenVX functions combines all benefits 19/49

Asynchronous Growing Processes (AGP) formal computational model AGP defines: VPL language syntax semantics of VPL language objects control units AGP provides: formal verification identical results in different run-time environments dynamics of parallel computations Domain combination of working in shared and distributed memory models AGP the single model for all types of parallel computations and kernel - data interaction (shared memory / message passing) R Program scheme oriented graph W V d 1 W V p 1 W V d 3 WRE V d 2 V p 2 V p 3 RE V p 4 V p operator vertex, V d data-object vertex, W/R/RE arcs (links) marking W 20/49 2006 OMS & MCIT GUAP

Visual programming environment: VIPE 21/49

Interactive tools Development process support tools: parallel program scheme validation syntax checking types checking links correctness scheme completeness etc. 22/49

Interactive tools Development process support tools: parallel program scheme validation verification (in progress) unambiguity deadlocks livelocks finitness etc. 23/49

Interactive tools Development process support tools: parallel program scheme validation verification (in progress) interactive debugging, etc. step-by-step debug breakpoints watches data transfers computation traces operators executions functional debugging by serial execution etc. 24/49

expert Early program estimation and evaluation 25/49

Performance evaluation. Visual Profiler Hot-spots detection Modes : Absolute execution time of each node Relative execution time of each node Hot-spots 26/49

Performance evaluation. Static Analysis Fast, early estimation of the program performance on the many core platform Time reduce R = T n T 1 100% Speedup S = T 1 T n Efficiency E = T 1 N T n T 1 program execution time on the1 processor T n program execution time on N processors N number of processors 27/49

Performance evaluation. VPL Simulator Allows estimating : 1. Performance requirements for cores of the embedded system 2. Memory requirements 3. Load balance of various allocations 4. Volume and intensity of data exchange 5. Efficiency of hardware occupation 6. Bottlenecks of hardware platform, program and task distribution 28/49

Support of heterogeneous platforms programming Mapping operators to one or several core types (CPU, GPU, DSP, DMA) Operators on various core types Data on various data types Data exchanges on various connection types Selecting the implementation for data processing operators Preparation of initial data and the results of operator of the program, taking into account the specifics of the different communication mechanisms 29/49

Main memory Heterogeneous allocation Local memory CPU CPU CPU DSP (host) CPU DSP (parallel flows) Local memory DSP (host) Memory pre-allocation Start of computations Dynamic unrolling on CPU Offload the parallel loop to DSP cluster Dynamic unrolling on DSP (another loop) Results back to CPU, end of computation DSP (dynamic parallel flows) 30/49

Deployment to target platforms Working prototypes ANSI C C++ RT-run-time on a multicore platform (in progress) Parallel OpenMP Proof of concept Parallel threads MPI Assembler MIPS, DSP 31/49 2006 OMS & MCIT GUAP

Use cases and demonstrators

Use case: face identification Task description Camera Program is developed with VIPE Camera Face database Training of neural network Algorithm of face identification Identification and tracking (camera rotation) 33/49 2006 OMS & MCIT GUAP

Use case: face identification Software part: low quality of face recognition Hardware part: Ci20 (perspective ELISE by ELVEES) Computations only on the CPU Works slowly 34/49 2006 OMS & MCIT GUAP

Terminator Vision System. Student project Autonomous Cyber-Physical System combining multicore computations, control and mechanical parts. The Vision System identifies people from the database and tracks them with rotating camera. Project presented on hackster.io + Imagination challenge: https://www.hackster.io/contests/ci20 Project is developed with VIPE Face recognition is performed by using training neural network Database of faces was created for face classification Tracking is performed by using servo, which is controlled by Arduino that receives commands from the Ci-20 board 35/49

Use case: number plate recognition Program developed with VIPE 2 threads are used OpenGL 36/49 2006 OMS & MCIT GUAP

Use case: number plate recognition PC CPU GPU ELISE AFE 802.11 ac 2x2 RPU (VOLT) ao 2x2 ISP (Felix) 2 pipes M6160 Virtuoso L1 = 18K InterAptiv Single Core L1 = 32 K L2 = 268K Р6802 (Apache) 2 cores L1 = 64K L2 cache = 1MB GPU (Ctyde) 2 clusters VPU Decode (Coral) VPU Enoode (ONYX) Bus fabric Security / NOC / Arbitration System RAM / ROM / DMA Navicore Velcore Peripherals DMA Engine Crypto Engine PDF DDR Controller PLL POR Ci20 USB USB MIPI MIPI HDMI HDMI DDR Software part: low quality of number plate recognition Hardware part: Ci20 (perspective ELISE by ELVEES) Computations only on the CPU Works slowly 37/49 2006 OMS & MCIT GUAP

Use case: number plate recognition Camera Regions Of Interests Symbol detection of each number plate Result output 38/49 2006 OMS & MCIT GUAP

Use case: number plate recognition Scheme of working with Imagination Creator Ci20 board Wi-Fi Webcam Ci20 C code Result PC with VIPE Compile Run 39/49 2006 OMS & MCIT GUAP

VIPE one button deployment 40/49

Feature tracking (OpenVX) DSL and design DSL, based on OpenVX Feature tracking program in VIPE 41/49

Feature tracking (OpenVX) Results Feature tracking program run on the x86 platform with using the sample implementation by Khronos 42/49

Feature tracking Static analysis Performance estimation of the feature tracking program with sequential frame processing Performance estimation of the feature tracking program with parallel frame processing 43/49

Feature tracking Visual Profiler Large amount of time is taken by image format conversion function (from OpenVX format and back) Profiling of the feature tracking program 44/49

Traffic radar object detection Development Traffic radar object detection is developed in the PaPP project, part of the European technical platform ARTEMIS/ECSEL. 45/49

Traffic radar object detection Static analysis Static analysis shows acceptable reduction of time on 2-3 cores However, the results were worse than expected. Static analysis of subprogram "Data processing units" shows close to a linear reduction of time for 8 cores 46/49

Traffic radar object detection Visual Profiler Visual Profiler shows, that a large amount of time is taken by function for reading the input file (prototype uses data from the input file). File reading function is in sequential part, hence the parallelism is limited by Amdahl's law. Actual process of getting the input data should be optimized to take less time. Evaluation of program with reduced operating time of reading function shows satisfactory results. 47/49

Traffic radar object detection Comparison of the results of analysis, simulation and execution Cores (simulator) or threads(openmp) Static analysis sec. / % Modeling VPL sec. / % Execution sec. / % Without OpenMP 1.60 1 1.26 / 100 1.29 / 100 1.65 / 100 2 0.64 / 50.8 0.88 / 68.2 1.00 / 60.6 3 0.60 / 47.6 0.72 / 55.8 0.81 / 49.1 4 0.34 / 30.0 0.55 / 42.6 1.25 / 76.7 Hardware platform Core i7 8 cores-> VirtualBox VM 4 cores Ubuntu 14.04, GCC 4.9.2. Input data 12 MB signal samples 48/49

Summary Technology covers various requirements of embedded SW development Design, programming, evaluation, porting etc. DSLs for involving domain specialists into development process Rapid SW prototyping for early customer presentations Formal model basis for proofed and predictable results, including debugging Fast tools adaptation for new cores and platforms Supporting development tool for heterogeneous cores, platforms, system software infrastructure 49/49