A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

Similar documents
Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

借助 SDSoC 快速開發複雜的嵌入式應用

Designing a Multi-Processor based system with FPGAs

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

SDSoC: Session 1

SDSoC Technical Seminars Feb 2016

MATLAB/Simulink 기반의프로그래머블 SoC 설계및검증

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

Copyright 2014 Xilinx

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

Integrated Workflow to Implement Embedded Software and FPGA Designs on the Xilinx Zynq Platform Puneet Kumar Senior Team Lead - SPC

Introduction to Embedded System Design using Zynq

Dr. Yassine Hariri CMC Microsystems

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

Real-Timeness and System Integrity on a Asymmetric Multi Processing configuration

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Hardware-Software Co-Design and Prototyping on SoC FPGAs Puneet Kumar Prateek Sikka Application Engineering Team

Extending the Power of FPGAs

ECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design

Estimating Accelerator Performance and Events

Adaptable Intelligence The Next Computing Era

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Heterogeneous Software Architecture with OpenAMP

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Extending the Power of FPGAs to Software Developers:

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs

Introduction to Embedded Systems

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Don t Think You Need an FPGA? Think Again!

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Bringing the benefits of Cortex-M processors to FPGA

Did I Just Do That on a Bunch of FPGAs?

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

FPGA Entering the Era of the All Programmable SoC

Asymmetric MultiProcessing for embedded vision

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

«Real Time Embedded systems» Multi Masters Systems

Designing and Prototyping Digital Systems on SoC FPGA The MathWorks, Inc. 1

Motor Control: Model-Based Design from Concept to Implementation on heterogeneous SoC FPGAs Alexander Schreiber, MathWorks

Performance Verification for ESL Design Methodology from AADL Models

Embedded HW/SW Co-Development

Cost-Optimized Backgrounder

Simplify System Complexity

Hardware and Software Co-Design for Motor Control Applications

Hardware Software Bring-Up Solutions for ARM v7/v8-based Designs. August 2015

Integrating ROS and ROS2 on mixed-critical robotic systems based on embedded heterogeneous platforms

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

Using a Hypervisor to Manage Multi-OS Systems Cory Bialowas, Product Manager

Multimedia SoC System Solutions

Combining Arm & RISC-V in Heterogeneous Designs

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

HKG : OpenAMP Introduction. Wendy Liang

Graph Streaming Processor

Partitioning of computationally intensive tasks between FPGA and CPUs

Simplify System Complexity

Versal: AI Engine & Programming Environment

Xilinx ML Suite Overview

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

XPU A Programmable FPGA Accelerator for Diverse Workloads

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Beyond Hardware IP An overview of Arm development solutions

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

10 Steps to Virtualization

EE382V: System-on-a-Chip (SoC) Design

Profiling Applications and Creating Accelerators

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

High-level programming tools for FPGA

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Real-Time Systems and Intel take industrial embedded systems to the next level

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Hardware and Software Co-Design for Motor Control Applications

S2C K7 Prodigy Logic Module Series

Deep Learning Requirements for Autonomous Vehicles

S32K Microcontroller Press Pack

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Security and Performance Benefits of Virtualization

Dramatically Accelerate 96Board Software via an FPGA with Integrated Processors

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

融入 Python 生态的 Zynq 软硬件设计框架

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Software Verification for Low Power, Safety Critical Systems

Vivado HLx Design Entry. June 2016

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

To hear the audio, please be sure to dial in: ID#

The Design of Sobel Edge Extraction System on FPGA

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Next Generation Enterprise Solutions from ARM

Scalable and Flexible Software Platforms for High-Performance ECUs. Christoph Dietachmayr Sr. Engineering Manager, Elektrobit November 8, 2018

SimXMD Simulation-based HW/SW Co-debugging for field-programmable Systems-on-Chip

Transcription:

A So%ware Developer's Journey into a Deeply Heterogeneous World Tomas Evensen, CTO Embedded So%ware, Xilinx

Embedded Development: Then Simple single CPU Most code developed internally 10 s of thousands of lines of code in C and assembly Single Real-time Operating System JTAG/BDM debugger Simple I/O Page 2

Embedded Development: Now Multiple heterogeneous CPUs Multiple accelerators and programmable logic Millions of lines of code - Mostly from other places like open source Multiple Operating Systems (ie Linux + RTOS) JTAG debugger Safety and Security concerns Xilinx Zynq MPSoC Page 3

Dedicated Hardware is Energy Efficient Microprocessors General Purpose DSP Dedicated 3 orders of Magnitude! Courtesy Bob Broderson, based on published results at ISSCC conferences Page 4

Heterogeneous Example: IIoT Gateway Cloud Connectivity ARM Processing System API HW Acceleration of Application & RT Processing Motor Control FOC Any Network Image Signal Processing Sensor Fusion I/O Expansion Any Design FPGA Fabric Zynq UltraScale+ SoC Expertise Needed All the Way from a System Level to Cloud Connectivity Page 5

FPGA: FPGA The Chameleon The Chameleon Chip Chip Is it glue logic? Is it a powerful parallel DSP engine? Is it an RTL simulator? Yes!!! And more Page 6

FPGA Reaching New Developers Limited pool of FPGA developers Need to reach software developers Software developers are different! Key to reach software developers 1 Create libraries so they can utilize accelerators written by others 2 Create tools so they can utilize FPGA without RTL Page 7

Heterogeneous So%ware Development Page 8

Mapping ApplicaQons to Heterogeneous Systems Software Software (A9) (A9) Software Software (A53) Software (A53) Software (A53) (A53) Software (R5) Hardware Hardware User Application User Application Page 9

Components for Heterogeneous SW Development Accelerated libraries and frameworks for common functions Eg OpenCV, CNN, Support for multiple types of Operating Environments Solid Linux support, bare metal, FreeRTOS, 3 rd party RTOS, Windows EC Mixing of OS s through AMP and hypervisors System debugger Unifying debug/profile Debug across cores and FPGA including profiling and trace FPGA Compiler SDSoC Write code for FPGA using C/C++/OpenCL Automate the glue between execution engines Other Virtual Prototyping for complete system Page 10

Framework Programming: Deep Learning Many embedded problems are being converted to use deep learning Embedded vision, speech, Using neural networks of different kinds, eg CNN, Neural networks are programmed through learning Neural networks are typically controlled by frameworks Caffe, Tensorflow, Torch, Theano, Neural networks are very computation intensive FPGAs can be very efficient for neural networks Combination of fixed point, flexible routing, memory hierarches and DSPs By supporting existing framework, programmers can avoid RTL Output Feature Maps AlexNet Calculations Filter Sizing MACs Rows Cols Depth Dim Depth conv conv1 55 55 64 11 3 70,276,800 conv2 27 27 192 5 64 223,948,800 conv3 13 13 384 3 192 112,140,288 conv4 13 13 256 3 384 149,520,384 conv5 13 13 256 3 256 99,680,256 fc6 6 6 256 4096 37,748,736 fc7 1 1 4096 4096 16,777,216 fc8 1 1 4096 1000 4,096,000 Total 714,188,480 Page 11

OpenAMP: A Standard for MulQ-OS Systems RPU FPGA What is OpenAMP? App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2 A standard for mixing embedded Operating Systems RTOS Bare Metal RTOS Bare Metal An Open Source project R5 Core 0 R5 Core 1 MicroBlaze MicroBlaze Trend to combine Operating Systems Linux is used in majority of use cases Many free and commercial RTOS s are being used Bare metal (no OS) is common on smaller cores App1 Non-secure State App2 App1 App2 Secure State App1 App2 Why multiple Operating Systems? Linux RTOS Trusted OS Heterogeneous cores Different needs Hypervisor Real-time vs general purpose Different Safety/Security levels Legacy GPL avoidance ARM Trusted Firmware (ATF) Safety and Security issues common A53 Core 0 A53 Core 1 A53 Core 2 A53 Core 3 Affects boot order, messaging implementation, - Examples of OpenAMP applications Page 12

OpenAMP CapabiliQes System Wide Applications Linux VxWorks FreeR TOS Bare Metal uc- OSII Nucleus Bare Metal Bare Metal Bare Metal Bare Metal OpenAMP SW A53 A53 A53 A53 R5 R5 MB MB MB MB Provides a Layer for Applications Standard API s that allow applications to be ported across processors and operating systems System Development Provides a wide rage of capabilities needed to deploy applications across asymmetric computing elements Inter-OS & Inter Processor Communication Send messages back and forth OS Management Provides booting/rebooting of processors Two Implementations GPL implementation in Linux kernel BSD implementation for RTOS/BM/Linux user space Page 13

So%ware Development Tools (SDK) 71% - Software Development tools Complete system visibility needed Heterogeneous debugging and analysis is very hard! Especially timing related problems Tools Features: Heterogeneous system Level Debugging Visibility into both CPUs and FPGA Integrated performance profiling Which parts of the chip are busy? Measure processor and bus activities Integrated traffic generator System event trace What is happening in the chip over time? Combined time line for SW and HW events Based on standards Open source Eclipse, TCF Strong system level tools are critical for heterogeneous development Page 14

Performance Data Live tables ARM performance registers Cache misses, IPC, AXI performance registers Transactions, latency, Non-intrusive JTAG profiling Timeline plot Correlate performance Cache, busses, CPU, Examples: How does ACP traffic affect cache miss rate? How balanced are the busses? How does changing mem access priority affect throughput? Page 15

Evaluate Performance - Traffic GeneraQon Generate Traffic Patterns Pre-defined bitstream Configurable to emulate traffic patterns on multiple ports Simultaneous CPU loading Configurable app types Allows for pre-porting eval Page 16

Event Trace to Dissect Timing Issues Common Timeline Software Events OS events (sys calls, locks, ) User events Hardware Events Buss transactions, PL events Low overhead Page 17

SDSoC: FPGA Development through So%ware Page 18

FPGA ProducQvity with Technology Advancement Ease of Development CPU GPU ARM SoCs & DSPs Zynq SoC & MPSoC Zynq SoC & MPSoC HLS Performance / Watt & Any to Any Connectivity Page 19

Typical Zynq Development Flow APP(){ funca(); funcb(); funcc();} HW-SW partition? Explore optimal architecture funca funca funcb, funcc HW-SW Connectivity? Datamover PS-PL interfaces SW drivers funcb, funcc Processing System (PS) Programmable Logic (PL) Page 20

Before SDSoC: Met Req? C/C++ SDK Application C SDK, OS Tools PS Driver IPI project IP Integrator Datamover PS-PL interface PL HLS Verilog, VHDL Vivado IP HW-SW partition spec Need to modify multiple levels of design entry Page 21

A%er SDSoC: C/C++ Remove the manual design of SW drivers and HW connectivity HLS Verilog, VHDL HW-SW partition spec Page 22

A%er SDSoC: Remove the manual design of SW drivers and HW connectivity C/C++ Use the C/C++ end application as the input calling the user algorithm IPs as function calls func1();<-sw func2();<-hw func3();<-hw Partition set of functions to Programmable Logic by a single click Select functions for PL Page 23

A%er SDSoC: AutomaQc System GeneraQon Met Req? Application C/C++ func1();<-sw func2();<-hw func3();<-hw Select functions for PL SDSoC PS PL Driver Datamover PS-PL interface IP C/C++ to System in hours, days Page 24

Example 1: Matrix MulQply + Add } madd(ina,inb,out){ HLS C/C++ main(){ malloc(a,b,c); mmult(a,b,d); madd(c,d,e); printf(e); } } mmult(ina,inb,out){ HLS C/C++ Generated PS Application Driver AXI Bus PL mmult D madd Platform A,B datamovers

Example 2: 1080p60 Stereo Vision } main(){ histequal(a); histequal(b) ractify(a,b,c); stereobm(c,d); overlay(d,out); display(out); ZC706 + HDMI FMC Platform SDSoC Generated Platform DMA AXI-S Application Libraries Stub HDMI PS PL Histogram equalize Histogram equalize Linux Ractify AXI Drivers Stereo Block Matching Overlay HDMI Image processing on the Copyright video 2016 Xilinx I/Os via DDR3 memory

How to Call Accelerators - Programming Paradigms Explicit Message Passing APIs Generic API to transfer data (send/receive, set/get) Tasks written in C/C++ (SW) and/or VHDL/Verilog (HW) Mental model: Threads communicating with each other send_i(port1, A, ); send_i(port2, B, ); receive_i(port3, C, ); cf_wait_all( ); Function call paradigm Standard function call paradigm Synchronous or asynchronous Mental model: Call an accelerator that returns result mm_mul(a, B, C); // or mm_mul_i(a, B, C, ); wait( ); Enqueue work items (OpenCL) Compile OpenCL host and kernels Kernels compiled to CPU/Neon or FPGA Mental model: Enqueue tasks to next available exec unit *k = clcreatekernel(*prog, "mmul", &err); err!= clsetkernelarg(*k, 3, SIZE, &A); err = clsetkernelarg(*k, 4, SIZE, &B); err = clsetkernelarg(*k, 5, SIZE, &C); err = clenqueuendrangekernel(cmds, k, ); High level modeling MathWorks - MATLAB/Simulink National Instruments LabView No right way of doing this Depends on application Page 27

Summary Heterogeneous systems are here to stay And they will be increasingly complex Developing for heterogeneous systems is hard Each component might have its own language and operating environment Parallel programming is hard to get right New standards, tools, frameworks and APIs are here to help Hiding the complexity and unifying the environments Don t get stuck in old ways Embedded developers are conservative Never a good time to try new methodologies Boiling frog syndrome Page 28

Page 29