Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Similar documents
Elaborazione dati real-time su architetture embedded many-core e FPGA

THE LEADER IN VISUAL COMPUTING

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

GPUs and Emerging Architectures

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Exercise: OpenMP Programming

HPC with Multicore and GPUs

Pedraforca: a First ARM + GPU Cluster for HPC

Altera SDK for OpenCL

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

. SMARC 2.0 Compliant

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

Embedded Linux Conference San Diego 2016

An Alternative to GPU Acceleration For Mobile Platforms

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Parallel Programming on Ranger and Stampede

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

. Micro SD Card Socket. SMARC 2.0 Compliant

Mapping applications into MPSoC

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Introduction to Runtime Systems

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

OpenMP tasking model for Ada: safety and correctness

Modern Processor Architectures. L25: Modern Compiler Design

Intel Xeon Phi Coprocessors

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Preparing for Highly Parallel, Heterogeneous Coprocessing

The rcuda middleware and applications

Welcome. Altera Technology Roadshow 2013

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Trends and Challenges in Multicore Programming

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Barcelona Supercomputing Center

To hear the audio, please be sure to dial in: ID#

The Mont-Blanc Project

Building supercomputers from commodity embedded chips

ARM+DSP - a winning combination on Qseven

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Parallel Computing. Hwansoo Han (SKKU)

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

EyeCheck Smart Cameras

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Embedded Systems: Architecture

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Accelerating Financial Applications on the GPU

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

ECE 574 Cluster Computing Lecture 18

Building supercomputers from embedded technologies

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPUs and GPGPUs. Greg Blanton John T. Lubia

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Introduction II. Overview

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

The Era of Heterogeneous Computing

From Application to Technology OpenCL Application Processors Chung-Ho Chen

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallella: A $99 Open Hardware Parallel Computing Platform

[Potentially] Your first parallel application

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

XPU A Programmable FPGA Accelerator for Diverse Workloads

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

CSC573: TSHA Introduction to Accelerators

Feature Detection Plugins Speed-up by

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

Real-Time Support for GPU. GPU Management Heechul Yun

MYC-C7Z010/20 CPU Module

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Heterogeneous Architecture. Luca Benini

SoC Platforms and CPU Cores

JCudaMP: OpenMP/Java on CUDA

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

CPU-GPU Heterogeneous Computing

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Trends in the Infrastructure of Computing

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Transcription:

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA Dott. Alessandro Capotondi alessandro.capotondi@unibo.it Prof. Davide Rossi davide.rossi@unibo.it

Agenda Many-core introduction CIRI ICT OpenMP Technologies Productive Parallel Programming Models Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory Conclusions

Collaborations Academia Industrial UE Projects P-SOCRATES FP7 ICT GRANT N 288574 ERC GRANT N 291125

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core coprocessor (192 cores accelerator: NvidiaKeplerGPU)

Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 200 $. Big Community of user!

Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 599 $. WORKSTATION COMPARABLE PERFORMANCES

TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL

TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL Evaluation Board 1000$. Target for Signal Processing Accelerator

MPPA Kalray

MPPA Kalray Evaluation Board 1000$. Targeting High Performance Time critical missions Aerospace/Military/Autonomous driving Industrial Robotics

Programmable many-core accelerator (PMCA)

Programmable many-core accelerator (PMCA)

Programmable many-core accelerator (PMCA)

Programmable many-core accelerator (PMCA)

Challenges Fast Programmability, High Productivity programming techniques Time predictability for industrial applications Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory

Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple

Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple [*] Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on ios and Android devices, Ethan Bogdan Hongin Yun.

Parallel Programming models Proprietary Programming models

Proprietary Programming models Parallel Programming models

Proprietary Programming models Parallel Programming models

Proprietary Programming models Parallel Programming models

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing

Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing OmpSS OpenHMPP Academic Proposals

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity 2x to 10x less LOC OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA.. 2...and multiple runtime systems too!!

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA.. 2...and multiple runtime systems too!! Intel s Parallel Universe magazine May 2014

Open-Next OpenMP runtime 4.0 What snew? UNTIED tasks

Open-Next OpenMP runtime 4.0 16 14 12 10 8 6 4 2 0 Comparison with other OpenMP implementations RECURSIVE libgomp iomp >> x86 (Intel Haswell 2 8 cores @ 2.40 GHz.) libgomp: GNU OpenMP implementation (GCC 4.9.2) iomp: Intel OpenMP implementation (ICC 15.0.2)

Open-Next OpenMP runtime 4.0 Comparison with other tasking runtimes 16 14 12 10 8 6 4 2 0 RECURSIVE nanos libgomp iomp Intel CILK+ ICC (15.0.2) Intel TBB ICC (15.0.2) WOOL GCC (4.9.2) >> x86 (Intel Haswell 2 8 cores @ 2.40 GHz.) nanos: BSC OpenSS (Mercurium 15.06 + Nanos++) Intel CILK+: ICC 15.0.2 Intel TBB: ICC 15.0.2 Wool: GCC 4.9.2

Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

Open-Next Offload using OpenMP

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; new OpenMP directive used to offload the execution of a code block to the accelerator { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) } new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block TASK_A(){ int i; #pragma omp parallel proc_bind (close) #pragma omp for for( i=0;. ) do_smthg(); }

Early evaluation 0 5 10 15 20 25 30 35 40 FAST 0 5 10 15 20 25 30 35 40 CT 0 5 10 15 20 25 30 35 40 Mahala. 0 5 10 15 20 25 30 35 40 Strassen 0 5 10 15 20 25 30 35 40 NCC 0 5 10 15 20 25 30 35 40 SHOT Kernel Repetitions Kernel Repetitions Speedup Speedup

Early evaluation Speedup 40 35 30 25 20 15 10 5 0 FAST 40 35 30 25 20 15 10 5 0 CT 40 35 30 25 20 15 10 5 0 NCC Kernel Repetitions Speedup 40 35 30 25 20 15 10 5 0 Mahala. 40 35 30 25 20 15 10 5 0 Strassen Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload Marongiu, Capotondi, Tagliavini, Benini IEEE Transactions Industrial Informatics 2015 Kernel Repetitions 40 35 30 25 20 15 10 5 0 SHOT

Accelerator Resource Sharing OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer Legacy Applications

Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver

Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Accelerator Resource Sharing OpenMP Virtual Accelerators O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Accelerator Resource Sharing Virtual Accelerators O 2 O 1 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Runtime Efficiency: Computer Vision Use-Case

Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] Removal Object Detector (OpenMP 4 Clusters)[4]

Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] [1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. [2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR-20003-96 3 (2003): 14. [3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): 105-119. [4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, 2009. AVSS'09. Sixth IEEE International Conference on. IEEE, 2009. Removal Object Detector (OpenMP 4 Clusters)[4]

Runtime Efficiency: Computer Vision Use-Case 100% 80% Efficiency % vs Ideal 60% 40% 20% 0% 10 100 1000 10000 #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

Runtime Efficiency: Computer Vision Use-Case 100% 90% efficient wrt ideal +30% efficiency wrt SPM-MO +40% efficiency wrt SPM-SO Efficiency % vs Ideal 80% 60% 40% 20% 0% 10 100 1000 10000 #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support.

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host.

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory. Explicit data management involving copies: Limited programmability Low performance

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of virtual address pointers >Transparent to the application developer Coherent virtual >Zero-copy offload, memory for higher host. predictability >Low complexity, low area, low cost Explicit data management involving copies: Limited programmability Low performance Accelerator can only access contiguous section in shared main memory, no virtual memory.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Moves the complexity from the software to the hardware Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Not only many-core accelerator! Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Moves the complexity from the software to the hardware FPGA CNN Deep- Learning Accelerator Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (http://www.pulp-platform.org/)

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (http://www.pulp-platform.org/) First open-source RISC-V core

Open-Next CIRI-ICT Activities Identificazione del programming model di riferimento per le piattaforme multi- e many-core che implementano gli use-case del progetto Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA) Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA) Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance

Open-Next CIRI-ICT Unibo Your Industrial use-cases! >10 year experience on embedded many-core programming 36 pm on industrial usecases exploration Move from workstation to efficient embedded systems!

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA