Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA Dott. Alessandro Capotondi alessandro.capotondi@unibo.it Prof. Davide Rossi davide.rossi@unibo.it

Agenda Many-core introduction CIRI ICT OpenMP Technologies Productive Parallel Programming Models Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory Conclusions

Collaborations Academia Industrial UE Projects P-SOCRATES FP7 ICT GRANT N 288574 ERC GRANT N 291125

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core coprocessor (192 cores accelerator: NvidiaKeplerGPU)

Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL

MPPA Kalray

MPPA Kalray Evaluation Board 1000$. Targeting High Performance Time critical missions Aerospace/Military/Autonomous driving Industrial Robotics

Programmable many-core accelerator (PMCA)

Challenges Fast Programmability, High Productivity programming techniques Time predictability for industrial applications Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory

Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple

Parallel Programming models Proprietary Programming models

Proprietary Programming models Parallel Programming models

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing

Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing OmpSS OpenHMPP Academic Proposals

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity 2x to 10x less LOC OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA.. 2...and multiple runtime systems too!!

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA.. 2...and multiple runtime systems too!! Intel s Parallel Universe magazine May 2014

Open-Next OpenMP runtime 4.0 What snew? UNTIED tasks

Open-Next OpenMP runtime 4.0 16 14 12 10 8 6 4 2 0 Comparison with other OpenMP implementations RECURSIVE libgomp iomp >> x86 (Intel Haswell 2 8 cores @ 2.40 GHz.) libgomp: GNU OpenMP implementation (GCC 4.9.2) iomp: Intel OpenMP implementation (ICC 15.0.2)

Open-Next OpenMP runtime 4.0 Comparison with other tasking runtimes 16 14 12 10 8 6 4 2 0 RECURSIVE nanos libgomp iomp Intel CILK+ ICC (15.0.2) Intel TBB ICC (15.0.2) WOOL GCC (4.9.2) >> x86 (Intel Haswell 2 8 cores @ 2.40 GHz.) nanos: BSC OpenSS (Mercurium 15.06 + Nanos++) Intel CILK+: ICC 15.0.2 Intel TBB: ICC 15.0.2 Wool: GCC 4.9.2

Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

Open-Next Offload using OpenMP

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; new OpenMP directive used to offload the execution of a code block to the accelerator { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block /* some more code here */ #pragma omp wait (ker_id) }

Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) } new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block TASK_A(){ int i; #pragma omp parallel proc_bind (close) #pragma omp for for( i=0;. ) do_smthg(); }

Early evaluation 0 5 10 15 20 25 30 35 40 FAST 0 5 10 15 20 25 30 35 40 CT 0 5 10 15 20 25 30 35 40 Mahala. 0 5 10 15 20 25 30 35 40 Strassen 0 5 10 15 20 25 30 35 40 NCC 0 5 10 15 20 25 30 35 40 SHOT Kernel Repetitions Kernel Repetitions Speedup Speedup

Early evaluation Speedup 40 35 30 25 20 15 10 5 0 FAST 40 35 30 25 20 15 10 5 0 CT 40 35 30 25 20 15 10 5 0 NCC Kernel Repetitions Speedup 40 35 30 25 20 15 10 5 0 Mahala. 40 35 30 25 20 15 10 5 0 Strassen Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload Marongiu, Capotondi, Tagliavini, Benini IEEE Transactions Industrial Informatics 2015 Kernel Repetitions 40 35 30 25 20 15 10 5 0 SHOT

Accelerator Resource Sharing OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer

Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver

Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Accelerator Resource Sharing OpenMP Virtual Accelerators O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Accelerator Resource Sharing Virtual Accelerators O 2 O 1 O 3 O N HOST driver Lightweight Spatial Partitioning Support

Runtime Efficiency: Computer Vision Use-Case

Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] Removal Object Detector (OpenMP 4 Clusters)[4]

Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] [1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. [2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR-20003-96 3 (2003): 14. [3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): 105-119. [4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, 2009. AVSS'09. Sixth IEEE International Conference on. IEEE, 2009. Removal Object Detector (OpenMP 4 Clusters)[4]

Runtime Efficiency: Computer Vision Use-Case 100% 80% Efficiency % vs Ideal 60% 40% 20% 0% 10 100 1000 10000 #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

Runtime Efficiency: Computer Vision Use-Case 100% 90% efficient wrt ideal +30% efficiency wrt SPM-MO +40% efficiency wrt SPM-SO Efficiency % vs Ideal 80% 60% 40% 20% 0% 10 100 1000 10000 #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support.

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory. Explicit data management involving copies: Limited programmability Low performance

Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of virtual address pointers >Transparent to the application developer Coherent virtual >Zero-copy offload, memory for higher host. predictability >Low complexity, low area, low cost Explicit data management involving copies: Limited programmability Low performance Accelerator can only access contiguous section in shared main memory, no virtual memory.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Moves the complexity from the software to the hardware Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Not only many-core accelerator! Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Moves the complexity from the software to the hardware FPGA CNN Deep- Learning Accelerator Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (http://www.pulp-platform.org/)

HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (http://www.pulp-platform.org/) First open-source RISC-V core

Open-Next CIRI-ICT Activities Identificazione del programming model di riferimento per le piattaforme multi- e many-core che implementano gli use-case del progetto Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA) Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA) Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance

Open-Next CIRI-ICT Unibo Your Industrial use-cases! >10 year experience on embedded many-core programming 36 pm on industrial usecases exploration Move from workstation to efficient embedded systems!

Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA