Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Size: px

Start display at page:

Download "Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA"

Christina Burns
5 years ago
Views:

1 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA Dott. Alessandro Capotondi alessandro.capotondi@unibo.it Prof. Davide Rossi davide.rossi@unibo.it

2 Agenda Many-core introduction CIRI ICT OpenMP Technologies Productive Parallel Programming Models Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory Conclusions

3 Collaborations Academia Industrial UE Projects P-SOCRATES FP7 ICT GRANT N ERC GRANT N

4 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

5 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

6 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

7 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

8 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

schemes are available: Targeting ADAPTIVITY:

workloads usually tailored for workstations or HPC.

embedded in the same die designed to deliver high

efficiency (GOPS/Watt) Targeting PARALLELISM: massively

9 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core coprocessor (192 cores accelerator: NvidiaKeplerGPU)

10 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.

11 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 200 $. Big Community of user!

12 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

13 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 599 $. WORKSTATION COMPARABLE PERFORMANCES

14 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL

2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3

0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO

15 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL Evaluation Board 1000$. Target for Signal Processing Accelerator

16 MPPA Kalray

17 MPPA Kalray Evaluation Board 1000$. Targeting High Performance Time critical missions Aerospace/Military/Autonomous driving Industrial Robotics

18 Programmable many-core accelerator (PMCA)

19 Programmable many-core accelerator (PMCA)

20 Programmable many-core accelerator (PMCA)

21 Programmable many-core accelerator (PMCA)

22 Challenges Fast Programmability, High Productivity programming techniques Time predictability for industrial applications Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory

23 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple

far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.

24 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple [*] Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on ios and Android devices, Ethan Bogdan Hongin Yun.

25 Parallel Programming models Proprietary Programming models

26 Proprietary Programming models Parallel Programming models

27 Proprietary Programming models Parallel Programming models

28 Proprietary Programming models Parallel Programming models

29 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

30 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

31 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

32 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

33 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing

34 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing OmpSS OpenHMPP Academic Proposals

35 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity

36 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of

37 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity 2x to 10x less LOC OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

38 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!!

39 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!! Intel s Parallel Universe magazine May 2014

40 Open-Next OpenMP runtime 4.0 What snew? UNTIED tasks

41 Open-Next OpenMP runtime Comparison with other OpenMP implementations RECURSIVE libgomp iomp >> x86 (Intel Haswell GHz.) libgomp: GNU OpenMP implementation (GCC 4.9.2) iomp: Intel OpenMP implementation (ICC )

42 Open-Next OpenMP runtime 4.0 Comparison with other tasking runtimes RECURSIVE nanos libgomp iomp Intel CILK+ ICC (15.0.2) Intel TBB ICC (15.0.2) WOOL GCC (4.9.2) >> x86 (Intel Haswell GHz.) nanos: BSC OpenSS (Mercurium Nanos++) Intel CILK+: ICC Intel TBB: ICC Wool: GCC 4.9.2

43 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

44 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

45 Open-Next Offload using OpenMP

46 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

47 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; new OpenMP directive used to offload the execution of a code block to the accelerator { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

48 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator /* some more code here */ #pragma omp wait (ker_id) }

49 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

50 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

51 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block /* some more code here */ #pragma omp wait (ker_id) }

52 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) } new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block TASK_A(){ int i; #pragma omp parallel proc_bind (close) #pragma omp for for( i=0;. ) do_smthg(); }

53 Early evaluation FAST CT Mahala Strassen NCC SHOT Kernel Repetitions Kernel Repetitions Speedup Speedup

54 Early evaluation Speedup FAST CT NCC Kernel Repetitions Speedup Mahala Strassen Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload Marongiu, Capotondi, Tagliavini, Benini IEEE Transactions Industrial Informatics 2015 Kernel Repetitions SHOT

55 Accelerator Resource Sharing OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

56 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

57 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

58 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

59 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer

60 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer Legacy Applications

61 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver

62 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

63 Accelerator Resource Sharing OpenMP Virtual Accelerators O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

64 Accelerator Resource Sharing Virtual Accelerators O 2 O 1 O 3 O N HOST driver Lightweight Spatial Partitioning Support

65 Runtime Efficiency: Computer Vision Use-Case

66 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] Removal Object Detector (OpenMP 4 Clusters)[4]

Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF.

[2] Jones, Michael, et al. "Fast multi-view face detection.

"Faster and better: A machine learning approach to corner detection.

67 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] [1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, [2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR (2003): 14. [3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): [4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, AVSS'09. Sixth IEEE International Conference on. IEEE, Removal Object Detector (OpenMP 4 Clusters)[4]

68 Runtime Efficiency: Computer Vision Use-Case 100% 80% Efficiency % vs Ideal 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

69 Runtime Efficiency: Computer Vision Use-Case 100% 90% efficient wrt ideal +30% efficiency wrt SPM-MO +40% efficiency wrt SPM-SO Efficiency % vs Ideal 80% 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

70 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support.

71 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning

72 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host.

73 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

74 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory. Explicit data management involving copies: Limited programmability Low performance

Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of

75 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of virtual address pointers >Transparent to the application developer Coherent virtual >Zero-copy offload, memory for higher host. predictability >Low complexity, low area, low cost Explicit data management involving copies: Limited programmability Low performance Accelerator can only access contiguous section in shared main memory, no virtual memory.

76 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

77 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

78 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Moves the complexity from the software to the hardware Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

Execute control intensive and sequential tasks.

FPGA CNN Deep- Learning Accelerator Fine-grained

79 Not only many-core accelerator! Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Moves the complexity from the software to the hardware FPGA CNN Deep- Learning Accelerator Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

80 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU

81 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13

82 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (

83 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA ( First open-source RISC-V core

84 Open-Next CIRI-ICT Activities Identificazione del programming model di riferimento per le piattaforme multi- e many-core che implementano gli use-case del progetto Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA) Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA) Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance

85 Open-Next CIRI-ICT Unibo Your Industrial use-cases! >10 year experience on embedded many-core programming 36 pm on industrial usecases exploration Move from workstation to efficient embedded systems!

86 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T