Elaborazione dati real-time su architetture embedded many-core e FPGA

Similar documents
Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

THE LEADER IN VISUAL COMPUTING

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

MYC-C7Z010/20 CPU Module

MYD-C7Z010/20 Development Board

LoGA: Low-overhead GPU accounting using events

Copyright 2017 Xilinx.

GPUs and Emerging Architectures

. SMARC 2.0 Compliant

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Zynq-7000 All Programmable SoC Product Overview

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

An Investigation of Unified Memory Access Performance in CUDA

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

XMC-ZU1. XMC Module Xilinx Zynq UltraScale+ MPSoC. Overview. Key Features. Typical Applications

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

. Micro SD Card Socket. SMARC 2.0 Compliant

Embedded Linux Conference San Diego 2016

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Multimedia SoC System Solutions

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

EyeCheck Smart Cameras

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications

S2C K7 Prodigy Logic Module Series

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications

XMC-SDR-A. XMC Zynq MPSoC + Dual ADRV9009 module. Preliminary Information Subject To Change. Overview. Key Features. Typical Applications

Simplify System Complexity

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Position Paper: OpenMP scheduling on ARM big.little architecture

CPU-GPU Heterogeneous Computing

Hugo Cunha. Senior Firmware Developer Globaltronics

Exercise: OpenMP Programming

SoC Platforms and CPU Cores

The rcuda middleware and applications

ARM and x86 on Qseven & COM Express Mini. Zeljko Loncaric, Marketing Engineer, congatec AG

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Simplify System Complexity

Intelligent Video Analytics for Urban Management

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

VT988. Key Features. Benefits. High speed 16 ADC at 3 GSPS with Synchronous Capture VT ADC for synchronous capture

Embedded platforms for nextgeneration autonomous driving systems

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Mapping applications into MPSoC

There s STILL plenty of room at the bottom! Andreas Olofsson

F28HS Hardware-Software Interface: Systems Programming

Compact form factor. High speed MXM edge connector. Processor. Max Cores 4. Max Thread 4. Memory. Graphics. Video Interfaces.

Parallel Computing. Hwansoo Han (SKKU)

Real-Time Support for GPU. GPU Management Heechul Yun

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Benchmarking Real-World In-Vehicle Applications

TEGRA K1 AND THE AUTOMOTIVE INDUSTRY. Gernot Ziegler, Timo Stich

FPGA-based Supercomputing: New Opportunities and Challenges

Hi3536 H.265 Decoder Processor. Brief Data Sheet. Issue 03. Date

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

SoC FPGAs. Your User-Customizable System on Chip Altera Corporation Public

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

Designing a Multi-Processor based system with FPGAs

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

SOM PRODUCTS BRIEF. S y s t e m o n M o d u l e. Engicam. SOMProducts ver

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Building supercomputers from embedded technologies

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Introduction to GPGPU and GPU-architectures

Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS

Embedded Systems: Architecture

Antonio R. Miele Marco D. Santambrogio

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

An Evaluation of Unified Memory Technology on NVIDIA GPUs

General Purpose GPU Computing in Partial Wave Analysis

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

NVIDIA Jetson Platform Characterization

Ten (or so) Small Computers

Pedraforca: a First ARM + GPU Cluster for HPC

Efficient Video Processing on Embedded GPU

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Copyright 2016 Xilinx

Freescale i.mx6 Architecture

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

INTELLIGENCE AT THE EDGE -HIGH PERFORMANCE EMBEDDED COMPUTING TRENDS (HPEC)

Yafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces

NXP-Freescale i.mx6 MicroSoM i4pro. Quad Core SoM (System-On-Module) Rev 1.3

Transcription:

Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T U n i v e r s i t à D i B o l o g n a

Heterogeneous Many-Core Architectures

Heterogeneous Many-Core Architectures Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core co-processor (192 cores accelerator: NvidiaKeplerGPU)

Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL

MPPA Kalray

Xilinx Zynq All Programmable SoC Hardware Features Optimized for quickly prototyping embedded applications using Zynq-7000 SoCs Hardware, design tools, IP, and pre-verified reference designs Demonstrates a embedded design, targeting video pipeline Advanced memory interface with 1GB DDR3 Component Memory 1GB DDR3 SODIM Memory Connectivity Features Enabling serial connectivity with PCIe Gen2x4, SFP+ and SMA Pairs, USB OTG, UART, IIC Supports embedded processing with Dual ARM Cortex-A9 core processors Develop networking applications with 10-100-1000 Mbps Ethernet (RGMII) Implement Video display applications with HDMI out Expand I/O with the FPGA Mezzanine Card (FMC) interface Programming Features C/C++ VHDL/Verilog SDSoC / OpenCL flows

Heterogeneous SoCs Memory Model Strong push for unified memory model in heterogeneous SoCs Shared DRAM Good for programmability Optimized to reduce performance loss How about predictability?

Maxwell GPGPU Architecture (Tegra X1) MEMORY HIERARCHY On-Chip Shared Memory Explicitly managed data accesses No interference between host and accelerator Intruction/Data Cache Refill on System Memory Can cause interference between host and accelerator

How large can the interference in execution time among the two subsystems be? NVIDIA Tegra X1 3x GPU execution A mix of synthetic workload and real bechmarks Polybench 4x CPU execution Synthetic workload to model high-interference slowdown VS GPU execution in isolation

How large can the interference in execution time among the two subsystems be? euler3d sc_gpu lud_cuda srad_v1 srad_v2 heartwall leukocyte needle backprop kmeans particlefilter_float pathfinder lavamd HOW ABOUT REAL WORKLOADS? Rodinia benchmarks executing on both the GPU and the CPU (on all 4 A57, one is observed) Up to 2.5x slower execution under mutual interference CPU/GPU NVIDIA Tegra X1 srad_v1 1,7 1,5 1,2 1,4 1,4 1,2 1,2 1,2 1,4 1,6 1,2 1,2 1,2 srad_v2 1,2 1,4 1,1 1,3 1,2 1,1 1,1 1,1 1,3 1,5 1,1 1,2 1,1 needle 1,6 2,0 1,3 1,9 1,9 1,3 1,3 1,3 1,9 2,5 1,3 1,7 1,3 euler3d_cpu 1,2 1,4 1,1 1,1 1,3 1,1 1,1 1,1 1,3 1,5 1,1 1,2 1,1 streamcluster 1,5 1,9 1,3 1,3 1,8 1,3 1,3 1,3 2,0 2,0 1,3 1,4 1,3 lud_omp 1,1 1,3 1,0 1,2 1,2 1,1 1,0 1,0 1,2 1,4 1,0 1,1 1,0 heartwall 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,1 1,0 particle_filter 1,2 1,2 1,2 1,3 1,1 1,1 1,2 1,1 1,1 1,1 1,1 1,3 1,1 mummergpu 1,6 2,1 1,3 1,4 1,9 1,4 1,4 1,3 2,0 2,5 1,4 1,7 1,4 bfs 2,1 1,7 1,6 1,9 1,8 1,7 1,5 1,7 1,6 1,6 1,5 1,8 1,7 backprop 1,9 2,4 1,8 2,0 2,4 1,5 1,6 1,5 2,3 2,5 1,7 2,0 1,6 nn 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 hotspot 1,5 1,5 1,5 1,5 1,7 1,5 1,7 1,7 1,5 1,8 1,6 1,5 1,5 hotspot3d 1,5 2,0 1,5 1,6 1,9 1,4 1,5 1,9 2,4 1,9 1,6 1,8 1,6 kmeans 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,0 lavamd 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 myocyte.out 1,1 1,1 1,0 1,1 1,1 1,0 1,1 1,0 1,1 1,2 1,1 1,1 1,1 pathfinder 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 b+tree 1,0 1,1 1,0 1,1 1,1 1,0 1,0 1,0 1,1 1,1 1,0 1,1 1,0 leukocyte 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 1,1 hotspot 1,7 1,6 1,6 1,6 1,6 1,7 1,6 1,7 1,6 1,6 1,7 1,5 1,7

srad_v1 srad_v2 needle euler3d_cpu sc_omp lud_omp heartwall particle_filter mummergpu bfs backprop filelist_512k hotspot 3D kmeans_serial lavamd myocyte.out pathfinder b+tree.out leukocyte hotspot classification How large can the interference in execution time among the two subsystems be? HOW ABOUT REAL WORKLOADS? Rodinia benchmarks executing on both the GPU (observed values) and the CPU (on all 4 A57) Up to 33x slower execution under mutual interference NVIDIA Tegra X1 GPU / CPU euler3d 1,2 1,2 1,3 1,1 1,2 1,2 1,0 1,0 1,2 1,1 1,4 1,0 1,0 1,1 1,1 1,0 1,1 1,1 1,0 1,0 1,0 1,2 lud_cuda 1,5 1,5 1,3 1,1 1,2 1,1 1,0 1,0 2,2 1,1 1,6 1,0 1,0 1,2 1,1 1,0 1,2 1,1 1,0 1,0 1,0 2,4 srad_v1 1,2 5,2 1,3 1,3 1,4 1,2 1,1 1,3 1,4 1,6 1,5 1,2 1,2 1,2 1,2 1,1 1,1 1,2 1,1 1,1 1,1 3,1 srad_v2 1,4 8,9 1,5 1,1 1,1 1,9 1,0 6,5 1,3 3,6 5,6 1,2 1,8 1,1 1,1 1,0 1,2 1,5 1,3 1,0 2,9 3,6 heartwall 1,1 1,4 1,2 1,0 1,1 1,1 1,0 1,1 1,8 1,0 1,4 1,0 1,0 1,1 1,0 1,0 1,2 1,1 1,1 1,0 1,1 1,2 leukocyte 1,4 1,5 1,6 1,0 1,0 1,0 1,0 1,6 2,6 1,2 1,6 1,3 1,2 1,1 1,0 1,1 1,3 1,1 1,1 1,0 1,4 2,3 needle 2,8 4,5 2,9 2,6 2,7 2,4 1,7 13,0 4,6 5,2 2,9 3,8 6,4 2,6 2,4 2,4 3,2 2,6 2,5 1,9 4,1 14,2 backprop 3,6 32,7 2,6 2,6 2,6 16,3 2,0 3,6 9,2 2,1 3,8 7,7 2,0 2,2 2,6 2,0 3,6 2,2 2,1 2,0 2,1 2,5 kmeans 19,3 21,0 14,4 1,2 2,2 4,4 1,0 29,8 4,2 28,9 22,7 9,5 19,9 2,9 4,4 1,0 1,9 15,7 5,4 1,2 8,3 3,8 particlefilter_float 1,2 1,3 1,3 1,1 1,1 1,1 1,0 1,3 1,8 1,1 1,2 1,1 1,5 1,1 1,1 1,0 1,2 1,1 1,0 1,1 1,0 1,1 pathfinder 8,0 19,3 8,2 1,1 7,5 10,5 6,4 9,3 4,1 27,2 9,2 12,1 5,4 9,6 3,4 1,6 7,4 10,8 7,8 1,1 8,2 27,7 lavamd 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,2 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,2

The predictable execution model (PREM) Predictable interval Memory prefetching in the first phase No cache misses in the execution phase Non-preemptive execution Requires compiler support for code re-structuring Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs

The predictable execution model (PREM) System-wide co-scheduling of memory phases from multiple actors Requires runtime techniques for global memory arbitration Originally proposed for (multi-core) CPU. We study the applicability of this idea to heterogeneous SoCs

A heterogeneous variant of PREM Current focus on GPU behavior (way more severely affected by interference than CPU) SPM as a predictable, local memory Implement PREM phases within a single offload Arbitration of main memory accesses via timed interrupts+shmem

CPU/GPU synchronization Basic mechanism in place to control GPU execution CPU inactivity forced via the throttle thread approach (MEMguard) GPU inactivity forced via active polling on the on-chip shared memory (GPUguard)

OpenMP offloading annotations #pragma omp target teams distribute parallel for for (i = LB; i < UB; i++) C[i] = A[i] + B[i] The annotated loop is to be offloaded to the accelerator Divide the iteration space over the available execution groups (SMs in CUDA) Execute loop iterations in parallel over the available threads

Loops Apply tiling to reorganize loops so as to operate in stripes of the original data structures whose footprint can be accommodated in the SPM Can leverage well-established techniques to prepare loops to expose the most suitable access pattern for SPM reorganization So far the main transformation applied

Clang OpenMP Compilation Loop tiling Loop outlining Separation Specialization

EVALUATION PREDICTABILITY (PolyBench) 1. Near-zero variance when sizing PREM periods for the worst case 2. Up to 70x variance reduction w.r.t. baseline!!! 3. What s the performance drop? 70x Execution time with interference/execution time with no interference

EVALUATION PREDICTABILITY vs PERFORMANCE (PolyBench) 1. Up to 6x slowdown wrt unmodified program w/o interference 2. Up to 11x improvement wrt unmodified program w interference 11x 6x Execution Time normalized w.r.t. unmodified baseline execution w/o iterference

EVALUATION OVERHEADS (1) 1. Code refactoring 2. Synchronization OVERHEADS (2) 1. GPU idleness

EVALUATION PATH PLANNER APPLICATION WCET Near zero variance 3X reduction in WCET Increased instruction count for specialization and/or tiling Compiler optimizations possible Overhead due to GPU idleness Synchronization scheme Property of the workload Björn Forsberg, Daniele Palossi, Andrea Marongiu, Luca Benini: GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model. ICCS 2017

WORK IN PROGRESS Understanding key functionality/performance issues Extensively characterizing performance and overheads on real benchmarks PREMization of CPU code Extend scope of analysis to more control-oriented codes Extend the metodology to All Programmable SoC (CPU + FPGA)

END! davide.rossi@unibo.it