A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

Similar documents
ADRENALINE: an OpenVX environment to optimize embedded vision applications on many-core accelerators

Research Collection. KISS PULPino - Updates on PULPino updates on PULPino. Other Conference Item. ETH Library

PULP: an open source hardware-software platform for near-sensor analytics. Luca Benini IIS-ETHZ & DEI-UNIBO

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The OpenVX Computer Vision and Neural Network Inference

PULP: A Parallel Ultra Low Power platform for next generation IoT Applications

The PULP Cores: A Set of Open-Source Ultra-Low- Power RISC-V Cores for Internet-of-Things Applications

Evaluating RISC-V Cores for PULP

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

XPU A Programmable FPGA Accelerator for Diverse Workloads

Optimizing DMA Data Transfers for Embedded Multi-Cores

TIOVX TI s OpenVX Implementation

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Mapping Data-intensive Applications to an Explicitly Managed Memory Architecture: Challenges and Solutions

DSP ISA Extensions for an Open-Source RISC-V Implementation

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Graph Streaming Processor

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)

The Nios II Family of Configurable Soft-core Processors

Design methodology for multi processor systems design on regular platforms

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

CUDA Performance Optimization. Patrick Legresley

Standards for Vision Processing and Neural Networks

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Overview of research activities Toward portability of performance

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Kari Pulli Intel. Radhakrishna Giduthuri, AMD. Frank Brill NVIDIA. OpenVX Webinar. June 16, 2016

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Adaptable Intelligence The Next Computing Era

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Multimedia in Mobile Phones. Architectures and Trends Lund

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Copyright 2016 Xilinx

TMS320C6678 Memory Access Performance

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache

GPUfs: Integrating a file system with GPUs

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Image Processing Optimization C# on GPU with Hybridizer

Maximizing heterogeneous system performance with ARM interconnect and CCIX

CEC 450 Real-Time Systems

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

ECE 8823: GPU Architectures. Objectives

M 3 Microkernel-based System for Heterogeneous Manycores

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Mapping applications into MPSoC

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

System-Level Analysis of Network Interfaces for Hierarchical MPSoCs

Fundamental CUDA Optimization. NVIDIA Corporation

KiloCore: A 32 nm 1000-Processor Array

Zynq-7000 All Programmable SoC Product Overview

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

Deterministic high-speed serial bus controller

Addressing the Memory Wall

Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Interconnection Network for Tightly Coupled Accelerators Architecture

Deep ST, Ultra Low Power Artificial Neural Network SOC in 28 FD-SOI. Nitin Chawla,

Embedded Systems: Hardware Components (part I) Todor Stefanov

Versal: AI Engine & Programming Environment

CoreVA-MPSoC: A Many-core Architecture with Tightly Coupled Shared and Local Data Memories

«Real Time Embedded systems» Multi Masters Systems

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

GPUfs: Integrating a file system with GPUs

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

Embedded Systems: Hardware Components (part II) Todor Stefanov

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Co-synthesis and Accelerator based Embedded System Design

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

ECE 574 Cluster Computing Lecture 15

CUDA Programming Model

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

Blackfin Optimizations for Performance and Power Consumption

Grand Challenge Scaling - Pushing a Fully Programmable TeraOp into Handset Imaging

Messaging Overview. Introduction. Gen-Z Messaging

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Introduction to Embedded Systems

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Architetture di Calcolo Ultra-Low-Power per Internet of Things: La piattaforma PULP

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Energy efficient real-time computing for extremely large telescopes with GPU

Exploring OpenCL Memory Throughput on the Zynq

Tutorial Practice Session

Evaluating the Portability of UPC to the Cell Broadband Engine

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

A flexible memory shuffling unit for image processing accelerators

Transcription:

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS ETHZ Luca Benini, DEI University of Bologna & IIS ETHZ

Outline Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run time Conclusion

Many core accelerators for signal/image processing GOPS/W 1 3 6 > 100 SW Mixed HW General purpose Computing Throughput Computing CPU GPGPU Accelerator Gap HW IP

Clustered many core accelerators (CMA) Multi core processor Cluster based design Cluster memory (optional) Host L3 DDR3 memory PE Cluster MPMD Processing Elements Cluster Many core accelerator PE PE PE PE PE PE L2 SoC design (optional) Cluster Low latency controller shared TCDM Some examples: STM STHORM, (optional) Kalray MPPA, memory PULP CC L1 DMA HWS DMA engine (L1 L3) HW synchronizer

PULP Parallel Ultra Low Power platform SoC VOLTAGE DOMAIN (0.8V) L2 MEMORY PERIPHERALS BRIDGE BRIDGE CLUSTER VOLTAGE DOMAIN (0.5V 0.8V) CLUSTER BUS BRIDGES PERIPHERALS SRAM VOLTAGE DOMAIN (0.5V 0.8V) DMA PERIPHERAL INTERCONNECT SRAM #0 PE #0 SCM #0 RMU LOW LATENCY INTERCONNECT I$ SRAM #1 SCM #1 RMU PE #1 I$......... SRAM #M 1 SCM #M 1 RMU PE #N 1 I$ Hybrid memory system to RMUs INSTRUCTION BUS PEAK EFFICIENCY: 400GOPS/W

OpenVX overview Foundational API for vision acceleration Focus on mobile and embedded systems Stand alone or complementary to other libraries Enable efficient implementations on different devices CPUs, GPUs, DSPs, manycore accelerators

OpenVX programming model The OpenVX model is based on a directed acyclic graph of nodes (kernels), with data (images) as linkage vx_image imgs[] = { vxcreateimage(ctx, width, height, VX_DF_IMAGE_RGB), vxcreatevirtualimage(graph, 0, VX_DF_IMAGE_U8), vxcreateimage(ctx, width, height, VX_DF_IMAGE_U8), }; vx_node nodes[] = { vxcolorconvertnode(graph, imgs[0], imgs[1]), vxsobel3x3node(graph, imgs[1], imgs[2], imgs[3]), vxmagnitudenode(graph, imgs[2], imgs[3], imgs[4]), vxthresholdnode(graph, imgs[4], thresh, imgs[5]), }; vxverifygraph(graph); vxprocessgraph(graph);

OpenVX DAG Virtual images are not required to actually reside in main memory They define a data dependency between kernels, but they cannot be read/written They are the main target of our optimization efforts V2 I CC V1 S M V4 T O An OpenVX program must be verified to guarantee some mandatory properties: Inputs and outputs compliant to the node interface No cycles in the graph Only a single writer node to any data object is allowed Writes have higher priorities than reads. Virtual image must be resolved into concrete types V3

ADRENALINE Application mapping Virtual Platform PE0 Mem PEn TestApplications i G g S x y ADRENALINE Platform configuration Run time support Run time policies http://www micrel.deis.unibo.it/adrenaline/

Outline Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run time Conclusion

Virtual platform (1) The virtual platform is written in Python and C++ Python is used for the architecture configuration C++ is used to provide an efficient implementation of internal model A library of basic components is available, but custom blocks can also be implemented and assembled

Standard configuration: Virtual platform (2) OpenRISC core. An Instruction Set Simulator (ISS) for the OpenRISC ISA, extended with timing models to emulate pipeline stalls Memories. Multi bank, constant latency timing mode L1 interconnect. One transaction per memory bank serviced at each cycle DMA. Single synchronous request to the interconnect for each line to be transferred Shared instruction cache. Dedicated interconnect and memory banks.

Virtual platform (3) OpenRISC ISS OpenRISC ISS Parametric PEs PE PE PE PE PE PE PE Size parameter Host L2 L1 DMA L3 Size parameter Size parameter Configurable bandwidth/latency

Outline Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run time Conclusion

A first solution: using OpenCL to accelerate OpenVX kernels OpenCL is a widely used programming model for manycore accelerators First solution: OpenVX kernel == OpenCL kernel When a node is selected for execution, the related OpenCL kernel is enqueued on the device Limiting factor: too much code! memory bandwidth

OpenCL bandwidth Experiments performed with OpenCL runtime on a STHORM evaluation board same results using the virtual platform 10000 1000 922 1391 779 290 307 199 MB/s 100 71 38 71 31 10 15 1 OpenCL Available BW

OpenVX for CMA We realized an OpenVX framework for many core accelerators coupling a tiling approach with algorithms for graph partition and scheduling Main goals: Reducing the memory bandwidth Maximize the accelerator efficiency Several steps are required: Tile size propagation Graph partitioning Node scheduling Buffer allocation Buffer sizing

Localized execution Localized execution when a kernel is executed by a manycore accelerator, read/write operations are always performed on local buffers in the L1 scratchpad memory PE PE PE RGB to Grayscale L1 DMA L3 Reads/writes on L1 do no stall the PEs In real platforms the L1 is often too small to contain a full image In addition, multiple kernelsrequiresmore L1 buffers During DMA transfers cores are waiting

Localized execution with tiling Images are partitioned into smaller blocks (tiles) Double buffering overlap between data transfers and computation PE PE PE RGB to Grayscale L1 DMA L3 Single tiles always fit L1 memory Transfer latency is hidden by computation Tiling is not so trivial for all algorithms data access patterns

Common access patterns for image processing kernels (D) GLOBAL OPERATORS Compute the value of a point in the output image using the whole input image Support: Host exec / Graph partitioning (E) GEOMETRIC OPERATORS Compute the value of a point in the output image using a non rectangular input window Support: Host exec / Graph partitioning (F) STATISTICAL OPERATORS Compute any statistical functions of the image points Support: Graph partitioning (A) POINT OPERATORS Compute the value of each output point from the corresponding input point Support: Basic tiling (B) LOCAL NEIGHBOR OPERATORS Compute the value of a point in the output image that corresponds to the input window Support: Tile overlapping (C) RECURSIVE NEIGHBOR OPERATORS Like the previous ones, but also consider the previously computed values in the output window Support: Persistent buffer

Tile size propagation V2 K3 V4 I K1 V1 K2 K5 O V3 K4 V5

Example (1) MEMORY DOMAINS NESTED GRAPH L3 L1 L3 ACCELERATOR SUB GRAPH ADAPTIVE TILING HOST NODE

Example (2) PEs S N P NM S N P NM Host/CC a b c d e DMAin i1 i2 i3 i4 DMAout o1 o2 B5 B4 B3 B2 B1 B0 time L3 access

CMA kernel kernel void threshold( global unsigned char *src, int srcstride, global unsigned char *dst, int dststride, short width, short height, short bandwidth, char nbcores, global unsigned char *params) { int i, j; int id = get_id(); unsigned char threshold = params[4]; } int srcindex = 0, dstindex = 0; for (j=0; j<height; ++j) { for (i=id*bandwidth; i<(id+1)*bandwidth && i<width; ++i) { unsigned char value = src[srcindex+i]; dst[dstindex+i] = (value >= threshold? value: 0); } srcindex+=srcstride; dstindex+=dststride; } The global space is remapped on local buffers!

Bandwidth reduction 10000 1000 922 1391 779 290 307 307 199 359 MB/s 100 10 34 36 71 8 38 24 71 44 8 31 15 15 18 22 1 OVX OpenCL Available BW

Speed up w.r.t. OpenCL 12.00 10.00 9.61 8.00 6.73 6.00 5.64 5.04 4.00 3.86 3.46 3.50 2.81 2.92 3.12 2.00 1.00 0.00 Random graph Edge detector Object Super detection resolution FAST9 Disparity Pyramid Optical Canny Retina preproc. Disparity S4

Outline Introduction ADRENALINE: virtual platform ADRENALINE: OpenVX run time Conclusion

Next steps Virtual platform More accurate models Multi cluster configuration OpenVX runtime Models evolution FPGA emulator

THANKS!!! Work supported by EU funded research projects