Challenges of Heterogeneous MPSoC for Image Processing
|
|
- Stewart Allen
- 6 years ago
- Views:
Transcription
1 Challenges of Heterogeneous MPSoC for Image Processing DGLR 2017 Walter Stechele Institute for Integrated Systems Technische Universität München
2 Overview Reconfigurable hardware Hardware <---> software migration Heterogeneous MPSoC Application mapping and resource-aware programming Case studies from driver assistance and robotic vision 2
3 AutoVision Processor Shape Engine Contrast Engine Taillight Engine Optischer Fluß C Highway X X Tunnel entrance X X Tunnel X X City X X C1 FPGA C2 I/O on-chip bus ShapeEng Eng0 Eng1 ICAP MEM CTRL Video IF MatE CensusE TaillightE ConEng ShapeEng EdgeEng SDRAM Partial Bitstreams 3
4 Optical Flow Census Transformation HW SW Draw features Matching Post Proc List of features Status SW-Algorithm (Matlab) SW-Algorithm (OpenCV) Profiling HW/SW Partitioning HW Accelerator Demonstrator Results: Optical flow (640x480, ~ 17000feat) Core2 Duo 1,86GHz: 40 ms Engine: 2217ms 4
5 Census transformation 1) Compute signature for every pixel within Frame t k Frame t k ( 0ms) Frame t k 1 (40ms) Frame t k ( 0ms) > < < > < x < = < < > x > < > = > < 2) same comparison in Frame t k 1 Signature k Signature k ) Match the signatures from both images correspondence / motion vector [1] F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium,
6 Algorithmic redesign Software Version [1] n Signature = Address/Key Frame t k Frame t k 1 3 steps: low pass filter Census transformation Matching Frame t k n Frame t k Signature serves as address, pixel coordinates as value non-consecutive memory regions (no bursting possible) Counter update requires a read for every write operation Too high memory consumption through table based indexing scheme global matching: motion vectors across whole image possible Algorithm unsuitable for FPGA implementation!!! m m x,y x,y 0 4 x,y x,y 1 1 x,y x,y 5 0 x,y x,y 1 2 [1] F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium,
7 Algorithmic redesign Hardware Version [2] 3 steps: low pass filter Census transformation Matching m m n Frame t k n m m Signature = Value n n Frame t k 1 Signature serves as value, pixel coordinates as address Bursting possible Table based approach removed completely -> no counter update local matching: motion vectors only within neighborhood possible (225 parallel paths and comparisons in one clock cycle) Algorithm in that form unsuitable for software!!! [2] C Claus, A Laika, Li Jia, W Stechele: High performance FPGA based optical flow calculation using the census transformation, IV
8 Performance Comparison Bursting possible Counter update required Matching Scheme Platform Frequency Image smoothing Census transformation Finding matches Drawing motion vectors Total time Power consumption Processing on C Processing on FPGA SW no yes global Intel Core 2 Duo 186 GHz 187 ms 3868 ms 4055 ms* 65 W (TDP) HW yes no local FPGA, 2 emb PPCs 100 (FPGA), 300 (PPC) 395 ms 592 ms 123 ms 2217 ms* 10 W (TDP, < 1W target) *Execution time of images with a resolution of 640x480 and approximately feat detected 8
9 CensusEngine 9
10 MatchingEngine 10
11 AutoVision in a car Special thanks to DFG, BMW, Xilinx, sensor-to-image 11
12 Object Recognition An approach used in computer vision to extract features and infer the contents of an image Enables the ARMAR robot to recognize objects and to carry out tasks like object tracking, object grasping Consists of three main stages (Harris corner detection, SIFT feature extraction and SIFT feature matching) CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner SIFT Feature Extraction SIFT Feature Matching 12
13 Heterogeneous MPSoC Tiled hardware architecture with: Tightly Coupled Processor Array (TCPA) for image processing Invasive-Core (i-core) with special instruction support (SI) Loosely coupled LEON3 cores for high level algorithms Network-on-Chip (NoC) Tile Local Memory (TLM) Memory tile with interface to external DDR-II memory IO tile and Ethernet interface 13
14 Humanoid Robot ARMAR from KIT [Asfour et al] Camera Microphone Accelerometer Pressure sensor Torque sensor Rotary encoder 14
15 AMBA APB Bus Configuration Manager Reconfigurable Buffer Reconfigurable Buffer Harris Corner Detection on TCPA Tightly Coupled Processor Array [Teich et al] TCPA consists of numerous light weight processing elements (PE) TCPA benefit from instruction and loop-level parallelism and offers significant acceleration to image processing algorithms Direct PE to PE communication channels, results in continuous streaming of data from the surrounding buffers through the array Irq Ctrl Config & Com Proc Network Adapter AHB/APB Bridge AMBA AHB Bus IM GC AG IM AG Reconfigurable Buffer GC Config Memory Config Loader GC Reconfigurable Buffer AG IM AG GC IM 15
16 Configuration Manager I/O Buffers I/O Buffers Mapping HCD on TCPA 3x3 IM GC AG IM AG I/O Buffers GC TCPA prototype for HCD consists of two PEs Achieved a frame rate of 5 fps (640x480 pixels) GC I/O Buffers AG IM AG GC IM TCPA implementation is expected to consumes less power due to its lightweight PE structure AMBA bus Conf & Com Proc (LEON3) Memory 16
17 SIFT Feature Matching on i-core [Henkel et al] i-core - an extension of a LEON3 processor with a reconfigurable fabric, which allows loading application specific accelerators at runtime Start Harris Corner Detection SIFT Feature Extraction Euclidean distance Distance between p and q D p, q = Σ (p q) 2 SIFT Feature Matching Visualize for( k = 0; k < ndimension; k++) { v = pquery[k] pdata[k] ; sum += v * v; } Stop 17
18 SIFT Feature Matching on i-core Two memory ports provide a high-bandwidth connection (2x128 bits) to the tile-local memory 18
19 Homogeneous vs Heterogeneous Three stages of the object recognition algorithm operate in a pipelined fashion Hardware variants used for comparison: Homogeneous MPSoC, 2x3 tile design with four LEON3 PEs per tile Heterogeneous MPSoC, 2x3 tile design with one TCPA tile and one i-core CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner on TCPA SIFT Feature Extraction on LEON3 SIFT Feature Matching on i-core TCPA i-core LEON3 HCD-TCPA SIFT-Extr-LEON3 SIFT-Match-iCore Time 19
20 Homogeneous vs Heterogeneous Three stages of the object recognition algorithm operate in a pipelined fashion Hardware variants used for comparison: Homogeneous MPSoC, 2x3 tile design with four LEON3 PEs per tile Heterogeneous MPSoC, 2x3 tile design with one TCPA tile and one i-core Load LEON3 Load TCPA Load i-core Throughput WOLT (msec) Homogeneous 59% 0% 0% 97 frames 732 Heterogeneous 23% 62% 72% 97 frames 683 TCPA i-core LEON3 HCD-TCPA SIFT-Extr-LEON3 SIFT-Match-iCore Time 20
21 Additional Applications Conventional task distribution App-3 App-1 Audio Filtering on TCPA App-2 Matrix Multiplication i-core CAM Frame Buffer Stage-1 Stage-2 Stage-3 Harris Corner on TCPA SIFT Feature Extraction on LEON3 SIFT Feature Matching on i-core TCPA i-core LEON3 (a) Frame 1 (c) (b) Time (d) Frame 2 Audio-TCPA MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-LEON3 21
22 Additional Applications Conventional task distribution Load LEON3 Load TCPA Load i-core Throughput WOLT (msec) Obj-Recog Only 23% 62% 72% 97 frames 683 All Three Apps 17% 96% 67% 72 frames 1400 TCPA i-core LEON3 (a) Frame 1 (c) (b) Time (d) Frame 2 Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 22
23 Conventional vs Resource-aware Resource-aware task distribution App-1 Audio Filtering on TCPA App-2 Matrix Multiplication i-core CAM Frame Buffer Harris Corner on TCPA SIFT Feature Extraction on LEON3 App-3 Stage-1 Stage-2 Stage-3 SIFT Feature Matching on i-core TCPA i-core LEON3 Frame 1 (e) (h) (f) Frame 2 (x) Frame 3 Frame Time (g) (y) Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 23
24 Conventional vs Resource-aware Resource-aware task distribution Load LEON3 Load TCPA Load i- Core Throughput WOLT (msec) Conventional 17% 96% 67% 72 frames 1400 Resource-aware 38% 80% 75% 98 frames 705 TCPA i-core LEON3 Frame 1 (e) (h) (f) Frame 2 (x) Frame 3 Frame Time (g) (y) Audio-TCPA MatrixMul-i-Core MatrixMul-iCore HCD-TCPA HCD-LEON3 SIFT-Extr-LEON3 SIFT-Match-iCore SIFT-Match-i-Core SIFT-Match-LEON3 24
25 Summary for Heterogeneous MPSoC What we could see so far Resource-awareness helps task distribution on heterogeneous MPSoC Improves throughput and WOLT (worst observed latency time) Now, what if no more C cores available 25
26 Execution time Frame No Expected Observed 26 HCD and Additional Load Core count Frame No Required Available Harris Corner Detection Operating System Many-core HW Add l Load
27 Stage -3 Stage -2 Stage -1 Harris Corner Detection - Stages Harris corner detection algorithm consists of three main stages Covariance I x = p 1 p 2 I y = p 3 p 4 X Harris Map w x I x 2 w w x I x I y w w x I x I y w w x I y 2 w = a b b c X Threshold R = ac b 2 k((a + c)(a c)) 27
28 Subsampling by Dropping Pixels Advantages: Reduce the computational load by dropping pixels Drop every alternate pixel horizontally to reduce the computations by 50% Drop every alternate pixels horizontally & vertically reduces workload by 75% Disadvantages: Regions with and without corners are considered equally Memory read/write remains almost the same Low ratio of computation : memory leads to poor scalability 28
29 Masking Technique Covariance Harris Map Threshold X Threshold the covariance image to generate a mask unique to the input image, based on [Alkaabi 2004] Mask out regions with regular intensities and unmask others Masked and unmasked regions appear in clusters, making it highly cache friendly 29
30 Masking Technique Self-adaptive HCD algorithm with variable masking Negligible loss in precision & recall until threshold of 8 Execution time reduced by 60% with a mask threshold of 100 keeping up precision & recall to 88-90% precision = 1 recall = #incorrect matches #correct + #incorrect #correct matches #total possible matches Resource Usage Precision Recall Mask Threshold = 0 Mask Threshold = 2 Mask Threshold = 4 Mask Threshold = 8 Harris Corner Detection Mask Threshold = 16 Mask Threshold = 32 Mask Threshold = 64 Mask Threshold = 128 Operating System Many-core HW 30
31 Precision/Recall Rate Duration (milliseconds) Results: Conventional vs Resource-aware HCD Execution time profile RA CN Accuracy 1 0,8 0,6 0,4 0, Frame No PR-RA RE-RA PR-CN RE-CN Summary 31
32 Conclusion Reconfigurable hardware Hardware is not always fixed Optical flow in hardware and software Software is not simply mapped to hardware Heterogeneous MPSoC Static mapping is not sufficient Resource-aware computing Software adaptation is not limited to parameter tuning 32
33 References J A Colmenares, G Eads, S A Hofmeyr, S Bird, M Moretó, D Chou, B Gluzman, E Roman, D B Bartolini, N Mor et al: Tessellation: refactoring the OS around explicit resource containers with continuous adaptation, DAC 2013 Henry Hoffmann, Jonathan Eastep, Marco D Santambrogio, Jason E Miller, Anant Agarwal: Application Heartbeats - A Generic Interface for Specifying Program Performance and Goals in Autonomous Computing Environments, in ICAC 2010 Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, Martin Rinard: Dynamic Knobs for Responsive Power-Aware Computing, ASPLOS 2012 D B Bartolini, R Cattaneo, G C Durelli, M Maggio, M D Santambrogio and F Sironi The autonomic operating system research project: achievements and future directions, DAC 2013 Edoardo Paone, Davide Gadioli, Gianluca Palermo, Vittorio Zaccaria, Cristina Silvano: Evaluating Orthogonality between Application Auto-Tuning and Run-Time Resource Management for Adaptive OpenCL Applications, ASAP
34 References F Stein: Efficient Computation of Optical Flow Using the Census Transform, DAGM-Symposium, 2004 S Alkaabi, F Deravi: Candidate pruning for fast corner detection, Electronics Letters, 2004 [Teich et al] [Henkel et al] J Paul, W Stechele et al: Resource awareness on heterogeneous MPSoCs for image processing, Journal of Systems Architecture, Elsevier, 2015 J Paul, W Stechele et al: Self-adaptive corner detection on MPSoCs through resource-aware programming, Journal of Systems Architecture, Elsevier, 2015 C Claus, A Laika, Li Jia, W Stechele: High performance FPGA based optical flow calculation using the census transformation, Intelligent Vehicles Symposium,
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationEvaluating Orthogonality between Application Auto tuning and Run Time Resource Management for Adaptive OpenCL Applications
Evaluating Orthogonality between Application Auto tuning and Run Time Resource Management for Adaptive OpenCL Applications Edoardo Paone, Davide Gadioli, Gianluca Palermo, Vittorio Zaccaria, Cristina Silvano
More informationDesign Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures
Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Outline Research challenges in multicore
More informationResource-Aware Programming for Robotic Vision
Resource-Aware Programming for Robotic Vision Johny Paul, Walter Stechele Manfred Kro hnert, Tamim Asfour Institute for Integrated Systems Technical University of Munich, Germany {Johny.Paul, Walter.Stechele}@tum.de
More informationImage Processing on Heterogeneous Multiprocessor System-on-Chip using Resource-aware Programming
TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Integrierte Systeme Image Processing on Heterogeneous Multiprocessor System-on-Chip using Resource-aware Programming Johny Paul Vollständiger Abdruck der von
More informationTime-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration
Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Marie Nguyen Carnegie Mellon University Pittsburgh, Pennsylvania James C. Hoe Carnegie Mellon University Pittsburgh,
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationInvasive Computing for Robotic Vision
Invasive Computing for Robotic Vision Johny Paul and Walter Stechele Institute for Integrated Systems Technical University of Munich Germany {Johny.Paul, Walter.Stechele}@tum.de M. Kröhnert, T. Asfour
More informationMulti processor systems with configurable hardware acceleration
Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations
More informationA Resource-Aware Nearest-Neighbor Search Algorithm for K-Dimensional Trees
A Resource-Aware Nearest-Neighbor Search Algorithm for K-Dimensional Trees Johny Paul and Walter Stechele Institute for Integrated Systems Technical University of Munich Germany {johny.paul,walter.stechele}@tum.de
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationA software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system
A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system 26th July 2005 Alberto Donato donato@elet.polimi.it Relatore: Prof. Fabrizio Ferrandi Correlatore:
More informationDesign of an open hardware architecture for the humanoid robot ARMAR
Design of an open hardware architecture for the humanoid robot ARMAR Kristian Regenstein 1 and Rüdiger Dillmann 1,2 1 FZI Forschungszentrum Informatik, Haid und Neustraße 10-14, 76131 Karlsruhe, Germany
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationComprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications
Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications Helena Zheng ML Group, Arm Arm Technical Symposia 2017, Taipei Machine Learning is a Subset of Artificial
More informationImproving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration
, pp.517-521 http://dx.doi.org/10.14257/astl.2015.1 Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration Jooheung Lee 1 and Jungwon Cho 2, * 1 Dept. of
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More informationOn Road Vehicle Detection using Shadows
On Road Vehicle Detection using Shadows Gilad Buchman Grasp Lab, Department of Computer and Information Science School of Engineering University of Pennsylvania, Philadelphia, PA buchmag@seas.upenn.edu
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationAutonomous Navigation for Flying Robots
Computer Vision Group Prof. Daniel Cremers Autonomous Navigation for Flying Robots Lecture 7.1: 2D Motion Estimation in Images Jürgen Sturm Technische Universität München 3D to 2D Perspective Projections
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationHVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips
on introducing a new design paradigm HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips D. Diamantopoulos, K. Siozios, E. Sotiriou-Xanthopoulos, G. Economakos and D. Soudris
More informationHardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms
Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering,
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationSelf-Aware Adaptation in FPGA-based Systems
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Self-Aware Adaptation in FPGA-based Systems IEEE FPL 2010 Filippo Siorni: filippo.sironi@dresd.org Marco Triverio: marco.triverio@dresd.org Martina Maggio: mmaggio@mit.edu
More informationMATLAB/Simulink 기반의프로그래머블 SoC 설계및검증
MATLAB/Simulink 기반의프로그래머블 SoC 설계및검증 이웅재부장 Application Engineering Group 2014 The MathWorks, Inc. 1 Agenda Introduction ZYNQ Design Process Model-Based Design Workflow Prototyping and Verification Processor
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationPerformance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews
Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationMulticore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.
Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationDesigning and Targeting Video Processing Subsystems for Hardware
1 Designing and Targeting Video Processing Subsystems for Hardware 정승혁과장 Senior Application Engineer MathWorks Korea 2017 The MathWorks, Inc. 2 Pixel-stream Frame-based Process : From Algorithm to Hardware
More informationReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware
ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn {enno.luebbers, platzner}@upb.de Outline
More informationCo-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,
Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems
More informationReconOS: An RTOS Supporting Hardware and Software Threads
ReconOS: An RTOS Supporting Hardware and Software Threads Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn marco.platzner@computer.org Overview the ReconOS project programming
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationEECS150 - Digital Design Lecture 17 Memory 2
EECS150 - Digital Design Lecture 17 Memory 2 October 22, 2002 John Wawrzynek Fall 2002 EECS150 Lec17-mem2 Page 1 SDRAM Recap General Characteristics Optimized for high density and therefore low cost/bit
More informationMultimedia Decoder Using the Nios II Processor
Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra
More informationA Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related
More informationHardware-Software Codesign
Hardware-Software Codesign 8. Performance Estimation Lothar Thiele 8-1 System Design specification system synthesis estimation -compilation intellectual prop. code instruction set HW-synthesis intellectual
More informationHardware Software Co-design and SoC. Neeraj Goel IIT Delhi
Hardware Software Co-design and SoC Neeraj Goel IIT Delhi Introduction What is hardware software co-design Some part of application in hardware and some part in software Mpeg2 decoder example Prediction
More informationLow-Power Processor Solutions for Always-on Devices
Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationSingle Pass Connected Components Analysis
D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected
More informationEmbedded real-time stereo estimation via Semi-Global Matching on the GPU
Embedded real-time stereo estimation via Semi-Global Matching on the GPU Daniel Hernández Juárez, Alejandro Chacón, Antonio Espinosa, David Vázquez, Juan Carlos Moure and Antonio M. López Computer Architecture
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationSoC for Car Navigation Systems with a 53.3 GOPS Image Recognition Engine
Session 5D : Designer s Forum : State-of-the-art SoCs 5D-4 SoC for Car Navigation Systems with a 53.3 GOPS Image Recognition Engine Jan. 20. 2010 Hiroyuki Hamasaki*, Yasuhiko Hoshi*, Atsushi Nakamura *,
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationIntroduction to System-on-Chip
Introduction to System-on-Chip COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More information[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개
[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware
More informationRISC-V Core IP Products
RISC-V Core IP Products An Introduction to SiFive RISC-V Core IP Drew Barbier September 2017 drew@sifive.com SiFive RISC-V Core IP Products This presentation is targeted at embedded designers who want
More informationA 1-GHz Configurable Processor Core MeP-h1
A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationTable 1: Example Implementation Statistics for Xilinx FPGAs
logijpge Motion JPEG Encoder January 10 th, 2018 Data Sheet Version: v1.0 Xylon d.o.o. Fallerovo setaliste 22 10000 Zagreb, Croatia Phone: +385 1 368 00 26 Fax: +385 1 365 51 67 E-mail: support@logicbricks.com
More informationBroadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms
IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Design Space in Embedded Scalable Platforms Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationSoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik
SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationYafit Snir Arindam Guha Cadence Design Systems, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces
Yafit Snir Arindam Guha, Inc. Accelerating System level Verification of SOC Designs with MIPI Interfaces Agenda Overview: MIPI Verification approaches and challenges Acceleration methodology overview and
More informationTradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter
Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,
More informationDRPM architecture overview
DRPM architecture overview Jens Hagemeyer, Dirk Jungewelter, Dario Cozzi, Sebastian Korf, Mario Porrmann Center of Excellence Cognitive action Technology, Bielefeld University, Germany Project partners:
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationPolitecnico di Milano
Politecnico di Milano Prototyping Pipelined Applications on a Heterogeneous FPGA Multiprocessor Virtual Platform Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani, Gianluca Palermo, Fabrizio
More informationModel-based Visual Tracking:
Technische Universität München Model-based Visual Tracking: the OpenTL framework Giorgio Panin Technische Universität München Institut für Informatik Lehrstuhl für Echtzeitsysteme und Robotik (Prof. Alois
More informationNext Generation Multi-Purpose Microprocessor
Next Generation Multi-Purpose Microprocessor Presentation at MPSA, 4 th of November 2009 www.aeroflex.com/gaisler OUTLINE NGMP key requirements Development schedule Architectural Overview LEON4FT features
More informationIntelligent Interconnect for Autonomous Vehicle SoCs. Sam Wong / Chi Peng, NetSpeed Systems
Intelligent Interconnect for Autonomous Vehicle SoCs Sam Wong / Chi Peng, NetSpeed Systems Challenges Facing Autonomous Vehicles Exploding Performance Requirements Real-Time Processing of Sensors Ultra-High
More informationFlexTiles. Runtime mapping of hardware accelerators on 3D self-adaptive heterogeneous manycore
FlexTiles www.flextiles.eu Runtime mapping of hardware accelerators on 3D self-adaptive heterogeneous manycore 21/5/2013 Christophe HURIAUX, Olivier SENTIEYS, Antoine COURTAY, Emmanuel CASSEAU, Quang Hoa
More informationLocal features: detection and description. Local invariant features
Local features: detection and description Local invariant features Detection of interest points Harris corner detection Scale invariant blob detection: LoG Description of local patches SIFT : Histograms
More informationFPGA: What? Why? Marco D. Santambrogio
FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much
More informationCogniSight, image recognition engine
CogniSight, image recognition engine Making sense of video and images Generating insights, meta data and decision Applications 2 Inspect, Sort Identify, Track Detect, Count Search, Tag Match, Compare Find,
More informationA Bus-based SoC Architecture for Flexible Module Placement on Reconfigurable FPGAs
The work was published in Proceedings of International Conference on Field-Programmable Logic and Applications (FPL 10), pp. 234-239 A Bus-based SoC Architecture for Flexible Module Placement on Reconfigurable
More informationECE/CS 757: Advanced Computer Architecture II Interconnects
ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction
More informationMotion Estimation and Optical Flow Tracking
Image Matching Image Retrieval Object Recognition Motion Estimation and Optical Flow Tracking Example: Mosiacing (Panorama) M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003 Example 3D Reconstruction
More informationMapping applications into MPSoC
Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose
More informationInterfacing a High Speed Crypto Accelerator to an Embedded CPU
Interfacing a High Speed Crypto Accelerator to an Embedded CPU Alireza Hodjat ahodjat @ee.ucla.edu Electrical Engineering Department University of California, Los Angeles Ingrid Verbauwhede ingrid @ee.ucla.edu
More informationRuntime Application Mapping Using Software Agents
1 Runtime Application Mapping Using Software Agents Mohammad Abdullah Al Faruque, Thomas Ebi, Jörg Henkel Chair for Embedded Systems (CES) Karlsruhe Institute of Technology Overview 2 Motivation Related
More informationCS 378: Autonomous Intelligent Robotics. Instructor: Jivko Sinapov
CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov http://www.cs.utexas.edu/~jsinapov/teaching/cs378/ Visual Registration and Recognition Announcements Homework 6 is out, due 4/5 4/7 Installing
More informationAdapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES
Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationELCT 912: Advanced Embedded Systems
ELCT 912: Advanced Embedded Systems Lecture 2-3: Embedded System Hardware Dr. Mohamed Abd El Ghany, Department of Electronics and Electrical Engineering Embedded System Hardware Used for processing of
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationGiancarlo Vasta, Magneti Marelli, Lucia Lo Bello, University of Catania,
An innovative traffic management scheme for deterministic/eventbased communications in automotive applications with a focus on Automated Driving Applications Giancarlo Vasta, Magneti Marelli, giancarlo.vasta@magnetimarelli.com
More informationFast dynamic and partial reconfiguration Data Path
Fast dynamic and partial reconfiguration Data Path with low Michael Hübner 1, Diana Göhringer 2, Juanjo Noguera 3, Jürgen Becker 1 1 Karlsruhe Institute t of Technology (KIT), Germany 2 Fraunhofer IOSB,
More informationRuntime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms
Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms Davide Giri Columbia University New York, USA davide_giri@cs.columbia.edu ABSTRACT In heterogeneous systems-on-chip, the optimal choice
More informationImplementing Flexible Interconnect Topologies for Machine Learning Acceleration
Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges
More informationAcceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays
Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays Éricles Rodrigues Sousa 1, Alexandru Tanase 1,VahidLari 1, Frank Hannig 1,Jürgen Teich 1, Johny Paul 2, Walter Stechele 2,
More information02 - Distributed Systems
02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More information100M Gate Designs in FPGAs
100M Gate Designs in FPGAs Fact or Fiction? NMI FPGA Network 11 th October 2016 Jonathan Meadowcroft, Cadence Design Systems Why in the world, would I do that? ASIC replacement? Probably not! Cost prohibitive
More informationOptimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd
Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block
More informationEnabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center
Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow University of Toronto 1 Cloudy with
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationAutomatic Pruning of Autotuning Parameter Space for OpenCL Applications
Automatic Pruning of Autotuning Parameter Space for OpenCL Applications Ahmet Erdem, Gianluca Palermo 6, and Cristina Silvano 6 Department of Electronics, Information and Bioengineering Politecnico di
More informationC-Based Hardware Design Platform for Dynamically Reconfigurable Processor
C-Based Hardware Design Platform for Dynamically Reconfigurable Processor September 22 nd, 2005 IPFlex Inc. Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW
More information