Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform

Size: px
Start display at page:

Download "Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform"

Transcription

1 Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform By: Floris Driessen (f.c.driessen@student.tue.nl)

2 Introduction 1 Video applications on embedded platforms Use of accelerators Faster Energy efficiency USB camera

3 Platform of Interest - ZYNQ 2 Zedboard by Digilent Xilinx Zynq platform Dual core ARM Cortex A9 Programmable logic 512 MB RAM USB connectivity HDMI output USB camera

4 Naïve implementation 3 Software 1. Read camera frame 2. Copy frame to DMA region 3. Perform HW accelerated operation (Sobel) 4. Copy result from DMA region 5. Show result Separate DMA region needed due to lack of DMA drivers Zynq platform USB ARM Core 0 Linux RAM ARM Core 1 Programmable logic DMA RAM

5 Bottleneck Study 4 Performance limit Converting the format Camera output to accelerator input Copying from/to DMA region Mmap Not cached Frame capturing Zynq platform USB ARM Core 0 Linux RAM ARM Core 1 Programmable logic DMA RAM

6 Possible improvements 5 Exploiting scratchpad A frame would not fit DMA driver support Not feasible within time frame of project Optimize the current implementation Copying data Converting format Capturing camera frame

7 Format conversion 6 Naïve implementation Combined conversion and copy Writing small chunks to mmaped memory (slow) Split conversion and copy OpenCV mixchannels NEON interleaving ARM SIMD Next slide Implementation Convert + copy [s] Speed-up Naïve 1,95 1x Split 0,28+0,04=0,32 6,1x OpenCV 0,05+0,04= ,7x NEON 0,04 50,6x 0x00 R0 R0 vst4.8 {d0-d3} [#0] 0x01 G0 G0 0x02 B0 vld3.8 {d0-d2} [#0] B0 0x03 R1 x 0x04 G1 R1 0x05 B1 G1 0x06 R2 B1 0x07 G2 x R7 R6 R5 R4 R3 R2 R1 R0 d0 G7 G6 G5 G4 G3 G2 G1 G0 d1 B7 B6 B5 B4 B3 B2 B1 B0 d2 x x x x x x x x d3

8 NEON RGB24 to RGB32 conversion example 7 void attribute ((noinline)) neonrgbtorgba_gas(unsigned char* src, unsigned char* dst, int numpix) { asm( // numpix/8 " mov r2, r2, lsr #3\n" // numpix/8 // load alpha channel value " vmov.u8 d3, #0xff\n" "loop1:\n" // load 8 rgb pixels with deinterleave " vld3.8 {d0,d1,d2}, [r0]!\n" // preload next values " pld [r0,#40]\n" " pld [r0,#48]\n" " pld [r0,#56]\n" // substract loop counter " subs r2, r2, #1\n" //" vswp d0, d2\n" // store as 4*8bit values " vst4.8 {d0-d3}, [r1]!\n" // loop if not ready " bgt loop1\n" ); } 0x00 R0 R0 0x01 G0 G0 0x02 B0 B0 0x03 R1 x 0x04 G1 R1 0x05 B1 G1 0x06 R2 B1 0x07 G2 x R7 x R6 x R5 x R4 x R3 x R2 x R1 x R0 x d0 G7 x G6 x G5 x G4 x G3 x G2 x G1 x G0 x d1 B7 x B6 x B5 x B4 x B3 x B2 x B1 x B0 x d2 x x x x x x x x d3

9 Execution time [ms] Frame copy from/to DMA RAM 8 OpenCV (as used in the naïve implementation) Manual copy (loop over virtual contiguous memory) Memcpy from C library NEON accelerated copy OpenCV Manual Memcpy Neon copy Linux RAM Linux RAM Linux RAM DMA RAM DMA RAM Linux RAM

10 Camera capture 9 OpenCV Always BGR24 Video4Linux Different formats Not a big improvement V4L2 RGB24 V4L2 BGR24 V4L2 MJPEG V4L2 YUYV 0.04 OpenCV BGR Frame delay [s]

11 Execution time per frame [s] Results 10 Multiple configurations Combined the conversion and copy (NEON accelerated) 1: Split convert and copy 2: OpenCV mixchannels 3: Combined mixchannels to external 4: No convert back + V4L capture 5: NEON copy 6: Combined NEON convert and NEON copy Copy back and convert Sobel calculation Convert and copy Get frame Application configuration

12 Contributions 11 Framework for combining USB camera with accelerators in programmable logic Multiple format conversion routines NEON NEON copying routines Video4Linux frame capture Capture frame Process result Convert format Copy to DMA RAM Execute accelerator Copy result back Convert format

13 Conclusion and Future work 12 Huge improvement 32x (0,2 to 7,7 FPS) Still one ARM core unoccupied for processing data after accelerator Make camera frame buffer available to DMA DMA buffer sharing Linux kernel 3.8 Improve frame capture Takes more than half of the time Latency of ~4 frames Driver from manufacturer Consider other cameras

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang Lecture 25: Interrupt Handling and Multi-Data Processing Spring 2018 Jason Tang 1 Topics Interrupt handling Vector processing Multi-data processing 2 I/O Communication Software needs to know when: I/O

More information

Real-time image processing and object recognition for robotics applications. Adrian Stratulat

Real-time image processing and object recognition for robotics applications. Adrian Stratulat Real-time image processing and object recognition for robotics applications Adrian Stratulat What is computer vision? Computer vision is a field that includes methods for acquiring, processing, analyzing,

More information

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017

OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017 OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017 Agenda Why Zynq SoCs for Traditional Computer Vision Automated

More information

W I S S E N T E C H N I K L E I D E N S C H A F T

W I S S E N T E C H N I K L E I D E N S C H A F T W I S S E N T E C H N I K L E I D E N S C H A F T System-on-Chip Architectures and Modelling 2014 Ehrenhöfer, Lalic, Steinbäck, Jelinek, Ortoff, Jantscher, Fellner, Schilling, Weiser, Sparber www.iaik.tugraz.at

More information

The Design of Sobel Edge Extraction System on FPGA

The Design of Sobel Edge Extraction System on FPGA The Design of Sobel Edge Extraction System on FPGA Yu ZHENG 1, * 1 School of software, Beijing University of technology, Beijing 100124, China; Abstract. Edge is a basic feature of an image, the purpose

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Methods to protect proprietary components in device drivers

Methods to protect proprietary components in device drivers Methods to protect proprietary components in device drivers Matt Porter Embedded Alley Solutions, Inc. Introduction Why the interest in closed drivers on Linux? Competition Advantage perception Upsell

More information

Multimedia SoC System Solutions

Multimedia SoC System Solutions Multimedia SoC System Solutions Presented By Yashu Gosain & Forrest Picket: System Software & SoC Solutions Marketing Girish Malipeddi: IP Subsystems Marketing Agenda Zynq Ultrascale+ MPSoC and Multimedia

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Linux Storage System Bottleneck Exploration

Linux Storage System Bottleneck Exploration Linux Storage System Bottleneck Exploration Bean Huo / Zoltan Szubbocsev Beanhuo@micron.com / zszubbocsev@micron.com 215 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications

More information

Atlys (Xilinx Spartan-6 LX45)

Atlys (Xilinx Spartan-6 LX45) Boards & FPGA Systems and and Robotics how to use them 1 Atlys (Xilinx Spartan-6 LX45) Medium capacity Video in/out (both DVI) Audio AC97 codec 220 US$ (academic) Gbit Ethernet 128Mbyte DDR2 memory USB

More information

Realtime Signal Processing on Embedded GPUs

Realtime Signal Processing on Embedded GPUs Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences Motivation

More information

Memory Management in Tizen. SW Platform Team, SW R&D Center

Memory Management in Tizen. SW Platform Team, SW R&D Center Memory Management in Tizen SW Platform Team, SW R&D Center Contents Tizen Kernel Overview Memory Management in Tizen Kernel Memory Size Optimization 2 Tizen Kernel Overview 3 Tizen Kernel Overview Core

More information

DM3730 Camera Interfaces on Gumstix. M. (Mattanja) Venema. MSc Report. C e Dr.ir. J.F. Broenink Ir. E. Molenkamp Ing. M.H. Schwirtz.

DM3730 Camera Interfaces on Gumstix. M. (Mattanja) Venema. MSc Report. C e Dr.ir. J.F. Broenink Ir. E. Molenkamp Ing. M.H. Schwirtz. DM3730 Camera Interfaces on Gumstix M. (Mattanja) Venema MSc Report C e Dr.ir. J.F. Broenink Ir. E. Molenkamp Ing. M.H. Schwirtz August 2016 033RAM2016 EE-Math-CS P.O. Box 217 7500 AE Enschede The Netherlands

More information

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Topics Hardware advantages of ZYNQ UltraScale+ MPSoC Software stacks of MPSoC Target reference design introduction Details about one Design

More information

借助 SDSoC 快速開發複雜的嵌入式應用

借助 SDSoC 快速開發複雜的嵌入式應用 借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System

More information

ECE 598 Advanced Operating Systems Lecture 4

ECE 598 Advanced Operating Systems Lecture 4 ECE 598 Advanced Operating Systems Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Announcements HW#1 was due HW#2 was posted, will be tricky Let me know

More information

Case Study: Building a High Quality Video Pipeline Using GStreamer and V4Linux on an i.mx6

Case Study: Building a High Quality Video Pipeline Using GStreamer and V4Linux on an i.mx6 Case Study: Building a High Quality Video Pipeline Using GStreamer and V4Linux on an i.mx6 Sean Hudson Embedded Linux Architect & Member of Technical Staff Android is a trademark of Google Inc. Use of

More information

Hardware Accelerated SDR Platform for Adaptive Air Interfaces Tarik Kazaz, Christophe Van Praet, Merima Kulin, Pieter Willemen, Ingrid Moerman

Hardware Accelerated SDR Platform for Adaptive Air Interfaces Tarik Kazaz, Christophe Van Praet, Merima Kulin, Pieter Willemen, Ingrid Moerman Hardware Accelerated SDR Platform for Adaptive Air Interfaces Tarik Kazaz, Christophe Van Praet, Merima Kulin, Pieter Willemen, Ingrid Moerman 27/01/2016 1 Overview Common SDR approach Propposed approach

More information

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13 I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of

More information

A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies

A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies Filipa Duarte and Stephan Wong Computer Engineering Laboratory Delft University of Technology Abstract In this paper, we present

More information

Architecture of Computers and Parallel Systems Part 2: Communication with Devices

Architecture of Computers and Parallel Systems Part 2: Communication with Devices Architecture of Computers and Parallel Systems Part 2: Communication with Devices Ing. Petr Olivka petr.olivka@vsb.cz Department of Computer Science FEI VSB-TUO Architecture of Computers and Parallel Systems

More information

A flexible memory shuffling unit for image processing accelerators

A flexible memory shuffling unit for image processing accelerators Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or

More information

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS

More information

KeyStone II. CorePac Overview

KeyStone II. CorePac Overview KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB

More information

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015. Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()

More information

MAGEWELL Pro Capture HDMI Technical Specification

MAGEWELL Pro Capture HDMI Technical Specification MAGEWELL Pro Capture HDMI Technical Specification Copyright (c) 2011 2015 Nanjing Magewell Electronics Co., Ltd. All rights reserved. Specifications are based on current hardware, firmware and software

More information

ECE 598 Advanced Operating Systems Lecture 18

ECE 598 Advanced Operating Systems Lecture 18 ECE 598 Advanced Operating Systems Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 5 April 2018 Announcements Homework #9 will be posted (graphics) 1 Graphics Interface

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

MYD-C7Z010/20 Development Board

MYD-C7Z010/20 Development Board MYD-C7Z010/20 Development Board MYC-C7Z010/20 CPU Module as Controller Board Two 0.8mm pitch 140-pin Connectors for Board-to-Board Connections 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor

More information

ARMv8-A CPU Architecture Overview

ARMv8-A CPU Architecture Overview ARMv8-A CPU Architecture Overview Chris Shore Training Manager, ARM ARM Game Developer Day, London 03/12/2015 Chris Shore ARM Training Manager With ARM for 16 years Managing customer training for 15 years

More information

Porting BLIS to new architectures Early experiences

Porting BLIS to new architectures Early experiences 1st BLIS Retreat. Austin (Texas) Early experiences Universidad Complutense de Madrid (Spain) September 5, 2013 BLIS design principles BLIS = Programmability + Performance + Portability Share experiences

More information

Asymmetric MultiProcessing for embedded vision

Asymmetric MultiProcessing for embedded vision Asymmetric MultiProcessing for embedded vision D. Berardi, M. Brian, M. Melletti, A. Paccoia, M. Rodolfi, C. Salati, M. Sartori T3LAB Bologna, October 18, 2017 A Linux centered SW infrastructure for the

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design

ECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design ECE 5775 (Fall 17) High-Level Digital Design Automation Hardware-Software Co-Design Announcements Midterm graded You can view your exams during TA office hours (Fri/Wed 11am-noon, Rhodes 312) Second paper

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin 1 Overview Acceleration for Storage NVMe for Acceleration How are we using (abusing ;-)) NVMe to support

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Project. Expect. Why MPEG Encode? MPEG Encoding Project Motion Estimation DCT Entropy Encoding

ESE532: System-on-a-Chip Architecture. Today. Message. Project. Expect. Why MPEG Encode? MPEG Encoding Project Motion Estimation DCT Entropy Encoding ESE532: System-on-a-Chip Architecture Day 16: March 20, 2017 MPEG Encoding MPEG Encoding Project Motion Estimation DCT Entropy Encoding Today Penn ESE532 Spring 2017 -- DeHon 1 Penn ESE532 Spring 2017

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

Design Choices for FPGA-based SoCs When Adding a SATA Storage } U4 U7 U7 Q D U5 Q D Design Choices for FPGA-based SoCs When Adding a SATA Storage } Lorenz Kolb & Endric Schubert, Missing Link Electronics Rudolf Usselmann, ASICS World Services Motivation for SATA Storage

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV 0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV Rajeev Jayavant Cyrix Corporation A National Semiconductor Company 8/18/98 1 0;L$UFKLWHFWXUDO)HDWXUHV ¾ Next-generation Cayenne Core Dual-issue pipelined

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

A Linux multimedia platform for SH-Mobile processors

A Linux multimedia platform for SH-Mobile processors A Linux multimedia platform for SH-Mobile processors Embedded Linux Conference 2009 April 7, 2009 Abstract Over the past year I ve been working with the Japanese semiconductor manufacturer Renesas, developing

More information

Specifications are based on current hardware, firmware and software revisions, and are subject to change without notice.

Specifications are based on current hardware, firmware and software revisions, and are subject to change without notice. MAGEWELL Pro Capture HDMI Technical Specification Copyright (c) 2011 2018 Nanjing Magewell Electronics Co., Ltd. All rights reserved. Specifications are based on current hardware, firmware and software

More information

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Nvidia TX2 using CUDA Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Writing high performance code. CS448h Nov. 3, 2015

Writing high performance code. CS448h Nov. 3, 2015 Writing high performance code CS448h Nov. 3, 2015 Overview Is it slow? Where is it slow? How slow is it? Why is it slow? deciding when to optimize identifying bottlenecks estimating potential reasons for

More information

Introduction to the ARM Architecture. or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref.

Introduction to the ARM Architecture. or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref. Introduction to the ARM Architecture or: a loose set of random facts blatantly copied from tech sheets and the Architecture Ref. Manual Glance into the past Initial ARM Processor developed by Acorn Computers,

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Efficient Video Processing on Embedded GPU

Efficient Video Processing on Embedded GPU Efficient Video Processing on Embedded GPU Tobias Kammacher Armin Weiss Matthias Frei Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of Applied Sciences (ZHAW)

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

ECE 598 Advanced Operating Systems Lecture 12

ECE 598 Advanced Operating Systems Lecture 12 ECE 598 Advanced Operating Systems Lecture 12 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 March 2018 Announcements Next homework will be due after break. Midterm next Thursday

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

[537] I/O Devices/Disks. Tyler Harter

[537] I/O Devices/Disks. Tyler Harter [537] I/O Devices/Disks Tyler Harter I/O Devices Motivation What good is a computer without any I/O devices? - keyboard, display, disks! We want: - H/W that will let us plug in different devices - OS that

More information

Recommended OS (tested)

Recommended OS (tested) MAGEWELL Pro Capture AIO Technical Specifications Copyright (c) 2011 2018 Nanjing Magewell Electronics Co., Ltd. All rights reserved. Specifications are based on current hardware, firmware and software

More information

i.mx 7 - Hetereogenous Multiprocessing Architecture

i.mx 7 - Hetereogenous Multiprocessing Architecture i.mx 7 - Hetereogenous Multiprocessing Architecture Overview Toradex Innovative Business Model Independent Companies Direct Sales Publicly disclosed Sales Prices Local Warehouses In-house HW and SW Development

More information

S100 Series. Compact Smart Camera. High Performance: Dual Core Cortex-A9 processor and Xilinx FPGA. acquisition and preprocessing

S100 Series. Compact Smart Camera. High Performance: Dual Core Cortex-A9 processor and Xilinx FPGA. acquisition and preprocessing S100 Series Compact Smart Camera High Performance: Dual Core Cortex-A9 processor and Xilinx FPGA IP-67 Rated enclosure Programmable FPGA for image acquisition and preprocessing Multiple resolution: VGA,

More information

Integer GEMM (under)performance Marat Dukhan

Integer GEMM (under)performance Marat Dukhan Integer GEMM (under)performance Marat Dukhan Software Engineer on Caffe 2 GEMM in Neural Networks Fully-connected layers im2col+gemm algorithm for convolution 1x1 convolutional layers Android CPU Landscape

More information

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory

Memory Management. To do. q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory Memory Management To do q Basic memory management q Swapping q Kernel memory allocation q Next Time: Virtual memory Memory management Ideal memory for a programmer large, fast, nonvolatile and cheap not

More information

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms Gaurav Mitra 1 Beau Johnston 1 Alistair P. Rendell 1 Eric McCreath 1 Jun Zhou 2 1 Research

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

Product Technical Brief S3C2416 May 2008

Product Technical Brief S3C2416 May 2008 Product Technical Brief S3C2416 May 2008 Overview SAMSUNG's S3C2416 is a 32/16-bit RISC cost-effective, low power, high performance micro-processor solution for general applications including the GPS Navigation

More information

User Guide. TexturePerformancePBO Demo

User Guide. TexturePerformancePBO Demo User Guide TexturePerformancePBO Demo The TexturePerformancePBO Demo serves two purposes: 1. It allows developers to experiment with various combinations of texture transfer methods for texture upload

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering,

More information

OpenAMP Discussion - Linaro2018HK. Copyright 2018 Xilinx

OpenAMP Discussion - Linaro2018HK. Copyright 2018 Xilinx OpenAMP Discussion - Linaro2018HK Agenda o SPDX Short Licenses Identifier o Coding Guideline o API Standardisation o Coprocessor Image Format o OpenAMP and Container Page 2 OpenAMP License Page 3 SPDX

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y

Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y Tracing embedded heterogeneous systems P R O G R E S S R E P O R T M E E T I N G, M A Y 2 0 1 6 T H O M A S B E R T A U L D D I R E C T E D B Y M I C H E L D A G E N A I S May 5th 2016 TRACING EMBEDDED

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

Designing with NXP i.mx8m SoC

Designing with NXP i.mx8m SoC Designing with NXP i.mx8m SoC Course Description Designing with NXP i.mx8m SoC is a 3 days deep dive training to the latest NXP application processor family. The first part of the course starts by overviewing

More information

Disks and I/O Hakan Uraz - File Organization 1

Disks and I/O Hakan Uraz - File Organization 1 Disks and I/O 2006 Hakan Uraz - File Organization 1 Disk Drive 2006 Hakan Uraz - File Organization 2 Tracks and Sectors on Disk Surface 2006 Hakan Uraz - File Organization 3 A Set of Cylinders on Disk

More information

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol

Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices. Dr Jose Luis Nunez-Yanez University of Bristol Computing to the Energy and Performance Limits with Heterogeneous CPU-FPGA Devices Dr Jose Luis Nunez-Yanez University of Bristol Power and energy savings at run-time Power = α.c.v 2.f+g1.V 3 Energy =

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

H.264 Decoding. University of Central Florida

H.264 Decoding. University of Central Florida 1 Optimization Example: H.264 inverse transform Interprediction Intraprediction In-Loop Deblocking Render Interprediction filter data from previously decoded frames Deblocking filter out block edges Today:

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Partitioning of computationally intensive tasks between FPGA and CPUs

Partitioning of computationally intensive tasks between FPGA and CPUs Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland tobias.welti@zhaw.ch

More information

EyeCheck Smart Cameras

EyeCheck Smart Cameras EyeCheck Smart Cameras 2 3 EyeCheck 9xx & 1xxx series Technical data Memory: DDR RAM 128 MB FLASH 128 MB Interfaces: Ethernet (LAN) RS422, RS232 (not EC900, EC910, EC1000, EC1010) EtherNet / IP PROFINET

More information

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational

More information

Performance Verification for ESL Design Methodology from AADL Models

Performance Verification for ESL Design Methodology from AADL Models Performance Verification for ESL Design Methodology from AADL Models Hugues Jérome Institut Supérieur de l'aéronautique et de l'espace (ISAE-SUPAERO) Université de Toulouse 31055 TOULOUSE Cedex 4 Jerome.huges@isae.fr

More information

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester Operating System: Chap13 I/O Systems National Tsing-Hua University 2016, Fall Semester Outline Overview I/O Hardware I/O Methods Kernel I/O Subsystem Performance Application Interface Operating System

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese, Gaurav Mitra, Robert Edwards and Alistair Rendell Research School of Computer Science The Australian National

More information

DevKit7000 Evaluation Kit

DevKit7000 Evaluation Kit DevKit7000 Evaluation Kit Samsung S5PV210 Processor based on 1GHz ARM Cortex-A8 core Onboard 512MByte DDR2 and 512MByte NAND Flash 4 UART, 4 USB Host, USB Device, Ethernet, Audio, TF, RTC,... Supports

More information

I/O - input/output. system components: CPU, memory, and bus -- now add I/O controllers and peripheral devices. CPU Cache

I/O - input/output. system components: CPU, memory, and bus -- now add I/O controllers and peripheral devices. CPU Cache I/O - input/output system components: CPU, memory, and bus -- now add I/O controllers and peripheral devices CPU Cache CPU must perform all transfers to/from simple controller, e.g., CPU reads byte from

More information

I/O. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 6.5-6

I/O. Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. See: P&H Chapter 6.5-6 I/O Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 6.5-6 Computer System = Input + Output + Memory + Datapath + Control Video Network Keyboard USB Computer System

More information

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and

More information

A Hardware Cache memcpy Accelerator

A Hardware Cache memcpy Accelerator A Hardware memcpy Accelerator Stephan Wong, Filipa Duarte, and Stamatis Vassiliadis Computer Engineering, Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands {J.S.S.M.Wong, F.Duarte,

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 205 Lecture 23 LAST TIME: VIRTUAL MEMORY! Began to focus on how to virtualize memory! Instead of directly addressing physical memory, introduce a level of

More information

A Run-Time System for Partially Reconfigurable FPGAs: The case of STMicroelectronics SPEAr board

A Run-Time System for Partially Reconfigurable FPGAs: The case of STMicroelectronics SPEAr board A Run-Time System for Partially Reconfigurable FPGAs: The case of STMicroelectronics SPEAr board George CHARITOPOULOS a,b,1, Dionisios PNEVMATIKATOS a,b, Marco D. SANTAMBROGIO c, Kyprianos PAPADIMITRIOU

More information

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017 ECE 550D Fundamentals of Computer Systems and Engineering Fall 2017 Input/Output (IO) Prof. John Board Duke University Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke) IO:

More information

RZ/G1 SeRieS embedded microprocessors

RZ/G1 SeRieS embedded microprocessors RZ/G1 SeRieS embedded microprocessors High-End HMI, Video, Embedded Vision and more 2017.01 RZ/G1 SERIES MULTI-CORE MPUS WITH high-end GRaphicS and multi-stream ViDeo RZ/G1 Series microprocessors (MPUs)

More information

Hugo Cunha. Senior Firmware Developer Globaltronics

Hugo Cunha. Senior Firmware Developer Globaltronics Hugo Cunha Senior Firmware Developer Globaltronics NB-IoT Product Acceleration Platforms 2018 Speaker Hugo Cunha Project Developper Agenda About us NB IoT Platforms The WIIPIIDO The Gateway FE 1 About

More information

Developing a Prototyping Board for Emerging Memory

Developing a Prototyping Board for Emerging Memory Developing a Prototyping Board for Emerging Memory 2013. 10. 25 Sungjoo Yoo Embedded System Architecture Lab. POSTECH Introduction scaling problem [ITRS, 2012] Year 2012 2013 2014 2015 2016 2017 2018 2019

More information

«Real Time Embedded systems» Cyclone V SOC - FPGA

«Real Time Embedded systems» Cyclone V SOC - FPGA «Real Time Embedded systems» Cyclone V SOC - FPGA Ref: http://www.altera.com rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 SOC + FPGA (ex. Cyclone V,

More information

QUIZ Ch.6. The EAT for a two-level memory is given by:

QUIZ Ch.6. The EAT for a two-level memory is given by: QUIZ Ch.6 The EAT for a two-level memory is given by: EAT = H Access C + (1-H) Access MM. Derive a similar formula for three-level memory: L1, L2 and RAM. Hint: Instead of H, we now have H 1 and H 2. Source:

More information