CNP: An FPGA-based Processor for Convolutional Networks
|
|
- Cori Rodgers
- 6 years ago
- Views:
Transcription
1 Clément Farabet Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio Culurciello s e-lab Yale {Berin Martini, Polina Akselrod, Selcuk Talay} 1 / 35
2 1 Introduction Our goal: Real-Time Embedded Vision Our Solution: the CNP (ConvNet Processor) 2 CNP: Architecture 3 Results 2 / 35
3 Real-time Vision Convolutional Neural Networks The state of the art for object recognition/classification, ConvNets are biologically-inspired hierarchical architectures that can be trained to perform a variety of detection/recognition tasks, such as... 3 / 35
4 Figure: Object recognition. 4 / 35
5 5 / 35
6 Figure: Robot navigation/obstacle avoidance. 6 / 35
7 Convolutional Networks ConvNets have a feed-forward architecture consisting of multiple linear convolution filters interspersed with point-wise non-linear squashing functions. [2, 3, 4, 1] Convolutions Subsampling Convolutions Subs. Conv. Full connections F6: F7: 2@70x50 1@70x50 Input Image C1: 6@314x234 S2: 6@157x117 C3: 16@151x111 S4: 16@75x55 C5: 80@70x50 7 / 35
8 Figure: Ex: Features extracted from an image, by a trained ConvNet. 8 / 35
9 Our Goal: Achieve high-performance in low-power environments: 9 / 35
10 Our Solution: A ConvNet Processor Our entire system fits in a single FPGA plus an external RAM module, and no extra parts. It is a programmable SIMD (Single Instruction, Multiple Data) processor, with a vector instruction set that matches the elementary operations of a ConvNet. These elementary operations are highly optimized, and make extensive use of the parallelism inherent in the hardware. 10 / 35
11 Flexibility / Efficiency Implementing a particular ConvNet simply consists in programming the software layer of our processor, and does not require to reconfigure the logic circuits in the FPGA. A network compiler software was implemented, which takes a description of a trained ConvNet and compiles it into a proper program for the ConvNet Processor (CNP). 11 / 35
12 1 Introduction 2 CNP: Architecture Overview Hardware Software 3 Results 12 / 35
13 Our System Fits in a single FPGA + an external memory module. The FPGA: a low-end Xilinx Spartan-3A DSP 3400 FPGA, or a high-end Virtex-4 SX35. Both are roughly the same size. Spartan: 126 built-in hardware fixed-point multiply-accumulate units with 18 by 18 bit multiplication, and with accumulation on 48 bits, 200MHz. Virtex 4 : 192 units, 400MHz. 13 / 35
14 Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing Hardware HW module ALU Driver Memory I/O External Flash SW program Memory CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Streaming Multiport Memory Arbiter { Mux Demux } External Memory Chip(s) Vector / Stream ALU Display Manager I/O Video Manager Screen Camera FPGA 14 / 35
15 Remote system Terminal Control Unit (CU) Serial File Manager 32bit General CPU Pre/Post Processing ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Memory I/O Streaming Multiport Memory Arbiter { Mux Demux } Vector / Stream ALU Display Manager I/O Video Manager FPGA Screen Camera 15 / 35
16 Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Screen Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Camera Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) 16 / 35
17 FIFOS WR... FIFOS WR FIFOS WR 17 / 35
18 Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Screen Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Camera Memory Arbiter { Mux Demux } FPGA 18 / 35
19 The most important instruction: 2D Convolution 80 to 90% of the computations Input/output coded on Q10.6, intermediate results on Q36.12 [5] 19 / 35
20 W00 W W02 W10 W11 W12 x + x + W20 W21 W22 x x + x x + x + x + x Pooling / Subsampling 20 / 35
21 The point-wise non-linearity Implemented as a piecewise approximation of the hyperbolic tangent function g(x) = A.tanh(B.x) g (x) = a i x + b i for x [l i, l i+1 ] (1) a i = 1 2 m n m, n [0, 5]. (2) Eq.(2) is used to replace all the multipliers by simple shifts and adds. 21 / 35
22 The point-wise non-linearity The piecewise mapping is done by a hardware mapper, which streams the input data through a cascade of linear mappers. Each linear mapper is configurable from the soft CPU. l0 >= a0 b0 x + l0 >= a0 b0 x +... l0 >= a0 b0 x + val? val? val? 22 / 35
23 Figure: Approximations of x [16 segments] and tanh(x) [16 segments] 23 / 35
24 24 / 35 Introduction CNP: Architecture Results References Remote system Terminal Control Unit (CU) ALU Driver CONV, DOT, SUB NONLIN DIVISION SQRT Stream Router PRODUCT Vector / Stream ALU Display Manager Serial File Manager I/O Memory I/O 32bit General CPU Pre/Post Processing Streaming Multiport Video Manager Memory Arbiter { Mux Demux } FPGA External Flash External Memory Chip(s) Screen Camera
25 Embedded software A complete library has been developed in C++, to abstract the hardware of the CNP. Its functions: Provide a high-level/modular/scalable user interface to allow different kinds of ConvNets to be loaded from a flash memory/shell interface (with their trained weights), then interpreted and computed in real-time; Hide all the intricacies of the VALU into a Matrix class, that allows simple calls, such as image1.convolve(image1, kernel2); Manipulate the data generated by the hardware, to perform operations that cannot be done by the hardware (e.g. blob detection, calculation of their centroids,... ); Provide a high-level shell interface, to allow scripting/debugging/polling the system. 25 / 35
26 External software: Training and Porting ConvNets are defined and trained on a conventional machine, using a standard machine learning library (e.g. LUSH, EBLearn); Our custom compiler then takes a Lush description of a trained ConvNet, and automatically compiles it into a compact representation describing the content of each layer type of operation, matrix of connections, kernels. This representation can then be stored on a flash drive (e.g. SD Card) connected to the CNP, to be interpreted on the fly by our embedded program. 26 / 35
27 Unifying the software: nrgizer [joint work with e Lab, Yale] We are currently developing a C++ library (inspired by LUSH), to unify the software nrgizer: nrgizer is a training environment that allows fast prototyping of ConvNets (and similar) architectures; All the computations in this lib are abstracted in a single class Matrix, that can easily be ported to different platforms (a PC, our CNP, and also GPUs, or embedded devices...); The whole lib can be ported to a little CPU (such as Microblaze, or OpenRisc); The API is extremely simplified / 35
28 1 Introduction 2 CNP: Architecture 3 Results Resources Usage Speed Images Ongoing Work 28 / 35
29 Resources Usage One single Spartan-3A DSP 3400 and an external DDR-RAM memory to store the kernels ( 50kB) and the temporary feature maps (approx. 16MB for input images) OR one single Virtex-4 SX35 and two external QDR-SRAM chips Entity Occupancy I/Os 135 out of % DCMs 2 out of 8 25% Mult/Accs 56 out of % Bloc Rams 100 out of % Slices out of % Table: Resources used in a Spartan-3A DSP / 35
30 Speed The design can be synthesized to run at up to 200MHz, which gives a peak performance of 10 GMACs for a 7 7 kernel! At 200MHz, it computes 8 fps in multi-scale (constant time for more than 2 scales), with 512x384 input images (this includes pre-processing, maximas detection and centroid calculation) or 12 fps in mono-scale 30 / 35
31 x1 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x Figure: Left: Time required to compute a 2D convolution, right: Time required to compute a full ConvNet. 31 / 35
32 Introduction CNP: Architecture Results References Figure: Applications running on the CNP (video generated by the CNP) 32 / 35
33 Figure: The CNP, integrated in our custom hardware platform (Virtex-4 + QDRs) 33 / 35
34 Ongoing / Future Work Our currents efforts [with e-lab, Yale]: Transforming the convolver into a flexible and programmable grid of elementary units, to compute several convolutions of different sizes, at the same time; Allowing the operations in the VALU to be cascaded, to combine operations on a single stream; Pre-caching the kernels in the grid, and pre-fetching any stream of data required while processing... to suppress any latency; Making a custom (ASIC) chip, that would integrate both memory and logic in a single chip, to increase the bandwidth and frequency of operation. 34 / 35
35 References I Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Zurich, Switzerland, Elsevier. an extended version was published as a technical report of the University of Toronto. Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, Y. LeCun and Y. Bengio. Pattern recognition and neural networks. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Intelligent Signal Processing, pages IEEE Press, Richard G. Shoup. Parameterized convolution filtering in a field programmable gate array. In Selected papers from the Oxford 1993 international workshop on field programmable logic and applications on More FPGAs, pages , Oxford, United Kingdom, Abingdon EE&CS Books. 35 / 35
CNP: AN FPGA-BASED PROCESSOR FOR CONVOLUTIONAL NETWORKS
CNP: AN FPGA-BASED PROCESSOR FOR CONVOLUTIONAL NETWORKS Clément Farabet 1, Cyril Poulet 1, Jefferson Y. Han 2, Yann LeCun 1 (1) Courant Institute of Mathematical Sciences, New York University, 715 Broadway,
More informationDataflow Computing for General Purpose Vision
Dataflow Computing for General Purpose Vision Clément Farabet joint work with: Yann LeCun, Eugenio Culurciello, Berin Martini, Polina Akselrod, Selcuk Talay, Benoit Corda General Purpose Vision? the task
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationDigital Integrated Circuits
Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market
More informationMinimizing Computation in Convolutional Neural Networks
Minimizing Computation in Convolutional Neural Networks Jason Cong and Bingjun Xiao Computer Science Department, University of California, Los Angeles, CA 90095, USA {cong,xiao}@cs.ucla.edu Abstract. Convolutional
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationFull Linux on FPGA. Sven Gregori
Full Linux on FPGA Sven Gregori Enclustra GmbH FPGA Design Center Founded in 2004 7 engineers Located in the Technopark of Zurich FPGA-Vendor independent Covering all topics
More informationL2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA
L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA 18-545: FALL 2014 2 Admin stuff Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationTechniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company
Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities
More informationNeural Computer Architectures
Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations
More informationHardware Implementation of TRaX Architecture
Hardware Implementation of TRaX Architecture Thesis Project Proposal Tim George I. Project Summery The hardware ray tracing group at the University of Utah has designed an architecture for rendering graphics
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More informationDesign Space Exploration of FPGA-Based Deep Convolutional Neural Networks
Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil Ghiasi Electrical and Computer Engineering Department, University
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationBasic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices
3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationMemory-Centric Accelerator Design for Convolutional Neural Networks
Memory-Centric Accelerator Design for Convolutional Neural Networks Maurice Peemen, Arnaud A. A. Setio, Bart Mesman and Henk Corporaal Department of Electrical Engineering, Eindhoven University of Technology,
More informationSystem-on Solution from Altera and Xilinx
System-on on-a-programmable-chip Solution from Altera and Xilinx Xun Yang VLSI CAD Lab, Computer Science Department, UCLA FPGAs with Embedded Microprocessors Combination of embedded processors and programmable
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationSimplify System Complexity
Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationSoC Platforms and CPU Cores
SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationAscenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005
Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005 Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides:
More informationNn-X - a hardware accelerator for convolutional neural networks
Purdue University Purdue e-pubs Open Access Theses Theses and Dissertations Summer 2014 Nn-X - a hardware accelerator for convolutional neural networks Vinayak A. Gokhale Purdue University Follow this
More informationVHX - Xilinx - FPGA Programming in VHDL
Training Xilinx - FPGA Programming in VHDL: This course explains how to design with VHDL on Xilinx FPGAs using ISE Design Suite - Programming: Logique Programmable VHX - Xilinx - FPGA Programming in VHDL
More informationINTRODUCTION TO FPGA ARCHITECTURE
3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)
More informationConvolutional Neural Networks
Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
More informationComputer Systems Colloquium (EE380) Wednesday, 4:15-5:30PM 5:30PM in Gates B01
Adapting Systems by Evolving Hardware Computer Systems Colloquium (EE380) Wednesday, 4:15-5:30PM 5:30PM in Gates B01 Jim Torresen Group Department of Informatics University of Oslo, Norway E-mail: jimtoer@ifi.uio.no
More informationField Programmable Gate Array (FPGA)
Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems
More informationFace Detection using GPU-based Convolutional Neural Networks
Face Detection using GPU-based Convolutional Neural Networks Fabian Nasse 1, Christian Thurau 2 and Gernot A. Fink 1 1 TU Dortmund University, Department of Computer Science, Dortmund, Germany 2 Fraunhofer
More informationField Program mable Gate Arrays
Field Program mable Gate Arrays M andakini Patil E H E P g r o u p D H E P T I F R SERC school NISER, Bhubaneshwar Nov 7-27 2017 Outline Digital electronics Short history of programmable logic devices
More informationdirect hardware mapping of cnns on fpga-based smart cameras
direct hardware mapping of cnns on fpga-based smart cameras Workshop on Architecture of Smart Cameras Kamel ABDELOUAHAB, Francois BERRY, Maxime PELCAT, Jocelyn SEROT, Jean-Charles QUINTON Cordoba, June
More information27 March 2018 Mikael Arguedas and Morgan Quigley
27 March 2018 Mikael Arguedas and Morgan Quigley Separate devices: (prototypes 0-3) Unified camera: (prototypes 4-5) Unified system: (prototypes 6+) USB3 USB Host USB3 USB2 USB3 USB Host PCIe root
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationCourse Overview Revisited
Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for
More informationFPGA VHDL Design Flow AES128 Implementation
Sakinder Ali FPGA VHDL Design Flow AES128 Implementation Field Programmable Gate Array Basic idea: two-dimensional array of logic blocks and flip-flops with a means for the user to configure: 1. The interconnection
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationFPGA Polyphase Filter Bank Study & Implementation
FPGA Polyphase Filter Bank Study & Implementation Raghu Rao Matthieu Tisserand Mike Severa Prof. John Villasenor Image Communications/. Electrical Engineering Dept. UCLA 1 Introduction This document describes
More informationSparse Matrix-Vector Multiplication FPGA Implementation
UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...
More informationFPGA BASED IMPLEMENTATION OF DEEP NEURAL NETWORKS USING ON-CHIP MEMORY ONLY. Jinhwan Park and Wonyong Sung
FPGA BASED IMPLEMENTATION OF DEEP NEURAL NETWORKS USING ON-CHIP MEMORY ONLY Jinhwan Park and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul 151-744 Korea
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationMetropolitan Road Traffic Simulation on FPGAs
Metropolitan Road Traffic Simulation on FPGAs Justin L. Tripp, Henning S. Mortveit, Anders Å. Hansson, Maya Gokhale Los Alamos National Laboratory Los Alamos, NM 85745 Overview Background Goals Using the
More informationPricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation
Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation Prof. Dr. Joachim K. Anlauf Universität Bonn Institut für Informatik II Technische Informatik Römerstr. 164 53117 Bonn E-Mail: anlauf@informatik.uni-bonn.de
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationEE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination
1 Student name: Date: June 26, 2008 General requirements for the exam: 1. This is CLOSED BOOK examination; 2. No questions allowed within the examination period; 3. If something is not clear in question
More informationReconOS: An RTOS Supporting Hardware and Software Threads
ReconOS: An RTOS Supporting Hardware and Software Threads Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn marco.platzner@computer.org Overview the ReconOS project programming
More informationINTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)
INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design
More informationEmbedded Computing Platform. Architecture and Instruction Set
Embedded Computing Platform Microprocessor: Architecture and Instruction Set Ingo Sander ingo@kth.se Microprocessor A central part of the embedded platform A platform is the basic hardware and software
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationEITF35: Introduction to Structured VLSI Design
EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field
More informationVirtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)
Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L
More informationComprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications
Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications Helena Zheng ML Group, Arm Arm Technical Symposia 2017, Taipei Machine Learning is a Subset of Artificial
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationmachine cycle, the CPU: (a) Fetches an instruction, (b) Decodes the instruction, (c) Executes the instruction, and (d) Stores the result.
Central Processing Unit (CPU) A processor is also called the CPU, and it works hand in hand with other circuits known as main memory to carry out processing. The CPU is the "brain" of the computer; it
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationA new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology
Dr.-Ing Jens Benndorf (DCT) Gregor Schewior (DCT) A new Computer Vision Processor Chip Design for automotive ADAS CNN applications in 22nm FDSOI based on Cadence VP6 Technology Tensilica Day 2017 16th
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationFPGA Implementation and Validation of the Asynchronous Array of simple Processors
FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,
More informationFPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011
FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level
More informationFPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Date of publication 2018 00, 0000, date of current version 2018 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.2890150.DOI arxiv:1901.00121v1 [cs.ne] 1 Jan 2019 FPGA-based Accelerators of Deep
More informationTangent Prop - A formalism for specifying selected invariances in an adaptive network
Tangent Prop - A formalism for specifying selected invariances in an adaptive network Patrice Simard AT&T Bell Laboratories 101 Crawford Corner Rd Holmdel, NJ 07733 Yann Le Cun AT&T Bell Laboratories 101
More informationYet Another Implementation of CoRAM Memory
Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS
More informationSpiral 2-8. Cell Layout
2-8.1 Spiral 2-8 Cell Layout 2-8.2 Learning Outcomes I understand how a digital circuit is composed of layers of materials forming transistors and wires I understand how each layer is expressed as geometric
More informationZynq-7000 All Programmable SoC Product Overview
Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform
More informationA software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system
A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system 26th July 2005 Alberto Donato donato@elet.polimi.it Relatore: Prof. Fabrizio Ferrandi Correlatore:
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationA Parallel Accelerator for Semantic Search
A Parallel Accelerator for Semantic Search Abhinandan Majumdar, Srihari Cadambi, Srimat T. Chakradhar and Hans Peter Graf NEC Laboratories America, Inc., Princeton, NJ, USA. Email: {abhi, cadambi, chak,
More informationFPGA Based Digital Design Using Verilog HDL
FPGA Based Digital Design Using Course Designed by: IRFAN FAISAL MIR ( Verilog / FPGA Designer ) irfanfaisalmir@yahoo.com * Organized by Electronics Division Integrated Circuits Uses for digital IC technology
More informationKeywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.
ISSN 2319-8885 Vol.03,Issue.32 October-2014, Pages:6436-6440 www.ijsetr.com Design and Modeling of Arithmetic and Logical Unit with the Platform of VLSI N. AMRUTHA BINDU 1, M. SAILAJA 2 1 Dept of ECE,
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationPINE TRAINING ACADEMY
PINE TRAINING ACADEMY Course Module A d d r e s s D - 5 5 7, G o v i n d p u r a m, G h a z i a b a d, U. P., 2 0 1 0 1 3, I n d i a Digital Logic System Design using Gates/Verilog or VHDL and Implementation
More informationDesign of Feature Extraction Circuit for Speech Recognition Applications
Design of Feature Extraction Circuit for Speech Recognition Applications SaambhaviVB, SSSPRao and PRajalakshmi Indian Institute of Technology Hyderabad Email: ee10m09@iithacin Email: sssprao@cmcltdcom
More informationFPGA: What? Why? Marco D. Santambrogio
FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much
More informationThe DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK
The DSP Primer 8 FPGA Technology Return DSPprimer Home Return DSPprimer Notes August 2005, University of Strathclyde, Scotland, UK For Academic Use Only THIS SLIDE IS BLANK August 2005, For Academic Use
More informationLow-Power Neural Processor for Embedded Human and Face detection
Low-Power Neural Processor for Embedded Human and Face detection Olivier Brousse 1, Olivier Boisard 1, Michel Paindavoine 1,2, Jean-Marc Philippe, Alexandre Carbon (1) GlobalSensing Technologies (GST)
More informationDistributed Vision Processing in Smart Camera Networks
Distributed Vision Processing in Smart Camera Networks CVPR-07 Hamid Aghajan, Stanford University, USA François Berry, Univ. Blaise Pascal, France Horst Bischof, TU Graz, Austria Richard Kleihorst, NXP
More informationIntroducing the Spartan-6 & Virtex-6 FPGA Embedded Kits
Introducing the Spartan-6 & Virtex-6 FPGA Embedded Kits Overview ß Embedded Design Challenges ß Xilinx Embedded Platforms for Embedded Processing ß Introducing Spartan-6 and Virtex-6 FPGA Embedded Kits
More informationHandwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks
Downloaded from orbit.dtu.dk on: Nov 22, 2018 Handwritten Digit Classication using 8-bit Floating Point based Convolutional Neural Networks Gallus, Michal; Nannarelli, Alberto Publication date: 2018 Document
More informationMulticore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.
Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationEmbedded Real-Time Video Processing System on FPGA
Embedded Real-Time Video Processing System on FPGA Yahia Said 1, Taoufik Saidani 1, Fethi Smach 2, Mohamed Atri 1, and Hichem Snoussi 3 1 Laboratory of Electronics and Microelectronics (EμE), Faculty of
More informationCharacter Recognition Using Convolutional Neural Networks
Character Recognition Using Convolutional Neural Networks David Bouchain Seminar Statistical Learning Theory University of Ulm, Germany Institute for Neural Information Processing Winter 2006/2007 Abstract
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to
More informationObject Counting Using Convolutional Neural Network Accelerator IP Reference Design
Object Counting Using Convolutional Neural Network Accelerator IP FPGA-RD-02036 Version 1.1 September 2018 Contents Acronyms in This Document... 3 1. Introduction... 4 2. Related Documentation... 5 2.1.
More informationDeploying Deep Neural Networks in the Embedded Space
Deploying Deep Neural Networks in the Embedded Space Stylianos I. Venieris, Alexandros Kouris, Christos-Savvas Bouganis 2 nd International Workshop on Embedded and Mobile Deep Learning (EMDL) MobiSys,
More informationSpiral 3-1. Hardware/Software Interfacing
3-1.1 Spiral 3-1 Hardware/Software Interfacing 3-1.2 Learning Outcomes I understand the PicoBlaze bus interface signals: PORT_ID, IN_PORT, OUT_PORT, WRITE_STROBE I understand how a memory map provides
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationLearning Outcomes. Spiral 3 1. Digital Design Targets ASICS & FPGAS REVIEW. Hardware/Software Interfacing
3-. 3-.2 Learning Outcomes Spiral 3 Hardware/Software Interfacing I understand the PicoBlaze bus interface signals: PORT_ID, IN_PORT, OUT_PORT, WRITE_STROBE I understand how a memory map provides the agreement
More informationPowerPC on NetFPGA CSE 237B. Erik Rubow
PowerPC on NetFPGA CSE 237B Erik Rubow NetFPGA PCI card + FPGA + 4 GbE ports FPGA (Virtex II Pro) has 2 PowerPC hard cores Untapped resource within NetFPGA community Goals Evaluate performance of on chip
More informationEECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141
EECS 151/251A Spring 2019 Digital Design and Integrated Circuits Instructor: John Wawrzynek Lecture 18 Memory Blocks Multi-ported RAM Combining Memory blocks FIFOs FPGA memory blocks Memory block synthesis
More informationExperiment 3. Digital Circuit Prototyping Using FPGAs
Experiment 3. Digital Circuit Prototyping Using FPGAs Masud ul Hasan Muhammad Elrabaa Ahmad Khayyat Version 151, 11 September 2015 Table of Contents 1. Objectives 2. Materials Required 3. Background 3.1.
More information