Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Similar documents
Embedded many core sensor-processor system

The Xilinx XC6200 chip, the software tools and the board development tools

A Reconfigurable Multifunction Computing Cache Architecture

The future is parallel but it may not be easy

Unit 9 : Fundamentals of Parallel Processing

Memory Systems IRAM. Principle of IRAM

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

An Object Tracking Computational Sensor

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

Chapter Seven Morgan Kaufmann Publishers

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Reconfigurable Computing. Introduction

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

Course Overview Revisited

by the vision process. In the course of effecting the preprocessing

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

MOVING OBJECT DETECTION USING BACKGROUND SUBTRACTION ALGORITHM USING SIMULINK

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems

Parallel graph traversal for FPGA

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Technical Basis for optical experimentation Part #4

Architectures of Flynn s taxonomy -- A Comparison of Methods

Physical Implementation

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

Low Power using Match-Line Sensing in Content Addressable Memory S. Nachimuthu, S. Ramesh 1 Department of Electrical and Electronics Engineering,

XPU A Programmable FPGA Accelerator for Diverse Workloads

A High Speed VLSI Architecture for Handwriting Recognition

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

Multi MicroBlaze System for Parallel Computing

A Prototype Multithreaded Associative SIMD Processor

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

Technology in Action

Last update: May 4, Vision. CMSC 421: Chapter 24. CMSC 421: Chapter 24 1

A two-level reconfigurable architecture for digital signal processing

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture 9: MIMD Architecture

Indian Silicon Technologies 2013

High Level, high speed FPGA programming

System Verification of Hardware Optimization Based on Edge Detection

Parallel Extraction Architecture for Information of Numerous Particles in Real-Time Image Measurement

Vertex Shader Design I

Implementation of ALU Using Asynchronous Design

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

FPGA Implementation of a Nonlinear Two Dimensional Fuzzy Filter

FPGA architecture and design technology

Outline Marquette University

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Methodology and Implementation of Early Vision Tasks for the Viola Jones Processing Framework

Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation

FPGAs: FAST TRACK TO DSP

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

5MP Global Shutter. High Dynamic Range Global Shutter CMOS Sensor

4. Networks. in parallel computers. Advances in Computer Architecture

Design of Low Power Wide Gates used in Register File and Tag Comparator

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

INTRODUCTION TO FPGA ARCHITECTURE

CHAPTER 1 INTRODUCTION

CS Computer Architecture

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

Deep Learning Accelerators

direct hardware mapping of cnns on fpga-based smart cameras

Module 5 Introduction to Parallel Processing Systems

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Architecture of parallel processing in computer organization

PACE: Power-Aware Computing Engines

Systolic Arrays for Reconfigurable DSP Systems

Higher Level Programming Abstractions for FPGAs using OpenCL

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture

Introduction to Parallel Processing

Embedded Systems: Hardware Components (part II) Todor Stefanov

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

VLSI Signal Processing

Architecture and Design of a Programmable 3D-Integrated Cellular Processor Array for Image Processing

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Specializing Hardware for Image Processing

4. Hardware Platform: Real-Time Requirements

Lecture 8: RISC & Parallel Computers. Parallel computers

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Unleashing the Power of Embedded DRAM

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

High performance, power-efficient DSPs based on the TI C64x

Extension and VLSI Implementation of the Majority-Gate Algorithm for Gray-Scale Morphological Operations

A Library of Parameterized Floating-point Modules and Their Use

Transcription:

Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain

GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions, mainly Visual Processors, through papers found in the literature

Outline Parallel Computing SIMD Computing Digital Solutions Conclusions

Outline Parallel Computing SIMD Computing Digital Solutions Conclusions

Trends in High Performance Computing IEEE Circuits&Devices Magazine- January/February 2006 Supercomputers- evolution

Supercomputers- Evolution Trends in High Performance Computing IEEE Circuits&Devices Magazine- January/February 2006 Late 70 s- Early 80 s- Vector Systems 80 s- Symmetric Multiprocessors (SMPs)- memory sharing Late 80 s- Distributed memory computer system- overcoming the hdw scalability limitations of shared memory 90 s- Massively Parallel Processors (MPP)- 256 to 10000 processors, scalable (distributed memory) Today- off-the the-shelf components,, clusters of PC s or workstations Today and tomorrow - Grid computing

Outline Parallel Computing SIMD Computing Digital Solutions Conclusions

A System for Evaluating Performance and Cost of SIMD Array Designs J. of Parallel and Distributed Computing 60, 217-246, 246, 2000 Goal- efficient evaluation of SIMD arrays with respect to complex applications while accounting for operating frequency and chip area Remark- The first 10% of the design cycle determines nearly 80% of a system s cost many alternatives in SIMD realization

A System for Evaluating Performance and Cost of SIMD Array Designs J. of Parallel and Distributed Computing 60, 217-246, 246, 2000 Design Alternatives in SIMD Machines (I) Asymmetric- Control and SIMD array PE Design- CPU, datapath: : 1-, 1, 8-, 8, 32-bits PE Memory Hierarchy- on- and/or off-chip memory, cache (Propagating templates) PE Nearest Neighbor Communication- NEWS network

A System for Evaluating Performance and Cost of SIMD Array Designs J. of Parallel and Distributed Computing 60, 217-246, 246, 2000 Design Alternatives in SIMD Machines (II) General Communication- additional inter-pe communication networks Feedback from array to controler global OR mechanisms, count of corresponding PEs, flags Array Instruction Issue- Speed at which instructions can be issued The latency of the instructions from the issuer to the PEs The skew in instruction distribution to the various Pes in the array Mapping data to processing elements- one-to to-oneone correspondence,, virtual Pes (VPEs)

A System for Evaluating Performance and Cost of SIMD Array Designs J. of Parallel and Distributed Computing 60, 217-246, 246, 2000 Variable Parameters at PE level: PE datapath from 1 to 64 bits FP units, possibly shared among PEs ALU complexity, multiplier and/or divider Nearest-neighbor communication network parameters Number of internal buses and parts in the register file Local indexing Per PE caching Simple pipelined,, PE datapaths

A System for Evaluating Performance and Cost of SIMD Array Designs J. of Parallel and Distributed Computing 60, 217-246, 246, 2000 Variable Parameters at Array level: Size- 256 to 16 million PEs Load/store mechanism Communication networks- broadcast Feedback, including OR and count

Outline Parallel Computing SIMD Computing Digital Solutions Conclusions

Contributions Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Bit-sliced digital implementation method of the sensor- computer. For 0.18 um and below,, digital is a viable alternative to analog technology Photocells and A/D converters shared among several Pes (4) without performance degradation Note: The chip, named as XENON, has not been tested

The Processing Element Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Look-up up-table (LUT) for morphology,, hit-and and-miss, and whatever bitwise logic operations Arithmetical datapath- full-adder for addition/subtraction subtraction- based arithmetic Independent path; both can work in parallel

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 The Processing Element 64 bits for each pixel: : 7 pieces of 8 bit variables, LAM, plus 8 Boolean variables with random access Neighborhood interconnections: directly from the memory, or through the crossbar switch Global buses for Global interconnection, and Data condition flags

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Sensor Signal AD Conversion Single-slope slope-type AD conversion- the slope is provided by a global analog signal The digital staircase figure is calculated with the processor itself, particularly with the arithmetic part Need of an S/H

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Sensing Schemes Local integration time control feature Feedback local intensity information to the sensing medium to increase the integration time in the dark regions, and decrease it in the bright regions As a consequence, realistic compressed dynamic range images

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Computational Throughput Higher for bit-serial arithmetic than for bit parallel Straightforward pipeline structures High speed logic

Digital Implementation of Cellular Sensor-Computers Int. J. Circ. Theor. Appl.. 2006; 34: 409-428 Chip Data- 0.18 um Clock- 100MHz Cell size- 33x33 um Processor kernel size- 33000 l 2, being l the feature size Sensor- 5x5 um AD and accompanying circuitry 3000 l 2 Note: there are no measurements!!!

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003 Goal High-speed target tracking including multitarget tracking with collision and separation

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003 Features Global feature extraction- I/O bottleneck, scalar features extracted from the array Need of high speed global operations The chip calculates moments (global information) Serial communication through nearest-neighbor neighbor connection, and global operations performed with bit serial cummulative adders at the end of rows and columns Sensor output binarized- time-controled Datapath- bit-serial

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003 Tracking Local operations- logic and arithmetic Global operations- extraction of moments

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003

A Digital Vision Chip Specialized for High-Speed Target Tracking IEEE Transactions on Electron Devices, vol. 50, n. 1, January 2003

A Dynamically Reconfigurable SIMD Processor for a Vision Chip IEEE Journal of Solid-State State Circuits, vol. 39, n. 1, January 2004 Goal To overcome the poor performance in global operations shown by conventional SIMD image processors Contribution- SIMD vision chip that reconfigures its hdw dynamically by chaining processing elements

A Dynamically Reconfigurable SIMD Processor for a Vision Chip IEEE Journal of Solid-State State Circuits, vol. 39, n. 1, January 2004 Features Early visual processing- edge detection, smoothing Global feature calculation- calculate moments and output them as scalar values Scalar feature values such as summations at high speed Fast communication between distant PEs Grain size of the PE and network structure dynamically reconfigurable Key to high speed- n PEs chained, The ALU behaves as one n-bit ALU, memory capacity multiplied by n Number of instructions reduced

A Dynamically Reconfigurable SIMD Processor for a Vision Chip IEEE Journal of Solid-State State Circuits, vol. 39, n. 1, January 2004

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 Goal- contributions Physical properties of the image sensor itself is utilized to do part of the signal processing task Analog temporal behavior of photodiodes combined with thresholding amplifiers Non-linear operations like median filter or some other tasks like convolution Adaptivity to different light levels Moments and shape factors

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 The Photodiode Sensor Inversely biased diode Generated photocurrent used to discharge a small capacitor that has been initially charged to a nominal level U=U0-kIt Two ways to proceed 1st- exposure time constant, change the threshold level until it is equal to the photodiode voltage (CCD) 2nd- keep threshold constant and measure the time (NSIP)

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 The Processing Element Based on the traditional bit-serial SIMD architecture Extensions to handle image operations of global nature PE register-accumulator oriented Multilevel, gray-scale operations performed bitwise using the registers for intermediate storage (bit-serial operations) Neighbor communication- NEWS Global status value- COUNT

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 Some operations- Location of the highest intensity pixel- solution sense the array of photodiodes continuously after precharge. The first photodiode to pass the threshold level will correspond to the location of highest intensity

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 Some operations- Median filter- for each neighborhood of three, the second pixel to cross the threshold is the median pixel

Near-Sensor Image Processing: : A New Paradigm IEEE Transactions on Image Processing, vol. 3, n. 6, Nov. 1994 Global feature extraction Through asynchronous operations- propagating-type type templates in the CNN framework. An in-depth study in Global Feature Extraction Operations for Near-Sensor Image Processing,, IEEE Transactions on Image Processing,, vol. 5, n. 1, January 1996 Chip measured with 32x32 resolution in 0.8um CMOS technology in VLSI VLSI Implementation of a Focal Plane Image Processor- A Realization of the Near-Sensor Image Processing Concept,, IEEE Transactions on Very Large Scale Integration (VLSI) Systems,, vol. 4, n. 3, Sept.. 1996

Some other solutions SIMD solutions SRAM or any other memory-based chip (still( to read something on this issue) CAM-based (still to read something on this issue) FPGA solutions

FPGA-Based Real-Time Optical-Flow System IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, n.2, February 2006 Goal- contributions Pipelined optical-flow processing system- virtual motion sensor Conventional system- conventional camera acting as a front-end supporting by an FPGA processing device FPGA programmability permits easy change of parameters to adapt to different conditions like speed, environment or light intensity Stand-alone alone- 30 Hz with 320x240 pixels Hdw- Stand-alone alone or PCI board hdw accelerator RC1000-PP, Celoxica- Virtex 2000E-6 Xilinx FPGA RC 200-PP, Celoxica- XC2V2000-4 4 FPGA

Conclusions Early visual processing SIMD solutions with CMOS on-chip implementations for local and global operations Digital Analog FPGA-based solutions Things to explore- SRAM and memory-based approaches along with how our techniques could be adopted in conventional SIMD architectures (not CNNs) Contributions on SIMD- At the first stage of the design cycle, where the ideas have significant impact at layout level. Look at new paradigms, techniques, algorithms