EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

Similar documents
Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Hardware Design. University of Pannonia Dept. Of Electrical Engineering and Information Systems. MicroBlaze v.8.10 / v.8.20

Understanding Sources of Inefficiency in General-Purpose Chips

EECS150 - Digital Design Lecture 14 FIFO 2 and SIFT. Recap and Outline

Instruction Set Overview

The Nios II Family of Configurable Soft-core Processors

Copyright 2016 Xilinx

SoC Platforms and CPU Cores

Qsys and IP Core Integration

PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor

Asymmetric Coherent Configurable Caches for PolyBlaze Multicore Processor

Embedded Systems. 7. System Components

Improve Memory Access for Achieving Both Performance and Energy Efficiencies on Heterogeneous Systems

Chapter 5. Introduction ARM Cortex series

Yet Another Implementation of CoRAM Memory

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Fast dynamic and partial reconfiguration Data Path

Designing Embedded AXI Based Direct Memory Access System

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Embedded Systems: Hardware Components (part I) Todor Stefanov

A Memory System Design Framework: Creating Smart Memories

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Digital Blocks Semiconductor IP

A 1-GHz Configurable Processor Core MeP-h1

Embedded Computing Platform. Architecture and Instruction Set

Zynq-7000 All Programmable SoC Product Overview

Microprocessor Soft-Cores: An Evaluation of Design Methods and Concepts on FPGAs

DO-254 MicroBlaze v1.00a. Safety Features. General Description. Features. September 2, 2014, Revision - Certifiable Data Package (DAL A)

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

3-D Accelerator on Chip

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

Lecture 41: Introduction to Reconfigurable Computing

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

ECE332, Week 2, Lecture 3. September 5, 2007

ECE332, Week 2, Lecture 3

Zynq Architecture, PS (ARM) and PL

Digital Blocks Semiconductor IP

Specializing Hardware for Image Processing

System-on Solution from Altera and Xilinx

EECS150 - Digital Design Lecture 16 - Memory

«Real Time Embedded systems» Multi Masters Systems

Digital Systems Design. System on a Programmable Chip

EECS150 - Digital Design Lecture 16 Memory 1

A hardware operating system kernel for multi-processor systems

Hardware Implementation of TRaX Architecture

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

EECS150 - Digital Design Lecture 09 - Parallelism

Digital Integrated Circuits

Reconfigurable Computing. Introduction

Midterm Exam. Solutions

Vertex Shader Design I

The University of Reduced Instruction Set Computer (MARC)

EECS150 - Digital Design Lecture 11 SRAM (II), Caches. Announcements

Superscalar Processors

ReconOS: An RTOS Supporting Hardware and Software Threads

A Process Model suitable for defining and programming MpSoCs

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

Teaching Computer Architecture with FPGA Soft Processors

The ARM Cortex-A9 Processors

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

EE382V: System-on-a-Chip (SoC) Design

ELCT 912: Advanced Embedded Systems

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES

LEON4: Fourth Generation of the LEON Processor

Reader's Guide Outline of the Book A Roadmap For Readers and Instructors Why Study Computer Organization and Architecture Internet and Web Resources

Course Overview Revisited

Crypto Hardware Design for

Midterm Exam. Solutions

Enabling success from the center of technology. Xilinx Embedded Processing Solutions

Mapping applications into MPSoC

Teaching Microprocessors Design Using FPGAs

Superscalar Machines. Characteristics of superscalar processors

FPGA memory performance

New development within the FPGA area with focus on soft processors

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007

EECS150 - Digital Design Lecture 12 - Video Interfacing. MIPS150 Video Subsystem

LogiCORE IP Mailbox (v1.00a)

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University

EECS150 - Digital Design Lecture 17 Memory 2

EECS150 - Digital Design Lecture 15 - Video

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Introduction to reconfigurable systems

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

VLSI Design of Multichannel AMBA AHB

Nios Soft Core Embedded Processor

The Design Complexity of Program Undo Support in a General-Purpose Processor

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

Speeding AM335x Programmable Realtime Unit (PRU) Application Development Through Improved Debug Tools

FPGA based embedded processor

EN2911X: Reconfigurable Computing Lecture 01: Introduction

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing

Transcription:

EECS150 - Digital Design Lecture 13 - Accelerators Oct. 10, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek) http://www-inst.eecs.berkeley.edu/~cs150 1 Recap and Outline SRAM USB WebCam Host PC VGA Interface Frame Buffer DVI Interface Note: partners HW/ project Overview of MicroBlaze + feature detection Hardware acceleration/co-processors 2 1

90/10 rule: Motivation Often 90 percent of the program runtime and energy is consumed by 10 percent of the code (inner-loops). Only small portions of an application become the performance bottlenecks. Usually, these portions of code are data processing intensive with relatively fixed dataflow patterns (little control): cryptography, graphics, video, communications signal processing, networking,... The other 90 percent of the code not performance critical: UI, control, glue, exceptional cases,... Hardware accelerator/economizer implements specialized circuits for inner-loops. Processor packs the noncritical portions (90%), 10% of the computation into minimal space. Hybrid processor-core hardware accelerator 3 Energy Efficiency of CPU versus ASIC versus FPGA Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. SIGARCH Comput. Archit. News, 38:37 47, June 2010. ASIC 500x CPU Ian Kuon and Jonathan Rose. Measuring the gap between fpgas 7x and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA 06, pages 21 30, New York, NY, USA, 2006. ACM FPGA ASIC FPGA : CPU = 70x Similar story for performance efficiency Wawrzynek ReConFig 12/14/2010 4 2

Why is HW more efficient than processors? Performance/cost or Energy/op 1. exploit problem specific parallelism, at thread and instructions level 2. custom instructions match the set of operations needed for the algorithm (replace multiple instructions with one), custom word width arithmetic, etc. 3. remove overhead of instruction storage and fetch, ALU multiplexing What about FPGAs? 5 Three ARM cores, plus lots of accelerators Targets smart phones System on Chip Example 6 3

Xilinx Zinq Processors in FPGAs Altera: Dual-Core ARM Cortex-A9 MPCore Processor 7 Xilinx: Microblaze Soft Processor Altera: Nios, MIPS 8 4

Custom Hardware in the Pipeline 9 Custom Instructions Example: Tensilca Product Special language TIE is used for defining special function units Custom architecture automatically compiled, e.g. custom SIMD instructions Compiler support challenging 10 5

Tightly Coupled Co-processor MicroBlaze: Fast Simplex Links (FSL) Similar to MIPS coprocessor model 11 FSL : Fast Simplex Link 12 6

MicroBlaze Fast Simplex Links 13 Fast Simplex Link 14 7

Memory Mapped Accelerator Memory mapped control/data registers 15 Memory Mapped Accelerator Common Variations 16 8

CPU/Accelerator Shared Memory Processor instructs accelerator to independently access memory and perform work How does processor synchronize with accelerator (how does it know when it is done) Data Cache on CPU creates coherency issue What about a cache in the accelerator? 17 Tightly Coupled Co-processor MIPS: load/store to/from coprocessor, coprocessor op Memory mapped control/data registers 18 9

Summary so far Custom hardware in pipeline Tightly coupled co-processor e.g. Fast Simplex Link e.g. floating point co-processor memory-mapped co-processor 19 Feature Tracking Project USB WebCam SRAM Host PC VGA Interface Frame Buffer DVI Interface Feature Detector P.L.B. serial interface micro Blaze CPU F.S.L. Xilinx FPGA DDRAM (program+ data memory) EECS150 - Lec12-video 20 10

MicroBlaze Block Diagram 21 MicroBlaze IO DPLB: Data interface, Processor Local Bus DLMB: Data interface, Local Memory Bus (Block RAM only) IPLB: Instruction interface, Processor Local Bus ILMB: Instruction interface, Local Memory Bus (Block RAM only) MFSL 0..15: FSL master interfaces DWFSL 0..15: FSL master direct connection interfaces SFSL 0..15: FSL slave interfaces DRFSL 0..15: FSL slave direct connection interfaces DXCL: Data side Xilinx CacheLink interface (FSL master/slave pair) IXCL: Instruction side Xilinx CacheLink interface (FSL master/slave pair) Core: Miscellaneous signals for: clock, reset, debug, and trace M_AXI_DP: Peripheral Data Interface, AXI4-Lite or AXI4 interface M_AXI_IP: Peripheral Instruction interface, AXI4-Lite interface M0_AXIS..M15_AXIS: AXI4-Stream interface master direct connection interfaces S0_AXIS..S15_AXIS: AXI4-Stream interface slave direct connection interfaces M_AXI_DC: Data side cache AXI4 interface M_AXI_IC: Instruction side cache AXI4 interface >= Virtex 6 22 11

Processor Local Bus For frame buffer interface 23 from VGA Feature Detection using D.o.G. to MicroBlaze WB: convolution, D.o.G. ``A Parallel Hardware Architecture for Scale and Rotation Invariant Feature Detection Vanderlei Bonato, Eduardo Marques, and George A. Constantinides, IEEE Trans. on Circuits and Systems for Video Technology, vol. 18, no12. Dec. 2008. 24 12

SRAM PS #6, problem 2 DATA DOUT ADDR DIN 25 Conclusions Custom hardware in pipeline Tightly coupled co-processor e.g. Fast Simplex Link e.g. floating point co-processor memory-mapped co-processor MicroBlaze connections to coprocessor and frame buffer 26 13