Co-synthesis and Accelerator based Embedded System Design

Similar documents
Hardware Accelerators

EE382V: System-on-a-Chip (SoC) Design

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.

Lecture 7: Introduction to Co-synthesis Algorithms

EE382V: System-on-a-Chip (SoC) Design

Hardware/Software Codesign

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Hardware/Software Codesign

Hardware/Software Co-design

TKT-2431 SoC design. Introduction to exercises

Hardware-Software Codesign. 1. Introduction

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Design methodology for multi processor systems design on regular platforms

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

Parallel Processing SIMD, Vector and GPU s cont.

A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS

! Readings! ! Room-level, on-chip! vs.!

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Multimedia Decoder Using the Nios II Processor

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

ECE 111 ECE 111. Advanced Digital Design. Advanced Digital Design Winter, Sujit Dey. Sujit Dey. ECE Department UC San Diego

UNIT I (Two Marks Questions & Answers)

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Hardware-Software Codesign. 1. Introduction

CEC 450 Real-Time Systems

Main Points of the Computer Organization and System Software Module

Partitioning Methods. Outline

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

ECE 448 Lecture 15. Overview of Embedded SoC Systems

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

Embedded Systems. 7. System Components

Reconfigurable Computing. Introduction

High Performance Computing on GPUs using NVIDIA CUDA

COE 561 Digital System Design & Synthesis Introduction

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Hardware-Software Codesign

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Hardware/Software Co-Design/Co-Verification

Review: Creating a Parallel Program. Programming for Performance

Embedded System Design

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Multimedia Systems 2011/2012

Pilot: A Platform-based HW/SW Synthesis System

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Introduction to System-on-Chip

Introduction to Parallel Programming

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

A New Approach to Execution Time Estimations in a Hardware/Software Codesign Environment

The Nios II Family of Configurable Soft-core Processors

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

EC EMBEDDED AND REAL TIME SYSTEMS

Computer Science 146. Computer Architecture

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Hardware Software Co-design and SoC. Neeraj Goel IIT Delhi

Mapping applications into MPSoC

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

Mapping Array Communication onto FIFO Communication - Towards an Implementation

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Instruction Encoding Synthesis For Architecture Exploration

The Xilinx XC6200 chip, the software tools and the board development tools

Computer Architecture

Long Term Trends for Embedded System Design

Concurrency for data-intensive applications

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

A hardware operating system kernel for multi-processor systems

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

WHY PARALLEL PROCESSING? (CE-401)

The future is parallel but it may not be easy

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Previous Exam Questions System-on-a-Chip (SoC) Design

SPARK: A Parallelizing High-Level Synthesis Framework

Introduction. Definition. What is an embedded system? What are embedded systems? Challenges in embedded computing system design. Design methodologies.

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

A hardware/software partitioning and scheduling approach for embedded systems with low-power and high performance requirements

Chapter 4. The Processor

Design Space Exploration Using Parameterized Cores

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

RECAP. B649 Parallel Architectures and Programming

Processor Architecture and Interconnect

Flexible Architecture Research Machine (FARM)

Parallel Programming Multicore systems

Master Informatics Eng.

Chapter 9. Design for Testability

NISC Application and Advantages

EE382 Processor Design. Processor Issues for MP

ECE519 Advanced Operating Systems

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Transcription:

Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview Introduction to Hardware-Software Co-design Hardware Software Co-synthesis System Partitioning CPU-Accelerator Co-design Some Material from Chapter 8 of the Text by M. Wolf

Embedded System Architecture Design Real-time System Design Performance analysis Scheduling and allocation Accelerated systems Use additional computational unit dedicated to some functions? Hardwired Logic e.g. FPGA Multiple processing elements (PEs) or an extra CPU Hardware/software co-design: a joint design of hardware and software architectures of Embedded System. G. Khan EE8205: Architectural Co-Synthesis Page: 2

Hardware-Software Codesign Study of the design of embedde computer systems encompassing the following parts: Co-Specification Developing system specification that describes hardware and software modules and relationship among them Co- Synthesis Automatic and semi-automatic design of hardware and software elements to meet the specification. Co-Simulation Simultaneous simulation of hardware and software. Hardware Software Co-Verification Hardware Software Co-Estimation G. Khan EE8205: Architectural Co-Synthesis Page: 3

Embedded System Co-Specification Common Specification Approaches Using Task Graphs Grey Area High Level descriptions are usually translated to some form of a task graph Common Forms Data flow graphs Control flow Graphs Control-Data Flow Graphs G. Khan EE8205: Architectural Co-Synthesis Page: 4

Hardware-Software Partitioning Assignment of System parts to implementation units (Hardware and Software) Goals of Partitioning Algorithm Meet constraints (Timing) Minimize cost (Area, Power) Directly affects the cost and performance of final system Major problem: Computation time Significant Research Area G. Khan EE8205: Architectural Co-Synthesis Page: 5

Hardware-Software Partitioning Granularity: Size of System Components assigned to Hardware or Software Coarse Grain Approaches Assign Complete Function or Processes to hardware or software A single task is a large block of system functionality Prevent Excess Communications Example tasks: MPEG decode, JPEG-2000 encode, FFT, DCT, etc. Also referred to as High Level Granularity G. Khan EE8205: Architectural Co-Synthesis Page: 6

Hardware-Software Partitioning Fine Grain Approaches: Operate on basic operations (add, subtract, multiply etc.) Avoid the problem of dealing with poorly defined functional specifications Very useful when dealing with partially reconfigurable processors (IP cores that can be modified during the design process) Here tasks are often referred to as base blocks Also referred to as Low Level Granularity G. Khan EE8205: Architectural Co-Synthesis Page: 7

Hardware Software Partitioning Flexible Granularity: Computationally intensive parts are small loops which are hidden inside a function or process Coarse grain approaches provide costly designs, some redundant parts may be moved to hardware Fine Grain approaches make up such a large design space which is usually very hard to explore Perform HW/SW partitioning handling decisions at both high and low levels Complicated problem compared to high or low level partitioning G. Khan EE8205: Architectural Co-Synthesis Page: 8

Partitioning Approaches Optimal Partitioning Approaches Exhaustive Computationally very Intensive Limited to small task graphs Branch and Bound Start from a good Initial Condition Try pre-sorted combinations Exhaustive in limiting case Heuristic Approaches Heuristic Optimization Techniques Simulated Annealing, Tabu Search and Genetic Algorithm Greedy Approaches: Start from all SW or HW architectures and move parts to HW or SW G. Khan EE8205: Architectural Co-Synthesis Page: 9

Hardware-Software Co-Synthesis Four Principle Phases of Co-synthesis: Partitioning Dividing the functionality of a specific system or SoC into units of computation. Scheduling - Pipelining Task Start Timing: Choosing time at which various computation units will occur. Allocation Determining the processing elements (PEs) on which computations will occur. Selection, type of Processing Elements and Communication Structure (System Architecture) Assignment Task Mapping: Choosing particular component types for the allocated units (of computations). G. Khan EE8205: Architectural Co-Synthesis Page: 10

Hardware Software Co-Synthesis Automatic Embedded System Architecture definition (e. bus, tree, ) Tightly coupled with HW/SW Partitioning Must schedule system to determine feasibility of each solution being evaluated. G. Khan EE8205: Architectural Co-Synthesis Page: 11

Hardware Software Co-Synthesis Approaches Optimal Try all possible combinations Limited use due to Large Design Space Examples Exhaustive, Constraint Solving, Integer Linear Programming Heuristic Avoid large Execution times Usually give Good results Iterative or Constructive G. Khan EE8205: Architectural Co-Synthesis Page: 12

Hardware Software Co-Synthesis Iterative Heuristic Approaches Start with an initial solution With each iteration of the algorithm improve the solution somewhat As the algorithm progresses the solution is refined For Example Start with an Initial Solution Improve the solution at each Iteration As the algorithm progresses the solution is refined G. Khan EE8205: Architectural Co-Synthesis Page: 13

Embedded Hardware Structure Various Hardware Options: CPU Cache ASIC FPGA Memory GPU I/O Other PEs G. Khan EE8205: Architectural Co-Synthesis Page: 14

System Partitioning Introduces a design methodology that uses several techniques: Partition system specification into tasks (processes). The best way to partition a specification depends on the characteristics of the underlying hardware platform. Try different functional partitions Determine the performance of the function when executed on the hardware platform. We usually rely on approximating. Exact performance depends on hardware-software details. Allocate processes onto various processing elements e.g. ASIC or FPGA. G. Khan EE8205: Architectural Co-Synthesis Page: 15

Hardware-Software Partitioning Hardware/software system design involve: Modeling, Validation and Implementation System Implementation involves: Hardware-software partitioning (Cosynthesis) Hardware-Software Partitioning Identifying parts of the system model best implemented in hardware and software modules Such partitions can be decided by the designer (with a successive refinement) or determined by the codesign (CAD) tools G. Khan EE8205: Architectural Co-Synthesis Page: 16

Hardware-Software Partitioning For embedded systems, such partitioning represents a physical partition of the system functionality into: Hardware (CPU: 1 or more), accelerators, GPU, etc. Software executing on one or more CPUs Various formation of the Partitioning Problem that are based on: Architectural Assumptions Partitioning Goals Solution Strategies COWARE: A design environment for application specific architectures targeting telecom applications. G. Khan EE8205: Architectural Co-Synthesis Page: 17

Partitioning Techniques Hardware-Software Homogeneous System Model => task graph For each node of the task graph, determine the implementation choices (HW or SW) Keep the scheduling of nodes at the same time Meeting the real-time constraints There is an intimate relationship between partitioning and scheduling. Wide variation in timing properties of the hardware and software implementation of a task. That effects the overall system latency significantly. G. Khan EE8205: Architectural Co-Synthesis Page: 18

Accelerator (ASIC, FPGA, etc.) System Architectures request Accelerator CPU data Memory result data I/O G. Khan EE8205: Architectural Co-Synthesis Page: 19

Accelerator vs Coprocessor A co-processor executes instructions Instructions are dispatched by the CPU Accelerator appears as a device on the bus The accelerator is controlled by registers Accelerator Implementations Application-specific integrated circuit Field-programmable gate array (FPGA) Standard component Example: GPUs G. Khan EE8205: Architectural Co-Synthesis Page: 20

SoC Architecture Design Tasks Design a Heterogeneous Multiprocessor SoC architecture. Processing Element (PE) CPU, accelerator, etc. Program the system G. Khan EE8205: Architectural Co-Synthesis Page: 21

Why Accelerators? Better Cost/Performance Custom logic may be able to perform operation faster than a CPU of equivalent cost CPU cost is a non-linear function of performance cost performance G. Khan EE8205: Architectural Co-Synthesis Page: 22

CPU and Accelerators Better Real-time Performance Put time-critical functions on less-loaded processing elements. Remember RMS utilization (<100%) --- extra CPU cycles must be reserved to meet deadlines. deadline Required Application Performance cost Rate Monotonic Scheduling deadline with RMS overhead Scheduling performance G. Khan EE8205: Architectural Co-Synthesis Page: 23

Why Accelerators? Good for processing I/O in real-time. May consume less energy. May be better at streaming data. May not be able to do all the work on even the largest single CPU. G. Khan EE8205: Architectural Co-Synthesis Page: 24

Accelerated System Design First, determine that the system really needs to be accelerated. How much faster is the accelerator on the core function? How much data transfer overhead? Design the accelerator itself Design CPU interface to the accelerator G. Khan EE8205: Architectural Co-Synthesis Page: 25

Performance Analysis Critical parameter is the speedup: How much faster is the system with the accelerator? Must take into account: Accelerator execution time Data transfer time Synchronization with the master CPU G. Khan EE8205: Architectural Co-Synthesis Page: 26

Accelerator Execution Time Total accelerator execution time: t accel = t in + t x + t out Data input Accelerater computation Data output G. Khan EE8205: Architectural Co-Synthesis Page: 27

Data Input/Output Times For Bus-based SoCs - bus transactions include: Flushing register/cache values to main memory Time required for CPU to set up transaction Overhead of data transfers by bus packets, handshaking, etc. G. Khan EE8205: Architectural Co-Synthesis Page: 28

Accelerator Speedup Assume loop is executed n times. Compare accelerated system to non-accelerated system: Speedup = = n(t CPU - t accel ) = n[t CPU - (t in + t x + t out )] Execution time on CPU of the accelerated functions G. Khan EE8205: Architectural Co-Synthesis Page: 29

Single-threaded vs Multi-threading One critical factor is the available parallelism in the application: Single-threaded/blocking: CPU waits for accelerator; Multithreaded/non-blocking: CPU continues to execute along with accelerator. For multithread, CPU must have some useful work to do while accelerators perform some tasks. Software environment must also support multithreading. Blocking: CPU waits for the accelerator call to complete. G. Khan EE8205: Architectural Co-Synthesis Page: 30

Total Execution Time Single-threaded: Count execution time of all component processes. P1 Multi-threaded: Find longest path through execution. P1 P2 A1 P2 A1 P3 P4 Accelerator P2, P3 are independent After P1 CPU start P3 P2 depends on A1 P3 P4 G. Khan EE8205: Architectural Co-Synthesis Page: 31

Sources of Parallelism Overlap I/O and the Accelerator Computation. Perform operations in batches, read in second batch of data while computing on first batch. Find other work to do on the CPU. May reschedule operations to move work after accelerator initiation. G. Khan EE8205: Architectural Co-Synthesis Page: 32

Accelerator/CPU Interface Accelerator registers provide control registers for CPU. Data registers can be used for small data objects. Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers G. Khan EE8205: Architectural Co-Synthesis Page: 33

Bus-based SoC Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc. G. Khan EE8205: Architectural Co-Synthesis Page: 34

Memory Caching Problems Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data. CPU reads location S Accelerator writes location S CPU writes location S PROBLEM G. Khan EE8205: Architectural Co-Synthesis Page: 35

Partitioning Divide functional specification into units. Map units onto PEs Units may become processes Determine proper level of parallelism. f1( ) f2( ) f3(f1( ),f2( )) vs. f3( ) G. Khan EE8205: Architectural Co-Synthesis Page: 36

Scheduling and Allocation We must: Schedule operations in time Allocate computations to processing elements Scheduling and allocation interact, but separating them will help. Alternatively allocate, then schedule. G. Khan EE8205: Architectural Co-Synthesis Page: 37

An Example Scheduling and Allocation Hardware Modules F1 F2 d1 d2 P1 A2 F3 Task graph Hardware platform G. Khan EE8205: Architectural Co-Synthesis Page: 38

Process Execution Times Time to perform F1, F2 & F3 on P1 and A2 hardware HW (Library) P1 F1 5 5 F2 5 6 F3-4 A2 G. Khan EE8205: Architectural Co-Synthesis Page: 39

First design Allocate F1, F2 -> P1; F3 -> A2. P1 F1 F1C F2 F2C A2 F3 time G. Khan EE8205: Architectural Co-Synthesis Page: 40

Communication Model Example Assume communication within PE is free Cost of communication from F1 to F3 is d1 =2; Cost of F2 to F3 communication is d2 = 4 G. Khan EE8205: Architectural Co-Synthesis Page: 41

Second design Allocate F1 -> P1; F2, F3 -> A2: P1 F1 d1 A2 F2 F3 time G. Khan EE8205: Architectural Co-Synthesis Page: 42

2nd Design Option Allocate F1, F2 -> P1; F3 -> A2 P1 F1 F2 Time = 18 A2 F3 Interconnection network d2 time 5 10 15 20 G. Khan EE8205: Architectural Co-Synthesis Page: 43

Example: Adjusting Messages to Reduce Delay Task graph: Processor Network: 3 execution time 3 allocation F1 F2 P1 P2 P3 d1 d2 4 F3 Transmission time = 4 G. Khan EE8205: Architectural Co-Synthesis Page: 44

Initial schedule P1 F1 P2 F2 P3 F3 network d1 d2 Time = 15 time 0 5 10 15 20 G. Khan EE8205: Architectural Co-Synthesis Page: 45

Efficient Schedule P1 P2 P3 F1 F2 F3 F3 F3 F3 Modify P3 hardware such that: Read one packet each from F1 and F2 Compute partial result continues to next packet network d1 d2 d1 d2 d1 d2 d1 d2 Time = 12 0 5 10 15 time G. Khan EE8205: Architectural Co-Synthesis Page: 46

Buffering and performance Buffering can sequentialize operations. Next process must wait for data to enter buffer before it can continue. Buffer policy (queue, RAM) can affect the available parallelism. G. Khan EE8205: Architectural Co-Synthesis Page: 47

Buffers and latency Three processes D, E and F separated by buffers B1, B2 and B3: B1 D B2 E B3 F G. Khan EE8205: Architectural Co-Synthesis Page: 48

Buffers and Latency Schedules D[0] D[1] E[0] E[1] F[0] F[1] Must wait for all of D before getting any E D[0] E[0] F[0] D[1] E[1] F[1] G. Khan EE8205: Architectural Co-Synthesis Page: 49

System Integration and Debugging Try to debug the CPU/accelerator interface separately from the accelerator core. Build scaffolding to test the accelerator. Hardware/software co-simulation can be useful. Seamless: Mentor Graphics Coware G. Khan EE8205: Architectural Co-Synthesis Page: 50

Accelerator Case Study An Example: Video accelerator G. Khan EE8205: Architectural Co-Synthesis Page: 51

Concept Build accelerator for block motion estimation, a step in video compression. Perform two-dimensional correlation: Frame 1 Frame 2 G. Khan EE8205: Architectural Co-Synthesis Page: 52

Block Motion Estimation MPEG divides frame into 16 x 16 macroblocks for motion estimation. Search for best match within a search range. Measure similarity with sum-of-absolutedifferences (SAD): S M(i,j) - S(i-o x, j-o y ) G. Khan EE8205: Architectural Co-Synthesis Page: 53

Best Match Best match produces motion vector for the motion of an image block G. Khan EE8205: Architectural Co-Synthesis Page: 54

Full Search Algorithm bestx = 0; besty = 0; bestsad = MAXSAD; for (ox = - SEARCHSIZE; ox < SEARCHSIZE; ox++) { for (oy = -SEARCHSIZE; oy < SEARCHSIZE; oy++) { int result = 0; for (i=0; i<mbsize; i++) { for (j=0; j<mbsize; j++) { result += iabs(mb[i][j] - search[i-ox+xcenter][j-oy-ycenter]); } } if (result <= bestsad) { bestsad = result; bestx = ox; besty = oy; } } } G. Khan EE8205: Architectural Co-Synthesis Page: 55

Computational Requirements Let MBSIZE = 16, SEARCHSIZE = 8. Search area is 8 + 8 + 1 in each dimension Must perform: n ops = (16 x 16) x (17 x 17) = 73984 ops For an image of 512 x 512 pixels => 32 x 32 macroblocks G. Khan EE8205: Architectural Co-Synthesis Page: 56