HW SW Partitioning. Reading. Hardware/software partitioning. Hardware/Software Codesign. CS4272: HW SW Codesign

Similar documents
Hardware/ Software Partitioning

Hardware-Software Codesign

Econ 172A - Slides from Lecture 2

Standard Optimization Techniques

Foundations of Computing

Hardware/Software Partitioning using Integer Programming. Ralf Niemann, Peter Marwedel. University of Dortmund. D Dortmund, Germany

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

Chapter 15 Introduction to Linear Programming

An Algorithm for Hardware/Software Partitioning. Using Mixed Integer Linear Programming

Introduction to Mathematical Programming IE406. Lecture 20. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

1. Lecture notes on bipartite matching

Standard Optimization Techniques

Figure : Example Precedence Graph

MVE165/MMG631 Linear and integer optimization with applications Lecture 7 Discrete optimization models and applications; complexity

Hardware/Software Partitioning of Digital Systems

CSE 417 Network Flows (pt 4) Min Cost Flows

16.410/413 Principles of Autonomy and Decision Making

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

Workload Characterization Techniques

EE382V: System-on-a-Chip (SoC) Design

Hardware-Software Codesign. 1. Introduction

Hardware/Software Codesign

Financial Optimization ISE 347/447. Lecture 13. Dr. Ted Ralphs

11. APPROXIMATION ALGORITHMS

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Partitioning for Better Synthesis Results

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

11. APPROXIMATION ALGORITHMS

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

IE 5531: Engineering Optimization I

Karthik Narayanan, Santosh Madiraju EEL Embedded Systems Seminar 1/41 1

The Simplex Algorithm for LP, and an Open Problem

CSE 417 Network Flows (pt 3) Modeling with Min Cuts

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Slides for Faculty Oxford University Press All rights reserved.

Social-Network Graphs

Outline. CS38 Introduction to Algorithms. Linear programming 5/21/2014. Linear programming. Lecture 15 May 20, 2014

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

Math 273a: Optimization Linear programming

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Excel for Algebra 1 Lesson 5: The Solver

Models for grids. Computer vision: models, learning and inference. Multi label Denoising. Binary Denoising. Denoising Goal.

AMS : Combinatorial Optimization Homework Problems - Week V

MPSoC Design Space Exploration Framework

LINEAR PROGRAMMING INTRODUCTION 12.1 LINEAR PROGRAMMING. Three Classical Linear Programming Problems (L.P.P.)

Hardware/Software Co-Design/Co-Verification

REGULAR GRAPHS OF GIVEN GIRTH. Contents

Association for Computing Machinery Hong Kong Chapter Collegiate Programming Contest 2003

Simulation. Lecture O1 Optimization: Linear Programming. Saeed Bastani April 2016

Lecture 3: Totally Unimodularity and Network Flows

Partitioning Methods. Outline

Lecture Compiler Backend

Relational Database: The Relational Data Model; Operations on Database Relations

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

An Introduction to FPGA Placement. Yonghong Xu Supervisor: Dr. Khalid

Linear Programming. Readings: Read text section 11.6, and sections 1 and 2 of Tom Ferguson s notes (see course homepage).

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

LATIN SQUARES AND THEIR APPLICATION TO THE FEASIBLE SET FOR ASSIGNMENT PROBLEMS

1. Lecture notes on bipartite matching February 4th,

NOTATION AND TERMINOLOGY

DD2429 Computational Photography :00-19:00

Algorithms for Euclidean TSP

This Lecture. We will first introduce some basic set theory before we do counting. Basic Definitions. Operations on Sets.

Lecture 7: Introduction to Co-synthesis Algorithms

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Integer Programming Theory

Lecture 14: Linear Programming II

In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems.

Linear Programming. them such that they

4 Integer Linear Programming (ILP)

Planning and Optimization

WDM Network Provisioning

Advanced Algorithms Linear Programming

6 Randomized rounding of semidefinite programs

Embedded Systems Design with Platform FPGAs

System Design and Methodology/ Embedded Systems Design (Modeling and Design of Embedded Systems)

Linear Programming in Small Dimensions

Research Interests Optimization:

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

The Polyhedral Model (Transformations)

Lecture Notes 2: The Simplex Algorithm

Discrete Mathematics. Kruskal, order, sorting, induction

No model may be available. Software Abstractions. Recap on Model Checking. Model Checking for SW Verif. More on the big picture. Abst -> MC -> Refine

Methods and Models for Combinatorial Optimization Modeling by Linear Programming

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

Hardware/Software Codesign

Algorithm and Complexity of Disjointed Connected Dominating Set Problem on Trees

ABC basics (compilation from different articles)

Hardware-Software Codesign. 1. Introduction

Optimization Techniques for Design Space Exploration

LEAST COST ROUTING ALGORITHM WITH THE STATE SPACE RELAXATION IN A CENTRALIZED NETWORK

Hardware-Software Codesign

1 Parsing (25 pts, 5 each)

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis

Chapter 10 Part 1: Reduction

A Comparison of Mixed-Integer Programming Models for Non-Convex Piecewise Linear Cost Minimization Problems

(Refer Slide Time: 1:27)

Overview. Design flow. Principles of logic synthesis. Logic Synthesis with the common tools. Conclusions

Transcription:

CS4272: HW SW Codesign HW SW Partitioning Abhik Roychoudhury School of Computing National University of Singapore Reading Section 5.3 of textbook Embedded System Design Peter Marwedel Also must read Hardware/software partitioning using Integer programming, by Ralf Niemann URL available from CS4272 webpage. This article has a much better explanation of the same material. Modified and augmented from Peter Marwedel s lecture notes 20/09/2007 1 20/09/2007 2 Hardware/Software Codesign Hardware/software partitioning No need to consider special purpose hardware in the long run? Correct for fixed functionality, but wrong in general, since By the time MPEG-n can be implemented in software, MPEG-n+1 has been invented [de Man] 20/09/2007 3 Functionality to to be be implemented in in software or or in in hardware? 20/09/2007 4 Functionality to be implemented in software or in hardware? Codesign Tool (COOL) as an example of HW/SW partitioning Decision based on on hardware/ software partitioning, a special case of of hardware/ software codesign. platform Inputs to to COOL: 1. Target 1. technology 2. Design 2. constraints 3. Required 3. behavior Design constraints refer to constraints on performance, hardware area etc. We need to clarify the other two terms. 20/09/2007 5 20/09/2007 6 1

Target Technology A graph of nodes --- Nodes denote Hardware components or Processors Memories also present, but mapping of tasks to HW or Proc. Edges denote interconnections --- often in the form of buses So, Target Tech is Hardware Components Possibly of different types Set of Processors External memory and buses between them. P1 P2 M H 20/09/2007 7 20/09/2007 8 Behavior Hierarchical Task Graphs What is a task graph? Typically DAG of tasks Nodes denotes specific tasks in the functionality of the system being designed Edges can denote several things Causal dependences, or in more details Communication (with weightage of data being communicated) There might exist causal dependence T1 T2 without any data being communicated from T1 to T2 Nodes of hierarchical task graphs can be task graphs Why task graphs? Reasonable way to capture repetitive reactive behavior Tasks produce output streams from input streams (from environment or other tasks). Task graphs thus represent a high-level behavioral specification of the system. How each task is described depends on who is designing it (hardware designer, programmer) Diff. from how each task will be implemented! 20/09/2007 9 20/09/2007 10 Specification Approach Task graph input Input to the partitioning method is a hierarchical task graph. At the lowest level (leaves of the hierarchy), behavior of each node is specified say in VHDL Alternately in C? Mapping Processor P1 Processor P2 Hardware [Niemann, Hardware/Software Co-Design for Data Flow Dominated Embedded Systems, Kluwer Academic Publishers, 1998 (Comprehensive mathematical model)] 20/09/2007 11 20/09/2007 12 2

Schematic Target Tech. VHDL System Spec. Design Constraints Schematic (solving ILP) Solving ILP Optimization Problem Syntax Graph Model Solution? no Result := ValidPartition C Code Generation VHDL Code Generation yes ValidPartition := Solution found Retargetable Compilation High-level Synthesis Cluster SW nodes Retargetable Compilation SW costs HW costs ILP optimization Problem Solve 20/09/2007 13 (contains all info.) Refine ILP problem SW costs 20/09/2007 14 Partition Refinement COOL partitioning algorithm A B 1 2 C D A,B 2 1 C,D A,B, C,D, 1,2 1. 1. Translation of of the the behavior into an an internal graph model 2. 2. Translation of of the the behavior of of each node from VHDL into C (we assumed task description in in VHDL, otherwise this this step is is not not required). 3. 3. Compilation All All C programs compiled for for the the target processor, Computation of of the the resulting program size, estimation of of the the resulting execution time (simulation input data might be be required) 4. 4. Synthesis of of hardware components: leaf leaf node, application-specific hardware is is synthesized. High-level synthesis sufficiently fast. Denotes implemented in software 20/09/2007 15 20/09/2007 16 COOL partitioning algorithm 5. 5. Flattening of of the the hierarchy: Granularity used by by the the designer is is maintained. Cost and and performance information added to to the the nodes. Precise information required for for partitioning is is pre-computed 6. 6. Generating and and solving a mathematical model of of the the optimization problem: Integer programming IP IP model for for optimization. Optimal with with respect to to the the cost cost function (approximates communication time) COOL partitioning algorithm 7. 7. Iterative improvements: Adjacent nodes mapped to to the the same hardware component are are now now merged. 20/09/2007 17 20/09/2007 18 3

COOL partitioning algorithm 8. Interface synthesis: After partitioning, the glue logic required for interfacing processors, application-specific hardware and memories is created. We now describe step 6 (Integer Programming) in more details. 20/09/2007 19 Integer programming models Ingredients: Cost function Constraints Involving linear expressions of integer variables from a set X Cost function C = ai xi withai R, xi N N (1) xi X Constraints: j J b x c with b, c (2) : i, j i j i, j j R xi X Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem. If all x i are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem. 20/09/2007 20 R Example On integer programming C = 5x + x x + x 1 x, x, x 1 2 1 + 6x2 4 2 + x 3 3 3 2 {0,1} C Optimal Maximizing the the cost cost function can can be be done by by setting C =-C Integer programming is is NP-complete. In In practice, running times can can increase exponentially with with the the size size of of the the problem, but but problems of of some thousands of of variables can can still still be be solved with with commercial solvers, depending on on the the size size and and structure of of the the problem. IP IP models can can be be a good starting point for for modeling, even if if in in the the end end heuristics have to to be be used to to solve them. 20/09/2007 21 20/09/2007 22 Digress: Linear Programming Example: Grocery Shopping m varieties of nutrients (vitamins, protein,..) need b 1 units of Nut 1, b 2 units of Nut 2,, b m units of Nut m. Can buy n types of food (milk, bread, beef,..) Each unit of food contains a certain number of units of each type of nutrients. a i, j represents the number of units of the ith type nutrient contained in one unit of food of the jth type. Digress: Linear Programming Milk bread fish Beef Celery. Ice-cream 1 2. n Vitamin A 1 Vitamin B 2.... Protein m 15 20/09/2007 23 20/09/2007 24 4

Digress: Linear Programming Suppose you buy x 1 units of food type 1 and x 2 units of food type 2, and x n units of food type n. Then for nutrition type of type i it must be the case : a i,1. x 1 + a i,2. x 2 +.. + a i,n. x n b i Digress: Linear Programming a 1,1. x 1 + a 1,2. x 2 +.. + a 1,n. x n b 1 a 2,1. x 1 + a 2,2. x 2 +.. + a 2,n. x n b 2 a i,1. x 1 + a i,2. x 2 +.. + a i,n. x n b i a m, 1,. x 1 + a m, 2. x 2 +.. + a m, n. x n b m 20/09/2007 25 20/09/2007 26 Digress: Linear programming AX b A - m n matrix X -n 1 column vector of unknowns. b -m 1 column vector of constants. Additional constraints x j 0 (j = 1, 2, n) (why?) Cost function: z = c 1. x 1 + c 2. x 2 + + c n. x n c i - the cost of one unit of food type i. Digress: Linear Programming The LP Problem. Find (x 1, x 2,, x n ) such that: All the constraints are satisfied. The cost function is minimized. If (y 1, y 2,, y n ) also satisfies all the constraints then z z where: z = c 1. x 1 + c 2. x 2 + + c n. X n z = c 1. y 1 + c 2. y 2 + + c n. y n 20/09/2007 27 20/09/2007 28 Integer Linear Programming Demand in addition: Each x i should be an integer. Solving an ILP problem usually boils down to solving a series of LP problems. The General Idea in solving an LP: Feasible solution is a solution that satisfies all the constraints. The set of feasible solutions (for sensible LP problems!) is a convex polyhedron. One of the corner points of the polyhedron is the optimal solution. Mixed Integer Linear Programming Problem Demand the integer-value constraint only for a subset of the variables. In principle, LP problems can be solved in polynomial time. But ILP problems have only exponential time algorithms at present. NP-complete Role of ILP in solving co-design problems ILP based resource aware compilation Palsberg and Naik http://www.cs.ucla.edu/~palsberg/paper/mpsoc-chapter03.pdf 20/09/2007 29 20/09/2007 30 5

The Partitioning Problem Possible solution A Task Graph Target Tech. A Task Graph B C B C ASIC DSP Set of entities or tasks D G E F Memory D G E F Shared implementation Partitioning Problem: Map {A,B,C,D.E,F,G} to {ASIC, DSP}. Green color: executing in DSP 20/09/2007 31 20/09/2007 32 IP model for partitioning IP model for partitioning Notation: Index set set II denotes task task graph nodes. Index set set L denotes task task graph node types e.g. e.g. square root, DCT DCT or or FFT FFT Index set set KH KHdenotes hardware component types. e.g. e.g. hardware components for for the the DCT DCT or or the the FFT. FFT. Index set set J of of hardware component instances Index set set KP KPdenotes processors. All All processors are are assumed to to be be of of the the same type X X i,k : i,k =1 if if node v i is i is mapped to to hardware component type k KH and 0 otherwise. Y Y i,k : i,k =1 if if node v i is i is mapped to to processor k KP and 0 otherwise. NY l,k =1 l,k if if at at least one node of of type l is is mapped to to processor k KP and 0 otherwise. T Tis is a mapping from task graph nodes to to their types: T: T: I L The cost function accumulates the costs: C = cost(processors) + cost(memories) + cost(application specific hardware) 20/09/2007 33 20/09/2007 34 Constraints Operation assignment constraints i I = 1 : X i, k + Yi, k k KH k KP All task graph nodes have to be mapped either in software or in hardware. Variables are assumed to be integers. Additional constraints to guarantee they are either 0 or 1: i I : k KH :, 1 X i k i I : k KP :, 1 Y i k 20/09/2007 35 Operation assignment constraints (2) l L, i:t(v i )= i )= l, l, k KP: NY l,k l,k Y i,k i,k For all all types ll of of operations and and for for all all nodes iiof of this this type: if if iiis is mapped to to some processor k, k, then that that processor must implement the the functionality of of l. l. Decision variables must also also be be 0/1 0/1 variables: ll L, L, k KP: KP: NY NY l,k l,k 1. 1. 20/09/2007 36 6

Resource & design constraints k KH, the cost (area) used for for components of of that type is is calculated as as the sum of of the costs of of the components of of that type. This cost should not exceed its its maximum. k KP, the cost for for associated data storage area should not exceed its its maximum. k KP the cost for for storing instructions should not exceed its its maximum. The total cost (Σ (Σ kk KH ) KH ) of of HW components should not exceed its its maximum The total cost of of data memories (Σ (Σ kk KP ) KP ) should not exceed its its maximum The total cost instruction memories (Σ (Σ kk KP ) KP ) should not exceed its its maximum 20/09/2007 37 Timing constraints Timing constraints These constraints can can be be used to to guarantee that that certain time constraints are are met. Execution time of a node in the task graph is variable (implemented in hardware or software) This defines execution time of node as a linear expression on our decision variables. Using these execution times, we can define start and end times of each node. The end time of the sink node in the task graph should be less than pre-defined constant --- overall timing constraint on the design. 20/09/2007 38 v 1 v 2 v 3 v 4 e 3 e 4 v 5 v 6 v 7 v 8 v 9 v 10 Scheduling FIR 2 on h 1 Processor p 1 FIR 1 FIR 2 ASIC h1 Communication channel c 1 p 1 c 1 Scheduling / precedence constraints For all all nodes v i1 and i1 and v i2 that i2 that are are potentially mapped to to the the same processor or or hardware component instance, introduce a binary decision variable b i1,i2,k with i1,i2,k with b i1,i2,k =1 i1,i2,k =1 if if v i1 is i1 is executed before v i2 in i2 in component k and and = 0 otherwise. Define constraints of of the the type (end-time of of v i2 ) i2 (start time of of v i1 ) i1 if if b i1,i2,k =0 i1,i2,k =0 Ensure that that the the schedule for for executing operations is is consistent with with the the precedence constraints in in the the task task graph. v 11... v... 3 v 4 or... v... 4 v 3 t... v... 7 v 8 or... v... 8 v 7 t... e 3... or...... e 4 e 4 e 3 t 20/09/2007 40 Scheduling Constraints b i1,i2,k + y i1,k 1 b i1,i2,k +y i2,k 1 b i1,i2,k + b i2,i1,k 1 b i1,i2,k + b i2,i1,k + y i1,k +y i2,k 3 Example HW HW types H1, H1, H2 H2 and and H3 H3 with with costs of of 20, 20, 25, 25, and and 30. 30. Processors of of type P. P. Tasks T1 T1 to to T5. T5. Execution times: Time i1 start Time i2 end (Large Const) * b i1,i2,k Time i2 start Time i1 end (Large Const) * b i2,i1,k 20/09/2007 41 T H1 H2 H3 P 1 20 100 2 20 100 3 12 10 4 12 10 5 20 100 20/09/2007 42 7

Operation assignment constraints (1) T H1 H2 H3 P 1 20 100 2 20 100 3 12 10 4 12 10 5 20 100 X 1,1 +Y 1,1 =1 (task 1 mapped to H1 or to P) X 2,2 +Y 2,1 =1 X 3,3 +Y 3,1 =1 X 4,3 +Y 4,1 =1 X 5,1 +Y 5,1 =1 i I = 1 : X i, k + Yi, k k KH k KP Operation assignment constraints (2) Assume types of of tasks are are l l =1, =1, 2, 2, 3, 3, 3, 3, and and 1. 1. ll L, L, i:t(v i )= i )= ll k KP: KP: NY NY l,k l,k Y i,k i,k Functionality 3 to be implemented on processor if node 4 is mapped to it. 20/09/2007 43 20/09/2007 44 Other equations Time constraints leading to: Application specific hardware required for time constraints under 100 time units. T H1 H2 H3 P 1 20 100 2 20 100 3 12 10 4 12 10 5 20 100 Cost function: C=20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory) 20/09/2007 45 Result For a time constraint of of 100 100 time units and and cost(p)<cost(h3): T H1 H2 H3 P 1 20 100 2 20 100 3 12 10 4 12 10 5 20 100 Solution (educated guessing) : T1 H1 T2 H2 T3 P T4 P T5 H1 20/09/2007 46 Separation of scheduling and partitioning Combined scheduling/partitioning very complex; Heuristic: Compute approx. time values. Perform partitioning for for approx. time values. Perform final scheduling using the the above partition. If If final final schedule does not not meet time constraint, go go to to 1 using a reduced overall timing constraint. Actual execution time specification approx. execution time t 1 st Iteration Actual execution time New specification approx. execution time 2 nd Iteration 20/09/2007 47 t How to get approx time? Compute start and end times of each node without the scheduling constraints. Only basic constraints based on topological ordering in the task graph. Time start i Σ predecessors j Y j,k * (sw time of j on k) Similarly for hardware Time start i Time end j + Σ j path(j,i) Y j,k * (sw time of j on k) where j is the dominator of i in task graph. (similar constraints for hardware costs). Compute partitioning with these time values. Then solve the scheduling using the given partition. 20/09/2007 48 8

Application example Audio lab lab (mixer, fader, echo, equalizer, balance units); slow SPARC processor 1µ 1µ ASIC library Allowable delay of of 22.675 µs µs (~ (~ 44.1 khz) Running time for COOL optimization SPARC processor ASIC (Compass, 1 µ) External memory Outdated technology; just a proof of concept. 20/09/2007 49 Only simple models can be solved optimally. 20/09/2007 50 Deviation from optimal design Running time for heuristic Hardly any loss in design quality. 20/09/2007 51 20/09/2007 52 Design space for audio lab Final remarks COOL approach: shows that that formal model of of hardware/sw codesign is is beneficial; IP IP modeling can can lead lead to to useful implementation even if if optimal result is is available only only for for small designs. Other approaches for for HW/SW partitioning: starting with with everything mapped to to hardware; gradually moving to to software as as long as as timing constraint is is met. starting with with everything mapped to to software; gradually moving to to hardware until timing constraint is is met. Simple approaches like like Binary search. Everything in software: 72.9 µs, 0 λ 2 Everything in hardware: 3.06 µs, 457.9x10 6 λ 2 Lowest cost for given sample rate: 18.6 µs, 78.4x10 6 λ 2, 20/09/2007 53 20/09/2007 54 9

Exercise Exercise T1 T2 T3 160 time units T4 T5 At most 2 nodes can be implemented in HW ASIC. Each ASIC costs 20 units and each software implementation costs 5 units. Each task s HW execution is 8 time units and the SW execution is 60 time units. Total execution time should be less than 160 time units. Perform HW-SW partitioning for this task graph. 20/09/2007 55 20/09/2007 56 Exercises A. Consider a traffic light controller with two different lights each of which must be red when the other is not red. They both start in the red state. When one of them receives a start event, it performs a cycle going from red to green to yellow and back to red. When either light reaches the red state, it tells the other to perform a cycle. Exercise A 1. Draw the above as a statechart. 2. Identify what features of statechart you found most useful? B. Event broadcasting allows output of a transition to serve as triggers of transitions in orthogonal components of a system. Can such broadcast go on in an infinite loop? Construct a small example statechart to show that this is possible. 20/09/2007 57 tm(n): a timeout transition triggered after a specified amount of time (n) 20/09/2007 has passed. 58 Exercise B 20/09/2007 59 10