Hardware/Software Codesign 3. Partitioning Marco Platzner Lothar Thiele by the authors 1 Overview A Model for System Synthesis The Partitioning Problem General Partitioning Methods HW/SW-Partitioning Methods Case Studies 2
A Model for System Synthesis Synthesis = Allocation + Binding + Scheduling on the System Level: Partitioning = Allocation + Binding Graph Models Œ Problem Graph» vertices: functional and communication objects» edges: dependencies Œ Architecture Graph» vertices: functional and communication resources» edges: directed communication channels Œ Specification Graph» problem graph + architecture graph + possible mappings 3 Problem Graph task graph problem graph 1 2 1 2 5 3 3 6 4 communication 7 4 4
Architecture Graph target architecture architecture graph RISC v RISC bus v bus HWM1 HWM2 v HWM1 point-to-point link v ptp v HWM2 5 Specification Graph 1 5 3 v RISC v bus 7 v HWM1 2 v ptp 6 v HWM2 4 6
Allocation, Binding 1 5 v RISC v bus 3 7 v HWM1 2 v ptp 6 v HWM2 4 7 Example: homogeneous Multiprocessor Allocation given find Binding and Schedule with Œ minimal latency or Œ guaranteed deadlines v PE1 v PE2 v PE3 M PE1 M PE2 M PE3 bus v bus 8
Example: Hw/Sw Partitioning in its simplest form only two blocks: SW and HW (bi-partitioning) processor v processor bus v bus ASIC v ASIC 9 Partitioning Levels of Abstractions Œ structural partitioning: at the register transfer (RTL) level, at the netlist level» split a digitial circuit and map it to several devices (FPGAs, ASICs)» system parameters are relatively well-known (area, delay)» no more comparison of design alternatives possible Œ functional partitioning: at the system level» comparison of design alternatives possible (design space exploration)» system parameters are unknown Å estimation (analysis, simulation, rapid prototyping) 10
The Partitioning Problem Definition: The partitioning problem is to assign n objects O ={o 1,..., o n } to m blocks (also called partitions) P={p 1,..., p m }, such that l p 1 p 2... p m = O l l p i p j = { } i,j: i j and cost c(p) are minimized. the general partitioning problem is NP-complete 11 Cost Metrics - Example cost function : f(c, L, P) = k 1 h C (C,C) + k 2 h L (L,L) + k 3 h P (P,P) C system cost in [$] L execution time in [sec] P power consumption in [W] h C, h L, h P functions that determine how much C, L, P violate the design constraints C, L, P (penalty) k 1, k 2, k 3 weighting and normalization 12
General Partitioning Methods exact methods Œ enumeration Œ Integer Linear Programs (ILP) heuristic methods Œ constructive methods» random mapping» hierarchical clustering Œ iterative methods» Kernighan-Lin Algorithm» Simulated Annealing» Evolutionary Algorithms (EA) 13 Integer Linear Programs (1) binary variables x i,k Œ x i,k = 1: object o i in block p k Œ x i,k = 0: object o i not in block p k cost c i,k, if object o i is in block p k integer linear program: x i, k m k= 1 x { 0,1} i, k = 1 minimize 1 i n,1 k m 1 i n m n k= 1i= 1 x i, k c i, k 1 k m,1 i n 14
Integer Linear Programs (2) additional constraints Œ example: maximum number of h k objects in block k n xi, k hk 1 i= 1 k m ILP is NP-complete Œ in the worst-case exponential runtime Œ solved by branch&bound algorithms Œ formulation gets difficult when constraints are non-linear 15 Constructive Methods random mapping Œ each object is assigned to a block randomly hierarchical clustering Œ stepwise grouping of objects Œ closeness function determines how desirable it is to group two objects constructive methods Œ are often used to generate a starting partition for iterative methods Œ show the difficulty of finding proper closeness functions 16
Hierarchical Clustering - Example (1) v 5 = v 1 v 3 v 2 10 10 v 1 8 20 v 3 v 2 10 v 5 7 4 6 v 4 4 v 4 closeness function: arithmetic mean of weights 17 Hierarchical Clustering - Example (2) v 6 = v 2 v 5 v 5 10 v 6 v 2 7 5.5 4 v 4 v 4 18
Hierarchical Clustering - Example (3) v 6 5.5 v 7 = v 6 v 4 v 7 v 4 19 Hierarchical Clustering - Example (4) step 3: v 7 = v 6 v 4 step 2: v 6 = v 2 v 5 cut lines (partitions) step 1: v 5 = v 1 v 3 v 1 v 2 v 3 v 4 20
Iterative Methods - Kernighan-Lin (1) Generation of bi-partitions: re-group the object which leads to the largest gain in cost v 6 v 1 v 3 v 2 v 4 v 5 v 7 v 8 v 9 example: cost = number of edges crossing the partitions 21 Iterative Methods - Kernighan-Lin (2) Extensions Œ re-group the object which leads to the largest gain in cost or the smallest loss in cost» as long as a better partition is found: from all n objects, virtually re-group the best, then from the remaining n-1 objects again the best, etc., until all objects have been re-grouped from this n partitions take that with smallest cost and actually perform the corresponding re-group operations» escapes from global minima» asymptotic complexity O(n 3 ) Œ partitioning into m blocks: O(mn 3 ) 22
Iterative Methods - Simulated Annealing (1) from Physics: Œ metal and gas take on a minimal-energy state during cooling down (under certain constraints):» at each temperature, the system reaches a thermodynamic equilibrium» the temperature is decreased sufficiently slowly Œ probability that a particle jumps to a higher-energy state: P( e, e, T ) = e i j ei e k T B j application to Combinatorial Optimization Œ energy = cost of a solution (partition) Œ cost decreases with temperature, sometimes (with a certain probability) increases in cost are accepted 23 Iterative Methods - Simulated Annealing (2) temp = temp_start; cost = c(p); while (Frozen() == FALSE) { while (Equilibrium() == FALSE) { P = RandomMove(P); cost = c(p ); deltacost = cost - cost; if (Accept(deltacost, temp) > random[0,1)) { P = P ; cost = cost ; } } temp = DecreaseTemp (temp); } Accept() = min(1, e deltacost k temp ) 24
Iterative Methods - Simulated Annealing (3) Cooling Down: DecreaseTemp(), Frozen()» temp_start = 1.0» temp = α temp (typical: 0.8 α 0.99)» terminate when temp < temp_min or there is no more improvement Equilibrium: Equilibrium()» after defined number of iterations or when there is no more improvement Complexity Œ from exponential to constant, depending on the implementation of the functions Equilibrium(), DecreaseTemp(), and Frozen() Œ the longer the runtime, the better the quality of results Œ typical: construct functions to get polynomial runtimes 25 Iterative Methods - EA (1) Principles of Evolution å Selection Cross-over ê Mutation 26
Iterative Methods - EA (2) minimize g(x) = x² =one solution fitnesscalculation fitness = 9 selection next generation mutation cross-over 27 Applications Domain of EA Problem is: GLIIXVH FRPSOH[ Œ Examples:» system synthesis» route planning in robotics» container loading Multi criteria optimization Œ multiple criteria that are conflicting» example: performance vs. cost vs. power consumption Œ EA find Pareto-fronts (set of Pareto points) 28
Dominance, Pareto Points (1) Definition: A (design) point J i is dominated by J k, if J k is better or equal than J i in each criteria. Ji f J k Definition: A point is Pareto-optimal or a Pareto-point, if it is not dominated. 29 Dominance, Pareto Points (2) execution time 1 3 2 4 5 6 cost 30
Pareto-Ranking Fitness function: execution time = = 1: Ji p J F ( J ) i 1.. N, J 0 : else J i 1 2 4 3 5 6 F (1) = 0 F (2) = 1 F (3) = 2 F (4) = 0 F (5) = 0 F (6) = 5 cost 31 EA - Case Study (1) 32
EA - Case Study (2) 33 EA - Case Study (3) frame memory dual ported frame memory block matching module input module subtract/add module DCT/IDCT module 34 output module Huffman encoder
HW/SW Partitioning simplest case: bi-partitioning (processor-asic system) P = {p SW, p HW } software-oriented approach: P = {O, {}} Œ in software we can realize all functions Œ but the performance may be inacceptably low migrate objects into hardware to improve performance hardware-oriented approach: P = {{}, O} Œ in hardware the performance is sufficient Œ but the cost might be too high migrate objects into software to lower cost 35 Greedy Algorithms Migration of objects into the other block (hw/sw), until there is no more improvement repeat { P old = P; for i = 1 to n { if (f(move(p, o i )) < f(p)) { P = Move(P, o i ); } } until (P == P old ) cost function 36
Yorktown Silicon Compiler (YSC) functional partitioning of hardware Œ input: functional description at the level of arithmetic and logic expressions Œ partitioning into functional units of a datapath (ALUs, register) Œ method: hierarchical clustering Œ closeness function: Closeness( p, p min i j maxsize sharedwires( pi, p ) = maxwires( P) c 3 2 ) { size( p ), ( )} ( ) + ( ) i size p j size pi size p j j maxsize c 37 Hw/Sw Partitioning - Vulcan Input: program in HardwareC Œ C extended by a process concept and inter process communication Œ specification with constraits (min/max-times and rates) Target architecture: single-processor / single-asic Œ one global bus, one global memory Œ processor is the bus master Abstraction level: basic blocks and operations Œ derministic computation times Œ internal/external non-deterministic computation times Method: HW-oriented Greedy-Algorithm Œ cost function includes HW-cost, memory requirement, performance, and synchronization effort 38
Hw/Sw Partitioning - Cosyma Input: Program in C x Œ C extended by a process concept and inter process communication Œ specification with min/max-times Target architecture: processor + coprocessor Œ coupled by shared memory Œ computations on the processor and on the coprocessor may not overlap Abstraction level: basic blocks Method: SW-oriented, 2 loops: Œ inner loop: Simulated Annealing with cost function that measures the gain in computation time for a hardware-realization of a block Œ outer loop: synthesis to get estimations for the inner loop 39