Chapter 8 Folding. VLSI DSP 2008 Y.T. Hwang 8-1. Introduction (1)

Size: px

Start display at page:

Download "Chapter 8 Folding. VLSI DSP 2008 Y.T. Hwang 8-1. Introduction (1)"

Oliver Dalton
6 years ago
Views:

1 Chapter 8 olding LSI SP 008 Y.T. Hang 8- folding Introduction SP architecture here multiple operations are multiplexed to a single function unit Trading area for time in a SP architecture Reduce the number of function units by a factor of N at the expense of increasing the computing time by a factor of N N: folding factor Present a systematic ay to derive the folded SP architecture LSI SP 008 Y.T. Hang 8-

2 olding example yn = an + bn + cn Time multiplexed on a single pipeline adder An input sample must remains clock cycles Introduction LSI SP 008 Y.T. Hang 8-3 More on folding Introduction 3 May lead to an architecture using a large number of registers esign to minimize the number of registers LSI SP 008 Y.T. Hang 8-4

3 Preliminary Consider a G olding transformation An edge e connecting nodes and ith e delays Executions of the l-th iterations of and at time units Nl+u and Nl+v u and v: folding orders and 0 u,v N- N: folding factor, the number of operations folded to a single function unit H and H : function units to execute nodes and H is pipelined by P stages LSI SP 008 Y.T. Hang 8-5 olding an edge e olding transformation has e delays l-th iteration of node is available at time Nl + u + P Generated data is used by the l+e-th iteration of The result must be stored for e [ N l e v] [ Nl P u] N e P v u olding factor = N LSI SP 008 Y.T. Hang 8-6

olding set olding transformation 3 An order of operations executed by the same hardare Example: S = {A,Ø,A} A: S 0, A: S Biquad filter example Addition : u.t. and -stage pipelining, P A = Multiplication: u.

4 olding set olding transformation 3 An order of operations executed by the same hardare Example: S = {A,Ø,A} A: S 0, A: S Biquad filter example Addition : u.t. and -stage pipelining, P A = Multiplication: u.t. and -stage pipelining, P M = olding factor N = 4 Assume folding set S = {4,, 3, }, S = {5, 8, 6, 7} LSI SP 008 Y.T. Hang 8-7 olding transformation 4 Biquad filter example cont. Node 3 is executed on adder at time instance 4l + LSI SP 008 Y.T. Hang 8-8

5 olding transformation 5 Biquad filter example cont. 8 = 5 : an edge from the adder to the multiplier in the folded G ith 5 delays Because node 8 has S, the folded edge is sitched at the input of the multiplier at 4l + LSI SP 008 Y.T. Hang 8-9 alid folding e 0 olding transformation 6 must hold for all edges in the G Can be achieved by retiming Recall e after retiming has a delay r e = e + r - r ' e Let 0 denote the number of folded delay by folding the retimed G ' e 0 N e r r P r r N e P e N r v u 0 v u 0 r r e N LSI SP 008 Y.T. Hang 8-0

olding transformation 7 Retiming for valid folding Solve a system of inequalities irst construct a constraint se loyd-warshall algorithm to solve the problem LSI SP 008 Y.T.

6 olding transformation 7 Retiming for valid folding Solve a system of inequalities irst construct a constraint se loyd-warshall algorithm to solve the problem LSI SP 008 Y.T. Hang 8- olding transformation 8 Retiming for valid folding cont. Constraint graph Solution r = -, r = 0 r3 = -, r4 = 0 r5 = -, r6 = - r7 = -, r8 = - Leads to the G in ig 6.3 Can be achieved equivalently by cut set retiming using C and C LSI SP 008 Y.T. Hang 8-

7 More on folding olding transformation 8 The original G and the N-unfolded version of the folded G synthesized ith folding factor N are retimed and/or pipelined versions of each other An arbitrary G can be unfolded by a factor N and then folded again to generate a family of architectures LSI SP 008 Y.T. Hang 8-3 Register minimization in folding Lifetime analysis To compute the minimum number of registers required to implement a SP algorithm in hardare A data sample variable is live from the time it is produced excluded through the time it is consumed included A variable after lifetime is called dead The maximum number of live variables at each time unit is the minimum number of registers required to implement the SP program LSI SP 008 Y.T. Hang 8-4

8 Example Register minimization in folding Assume 3 variables a, b, c Life time of variable a: {,,3,4} Life time of variable b: {,3,4,5,6,7} Life time of variable c: {5,6,7} Number of live variables {,,,,,,} registers are needed to implement the SP program LSI SP 008 Y.T. Hang 8-5 Linear lifetime chart When the iteration period is less than the span of the scheduling, the scheduling overlaps The number of live variables at time instance n is the sum of the number of live variables at cycles n-kn, k Non-overlapped Overlapped ith Schedule period 6 LSI SP 008 Y.T. Hang 8-6

9 Linear lifetime chart Matrix transpose example Assume ro-ise access a d g b e h c a f b i c d e f g h i Input time: T input Zero latency output time: T zlout T diff = T zlout T input Required latency T lat = magnitude of the most negative value of T diff T output = T zlout + T lat LSI SP 008 Y.T. Hang 8-7 Linear lifetime chart 3 Matrix transpose example cont. Assume iteration period of the SP program is N = 9 LSI SP 008 Y.T. Hang 8-8

10 Circular lifetime chart Circular lifetime chart Point i represents the time partition i and all time instances {Nl+i} linear circular LSI SP 008 Y.T. Hang 8-9 ata allocation orard backard register allocation To achieve minimum number of registers etermine ho variables are assigned to registers in the allocation table Step : determine the minimum number of registers using lifetime analysis Step : Input each variable at the time step corresponding to the beginning of its lifetime If multiple variables are input in a given cycle, they are allocated to multiple registers according to lifetime in a descending order LSI SP 008 Y.T. Hang 8-0

11 ata allocation orard allocation If register i holds the variable in the current cycle, then register i+ holds the same variable in the next cycle If the register i+ is not available, then the variable is allocated to the first available forard register Step 3: Each register is allocated in a forard manner until it is dead or reaches the last register Step 4: In periodic scheduling, the allocation of current iteration also repeats itself in subsequent iterations If R j is occupied by a variable in cycle l, hash the position for R j at time unit l+n LSI SP 008 Y.T. Hang 8- Step 5: ata allocation 3 or a variable that reaches the last register and is not yet dead, allocate it in backard manner If multiple registers available, choose the one ith least but sufficient number of forard registers capable of completing the allocation After a variable has been allocated backard, allocate it in a forard manner until it is dead or again reaches the last register Step 6: Repeat step 4 and 5 as required until the allocation is complete LSI SP 008 Y.T. Hang 8-

12 ata allocation 4 3X3 matrix transpose example ith N = 9 hashing After steps ~4 completion LSI SP 008 Y.T. Hang 8-3 Another example ata allocation 5 Linear lifetime chart Step ~4 completion LSI SP 008 Y.T. Hang 8-4

13 ata allocation 6 architecture design after register allocation LSI SP 008 Y.T. Hang 8-5 ata allocation 7 architecture design after register allocation LSI SP 008 Y.T. Hang 8-6

14 Goal Register minimization in folding To synthesize control circuits in folded architectures ith minimum number of registers Procedures Perform retiming for folding Write folding equations se the folding equations to construct a lifetime table ra the lifetime chart and determine the required number of registers Perform forard-backard register allocation ra the folded architecture that uses the minimum number of registers LSI SP 008 Y.T. Hang 8-7 Biquad filter Biquad filter example Original bi-quad ilter design esign after retiming LSI SP 008 Y.T. Hang 8-8

15 Biquad filter example esign ithout register minimization Total of 6 external and 3 internal pipelining registers olding equations olded architecture LSI SP 008 Y.T. Hang 8-9 Biquad filter example 3 Construct a lifetime table Each a node ith lifetime T input T output corresponds to an entry in the lifetime table T input : u folding order + P # of pipelining stages of the function unit T output : u+ P +max { oe node, folding order is 3, adder s P is T input = 3+=4 T output = u+ P +max { = 3++max{,0,,3,5}=9 LSI SP 008 Y.T. Hang 8-30

Biquad filter example 4 Construct a lifetime table and lifetime chart Assume N iteration period is 4 Minimum number of registers required is LSI SP 008 Y.T.

16 Biquad filter example 4 Construct a lifetime table and lifetime chart Assume N iteration period is 4 Minimum number of registers required is LSI SP 008 Y.T. Hang 8-3 Biquad filter example 5 Allocation table Only variables n, n 7 and n 8 ith non-zero duration are shon ariable n is output in cycles 4,5,6,8,9, only the latest cycle 9 is shon in the table LSI SP 008 Y.T. Hang 8-3

17 Biquad filter example 6 olded design ith registers Edge has = delay after delay the variable n is located in R An edge from R to adder sitched at 4l+ because the node has folding order LSI SP 008 Y.T. Hang 8-33 Biquad filter example 7 olded design ith registers cont. Edge 7 has 7= 3 delays after 3 delays the variable n is located in R An edge from R to multiplier sitched at 4l+ because the node 7 has folding order LSI SP 008 Y.T. Hang 8-34

18 IIR filter before retiming yn = ayn-3 + byn-5 + xn olding factor = IIR filter example olding set: A S = {,}, MPY S = {4,3} Retiming solution r = 0, r = 0, r3 = -, r4 = - LSI SP 008 Y.T. Hang 8-35 IIR filter after retiming olding equations for the retimed G = = 0 3= 3 + = 5 4= + 0 = 3 = + 0 = 4 = = 0 IIR filter example Lifetime table LSI SP 008 Y.T. Hang 8-36

19 Lifetime chart IIR filter example 3 A total of 3 registers is needed LSI SP 008 Y.T. Hang 8-37 IIR filter example 4 Allocation table and folded design 3 registers minimized v.s. 6 registers unminimized LSI SP 008 Y.T. Hang 8-38

20 olding of multi-rate systems ecimators and expanders lead to a multi-rate system ecimation by M expansion by M ecimator: thro aay M- out of M samples y n = xmn Expander: insert M- zeros in beteen y E x n / M if n is a multiple 0 otherise of M LSI SP 008 Y.T. Hang 8-39 olding of multi-rate systems olding of an decimator Arc ith decimator olded arc l-th iteration of node executed at time N l + u l-th iteration of node executed at time N l + v olding order u[0, N olding order v[0, N LSI SP 008 Y.T. Hang 8-40

21 LSI SP 008 Y.T. Hang 8-4 olding of multi-rate systems 3 olding of an decimator cont. Sample yl consumed during the l-th iteration of is produced during the Ml M + -th iteration of yl is consumed by H in time unit N l + v generated by H in time unit N Ml M + +u+p yl must be stored for l M x l s l y Ml x Ml s l s l x l s u v P M N l MN N P u M Ml N v l N ] [ ] [ LSI SP 008 Y.T. Hang 8-4 olding of multi-rate systems 4 olding of an decimator cont. In a decimator, N = MN Node executes M times for each execution of node u v P M N

22 olding of multi-rate systems 5 ecimator folding example olding factors N = N N 6 0 N N 3 olding orders u, v, v, v 4, v P = 0 3 olding equations e e 30 e e LSI SP 008 Y.T. Hang 8-43 olding of multi-rate systems 6 ecimator folding example cont. Number of registers required can be reduced using lifetime analysis 0 must hold given a feasible schedule Noble identities elay redistribution in a multirate system LSI SP 008 Y.T. Hang 8-44

23 LSI SP 008 Y.T. Hang 8-45 olding of multi-rate systems 7 Retiming of multi-rate G Let and be the number of delays on arc after retiming ru, rv: retiming values of nodes and, respectively r uv : number of times one delays removed from its output, and M delays are added to its input uv uv uv uv N Mr r N Mr r r Mr N u v P r Mr r r M N r r r Mr u v P M N 0 ] [ here ' ' ' ' ' ' LSI SP 008 Y.T. Hang 8-46 olding of multi-rate systems 8 Retiming of multi-rate G cont. Note that retiming may yield not equivalent result due to its periodically time varying nature Example: assume ra = -, rmpy = 0 z n = axn + yn z n = axn- + yn-

Chapter 6: Folding. Keshab K. Parhi

Chapter 6: Folding. Keshab K. Parhi Chapter 6: Folding Keshab K. Parhi Folding is a technique to reduce the silicon area by timemultiplexing many algorithm operations into single functional units (such as adders and multipliers) Fig(a) shows