Regular Fabrics for Retiming & Pipelining over Global Interconnects

Size: px

Start display at page:

Download "Regular Fabrics for Retiming & Pipelining over Global Interconnects"

Arron Jeffry Newton
6 years ago
Views:

of California, Los Angeles cong@cs cs.ucla.

1 Regular Fabrics for Retiming & Pipelining over Global Interconnects Jason Cong Computer Science Department University of California, Los Angeles cs.ucla.edu cadlab.cs.ucla.edu/~cong FCRP Interconnect Workshop, June 28, 2002 DUSD(Labs)

2 Overarching GSRC Research Emphasis [Jan Rabaey,, June 2002] A broadened focus on application-oriented embedded systems under tight cost, PDA, and time-to-market constraints Founded on One Basic Principle From Ad-Hoc System-on on-a-chip Design to Disciplined, Platform-Based Design

The Discipline of Platform-Based Design Application Programming Model: Models/Estimators Kernels/Benchmarks Architecture(s) Architectural Platform Microarchitecture(s) Cycle-speed, power, area

3 The Discipline of Platform-Based Design Application Programming Model: Models/Estimators Kernels/Benchmarks Architecture(s) Architectural Platform Microarchitecture(s) Cycle-speed, power, area Functional Blocks, Interconnect V S G S V S Circuit Fabric(s) Silicon Implementation Platform Manfacturing Interface V S G S V S V S S S V G Delay, variation, SPICE models Basic device & interconnect structures Silicon Implementation

4 The Discipline of Platform-Based Design Programmable Systems Comp and Comm Based Design Constructive Fabrics Programming Model: Models/Estimators Cycle-speed, power, area Application Architecture(s) Architectural Platform Microarchitecture(s) Circuit Fabric(s) Silicon Implementation Platform Manfacturing Interface Kernels/Benchmarks Functional Blocks, Interconnect Test, Verification, Energy&Power Calibrating Achievable Design Delay, variation, SPICE models Basic device & interconnect structures Silicon Implementation

5 From Architecture to Silicon Implementation Platform Different Different targets employ different intermediate platforms, hence different layers of regularity and design-space space constraints Design Design space may actually be smaller than with large steps! Large-step predictions/abstractions may misguide the optimizations Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation Constructive Fabrics Th [Source: Larry Pileggi]

6 Sample Work from the GSRC Fabric Theme Bob Brayton: : Topologically Constrained Logic Synthesis Malgorzata Marek-Sadowska Sadowska: : Interconnecting Regular Fabrics Wojtek Maly: : Geometrical Regularity Herman Schmit: : Regular Communication Fabrics Jason Cong: Regular Fabrics for Retiming and Pipelining over Global Interconnects

7 Motivation: How Far Can We Go in Each Clock Cycle 7 clock NTRS um Tech 6 clock 5 clock 5 G Hz across-chip clock 620 mm 2 (24.9mm x 24.9mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x From corner to corner: 7 clock cycles 4 clock 1 clock 2 clock 3 clock (mm)

8 Solutions Fully Fully asynchronous designs GALS GALS (global asynchronous locally synchronous designs) Latency-insensitive designs Synchronous designs, with multi-cycle communications Much better understood Supported by the current tool set More energy efficient?

9 Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global interconnect delays. Placement 1 Placement 2 a b c d a d b c d(v)=1, WL=6, d(e) WL Before retiming, φ = 5.0 After retiming, φ = 3.0 d(v)=1, WL=6, d(e) WL Before retiming, φ = 4.0 Better Initial Placement!!

10 Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global interconnect delays. Placement 1 Placement 2 a b c d a d b c d(v)=1, WL=6, d(e) WL Before retiming, φ = 5.0 After retiming, φ = 3.0 d(v)=1, WL=6, d(e) WL Before retiming, φ = 4.0 Better Initial Placement!! After retiming, φ = 4.0

11 Difficulties How to consider retiming/pipelining over global interconnects Flip-flop boundaries are not fixed during placement, difficult to do static timing analysis Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) How to handle the high complexity of the combined problem Use the multi-level optimization technique

12 Static Timing Analysis (STA) a Sequential circuit example: PI: a, b. PO: g. c d e g b f a a c d e g Suppose d(v)=1, d(e)=2 a b g f c d e AT: Suppose clock cycle φ =11 RT: f Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.

13 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) Definition [Pan et al, TCAD98] Given a clock period φ, transfer circuit C into an edge-weighted vertex weighted graph G, Label vertex v as l(v) l ) = the weight of longest path from PIs to v = max{l(u) - φ w(u,v) ) + d(u,v) ) + d(v)}, l(v) ) is also called SAT(v). Theorem: C can be retimed to φ + max{d(v)} iff l(pos) φ Relation to retiming: r(v) ) = l(v) ) / φ - 1 Complexity is O(VE) a b w(a,c)=1 w(b.c)=0 c l(a) = 7 l(b) = 3 d(a) a b d(b) w l (a,c)= d(e (a,c) )-φ w(a,c) d(c) c d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{ , 3+2+1} = 6 w l (b,c)= d(e (b,c) )-φ w(b,c)

14 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) a a b b c c Sequential circuit f f d Retimed circuit d e g d(v)=1, d(e)=2 Is φ = 4.5 possible? e g a b Retiming graph (not a DAG) 2 d -2.5 c e f Iter# a b c d e f g Cycle time 4.5 is possible because l(g) 4.5 g

15 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) (cont d) Sequential circuit a d c e g a Retiming graph (not a DAG) d c e g b f d(v)=1, d(e)=2 Is φ = 2.5 feasible? b f Iter# a b c d e f g Cycle time 2.5 is not feasible because l(g) > 2.5

16 Sequential Timing Analysis (Seq( Seq-TA) With loops, problem is difficult Topological order does not exist! Start with a min l-value for each node and iteratively improve it Convergence is guaranteed in O(n) iterations if the circuit can be retimed to the target cycle time Outline of Seq-TA Binary search the min. feasible clock period Given a clock period φ, check if φ is feasible l(pi) = 0, l(others) = - Relax one vertex at a time and update l-values If a l(po) > φ, φ is not feasible; if relaxation converge, φ is feasible Complexity is O(VE)

17 Multi-Level Optimization Framework Levels Coarsening Problem sizes Uncoarsening & Refinement (optimization) Multi-level coarsening generates smaller problem sizes for top levels faster optimization on top levels May explore different aspects of the solution space at different levels Gradual refinement on good solutions from coarser levels is very efficient Successful in many applications Originally developed for PDE Recent success in VLSICAD: partitioning, placement, routing

18 Challenges Previous Previous Seq-TA can only handle single-output gate In reality multi-output modules exist IP block, MUX, adders Clusters in the multi-level level optimization process How How to integrate Seq-TA into multi-level level coarse placement efficiently

19 Generalize c-retiming c for Complex Combinational Modules l 1 -value labeling for each vertex l 1 (v)=weight of the longest path from PIs to v using d (v) d as uniform gate delay Each vertex has a l 1 -value label. Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d (v) = max { d(v (i, j) ) } v I0 4 v O0 v I v v O1 I2 complex module (combinational logic) with multi-output and non-uniform propagation delay Decompose the complex module by treating each pin of the module as vertex with zero delay. v I0 v I1 d (v)=11 v I2 v I0 v I1 v I v O0 v O1 v O0 v O1 l 2 -value labeling for each output of a vertex l 2 (vo t )=weight of the longest path from PIs to output o t Each output of a vertex has a l 2 -value label. Lower bound of the labeling of v

20 Integrate Seq-TA with a Multi-level level SA-based Coarse Placement In coarsening phase, FFs can only be clustered after a certain level k Level L 0. From level L n to L k+1 perform static timing analysis (where FFs are clusterd) From level L k to L 0 perform Seq-TA (where FFs are not clustered). Level L k Level L n. Initial Placement. Refinement by timing-driven SA-based coarse placement

21 Initial Experimental Result on Impact of Simultaneous Retiming and Placement circuit #gates Grid size WL-driven placement Simultaneous retiming and placement dly dly dly (before retiming) (after retiming) S x Ind x Ind x Ind x Ind x Avg

22 Limitation of Exploring Multi-cycle Interconnect Communication during Logic Synthesis Minimum Minimum clock period can be achieved by logic optimization is bounded by max. delay-to to-register (DR) ratio of the loops in the circuits In a loop, 4 logic cells, 2 registers Cell delay =1ns Interconnect delay=1ns DR ratio = (D logic +D int )/#Registers = (4+4)/2=4ns Clock cycle >= 4ns Require Require consideration of multi-cycle communication during architecture & behavior synthesis

23 Regular Distributed Register Architecture FUC FUC FUC 1 cycle Island Register File 2 cycle. k cycle DIV MUX ADD Cluster with area constraint Global Interconnect Function Unit Cluster (FUC) H i FUC FUC FUC W i D intra island = Dlog ic + Dopt int Dlog ic + Dopt int(2w i + 2Hi ) Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, k cycle interconnect communication in each island Highly regular T

24 Example : Regular Distributed Register Architecture for 70nm Technology NTRS 97 70nm Tech Chip dimension: 620 mm 2 (24.9mm x 24.9mm) 5 G Hz across-chip clock Wire can travel up to 7.52mm within 1 clock cycle under interconnect optimization Need 7 clock cycles to cross the chip Each island base dimension Wi = Hi=2.08mm = critical length (longest length that a wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x 1/3 of distance a wire can travel in 1 clock cycle Logic volume: 6.76M min-size 2-NAND gates 12X12 island-base array Local registers are partitioned to 7 banks

25 Example: Impact of Interconnect on Scheduling Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and operation is 1ns. The resources available are 2 multipliers and 2 ALUs. The nodes with the same color are assigned to the same functional unit * 3 * Mul2 3,7,12 Alu1 1,5,10 Alu2 2,6,9 * 7 * 8-9 * 11 * Mul1 4,8,11 FUC Represents long Interconnect delay. The long interconnect delay is 2ns. Represents short Interconnect delay. Short Interconnect delay is 1ns. Wirelength-driven Placement

26 Single-cycle vs. Multi-cycle Interconnect Communication Represents registers. + 2 Cycle Cycle 1-1 Cycle2 * 3 * 4 Cycle2 * 3 * 4 Cycle Cycle Cycle 4 Cycle5 * 11 * 8 Cycle 4 * 7 * 11 Cycle6 * 7 * 12 Cycle5 * 8 * 12 Cycle Cycle6-10 Cycle8-9 Cycle9 Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

27 Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization Cycle1 * 3 * 4 Cycle2 Mul2 3,7,12 Alu1 1,5, Cycle3 * 7 * 8 Cycle4 Cycle5 * 11 Cycle6 * 12 Mul1 4,8,11 Alu2 2,6,9-9 Cycle7-10 Cycle8 Simultaneous Placement and Scheduling With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

28 Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization * 3 * 4 Cycle1 Cycle2 Mul2 3,7,11 Alu1 1,5, Cycle3 Cycle4 * 7 * 12 Cycle5 Mul1 4,8,12 Alu2 2,6,9 * 8 * 11 Cycle Cycle7 Simultaneous Placement, Scheduling and Binding With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

29 Example: Multicluster Architectures of DEC Alpha Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

30 Conclusions Multi-cycle communication is needed for gigahertz designs Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global interconnects Regular Regular distributed register (RDR) fabric provides regularity to support Multicycle communication Integrated resource binding, scheduling, and physical planning

31 From Architecture to Silicon Implementation Platform Different Different targets employ different intermediate platforms, hence different layers of regularity and design-space space constraints Design Design space may actually be smaller than with large steps! Large-step predictions/abstractions may misguide the optimizations Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation Constructive Fabrics Th [Source: Larry Pileggi]

Retiming & Pipelining over Global Interconnects

Retiming & Pipelining over Global Interconnects Jason Cong Computer Science Department University of California, Los Angeles cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong Joint work with C. C. Chang,