Retiming & Pipelining over Global Interconnects

Size: px

Start display at page:

Download "Retiming & Pipelining over Global Interconnects"

Beatrix Lane
5 years ago
Views:

1 Retiming & Pipelining over Global Interconnects Jason Cong Computer Science Department University of California, Los Angeles Joint work with C. C. Chang, D. Pan*, and X. Yuan * IBM Research

2 Motivation: How Far Can We Go in Each Clock Cycle 7 clock NTRS um Tech 6 clock 5 clock 5 G Hz across-chip clock 620 mm 2 (24.9mm x 24.9mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x From corner to corner: 7 clock cycles 4 clock 1 clock 2 clock 3 clock (mm)

3 Solutions Fully Fully asynchronous designs GALS GALS (global asynchronous locally synchronous designs) Latency-insensitive designs Synchronous designs, with multi-cycle communications Much better understood Supported by the current tool set More energy efficient?

4 Interconnect-Centric IC Design Flow Under Development at UCLA Architecture/Conceptual-level Design Design Specification HDM Interconnect Planning Physical Hierarchy Generation for Multi-Cycle Comm. Physical Hierarchy Generation for Multi-Cycle Comm. Interconnect Architecture Planning Interconnect Performance Estimation Models (IPEM) OWS, SDWS, BISWS Structure view Functional view Physical view Timing view Synthesis and Placement under Physical Hierarchy Interconnect Synthesis Topology genration & wiresizng for delay Wire ordering & spacing for noise control Interconnect Layout Route Planning abstraction Interconnect Optimization (TRIO) Topology Optimization with Buffer Insertion Wire sizing and spacing Simultaneous Buffer Insertion and Wire Sizing Simultaneous Topology Construction with Buffer Insertion and Wire Sizing Point-to-Point Gridless Routing Final Layout

5 Physical Hierarchy Generation Physical Hierarchy Generation Problem Formulation Logical Hierarchy Physical Hierarchy = Placement bins + module locations Hard IP Soft module Same color for modules of the same logic hierarchy Assign modules to physical hierarchy Defines global interconnects Optimization objectives: wire length minimization routing congestion minimization clock period, latency, performance (with consideration of multi-cycle comm.)

6 Need of Considering Retiming/Pipelining during Placement - Retiming/pipelining on global interconnects Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global interconnect delays. Placement 1 Placement 2 a b c d a d b c d(v)=1, WL=6, d(e) WL Before retiming, φ = 5.0 After retiming, φ = 3.0 d(v)=1, WL=6, d(e) WL Before retiming, φ = 4.0 Better Initial Placement!!

7 Need of Considering Retiming during Placement - Retiming/pipelining on global interconnects Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global interconnect delays. Placement 1 Placement 2 a b c d a d b c d(v)=1, WL=6, d(e) WL Before retiming, φ = 5.0 After retiming, φ = 3.0 d(v)=1, WL=6, d(e) WL Before retiming, φ = 4.0 Better Initial Placement!! After retiming, φ = 4.0

8 Difficulties How to consider retiming/pipelining over global interconnects Flip-flop boundaries are not fixed during placement, difficult to do static timing analysis Answer: Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) How to handle the high complexity of the combined problem Answer: Use the multi-level optimization technique

9 Simultaneous Coarse Placement with Retiming on Interconnects Our Our solution Compute the labels of all nodes under c-retiming c for a given placement solution and perform sequential timing analysis (Seq( Seq- TA) Minimize the longest sequential path by improving the placement solution Alternative solution [Brayton[ Brayton,, et al] Enforcing all loop constraints during placement

10 Static Timing Analysis (STA) a Sequential circuit example: PI: a, b. PO: g. c d e g b f a a c d e g Suppose d(v)=1, d(e)=2 a b g f c d e AT: Suppose clock cycle φ =11 RT: f Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.

11 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) Definition [Pan et al, TCAD98] Given a clock period φ, transfer circuit C into an edge-weighted vertex weighted graph G, Label vertex v as l(v) l ) = the weight of longest path from PIs to v = max{l(u) - φ w(u,v) ) + d(u,v) ) + d(v)}, l(v) ) is also called SAT(v). Theorem: C can be retimed to φ + max{d(v)} iff l(pos) φ Relation to retiming: r(v) ) = l(v) ) / φ - 1 Complexity is O(VE) a b w(a,c)=1 w(b.c)=0 c l(a) = 7 l(b) = 3 d(a) a b d(b) w l (a,c)= d(e (a,c) )-φ w(a,c) d(c) c d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{ , 3+2+1} = 6 w l (b,c)= d(e (b,c) )-φ w(b,c)

12 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) a a b b c c Sequential circuit f f d Retimed circuit d e g d(v)=1, d(e)=2 Is φ = 4.5 possible? e g a b Retiming graph (not a DAG) 2 d -2.5 c e f Iter# a b c d e f g Cycle time 4.5 is possible because l(g) 4.5 g

13 Continuous Retiming (c-retiming) and Sequential Arrival Time (SAT) (cont d) Sequential circuit a d c e g a Retiming graph (not a DAG) d c e g b f d(v)=1, d(e)=2 Is φ = 2.5 feasible? b f Iter# a b c d e f g Cycle time 2.5 is not feasible because l(g) > 2.5

14 Multi-Level Optimization Framework Levels Coarsening Problem sizes Uncoarsening & Refinement (optimization) Multi-level coarsening generates smaller problem sizes for top levels faster optimization on top levels May explore different aspects of the solution space at different levels Gradual refinement on good solutions from coarser levels is very efficient Successful in many applications Originally developed for PDE Recent success in VLSICAD: partitioning, placement, routing

15 Challenges Previous Previous Seq-TA can only handle single-output gate In reality multi-output modules exist IP block, MUX, adders Clusters in the multi-level level optimization process How How to integrate Seq-TA into multi-level level coarse placement efficiently Need Need to consider congestion and routability

16 Generalize c-retiming c for Complex Combinational Modules l 1 -value labeling for each vertex l 1 (v)=weight of the longest path from PIs to v using d (v) d as uniform gate delay Each vertex has a l 1 -value label. Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d (v) = max { d(v (i, j) ) } v I0 4 v O0 v I v v O1 I2 complex module (combinational logic) with multi-output and non-uniform propagation delay Flatten/Decompose the complex module by treating each pin of the module as vertex with zero delay. v I0 v I1 d (v)=11 v O0 v v O1 I2 v I0 4 v I v I2 3 v O0 v O1 l 2 -value labeling for each output of a vertex l 2 (vo t )=weight of the longest path from PIs to output o t Each output of a vertex has a l 2 -value label. Lower bound of the labeling of v

17 Properties of Generalized c-retiming c for Complex Combinational Modules Theorem: If a PO t with l 2 (PO t ) > Φ, then the circuit can not be retimed to a clock period of Φ. Theorem: If for every POi, l 1 (PO i ) Φ, then the circuit can be retimed to a clock period less than Φ+k, where k is max. input-output delay of all gates. Theorem: For any module v and its out-pin vo t, l 2 (vo t ) l 1 (v). Theorem: Given a circuit C, Φ is the min. clock period achieved by retiming on circuit C, if C c is derived from C by performing clustering,and the min. clock period achieved by retiming on C c is Φ c, then Φ Φ c.

18 Integrate Seq-TA with a Multi-level level SA-based Coarse Placement In coarsening phase, FFs can only be clustered after a certain level k Level L 0. From level L n to L k+1 perform static timing analysis (where FFs are clusterd) From level L k to L 0 perform Seq-TA (where FFs are not clustered). Level L k Level L n. Initial Placement. Refinement by timing-driven SA-based coarse placement

19 Area Density Problems in Multi-level level Coarse Placement Traditional area density control: Cell area in each bin < bin area utilization with a small percentage of overflow Does not work when cluster sizes may have significant variations and may be bigger than a bin How about use different grid sizes for different levels of clustering? Hard to find fixed percentages that works Significant placement cost jump when switch grid sizes

20 Hierarchical Area Density Control Use the same grid structure for placement for all clustering levels Impose hierarchy on bin structure for area density control Each cluster move must satisfy the area constraints on each level in the bin hierarchy Area constraint for moving a cell of size A Allowed overflow on each level in the bin hierarchy = ka, k is a small constant (usually 1 or 2) Work well in multi-level framework: Area constraints gradually tightened during optimization

21 Fast Incremental A-tree A Routing for Multi-pin Nets Root(source pin) Simple incremental A-tree Recursively Quad-partition grids Each pin recursively connects to lower left corner of each level of partition For net with bounding box length B, at most 2 *log B edge updates for each pin move, except the root. Each edge routed by LZ-router First Quadrant

22 Fast LZ-routing for Two-pin Connections HVH Left region VHV Right region Decide HVH or VHV: Select the less congested layer Binary search on V-stem (or H-stem) Initial left region and right region to cover bounding box Repeat Query wire usage on both regions Select region with less congestion Wire usage query can be done in O(log grid_size)

23 Placement Cost Functions Wire length driven: Summation of net bounding boxes of all nets Congestion driven: Wire usages estimated from the fast global router Cost = Summation of square of wire usages in all bins For fixed wire width cost equivalent to summation of weighted wire length, weight on a bin = wire usage of the bin For congestion driven run: only turns on congestion driven cost at the finest placement level W1 W2 W3 W4 W5 W6 Congestion cost = W1 2 + W W9 2 W7 W8 W9

24 Experimental Results on Wire Length Minimization Multi-level simulated annealing coarse placement Wire length comparison with GORDIAN-L: Our engine only turns on wire length optimization Legalized by DOMINO for wire length comparison mpg+dom/gor+dom Wire Length Comparison mpg+dom/gor+dom CPU Time Comparison 100% 99% 98% 97% 96% 95% 94% 97% 100% 96% 90% 80% 70% 60% 50% 40% 30% 20% 10% 81% 43% 22% 93% 0% 20k-50k k 100k-210k 20k-50k k 100k-210k 20k-50k test cases: avqlarge, avqsmall, ibm04, ibm07 50k-100k test cases: ibm09, ibm10 100k-210k test cases: ibm14, ibm15, ibm16, ibm17, ibm18 Our multi-level engine performs well for big circuits

25 Experimental Results on Congestion Control BBOX WL Routed WL Max boundary congestion Total overflow CPU mpg mpg-cg.rd mpg-cg Test cases: ibm01, ibm04, ibm07, ibm11, ibm13, ibm15 mpg: wire length driven mode mpg-cg: congestion driven at finest clustering level mpg-cg.rd: alternative congestion driven + wire length driven at fines clustering level

26 Initial Experimental Result on Impact of Simultaneous Retiming and Placement circuit #gates Grid size WL-driven placement Simultaneous retiming and placement dly dly dly (before retiming) (after retiming) S x Ind x Ind x Ind x Ind x Avg

27 Limitation of Exploring Multi-cycle Interconnect Communication during Logic Synthesis Minimum Minimum clock period can be achieved by logic optimization is bounded by max. delay-to to-register (DR) ratio of the loops in the circuits In a loop, 4 logic cells, 2 registers Cell delay =1ns Interconnect delay=1ns DR ratio = (D logic +D int )/#Registers = (4+4)/2=4ns Clock cycle >= 4ns Require Require consideration of multi-cycle communication during architecture & behavior synthesis

28 Regular Distributed Register Architecture (1) FUC FUC FUC Island Register File. DIV MUX ADD Cluster with area constraint FUC Global Interconnect FUC FUC Function Unit Cluster (FUC) W i H i D intra island = Dlog ic + Dopt int Dlog ic + Dopt int(2w i + 2Hi ) T Distribute registers to each island Local computation and communication in each island can be done in a single clock cycle But registers may need to be inserted along global interconnects for multi-cycle communication (less regular)

29 Regular Distributed Register Architecture (2) FUC FUC FUC 1 cycle Island Register File 2 cycle. k cycle DIV MUX ADD Cluster with area constraint Global Interconnect Function Unit Cluster (FUC) H i FUC FUC FUC W i D intra island = Dlog ic + Dopt int Dlog ic + Dopt int(2w i + 2Hi ) Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, k cycle interconnect communication in each island Highly regular T

30 Example : Regular Distributed Register Architecture for 70nm Technology NTRS 97 70nm Tech Chip dimension: 620 mm 2 (24.9mm x 24.9mm) 5 G Hz across-chip clock Wire can travel up to 7.52mm within 1 clock cycle under interconnect optimization Need 7 clock cycles to cross the chip Each island base dimension Wi = Hi=2.08mm = critical length (longest length that a wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x 1/3 of distance a wire can travel in 1 clock cycle Logic volume: 6.76M min-size 2-NAND gates 12X12 island-base array Local registers are partitioned to 7 banks

31 Example: Impact of Interconnect on Scheduling Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and operation is 1ns. The resources available are 2 multipliers and 2 ALUs. The nodes with the same color are assigned to the same functional unit * 3 * Mul2 3,7,12 Alu1 1,5,10 Alu2 2,6,9 * 7 * 8-9 * 11 * Mul1 4,8,11 FUC Represents long Interconnect delay. The long interconnect delay is 2ns. Represents short Interconnect delay. Short Interconnect delay is 1ns. Wirelength-driven Placement

32 Single-cycle vs. Multi-cycle Interconnect Communication Represents registers. + 2 Cycle Cycle 1-1 Cycle2 * 3 * 4 Cycle2 * 3 * 4 Cycle Cycle Cycle 4 Cycle5 * 11 * 8 Cycle 4 * 7 * 11 Cycle6 * 7 * 12 Cycle5 * 8 * 12 Cycle Cycle6-10 Cycle8-9 Cycle9 Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

33 Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization Cycle1 * 3 * 4 Cycle2 Mul2 3,7,12 Alu1 1,5, Cycle3 * 7 * 8 Cycle4 Cycle5 * 11 Cycle6 * 12 Mul1 4,8,11 Alu2 2,6,9-9 Cycle7-10 Cycle8 Simultaneous Placement and Scheduling With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

34 Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization * 3 * 4 Cycle1 Cycle2 Mul2 3,7,11 Alu1 1,5, Cycle3 Cycle4 * 7 * 12 Cycle5 Mul1 4,8,12 Alu2 2,6,9 * 8 * 11 Cycle Cycle7 Simultaneous Placement, Scheduling and Binding With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

35 Example: Multicluster Architectures of DEC Alpha Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

36 Conclusions Multi-cycle communication is needed for gigahertz designs Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global interconnects Regular Regular distributed register (RDR) fabric provides regularity to support Multicycle communication Integrated resource binding, scheduling, and physical planning

Regular Fabrics for Retiming & Pipelining over Global Interconnects

Regular Fabrics for Retiming & Pipelining over Global Interconnects Jason Cong Computer Science Department University of California, Los Angeles cong@cs cs.ucla.edu http://cadlab cadlab.cs.ucla.edu/~cong