Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park. Design Technology Infrastructure Design Center System-LSI Business Division

Size: px

Start display at page:

Download "Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park. Design Technology Infrastructure Design Center System-LSI Business Division"

Randolf Booker
5 years ago
Views:

1 Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park Design Technology Infrastructure Design Center System-LSI Business Division

2 1. Motivation 2. Design flow 3. Parallel multiplier 4. Coarse-grained structural placement methodology 5. Experimental results 6. Future works

Data-flow (design structure) awareness is crucial to enhance physical

Structured datapath placement is mostly done manually.

3 Data-flow (design structure) awareness is crucial to enhance physical design qualities. Timing, area, congestion, and power etc. Structured datapath placement is mostly done manually. In general, it is thought that placement tools do not perform well on the datapath designs. Design efforts; days ~ weeks Sum = A + B Floorplan Coarser Memory macro placement Control granularity Structured datapath placement Finer 3

We have added another methodology in the data-flow aware physical design. Automated extraction and mapping for a synthesized parallel multiplier.

4 We have added another methodology in the data-flow aware physical design. Automated extraction and mapping for a synthesized parallel multiplier. Sum = A * B Floorplan Coarser Logic Synthesis Memory Floorplan macro placement Coarse-grained structured Memory macro placement datapath placement Coarser Control granularity Control granularity Datapath template Automated datapath extraction and mapping Structured datapath placement Finer 4

5 Identify cells of a synthesized parallel multiplier to be structurally placed RTL code Parsing/Elaboration Technology library Timing/ Area constraints Inherent structural location extractions of the cells Analyze data-flow of the multiplier Logic Synthesis Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Non-arithmetic logic High-level optimizations Structurally mapping the cells on a logical 2-D array Structural templates (Multiplier) Technology independent and dependent optimizations Optimized gate-level netlist Physical bit-slice alignment of the cells Generate structural relative placement directives Guide structural placement during global placement 5 N o Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Coarse-grained structural placement Result satisfactory? Yes User Dataflow analysis N o

6 A parallel multiplier is one of the most abundant arithmetic circuits in today s multi-media feature intensive SoCs. Parallel multiplier largely consists of three parts. Partial product generation Partial product reduction Carry propagating adder (Final adder) Multiplicand Partial Product Multiplier Multiplicand Multiplier Y3 Y2 Y1 Y0 X3 X2 X1 X0 Partial Product Reduction Final Adder Multiplication in dot-notation Partial Products Final Product X0Y3 X0Y2 X0Y1 X0Y0 X1Y3 X1Y2 X1Y1 X1Y0 X2Y3 X2Y2 X2Y1 X2Y0 X3Y3 X3Y2 X3Y1 X3Y0 S7 S6 S5 S4 S3 S2 S1 S0 Final Product 6

7 Partial product generation Non-booth : generates the logical product of a multiplicand and multiplier (AND). Booth (Radix-4) : reduces the number of partial products to the half. Partial product reduction Carry-save addition : reduces every column to 2 output rows using compressor cell. Carry-propagate adder (final adder) Carry look ahead adder : adds the 2 output rows Multiplicand Partial Product Partial Product Reduction Final Adder Final Product Multiplier Multiplication in dot-notation Multiplicand Multiplier Partial Products Final Product Xi Partial Carry-propagate product generation reduction adder PPij PPi+2j-2 PPi+1j-1 PPij 3:2 3:2 PPi-1j+1 A2 B2 A1 B1 A0 B0 C0 Yj Cout FA FA FA Sum Cin C2 C1 S2 S1 S0 P2 G2 P1 G1 P0 G0 C3 Carry-look ahead unit Non-booth Booth

8 It performs 1. Identify cells of a synthesized parallel multiplier to be structurally placed The PI cells from the partial product generation The PO cells from the final adder 2. Inherent structural location extraction of the cells Tagging structural locations for the PI and PO cells RTL code Parsing/Elaboration Logic Synthesis Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Structural templates (Multiplier) Technology library Technology independent and dependent optimizations Timing/ Area constraints Non-arithmetic logic High-level optimizations Optimized gate-level netlist 3. Analyze data-flow of the multiplier 4. Structurally mapping the cells on a logical 2-D array 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives 7. Guide structural placement during global placement 8 N o Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Coarse-grained structural placement Result satisfactory? Yes User Dataflow analysis N o

9 The PI cells from the partial product generation The PI cells are retrieved by the immediate fan-out cone cells of the input nets. A set of nets that to collect the PI cells differs depending on the type of the partial product generation. Non-booth : multiplicand and multiplier input nets Booth : multiplicand input nets Multiplicand Multiplier Partial product generation Partial Product Y3 Y2 Y1 Y0 X3 X2 X1 X0 Partial Product Reduction Final Adder Final Product Xi Yj PPij Non-booth Booth X1Y3 X1Y2 X1Y1 X1Y0 X2Y3 X2Y2 X2Y1 X2Y0 X3Y3 X3Y2 X3Y1 X3Y0 S7 S6 S5 S4 X0Y3 X0Y2 X0Y1 X0Y0 S3 S2 S1 S0 9

10 After extracting the PI cells, the PI cells are tagged by 2-D locations of a partial product row and column. Row inference Column inference The row of the PI cell can be inferred by its topologically closest multiplier inputs. Row inference i indicates the ith row of the partial product generator. - PIrow(Ck) : the row number of the PI cell Ck - PIcol(Ck) : the column number of the PI cell Ck - Bmd(Ck) : the closest multiplicand bit of Ck - Bmr(Ck) : the closest multiplier bit of Ck - PPtype : the partial product type Xi Yj PPij Non-booth Booth

11 The column of the PI cell can be inferred by its topologically closest and bitslice aligned multiplier output bit. Topological order propagation is restricted to only follow the same weighted bit-slice along the CSA tree. - Ignoring carry-out pins of the compressor cells. Column inference Find topologically closest and bit-slice aligned result. 3:2 3:2 Y3 Y2 Y1 Y0 3:2 3:2 X3 X2 X1 X0 3:2 3:2 3:2 X2Y3 X2Y2 X2Y1 X2Y0 X3Y3 X3Y2 X3Y1 X3Y0 X0Y3 X0Y2 X0Y1 X0Y0 X1Y3 X1Y2 X1Y1 X1Y0 Column[i+1] Column[i] S7 S6 S5 S4 S3 S2 S1 S0 11

12 The PO cells are parts of the final carry propagating adder. The PO cells are retrieved by the immediate fan-in cone cells of the output nets. Tags corresponding multiplier output bits to the PO cells Multiplicand Partial Product Multiplier Carry-propagate adder A2 B2 A1 B1 A0 B0 Partial Product Reduction Final Adder Final Product C3 FA FA FA C2 C1 S2 S1 S0 P2 G2 P1 G1 P0 G0 Carry-look ahead unit C0 12

13 It performs 1. Identify cells of a parallel multiplier to be structurally placed RTL code Parsing/Elaboration Logic Synthesis Technology library Timing/ Area constraints 2. Inherent structural location extraction of the cells Arithmetic operation extraction High-level arithmetic optimizations Non-arithmetic logic High-level optimizations 3. Structurally mapping the cells on a logical 2-D array 4. Analyze data-flow of the multiplier 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives 7. Guide structural placement during global placement 13 N o Datapath generator Structural templates (Multiplier) Technology independent and dependent optimizations Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Result satisfactory? Optimized gate-level netlist Coarse-grained structural placement Yes User Dataflow analysis N o

14 It performs 1. Identify cells of a parallel multiplier to be structurally placed RTL code Parsing/Elaboration Logic Synthesis Technology library Timing/ Area constraints 2. Inherent structural location extraction of the cells 3. Analyze data-flow of the multiplier Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Non-arithmetic logic High-level optimizations 4. Structurally mapping the cells on a logical 2-D array Using the inferred row and column numbers. 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives 7. Guide structural placement during global placement Structural templates (Multiplier) Technology independent and dependent optimizations Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Optimized gate-level netlist Coarse-grained structural placement User Dataflow analysis N o Result satisfactory? Yes N o

15 The PI cells are mapped onto a logical 2-D array according to their tagged row and column numbers. However, the number of cells inferring to the same location can be uneven due to the local nature of logic synthesis optimizations. If enough slots are allocated for all the cells, the 2-D array may have uncontrollable aspect ratio which may degrade placement quality. The maximum number of columns is constrained to control the array dimension. The number of rows is fixed. Some mis-mappings are allowed. Slot sharing between adjacent columns. There are spacing between the rows of the 2-D array. Non-guided cells to be placed close to their inherent structural locations. 15

16 Min-cost max-flow based cell mapping to maximize the number of mapped PI cells with minimum mis-mapping cost for a given 2-D array. An initial 2-D slot array may not fully contain all the PI cells. It allows empty slot sharing between adjacent bit-slice columns. It iteratively add dummy (empty) column slots at columns with the worst mis-mapping costs during the mapping. PI Cell[i-1,0] PI Cell[i,0] Cost [0,0] CostSH [0,0] Cost [0,1] CostSH [0,0] Cost [0,0] Cost [0,n] CostDS [0,0] PI Cell[i+1,0] PI Cell[i+1,0] CostDS [0,0] The slots are divided into the three types for each column having different mapping cost weights. Non-shared : mapping weight γown j slots m slots k slots Shared : mapping weight γshared Dummy : mapping weight γdummy Column[i-1] Shared Slot Column[i] Dummy Column[i+1] Slot[i] Column[i+1] Capacity = m Shared Slot Capacity = j Capacity = m Capacity = m Capacity = k 16 Mis-mapping cost : γx* rowcell rowslot

17 HPWL is considered to compensate for net-connection blindness of the mapping as a tiebreaker for the mapping. Linear programming formulations of the weighted sum of min-cost max-flow for CostMA(ci) and HPWL minimization for CostHPWL(ni) CostMA(ci) : weighted sum of mis-mapping cost of cell ci CostHPWL(ni) : weighted sum of mis-mapping cost of cell ci Gradually add dummy column slots to minimize mis-mapping cost at columns with the worst mis-mapping cost, then solve the linear program iteratively. 17

18 It performs 1. Identify cells of a parallel multiplier to be structurally placed RTL code Parsing/Elaboration Logic Synthesis Technology library Timing/ Area constraints 2. Inherent structural location extraction of the cells 3. Analyze data-flow of the multiplier Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Non-arithmetic logic High-level optimizations 4. Structurally mapping the cells on a logical 2-D array Structural templates (Multiplier) Technology independent and dependent optimizations Optimized gate-level netlist 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives 7. Guide structural placement during global placement Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Coarse-grained structural placement User Dataflow analysis N o Result satisfactory? Yes N o

19 The logically mapped PI and PO cells are then bit-slice aligned with respect to their physical dimension. Strict bit-slice alignment : a column width is decided by the widest cell among them - uncontrollable cell alignment size Ci,j-1 Ci,j Ci,j+1 Ci,j+2 Ci,j+3 i-1,j-1 Ci-1,j Ci-1,j+1 Ci-1,j+2 Ci-1,j+3 i-2,j-1 Ci-2,j Ci-2,j+1 Ci-2,j+2 Ci-2,j+3 Compression alignment : this generates a compact cell cluster - It cannot ensure vertical bit-slice alignment Ci,j-1 Ci,j Ci,j+1 Ci,j+2 Ci,j+3 Ci-1,j Ci-1,j+1 Ci-1,j+2 Ci-1,j+3 Ci-2,j Ci-2,j+1 Ci-2,j+2 Ci-2,j+3 19

20 Our method combines the advantages of the aforementioned methods. Align the columns within a maximum width constraint It performs bit slice misalignment minimization while ensuring a maximum alignment width. Misalignment at each column Ci,j-1 Ci,j Ci,j+1 Ci,j+2 Ci,j+3 i-1,j-1 Ci-1,j Ci-1,j+1 Ci-1,j+2 Ci-1,j+3 i-2,j-1 Ci-2,j Ci-2,j+1 Ci-2,j+2 Ci-2,j+3 Maximum width constraint 20

21 It performs 1. Identify cells of a parallel multiplier to be structurally placed RTL code Parsing/Elaboration Logic Synthesis Technology library Timing/ Area constraints 2. Inherent structural location extraction of the cells 3. Analyze data-flow of the multiplier Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Non-arithmetic logic High-level optimizations 4. Structurally mapping the cells on a logical 2-D array 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives The relative row and column locations of the cells The column spaces between the cells 7. Guide structural placement during global placement N o Structural templates (Multiplier) Technology independent and dependent optimizations Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Result satisfactory? Optimized gate-level netlist Coarse-grained structural placement Yes User Dataflow analysis N o

After the bit-slice alignment, the structural locations and the cell spacings are transformed into

Relative row and column locations of the cells Cell spaces between the cells To accommodate the cell

The compression based alignment is used to align the cell.

22 After the bit-slice alignment, the structural locations and the cell spacings are transformed into structural relative placement directives. Relative row and column locations of the cells Cell spaces between the cells To accommodate the cell spaces, the number of the array column is set to be twice of the logical 2-D array. The compression based alignment is used to align the cell. An estimated dataflow direction is used to set the initial orientations of the arrays for global placement. Cell spacing Cell slots Space slots Ci,j-1 Ci,j Ci,j+1 Ci,j+2 Ci,j+3 Ci-1,j Ci-1,j+1 Ci-1,j+2 Ci-1,j+3 Ci-2,j Ci-2,j+1 Ci-2,j+2 Ci-2,j+3 22

23 It performs 1. Identify cells of a parallel multiplier to be structurally placed RTL code Parsing/Elaboration Logic Synthesis Technology library Timing/ Area constraints 2. Inherent structural location extraction of the cells 3. Analyze data-flow of the multiplier Arithmetic operation extraction High-level arithmetic optimizations Datapath generator Non-arithmetic logic High-level optimizations 4. Structurally mapping the cells on a logical 2-D array Structural templates (Multiplier) Technology independent and dependent optimizations Optimized gate-level netlist 5. Physical bit-slice alignment of the cells 6. Generate structural relative placement directives 7. Guide structural placement during global placement N o Structure Extraction and Mapping Structural location inference/ Cell mapping Physical aware bit-slice alignment Structural relative placement directives Global Placement Coarse-grained structural placement Result satisfactory? Yes User Dataflow analysis N o

24 Structural relative placement directives hold the locations of the PI and PO cells. Non-guided cells are attracted to the PI and PO cells. 13*12 non-booth multiplier 32*16 Booth multiplier 24

25 We implemented the proposed methodology in Tcl and CLP as a linear program solver. Commercial logic synthesis and P&R tools with industrial designs were used. About 2%, 42%, and 2% improvements in critical path delay, total negative slack, and total wire-length respectively. D11 degraded the physical implementation quality, which had about 25% of the inputs are pruned due to constant propagation, and was not sufficient for the approach. Design # Mults Area ratio CPD TNS Wirelength D D D D D D D D D D D Ave

26 A snapshot of D10 26

27 The future works will focus on Extending the methodology for other synthesized datapath circuits. Developing regularity measuring methods to avoid structurally mapping insufficiently regular multipliers. Adding more surround awareness to further automate the methodology. 27

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation