Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder

Size: px

Start display at page:

Download "Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder"

Amos Gordon
6 years ago
Views:

1 Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder A Thesis Presented by Ogun Turkyilmaz to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Northeastern University Boston, Massachusetts August 2011

2 Abstract Addition is a timing critical operation in today s floating point units. In order to develop faster processing, an end-around carry (EAC) was proposed as a part of fused-multiply-add unit which performs multiplication followed by addition [5]. The proposed EAC adder was also investigated through other prefix adders in FPGA technology as a complete adder [6]. In this thesis, we propose a 128-bit standalone adder with parallel prefix end around carry logic and conditional sum blocks to improve the critical path delay and provide flexibility to design with different adder architectures. In previous works, CLA logic was used for EAC logic. Using a modified structure of a parallel prefix 2 n 1 adder provides flexibility to the design and decreases the length of the carry path. After the architecture is tested and verified, critical path is analyzed using FreePDK45nm library. Full custom design techniques are applied carefully during critical path optimization. Critical path analysis provides fast comparison of the total delay among different architectures without designing the whole circuit and a simpler approach to size the transistors for lowest delay possible. As a final step, datapath is designed as a recurring bitslice for fast layout entry. The results show that the proposed adder shows 142ps delay, 2.42mW average power dissipation, and 3,132 sq. micron area assuming there is not much routing area overhead in the estimated area.

3 Acknowledgements I would like to express my foremost appreciation to my advisor, Prof. Yong-Bin Kim for giving me the opportunity to conduct research at Northeastern University. I am grateful for his technical guidance and constant support in my graduate career. Without his valuable suggestions and assistance, this thesis would not be accomplished. I would like to thank the committee members, Prof. Fabrizio Lombardi and Prof. Gunar Schirner for reading my thesis and offering valuable suggestions and contributions. I am, especially, grateful to Prof. Schirner for the long discussions about graduate study. He has been a mentor and a teacher to me, who generously shares his knowledge and experiences with tremendous enthusiasm and never ending encouragement. It has been an honor and a great pleasure to study as a Fulbright Scholar. I would like to express my appreciation to the Fulbright Commission for giving me the chance to pursue further academical study and connect with many accomplished scholars. I would also like to thank Faith Crisley, Graduate Coordinator at the ECE Department, for her support and valuable suggestions. She has always been helpful with her comforting manner even in the most stressful moments. Last but not least, I would like to express my sincere appreciation to my beloved parents, Nuket and Nevzat Turkyilmaz, and my sister, Pinar Turkyilmaz, who encouraged me continuously in every step I take, supported me constantly through every hardship I faced and loved me without boundaries. Without them, I could not even imagine being where I am today. Ogun Turkyilmaz August 2011

4 Contents 1 Introduction Fused Multiply-Add Operation Adders Ripple Carry Adders Carry lookahead adders Parallel Prefix Adders Tree Adders Recurrence Algorithms Weinberger Recurrence Ling Recurrence Conclusion Modulo Adders Introduction Addition in Modulo 2 n 1 Adder Analysis of Previous End-Around-Carry Adders Carry-lookahead EAC Logic Unit Conclusion Modified Parallel Prefix EAC Adder Introduction Proposed Adder The 16bit blocks in EAC adder Parallel Prefix 2 n 1 EAC Block Implementation and Validation Conclusion Critical Path Analysis Path Identification Path Design Transistor Level Design and Sizing Logic Level Minimization Late arriving signal exploitation Logical Effort Design with Helpers Transistor Sizing i

5 4.5 Simulation Results Conclusion Datapath Library Introduction Concepts in Full Custom Design Datapath Design Layout Design Results Conclusion Conclusion and Future Works 56 A Verilog Code of the Proposed Adder 63 B HSPICE Simulation Files 68 B.1 Cells B.2 Simulation Code B.3 Condition of Transistors

6 List of Figures 1.1 Ripple Carry adder [1] Carry lookahead adder [2] Group PG cells [3] Taxonomy of prefix networks [3] Kogge-Stone adder [3] Sklansky adder [3] Brent-Kung adder [3] Han-Carlson adder [3] Knowles [2,1,1,1] adder [3] Ladner Fischer adder [3] Prefix graph with fast end-around carry [4] General block diagram Modulo 2 n 1 Adder [4] Block diagram of the 128-bit binary adder [5] Architecture of the EAC adder [6] Architecture of the EAC adder [6] Architecture of the modified EAC Adder bit conditional sum blocks c in merge with fast carry link [7] c in merge with extra bit [7] Modified 8bit Kogge-Stone EAC block Critical path of the modified EAC adder Gate level design of critical path Reduced first stage in Weinberger recursion adder [8] AOI without late arriving exploitation AOI with late arriving exploitation EAC logic with helper Spreadsheet for Logical Effort Calculation Transistor level schematic Transistor level schematic Delay vs. V dd at 25 C Power vs. V dd at 25 C Delay vs. V dd at 100 C Power vs. V dd at 100 C Global floorplan of a datapath [9] iii

7 5.2 Regularity placement and routing datapath circuit [10] Schematic layout of datapath and detailed view of bitcell [11] Representation of a datapath cell [9] Designed basic cells Bit slices of the blocks in the adder Bit slice of 16bit Kogge-Stone adder Wide layout Stacked compact layout Layout of the blocks

8 List of Tables 1.1 Trade-off between different adder topologies Delay and Power dissipation values in correspondence to V DD and Temperature Results comparison of proposed adder with the previous work v

9 Chapter 1 Introduction Fused multiply add unit plays an important role in modern microprocessor. It performs floating point multiplication followed by an addition of the product with a third floating point operand. In 2007, a seven cycle fused multiply add pipeline unit was proposed as a part of the floating point unit in IBM s POWER6 microprocessor [5]. In this fused multiply add data flow, the product should be aligned before it is added with the addend. Because the magnitude of the product is unknown in the early stages prior to the combination with the addend it is difficult to determine a priori which operand is bigger. Even if it was determined early that the product is bigger, there would be a problem on conditionally complementing two intermediate operands the carry and sum outputs of the counter tree. Thus an adder needs to be designed to always output a positive magnitude result and preferably only needs to complement one operand. In [6], the adder in POWER6, was taken as a reference for design space exploration in FPGA technology. They designed a complete adder independent of FMA block and shown that Kogge-Stone does not provide the best performance in FPGA technology. Zhang et al. [12] proposed recently a 108-bit adder for an FMA unit. All these adders took the adder in POWER6 as a reference. We believe using a CLA block in this adder limits the possibility to totally exploit the benefits of parallel prefix adders. We designed an adder with parallel prefix 2 n 1 block. Although the carry increment topology is still employed, the number of carry merge terms is decreased as well as the length end around carry path. Another area of improvement lies in the design of first level addition blocks. Using a Carry-select scheme provides the benefit of choosing the sum 1

10 CHAPTER 1. INTRODUCTION 2 at the end of computation. However, carry path should not include the calculation of group propagate and generate(pg) terms according to input carry. In our conditional sum blocks, the carry path only includes PG terms with c in = 0 and the sum is calculated in non-critical path for both conditions: c in = 0 and c in = 1. At the final stage, the real sum is selected according to the output of the EAC block. The thesis is organized as following; in Chapter 1 general information about adders are provided to show the importance of design space. In Chapter 2 detailed analysis of modulo 2 n 1 adders and EAC adders are given. The modified adder is described in Chapter 3 and compared with the previous architectures. Critical path analysis methodology is analyzed in detail in Chapter 4 and simulation results are provided. Finally in Chapter 5 datapath library design methodology is described. 1.1 Fused Multiply-Add Operation A fused multiply-add(fma) unit performs the multiplication A B followed immediately by an addition of the product and a third operand C so that the result T is calculated as Eqn. 1.1 in a single indivisable step [2]. Such a unit is capable of performing multiply only by setting C = 0 and add(or subtract) only by setting, for example, B=1. T = A B + C = M + C (1.1) An advantage of a fused multiply add unit, compared to separate multiply and adder, arises when executing floating-point operations since rounding is performed only once for the result of T = A B+C rather than twice(for the multiply and then for the add) [13]. Since rounding may introduce computation errors, reducing the number of rounding operations affects positively the overall error. The input of the operands are calculated at the CSA(Carry-Save Adder) multiplier tree and the magnitude of the operands is not known prior to addition to determine which operand has greater value. Since floating point is a sign magnitude operation, the result of the adder should be in two s complement form [14]. Therefore, an adder is needed to produce two separate results

11 CHAPTER 1. INTRODUCTION 3 for the following cases: Case 1 : If operand M > C, M C = M C = M + C + 1 Case 2 : If operand C < M, M C = C M = (M C) = (M + C + 1) = (M + C) 1 = (M + C) During subtraction of M C, the final carry out is C out is 1 when M > C, and 0 when C > M. Consequently, an End-Around-Carry adder produces two different results and C out determines whether case 1 or case 2 happens [12]. 1.2 Adders Ripple Carry Adders The addition of two operands is the most frequent operation in almost any arithmetic unit. A two-operand adder is used not only when performing additions and subtractions, but also often employed when executing more complex operations like multiplication and division. Consequently, a fast two-operand adder is essential [3]. s i = a i b i c i (1.2) c i+1 = a i b i + (a i + b i ) c i (1.3) Figure 1.1: Ripple Carry adder [1].

12 CHAPTER 1. INTRODUCTION 4 The most straightforward implementation of a parallel adder for two operands x and y is through the use of n basic units called full adders. A full adder(fa) is a logical circuit that accepts two operand bits, say x and y, and an incoming carry bit, denoted by s and an outgoing carry bit denoted by c. As the notation suggests, the outgoing carry c is also the incoming carry for the subsequent FA, which has c and y as the input bits. The FA is a combinational digital circuit implementing the binary addition of three bits through the Boolean equations in 1.2 and 1.3. The ripple carry adder is shown on Figure Carry lookahead adders The most commonly used scheme for accelerating carry propagation is the carry lookahead scheme [2] on Figure 1.2. The equations in Eqn. 1.4 show the realization of the Carry-Look- Ahead Generator. The main idea behind carry lookahead addition is an attempt to generate all incoming carries in parallel(for all the n-1 high order FAs) and avoid the need to wait until the correct carry propagates from the stage(fa) of the adder where it has been generated. This can be accomplished in principle, since the carries generated and the way they propagate depend only on the digits of the original numbers x-1, x-2,and y. These digits are available simultaneously to all stages of the adder and consequently value of the incoming carry and compute the sum bit accordingly. This however, would require and inordinately large number of inputs to each stage of the adder, rendering this approach impractical [1]. Figure 1.2: Carry lookahead adder [2].

13 CHAPTER 1. INTRODUCTION 5 c 4 = G 0 + c 0 P 0, c 8 = G 1 + G 0P 1 + c 0 P 0 P 1, c 12 = G 2 + G 1P 2 + G 0P 1 P 2 + c 0 P 0 P 1 P 2 (1.4) Parallel Prefix Adders A parallel prefix circuit is a combinational circuit with n inputs x 1, x 2,, x n producing the outputs x 1, x 2 x 1,..., x n x n 1... x 1 where is the associativity binary operation. The first stage of the adder generates individual P and G signals. The remaining stages constitute the parallel prefix circuit with the fundamental carry operation serving as the associative binary operation. This part of the adder can be designed in many different ways. g i = a i b i p i = a i b i (1.5) G i:k = G i:j + P i:j G j 1:k P i:k = P i:j P j 1:k (1.6) Although computing carry-propagate addition can use generate and propagate signals, its implementation in VLSI can be quite inefficient due to the number of wires that have to be connected together. Parallel-prefix adders solve this problem by making the wires shorter with simple gate structures to aid in the passing of groups of carries to the next weight [3] [15]. The proof of parallel prefix adder can be found in [16]. Parallel-prefix adders can be broken down into three stages: Pre-computation: single bit carry generate/propagate are obtained with Equation 1.5 and temporary sum is generated. This stage can be simplified applying the rules defined in Section Parallel-prefix tree: carry at each bit is computed with group carry generate/propagate where Equation 1.6 is applied. It is possible to simplify the first stage of the tree using the rules defined in Section

14 CHAPTER 1. INTRODUCTION 6 Post-computation: Sum and carry-out are derived with Equation 1.2. Basic cell definitions of prefix adders is shown Figure 1.3. Figure 1.3: Group PG cells [3]. 1.3 Tree Adders Tree structures have been used for graphically representing the various parallel prefix algorithms. Many state-of-the-art adder circuits use parallel prefix schemes to achieve high performance [17 19]. For wide adders, the delay of carry-lookahead (or carry-skip or carry-select) adders becomes dominated by the delay of passing the carry through the lookahead stages. This delay can be reduced by looking ahead across the look- ahead blocks [20]. In general, it possible to construct multilevel tree of look-ahead structures to achieve delay that grows with log N. Such adders are variously referred to as tree adders, logarithmic adders, multilevel-lookahead adders, parallel-prefix adders, or simply lookahead adders. There are many ways to build the lookahead tree that offer tradeoffs among the number of

15 CHAPTER 1. INTRODUCTION 7 Table 1.1: Trade-off between different adder topologies. Topology Logic Level Fanout Wiring Track Kogge-Stone [22] Low Low High Sklansky [23] Low High Low Brent-Kung [24] High Low Low stages of logic, the number of logic gates, the maximum fanout on each gate, and the amount of wiring between stages. Figure 1.4 shows a three dimensional taxonomy of prefix adders [21]. Far edge adders are Kogge-Stone [22], Sklansky [23] and Brent-Kung [24] for their properties of low logic-level, higher wiring track; low logic-level, high fanout and high logic-level, low fanout respectively as shown on Table 1.1. Figure 1.4: Taxonomy of prefix networks [3]. The following parallel prefix adders can be found in the literature: Kogge-Stone The Kogge-Stone tree [22] Figure 1.5 achieves both log2n stages and fanout of 2 at each stage. This comes at the cost of long wires that must be routed between stages. The tree also contains more PG cells; while this may not impact the area if the adder layout is on a regular grid, it will increase power consumption. Despite these cost,

CHAPTER 1. INTRODUCTION 8 Kogge-Stone adder is generally used for wide adders because it shows the lowest delay among other structures. Figure 1.5: Kogge-Stone adder [3].

16 CHAPTER 1. INTRODUCTION 8 Kogge-Stone adder is generally used for wide adders because it shows the lowest delay among other structures. Figure 1.5: Kogge-Stone adder [3]. Sklansky The Sklansky or divide-and-conquer tree [23] on Figure 1.6 reduces the delay to log2n stages by computing intermediate prefixes along with the large group prefixes. This comes at the expense of fan-outs that double at each level: The gates fanout to [8, 4, 2, 1] other columns. These high fan-outs cause poor performance on wide adders unless the gates are appropriately sized or the critical signals are buffered before being used for the intermediate prefixes. Transistor sizing can cut into the regularity of the layout because multiple sizes of each cell are required, although the larger gates can spread into adjacent columns. With appropriate buffering, the fan-outs can be reduced to [8,1,1,1].

17 CHAPTER 1. INTRODUCTION 9 Figure 1.6: Sklansky adder [3]. Brent-Kung The Brent-Kung tree [24] as shown on Figure 1.7 computes prefixes for 2-bit groups. These are used to find prefixes for 4-bit groups, which in turn are used to find prefixes for 8-bit groups, and so forth. The prefixes then fan back down to compute the carries-in to each bit. The tree requires 2(log2N) 1 stages. The fanout is limited to 2 at each stage. The diagram shows buffers used to minimize the fanout and loading on the gates, but in practice, the buffers are generally omitted. Figure 1.7: Brent-Kung adder [3].

CHAPTER 1. INTRODUCTION 10 Other than those major adders trade-off can be achieved with the following adders. The Han-Carlson trees [25] are a family of networks between Kogge-Stone and Brent-Kung.

The Knowles trees [26] are a family of networks between Kogge-Stone and Sklansky. All of these trees have log2n stages, but differ in the fanout and number of wires.

18 CHAPTER 1. INTRODUCTION 10 Other than those major adders trade-off can be achieved with the following adders. The Han-Carlson trees [25] are a family of networks between Kogge-Stone and Brent-Kung. Figure 1.8 shows such a tree that performs Kogge-Stone on the odd- numbered bits, and then uses one more stage to ripple into the even positions. The Knowles trees [26] are a family of networks between Kogge-Stone and Sklansky. All of these trees have log2n stages, but differ in the fanout and number of wires. If we say that 16-bit Kogge-Stone and Sklansky adders drive fanouts of [1, 1,1,1] and [8, 4, 2, 1] other columns, respectively, the Knowles networks lie between these extremes. For example, Figure 1.9 shows a [2, 1, 1, 1] Knowles tree that halves the number of wires in the final track at the expense of doubling the load on those wires. The Ladner-Fischer trees [27] are a family of networks between Sklansky and Brent-Kung. Figure 1.10 is similar to Sklansky, but computes prefixes for the odd- numbered bits and again uses one more stage to ripple into the even positions. Cells at high-fanout nodes must still be sized or merged appropriately to achieve good speed. Figure 1.8: Han-Carlson adder [3].

CHAPTER 1. INTRODUCTION 11 Figure 1.9: Knowles [2,1,1,1] adder [3]. Figure 1.10: Ladner Fischer adder [3].

[20]. Over the years, several addition algorithms have been developed.

19 CHAPTER 1. INTRODUCTION 11 Figure 1.9: Knowles [2,1,1,1] adder [3]. Figure 1.10: Ladner Fischer adder [3]. 1.4 Recurrence Algorithms Recurrence algorithms have been a research area for a long time [8] [28]. Weinberger presented the most widely known carry recurrence for VLSI addition in 1958 [20]. Over the years, several addition algorithms have been developed. These algorithms manipulate the carry and sum equations in an attempt to improve the speed of addition. The equations for sum and carry are defined as equation 1.2 and 1.3.

20 CHAPTER 1. INTRODUCTION 12 Ling modified the algorithm to reduce the complexity of the carry computation at the cost of increased complexity in the sum computation. An analysis was later performed by Doran [29] to determine the set of recurrences which have recurrence properties that are similar to Weinberger s and Ling s Weinberger Recurrence Weinberger [20] demonstrated that addition speed could be improved by parallelizing the computation of carry. Although widely credited with only the Carry Look-Ahead Adder, Weinberger s recurrence was not limited in group size or number of levels for carry computation [20]. The fundamental advancement of his work was the introduction of generate and propagate as shown on Eqn Weinberger defined the terms: bitwise generate (g), bitwise propagate (p), group generate (G), and group propagate (P ). These terms allow for carry computation to be performed in parallel, yielding a significant improvement in performance compared to ripple-carry addition. For a group of 4 bits, Weinberger recurrence has ten terms for the generation of G 5:2 from the inputs and four terms for the generation of P 5:2. The maximum transistor stack height is 5. Weinberger demonstrated that and could be used to create blocks of arbitrary size and parallelized to form multiple levels of recurrence [20]. Thus, the majority of parallel prefix adders proposed for high-performance addition employ the realizations of Weinberger s recurrence, e.g., Kogge-Stone [22], Brent-Kung [24], Han-Carlson [25], Ladner-Fischer [27], and those described by Knowles [26] Ling Recurrence g i = a i b i p i = a i + b i (1.7) IBM ECL technology limitations on fan-in (limited to 4) and wired-or (limited to 8) motivated Ling to develop a transformation that reduced the fan-in of Weinberger s recurrence [30] [8]. For clarity, a simple derivation of Ling s transformation will be shown. This derivation provides

21 CHAPTER 1. INTRODUCTION 13 the physical meaning of the signals used in Ling s transformation and identifies the favorable characteristics of Ling for implementation in modern CMOS technology. In the derivation, the bitwise generate signal is defined as: g i = a i b i and the bitwise propagate signal is defined as: t i = a i +b i. Note that the propagate signal t i is the same as Weinberger s p i (when implemented using an OR). To maintain consistency with Ling s original paper, t i will be used for propagate. Ling s transformation reduces the complexity of Weinberger s recurrence by factoring t i from c i+1 to create a pseudo-carry (h i ) on which the recurrence is performed. The transformation is shown below on c 1 to form h 0. The carry-out signal, c 1, of the first bit position is Ling s transformation uses the property t i g i = g i to form c 1 = g 0 + t 0 c 0 (1.8) where g 0 + c 0 = h 0, which leads to c 1 = t 0 g 0 + t 0 c 0 = t 0 (g 0 + c 0 ) (1.9) The general transformation of is c i defined as c 1 = t 0 (g 0 + c 0 ) = t 0 h 0 (1.10) c i = { ti 1 h i 1 if i > 0 c 0 i=0 (1.11) where the pseudo-carry, h i, is defined as h i = g i + c i (1.12) The physical meaning of the pseudo-carry signal h can be described as follows. By factoring t i out of the carry expression and propagating h i instead of c i+1, all cases where carry is generated and/or propagated from the stage preceding stage i are included in h i. This includes the case where a carry-in to the ith stage can be assimilated (which should not result in a carry-out). The assimilate condition is handled when forming c i+1 by ANDing h i with t i to produce c i+1. If the carry-assimilate (carry-kill) condition exists then t i = 0, which results in c i+1 = 0. A recurrence for can be defined as has been done previously for Weinberger s c i. The

22 CHAPTER 1. INTRODUCTION 14 group pseudo-carry and transmit which allow for parallel prefix computation can be defined over the group of bits (capital letters are used to refer to the group): T i:j = t i t i 1 t j (1.13) H i:j = g i + g i 1 + t i 1 g i 2 + t i 1 t i 2 g i t i 1 t i 2 t j+1 g j (1.14) The recurrence can be expressed using the operator as ( Hi:j T i 1:j 1 ) ( Hj 1:k T j 2:k 1 ) ( Hi:j + T = i 1:j 1 H j 1:k T i 1:j 1 T j 2:k 1 ) (1.15) The transformation from Weinberger s recurrence to Ling s recurrence for a group of 4 bits is shown in the example in Fig. 2. This figure should dispel any difficulties associated with understanding the original Ling s derivation. The advantage of using pseudo-carry instead of carry is offset by the increased complexity of sum computation, which requires the real carry to form individual sum signals. In CMOS technology can be efficiently calculated conditionally, thus avoiding the AND operation on the critical carry path: s i = { ai b i h i 1 =0 a i b i t i 1 h i 1 =1 (1.16) 1.5 Conclusion In this chapter general information about binary adder realization in VLSI perspective is described. Parallel prefix adders provide the fastest carry propagation on the critical path for wide adders and it allows fast layout design because of its regularity. The most important advantage is the design space with many trade-offs in delay, power dissipation and area. Also, recurrence algorithms are provided to show it is possible to shorten the carry path with different propagate and generate terms. In the next chapter, a detailed analysis about the modulo adders is given. End-around-carry adders are analyzed in detail as a specialized realization of modulo adders. Analysis of previous

23 CHAPTER 1. INTRODUCTION 15 work is also included.

24 Chapter 2 Modulo Adders 2.1 Introduction Modular arithmetic has been an interest to researchers in a wide range of areas, since its operations are the basis for systems that use the Residue Number Systems (RNS) [31]. Modulo addition/subtraction and multiplication can also be applied to digital filters [32], cryptography [33], error detection and correction [34], as well as checksum computation in high-speed networks [35]. More importantly, modulo 2 n 1 addition is a common operation that can be implemented in hardware, because of its circuit efficiency and simple implementation [36]. In end-around carry adders the carry-in depends on the carry out [4] ie. the carry out c out is fed through some logic back to the carry in c in. In particular, this is used for modulo 2 n + 1 [37] and 2 n 1 [38] addition rely on decrement and increment respectively of the addition result depending on c out. Since prefix algorithms actually rely on incrementer structures, considering parallel-prefix schemes for this kind of adders is very promising. In order to obtain fast end around carry adders both conditions of fast carry out generation and fast carry in processing have to be met. This implies that there should be no combinational path existing between c in and c out. A fast end around carry adder can be built using the prefix structure. Here the last prefix is used as an incrementer which is controlled by the c out of the previous prefix stages. 16

25 CHAPTER 2. MODULO ADDERS Addition in Modulo 2 n 1 Adder Addition modulo 2 n 1 or ones complement addition can be formulated by the following equation: { A+B(mod2 n A + B (2 1) = n 1) = A + B + 1 (mod2 n ) if A + B 2 n 1 A + B otherwise (2.1) However, the condition A + B 2 n 1 is not trivial to compute. It can be rewritten as A + B 2 n with a carry input of 1. { A + B(mod2 n A + B (2 1) = n 1) = A + B + 1 (mod2 n ) if A + B 2 n A + B otherwise (2.2) Now the carry out cout from the addition A+B can be used to determine whether increment has to be performed or even, simpler c out can be added to the sum of A + B. This equation however results in a double representation of zero ( ie. 0 = 00 = 11) Figure 2.1: Prefix graph with fast end-around carry [4]. The standard approach for the implementation of a modulo 2 n 1 adder is by using a conventional carry propagate adder(cpa) abd have the carry out fed back into the carry in of the adder. This creates the necessary end-around-carry needed to have modulo 2 n 1 operate correctly as shown on Figure 2.2.

26 CHAPTER 2. MODULO ADDERS 18 Figure 2.2: General block diagram Modulo 2 n 1 Adder [4]. 2.3 Analysis of Previous End-Around-Carry Adders Previous defined end-around action can be obtained using different rules. Although the EAC adder has been used [39] and implemented on several microprocessors, very few details exist on their formulations and arithmetic algorithms in today s literature. Schwarz [40] provided explanations about some aspects of the EAC adder s algorithm as a part of fused-multiplyadd(fma) unit. Shedletsky [41] defined the indeterminate behavior of EAC adders and Liu et al. [42] presented a formal analysis of EAC adders.

27 CHAPTER 2. MODULO ADDERS 19 Figure 2.3: Block diagram of the 128-bit binary adder [5]. Liu et al. [6] [42] extended the algorithm to make the adder independent without being part of a FMA unit. The design mainly follows the algorithms of the EAC adder which is implemented in IBM POWER6 microprocessor [5]. The additional logic units of the proposed adder are useful to ensure the whole adder to work independently. Another advantage is that it is easier to implement and test the adder in FPGA technology, which enables design space exploration. Figure 2.4 shows the architecture of the adder.

28 CHAPTER 2. MODULO ADDERS 20 Figure 2.4: Architecture of the EAC adder [6]. EAC means that when subtracting two signed numbers that are in signed magnitude format, the subtraction is implemented by the addition of the first operand with the Boolean complement of the second operand. For this addition, instead of setting a carry into the least significant digit, the carry out of the most significant digit is taken as the carry in. This ensures that the result of the addition is always positive magnitude result and preferably only one operand needs to be conditionally complemented. Thus, an EAC adder performs addition similar to other regular adders and subtraction using the end around carry to ensure the result is positive. The adder shown in Figure 2.4 should satisfy the following conditions: 1) when x.ŝ = ỹ.ŝ the adder should do addition and we have s.ŝ = x.ŝ and s.s = x.x + ỹ.y. 2) when x.ŝ ỹ.ŝ, the

29 CHAPTER 2. MODULO ADDERS 21 adder should do subtraction. If x.x ỹ.y, then s.ŝ = x.ŝ and s.s = x.x ỹ.y; if x.x < ỹ.y then s.ŝ = ỹ.ŝ and s.s = ỹ.y x.x. The subtraction operation can be described as follows: 1) Determining which operand is bigger. After a subtraction operation if result is positive, operand x is bigger otherwise y is bigger. When x.x ỹ.y = x.x + ỹ.y + 1 = x.x + 2 n ỹ.y, the carry out of x.x + ỹ.y + 1 will be 1. Therefore, c out results as 1 if x is bigger and 0 if y is bigger. Hence the sum equation can be written as x.x + ỹ.y + c out 2)When y is bigger, c out = 0 the subtraction can be written as s.s = ỹ.y x.x = (ỹ.y x.x) = ( x.x + ỹ.y + 1) = ( x.x + ỹ.y + 0) = ( x.x + ỹ.y). 3) The c out is used to select correct result: s.s = { x.x + ỹ.y + cout c out = 1 x.x + ỹ.y + c out c out = 0 In order to implement addition and subtraction in one adder,y should be complemented conditionally. The effective operation can be defined as; O s = x.ŝ ỹ.ŝ (2.3) y t = { ỹ.y Os = 0 ỹ.y O s = 1 The sign of the result is determined after sign logic: s.ŝ = { x.ŝ cout = 1 ỹ.ŝ c out = 0 Generally implementation of an adder/subtracter is achieved using two different adders, one for addition and one for subtraction. After the results for both of the operation is calculated, the final result is selected with a multiplexer as given on Figure 2.5.

30 CHAPTER 2. MODULO ADDERS 22 Figure 2.5: Architecture of the EAC adder [6]. 2.4 Carry-lookahead EAC Logic Unit The use of EAC unit helps implementing adder/subtracter using only one adder. In [40], the algorithm for EAC unit with four carry bits, can be found. The most significant bit is labeled as 0. The group carries for a CLA adder is defined as : C 0 = G 0 + P 0 G 1 + P 0 P 1 G 2 + P 0 P 1 P 2 G 3 + P 0 P 1 P 2 P 3 C in C 1 = G 1 + P 1 G 2 + P 1 P 2 G 3 + P 1 P 2 P 3 C in C 2 = G 2 + P 2 G 3 + P 2 P 3 C in (2.4) C 3 = G 3 + P 3 C in If the carry out C 0 is fed to the carry in, EAC operation is achieved as Equation 2.5. C 0 = G 0 + P 0 G 1 + P 0 P 1 G 2 + P 0 P 1 P 2 G 3 + P 0 P 1 P 2 P 3 C 1 = G 1 + P 1 G 2 + P 1 P 2 G 3 + P 1 P 2 P 3 G 0 + P 0 P 1 P 2 P 3 C 2 = G 2 + P 2 G 3 + P 2 P 3 G 0 + P 2 P 3 P 0 G 1 + P 0 P 1 P 2 P 3 (2.5) C 3 = G 3 + P 3 G 0 + P 3 P 0 G 1 + P 3 P 0 P 1 G 2 + P 0 P 1 P 2 P 3 The combination of the carries in this way result in a carry chain for every group that is the length of the width of the adder. This wrapping of the carries is correct for subtraction but is not correct for addition. To make adder selectable for addition and subtraction. the P 3 term needs to be modified. An extra bit is combined to the least significant bit of the adder to assert the carry propagation when the effective operation is subtraction. This bit can be integrated

31 CHAPTER 2. MODULO ADDERS 23 into P 3 to make P 3 = 0 for an effective operation of addition as defined in Eqn 2.6. P t 3 = { P3 O s = 1 0 O s = 0 (2.6) 2.5 Conclusion In this chapter detailed information about binary modulo adders and previous work on EAC adders is provided. It is shown that when the c out is fed back into a carry-increment stage provides EAC operation as defined as 2 n 1 adders. The designs in [5] and [12] employ the adder as a part of FMA flow. However, by addition of extra logic Liu [6] proposes a stand alone adder. In the next chapter detailed analysis about the proposed parallel prefix EAC adder will be given. The design premises an alternative to the previous work while providing wider design space.

32 Chapter 3 Modified Parallel Prefix EAC Adder 3.1 Introduction Previosly proposed adders were analyzed in section 2.3. Since the designed adder in [5] did not include the implementation details about how the blocks were internally built, [6] made a detailed analysis about how the first stage adder and EAC blocks are defined. They also extended the work to a complete adder without a FMA unit. Their implementation was directed to FPGA. Although the analysis that they provided solved many of the questions about previously designed work, a number of the parts of the adder still needs to analyzed. In this chapter, we propose a modified adder which uses a modified parallel prefix 2 n 1 adder block as the EAC logic with the conditional sum blocks for flexibility among different adder architectures and lower total propagation delay. 3.2 Proposed Adder Figure 3.1 shows the architecture of the proposed adder. The first level includes eight 16bit blocks of Kogge-Stone prefix-2 adders for 128bit inputs and the second level includes modiifed 2 n 1 parallel prefix Kogge-Stone adder. In order to design a standalone adder, the input complement, add/sub and sign blocks are included as suggested in [6]. 24

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 25 Figure 3.1: Architecture of the modified EAC Adder. 3.2.1 The 16bit blocks in EAC adder Figure 3.2 shows the 16bit conditional sum blocks.

33 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 25 Figure 3.1: Architecture of the modified EAC Adder The 16bit blocks in EAC adder Figure 3.2 shows the 16bit conditional sum blocks. The black and gray blocks are the same as Fig The dashed lines correspond to the halfsum blocks, h i = a i b i. GG and GP refer to the group generate and propagate signals. As explained in [6], when carry in to the adder block is assumed 0, it is possible to reduce the complexity of the adder. When carry in is intended to propagate, the generate term in Equation 1.6 can be extended as G i:k = G i:j +P i:j c in. Thus, for c in = 0 it results in G i:k = G i:j. It can be seen that the generated group carry for the corresponding bit position need not be merged with the previous carry. Thus, it is possible to reduce the number of black terms and use gray terms instead. Namely, if the there is no carry merge operation in the next level, the corresponding level can be terminated with a gray cell. Figure 1.5 shows 16-bit Kogge-stone tree with gray

34 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 26 and black cells. However, in our design since we included a second stage, we need both the generate and propagate terms. This necessity increases the use of black cells instead of gray cells. Figure 3.2: 16bit conditional sum blocks. The adder in [6] is not clear about how the input carry is merged in the first level of adders. A discussion of the most efficient approaches for the traditional carries can be found in [43]. The carry-in bit can be included either by adding a fast carry increment stage or by treating cin as an extra bit of the preprocessing stage of the adder. The first case in shown in Figure 3.3. The second case can be derived by setting g 1 = c in according to Equation 1.6. As a result of these schemes, the complexity increases to solve carry incorporation problem. Additionally [6] and [5] does not explain how the carry is propagated after 8bit blocks. Although a similar structure is used in [12], the adder architecture is different and only one set of generate-propagate is calculated and propagated in the first level. Since [6] uses two different adders for each condition when c in = 0 and c in = 1, two set carries need to be selected before

In the next stage, conditional carry for c in = 1 is calculated with a simple equation G i:k = G i:j + P i:j which is simply an OR gate. Figure 3.

35 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 27 leaving the first level. This operation makes the calculation more complicated. Thus, we use the architecture in Figure 3.2 to calculate one set of generate-propagate as group terms for the conditional of c in = 0. In the next stage, conditional carry for c in = 1 is calculated with a simple equation G i:k = G i:j + P i:j which is simply an OR gate. Figure 3.3: c in merge with fast carry link [7]. Figure 3.4: c in merge with extra bit [7]. As a final step the sum for each carry condition, for c in = 0 and c in = 1, is calculated to be

36 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 28 selected according to the result of the second level of carry calculation Parallel Prefix 2 n 1 EAC Block As explained in chapter 2, 2 n 1 adders can be used for EAC calculation. In this section, we extend the use of 2 n 1 adders for adder/subtractor. Figure 3.5 shows the modified 8bit Kogge-Stone EAC block. Figure 3.5: Modified 8bit Kogge-Stone EAC block. The block first takes the group generate and propagate terms that are calculated in the first level. Since there are eight 16 bit adders in the first level, the carry for the whole adder must be calculated first. The 8 carries are merged in the Kogge-Stone adder. In order to achieve subtraction operation, the final carry at the most significant bit position must be merged with lower significant position. This step is also called as carry increment stage. The effective operation signal O s is defined as in Eqn When operation is subtraction, O s = 1, the AND gate propagates the most signiificant carry and when operation is addition, O s = 0, most significant carry is blocked and the unit works as a regular adder. c t 8 = O s G 127:0 is defined as the end-around-carry in Eqn c t 8 = { G127:0 O s = 1 0 O s = 0 (3.1) In the literature [37] [4] [44] [45] [15] carry-increment stage has found many usage for 2 n 1 operation. Especially for the wide adders, N 64 the final carry needs to travel a very long path to arrive on the least significant position, actually twice the length from the least significant

37 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 29 position to least significant position again. It is defined in [40] that carry needs to travel only the total length in an EAC operation. That result motivated the use of CLA logic as a EAC logic. In our adder, we use a second stage for EAC calculation instead of one long parallel prefix propagation and carry increment stages. Namely for 128 bit EAC adder the shorthest stage count is 2 n = 128, n = 7 and one stage of 127 carry merge terms. In our adder, we have 4 stages for the first level, 3 stages for the second level and 1 stage of 8 carry merge terms, which makes the total stage count the same, whereas a much lower count of carry merge terms. Decreasing the number of carry merge terms, actually, decreases the length of the carry path as well as the delay. 3.3 Implementation and Validation After the adder architecture is finalized, both the modified and previously proposed [6] adders are designed in Verilog. A test fixture is created to validate the adder with the corner cases, such as 0-to-1 crossover, carry propagation from 16 th bit to the next bit. A check procedure is defined to signal out if the output of the adder and the calculated result are different. The tests show that both adders have similar operation and thus the modified EAC is verified. 3.4 Conclusion In this chapter the proposed Parallel Prefix EAC adder is analyzed. It is shown that using conditional sum blocks solves the carry incorporation problem existing in the previous works. Moreover, the EAC logic in [6] is redesigned with a modified 2 n 1 adder to provide wider design space and shorter carry path. As a final step the adder is implemented in Verilog and validated through simulation. In the next chapter, a simple method for transistor level realization of the critical path is discussed.

38 Chapter 4 Critical Path Analysis 4.1 Path Identification The critical path is identified as the path from the sign logic to the sum of at the output as shown in Figure 4.1 for the proposed adder. In order to conditionally complement the second input, the effective operation O s is calculated from sign inputs sa and sb. After B t is calculated, the propagate and generate terms are calculated using Eqn The half-sum block, h i = a i b i is not on the critical path because the half-sum is needed for sum calculation which is not needed for carry propagation. After 4 stages of 16bit adder, carry is merged with other group terms in the 8bit EAC block. The EAC block consists of 3 carry merge stages and 1 carry increment stage. We only take into account the carry merge operation G i:k = G i:j + P i:j G j 1:k, which is simply an AND-OR-INVERT(AOI) gate, because the group propagation term is not on the critical path. The most significant carry should either be blocked or propagated. Thus, the O s controlled AND gate is on the critical path. Since the end-around carry term, the most significant bit in EAC block, is calculated before the carry increment stage, c 8 is not on the critical path. If we choose one of the carry in between the last and first bits, it gives a good estimation about the delay of the operation. At the next stage one of the sums needs to be selected according to the carry from EAC block as an output for each16bit adder. Therefore, one mux is on the critical path. As a final stage, the output needs to be conditionally complemented according to the end-around-carry and operation. The calculation of the condition is not on 30

39 CHAPTER 4. CRITICAL PATH ANALYSIS 31 critical path because during the first MUX stage, it can be calculated. Thus, we end up one XOR stage for complementing. Figure 4.1: Critical path of the modified EAC adder. 4.2 Path Design After the critical path is identified, the gate level model is created as in Figure 4.2. All the inverters on the critical path are removed to decrease the number of logic levels by applying the inverting property of CMOS to the consecutive levels. It can be observed that each stage of carry merge operation shows an alternating design of AOI and OAI. The output load is one minimum size inverter.

40 CHAPTER 4. CRITICAL PATH ANALYSIS 32 Figure 4.2: Gate level design of critical path. Gate level design is just a logical expression of the circuit. Therefore, in order to make accurate simulations the gates should be mapped to transistor level models. In this phase, we created a library of custom design gates instead of using standard cell library for higher performance. 4.3 Transistor Level Design and Sizing Full custom design always outperforms standard cell design in terms of power consumption, area and propagation delay. However, it is hard to automate designing in full custom perspective and the designing process takes much longer time than the design with standard cells. A number of techniques exist when designing in transistor level Logic Level Minimization Before designing any logic circuit, it is a beneficial step to exploit logic level minimization. In our design we tried to combine the pg generation and first level of AOI as suggested in [8]. This approach allows for a single stage to be removed from Weinberger adder realization. However, this method increases the stack count and since our blocks were designed close to minimum width transistors, we did not observe any improvement on the delay. Figure 4.3 shows the reduced first stage.

41 CHAPTER 4. CRITICAL PATH ANALYSIS 33 Figure 4.3: Reduced first stage in Weinberger recursion adder [8]. As another step Ling recursion is applied to the first step. However as suggested in [15] Ling carries must be converted to real carries by ANDing the most significant carry with the corresponding propagate term, c i+1 = d i p i before leaving the adder block. This calculation is on the critical path. Therefore, Ling recursion is not applied. As a final step, we combined the O s controlled AND gate with the previous OAI gate. However, we observed that increasing the complexity of the gate did not provide much delay improvement because the gates are already sized close to minimum.

42 CHAPTER 4. CRITICAL PATH ANALYSIS Late arriving signal exploitation One of the most important perspective in transistor level design is to connect the late arriving signals close to output. It can be shown on Figure 4.4 AOI gate, assuming A is the late arriving signal, the gate on Figure 4.5 provides better delay. Figure 4.4: AOI without late arriving exploitation.

43 CHAPTER 4. CRITICAL PATH ANALYSIS 35 Figure 4.5: AOI with late arriving exploitation Logical Effort Sutherland et al. described a very useful concept of Logical Effort [46].The method of logical effort is founded on a simple model of the delay through a single MOS logic gate. The model describes delays caused by the capacitive load that the logic gate drives and by the topology of the logic gate. Clearly as the load increases, the delay increases, but delay also depends on the logic function of the gate. Inverters, the simplest logic gates, drive loads best and are often used as amplifiers to drive large capacitances. Logic gates that compute other functions require more transistors, some of which are connected in series, making them poorer that inverters at driving current. A NAND gate has more delay than an inverter with similar transistor sizes that drives the same load. The method of logical effort quantifies these effects to simplify delay analysis for individual logic gates and multistage logic networks. The logical effort of a logic gate tells how much worse it is at producing output current than is an inverter, given that each of its inputs may present the same input capacitance as the

44 CHAPTER 4. CRITICAL PATH ANALYSIS 36 inverter. Reduced output current means slower operation, and thus the logical effort number for a logic gate tells how much more slowly it will drive a load that would an inverter. Equivalently, logical effort is how much more input capacitance a gate must present in order to deliver the same output current as an inverter. Logical Effort for Multistage Networks The method of logical effort reveals the best number of stages in a multistage network and how to obtain the least overall delay by balancing the delay among the stages. The notions of logical and electrical effort generalize easily from individual gates to multistage paths. The logical effort along a path compounds by multiplying the logical efforts of all the logic gates along the path. The symbol G denotes the path logical effort, so that it is distinguished from g, the logical effort of a single gate in the path. The subscript i indexes the logic states along the path. G = Πg i (4.1) The electrical effort along a path through a network is simply the ratio of the capacitance that loads the last logic gate in the path to the input capacitance of the first gate in the path. The uppercase symbol H indicates the electrical effort along a path. In this case, C in and C out refer to the input and output capacitances. H = C out /C in (4.2) Branching effort b is used to account for fanout within a network. When fanout occurs within a logic network, some of the available drive current is directed along the analyzed path and some is directed along the off-path. The branching effort b is defined at the output of a logic to be b = (C onpath + C offpath )/C in (4.3) Note that if the path does not branch, the branching effort is one. The branching effort along an entire path B is the product of the branching effort at each of the stages along the

45 CHAPTER 4. CRITICAL PATH ANALYSIS 37 path. B = Πb i (4.4) Utilizing the definitions of logical, electrical and branching effort along a path, path effort is defined as F. Note that the path branching and electrical efforts are related to the electrical effort of each stage. F = G B H (4.5) Although it is not a direct measure of delay along the path, the path effort holds the key to minimizing the delay. Observe that the path effort depends only on the circuit topology and loading and not upon the sizes of the transistors used in the logic gates embedded within the network. The path effort is related to the minimum achievable delay along the path. Only a little work yields to find the best number of stages and the proper transistor sizes to realize the minimum delay. Optimizing the design of an N-stage logic network proceeds from a very simple principle that the path delay is least when each stage in the path bears the same stage effort. This minimum delay is achieved when the stage effort is: f = g i h i = F 1/N (4.6) To equalize the effort borne by each stage on a path, and therefore achieve the minimum delay along the path, appropriate transistor sizes for each stage of logic along the path must be chosen. Each logic stage should be designed with electrical effort: h i = F 1/N /g i (4.7) From this relationship, it is straightforward to determine the transistor sizes of gates along a path. Starting at the end of the path and working backward to apply the capacitance transformation: C in,i = (g i C out,i )/f (4.8)

46 CHAPTER 4. CRITICAL PATH ANALYSIS 38 The equation determines the input capacitance of each gate, which can then be distributed appropriately among the transistors connected to the input Design with Helpers As described in [47] in most of the aders the stage effort is generally constant if wire capacitance is neglected. This means uniform gate sizes may be used throughout with little loss in performance. It is possible to have a very regular layout in this case. However, such adders like Sklansky show exponential fanout increase. In this case, the stage effort becomes high for those cells and as a result a bigger driver is needed. Therefore, Harris et al. [47] proposes the concept of helpers. When the stage effort increases, it is a good practice to duplicate driving cells in parallel to maintain lower delay. Figure 4.6: EAC logic with helper. Using the concept described by [47], we duplicate the operation controlled NAND gate. That helps to decrease the fanout of the carry increment stage and, consequently, the size of the NAND gate. 4.4 Transistor Sizing Logical effort found wide interest in transistor sizing [48 51]. In our proposed adder, we used the concept to size the transistors in the critical path. After the sizes are determined, it is

47 CHAPTER 4. CRITICAL PATH ANALYSIS 39 possible to use these widths in the remaining blocks because other blocks which are not on critical path will have the same loads and fan-outs within the same logic level. In order to equalize the rising and falling edge of the output signal w p /w n = 2.5/1 ratio is used between NMOS and PMOS transistors. In the transistor level library, late arriving signal exploitation, helpers and logical effort are used to achive highest performance. In addition, the O s controlled AND gate needs to drive 7 OAI cells. If we use two helpers, as described in 4.3.4, instead of one, so that the gates drive 4 OAI at most and it is possible decrease the large size as well as delay. In order to apply logical effort, a spreadsheet is created as in Figure 4.7. After the logical effort parameters are calculated, the width is distributed to each stage according to stage coefficients. Since the gates are designed according to 2.5/1 ratio, stage coefficients show how large is the gate when compared to inverter. Figure 4.7: Spreadsheet for Logical Effort Calculation. After the appropriate transistor sizes are found, we continued on performing simulations using HSPICE [52] and FreePDK45 [53] [54] library. Figure 4.8 and 4.9 show the transistor level schematic.

48 CHAPTER 4. CRITICAL PATH ANALYSIS 40 Figure 4.8: Transistor level schematic. Figure 4.9: Transistor level schematic.

49 CHAPTER 4. CRITICAL PATH ANALYSIS Simulation Results During the simulation a square wave signal is applied with 150ps rising and falling edges. The delay between input and output is then measured. The power dissipation of the critical path is measured for power estimation of one bit of the adder. Temperature and supply voltage is swept between C and V dd to observe optimal performance. It can be observed from Figures that critical path shows 141.8ps delay and 18.9µW power dissipation for one bit, when 10 % switching activity is considered [3], which makes 2.42mW for 128bits. Figure 4.10: Delay vs. V dd at 25 C.

50 CHAPTER 4. CRITICAL PATH ANALYSIS 42 Figure 4.11: Power vs. V dd at 25 C. Figure 4.12: Delay vs. V dd at 100 C.

51 CHAPTER 4. CRITICAL PATH ANALYSIS 43 Table 4.1: Delay and Power dissipation values in correspondence to V DD and Temperature. V dd Delay(ps) Power(µW ) Temp 25 C 100 C 25 C 100 C Figure 4.13: Power vs. V dd at 100 C. Table 4.1 shows the trade-off between temperature and supply voltage for the proposed adder. When V dd is increased from 1.0V to 1.2V delay decreases by 12%, power increases by 79% and when decreased from 1.0V to 0.8V, delay increases by 26% and power decreases by 45%. When temperature changes from 25 C to 100 C delay increases by 51% and power increases by 9.2%. It can be observed that much more power can be saved than the increase in delay for low power applications by lowering the supply voltage. However, for high speed applications more power should be provided. Since the temperature of the circuit has a reverse effect on both power and delay, the circuit should remain at lower temperature for higher performance.

52 CHAPTER 4. CRITICAL PATH ANALYSIS Conclusion In this chapter the methodology for transistor level critical path optimization is provided. First, the critical path of the adder is identified taking into account the logic levels. Transistor are sized with the Logical Effort concept [46]. Architecture level and transistor level optimizations are applied according to full custom design rules in FreePDK45 library [54]. As a final step, transient simulations are carried out to calculate the delay and power dissipation. It is observed that the method enables fast design for comparison of the path among different adder architectures in the critical delay and power dissipation perspective. In the next chapter, a methodology for fast area estimation and layout entry is given. Datapath library is generated exploiting the regularity for the proposed adder.

53 Chapter 5 Datapath Library 5.1 Introduction Datapaths have been a research topic in the last years, as an approach to make the layout entry faster [9] [10] [55] [56]. Informally datapaths are circuits where the same or similar logic is applied to several bits [57]. A datapath stack [58] is made up of many custom word lines such as registers, ALU, adders, shifters, multiplexers, buffers to form the data flow of the functional units. Datapaths are characterized by a highly regular layout structure. A typical datapath floorplan consists of an array of horizontally oriented words of identical bit cells, called datapath cells, and vertically oriented bit slices as shown on Figure 5.1. Since each bit slice is replicated a number of times(determined by the datapath width) with very little or no modification, layout generation of such regular structures reduces to a careful design, often by means of handcrafting, of individual datapath cells. Figure 5.2 shows the regularity placement and routing inside a datapath slice. Figure 5.3 shows the schematic layout of datapath and corresponding bit slice [55]. 45

54 CHAPTER 5. DATAPATH LIBRARY 46 Figure 5.1: Global floorplan of a datapath [9]. Figure 5.2: Regularity placement and routing datapath circuit [10].

55 CHAPTER 5. DATAPATH LIBRARY 47 Figure 5.3: Schematic layout of datapath and detailed view of bitcell [11]. Datapath circuits are typically organized in horizontal rows of words representing the same functional block and vertical bit slices, delimited by vertically running power and ground rails. The layout of the datapath cell of bit slice i is identical to that of bit slice (i+1), but mirrored along the vertical axis so that the adjacent bit slices can share common power or ground rail.

56 CHAPTER 5. DATAPATH LIBRARY 48 Figure 5.4: Representation of a datapath cell [9]. The width of the bit slice, also known as a pitch is fixed; it determines the width for all the datapath cells as outlined on Figure 5.4. Power and ground (V DD /V SS ) supply rails generally delimit the pitch. Signal nets are connected to the datapath cell components by means of bristles. Vertical bristles, or data lines, provide wiring between different within the same bit slice. They run in parallel to the power rails. Horizontal bristles or control lines provide wiring between datapath cells of different bit slices. Control lines span the width of the datapath run perpendicular to the power rails. Since adders exploit very regular structures, they can be designed with datapaths. 5.2 Concepts in Full Custom Design Transistor chaining and device merging Transistor chaining is a widely used technique to improve both area and performance of datapath cells. Several transistors can be chained together by combining their diffusion areas in order to reduce the diffusion capacitance. Diffusion sharing applied to simple logic gates in the same datapath cell is known as device merging. Transistor folding Transistor folding is another popular technique aimed at minimizing area

CHAPTER 5. DATAPATH LIBRARY 49 and improving performance of custom designs. The folding changes the aspect ratio of the component while maintaining the required device size (W/L ratio).

Intracell sharing Two component areas(diffusion regions or poly gates) belonging to components from adjacent bit slices can be merged if they share the same global net, such as power line, control

57 CHAPTER 5. DATAPATH LIBRARY 49 and improving performance of custom designs. The folding changes the aspect ratio of the component while maintaining the required device size (W/L ratio). By performing folding with different number of fingers(poly gates), different component instances can be created for the placement phase. Intracell sharing Two component areas(diffusion regions or poly gates) belonging to components from adjacent bit slices can be merged if they share the same global net, such as power line, control line or clock signal. In a typical organization of a datapath, adjacent bit slices are identical copies of each other, reflected with respect to the vertical boundary line. In this case the components can be pushed under the boundary line (ground or power rail) to create a more compact layout. 5.3 Datapath Design Using the concepts defined in sections 5.2 and 4.3 the logic gates are designed. Figure 5.5(a) and 5.5(b) shows the design of basic AOI and MUX2 cells according to the rules. AOI cell occupies 0.77µm 1.5µm area and MUX2 0.97µm 1.5µm. (a) AOI Layout (b) MUX2 Layout Figure 5.5: Designed basic cells. The longest cell according to the values from Figure 4.7 has a 1.5µm height. Thus, the other

The bitslice includes the sign logic and conditional sum calculation.

58 CHAPTER 5. DATAPATH LIBRARY 50 cells are designed according to the longest cell to maintain regularity and to exploit datapath. Figure 5.7 shows the bitslice of 16bit Kogge-Stone adder. Using small cells allowed us to reduce the height of the bit slice. The bitslice includes the sign logic and conditional sum calculation. It can be seen from the datapath that the carry merge and propagate cells are interchanging one stage after another due to the intrinsic negation of CMOS design. (a) EAC bitslice (b) EAC last bit Figure 5.6: Bit slices of the blocks in the adder.

59 CHAPTER 5. DATAPATH LIBRARY 51 Figure 5.7: Bit slice of 16bit Kogge-Stone adder.

CHAPTER 5. DATAPATH LIBRARY 52 Using the same approach in Figure 5.7, bitslices for the last and regular bits of parallel prefix EAC block can be designed.

60 CHAPTER 5. DATAPATH LIBRARY 52 Using the same approach in Figure 5.7, bitslices for the last and regular bits of parallel prefix EAC block can be designed. Since the last bit has a irregular layout it is designed separately. Figure 5.6(a) and 5.6(b) show the designed bitslices for EAC block. 5.4 Layout Design Using the datapath designs, it is straightforward to design the whole adder. Replicating the bitslice as designed in Figure 5.7 gives the layout of 16 bit Kogge-Stone adder. Figure 5.8(a) shows the layout of 16 bit adder. 16 bit adder and EAC block are sized 13.42µm 21.98µm and 5.18µm 11.06µm respectively. As a final step, using the designed blocks, it is possible to estimate the total area of 128 bit EAC adder. Two different schemes are considered. Figure 5.9 has a thinner but wider layout to be used as a part of larger datapath. Figure 5.10 has a thicker but tighter layout to have a more compact layout because half of the adder is stacked on top of the other. First one and the second are sized 20.8µm 175µm and 35.8µm 87.5µm respectively. Figure 5.9: Wide layout. Figure 5.10: Stacked compact layout.

61 CHAPTER 5. DATAPATH LIBRARY 53 Table 5.1: Results comparison of proposed adder with the previous work. Delay(ps) Power(mW) Area(µm 2 ) Technology [5] 200(+29%) nm [12] 270(+47%) 20(+88%) 17,237(+82%) 65nm Proposed ,132 45nm 5.5 Results The results of previous works and the proposed adder can be observed on Table 5.1. In previous works a pipelined adder with 128bit 5GHz+ Binary Floating Point adder [5] is proposed and in [12] a 108bit EAC adder is proposed. It is not possible to compare with both of the adders because they implement the adders as a part of FMA unit and the adders are designed in 65nm IBM SOI technology, which is not publicly available. The adder designed in [5] is a pipelined adder which does not fall into the scope of this thesis. The adder in [6] is designed in FPGA technology, however, our adder is full custom design. Therefore, to the best of our knowledge this is the first adder to use small Parallel Prefix 2 n 1 EAC block with the full custom design methodology. It can be seen on Table 5.1 that the adder operates at the delay of 142ps and 2.42mW power dissipation in 3, 132µm 2 area under 25 C with 1V supply. The proposed adder shows up to 47% improvement in delay, 81% improvement in area and 88% improvement in power dissipation assuming routing does not majorly contribute to delay and power in comparison with the previous works. 5.6 Conclusion In this final chapter a detailed analysis about layout entry through datapath design is provided. The comparison between the proposed and the previous work is included. Datapath library provides building blocks for early area estimation before designing the whole adder. Further, the routing is simplified with datapath since wires can be placed regularly in the bitslice. The cells are designed using full custom design rules to achieve compact, area efficient layout. Two different layouts are generated by repeating the bitslices. We aimed to have the longest carry path under 100µm assuming the routing does not majorly contribute to delay. Final result

62 CHAPTER 5. DATAPATH LIBRARY 54 shows that, in the stacked layout, longest path remains under 45nm and the proposed adder shows up to 47% improvement in delay, 81% improvement in area and 88% improvement in power dissipation in 45nm technology under 25 C and 1.0V dd.

63 CHAPTER 5. DATAPATH LIBRARY 55 (a) Layout of 16bit Kogge-Stone adder (b) Layout of parallel prefix EAC block Figure 5.8: Layout of the blocks.

the main limitations of the work is that wiring increases with 1. INTRODUCTION

the main limitations of the work is that wiring increases with 1. INTRODUCTION Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,