A Reconfigurable Network Architecture For Parallel Prefix Counting

Size: px

Start display at page:

Download "A Reconfigurable Network Architecture For Parallel Prefix Counting"

Jonah Dixon
6 years ago
Views:

1 A Reconfigurable Network Architecture For Parallel Prefix Counting R. Lin S. Olariu Department of Computer Science Department of Computer Science SUNY at Geneseo Old Dominion University Geneseo, NY Norfolk, Va Abstract We propose an efficient reconfigurable parallel prefix counting network based on the recently-proposed technique of shift switching with domino logic, where the discharging signals can propagate along the switch chain asynchronously and produce a semaphore to indicate the end of each domino process. This results in a network that is fast and highly hardwarecompact. The proposed architecture for prefix counting N-1 bits involves a total of N+ cascaded simple switches and achieves a delay of (logn)( T d + ) + T add where T d is the time for discharging 4 cascaded precharged pass-transistors, T add is a full adder delay, and < T add. This is significantly faster than any design known to us for N Another important feature of the proposed architecture is that it requires a very simple control structure, driven solely by the semaphores. This significantly reduces the hardware complexity and fully utilizes the inherent speed of the computation. Index Terms Special purpose architectures, reconfigurable architectures, digital signal processing, VLSI design, domino logic, computer arithmetic. The work was supported, in part, by National Science Foundation under grants MIP , CCR and...

2 1. Introduction Due to the high degree of miniaturization possible today in VLSI technology, the size and complexity of designs that can be implemented in hardware has increased dramatically. This has made it technologically feasible and econnomically viable to develop high-speed, applications-specific, architectures featuring a spectacular performance increase over their general-purpose counterparts. The main goal of this work is to present such a special-purpose architecture for fast and hardware compact computation of parallel binary prefix counting. Our design is based on the recently-proposed technique of shift switching with domino logic [7]. Reconfigurable bus systems [1, 4, 5, 13, 15] with shift switches have been recently proposed to solve a number of fundamental computation problems [7-11]. In particular, it has been shown that shift switching on enhanced reconfigurable buses, that is, on buses containing shift switches endowed with buffers (inverters), is an elegant and powerful arithmetic design technique. It possesses the following unique features: (1) When special digital signals called state signals are propagating through a shift switch array, or enhanced reconfigurable buses, the desired modulo operations can be performed directly, simplifying the traditional approach to basic arithmetic computation; and (2) during the process the state signals are inverted, after passing through each shift switch, alternatively in two mutually inverted forms (n and p), minimizing the loads of transistors and maximizing the speeds of circuits. By apply precharged CMOS NP domino logic technique on pass-transistor-based shift switching buses, we can obtain a domino discharging chain with a new feature, that is, the discharging signals can propagate along the chain asynchronously (in the absence of a clocking scheme) and always produces a semaphore to indicate the end of the process[7]. This makes it possible to construct a network of processing elements to compute parallel binary prefix sums fast and in a highly hardware-compact fashion. The problem of parallel binary prefix counting is fundamental in parallel processing, for its solution is the principle ingredient in arithmetic expression evaluation, storage and data compaction, processor assignment, and routing, among many others [3, 8]. In this paper we propose a parallel prefix counting network of size N-1 (that is, with N-1 input bits, for N=2 2k =n*n). It is a two levels construction containing a total of N+ cascaded basic shift switches, with N pass-transistor-based shift switch plus trans-gate-based shift switches. Each pass-transistor-based switch is associated with a simple processing element (PE): for each row of switch units there is a row processing element (PE_r). In fact, both PEs and PE_rs, are simple control units, receiving 1

3 corresponding semaphores, and either sending control signals to tri-state drivers, or sending select signal to the MUX, or sending precharge/evaluation enable signals to start precharge/evaluation process. Thus, the entire network can be perceived as an application-specific circuit. The proposed network achieves a total delay of (logn)( T d + ) + T add where T d is the delay of discharging 4 cascaded precharged pass-transistors, and T add is an full adder delay, with <T add (see Section 5). Since it is not realistic to compute the prefix sums of a large array of binary numbers, say N 2 10, we assume that N<2 10. Under this assumption, the proposed architecture is significantly (at least 50%) faster than any known architectures, including tree of adders [3], or a processor with the same structure as ours but with each shift switch substituted by a half adder (half-adder-based processor for short), or our previously proposed architectures [8]. Another important feature of the proposed design is that it has a compact VLSI area, requiring an area about 0.5(N+ ) A h (almost linear in the input size), where A h is the area equivalent of a half adder. This area is significantly smaller than both the half adder-based processors (about 50%), and the tree of half adders processors, which require area of about (NlogN -1.5N+1)A h (note that the area for the control devices necessary in any such architecture is omitted). Another important feature of our architecture is that the N-prefix sums are computed and output row by row, with a simple PE (or control unit) per row. The control units runs asynchronously, driven solely by enable signals, i.e. semaphores produced at the end of the row s domino discharging. This greatly simplifies the hardware requirement, and the full inherent speed of the computation can be utilized. 2. State signals and shift switches Shift switches and state signals have been introduced in [6-11]. To make the paper self-contained, we now briefly review some of the relevant concepts. Definition 1. A state signal with integer value I, (0 I w-1; w 2 ), is represented by bit sequence b0, b1,..bw-1 (note that the order of bits here is either from left to right or bottom to top (Figure 1), with the unique bit u (either 0 or 1) in the I-th position. A state signal is said to be of n- (p-) type denoted by I (w) (I ) if u=0 (1). An S-signal may also be denoted by I (w) regardless of type or simply I (value only). In this paper we use only state signals with w=2, the discussion of general state signals can be found in [7]. The type of a state signals with w>2 is implied by its representation. However, for w=2 it must be defined 2

4 explicitly. The conversion between I (2) and I can be done easily through exchanging the positions of its two bits. Definition 2. A degenerate state signal with value I, (0 I w-1; w 2), is state signal I with the 0-th bit removed, for w=2 it may be denoted by I ( 2) (I ) or I (2), or simple I if no confusion arises (Figure 1). Clearly, the degenerate state signals I( ) and I(2) can also be seen as 1-bit regular signals I and respectively. (a) (b) (c) (d) Figure 1.State (and degenerate) signals (a) (or 1 (2)) (b) (0 (2) ) (c) (1 (2) ) (d) ( 0 (2) ). Definition 3. The inversion of a w-bit state signal I is represented by I with each of its bits inverted. The inversion of I( ) (or I (2) ) can be interpreted either as the same type state signal with value inverted, i.e. ( ) (or (2) ), or as a different type state signal with the same value, i.e. I (2) (I ( ) ), and this can be explicitly indicated if needed. In this paper, we alway interpret state signal inversions as the inversions of their types without changes in their values (which may be changed by shift switches as shown below). And, if necessary, we convert I( ) to I, I(2) to (see Figure 2). Figure 2 The conversions of regular and state signals and the inversions of state signals. Note that the bit sequence inside parentheses represents the corresponding state signal. It will soon be clear that the primary reason for employing state signals is that we can add small integers through shifting, instead of doing logic operations. This is particularly useful when a sequence of such additions must be performed. A shift switch processing 2-bit state signals is defined below: 3

5 Definition 4. Given state signals X (2), Y (2) called shift_in and control, respectively, a shift switch S<2, 1> is a trans-gate-based digital filter which realizes the function F (X (2), Y (2) )=(R (2), C (2) ), where outputs R (2) and C (2) are called shift_out and rotation signals, respectively, and the following hold: (a) X+Y = 2Q+R for some integers Q, R of 0 or 1; (b) C( ) =Q; Note that the shift switch definition is independent of implementation details. However, once a shift switch is implemented, the types of its inputs/outputs signals must be fixed. For example, the output C (2) can be implemented as C( ) or C(2) i.e. Q or.we may associate the quadruple, (X( ),Y( ),R(2), ), called specification, with a shift switch to describe its inputs/outputs signal types if necessary. The Shift switch S<2, 1> is also called basic (shift) switch. Figure 3. A shift switch S<2, 1> specified as: (X( ),Y( ),R(2), ) and sample inputs/outputs. Figure 3 shows such a switch. Since R=(X+Y) mod 2; and Q= =, the shift_out signal can be seen as the shift_in signal with Y bits cyclically shifted, i.e. the shifting is controlled by Y. We refer to R as the remainder of (X+Y) over 2, to Q as the quotient of (X+Y) over 2, and to Y as the state of the switch. When a signal propagates through an array of cascaded shift switches, we also call it a shifting signal. Definition 5. (extension of definition 4) (1) If C (2) is not produced, i.e. the output of the function consists of R only, the shift switch is denoted by S <2,1>. It is functionally identical to the well-known 2 2 barrel shifter (processing state signals) [19]; (2) if R (2) instead of R (2) is produced, the shift switch is denoted by <2, 1>; Note that, in general, a shift switch which processes w-bit state signals with a d-bit control signal is denoted by S<w, d>. The significance of a shift switch lies in the fact that it can sum two small integers Y and X by essentially letting X pass through one transistor that is preset by the control signal Y. Also since both shifting and control signals are state signals, we can use the shift-out of a switch to control another switch, thus we may 4

6 easily organize a network of shift switches (or enhanced reconfigurable buses) for the desired arithmetic computation. 3. The trans-gate based CMOS implementation of shift switches Figure 4 illustrates some trans-gate based schematics of basic shift switches: S<2,1>, S <2,1> and <2,1>. It is clear that if the trans-gates are preset by state signal Y, i.e. Y is received properly (about T inv ) earlier than X then for all of them the delay from X to R (or Q) can be specified as T trans.+ T inv, T trans is a trans-gate delay, T inv is the delay of an inverter with a small standard load (we assume that 4T inv + 2T trans =T add or 4 (T inv + T trans ) =T add [12]). For the purpose of our special-purpose application in this paper, the trans-gates of a switch may be preset by two zero bits (or state signal 0 ), in which case, all trans-gates of the switch are open, and the shift in X and shift out R are disconnected. In section 5 we will use the trans-gate based switches S <2,1> and <2,1> to construct a switch array for the computation of prefix row-parities. (a) (b) (c) Figure 4. The trans-gate-based schematics (a) S<2,1>: (X(2),Y( ),R(2),Q); (b) S <2,1>:(X(2),Y( ),R(2));. (c) <2,1>:(X(2),Y( ),R( ) ); 4. The Precharged CMOS implementation of shift switches and switch units Figures 5 illustrates two pass-transistor-based schematics of basic shift switches: S<2,1> (in the doted boxes). The corresponding modified switches with a state signal generator of Y added, are called n-switch 5

7 (Figures 5.a and 5.c) and p-switch (Figure 5.b and 5.d), respectively. Once the switches are precharged (high for n-switch, low for p-switch) and preset, the discharging signal X(2) (X( )) from the shift-in port will pull-down (pull-up) the data path to yield shift-out R(2) (R( )) and Q ( ). To improve the efficiency of discharging, we cascade a small number (four in the proposed scheme) of switches of the same type to form an n- (or p-) switch unit, as illustrated in Figure 6. The discharging process now yields the following (modulo 2) results (refer to Figure 6): u=(x+a) mod 2, v=(x+a+b) mod 2, w=(x+a+b+c) mod 2, z=(x+a+b+c+d) mod 2=R, and a =(X+a) DIV 2 (i.e. the quotient of (X+a) over 2), b =(X+a+b ) DIV 2, c =(X+a+b+c) DIV 2, z =(X+a+b+c+d) DIV 2. The complete process is composed of two phases that we describe next: A. Precharge phase, in this phase the following should occur (we use n-unit as example, see Figure 6.a), (1) E (tri-state enable signal) is set to 0, i.e. the output of each tri-state internal bus driver is in Hi-Z; (2) a bit data item is loaded in the corresponding register (at the beginning the item is the input bit of the PE associated with the switch), which preset each switch to the corresponding state; (3) prec/eval (precharge and evaluation signal) is set to 0, which starts recharging all switches of the unit in parallel. When the precharge is done, the semaphore q=0 and R=11 are produced. If a row contains more than one switch unit, then the units are cascaded in an alternative (n-p) chain form. The precharge phase for p-unit can be described similarly (refer to Figure 6.b), with the results of q=1, and R=00, which may become the input X of the following n-unit. Evaluation phase, in this phase the following should occur (see Figure 6.a), (1) prec/eval is set to 1; (2) when the discharging state signal X =01 (or 10) arrives, its bit 0 pull-down the preset data path along the unit to produce the (modulo 2) results (i.e. u, v, w, z and a, b, c, d ) for each unit as described above, and to produce a semaphore q=1 and R=01 (or 10); (3) if the signal E from PE_r is 1, read output bits (i.e.u, v, w, z ), otherwise no outputs; (4) if signal E from PE_r is 1, each PE triggers register-load operation (loading the corresponding a, b, c, d ) otherwise there is no loading (5) set prec/eval back to 0, i.e. start precharge again (and keep in precharge phase until next evaluation begin). If a row contains more than one switch unit, the discharging process can propagate from one switch unit to another automatically, i.e. the whole process will be a domino discharging process. Note that the PEs of each switch unit here can be simply driven by the semaphore of the unit, in other words, all operations here can be driven by the semaphore after initialization. 6

8 (a) (b) (c) (d) Figure 5. The pass-transistor-based schematics of (a) n-switch S<2,1>: (X( ),Y( ),R(2), ) and (b) p-switch (c and d) the alternative representations of n-switch S<2, 1> and p-switch. (a) 7

9 (a) Figure 6. (a) The n-switch unit and (b) the p-switch unit. 5. The parallel prefix counting network architecture We are now in a position to present our prefix counting network architecture and the prefix counting algorithm. Figure 7 shows the block diagram of a complete prefix counting network with input size of N-1=63. The mesh consists of n=8 rows, each featuring two cascaded (an n- and a p-) switch units as described in section 4. To simplify the network control, we use an additional column of trans-gate-based shift switch array on the left part of the mesh (see Figure 7). Let the input of the column switch array be a sequence of (p-type) state signals, b1, b2,..b7, and the output be a sequence of regular signals, p1, p2,.. p7. Clearly the pi=(b1+..+bi) mod 2. for all i (Note that it is slower than the precharged switch array and generates no semaphore, however, the computation does not require two phases). The beginning of each row consists of a row processing element (PE_r) which receives semaphore from the previous row, and controls a 2-input multiplexer, and an input state signal generator (consisting of two tri-state buffers). The computation algorithm is spelled out as follows: 8

10 1. Initial stage {to compute and output the least significant bits of the prefix counts} Step 1. all PE load their input bits into the registers; Step 2. each unit starts the initial precharge phase; Step 3. PE_r sets select signal to 0, such that, the right input (i.e. 0) of each MUX is selected; Step 4. PE_r sets Er=1, each row begin domino discharging; Step 5. PE_r sets E=0 (i.e. no output and register loading, refer to evaluation phase (4) in the last ection); Step 6. i *(T inv +T trans ) time after the semaphore value of 1 is received, the i-th PE_r set select signal to 1 Step 7. the i-th PE_r sets E=1 (i.e. to output and load register); {note after this step the precharge is automatically driven by semaphores and is done in parallel to the operations for charging controls} 2. Main stage, consists of logn-1 iterations of the following {note: different rows get into the main stage at different time, the first row gets into the stage earliest, the last row, the latest} Step 8. PE_r sets select signal to 0, such that, the right input (i.e. 0) of each MUX is selected; Step 9. PE_r sets Er=1, each row begin domino discharging; Step 10. PE_r sets E=0 (i.e. no output and register loading); Step 11. PE_r sets select signal to 1; Step 12. PE_r sets Er=1, each row begin domino discharging; Step 13. PE_r sets E=1 (i.e. output and register loading). The algorithm can be interpreted as follows: In the initial stage each row computes the least significant bit of the sum of the row (steps 1 to 5). The results (called the parity-bits of the rows) then are prefix summed by the column switch array, which takes about i * (T inv +T trans ) time to get the i-th prefix sum computed, so i * (T inv +T trans ) time after the first semaphore value of 1 is received, the i-the row start to compute the LSBs of the global prefix sums for the row (Steps 6 and 7); Once the inital stage is passed, each row gets into the main stage, and in the stage, the process is similar to the inital stage except that there is no waiting for the prefix sum of the parity-bits of the lower rows (the waiting has been made in the intial stage), the data are always available and the computations are for the remaining bits of the prefix sums. In fact, the column switch array takes a pipeline process to produce the bits of the global prefix counts of rows. From the implementation point of view the algorithm is very simple, in fact, every step can be easily driven by the value of the corresponding semaphore, with little additional logic devices needed to realize the controls. 9

11 Now the running time and VLSI area can be estimated as follows: the initial stage takes about 2 T d (two domino discharging processes along a row) + T add (waiting time, i.e. column switch array delay time); the main stage takes time (logn-1)*2 T d (two domino discharging processes along a row) + logn, here is the time from a switch of the column array receives an input to the beginning of domino discharging of the row next to the switch, which is about T trans + T inv + 2T inv (or 3T inv ) < T add So the total delay can be specified as (logn)( T d + )+ T add here T d is the delay of discharging 4 cascaded precharged pass-transistors, T add is an full adder delay =4(T inv +T trans ). Since to prefix a large array of binary number, say N 2 10 is not realistic, we assume that N<2 10. Under this assumption, it is easy to verify that our processor is significantly (at least 50%) faster than any known processors, say, tree of adders [5], or a processor with the same structure as ours but each shift switch is substituted by a half adder (half-adder-based processor for short). Since each pass-transistor-based switch is about half size of a half adder, the total area can be specified as 0.5(N+ ) A h, where A h is a half adder area equivalent (note registers and basic controls devices are not counted because they are necessary for any scheme to accomplish the computation). This area is significantly smaller than both half adder-based processor (about 50% less), and the tree of half adders processor (which requires an area (NlogN -1.5N+1)A h ). Note that the half adder-based processor requires significantly larger number of control devices because that it does not generate semaphore (even it uses the traditional domino logic technique). 10

12 Figure 7. Parallel prefix counting network. 11

13 7 Concluding remarks The main contribution of this paper was to propose a parallel prefix counting network of size N-1 (i.e.with N-1 input bits, for N=2 2k =n*n). Our design is a two level architecture involving a total of N+ cascaded simple shift switches, with N pass-transistor-based shift switches along with trans-gate-based shift switches. The resulting network achieves a delay of (logn)( T d + ) + T add where T d is the delay of discharging 4 cascaded precharged pass-transistors, and T add is an full adder delay. It is easily seen that this delay is at least 50% faster than any design known to us for N Our architecture is also area-compact and is constructed by using CMOS, NP domino logic technique on buses with shift switches. The processing elements require a very simple asynchronous control, being driven by semaphores produced at the end of each row s domino discharging process. This greatly simplifies the hardware requirement, and allows the full inherent speed of the computation to be utilized. 12

14 REFERENCES [1] J. Jang, H. Park and V. K. Prasanna, A bit model of the reconfigurable mesh, Proc. of the Workshop on Reconfigurable Architectures, the 8th International Parallel Processing Symposium, Cancun, Maxico, [2] R. H. Krambeck, C. M. Lee, and H.S. Law, High-Speed Compact Circuits with CMOS, IEEE Journal of Solid-State Circuits, Vol SC-17, No. 3, June, [3] F. T. Leighton, Parallel algorithms and architectures: Arrays.trees.hypercubes, Morgan kaufmann publishers, 1992, 49. [4] H. Li and M Maresca Polymorphic-torus architecture for computer vision, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 3, 1989, [5] H. Li and Q. Stout, Reconfigurable massively parallel computers, Prentice-Hall, [6] R. Lin, Reconfigurable Buses with Shift Switching - VLSI Radix Sort, Proc. of International Conference on Parallel Processing, St. Charles, Illinois, 1992, Vol III, pp [7] R. Lin, Shift switching and novel arithmetic schemes, in Proc. of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov Vol [8] R. Lin, and S. Olariu, Reconfigurable buses with shift switching --concept and applications, in IEEE Trans. on Parallel And Distributed System.Vol. 6, No 1, January 1995, [9] R. Lin and Olariu, Reconfigurable network architecture for parallel prefix counting, Proc. of the Workshop on Reconfigurable Architectures, the 10th International Parallel Processing Symposium, Hawaii, [10] R. Lin, and S. Olariu, Reconfigurable shift switching parallel comparators, VLSI Design: An International Journal of Custom-Chip Design, Simulation, and Testing, VOL. 9, No. 1. March, [11] R. Lin, K. Nakano, S. Olariu, M.C. Pintoti, J.L. Schwing, A.Y. Zomaya, Scalable hardware-algorithms for binary prefix sums, in IEEE Transactions on Parallel and Distributed Systems, [12] LSI logic 1.0 micron cell-based products data book, LSI logic corporation, Milpitas, Califoria, [13] R. Miller, V. K. P. Kumar, D. Reisis, and Q. F. Stout, Parallel Computations on Reconfigurable Meshes, IEEE Transactions on Computers, 42 (1993), [14] A. Mukherjee, Introduction to nmos & CMOS VLSI System Design, Prentice-Hall, Englewood, NJ, [15] D. B. Shu, L. W. Chow, and J. G. Nash, A constant addressable, bit serial associate processor, Proc. of the IEEE workshop on VLSI signal processing, Montery CA, Nov [16] E. E. Swartzlander, Jr., Computer Arithmetic Vol. 1 and 2 (IEEE CSP, CA, 1990). [17] J. D. Ullman, Computational Aspect of VLSI (Computer Science Press, MD, 1983). [18] J. Wang, Z. Wang, G. A. Jullien, S. S. Bizzan, W. Luo and W. C. Miller, Circuit Driven Delay Optimization of EMODL Carry Lookahead Adders, Proc. of 28th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov [19] N. Weste and K. Eshraghian, PRINCIPLES OF CMOS VLSI DESIGN, A systems Perspective (Second Edition), A,ddison-Wesley Publishing Company,

An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching

IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.6, June 2007 209 An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching Young-Hak Kim Kumoh National