Mapping Algorithms to Hardware By Prawat Nagvajara

Size: px

Start display at page:

Download "Mapping Algorithms to Hardware By Prawat Nagvajara"

Laurence Goodman
6 years ago
Views:

1 Electrical and Computer Engineering Mapping Algorithms to Hardware By Prawat Nagvajara Synopsis This note covers theory, design and implementation of the bit-vector multiplication algorithm. It presents also a general method for mapping iterative-loop algorithms to hardware. Introduction Numerical calculations involving nested loop algorithms are ubiquitous in signal processing and telecomm application. Examples of a two nested-loops algorithm are the convolution filters (Finite Impulse Response, FIR filters) and the Infinite Impulse Response (IIR) filters. Arithmetical calculations the bit-vector addition (positional numbers), multiplications, division and sorting are nested-loop algorithms. In fact signal processing algorithms have their foundations in arithmetical algorithms. For instance, the convolution and the bit-vector multiplication are the same algorithm. Algorithms Arithmetical algorithms and their hardware considered in the earlier notes were of one dimensional loop. These are, for instance, the bit-vector addition, the two s compliment, the bit-vector compare and finding the maximum from a set of numbers. The data dependency graphs are a one-dimension array where the loop indices are the indices for the nodes. For a doubly-nested loop the data dependency consists of nodes arranged in a two-dimensional array whose indices are the loop indices. The edges with arrows represent the data dependency. The positional bit-vector multiplication x * y = x * (y n-1 2 n-1 + y n-2 2 n-1 + +y 0 ) (1) where, x is an m-bit number and y is an n-bit number y i {0, 1}, n-1 i 0. A recursive description of (1) is partial_sum 0 := 0; for i in 0 to n-1 loop if y i = 1 then partial_sum i+1 := partial_sum i + x*2 i ; With the initial partial sum ps 0 is equal to 0, at the ith iteration the recursion accumulates x*2 i into the partial sum i+1. The final answer, the product x*y is partial_sum n. Consider the calculation at the bit level, the addition partial_sum i+1 := partial_sum i + x*2 i is the bit-vector addition that involves an iteration on the bit-positions with carry bits to the nextpositions. x is an m-bit vector, whereas x*2 i and ps i are (m+n-1)-bit vectors (unsigned type). The vector x multiplied by 2 i, x*2 i is (i+m-1 downto i) => x, others => 0, in other words, x*2 i is an (m+n-1)- bit vector with the vector x at the positions i+m- 1 down to i and other bits are zeros. When i = n- 1 (y s the most significant position) x*2 i is x at the positions n-1+m-1 down to n-1 and other bits are zeros. 1

2 The addition partial_sum i+1 := partial_sum i + x*2 i involves only n consecutive positions starting at position i+m-1 down to i. A description in hardware description language is as follows; function "*" (x, y : std_logic_vector) return std_logic_vector is variable m: natural := x'length; variable n: natural := y'length; type two_d_array is array (natural range <>, natural range <>) of std_logic; variable ps, c : two_d_array (n - 1 downto 0, m+n-2 downto 0); variable temp: std_logic; variable z: std_logic_vector(n+m-1 downto 0); In the declaration phase of the algorithm, declare signals ps, and the carry c as a twodimensional array of bits where the row i is the ith iteration from 0 to n-1 and the columns indexed by j are the bit positions from 0 to m- 1+n-1. The elements of the two-dimsional arrays are of std_logic type, e.g., ps(i, j) is the jth position bit of the vector ps i in the ith iteration of the algorithm. Since the addition of x*2 i to the partial sum happens only when y(i) = 1, declare a temporary variable temp and, if y(i) = 1 assign temp with x(j); else if y(i) = 0 assign temp with 0. In other words, temp := y(i) and x(j). The variable temp is added to the ith partial sum at the jth bit position in calculating the (i+1)th partial sum. The algorithm body is as follows; begin -- partial sum initial values for j in 0 to m-1 loop ps(0,j) := '0'; for i in 0 to n-1 loop for j in i to m-1+i loop temp := (y(i) and x(j-i)); -- not the last iteration if i < n-1 then -- rightmost bit position if j = i then c(i, j) := 0 ; z(j) := temp xor ps(i, j) xor c(i, j); c(i,j+1):= (temp and ps(i, j)) or -- not rightmost and leftmost positions elsif j < i+m-1 and j > i then ps(i+1,j) := temp xor ps(i, j) xor c(i, j); c(i,j+1):= (temp and ps(i, j)) or -- leftmost position elsif j = i+m-1 then ps(i+1,j) := temp xor ps(i, j) xor c(i, j); ps(i+1,j+1) := (temp and ps(i, j)) or -- last iteration elsif i = n-1 then -- carry initial value if j = i then c(i, j) := 0 ; -- not leftmost position if j < i+m-1 then z(j) := temp xor ps(i, j) xor c(i, j); c(i,j+1) := (temp and ps(i, j)) or -- leftmost position elsif j = i+m-1 then z(j) := temp xor ps(i, j) xor c(i, j); z(j+1) :=(temp and ps(i, j)) or (ps(i, j) and c(i, j)) or(c(i, j) and temp); return z; 2

3 end "*"; end arith_pack; The inner loop with index j is the bitwise addition, where the sum and the carry assignments are described in terms of logic expressions. The carry at the beginning of the inner loop, c(i, i) is initialized to 0. During the ith iteration, the inner loop m-bit addition involves the position j equals to i to i+n-1 where the least significant position is j = i and the most significant position is j = i+m-1. When j = i+n-1 the carry will be added to the partial sum in the (i+1)th iteration. Since the carry is to the bit position j = i + m which is the most significant position of the addition in (i+1)th iteration, the c(i, i+m) is the most significant bit of the partial sum into the next iteration, in other words, p(i+1, i+m) := c(i, i+m). The output (returned values) is described as follows; If i < n-1 the partial sum ps(i, i) in the ith iteration no longer involves in the addition in the (i+1)th iteration and the jth bit of the answer z(j) := ps(i, i), 0 i n-1, j = i. When i = n-1; j = n-1,, m+n-2, the partial sum ps(i, j) is z(j) and when j = m+n-2 the carry into the (m+n-1)th position is z(m+n-1) the most significant bit of the answer. The data dependency graph description of the algorithm for n = 4 is given below; z(7) z(6) z(5) z(4) z(3) z(2) z(1) z(0) Fig. 1 Data Dependency Graph The graph consists of nodes and edges on a two-dimensional grid where the index i = 0,, n-1, enumerates the iterations as the rows and j = 0,, m+n-2, enumerates the bit positions as the columns. The top right corner is the coordinate (i, j) = (0, 0) the index i increases in the downward direction and the index j increases toward the left-hand side. Use a two-dimensional column vector [1 0] T to denote the i-direction and [0 1] T to denote the j- direction, where T is the transposition. The nodes form a parallelogram where the vector x traverses in the [1 1] T direction. At i = 0, the vector x enters the calculation at the position j = 0 to j = n-1. In general at row i the vector x enters the nodes at the position j = i to j = i+n-1. This follows from the fact that at the ith iteration x is multiplied by 2 i which is equivalent to the vector x shifted to the left by i positions. The vector y are the edges traverse in the j- direction [0 1] T direction. The components y(i) enter the calculation at the column j = i, 0 i n-1. The partial sum ps(i, j) edges traverse in the i- direction [1 0] T, and the carry c(i, j) edges traverse [0 1]T direction. The initial ps(0, j) = 0 and the initial c(i, i) = 0 are the edges on the top and the right boarders of the parallelogram. The carry c(i, i+n) is ps(i+1, n+1). The results z(j), j = 0,, 2*n -1, appear at the right and bottom boarders. The data dependency graph of the algorithm can be transformed to a more efficient indexing. This can be done by transforming the parallelogram to a rectangular. In general if x is an m-bit vector and y is an n-bit vector the product is (m+n)-bit vector and the parallelogram has the width m and height n. The transformation matrix is [1 0; -1 1] where the semi-colon separates the rows of the matrix. The direction vectors mappings are as 3

4 follow; [1 0] [1-1], [0 1] [0 1] and [1 1] [1 0]. The coordinate (i, j) maps to the new coordinate (i, j ) = (i, -i+j). Fig. 2 Transformed Data Dependency Graph Figure 2 shows an example of the transformed data dependency graph for 4-bit vectors multiplication. The node (i, j) in Fig. 1 is mapped to node (i, -j+i) in Fig. 2. The vector x now traverses downward in [1 0] T direction, y and c vectors traverse leftward [0 1] T direction and the partial sum vector ps now traverses in the [1-1] T direction. The vector x and y are simply a broadcast of their values whereas the partial sum ps and the carry c, are functions of x, y, and ps. The carry signals at the leftmost column nodes (j = n-1) are the partial sum ps(i+1, n-1). A description of a bit-vector multiplication algorithm based on the transformed data dependency graph (Fig. 2) is as follows. library ieee; use ieee.std_logic_1164.all; package arith_pack is function "*" (x, y: std_logic_vector) return std_logic_vector; end arith_pack; package body arith_pack is function "*" (x, y : std_logic_vector) return std_logic_vector is z(0) z(1) z(2) z(7) z(6) z(5) z(4) z(3) variable m: natural := x'length; variable n: natural := y'length; type two_d_array is array (natural range <>, natural range <>) of std_logic; variable ps, c : two_d_array (n - 1 downto 0, m - 1 downto 0); variable temp: std_logic; variable z: std_logic_vector(n+m-1 downto 0); begin for j in 0 to m-1 loop ps(0,j) := '0'; for i in 0 to n-1 loop c(i, 0) := '0'; for i in 0 to n-1 loop for j in 0 to m-1 loop temp := (y(i) and x(j)); -- not the last row if i < n-1 then -- not 1st and last column if j < m-1 and j > 0 then ps(i+1,j-1) := temp xor ps(i, j) xor c(i, j); c(i,j+1):= (temp and ps(i, j)) or -- 1st column elsif j = 0 then z(i) := temp xor ps(i, j) xor c(i, j); c(i,j+1):= (temp and ps(i, j)) or -- last column elsif j = m-1 then ps(i+1,j-1) := temp xor ps(i, j) xor c(i, j); ps(i+1,j) := (temp and ps(i, j)) or -- last row elsif i = n-1 then -- not last column if j < m-1 then z(i+j) := temp xor ps(i, j) xor c(i, j); c(i,j+1) := (temp and ps(i, j)) or -- last column elsif j = m-1 then z(i+j) := temp xor ps(i, j) xor c(i, j); z(i+j+1) :=(temp and ps(i, j)) or (ps(i, j) and c(i, j)) or(c(i, j) and temp); return z; end "*"; end arith_pack; library ieee; 4

5 use ieee.std_logic_1164.all, work.arith_pack.all; entity test_mult_arith_pack is port( x, y: in std_logic_vector(3 downto 0); z : out std_logic_vector(7 downto 0) ); end test_mult_arith_pack; yo psi xi w yi architecture beh of test_mult_arith_pack is begin z <= x * y; end beh; co ci Figure 3 shows a simulation wave of the test_mult_arith_pack. xo pso Fig. 4 Processing Element Block Diagram Fig. 3 Verification Mapping Data Dependency Graph to Combinational Circuit A mapping of the transformed dependency graph to an array of processing elements as a combinational circuit is straightforward. A hardware description can use for generate statement to construct the interconnection of the processing elements. Figure 4 shows the processing element block diagram where xi, yi, psi and ci denote the x, y, partial sum and carry inputs. The outputs xo and yo are the copies of the input xi and yi. The signal w is the temp variable in the * function. A hardware description of an array comprising of the processing elements, sets of the interconnecting wires and, the input values, initial values and output assignments, is as follows; A description of the processing element is as follows; library ieee; use ieee.std_logic_1164.all; entity pe is port (xi, yi, psi, ci : in std_logic; xo, yo, pso, co : out std_logic); end pe; architecture dataflow of pe is signal w : std_logic; begin xo <= xi; yo <= yi; pso <= w xor ci xor psi; co <= (psi and w) or (w and ci) or (ci and psi); w <= xi and yi; end dataflow; 5

Mapping Data Dependency Graph to Synchronous Circuit This section covers a method on mapping data dependency graph to a synchronous circuit comprising combinational logic

A serial code (single thread) is basically a map of the data dependency graph data flow (edges) and processing (nodes) to a single processor where the loops in the code

As for example the multiplication function * code above describes a schedule for calculating the data dependency graph in a row-wise schedule where the nodes in the

Moreover, linear algebra provides a design tool for mapping the computations (nodes) and data flows (arcs).

The nodes in G are vectors in a k- dimension space over integers, V k, and a linear projection P: G G, where G is a (k-1)- dimensional subspace.

6 Mapping Data Dependency Graph to Synchronous Circuit This section covers a method on mapping data dependency graph to a synchronous circuit comprising combinational logic and storages that synchronized to the clock rising edges. A serial code (single thread) is basically a map of the data dependency graph data flow (edges) and processing (nodes) to a single processor where the loops in the code describe a schedule on when the nodes are to be calculates. As for example the multiplication function * code above describes a schedule for calculating the data dependency graph in a row-wise schedule where the nodes in the columns are calculated for left to right. Data dependency graph is useful for designing parallel computation on multiple processors. Moreover, linear algebra provides a design tool for mapping the computations (nodes) and data flows (arcs). Consider a linear projection P, P maps a k- dimensional data dependency graph G (knested loop algorithm) to a (k-1) dimensional graph G. The nodes in G are vectors in a k- dimension space over integers, V k, and a linear projection P: G G, where G is a (k-1)- dimensional subspace. The projected graph has vertices {v, v G } that represent the multiple processors for calculating the vertices {v; v G} of the data dependency graph. A processor v calculates the nodes {v Pv = v, v G } during different clock cycles. The projection P maps the arcs a G which are vectors describing the data flow directions, to the data flow directions between the processors. Let e be the orthogonal vector to a subspace V, u V if and only if u T e = 0, u,e V k. The affine subspaces V t, t = 0, 1,, where t represents the discrete time clock cycles, are defined as u V t 6

if and only if u T e = t. Calculations at time t are the nodes in the affine subspace V t which are distributed among the processors by a linear projection P: G G.

7 if and only if u T e = t. Calculations at time t are the nodes in the affine subspace V t which are distributed among the processors by a linear projection P: G G. The affine subspaces V t, t = 0, 1, are the schedules for the processors. Registers and memory provide temporary storage for the processors. They can be placed on the arcs of G to provide delay buffers that synchronize the calculations. Define an arc with direction to be an ordered pair nodes, a = (q, r). The number of time steps d between q V t and r V t+d such that q and r, are projected to adjacent processors is the number of delay buffers required on the arc a in G. Serial Multiplier with Combinational Adder Fig. 5 Mapping Data Dependency Graph to Multiple Processing Units Figure 5 shows an example of a mapping of the multiplication data dependency graph to an array of processing units (processors). The vector [i j] T means row i and column j where i increases downward and j increases leftward. The projection matrix P = [0 0; 0 1], maps the nodes [i j] T P[i j] T = [0 j] T, that is, the nodes in column j are calculated by the processor j. The nodes (processors) of the projected graph G are on the bottom of Fig. 5. The schedule is the affine subspaces orthogonal to e = [1 0] T shown as the red lines. The nodes u in the affine s subspace V t are calculated at time t = u T e = [i j][1 0] T = i. The data flow arcs for the multiplicand bits a(j), j = 0,, 4 (downward arcs) map to P[i 0] T = [0 0] T, self-loop arcs with a delay buffer on the nodes [0 j] T, j = 0,, 4 of G. These arcs are not shown in Fig. 5, however, they are delays that store a(j) in the node j, j = 0, 4, of G. The arcs in [0 1] T direction the multiplier bits and the carries, map to P[0 j] T = [0 j] T, the arcs traversing leftward in G. The arcs in G connect the nodes belong in the same affine subspace which implies that the data are available to the nodes during the clock cycle i, thus, no delays are required on the arcs traversing leftward in G. The schedule implies a combinational bitvector addition hardware where the carry signals propagate from the least significant position (j = 0) to the most significant position (j = m-1) during the clock cycle, and that the partial sums are valid before the next clock cycle. The partial sum the [1-1]T direction arcs, map to P[i -j] T = [0 -j] T, the arcs traversing rightward in G. The number of time steps (distance) between q and r of arc (q, r) in the direction e a is e T e a = [1 0][1-1] T = 1 delay. The projected graph G shows one delay D placed on the arcs traversing rightward. In the graph G, the nodes store the multiplicand bits which are multiplied by the multiplier bit i at time t = i. These are added to the partial-sum bits which are updated to the storages on the rightward traversing arcs (the projected partial-sum arcs). The carry out at the most significant position loops back as the most significant bit of the partial sum. The product bits p(t), t = 0, 1,, 9 are the output of G at time t. 7

Figure 6 below shows the projected graph G as a serial multiplier hardware with m = 5 multiplicand bits x4,, x0.

nodes such that e T u = t, for instance, the nodes [0 4] T, [1 2] T and [2 0] T lies on the affine subspace e T u = [2 1]u = t = 4 (see Fig. 7).

The carry and the multiplier arcs in the [0 1] T direction in G also traverse across one time step, e T [0 1] T = [2 1][0 1] T = 1, which place a delay on the projected carry and multiplier arcs in G

8 Figure 6 below shows the projected graph G as a serial multiplier hardware with m = 5 multiplicand bits x4,, x0. The multiplier bits b0,, b4, 0, 0, are applied at time t = 0, 1, and the product bits p(t) at the output at time t = 0, 1,, 9. nodes such that e T u = t, for instance, the nodes [0 4] T, [1 2] T and [2 0] T lies on the affine subspace e T u = [2 1]u = t = 4 (see Fig. 7). The partial sum arcs in G traverse across one time step, e T [1-1] T = [2 1][1-1] T = 1, which place a delay on the projected partial-sum arcs in G. The carry and the multiplier arcs in the [0 1] T direction in G also traverse across one time step, e T [0 1] T = [2 1][0 1] T = 1, which place a delay on the projected carry and multiplier arcs in G (see Fig. 7). A pipeline multiplier G has the data rate equal to 1/2, that is, the input data are applied on every two clock rising edge (cycle). An optimum pipelining rate is one, that is, input data are applied on every cycle. Fig. 6 Serial Multiplier Based on [1 0] T schedule Pipeline Multiplier In Fig. 7, the multiplier bits are applied on t = 0, 2, 4,, 8, and zeros are applied afterward. The output product bits appear also at the rate equal to 1/2. Note that, the arcs in the [1 0] T direction traverse across 2 time steps, that is, [2 1][1 0] T = 2. The processor utilization is 1/2 because at any instance t half of the processors are computing, for example, at t = 4, the processors [0 j]t, j = 0, 2, 4 are calculating the nodes [0 4] T, [1 2] T and [2 0] T. Fig. 7 Pipeline Multiplier Based on [2 1] T Schedule Consider the same projected graph G with the projection matrix P = [0 0; 0 1], a different schedule e = [2 1] T eliminates the propagation latency due to the carry signals which grows linearly with the number of bits in the multiplicand. The affine subspaces are the Fig. 8 Pipeline Multiplier Processing Unit Figure 8 shows the processing unit of the projected graph G (see Fig. 7). The delay buffers (Delay Flip-Flop, DFF) are placed on the arcs as the output buffers. This gives stable drives of the signals from the unit. An array of the processing units and an addition delay flip-flop 8

at the most significant position at the PS_in port form a

A series of snapshots of the pipeline multiplier with 3-bit

calculating 7x7 = 49 = 110001 are shown below (Fig. 9).

The product bits appear at the output at t = 1, 3, 5, 7, 9

The snapshots begin at t = 0 and continue to t = 11 showing

The inputs and outputs are highlighted (blue and red).

9 at the most significant position at the PS_in port form a pipeline multiplier G. A series of snapshots of the pipeline multiplier with 3-bit multiplicand a = 111 and 3-bit multiplier b = 111 calculating 7x7 = 49 = are shown below (Fig. 9). The input multiplier bits are applied at t = 0, 2, 4 followed by zeros until t = 11. The inputs are also zero for t = 1, 3. The product bits appear at the output at t = 1, 3, 5, 7, 9 and 11 starting from the least significant bit. The snapshots begin at t = 0 and continue to t = 11 showing the data flow in the pipeline. The inputs and outputs are highlighted (blue and red). Calculating units are highlighted red. Fig. 9 Pipeline Multiplier Snapshots Optimum Rate Pipeline Multiplier Consider a different projection P = [1 0; 0 0] of the data dependency graph G in the [0 1] T 9

direction. Figure 10 shows an example of G and the projected graph G. indefinitely as the inputs signals are applied to the convolution algorithm indefinitely.

10 direction. Figure 10 shows an example of G and the projected graph G. indefinitely as the inputs signals are applied to the convolution algorithm indefinitely. The addition of the partial sum is now an accumulation (integration) of numbers. The projected graph G (Fig. 10) is a convolution filter or a Finite Impulse Response (FIR) filter or a moving average filter, which calculates a weighted (filter coefficients are the weights) of the past n inputs (the number of nodes in G ). Fig. 10 Projected Graph in [0 1] T Direction The projected G has 2 delays placed on the multiplicand arcs in the [1 0]T direction, that is, e T [1 0] T = [2 1][1 0] T = 2 delays. The projected partial-sum arcs has one delay, [2 1][1-1] T = 1. The pipeline rate is 1 as the multiplication bits a(t), t = 0, are applied to the pipeline every clock cycle. The latency is 2n + m where m is the number of multiplicand bits and n is the number of multiplier bits. The least significant bit of the product appears at t = n and the product consists of n+m bits. The processing unit consists of a serial adder with an internal storage for the carry signal. The carry signal is assigned as the partial-sum output very m clock cycles. In a sense the unit is a processor (computer) consisting of a state machine and memory storage. Convolution Filter The graph G when the data are integers (or real numbers the floating points) the multiplication of the multiplicand bits and the multiplier bits (implemented as AND logic) are the integer multiplication. The bit-vector addition becomes integer addition. There are no carry bits. The graph is extended to the left The projected graph G as a version of the convolution filter comprises processing units connected as shown in Fig. 10 where the delays are registers storing numbers. The processing unit consists of a Multiply and Add (MAC) unit multiplying the filter coefficients {b(i), i = 0, n-1} (n is called the number of taps), with the past input signal a(t) traversing through the filter. The multiplication of the past inputs and the coefficients are added into the partial sum recursively. The output of the filter is the weighted sum of the past n inputs. In Fig. 10 the dependency graph shows the filter output p(t) reaches the steady state when t 4, and the weighted sum is given by, p(t) = b(0)a(t) + b(1)a(t-1) + b(2)a(t-2) + b(3)a(t-3) + b(4)a(t-4), t 4. Based on the method on mapping algorithm to hardware, the projected graph G in Fig 10 is a 5-tap convolution filter with a pipeline rate one and the latency equal to 5 cycles. Conclusions A method on mapping data dependency graphs of algorithms to array processing hardware are relevant in today s (2017) signal processing. Further studies and designs can include the infinite impulse response filter, matrix multiplication and decompositions (Lower- Upper, orthogonal and singular value decompositions). The study on bit-vector multiplication hardware presented provides a fundamental. 10

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit Nurul Hazlina 1 1. Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit Nurul Hazlina 2 Introduction 1. Digital circuits are frequently used for arithmetic operations 2. Fundamental