COMPUTATIONAL PROPERIES OF DSP ALGORITHMS

Similar documents
S Postgraduate Course on Signal Processing in Communications, FALL Topic: Iteration Bound. Harri Mäntylä

ARITHMETIC operations based on residue number systems

Binary Adders. Ripple-Carry Adder

2 Computation with Floating-Point Numbers

Floating-point representation

2 Computation with Floating-Point Numbers

MOST attention in the literature of network codes has

On the Performance of Greedy Algorithms in Packet Buffering

Implementing FIR Filters

Chapter 6: Folding. Keshab K. Parhi

Department of Mathematics Oleg Burdakov of 30 October Consider the following linear programming problem (LP):

April 9, 2000 DIS chapter 1

A.1 Numbers, Sets and Arithmetic

Data Flow Graph Partitioning Schemes

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

5.4 Pure Minimal Cost Flow

Rational Numbers: Multiply and Divide

Introduction to Field Programmable Gate Arrays

Take Home Final Examination (From noon, May 5, 2004 to noon, May 12, 2004)

11 and 12 Arithmetic Sequence notes.notebook September 14, 2017

Multi-Cluster Interleaving on Paths and Cycles

Generalized Network Flow Programming

Partial product generation. Multiplication. TSTE18 Digital Arithmetic. Seminar 4. Multiplication. yj2 j = xi2 i M

CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS

31.6 Powers of an element

INTERLEAVING codewords is an important method for

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Solutions to Homework 10

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Lecture 19. Broadcast routing

Lecture 10 October 7

REGISTER TRANSFER AND MICROOPERATIONS

3.1 DATA REPRESENTATION (PART C)

Final Examination. Winter Problem Points Score. Total 180

Logic, Programming and Prolog (Supplement)

The complement of PATH is in NL

EE/CSCI 451 Midterm 1

Graphical Analysis. Figure 1. Copyright c 1997 by Awi Federgruen. All rights reserved.

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Dynamic Programming Algorithms

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

3 No-Wait Job Shops with Variable Processing Times

DIGITAL ARITHMETIC: OPERATIONS AND CIRCUITS

Integer Programming as Projection

Dynamic Programming Algorithms

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

Arithmetic Sequences

Module 11. Directed Graphs. Contents

The Encoding Complexity of Network Coding

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Combinational and sequential circuits (learned in Chapters 1 and 2) can be used to create simple digital systems.

CHAPTER 4: Register Transfer Language and Microoperations

Chapter S:V. V. Formal Properties of A*

COLUMN GENERATION IN LINEAR PROGRAMMING

IT 201 Digital System Design Module II Notes

Chapter 2: Number Systems

Floating Point Arithmetic

HIGH-LEVEL SYNTHESIS

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

CMPSCI 311: Introduction to Algorithms Practice Final Exam

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology Madras.

Scan Scheduling Specification and Analysis

FPGA Matrix Multiplier

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

Chapter 1: Number and Operations

Unit 5F: Layout Compaction

Unit 3: Layout Compaction

Unit 2: High-Level Synthesis

DESIGN OF AN FFT PROCESSOR

COE 202- Digital Logic. Number Systems II. Dr. Abdulaziz Y. Barnawi COE Department KFUPM. January 23, Abdulaziz Barnawi. COE 202 Logic Design

Telemetry Standard RCC Document , Chapter 4, September 2007 CHAPTER 4 PULSE CODE MODULATION STANDARDS TABLE OF CONTENTS LIST OF FIGURES

Chapter 10 Binary Arithmetics

CHAPTER 4 PULSE CODE MODULATION STANDARDS TABLE OF CONTENTS

CS 6402 DESIGN AND ANALYSIS OF ALGORITHMS QUESTION BANK

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

Administrivia. What is Synthesis? What is architectural synthesis? Approximately 20 1-hour lectures An assessed mini project.

arxiv: v2 [cs.ds] 30 Nov 2012

Array Multipliers. Figure 6.9 The partial products generated in a 5 x 5 multiplication. Sec. 6.5

LECTURE 4. Logic Design

Algorithms for Graph Visualization Layered Layout

Reals 1. Floating-point numbers and their properties. Pitfalls of numeric computation. Horner's method. Bisection. Newton's method.

1. Introduction to Constructive Solid Geometry (CSG)

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

BUFR Table C - Data description operators (Edition 3)

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

1 Sets, Fields, and Events

y i = S i ΠS y i + S i ΠN\S

2.2 Syntax Definition

SIGNALS AND SYSTEMS I Computer Assignment 2

ECE608 - Chapter 16 answers

Iteration Bound. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Chapter 3: part 3 Binary Subtraction

Discharging and reducible configurations

Dr. Relja Vulanovic Professor of Mathematics Kent State University at Stark c 2008

Measuring Duty Cycles with an Intel MCS-51 Microcontroller

Topology and Topological Spaces

Design of Parallel Algorithms. Models of Parallel Computation

ECE 341 Midterm Exam

Space vs Time, Cache vs Main Memory

1.2 Numerical Solutions of Flow Problems

Transcription:

COMPUTATIONAL PROPERIES OF DSP ALGORITHMS 1 DSP Algorithms A DSP algorithm is a computational rule, f, that maps an ordered input sequence, x(nt), to an ordered output sequence, y(nt), according to xnt ( ) Æ ynt ( ), ynt := ( ) f ( x( nt )) Generally, in hard real-time applications the mapping is required to be causal since the input sequence is usually causal. We stress that the computational rule is an unambiguously specified sequence of operations on an ordered data set. DSP ALGORITHM Sequence of operations Basic set of arithmetic and logic operations Ordered data set

A DSP algorithm can be described by a set of expressions. The sign ":=" is used to indicate that the expressions are given in computational order. 2 Ï x 1 ( n ) := f 1 [, x 1 ( n 1 )º, x p ( n 1 ), x p ( n 2 ), º, a 1, b 1, º ] x 2 ( n ) := f 2 [, x 1 ( n 1 )º, x p ( n 1 ), x p ( n 2 ), º, a 2, b 2, º ] Ì x 3 ( n ) := f 3 [ x 1 ( n ), x 1 ( n 1 )º, x p ( n 1 ), x p ( n 2 ), º, a 3, b 3, º ] x N ( n ) := f N [ x 1 ( n ), x 1 ( n 1 )º, x p ( n 1 ), x p ( n 2 ), º, a N, b N, º ] Ó DSP algorithms can be divided into iterative processing and block processing algorithms. An iterative algorithm performs computations on a semi-infinite stream of input data, i.e., input data arrive sequentially and the algorithm is executed once for every input sample and produces a corresponding output sample. In block processing, a block of output samples is computed for each input block, which results in a large delay between input and output.

PRECEDENCE GRAPHS 3 A precedence graph describes the order of occurrence of events: A, B,, F. The directed branches between the nodes denote the ordering between the events which are represented by the nodes. A directed branch between node E and B shows that event E precedes event B. E is therefore called a precedent or predecessor to B. Node E also precedes event A, and therefore node E is a second-order precedent to node A. Event B is a succedent to event E. An initial node has no precedents and a terminal node has no succedents, while an internal node has both. If two events are not connected via a branch, then their precedence order is unspecified. E F C B D A

Sometimes, it may be more convenient to let the branches represent the events and the nodes represent the precedence relations. Such precedence graphs are called AOA (activity on arcs) graphs while the former type of graphs, with activity on the nodes, are called AON (activity on nodes) graphs. E F E F C B D A 4

Parallelism in Algorithms 5 Precedence relations between operations are unspecified in a parallel algorithm. The figure illustrates two examples of parallel algorithms. In the first case, the additions have a common precedent, while in the second case they are independent. Two operations (algorithms) are concurrent if their execution times overlap. In a sequential algorithm every operation, except for the first and the last, has only one precedent and one succedent operation. Thus, the precedence relations are uniquely specified for a sequential algorithm.

Latency 6 We define latency as the time it takes to generate an output value from the corresponding input value. Example: Bit-parallel multiplication Input x Input a Output y T Latency 1 Throughput t Note that latency refers to the time between input and output, while the algorithmic delay refers to the difference between input and output sample indices.

Example: Bit-serial multiplication 7 a) Input x LSB MSB Input a LSB MSB Output y LSB 1 Throughput T Latency MSB t Bit-serial multiplication can be done either by processing the least significant or the most significant bit first. The latency, if the LSB is processed first, is in principle equal to the number of fractional bits in the coefficient. For example, a multiplication with a coefficient W c = (1.0011) 2C will have a latency corresponding to four clock cycles.

A bit-serial addition or subtraction has in principle zero latency while a multiplication by an integer may have zero or negative latency. However, the latency in a recursive loop is always positive, since the operations must be performed by causal PEs. Latency models for bit-serial arithmetic. 8 Model 0 Model 1 x(n) FA S(n) x(n) FA D S(n) D D x 1 x 0 x W d 1 x 0 x 1 x W d 2 x W d 1 + y 0 y 1 y W d 1 + y 0 y 1 y W d 2 y W d 1 S 0 S 1 S W d 1 S 0 S 1 S 2 S W d 1 Latency Latency

9 Model 0 a Model 1 a x(n) Serial/parallel multiplier x(n) Serial/parallel multiplier D a x x 0 x 1 x W d 1 a x x 0 x 1 x W d 2 x W d 1 y 0 y W d 1 y W d+wcf 1 y 0 y W d 1 y W d+wcf 1 Latency Latency Denoting the number of fractional bits of the coefficient W af, the latencies become W af for latency model 0, and W af + 1 for latency model 1.

Sequentially Computable Algorithms 10 A precedence graph may be contradictory in the sense that it describes an impossible ordering of events. An example of such a precedence graph is shown in the Figure. Event A can occur only after event C has occurred. However, event C can only occur when event B has occurred, but event B cannot occur until event A has occurred. Hence, this sequence of events is impossible since the sequence can not begin. In a digital signal processing algorithm, events correspond to arithmetic operations. In a proper algorithm at least one of the operations in each recursive loop in the signal-flow graph must have all its input values available so that it can be executed. A C B

This is satisfied only if the loops contain at least one delay element, since the delay element contains a value from the preceding sample interval that can be used as a starting point for the computations in the current sample interval. Thus, there must not be any delay-free loops in the signal-flow graph. For a sequentially computable algorithm the corresponding precedence graph is called a directed acyclic graph (DAG). 11 A necessary and sufficient condition for a recursive algorithm to be sequentially computable is that every directed loop in the signal-flow graph contains at least one delay element. An algorithm that has a delay-free loop is nonsequentially computable. Such an algorithm can be implemented neither as a computer program nor in digital hardware.

Fully Specified Signal-Flow Graphs 12 In a fully specified signal-flow graph, the ordering of all operations as uniquely specified is illustrated in the figure. a b y = a + b + c c a y = a + b + c b c

SFGs IN PRECEDENCE FORM 13 A signal-flow graph in precedence form shows the order in which the node values must be computed. Example 6.1 The necessary quantization in the recursive loops is shown but not the ordering of the additions. Hence, the signal-flow graph is not fully specified. We get a fully specified signal-flow graph by ordering the additions. Note that there are several ways to order the additions. The first step, according to the procedure in Box 6.1, does not apply since, in this case, there are no multiplications by 1. x(n) X(z) c 0 Q Q b 1 z 1 b 2 z 1 T T a 0 a 1 a 2 u 1 u 3 u 6 a 0 u 7 c 0 u 2 u 4 b 1 v a 1 (n) 1 u 8 u b 5 2 v a 2 (n) 2 u 9 Y(z) u 10

The second step is to first assign node variables to the input, output, and delay elements. Next we identify and assign node variables to the outputs of the basic operations. Here we assume that multiplication, two-input addition, and quantization are the basic operations. An operation can be executed when all input values are available. The input, x(n), and the values stored in the delay elements, v 1 (n) and v 2 (n), are available at the beginning of the sample interval. Thus, these nodes are the initial nodes for the computations. To find the order of the computations we begin (step 3) by first removing all delay branches in the signal-flow graph. x(n) u 1 u 3 u 6 a 0 u 7 c 0 u 2 Q u 10 u b 4 1 v1 (n) a 1 u 8 u 5 b 2 a 2 v2 (n) u 9 14

In step four, all initial nodes are identified, i.e., x(n), v 1 (n), and v 2 (n). These nodes are assigned to node set N 1. 15 x(n) u 1 u 3 u 6 a 0 u 7 Q c 0 u 2 u 10 u b 1 a 1 4 v1 (n) u 8 N 1 N 2 N 3 N m u 5 b 2 a 2 v2 (n) u 9 I n step five we remove all executable operations, i.e., all operations that have only initial nodes as inputs. In this case we remove the five multiplications by the coefficients: c 0, a 1, a 2, b 1, and b 2. The resulting graph is

16 u 1 u 3 u 6 a 0 u 7 Q u 2 u4 u8 u 10 u 3 u 6 a 0 u 7 Q u 10 u 6 a 0 u 7 u 5 u 9 u 10 u 1 u 3 u 6 a 0 u 7 Q u 2 u 10 u 7 u 10 x(n) u 1 u 3 u 6 v 1 (n) u 4 u 5 u 2 u 7 v 2 (n) u 8 u 9 u 10 N 1 N 2 N 3 N 4 N 5 N 6 N 7

Finally, we obtain the signal-flow graph in precedence form by connecting the nodes with the delay branches and the arithmetic branches according to the original signal-flow graph. 17 Algorithmic Delay Elements x(n) v 1 (n) v 2 (n) c 0 b 1 a 1 b 2 a 2 u 1 u 4 u 5 u 8 u 9 u 2 u 10 u 3 u 6 Q z 1 a 0 u z 1 7 N 1 N 2 N 3 N 4 N 5 N 6 N 7 1 2 3 4 5 6 The computations take place from left to right. A delay element is here represented by a branch running from left to right and a gray branch running from right to left.

The latter indicates that the delayed value is used as input for the computations belonging to the subsequent sample interval. If a node value, for example, u 1, is computed earlier than needed, an auxiliary node must be introduced. The branch connecting such nodes represents storing the value in memory. It is illustrative to draw the precedence form on a cylinder to demonstrate the cyclic nature of the computations. The computations are imagined to be performed repeatedly around the cylinder. The circumference of the cylinder corresponds to a multiple of the length of the sample interval. u 7 x(n) v 1 (n) c 0 u 1 b 1 u 4 u 2 a 1 b 2 u 5 u 8 u 10 v 2 (n) a 2 u 9 6 1 2 t 18

DIFFERENCE EQUATIONS 19 A digital filter algorithm consists of a set of difference equations to be evaluated for each input sample value. The difference equations in computable order can be obtained directly from the signal-flow graph in precedence form. The signal values corresponding to node set N 1 are known at the beginning of the sample interval. Hence, operations having output nodes belonging to node set N 2 have the necessary inputs and can therefore be executed immediately. Algorithmic Delay Elements x(n) v 1 (n) a 1 c 0 b 1 b 2 a 2 u 1 u 4 u 5 u 8 u 2 u 10 u 3 v 2 (n) u 9 N 1 N 2 N 3 N 4 N 5 N 6 N 7 u 6 Q z 1 z 1 1 2 3 4 5 6 a 0 u 7

Once this is done, those operations having output nodes belonging to node set N 3 have all their inputs available and can be executed. This process can be repeated until the last set of nodes has been computed. 20 Node set N 2 Equations u 1 := c 0 x(n) u 4 := b 1 v 1 (n) u 5 := b 2 v 2 (n) u 8 := a 1 v 1 (n) u 9 := a 2 v 2 (n) N 3 u 2 := u 4 + u 5 u 10 := u 8 + u 9 N 4 u 3 := u 1 + u 2 N 5 u 6 := [u 3 ] Q N 6 u 7 := a 0 u 6 N 7 := u 7 + u 10

Finally, the signal values corresponding to node set N 1, can be updated via the delay elements. This completes the operations during one sample interval. Node set N 11 Equations v 2 (n + 1) := v 1 (n) 21 N 12 v 1 (n + 1) := u 6 Delay elements connected in series are updated sequentially starting with the last delay element, so that auxiliary memory cells are not required for intermediate values. Operations with outputs belonging to the different node sets must be executed sequentially while operations with outputs belonging to the same node set can be executed in parallel. Similarly, updating of node values that correspond to delay elements must be done sequentially if they belong to different subsets of N 1k. The updating is done here at the end of the sample interval, but it can be done as soon as a particular node value is no longer needed.

22 The system of difference equations can be obtained directly from the table. In the method just discussed, a large number of simple expressions are computed and assigned to intermediate variables. Hence, a large number of intermediate values are computed that must be stored in temporary memory cells and require unnecessary store and load operations. If the algorithm is to be implemented using, for example, a standard signal processor, it may be more efficient to eliminate some of the intermediate results and explicitly compute only those node values required. Obviously, the only values that need to be computed explicitly are Node values that have more than one outgoing branch Inputs to some types of non-commutating operations, and The output value.

The only node with two outgoing branches is node u 6. The remaining node values represent intermediate values used as inputs to one subsequent operation only. Algorithmic Delay Elements x(n) v 1 (n) a 1 c 0 b 1 b 2 a 2 u 1 u 4 u 5 u 8 u 2 u 10 u 3 v 2 (n) u 9 N 1 N 2 N 3 N 4 N 5 N 6 N 7 u 6 Q z 1 z 1 a 0 u 7 23 Hence, their computation can be delayed until they are needed. 1 2 3 4 5 6

24 Ï Ì In this algorithm, the only operation that is noncommutating with its adjacent operations is the quantization operation. Generally, the inputs to such operations must be computed explicitly, but in this case u 3 appears in only one subsequent expression. Hence, u 3 can be eliminated. u 3 := c 0 xn ( ) + b 1 v 1 ( n ) + b 2 v 2 ( n ) u 6 := [ u 3 ] Q yn ( ) := a 0 u 6 + a 1 v 1 ( n ) + a 2 v 2 ( n ) v 2 ( n + 1 ) := v 1 ( n ) v 1 ( n + 1 ) := u Ó 6 We get the following expressions that can be translated directly into a program.

25 Ïu 6 := [ c 0 xn ( ) + b 1 v 1 ( n ) + b 2 v 2 ( n )] Q yn ( ) := a 0 u 6 + a 1 v 1 ( n ) + a 2 v 2 ( n ) Ì v 2 ( n + 1 ) := v 1 ( n ) v 1 ( n + 1 ) := u Ó 6

Program Direct_form_II; const c0 =...; a0 =...; a1 =...; a2 =...; b1 =...; b2 =...; Nsamples =...; var xin, u6, v1, v2, y : Real; i: Longint; 26 function Q(x: Real): Real; begin x := Trunc(x*32768); { Wd = 16 bits => Q = 2^ 15 } if x > 32767 then x := 32767; if x < 32768 then x := 32768; x := x/32768; end; begin for i := 1 to NSamples do begin Readln(xin); { Read a new input value, xin } u6 := Q(c0*xin + b1*v1 + b2*v2); { Direct form II } y := a0*u6 + a1*v1 + a2*v2; v2 := v1; v1 := u6; Writeln(y); { Write the output value } end; end.

COMPUTATION GRAPHS 27 It is convenient to describe the computational properties of an algorithm with a computation graph that combines the information contained in the signal-flow graph with the corresponding precedence graph for the operations. The computation graph will later be used as the basis for scheduling the operations. Further, the storage and communication requirements can be derived from the computation graph. In this graph, the signal-flow graph in precedence form remains essentially intact, but the branches representing the arithmetic operations are also assigned appropriate execution times. Branches with delay elements are mapped to branches with delay. Additional branches with different types of delay must often be inserted to obtain consistent timing properties in the computation graph. These delays will determine the amount of physical storage required to implement the algorithm.

Critical Path By assigning proper execution times to the operations, represented by the branches in the precedence graph, we obtain the computation graph. The longest (time) directed path in the computation graph is called the critical path (CP) and its execution time is denoted T CP. Several equally long critical paths may exist. For example, there are two critical paths in the computation graph. The first CP starts at node v 1 (n) and goes through nodes u 4, u 2, u 3, u 6, u 7, and while the second CP begins at node v 2 (n) and goes through nodes u 5, u 2, u 3, u 6, u 7, and. Algorithmic Delay Elements x(n) v 1 (n) v 2 (n) a 1 c 0 b 1 b 2 a 2 u 1 u 4 u 5 u 8 u 9 u 2 u 10 u 3 N 1 N 2 N 3 N 4 N 5 N 6 N 7 u 6 Q z 1 z 1 1 2 3 4 5 6 a 0 u 7 28

Equalizing Delay Assume that an algorithm shall be executed with a sample period T and the time taken for the arithmetic operations in the critical path is T CP, where T CP < T. Then additional delay, which is here called equalizing delay, has to be introduced into all branches that cross an arbitrary vertical cut in the computation graph. This delay accounts for the time difference between a path and the length of the sample interval. The required duration of equalizing delay in each such branch is 29 T e = T T CP The amount of equalizing delay, which usually corresponds to physical storage, can be minimized by proper scheduling of the operations. Shimming Delay If two paths in a computation graph have a common origin and a common end node, then the data streams in these two paths have to be synchronized at the summation node by introducing a delay in the fastest path.

For example, execution of the two multiplications may be started at time instances t 0a and t 0b, and the times required for multiplication are t a and t b, respectively. This means that the products will arrive at the inputs of the subsequent adder with a time difference 30 Dt = ( t 0b + t b ) ( t 0a + t s ) Dt A delay must therefore be inserted in the upper branch of the computation graph so that the products arrive simultaneously. These delays are called shimming delays or slack. Shimming delay usually correspond to physical storage. t 0a t 0b t a t b t

31 c 0 x(n) Shimming Delays Q Memory corresponding to Algorithmic Delay Elements v 1 (n) b 1 a 0 Equalizing Delays a 1 b 2 v 2 (n) a 2 1 2 3 4 5 6 T CP T sample

Maximal Sample Rate 32 The maximal sample rate of an algorithm is determined only by its recursive parts. The minimal sample period for a recursive algorithm that is described by a fully specified signal-flow graph is T min = Ï max T opi Ì----------- Ó N i i where T opi is the total latency of the arithmetic operations, etc. and N i is the number of delay elements in the directed loop i. Nonrecursive parts of the signal-flow graph, e.g., input and output branches, generally do not limit the sample rate, but to achieve this limit additional delay elements may have to be introduced into the nonrecursive branches.

The minimal sample period is also referred to as the iteration period bound. x(n) Q 33 Loops that yield T min are called critical loops. Loop 1 T The signal-flow graph has two loops. Loop 1 is the critical loop if Loop 2 b 1 T T b2 + 2T add + T Q T b1 + T add + T q ------------------------------------------------ < ------------------------------------------- 2 1 otherwise loop 2 is the critical loop. The iteration period bound is of course not possible to improve for a given algorithm. However, a new algorithm with a higher bound can often be derived form the original algorithm. We recognize that there are two possibilities to improve the bound. Reduce the operation latency in the critical loop. Introduce additional delay elements into the loop b 2