Automatic Merge-point Detection for Sequential Equivalence Checking of System-level and RTL Descriptions

Size: px

Start display at page:

Download "Automatic Merge-point Detection for Sequential Equivalence Checking of System-level and RTL Descriptions"

Lawrence Stewart
5 years ago
Views:

1 Automatic Merge-point Detection for Sequential Equivalence Checking of System-level and RTL Descriptions Bijan Alizadeh, and Masahiro Fujita VLSI Design and Education Center (VDEC), University of Tokyo, Japan Abstract. In this paper, we propose a novel approach to verify equivalence of C-based system level description versus Register Transfer Level (RTL) model by looking for merge points as early as possible to reduce the size of equivalence checking problems. We tackle exponential path enumeration problem by identifying merge points as well as equivalent nodes automatically. It will describe a hybrid bit- and word-level representation called Linear Taylor Expansion Diagram (LTED) [] which can be used to check the equivalence of two descriptions in different levels of abstractions. This representation not only has a compact and canonical form, but also is close to high-level descriptions so that it can be utilized as a formal model for many EDA applications such as synthesis. It will then show how this leads to more effective use of LTED to verify equivalence of two descriptions in different levels of abstractions. We use LTED package to successfully verify some industrial circuits. In order to show that our approach is applicable to industrial designs, we apply it to 64- point Fast Fourier Transform and Viterbi algorithms that are the most computationally intensive parts of a communication system. Keywords: Formal Verification, Sequential Equivalence Checking, System on a Chip (SoC), Communication System, Canonical Representation. Introduction As system on a chip (SoC) designs continue to increase in size and complexity, many companies have paid more attention to design hardware at higher levels of abstraction due to faster design changes and higher simulation speed. In this phase, a C-based high level specification is described and then refined to a Register Transfer Level (RTL) description by adding more and more implementation details at different steps. Therefore there is a significant increase in the amount of verification required to achieve functionally correct description at each step, if traditional dynamic techniques such as simulation are used. This has led to a trend away from dynamic approaches and therefore Sequential Equivalence Checking (SEC) methods have become very important to reduce time-to-market as much as possible. SEC is a process of formally proving functional equivalence of designs that may in general have sequentially different implementations. Examples of sequential differences span the space from retimed pipelines, differing latencies and throughputs, and even scheduling and resource allocation differences. A few approaches have been proposed to perform equivalence checking between C-based specification and RTL description. In symbolic simulation based approaches,

2 loop and conditional statements need to be unrolled and then all paths through the code must be explored [2-7]. If dependencies exist between different iterations of a loop statement, it will increase the run time for symbolic simulation and degrades quality due to the exponential number of paths. For example consider C code of Fig. (a). After unrolling for-loop, corresponding to each then and else branch it is necessary to have two execution paths. In general for N number of iterations we have to enumerate 2 N paths and therefore exponential path enumeration problem occurs. On the other hand, the different results computed on the different paths must be tracked that will cause a blow-up in logic if lower level techniques such as BDDs and SAT solvers are utilized. for (i=0 i<2 i++) if ( a < b + c[i] ) a = b + c[i] else if ( a < b + c[0] ) a = b + c[0] else a 2 = b - c[0] Potential Merge Point a = b - c[i] if ( a < b + c[] ) a 3 = b + c[] else a 4 = b - c[] Potential Merge Point (a) (b) Fig.. Path enumeration of conditional statements (a) original source code (b) potential merge points to be detected. To cope with this complexity, the basic idea is to look for merge points as shown in Fig. (b), because it is obvious that two branches for if-then-else statements can be merged again. In this paper we not only attempt to figure out merge points automatically but also represent word-level arithmetic functions without requiring bitlevel encoding due to use of a canonical hybrid bit- and word-level representation, i.e., LTED []. Furthermore, we point out how to check the equivalence of a C-based description against a RTL model while there is no information about corresponding equivalent points into two descriptions. Therefore, the main contributions of our paper are as follows: Automatic merge point detection as early as possible to overcome exponential path enumeration problem. Defining cut-planes (each cut-plane is a set of cut-points) as outputs of different iterations of loops in the C-based description and therefore finding equivalent nodes in the RTL model automatically, rather than specifying them in the two descriptions as done in [2]. Efficient representation of the C-based description as well as the RTL model to reduce run time for checking their equivalence. The rest of this paper is structured as follows. Related works are addressed in Section 2. LTED as a hybrid canonical representation is briefly described in Section 3. Automatic merge-point detection approach to check the equivalence between C- based and RTL descriptions is presented in Section 4 and 64-point Fast Fourier Transform and Viterbi decoder algorithms as two case studies are discussed in Section 5. Finally a brief conclusion and future work are shown in Section 6.

3 2 Related Works Recently, some techniques have been proposed to apply equivalence checking to the system level and RTL descriptions [2-7]. In [2] an equivalence checking technique to verify system level design descriptions against their implementations in RTL was proposed. It presented an automatic technique to compute high level sequential compare points to compare variables of interest in the candidate design descriptions. They start the two design state machines at the same initial state and step the machines through every cycle, until a sequential compare point is reached. At this point the equivalence of the two state machines is proved using a lower (Boolean) level engine which is zchaff Satisfiability (SAT) solver. One of the limitations of this technique is not to be scalable in the number of cycles. As the number of cycles gets larger, the size of the expression grows quadratically, causing capacity problems for the lower level Satisfiability (SAT) engine. Furthermore it may not be applicable to large designs due to arithmetic encoding. In addition, in this technique corresponding equivalent points between two descriptions should be determined while these points may not be at all obvious due to complex control flow. The authors in [3] have proposed early cut-point insertion for checking the equivalence of high level software against RTL of combinational components. They introduce cut-points early during the analysis of the software model, rather than after generating a low level hardware equivalent. In this way, they overcome the exponential enumeration of software paths as well as the logic blow-up of tracking merged paths. However, it is necessary to synthesize word level information into bit level because of using BDD to represent the symbolic expressions and so the capacity is limited by memory and run time requirements. In addition, it has only focused on combinational equivalence checking and has not addressed how to extend the proposed method for sequential equivalence checking problem. Another approach to equivalence checking between C descriptions is presented in [4]. This approach detects the textual differences in the two target programs, and then performs a dependence analysis using program slicing, to check for the actual differences in the two programs. It then symbolically simulates this difference and reports the equivalence checking results. Since this process uses syntactic information, the similarity of the target descriptions is very essential to its application. A solution with a C-based bounded model checking (CBMC) engine was proposed in [6] that takes a C program and a Verilog implementation. They described an innovative method to convert the C program, including pointers and nested loops, into Boolean formulas. The Verilog code is also converted to Boolean formulas by a synthesis-like process. Then the two programs are converted into a Boolean satisfiability problem. Since this tool works entirely in the Boolean domain the capacity of CBMC is limited by space and time considerations. In [7] a method of equivalence checking between the Finite State Machine with Data-path (FSMD) model of the high-level behavioral specification and the FSMD model of the behavior transformed by the scheduler has been proposed. In this method cut-points in one FSMD are introduced and then computations are visualized as concatenation of paths from cut-points to cut-points. Finally equivalent finite path segments in the other FSMD are identified. This technique, however, is not scalable due to its limited application.

4 In all above approaches BDD or SAT based methods are utilized to represent symbolic expressions while algorithmic specifications such as those for digital signal processing contain a lot of arithmetic operations that should be encoded into bit level operations. Thus lower-level techniques like BDD or SAT are not able to handle these designs due to the large number of Boolean variables or clauses to be generated. In order to improve Boolean SAT-based methods, a Hybrid Satisfiability approach (HSAT) has been introduced [8] to generate functional test vectors for RTL designs. This approach creates linear arithmetic constraints for arithmetic operators and conjunctive normal form (CNF) clauses for Boolean logical operators. It then uses 3- SAT checking to solve the logic equations and integer linear programming (ILP) solver to check the feasibility of the arithmetic equations separately, in different domains. Hence for variables correspond to the interaction between the Boolean and arithmetic domains of the design, an assignment is selected from the CNF-clauses, and the resulting constraints are propagated to the arithmetic domain for the linear program to check for consistency. If variable assignments that satisfy the CNF clauses cause the linear programming constraints in the arithmetic domain to be infeasible, backtracking is needed to select another set of Boolean assignments. Since these two engines operate in separate domains, the performance of HSAT is limited by the heuristics that choose the set of assignments to Boolean variables. In addition, although HSAT is able to model bit- and word-level expressions, it only deals with scalar multiplication due to using integer linear programming. On the contrary, in our previous works [] and [9], we have proposed a canonical hybrid bit and word levels representation that integrates two domains in one engine and also represents two descriptions to be checked for equivalence, in a way that equivalent nodes could be found automatically without having to specify state or output mappings into two descriptions. 3 Hybrid Bit and Word Levels Representation The goal of this section is to introduce a new graph-based representation called Linear Taylor Expansion Diagram (LTED) for functions with a mixed Boolean and integer domain and an integer range to represent arithmetic operations at a high level of abstraction, while other proposed Word Level Decision Diagrams (WLDDs) are graph-based representations that provide a concise representation of integer-valued functions defined over binary variables as a bit vector. A thorough review of WLDDs can be found in [0]. On the other hand, BDDs or SAT based methods suffer from size explosion problems when the designs grow in size and complexity. BDD-based verification tools have not been very successful for designs containing large arithmetic data-path units due to prohibitive memory requirements. In LTED, functions to be represented are maintained as a single graph in strongly canonical form. We assume that the set of variables is totally ordered and that all of the vertices constructed obey this ordering. Maintaining a canonical form requires obeying a set of conventions for vertex creation as well as weight manipulation. These conventions are similar to other word level canonical representations and are not discussed here for brevity. In contrast to TED, LTED is a binary graph-based representation where the algebraic expression F(X,) is expressed by a hierarchical linearization of the Taylor series expansion []. Suppose variable X is the top variable

5 of F(X,). Equation () shows F(X,), where const is independent of variable X, while linear is coefficient of variable X. F(X,) = F(X=0,) + X*[F (X=0,)+/2*F (X=0,)+] = const + X*linear. () LTED data structure consists of a Variable node v that has as attributes an integer variable var(v) and two children const(v) and linear(v). In order to normalize the weights, any common factor is extracted by taking the greatest common divisor (gcd) of the argument weights. In addition, we adopt the convention that the sign of the extracted weight matches that of the const part. This assumes that gcd always returns a nonnegative value. Once the weights have been normalized the hash table is looked for an existing vertex or creates a new one. Similar to that of BDDs, each entry in the hash table is indexed by a key formed from the variable and the two children, i.e. const and linear parts. As long as all vertices are created, the graph will remain in strongly canonical form (see [] for more details). Fig. 2 illustrates how the following multivariate polynomial expression is represented by LTED. X f(x, Y, Z) = 24-8*Z+2*Y*Z-6*X 2-6*X 2 *Z X 2 X X 2 Y 2 Y 3 X -3 X Z Z Z 24-8Z+2YZ -6-6Z 24-8Z 2Z -6-6Z - 3 (a) (b) (c) Fig. 2. LTED representation of 24-8*Z+2*Y*Z-6*X 2-6*X 2 *Z (a) decomposition with respect to variable X (b) decomposition with respect to variables X and Y (c) decomposition with respect to variables X, Y and Z. Let the ordering of variables be X, Y and Z. First the decomposition with respect to variable X is taken into account. As shown in Fig. 2(a), const and linear parts will be 24-8*Z+2*Y*Z and -6*X 2-6*X 2 *Z respectively. After that, the decomposition is performed with respect to variable Y of Fig. 2(b). Finally the expressions are decomposed with respect to variable Z and a reduced diagram is depicted. In order to reduce the size of an LTED, redundant nodes are removed and isomorphic sub-graphs are merged as shown in Fig. 2(c). Analogous to TED and *BMDs, LTED is a canonical representation. In this representation, dashed and solid lines indicate const and linear parts respectively. It should be noted that LTED was introduced in [] as a graph-based representation with application to formal property verification. In order to have a canonical from, all nodes introduced in [] except Constant (C) and Variable (V) nodes have been removed. In this representation basic arithmetic operators such as addition, unary addition, subtraction, unary subtraction and multiplication are available that work for symbolic integer variables. In order to represent Boolean

6 functions, logical bitwise operations including NOT, AND, and OR have been provided. 4 Sequential Equivalence Checking In this section, we describe a sequential equivalence checking algorithm which is based on LTED canonical representation. Moreover, we will discuss merge point and cut-plane identification techniques. 4. Merge-point and Cut-plane Detection Approaches Fig. 3 depicts our proposed equivalence checking algorithm. An algorithmic specification in C (ASC) and an RTL description in Verilog (RTL) are treated as inputs to the algorithm. Although set of cut-planes (C), set of variables that are interesting for observation, can be defined by user as done in [2], in this paper it is obtained automatically as outputs of different iterations of loop executions in the ASC as shown by three first lines of Fig. 3. As a matter of fact this automatic decomposition converts the original description to some simpler expressions that can be handled easier even though there are among data dependencies between different loop iterations. As illustrated in Fig. 3, first of all a cut-plane is chosen. The nearest cut-plane to primary inputs is selected for better performance. A straight-forward way to do this is to sort cut-planes from primary inputs to primary outputs in the ASC. The selected cut-plane (CP) is removed from the set of cut-planes (C) and then all variables in CP are created in LTED. In order to detect merge-points of conditional statements such as if-then-else statement and case statement appeared in ASC description, variables from different branches of conditional statements, are rewritten by different indices (e.g., variable n is defined as n, n 2,, n m variables for m cases as shown in Fig. 3) and then added to CP. On the other hand, RTL description is synthesized using a high-level synthesis tool and modeled by a Finite State Machine with Datapath (FSMD). The FSMD adds a datapath including variables and operators on communication to the classic FSM. The FSMD is represented as a transition table, where we assume each transition is executed in a single clock cycle. Operations associated with each transition of this model are executed in a sequential form. Each controller transition is defined by the current state, the condition to be satisfied and a set of operations or actions. The condition evaluated true will determine the transition to be done and thus the actions to be executed. In an inner while loop, the FSMD is traversed at the current cycle and all variables on the left hand side of the assignments are created in LTED. During representing by LTED, equivalent nodes will be found automatically due to canonical form of LTED representation. At anytime during this process, if it is found that n, n 2,, n m are equivalent to some nodes in the RTL model, they will be merged as the original variable, i.e. n, and a primary input is introduced in its place. According to this explanation, we will be able to prevent exponential path enumeration problem since it is not necessary to consider different branches of the conditional statements. If equivalent nodes do not belong to {n, n 2,, n m }, we cut out the equivalent part and introduce new primary inputs in their places. These primary inputs are used while next iteration of an outer while loop is executed. In the inner while loop, the algorithm

7 proceeds to the next state of RTL model until all variables in the selected cut-plane are checked their equivalence with some nodes in the ASC. In the outer while loop, however, the process repeats until no cut-plane is available. If we can carry on this process to outputs of the two descriptions, then we have formally verified equivalence. Sequential_EC (ASC: Algorithmic Level Model RTL: RTL Model) Cut-plane i = Variables on the left hand side of the assignments in i th iteration of a loop C (set of Cut-planes) = number of iterations (Cut-plane i ) WHILE (C is not empty) Select a cut-plane (CP) and remove it from C (C = C CP) ASC = Generate LTED representation of all variables in CP IF (a conditional statement is encountered) FOR (each variable n on the left hand side of the assignments) Define n, n 2,, n m for m different cases instead of n CP = CP {n, n 2,, n m } WHILE (CP is not empty) RTL (t) = Generate LTED representation of all variables are assigned to at the current cycle (t) of RTL IF (a set of variables (v) are assigned to at the current cycle) RTL_v = Get LTED representation of v IF ((RTL_v is equivalent to some nodes in ASC) CP) CP = CP v IF (v == {n, n 2,, n m }) Merge n, n 2,, n m points and introduce primary input ELSE Introduce primary inputs at v and related nodes in ASC Proceed to the next cycle Fig. 3. Sequential equivalence checking algorithm with cut-plane and merge-point detection. 4.2 Example Fig. 4 illustrates an example containing the heart of Viterbi decoder algorithm called Add-Compare-Select (ACS) block that will be discussed in detail in Section 5.2. The C code of Fig. 4(a) indicates a high-level model of the ACS block, while another code of Fig. 4(b) describes different cycles of its RTL model. As soon as our proposed verification algorithm encounters if-then-else statement in the high-level model, it first represents a 0 = c 0 (variable a 0 in then branch) and a 02 = c (variable a 0 in else branch) in LTED with respect to in 0, in, PI 0 and PI inputs, as shown in Fig. 5(a). After that it is looking for equivalent nodes in the RTL model. At cycle t+2 of RTL model, it found out e 0 = f and e 02 = f 0 that are equivalent to a 02 and a 0 respectively as depicted in Fig. 5(b). Therefore, a 0 and e 0 in the high-level and RTL models respectively, are detected as merge point and then can be taken into account as primary inputs in the rest of the two descriptions. In this work, we assume that the

8 condition parts of different conditional statements can be checked using model checking methods and therefore are just skipped. ASC model: RTL model: b 0 = in 0 + in b = 2 - in 0 - in c 0 = b 0 + PI 0 c = b + PI if ( c 0 < c ) a 0 = c 0 else a 0 = c d 0 = in 0 +in d = 2-in 0 -in f 0 = d 0 + PI 0 f = d + PI if (f < f 0 ) e 0 = f else e 0 = f 0 Cycle t Cycle t+ Cycle t+2 (a) (b) Fig. 4. ACS block in Viterbi benchmark (a) C-based model and (b) RTL model. a 0 in 0 a 02 in 0 e 0 in 0 e 02 in 0 in in - - in - - in PI 0 PI PI PI 0 (a) 2 2 (b) Fig. 5. LTED representations of variables in Fig. 4 (a) a 0 and a 02 (b) e 0 and e Case Studies In recent years high speed wireless data communications has found many application areas. Fourth generation wireless and mobile systems are currently focusing on packet-based high-data-rate communication suitable for video transmission and mobile internet applications. Apart from the high speed of operation, the system demands lower power consumption. A general purpose DSP with associated software is not beneficial for this application since its power consumption is an order of magnitude higher compared to a dedicated hardware solution. Fig. 6 shows IEEE 802.a transmitter and receiver where are many signal processing functions such as Convolutional coder and inverse fast Fourier transform (IFFT) in the transmitter and, fast Fourier transform (FFT) and Viterbi decoder in the receiver. In addition, it has been shown [2] through extensive simulation that the most computationally intensive parts of such a high-data-rate system are the 64-point IFFT in the transmit direction and the Viterbi decoder in the receive direction. Therefore it is necessary to pay close attention to 64-point FFT and Viterbi decoder blocks. In order to demonstrate that our approach is applicable to such a complete system solution with application to communication systems, we present experimental results of two case studies () 64-point Fast Fourier Transform (FFT64) and (2) Viterbi Decoder with K=3 (Viterbi3), K=7 (Viterbi7) and K=9 (Viterbi9). The important point to be noted here is that Boolean SAT based verification is not able to handle all benchmarks discussed here due to a huge number of Boolean variables or clauses to be generated after encoding arithmetic functions into bit-level operations.

9 General information about the benchmark circuits are given in Table. Column benchmark gives the benchmark s name, whereas column #spec provides the number of lines in C code after unrolling all loops. In column #impl, the number of lines in RTL code after synthesizing is reported. The fourth, fifth and sixth columns (#add, #sub and #mul) provide the number of additions, subtractions and multiplications required in each benchmark respectively. For Viterbi3 benchmark, this information has been provided before (Viterbi3bmp) and after (Viterbi3amp) identifying mergepoints. While before applying merge-point detection technique, the number of additions to be computed is 6474, after detecting merge points they have reduced to 39. For Viterbi7 and Viterbi9 benchmarks, it is not possible to prepare information before applying merge-point detection technique due to generating too many branches of ACS blocks. In section 5.2, we will see that and states should be processed for Viterbi7 and Viterbi9 respectively if merge-point detection technique is not used. In the rest of this paper, experimental results are reported while the LTED package was implemented in C++ and has been carried out on an Intel 2.GHz Core Duo and GByte of main memory running Windows XP. Scrambler FEC Puncture Interleaver Mapper IFFT GI Addition Transmitter Block Receiver Block Equalizer FFT CFO Correction Timing Detection Demapper/Deinterleaver Depuncture Viterbi Decoder Descrambler Fig. 6. Block diagram of 802.a Transmitter and Receiver. Table. Industrial benchmark characteristics. Benchmark #spec #impl #add #sub #mul FFT Viterbi3bmp Viterbi3amp Viterbi7amp Viterbi9amp point FFT Benchmark The first case study is 64-point Fast Fourier Transform (FFT64) which is one of the most computationally intensive building blocks in communication systems. Although the FFT64 is realized by decomposing it into a two-dimensional structure of 8-point FTTs to reduce the number of required multiplications compared to the conventional radix-2 FFT64, we consider the conventional radix-2 FFT64 in order to have the maximum number of multiplications. Fig. 7 illustrates N-point FFT algorithm which performs the butterfly computations with three main loops. An outside loop counts through the log 2 (N) stages of the FFT computation and it causes huge data-dependent computations. Two inner loops perform the individual butterfly computations of each stage. The heart of this algorithm is the block of code that performs each butterfly computation in the third loop. In this figure, wr and wi parameters are commonly

10 known as twiddle factors and can be computed before the algorithm is performed. But here we have considered them as symbolic variables rather than constant values to increase the number of arithmetic operations. Although there is no conditional statement in Fig. 7 for defining merge points, a lot of data-dependent computations exist that make this test-case a suitable benchmark for proving the claim that our approach is able to deal with real industrial designs even though it only has to determine some cut-planes rather than looking for merge points. As illustrated in Fig. 8, cut-planes have been defined as outputs of different iterations of the outer loop in Fig. 7. In Fig. 8, butterfly diagrams have been shown according to different iterations of the inner loops in Fig. 7. for (s = 0 s < log 2 N s++) for (i = 0 i < N/(2 s+ ) i++) C = wr[idx] S = wi[idx] for (j = i j < N j + = N/2 s ) tmpr = aar[idx] - aar[idx+n/2 s+ ] tmpi = aai[idx] - aai[idx+n/2 s+ ] aar[idx] = aar[idx] + aar[idx+n/2 s+ ] aai[idx] = aai[idx] + aai[idx+n/2 s+ ] aar[idx+n/2 s+ ] = tmpr*c tmpi*s aai[idx+n/2 s+ ] = tmpr*s + tmpi*c idx = idx + 2 s Fig. 7. C code of N-point FFT benchmark Stage Cut-plane Stage Cut-plane 6 63 Stage Cut-plane Stage 5 Cut-plane Fig. 8. Cut-planes defined in FFT64 benchmark. Table 2 summarizes the results for two configurations, i.e., 64-point FFT without cut-planes (FFT64nocp) and 64-point FFT with cut-planes (FFT64cp). In this table, columns #Nodes and #InputVar give the number of LTED nodes and the number of input variables respectively. The memory and CPU time required for equivalence checking of the two descriptions are provided in columns Memory Usage in MByte and Run Time in seconds respectively. After identifying cut-planes as shown in Fig. 8, the number of LTED nodes will decrease to 668 from 220, while the number of inputs will increase from 90 to 830. In other words, = 640 points were specified as equivalent parts and then new primary inputs were introduced in their places. As expected, after applying cut-plane detection technique, the run time required checking the equivalence between two descriptions has reduced from 3.5 seconds to 0.66 second. Moreover, the memory needed to generate LTED without looking for cut-planes are 0.8 MB, while after applying cut-plane detection method it was reduced to.3 MB.

11 Table 2. FFT64 benchmark experimental results. Type #Nodes #InputVar Memory Usage Run Time FFT64nocp FFT64cp Viterbi Benchmark Viterbi decoding is a technique for performing maximum likelihood sequence detection on data that has been convolutionally coded. The decoding problem is to determine the path with the minimum path metric through the trellis, with path metric being defined as the sum of the branch metrics along the path. This is done in a stepwise manner by processing a set of state metrics forward in time, stage by stage over the trellis as shown in Fig. 9. The complexity of the Viterbi algorithm lies in the computation of 2 K- path metrics for a constraint length K decoder at each time stage. For the rate ½ codes (n=2) we are considering, there are just two predecessor states or branches for each state. Thus, state metric computation involves calculation of two branch metrics per state and then a selection of that branch which gives a smaller value of the new state metric. The former operation is done in the Branch Metric Unit (BMU) which takes in the received n-bit blocks of data and generates branch metrics by computing the distance between the received data and the actual codeword. The latter selection operation is performed by Add-Compare-Select (ACS) unit. The ACS unit takes in two state metrics and two branch metrics as input to yield an updated path metric. As the above process is performed, the selected or surviving branches for each state are recorded by storing one survivor bit per state at each trellis stage. The Survivor Management Unit (SMU) is responsible for tracing back through the trellis using the survivor bits to produce the input data bits. BMU ACS ChannelInput BMU 2 BMU n ACS 2 ACS n SMU (TraceBack Memory) Potential Merge-points Controller Fig. 9. Block diagram of Viterbi decoder. In order to have a better understanding of Viterbi decoder algorithm consider pseudo code of Fig. 0 where an outside loop is repeated ChannelLength= K*6- times. Inner loops run 2 K- (the number of states) and n=2 (due to rate ½ codes) times respectively. In each iteration of inner loops, the branch metric (BrMetric[i][j]) is added to the current path metric using Add part of the ACS block, then two updated path metrics at each node (i.e., A and B in Fig. 0) are compared (Compare part of the ACS block) and finally the smaller is saved and the other is discarded (Select part of the ACS block). Thus, the essence of the Viterbi algorithm lies in the relatively

12 simple operations of add, compare, select and trace-back which need to be applied to a large number of states. To give a glimpse about the complexity and size of Viterbi decoder benchmark, we compute how many states need to be processed after unrolling conditional statements related to all ACS blocks if we do not try to look for merge points. In Fig. 0, it is necessary to check 2 states on the first iteration of the second loop nest, 2 2 states on the second iteration and finally 2 K*6- states on the last iteration. Therefore the total number of states to be checked is K*6- = 2 K*6-2. For K=7 and K=9, they are and respectively which are large enough that methods mentioned in the literature are not able to handle them easily. After looking for merge points, however, the number of states to be processed are reduced to K-2 +(2 K- ++2 K- )=2*(2 K-2 -)+5*K*2 K- = (5*K+)*2 K- -2. for (t = 0 t < ChannelLength t++) for (i = 0 i < 2 K- i+=step) for (j=0 j < n j++) A = AcumErr[nextstate[i][j]][] B = AcumErr[i][0]+BrMetric[i][j] if (A > B) AcumErr[nextstate[i][j]][] = B StateHistory[nextstate[i][j]][t] = i for (i = 0 i < 2 K- i++) AcumErr[i][0] = AcumErr[i][] AcumErr[i][] = MAXINTEGER Fig. 0. Pseudo code of Viterbi algorithm. Cut-planes and Merge-points in Viterbi Benchmark. In this section we will discuss how to determine cut-planes and merge-points in the C-based description to reduce the size of equivalence checking problem. In Viterbi decoder the first K stages are different from other stages as shown in Fig. (a), where K is 7. This is because during the first K stages, there is only one path to achieve each next state from current state. For instance at t=, there is only one way to reach next states 0, 6, 32 and 48. These stages are outputs of the corresponding iterations of the outer loop of Fig. 0 that are viable candidates to be cut-planes as illustrated in Fig. (a). On the other hand, another decision flow exists for stages K+ to 6*K-, where each state can be reachable from two paths. One decision butterfly out of 32 pairs needed for Viterbi decoder K = 7, has been depicted in Fig. (b), where S varies from 0 to 3. In this figure each circle indicates a state and also corresponds to an ACS operation in Fig. 0. For instance consider state S that can be received through 2S and 2S+ by different branch metrics. According to Viterbi algorithm described in Fig. 0, to compute accumulated error metric for this state, first of all AcumErr[2S][0]+BrMetric0 is computed (B) and then compared to AcumErr[S][] (A of Fig. 0). Finally the smaller one is saved as a new value into AcumErr[S][]. This process is repeated when B = AcumErr[2S+][0]+BrMetric2 is computed and compared to AcumErr[S][]. As illustrated in Fig. 0, after completing the second loop nest, AcumErr[S][] is saved into AcumErr[S][0] and gets a very large integer

13 number, i.e., MAXINTEGER, because of beginning another iteration of an outer loop properly (see the fourth loop in Fig. 0). Obviously, each output of ACS units has the potential to be a merge point due to conditional statements S BrMetric0 S BrMetric Cut-plane Cut-plane Cut-plane t=0 t= t=2 t=3 t=6 (a) 3 BrMetric2 2S+ BrMetric3 Stage (t-) (b) S+32 Stage (t) Fig.. (a) Seven first stages of Viterbi K=7 (b) Decision butterfly for ACS pair in Viterbi K=7. Experimental Results. Table 3 provides experimental results for six configurations of Viterbi decoder, i.e., Viterbi (K=3) without merge point detection (Vitbi3nomp), Viterbi (K=3) with merge point detection (Vitbi3mp), Viterbi (K=7) with merge point detection (Vitbi7mp), Viterbi (K=7) with merge point and cut-plane detection (Vitbi7mpcp), Viterbi (K=9) with merge point detection (Vitbi9mp) and Viterbi (K=9) with merge point and cut-plane detection (Vitbi9mpcp). In this table, rows #Nodes and #Vars give the number of LTED nodes and the number of input variables respectively. The memory usage and CPU time needed for equivalence checking of the two descriptions are presented in rows Mem (in Mega-Byte) and Time (in seconds) respectively. The second and third columns, i.e., Vitbi3nomp and Vitbi3mp, provide useful information before and after applying automatic merge point detection method to Viterbi K=3 test case. Obviously, in this case after finding merge points automatically, = 66 new primary inputs (#Vars row in Table 3) have been introduced and the number of LTED nodes (#Nodes) has reduced from to 355. Moreover, memory and run time required for equivalence checking have been reduced from 36.3 MB to 0.4 MB and 57.8 seconds to 0. second respectively. Columns Vitbi7mp and Vitbi7mpcp in Table 3 represent experimental results of Viterbi K=7. Although we are not sure that LTED package is able to handle this case without merge point detection, the task of preparing the input file for this package is very difficult because it needs to duplicate the number of states on each iteration where the number of iterations and the number of states on the first iteration are K*6- = 4 and 2 K- = 64 respectively. Thus here we only report experimental result of Viterbi K=7 after applying merge point detection technique where memory usage and CPU time required to perform equivalence checking are 6.9 MB and 2.6 seconds. While after defining cut-planes, as shown in column Vitbi7mpcp of Table 3, they have

14 been reduced to 6 MB and 2 seconds respectively. Fortunately the case study in [2] was Viterbi K=7 that makes it possible to compare results without spending a lot of time to apply Viterbi K=7 to SAT based methods. The authors in [2] have used zchaff as a SAT solver to check the equivalence between expressions computed at every cycle of RTL model and expressions achieved from C-based description. They gave a breakdown of number of clauses in the CNF formula for various blocks. Table 4 provides experimental results of our method in comparison with proposed method in [2]. Although they reported that without their decomposition method, the monolithic Trellis computation would generate a CNF with nearly.9 million clauses, after using the decomposed technique, they created 32 independent CNF formulas that were input to zchaff. Each of these formulas had 5936 clauses and 28 variables. In addition the number of clauses in the CNF formula for Trellis computation per butterfly was 57344, while in our method it requires 352 LTED nodes, 0.28 MB memory and 0.06 second run time to check the equivalence between butterflies in the two descriptions. There was no report of memory usage and CPU time for SAT based method proposed in [2], so related entries was left blank in Table 4. The two last columns in Table 3 give experimental results of Viterbi K=9. After applying merge-point technique, in order to verify the equivalence of two descriptions, LTED nodes was generated and LTED package spent 90 seconds run time while the memory manager reported that 27.3MB RAM was consumed. This case proves scalability of our approach in comparison with method in [2] that was only applied to Viterbi K=7 and it cannot deal with Viterbi K=9 due to computational explosion problem of lower level SAT-based methods. Table 3. Experimental results of Viterbi benchmark. Type Vitbi3nomp Vitbi3mp Vitbi7mp Vitbi7mpcp Vitbi9mp Vitbi9mpcp #Nodes #Vars Mem Time Table 4. Experimental results of Trellis computation per butterfly in Viterbi benchmark. Technique #Nodes #Var Memory (MByte) Time (Sec) #add #sub Our Method Method in [2] Conclusion and Future Work In this paper, we proposed an automatic merge-point detection technique based on an hybrid bit- and word-level canonical representation called LTED. Then we have used it to check the equivalence between C-based specification and RTL implementation of two large industrial circuits, i.e., 64-point FFT algorithm (FFT64) and Viterbi decoder K=3, 7, 9. This representation is strong enough to handle arithmetic operations at word level representation and there is no need to encode them to bit-level operations. As opposed to low level methods such as Boolean SAT based techniques reported in

15 the literature, the empirical results indicate that our approach not only uses an efficient canonical form to represent symbolic expressions but also is scalable even on large industrial circuits. Obvious direction for future work is to integrate LTED package with a SpecC environment to address the equivalence checking between different abstractions of SpecC as a system level language. Acknowledgement This work was supported in part by Semiconductor Technology Academic Research Center (STARC). References. Alizadeh, B., Fujita, M.: LTED: A Canonical and Compact Hybrid Word-Boolean Representation as a Formal Model for Hardware/Software Co-designs. The fourth Workshop on Constraints in Formal Verification (CFV 2007) Vasudevan, S., Viswanath, V., Abraham, J., Tu, J.: Automatic Decomposition for Sequential Equivalence Checking of System Level and RTL Descriptions. In Proceedings of Formal Methods and Models for Co-Design (MemoCode 2006) Feng, X., Hu, A.: Early Cutpoint Insertion for High-Level Software vs. RTL Formal Combinational Equivalence Verification. In Proceedings of 43th Design Automation Conference (DAC 2006) Matsumoto, T., Saito, H., Fujita, M.: Equivalence checking of C programs by locally performing symbolic simulation on dependence graphs. In Proceedings of 7 th International Symposium on Quality Electronic Design (ISQED 2006) Koelbl, A., Lu, Y., Mathur, A.: Embedded tutorial: Formal Equivalence Checking Between System-level Models and RTL. In Proceedings of ICCAD (2005) Kroening, D., Clarke, E., Yorav, K.: Behavioral Consistency of C and Verilog Programs Using Bounded Model Checking. In Proceedings of 40th Design Automation Conference (DAC 2003) Karfa, C., Mandal, C., Sarkar, D., Pentakota, S. R., Reade, C.: A Formal Verification Method of Scheduling in High-level Synthesis. In Proceedings of 7 th International Symposium on Quality Electronic Design (ISQED 2006) Fallah, F., Devadas, S., Keutzer, K.: Functional Vector Generation for HDL Models Using Linear Programming and 3-Satisfiability. In Proceedings of 35 th Design Automation Conference (DAC 998) Alizadeh, B., Fujita, M.: A Hybrid Approach for Equivalence Checking Between System Level and RTL Descriptions. In 6 th International Workshop on Logic and Synthesis (IWLS ) Horeth, S., Drechsler, R.: Formal Verification of Word-Level Specifications. In Proceedings of Design Automation and Test in Europe (DATE 999) Alizadeh, B., Navabi, Z.: Word Level Symbolic Simulation in Processor Verification. In IEE Proceedings Computers and Digital Techniques Journal Vol. 5, No. 5 (2004) Grass, E., Tittelbach, K., Jagdhold, U., Troya, A., Lippert, G., Krueger, O., Lehmann, J., Maharatna, K., Fiebig, N., Dombrowski, K., Kraemer, R., Aehoenen, P.: On the Single Chip Implementation of a Hiperlan/2 and IEEE802.a Capable Modem. In IEEE Pers. Commun., Vol. 8 (200) 48 57

Use of Non-linear Solver to Check Assertions of Behavioral Descriptions

Use of Non-linear Solver to Check Assertions of Behavioral Descriptions I. Ugarte, P. Sanchez Microelectronics Engineering Group. TEISA Deparment. ETSIIT. University of Cantabria {ugarte, sanchez}@teisa.unican.es