Exploring Parallel Processing Levels for Convolutional Turbo Decoding

Exploring Parallel Processing Levels for Convolutional Turbo Decoding Olivier Muller Electronics Department, GET/EST Bretagne Technopôle Brest Iroise, 29238 Brest, France olivier.muller@enst-bretagne.fr Amer Baghdadi Electronics Department, GET/EST Bretagne Technopôle Brest Iroise, 29238 Brest, France amer.baghdadi@enst-bretagne.fr Michel Jézéquel Electronics Department, GET/EST Bretagne Technopôle Brest Iroise, 29238 Brest, France michel.jezequel@enst-bretagne.fr Abstract In forward error correction, convolutional turbo codes were introduced to increase error correction capability approaching the Shannon bound. Decoding of these codes, however, is an iterative process requiring high computation rate and latency. Thus, in order to achieve high throughput and to reduce latency, crucial in emerging digital communication applications, parallel implementations become mandatory. In this paper, we explore the parallelism in convolutional turbo decoding with the BCJR algorithm and propose a multi-level classification of the explored parallelism techniques. We also present promising results on sub-bloc and component-decoder levels of parallelism. Sub-bloc parallelism results show that for sub-bloc initializations, message passing technique outperforms the acquisition approach. Furthermore, sub-bloc parallelism becomes quite inefficient in terms of speed gain for high sub-bloc parallelism degree. Conversely component-decoder parallelism efficiency, which only depends on interleaving rules, increases with sub-bloc parallelism degree. 1. Introduction Digital communication systems, such as fiber-optic communication, wireless communication and storage applications, require very high data-rates as well as powerful error correction capabilities. For the latter, performance approaching the Shannon bound are obtained with iterative decoding algorithms, such as turbo decoding [1] or LDPC decoding [2]. As those algorithms are characterized by their high complexity, achieving high-data rates requires optimal parallelism exploitation. Parallelism in convolutional turbo decoding has been widely investigated over the last few years either at a fine grain level [4] [5] on symbol elementary computations of the decoding algorithm, called BCJR or Forward-bacward algorithm [3], or at a coarse grain level [6-9] mainly based on frame decoding scheme. Especially in the latter, new parallelism techniques continue to appear where the recently introduced shuffled decoding [1] constitutes a typical example. In this paper, we classify existing parallelism possibilities in convolutional turbo decoding with the BCJR algorithm. We also propose a promising performance analysis of parallelism efficiency related to sub-bloc decoding and shuffled decoding. The rest of the paper is organized as follows. The next section presents the convolutional turbo decoding algorithm for better understanding of subsequent sections. Section 3 proposes a multi-level classification of turbo decoding parallelism. In section 4 and section 5, sub-bloc parallelism and component-decoder parallelism (shuffled decoding) are respectively analyzed on the basis of parallelism efficiency criteria. Finally, section 6 summarizes the results obtained and concludes the paper. 2. Convolutional turbo decoding Discovered in 1993, the turbo principle [1] relies on information exchange and iterative processing between the different elementary blocs. The exchanged information is called extrinsic information. For parallel convolutional turbo codes (Figure 1.a), the elementary blocs are the component decoders. The convolutional decoding is performed using the BCJR algorithm [3] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes. In practice, a log-domain derivation of the algorithm (log-map) is used. Log-MAP algorithm can be approximated by max-log-map algorithm [15]. BCJR algorithm is implemented in Soft Input Soft Output (SISO) decoders. Using input symbols and a priori extrinsic information, each SISO decoder computes a posteriori probabilities (APP). These APPs constitue the a prori information for the other decoder and are exchanged via an interleaving (Π) and deinterleaving (Π -1 ) processes. Figure 1.b illustrates the main steps of the BCJR algorithm. Firstly, the branch metric (or γ metric) between two states represents the probability that a transition occurs between these two states.

γ red sys SISO Interleaved domain β (b) Π Π Π -1 extrinsic red 1 SISO 1 Desinterleaved domain output Frame β, extrinsic β, extrinsic Frame β extrinsic (a) T time (c) (d) Figure 1. Turbo decoding: (a) turbo decoder, (b) BCJR SISO, (c) forward bacward scheme, (d) butterfly scheme Secondly, forward recursion (or recursion) computes the probability of all the states in the trellis given the past observations (eq. 1). This processing is recursive since a trellis section (i.e. the probability of all states) is computed using the previous trellis section and branch metrics between these two sections. ( ν 2 1 s ) = 1 ( s ) γ ( s, s) (1) s = Thirdly, bacward recursion (or β recursion) computes the probability of all the states in the trellis given the future observations (eq. 2). This computation is similar to the forward recursion, but the frame is processed in the bacward direction. β ( ν 2 1 s ) = β + 1 ( s ) γ + 1( s, s ) (2) s = Finally, the extrinsic information is computed from the forward recursion, the bacward recursion and the extrinsic part of the branch metrics (eq. 3). β () s (') s γ (',) s s Pr( d = i) = (3) () s (') s (',) s s e 1 ( s', s)/ d( s', s) = i e β 1 γ ( s', s) 3.1. BCJR metric level parallelism T/2 The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder (Figure 1). It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [4] [5]. 3.1.1. Parallelism of trellis transitions Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [15], these operations are either ACS operations (Add-Compare- Select) for the max-log-map algorithm or ACSO operations (ACS with a correction offset [15]) for the log-map algorithm. Each BCJR computation (eq. 1,2,3) requires a number of ACS-lie operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree. However this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data. 3.1.2. Parallelism of BCJR computations time 3. Parallel processing levels In turbo decoding with BCJR algorithm, parallelism techniques can be classified at three levels: (1) BCJR metric level parallelism, (2) BCJR SISO decoder level parallelism, and (3) Turbo-decoder level parallelism. The first (lowest) parallelism level concerns symbol elementary computations inside a SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (highest) parallelism level duplicates the turbo decoder itself. A second metric parallelism can be orthogonally extracted from BCJR algorithm through a parallel execution of the three BCJR computations. Parallel execution of bacward recursion and APP computations was proposed with the original Forward- Bacward scheme, depicted in Figure 1.c. So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the bacward part. To increase this parallelism degree, several schemes are proposed [8]. Figure 1.d shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and

bacward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree. In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size which occupies most of the area in a turbo decoder circuit. evertheless the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels. 3.2. BCJR-SISO decoder level parallelism The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-bloc of the same frame in one of the two interleaving orders. This level of parallelism can reach a reasonable parallelism degree and preserve memory area [6]. Two inds of parallelism exist in this class: sub-bloc parallelism and component-decoder parallelism. 3.2.1. Sub-bloc parallelism In sub-bloc parallelism, each frame is divided into M sub-blocs and then each sub-bloc is processed on a BCJR-SISO decoder using adequate initializations [6] [7] [8] [9]. A graphical formalism is proposed in [8] to compare various existing sub-bloc decoding schemes towards parallelism degree and memory efficiency. Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to extend proportionally the communication bandwidth [12]. On the second hand, BCJR-SISO decoders have to be initialized adequately as detailed in section 4. 3.2.2. Component-decoder Parallelism The component-decoder parallelism is a new ind of parallelism that has become operational with the introduction of the shuffled decoding technique [1]. The basic idea of shuffled decoding is to execute all component decoders in parallel and to exchange APP information as soon as created. With this decoding scheme, decoding time could be theoretically halved in comparison with serial approach with the same iteration number. Section 5 analyzes the performance that can be obtained with this ind of parallelism. 3.3. Turbo-decoder level parallelism The highest level of parallelism duplicates whole turbo decoders to process iterations and/or frames in parallel. Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree. Turbo-decoder level parallelism, however, is too areaexpensive (all memories and computation resources are duplicated) and presents no gain in decoding latency. Table 1. Parallel processing levels Level Parallelism BCJR metric Trellis transitions BCJR computations BCJR-SISO decoder Turbo-decoder Sub-blocs Component decoders Iterations Frames 4. Initialization in sub-bloc parallelism As described in section 3, sub-bloc parallelism taes place at frame level and requires initializations. These initializations are mandatory as information on recursion metrics is available at frame ending points, but not at sub-bloc ending points [7]. An estimation of this undetermined information can be obtained either by acquisition or by message passing between neighboring sub-blocs. 4.1. Initialization by acquisition This widely used initialization method consists in estimating recursion metrics thans to an overlapping region called acquisition window or prologue. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed on its length, denoted AL, to provide reliable state metrics at the beginning of the sub-bloc. This acquisition length is determined at design time in order to mae negligible error rate degradation. It is fixed in function of the number of redundancies in the prologue, typically 6. Another empirical rule recommends from 3 to 5 times the constraint length of the code for this acquisition length [7]. When all the sub-blocs are initialized with acquisition method, the decoding time (t d ), the speed gain (S g ) and additional computation ratio (R c ) can be expressed as: t d + AL it (4) d AL d S g = d 1 + (5) AL ( d 1) R C = (6) where represents the frame length, d the sub-bloc parallelism degree and it the number of iterations. Equation 4 shows clearly that the decoding time tends towards a constant value when parallelism degree

increases. Thus sub-bloc parallelism with initialization by acquisition encounters a throughput ceiling value and the maximum speed gain is equal to ( AL +1). The corresponding efficiency, which is defined as speed gain S g divided by parallelism degree d, will decrease to the minimum value 1 ( AL +1). Furthermore the additional computation ratio, which concerns recursion computations and input data memory accesses, increases linearly with the parallelism degree. 4.2. Initialization by message passing The second method initializes dynamically a subbloc with recursion metrics computed during the last iteration in the neighboring sub-blocs [9]. So this technique does not require additional hardware except some communication resources between BCJR SISO units. To evaluate this technique, bit error rate performance degradation has to be evaluated and compensated with additional iterations. In Figure 2, Frame Error Rate (FER) performance is represented for different parallelism degrees in function of iteration number. This figure shows that asymptotic error rate is not affected by message passing approach whatever the parallelism degree. Thus it ensures that initialization by message passing can be used without degradation. 1.e 1.e-1 FER 1.e-2 1.e-3 1.e-4 PROC1 PROC1 PROC25 PROC5 PROC1 5 1 15 2 25 3 35 iterations Figure 2. Convergence of message passing technique, DVB-RCS, R=6/7, 188 bytes frame, SR=4.2 db, 5 bit quantification, Log-MAP algorithm However this figure also reveals that additional iterations are mandatory to reach a given FER. The decoding time (t d ) and speed gain (S g ) can be expressed as: td it MP (7) d it S g = d (8) itmp where it MP represents the iteration number with message passing technique. An estimation of it MP can be obtained at fixed FER value. In Figure 3, iteration number with message passing technique, speed gain and efficiency of the technique are represented in function of parallelism degree. Iterations 25 2 15 1 Iterations Speed Gain 5. 1 2 3 4 5 6 7 8 9 1 Parallelism degree Figure 3. Iterations, speed gain and efficiency with message passing technique, DVB-RCS, R=6/7, 188 bytes frame, SR=4.2 db, 5 bit quantification, Log- MAP algorithm It appears that iteration number is constant for small parallelism degree and after a threshold (8 in the presented case) it increases linearly with parallelism degree. So in terms of decoding time, this could be written as: it MP = it if d is less than the threshold. it + C d it MP = if d is greater than the threshold and where C is a constant. Lie sub-bloc parallelism with initialization by acquisition, using initialization by message passing method also encounters a throughput ceiling value. The maximum speed gain is roughly equal to it C, corresponding to an efficiency of it ( C ). Threshold position strongly depends on the sub-bloc size. It can be physically interpreted as the minimum sub-bloc size, which provides reliable recursion values at the end of the first iteration. Under this minimum size, recursion values have to be refined using more iterations. Thus this threshold will change according to the frame size and the code rate, as this latter has an influence on recursion reliability. 4.3. and performance comparison In Figure 4, sub-bloc parallelism efficiency is compared between both initialization methods. Under the presented conditions, efficiency of message passing technique and acquisition technique with 16-symbol acquisition length are quite similar at high sub-bloc parallelism degree. evertheless at low sub-bloc parallelism degree, message passing technique efficiency is constant and equal to maximum efficiency and thus outperforms acquisition technique whatever the acquisition length. 1.2 1..8.6.4.2 1 9 8 7 6 5 4 3 2 1 Speed Gain

1..8.6.4.2. 2 4 6 8 1 Parallelism degree Message Passing Acquisition with AL=8 Acquisition with AL=16 Acquisition with AL=32 Figure 4. and initialization methods, DVB- RCS, R=6/7, 188 bytes frame, SR=4.2 db, 5 bit quantification, Log-MAP algorithm Furthermore, the initialization by acquisition degrades error-rate performance, whereas initialization by message passing induces no degradation. Figure 5 illustrates Frame Error Rate (FER) performance considering a DVB-RCS code [14] with a 47-sub-blocparallelism degree..15db-degradation is observed between initialization with 32-symbol acquisition length and message passing initialization or Max-log-MAP without parallelism. 1 created. The following section will analyze the efficiency of this parallelism. 5.1. Shuffled decoding efficiency Lie sub-bloc parallelism efficiency, componentdecoder parallelism efficiency is defined as the speed gain divided by the parallelism degree at equivalent error rate performance. For shuffled decoding, parallelism degree is limited to the number of component decoders (usually 2) and only iteration number could vary in speed gain (eq. 5). Thus shuffled decoding efficiency only depends of the iteration number needed to reach the same error rate performance as serial decoding. Simulation results demonstrate efficiency ranging from.6 to.95. By definition, shuffled decoding efficiency is computed with a set of component decoder, with interleavers of defined size and with a fixed SR. can always be computed along the turbo decoder convergence process as shuffled decoding and serial decoding converge to the same value. Simulations reveal that efficiency is almost invariant along turbo decoder convergence. Then with a defined error rate at various SR, it is also possible to show that efficiency is SR invariant. So shuffled decoding efficiency can only depend on interleaving rules and BCJR-SISO decoder parallelism. 1-1 5.2. Shuffled decoding and interleaving FER 1-2 1-3 1-4 1-5 Acquisition..AL.32 Message.passing MaxLogMAP Acquisition..AL.16 3.5 3.9 4.3 4.7 SR Figure 5. FER and initialization methods for high parallelism degree (47), DVB-RCS, R=6/7, 188 bytes frame, 5 bit quantification, Max-log-MAP algorithm Comparison between both techniques tends clearly in favor of message passing technique, which enables better error rate performance without resource overhead but gives similar efficiency. However simulations results (Figure 4) also show that sub-bloc parallelism, whatever the initialization method, becomes quite inefficient at high parallelism degrees. 5. Component-decoder parallelism analysis As described in section 3, component-decoder parallelism taes advantage of the shuffled decoding technique that executes all component decoders in parallel and exchanges APP information as soon as The dependency between shuffled decoding and interleaving has already been studied in [1]. According to interleaving laws Π, symbols belong to three different classes. The first class contains all points processed at the same time in interleaved and desinterleaved order, such as t() = t(π()). The second class contains all points verifying t() < t(π()) and the third all points verifying t() > t(π()). A symbol of the first class is processed concurrently by both component decoders. So decoders do not tae advantage before the next iteration of APPs, that are sent by the another decoder. Using component-decoder and sub-bloc parallelism at the same time, the number of first-class symbols increases with sub-bloc parallelism degree. However shuffled decoding efficiency increases with sub-bloc parallelism degree as represented in Table 2 and Table 3. Table 2. and sub-bloc parallelism degree with 53 bytes DVB-RCS interleaving (R=6/7; max log MAP; SR=4, db; FER =1,6e-3) Sub-bloc parallelism degree without shuffling with shuffling 1 8 12.66 4 11 15.73 8 16 2.8 53 47 51.92

Table 3. and sub-bloc parallelism degree with 188 bytes DVB-RCS interleaver (R=6/7; max log MAP; SR=4, db; FER =1,6e-3) Sub-bloc parallelism degree without shuffling with shuffling 1 8 11.72 2 9 11.82 4 9 12.75 16 13 15.86 64 19 23.83 128 34 37.92 This result can be explained by the fact that iteration number increases with sub-bloc parallelism degree. Thus the penalty to next iteration imposed by defined first class will be less significant in the decoding and in this way shuffled decoding efficiency is improved. 5.3. Combining component-decoder and subbloc parallelism From these results, it maes sense to combine component-decoder parallelism and sub-bloc parallelism. Indeed component-decoder parallelism efficiency increases with sub-bloc parallelism degree and at the same time sub-bloc parallelism efficiency decreases. To determine when component-decoder parallelism becomes more efficient than sub-bloc parallelism, subbloc parallelism with parallelism degree d should become the new reference in efficiency computation. Thus the efficiency of doubling sub-bloc parallelism degree to a degree 2d can be compared with efficiency of shuffled decoding at the reference parallelism degree d to select the most efficient parallelism. In our examples, shuffled decoding becomes more efficient for d greater than 4 in Table 2 and 16 in Table 3. Obtained results illustrate that, beyond a certain bound, decoder parallelism becomes more efficient than sub-bloc parallelism. 6. Conclusion In this paper we analyzed and classified the various parallelism techniques that could be used in convolutional turbo decoding with the BCJR algorithm. The proposed three level classification includes: BCJR metric level parallelism, BCJR SISO decoder level parallelism, and Turbo-decoder level parallelism. It has been shown that sub-bloc initialization is more efficient with message passing technique than with acquisition technique and also that sub-bloc parallelism becomes inefficient for high sub-bloc parallelism degrees. On the contrary, component-decoder parallelism, with the newly introduced shuffled decoding technique, becomes more efficient for high sub-bloc parallelism degrees. of this parallelism depends only on interleaving rules. Furthermore a criterion based on analysis of parallelism efficiency is proposed to help the selection and the use of these two parallelism techniques. References [1] C. Berrou, A. Glavieux, and P. Thitimajshima, ear Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes, in Proc. 1993 International Conference on Communications (ICC 93), Geneva, Switzerland, 1993. [2] D. J. C. MacKay, Good error-correcting codes based on very sparse matrices, IEEE Trans. Inf. Theory, vol. 45, pp. 399 431, Mar. 1999. [3] L. Bahl, J. Coce, F. Jeline, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-2, pp. 284 287, Mar. 1974. [4] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, VLSI architectures for turbo codes, IEEE Trans. VLSI Syst., vol. 7, pp. 369 379, Sept. 1999. [5] E. Boutillon, W. J. Gross, and P. G. Gula, VLSI architectures for the MAP algorithm, IEEE Trans. Commun., vol. 51, pp. 175 185, Feb. 23. [6] C. Schurgers, F. Catthoor, and M. Engels, Memory optimization of MAP turbo decoder algorithms, IEEE Trans. VLSI Syst., vol. 9, pp. 35 312, Apr. 21. [7] T. Wolf, "Initialization of Sliding Windows in Turbo decoders", 3rd International Symposium on Turbo Codes and Related Topics, Brest, France, pp. 219-222, Sept. 23. [8] Y. Zhang and K.K. Parhi, Parallel Turbo decoding, Proceedings of the International Symposium on Circuits and Systems, volume 2, 23-26 May 24 Page(s):II - 59-12. [9] A. Abbasfar, and K. Yao, An Efficient Architecture for High Speed Turbo Decoders, Proc. of ICASSP 23, pp. IV-521-IV-524, April 23. [1] J. Zhang, and M.P.C. Fossorier, Shuffled iterative decoding, IEEE Transactions on Communications Volume 53, Issue 2, Feb. 25 Page(s):29 213. [11] D. Gnaëdig, E. Boutillon, M. Jezequel, V. Gaudet, G. Gula, "On Multiple Slice Turbo Code", 3rd International Symposium on Turbo Codes and Related Topics, Brest, France, pp. 343-346, Sept. 23. [12] F. Gilbert, M. Thul and. Wehn. Communication Centric Architectures for Turbo-Decoding on Embedded Multiprocessors, Proceedings of DATE 23, Munich. [13] M. J. Thul, F. Gilbert, T. Vogt, G. Kreiselmaier and. Wehn, A Scalable System Architecture for High- Throughput Turbo-Decoders, Journal of VLSI Signal Processing Vol. 39, pages 63-77, etherlands 25. [14] C. Douillard, M. Jezequel, C. Berrou,. Brengarth, J. Tousch,. Pham, The Turbo Code Standard for DVB- RCS, 2nd International Symposium on Turbo Codes & Related Topics, Brest, France, 2. p. 535-538. [15] P. Robertson, P. Hoeher, and E. Villebrun, Optimal and Sub-Optimal Maximum a Posteriori Algorithms Suitable for Turbo Decoding, European Transactions on Telecommunications (ETT), vol. 8, no. 2, 1997, pp. 119 125