Scalable ASIP Implementation and Parallelization of a MIMO Sphere Detector

Size: px

Start display at page:

Download "Scalable ASIP Implementation and Parallelization of a MIMO Sphere Detector"

Marcia Grant
5 years ago
Views:

1 Scalable ASIP Implementation and Parallelization of a MIMO Sphere Detector Esther P. Adeva and Björn Mennenga and Gerhard Fettweis Vodafone Chair Mobile Communication Systems Technische Universität Dresden Dresden, Germany esther.perez, mennenga, fettweis@ifn.et.tu-dresden.de Abstract High detection complexity is known to be one of the major challenges in MIMO communications based on spatial multiplexing. Tuple Search Detector (TSD) was recently introduced, significantly reducing detection complexity in comparison to conventional algorithms while achieving close to full maxlog-app BER performance. Besides high computational complexity, irregular control flow and sequential nature of the tree search represent major limitations of depth-first-based detectors, frustrating efficient application of parallelization techniques and hence leading to inefficient realizations with regard to most practical applications. This work 1 presents a novel TSD implementation, scalable in constellation size and number of antennas and mapped to a highly parallel and pipelined ASIP architecture. Major challenges and key strategies enabling a high-throughput and low-complexity realization are presented and performance of the resulting flexible and efficient detector implementation is evaluated. Proposed realization is shown to achieve > 300 Mbps throughput at a reference clock frequency of 400 MHz (regarding 4 4 MIMO transmission with 16-QAM), by far outperforming comparable state-of-the-art realizations. Keywords MIMO detection, sphere decoder, tuple search, parallelism, SIMD, pipeline-interleaving, STA, ASIP. I. INTRODUCTION Future mobile communication systems will make use of multiple-input multiple-output (MIMO) techniques in combination with high constellation orders to enhance spectral efficiency. Transmission of spatially multiplexed data streams allows increasing data rates as well as diversity. However, the inherent high detection complexity and lack of flexibility of common detector realizations still represent limiting factors towards efficient implementations. Conventional low-complex detector realizations provide poor BER performance (e.g. Successive Interference Cancelation -SIC detector), whereas high-ber-performance implementations present unusable high complexity (e.g. Single Tree Search -STS detector). In this regard, TSD [1] has demonstrated to achieve significant complexity reduction while providing close to full max-log-app BER performance, outperforming state-of-theart detection strategies [2] like STS [3] or List Sphere Detection (LSD) [4]. Variable search complexity of so-called depthfirst detection algorithms (like STS and LSD) represents one of the major challenges towards efficient realizations. Resulting 1 This work has been supported by the German Federal Ministry of Education and Research (BMBF) under grant 01BU1001. irregular and data-dependent control flow typically frustrates efficient algorithm parallelization, thus considerably limiting throughput enhancement. In this work, a novel TSD-ASIP model is presented. Lowcomplexity, flexibility, high throughput (>300 Mbps at a clock frequency of 400 MHz, using 4 4 transmission and 16-QAM) and high efficiency characterize the proposed implementation, by far outperforming state-of-the-art realizations. Throughout this work, major challenges with regard to practical realization as well as key strategies enabling such an efficient implementation are described and discussed. The considered communications system model is introduced in section II, followed by description of the complexityreduced TSD (section III) representing the basis of our implementation. Regularized control flow enabling processor vectorization is also described in this section. In order to minimize the required resources and speed up the computations, fixed-point arithmetic is utilized. Resolution, flexibility and influence on detection performance of the proposed fixed-point representation are analyzed in section IV. Considered parallelism approaches and implementation challenges are detailed in section VI. The proposed processor architecture is described in section VII, especially focusing memory organization. Challenges regarding the integration of conditional memory access into pipeline-interleaved and vectorized schemes are also discussed in this section. Throughout this work, computational complexity, critical path (V) and throughput (VI-C) of the resulting implementation are evaluated. Proposed implementation results are additionally discussed and compared to other works in section VIII. Finally, main highlights of this work are summarized in section IX. II. SYSTEM MODEL Throughout this paper, we consider a N T N R MIMO system based on a BICM transmission strategy with N T transmit and N R receive antennas, as depicted in Fig. 1. A vector u of i.i.d. information bits is encoded by the outer channel code with rate R. The resulting stream of vectors c is bit-interleaved and portioned into blocks c of N T L bits, where L denotes the number of bits per transmit symbol. For the transmission, the corresponding bits c C, covered in the set of permitted bit vectors, are mapped (e.g. gray mapping) onto complex constellation symbolsx(c) = [x 0,...x NT 1] T =

2 Transmitter Binary Data Source Receiver Binary Data Sink û u Outer c Interleaver c Encoder Hard Decision Le(u) R Outer Decoder L a,dec + L e,dec π Interleaver π -1 Interleaver π AWGN L e,det + - L a,det Constellation Mapper + + Detector / Demapper Fig. 1: Communications system model with BICM transmitter and iterative receiver. map(c) X, being X the set of valid transmit symbols with cardinality #X = #C = 2 L = Q. The transmit energy is normalized so that E{xx H } = E S /N T I, where E s represents the average transmit energy of the transmitter. Regarding the transmission, we consider an uncorrelated, flat fading channel and an additive noise vector n C NR 1 at the receiver with complex components of zero mean i.i.d. gaussian random variables of variance N 0 /2 per real dimension (E{nn H } = N 0 I). The considered passive channel is represented by H C NR NT with entries of a zero mean i.i.d. gaussian random process of variance 1 and is assumed to be perfectly known at the receiver. The received signal y is therefore given by: y = Hx+n and the signal-to-noise-ratio (SNR = E s /N 0 ) at the receiver applied to the energy of one information bit can be stated as E b /N 0 = E s N R /N 0 N T LR. In order to ensure comparability of results, a setup equivalent to the one used in e.g. [1], [5], [6] has been used for our simulations. These have been carried out for a rate 1/2 PCCC with (7 R,5) convolutional codes, an information block size of 9216 bits (including tail bits), gray mapping, and spatial and temporal fading. The detection of the transmitted bits is carried out by complex-valued tuple search sphere detector in conjunction with a BCJR based decoder with 8 internal iterations and, for the sake of simplicity, without detector decoder iterations. III. COMPLEXITY-REDUCED TUPLE SEARCH MIMO DETECTION A. Fundamentals Task of the focused detector is the determination of bits c most likely sent as well as of reliability information for these bits. This can be accomplished by calculating log-likelihood ratios (L-values): ( ) P (cm,l = +1 y) L(c m,l y)=ln P (c m,l = 1 y) 1 N 0 min {λ 0}+ 1 min c c m,l =+1 N {λ 0},(1) 0 c c m,l = 1 where (1) results from application of the max-log approximation [7]. The l-th bit of a symbol sent by the m-th antenna is represented by c m,l and λ 0 represents the distance metric. For the considered case without detector decoder iterations (i.e. without a priori information), this metric depends only on the Euclidean distance between the set of received symbols y and the representative of the symbol ˆx(c) likely transmitted through the channel H: λ 0 (y,c) = y Hˆx(c) 2 Consequently, besides the most probably sent symbol arg min{λ 0 } (i.e. the detection ˆx(c) c C hypothesis) and its corresponding metric λ 0 (c ML ), the detector has to determine also the counter-hypotheses arg min {λ 0 } and their metrics for each bit. ˆx(c) c C,c m,l c ML m,l B. Tree Search Basics Since brute force (full max-log-app) detection of (1) is known to be of exponentially growing computational complexity with the number of transmit antennas and order of the constellation, several close to optimal detection approaches have been lately proposed, some of the most promising based on tree search strategies. As described in detail in [5], transforming the detection problem is permitted by QRdecomposition (QRD) of H = QR, where Q is unitary and R an upper triangular matrix. With modified received symbols y = Q H y, determination of the Euclidean distance y Rˆx(c) 2 (2) can be interpreted as a tree search. Resulting from this, λ 0 can be recursively calculated through the layered partial metric λ i = λ i+1 }{{} + y i }{{} r 2 N T 1 iiˆx i, y i = y i r ijˆx j, (3) metric from already estimated symbols interference reduced symbol j=i+1 by adding the metric of the corresponding parent node to the squared distance between a representative of the estimated symbol and an interference reduced symbol. The search is carried out in depth-first fashion, successively extending the selected nodes by analyzing their child nodes. As presented in [8], a regularized control flow and the parallel calculation of sibling parent nodes as well as of leaf nodes permit a so called one-node-per-cycle implementation [9]. Based on this, the average number of node extensions n performed in the detection is taken as complexity measure throughout this paper. C. Tuple Search and Bit-Specific Candidate Determination Computing the L-values in (1) requires determination of a detection hypothesis and all counter-hypotheses as described in section III-A. Explicit search for all needed minimums leads to impractically high n [10]. Therefore, instead of searching all possible minima, TSD introduced in [1] searches a subset of T most likely leaves, similarly to the approach used by LSD. The metrics λ 0 of these leaves are stored in a search tuple T := {λ 0 (c 1 ),λ 0 (c 2 ),...,λ 0 (c T )}, defining the sphere radius as the maximum metric in the tuple: R = max c t c t T {λ 0,t}. (4) Tuple search can be additionally combined with separated bitspecific storage of information for the L-value calculation [1].

3 The resulting TSD achieves much better BER performance than LSD at a significantly reduced n compared to STS detection. Additionally, adjustment of SNR n trade-off is enabled by varying the size of the tuple T. D. Complexity Reduction Strategies like sorted QR decomposition [1], MMSE preprocessing [1] as well as sphere and L-values clipping [1] are widely applied approaches towards n reduction. Novel approaches like search sequence determination (SSD) [11] and metric estimation (ME) [12] contribute to further reduce n as well as the computations involved in the detection, as described in the following. 1) Search Sequence Determination (SSD): In [11], the commonly used Schnorr-Euchner (SE) enumeration as well as all the costly metric calculations (3) and sorting operations it requires are replaced by few inexpensive basic operations through a heuristic technique. This approach is based on geometrical position analysis relative to reference nodes, reducing the node selection task to simple geometrical case differentiation. Reference nodes ˆx i are determined by ˆx i = y i = y i, (5) r ii (with ( ) representing the rounding operator). Resulting enumeration is defined by fixed sequences which are mapped to constellation symbols during the detection, based on the relative position between y i and ˆx i. Computational complexity of this strategy can be further decreased by reducing the amount and length of considered sequences. As proposed in [11], in this work only two decision regions (i.e. two sequences) have been considered, sequences of 14 partially combined elements are used and adapted leaf sequences are applied. 2) Metric Estimation: The computationally costly operations required for the metric calculation (3) can be considerably simplified by an estimation based on SSD s geometrical approach [12]. Euclidean distances in (3) are replaced by predefined geometrical distances d defined for each decision region: r 2 ii y i ˆx i 2 r 2 ii d2. (6) Moreover it is possible to precalculate r 2 ii as well as r2 ii d2, hence reducing the complex products in (3) to a single real multiplication or even completely dissolving it. This considerable reduction in computational complexity makes this technique especially interesting for realizable hardware implementations. E. Algorithm Partitioning and Regularization TS-based detection process can be decomposed, as proposed in [8], in an arbitrary number of regularized loops. Operations performed within each loop have been partitioned into the task blocks a... n illustrated in Fig. 2. Selection of first node to be extended (5) is performed in a. SSD s geometric position analysis is subsequently carried out in b, determining the corresponding search sequence and the second node to Fig. 2: TSD regularized control flow diagram. be extended. Metric (6) and interference (3) associated to these nodes are determined by c, f, g and h. Subsequently, radius tuple is updated (4) by d. Final task within the loop is selection of the tree layer to be processed during next loop ( e). As mentioned in section III-B, a so called one-node-percycle(-loop) realization is desired. Consequently, sibling parent nodes and leaf nodes are assumed to be processed in parallel [11]. For this purpose, it is necessary to define separated blocks which process first ( a, c, f), second ( b, g, h) and subsequent nodes ( i, j, k) individually. Likewise, definition of task blocks processing candidate leaf nodes ( l, m) for later L-values (1) calculation ( n) is required. IV. FIXED-POINT REPRESENTATION AND RESULTS Conventional hardware implementation typically makes use of fixed-point arithmetic in order to limit the required latency and hardware resources. In the context of MIMO detection, overflow and quantization errors incurred by fixed-point representation typically lead to SNR performance degradation. It is therefore necessary to define appropriate resolution, attending to the trade-off between SNR performance and hardware complexity. Based on the analyses in [7], a maximum word width of 8 bits 2 is proposed in this work, whereas 0.3dB SNR loss has to be accepted, as shown in IV-B. This practical representation is enabled by the normalization strategy proposed in IV-A. The resulting detector presents lower hardware complexity and higher efficiency than comparable sphere detector (SD) realizations [13], [14], typically requiring greater precision (10-16 bits) to achieve similar SNR performance. A. Normalization Parameters involved in (1) and hence (3) (y i, R = {r ii,r ij } and λ i, with i,j = {1,...,(N T 1)},i j) possess dynamic range fluctuating with the received energy E r. In order to define fixed ranges, independent of the transmission conditions, removal of E r dependency from r ii and λ i is required. For this purpose, a scaling factor s f = f E r [7] is utilized to normalize these parameters ( r ii = r ii /s f, λi = λ i /s 2 f ), with f representing a constant factor, leading to (1) modified as 2 Integer and fractional word lengths have been individually determined for each of the parameters involved in the detection. Integer word lengths have been selected to cover the parameters range, while fractional lengths have been determined based on BER performance obtained through simulations. Details on the analysis can be found in [7].

4 Fig. 3: n required for 10 5 BER, corresponding to different transmission and detection schemes L(c m,l y) = L(c m,l y)/s 2 f. Concerning y i and r ij, s f normalization is not necessary since application of SSD strategy (5) eliminates the energy dependency. A certain scaling factor f is nevertheless required in order to accommodate their dynamic range to the proposed 8-bit quantization [7]. Based on this, ỹ = y i /f and r ij = r ij /f are additionally defined. To ensure flexibility of the proposed detector implementation, fixed-point representation has to be additionally independent of system configuration. For this purpose, factors f and f can be adjusted according to the transmission scheme, as shown in table I 3. As a result, definition of different fixed point representations depending on transmission characteristics as in [14] is avoided. By applying the proposed normalization, 8-bit fixed-point representation is enabled, independent of transmission conditions and adaptable to different N T and Q by simple adjustment of f and f. N T N T Q f f TABLE I: Scaling factors f and f, to enabling flexibility of the proposed fixed-point representation. B. Simulation Results Fig. 3 depicts SNR n achieved at 10 5 BER, using different transmission and detection schemes. FLP denotes floating point, while FXP stands for fixed point. Results corresponding to unclipped (L max = ) STS(FLP) detection are depicted representing full max-log-app SNR performance bound. SIC(FXP) detection has been implemented using the proposed TSD(FXP) realization with limited n = N T, hence representing the lowest n bound. Fig. 3 also includes results corresponding to the proposed TSD for the selected SNR n trade-off (adjusted through T ). Fixed-point arithmetic leads to 0.3dB SNR degradation with respect to the analogous floating point approach. Metric underestimations resulting from limited precision lead, in addition, to n reduction of 20% (slightly greater reduction in 8 8 with 64-QAM system). Higher detection accuracy (i.e. greater T ) can ease the shown 3 Values determined through simulation for the given system model. SNR performance loss ( n reduction). It should be noticed that SNR n difference between TSD(FLP) and TSD(FXP) is approximately constant for different transmission schemes, hence validating the normalization approach proposed in IV-A. For the selected SNR n trade-off, TSD(FXP) presents a total SNR loss of dB with respect to full max-log- APP SNR performance bound, while processing 1-6 orders of magnitude less nodes than unclipped STS(FLP). This loss is mainly caused by ME approach, and can be partially compensated by adjusting T. TSD provides considerable SNR performance gain with respect to SIC ( 2.5dB for 4 4 with 4-QAM transmission, 4dB in the remaining cases) at only slightly greater n (<30%). Resulting from this, proposed soft-output TSD(FXP) approaches hard-output SIC complexity (in terms of n), especially concerning low-order systems, while achieving close to full max-log-app SNR performance. Additionally, SNR n relationship is adjustable attending to transmission requirements, ranging from close to full maxlog-app SNR performance at significantly reduced n, to SIC SNR performance and complexity. V. COMPUTATIONAL COMPLEXITY AND CRITICAL PATH ESTIMATION In order to estimate the costs of a hardware implementation, computational complexity and latency will be analyzed in the subsequent. Resulting from application of complexity reduction techniques in III-D only basic operations (addition, comparison and shift operations) are required to carry out the detection [7]. The amount of required computations c is taken as computational complexity measure in this work. Cost of comparison and addition operations is assumed to be approximately equivalent, whereas shift operation cost is considered negligible. Table II summarizes c (add-equivalent operations, i.e. addition and comparison) required by each task block. As shown in this table, node selection ( a, b, i) and especially metric determination ( c, g, j) tasks present the lowest computational complexity, moreover not varying with N T,Q or T. Complexity of radius update task ( d) grows linearly with T. Remaining tasks complexity depends on system order (Q,N T ), presenting l and m (i.e. handling bit-specific metrics) the greatest complexity. Attending to the resulting overall computational complexity c total, doubling N T or Q implies increasing c total by a factor of 1.5 ( c total 3/4 N T 3/4 Q). Cost in terms of latency l of shift operations is neglected, while cost of addition and comparison operations is assumed to be equivalent to l = 1 clock cycle. This consideration is a quite conservative design constraint, which ensures the targeted throughput (depicted in VI-C and V) in a worstcase hardware implementation. Attending to data dependency 4, most of the computations enclosed within each task block may be performed in parallel. Considering sufficient parallelization level (varying with N T,Q and T ), fixed l can be assumed, 4 Detailed analysis of data dependency is considered to be out of the scope of this work and thus is not presented.

4 4 4 4 4 4 2 2 8 8 4-QAM 16-QAM 64-QAM 64-QAM 64-QAM l (T = 4) (T = 8) (T = 16) (T = 8) (T = 16) a 4 1 b 11 1 c 2 1 d 9 13 21 13 21 1 e 18 18 18 6 42 1 f 14 20 32 12 72 4 g 1 1 h 14 20 32 12 72 4 i

5 QAM 16-QAM 64-QAM 64-QAM 64-QAM l (T = 4) (T = 8) (T = 16) (T = 8) (T = 16) a 4 1 b 11 1 c 2 1 d e f g 1 1 h i 15 1 j 1 1 k l m n c total l cp TABLE II: Computational complexity c (in number of add-equivalent operations) and latency of each task block, for different transmission schemes. as shown in table II. Resulting from this, latency of the regularized loop becomes independent of the transmission system, in contrast to computational complexity. Loop latency can be further reduced by partially overlapping execution of the comprehended task blocks, resulting in the task scheduling illustrated in Fig. 4(c). Since output from blocks h, l, m and n is never required in the immediately subsequent loop [7], the critical path is comprised by blocks a e. Considering latencies given in table II, the critical path presents a latency of l cp = 5 clock cycles. Similar computational complexity analysis can be applied to conventional SD and SIC detector implementations [7], leading to task latencies shown in Fig. 4(a) and 4(b), respectively. As illustrated in Fig. 4, l cp of the proposed 8-bit TSD implementation (Fig. 4(c)) is < 1/4 of l cp corresponding to conventional (16-bit precision) SD realizations (Fig. 4(a)) and < 1/2 of common (16-bit precision) SIC detectors l cp (Fig. 4(b)). SSD and ME approaches in combination with the proposed task scheduling represent the main strategies contributing to this considerable critical path latency reduction. VI. PARALLELIZATION STRATEGIES FOR THROUGHPUT ENHANCEMENT Implementation of bit- and instruction-level parallelism mentioned in V does not require major considerations besides dependency analysis 4. Resulting detection throughput is dominated by 1) the number of loops required to complete the detection (or equivalently, n) and2) the loop latencyl cp (for a given frequency f ref ). In order to enhance throughput, further parallelization approaches are investigated in this work. A. Pipeline-Interleaving Task blocks defined in III-E can be grouped according to their execution times t ex, depicted in Fig. 4(c). Thereby 5 sets of tasks Sx, x = {0,...,4} are defined, each assigned to a different processing element (PE). Notice that tasks executed after l cp cycles ( m and n) are scheduled together with tasks corresponding to the following loop. Based on this, throughput can be enhanced by interleaving execution of Sx Fig. 4: Loop latency (in clock cycles), corresponding to different detection strategies. corresponding to different detections d in pipeline fashion. Fig. 5 illustrates pipeline-interleaving of Sx d, where e.g. S3 2 represents execution of task set 3 (see Fig. 4) corresponding to detection 2 (i.e. third interleaved detection). Throughput and resource utilization are maximized when the number of interleaved detections equals the maximum number of possible pipeline stages P, i.e. P = 5 for the identified critical path (l cp = 5). Assuming in addition that a new detection starts as soon as another has finished, no PE is idle after filling in the pipe. By applying these considerations, resulting average detection throughput τ is given by: τ TSD = (NT L) P n l cp f ref = NT L n f ref. P=lcp B. SIMD Vectorization TSD control flow regularization presented in section III-E enables direct implementation of SIMD vectorization. Assuming accordingly vectorized memory access (as described in VII-C), q individual detection paths can be performed in parallel as illustrated in Fig. 5. Vectorization of n-variable SD results in increased average n ( n) per vector element [8], in contrast to n-fixed detectors (M-algorithm, K-best... ). Assuming that a vector is completely processed as soon as execution of all its q parallel slices have finished, the overall vector latency will be thus determined by the slowest path (i.e. greater n). As a result, n increases with the degree of vectorization q, thereby dramatically affecting the achievable speedup. A simple approach to ease this problem consists in bounding n n max [8]. Consequent early termination of tree searches with n > n max typically leads to BER performance loss. However, it is possible to find suitable n max, e.g. n max = 1.25 n, causing negligible BER performance loss while allowing to achieve almost the maximum feasible speedup q [8]. Simplicity and effectiveness of this approach make it especially interesting for vectorization of n-variable SD. By applying q-fold SIMD parallelism with bounded n, τ is consequently enhanced by a factor of almost q, τ q-simd q τ TSD TSD.

For simplicity, SIMD vectorization has not been considered here (q = 1).

6 Fig. 6: ASIP-based detector architecture Fig. 5: Pipeline-interleaving scheme of 5 detection paths (5 pipeline stages). C. Throughput Analysis Table III shows τ corresponding to different detection approaches and transmission schemes, considering same BER performance (10 5 BER). For simplicity, SIMD vectorization has not been considered here (q = 1). Under the conservative latency assumption in V, hardware realization of the proposed pipelined architecture can be expected to achieve high frequency. Based on this consideration, clock frequency f ref = 400 MHz will be taken as reference in the following. TSD results correspond to the detection accuracy used in IV-B. As shown in table III, τ TSD is 8 times greater than τ of conventional SD implementations [7]. For low system orders (2 2 MIMO with 64-QAM, 4 4 MIMO with 4-,16- QAM) proposed TSD achieves 1-4 times lower τ than a conventional SIC detector. For high order systems (8 8 MIMO with 64-QAM) τ TSD is 15 times lower than conventional SIC τ. The gap between TSD and SIC in terms of τ has been thus considerably reduced compared to conventional SD. By applying pipeline-interleaving techniques (VI-A), τ TSD is enhanced 40 times with regard to common (unpipelined) SD implementations, surpassing τ of conventional (unpipelined) SIC detector by a factor of times greater (except for 8 8 MIMO with 64-QAM). Additionally, implementation of SIC detection based on our TSD realization achieves >2.5 times greater τ than conventional SIC realizations. Throughput enhancement via SIMD vectorization is not directly possible in the case of conventional depth-first SD due to their unregularized control flow. Conventional SIC detectors as well as the proposed n-bounded TSD achieve approximately the same speedup factor ( q). Resulting from this, proposed lowcomplex and flexible TSD realization achieves much higher τ than conventional SD implementations, approaching τ of common SIC detectors (for low order systems) and even surpassing it (TSD-based SIC implementation). N T N T Q-QAM Conventional SD (unpip.) TSD (unpip.) TSD (pip.) Conventional SIC (unpip.) TSD-based SIC (unpip.) TABLE III: Average throughput τ (in Mbps) achieved at f ref = 400 MHz and10 5 BER, corresponding to different transmission and detection schemes and considering a scalar architecture. VII. ASIP IMPLEMENTATION MODEL A. Synchronous Transfer Architecture (STA) The ASIP design proposed in this work has been modeled based on Synchronous Transfer Architecture (STA) [15]. STA architectural template is comprised of basic modules known as functional units (FUs). FUs present an arbitrary number of input and output ports and are connected through a multiplexing interconnection network. Output ports are buffered with registers so that data produced by a FU can be directly consumed by connected FUs without need for additional intermediate storage. The design is controlled by Very Long Instruction Words (VLIWs). At each cycle, the multiplexing network and the functionality of each FU is configured by different segments of the corresponding VLIW. The resulting system forms a synchronous network, which at each clock cycle consumes and produces some data. VLIW-based control and definition of vector data paths enable both instruction and data level parallelism. B. Architecture Block diagram in Fig. 6 depicts the architecture of the proposed STA-based ASIP model. A control unit decodes VLIWs stored in the instruction memory (IMEM) and manages the execution flow. Data path contains register files, data memories (SMEM, VMEMI, VMEMO) and the MIMO detector module (proposed TSD). The detector is comprised of several FUs, each implementing one of the detection tasks illustrated in Fig. 2. The proposed processor model has been dimensioned to support up to N T = 8 and Q = 64. Pipelineinterleaving scheme with P = 5 is assumed. For simplicity, q-fold vectorization with q = 1 is assumed in the following, unless otherwise stated. C. Memory Architecture and Scaling The presented ASIP-based detector model relies on instruction memory and data memories specified in table IV and described in the following. 1) Scalar Memory (SMEM): SMEM contains system parameters (N T,L,T,L max and n max ) as well as channel-dependent parameters, namely σ, N 0, r ij (3) and r 2 ii d2 (6). While system data is assumed to be constant for a given transmission system, channel data varies with frequency depending on the channel coherence time (e.g. each m detections). Assuming q-fold processor vectorization and channel coherence time equivalent to k detection vectors (m = kq), SMEM content can be shared by kq individual

7 detection paths. Resulting from this, vectorization of SMEM is unnecessary. This leads to enormous memory savings, directly proportional to the degree of vectorization q. 2) Input and Output Vector Memories (VMEMI, VMEMO): VMEMI represents a buffer (of length bf) storing the symbols y to be detected. The MIMO detector module gets a complete received symbol y on each read access, as long as there is data available in the input buffer. L-values resulting from the detection of y are stored into VMEMO. Both VMEMI and VMEMO are q-fold vector memories with regard to SIMD implementation. In order to avoid memory growing with q, usage of joint scalar buffers could be investigated as future work. Both pipeline-interleaving and SIMD vectorization are supported by the designed memory structure. Resulting from these design considerations, size of SMEM is fixed, while VMEMI and VMEMO size scales with the degree of vectorization q and the desired buffer length bf, as depicted in table IV. For a sample non-vectorized design (q = 1) with bf = 40, a total of 3KB memory is required, corresponding 2.8KB to data memory. Memory Width Length Size (bits) ( words) (bytes) Content IMEM Program code SMEM System and channel info. VMEMI 128 bf 16 bf q Received symbols VMEMO 384 bf 48 bf q Soft output (L-values) TABLE IV: Memory interface specifications, supporting up to N T,N R = 8 and Q = 64. D. Memory Access Implementation concepts and analyses presented in sections IV to VI are based on the assumption of constant data availability. Latency of memory tasks like read/write access or address generation have been therefore disregarded. In the following, challenges concerning the integration of these tasks into the proposed regularized scheme (III-E) are presented, as well as the strategies considered to overcome them. 1) Memory Address Generation: Due to different nature and frequency of data memories access, address generation functionality has been separately defined and implemented for each memory. VMEMI and SMEM access is sequentially performed, thus relying on simple pointers and a register file for address storage. In contrast, VMEMO access is not sequential, hence requiring more elaborate approaches. It is desired that y and corresponding L-values occupy in VMEMI and VMEMO (respectively) homologous memory areas. In order to implement this sorting while dealing with variable detection latency, a tagging approach is proposed. A tag is assigned to each y during its processing, indicating the VMEMI-address where this symbol was stored. When detection is complete, the tag is used to determine the corresponding VMEMO address (requiring only simple shift and addition operations). Simplicity of the proposed strategy enables calculating addresses in parallel to detection tasks execution. Resulting from this, address generation requires no additional clock cycles and detection latency depicted in V remains unaffected. 2) SMEM Data Temporal Locality: SMEM content is very frequently accessed by most detector FUs. Access frequency is further increased by the proposed pipeline-interleaving approach (VI-A). As a consequence, data demand rate is much higher than serving capacity of the defined memory interface 5. Simplest solutions like including stall cycles or enhancing interface bandwidth lead to speedup detriment or limit scalability of the design. To solve the problem avoiding these negative effects, partial copy of SMEM content into a register file is proposed (block o in Fig. 6). Since register access is significantly faster than memory access, data availability problem is solved and throughput speedup remains unaffected. In addition, costly memory access is reduced, since SMEM is only accessed when its content varies (depending on channel coherence time). It should be noticed that additional storage required does not scale with q in case of processor vectorization. As part of future work, SMEM could be completely eliminated, hence reducing memory requirements. E. Conditional Control Flow Fig. 7(a) illustrates control flow including memory access, with Detection Loop representing the regularized loop shown in Fig. 2. As depicted in this diagram, data memories access is conditional, triggered by detection termination and channel update. Assuming that channel information is shared by kq > P individual detection paths (VII-C), decision on SMEM access can be jointly performed for the P interleaved detection paths. In contrast, decision on VMEMI and VMEMO access is individual, since it depends on variable detection latency. Joint control flow of interleaved detection paths is consequently frustrated. In order to satisfy individual memory access demand without disrupting the pipelined execution of the remaining detection paths, the regularized and extended pipeline-interleaving scheme shown in Fig. 7(b) is proposed, in combination with the following approaches: Unconditional VMEMI access: periodic and unconditional read operations are performed during the complete detection process. Internal signaling detects when data has been effectively consumed (i.e. a new detection begins) in order to update VMEMI address pointer accordingly. Assisted control flow: VMEMO access is conditional and triggered by up to P = 5 detection paths. SMEM access is additionally conditional, depending on channel coherence time. Resulting from this, complex conditional branch operations have to be performed by the sequencer, as illustrated by arrows in Fig. 7(b). Based on control signals received from the detector module, a Flow Control Unit (FCU, p in Fig. 6) identifies triggering events and determines the corresponding IMEM address, subsequently transferring it to the sequencer. As a result, 5 Read access is assumed to take 2 clock cycles.

(a) Control flow diagram (b) Pipeline-interleaving scheme Fig. 7: Extended control flow and pipeline-interleaving schemes, including data memory access.

8 (a) Control flow diagram (b) Pipeline-interleaving scheme Fig. 7: Extended control flow and pipeline-interleaving schemes, including data memory access. execution flow is altered by performing uniquely simple unconditional branch operations. By applying these techniques, no overhead (i.e. additional clock cycles) is caused due to execution flow control or memory access. Resulting detection latency depends uniquely on the detection process, as described in VI-C. Achievable speedup provided by parallelization approaches considered in VI remains hence unaffected. VIII. RESULTS DISCUSSION In this work a novel ASIP-based detector implementation has been presented, combining low complexity and high BER performance of TSD with high throughput of highly parallel and pipelined architectures. Performance comparison to different bread-first and depth-first detector realizations ratifies the suitability of the proposed TSD implementation. Table V depicts considered detectors features. f ref corresponds to the maximum clock frequency f max achieved by the corresponding hardware implementations on the specified CMOS technologies. Concerning this work (not including hardware realization) fref TSD = 400 MHz has been selected as reference, according to the explanations in VI-C f ref has been extracted from [14], [16] [19] corresponding to 10 4 BER or alternatively 1% FER, depending on the metric utilized MHz for SO system, 250 MHz for SISO system. by each author. 71 MHz represents τ down-scaled to min{f ref } = 71 MHz. The benefit in n reduction provided by TSD grows exponentially with N T and Q [2]. However, a 4 4 MIMO system using 16-QAM constellation has been considered to ensure comparability of results. As illustrated in table V, the proposed pipelined TSD-ASIP outperforms in terms of throughput all considered works (except [16], providing hard output) at same frequency and comparable error rate. τ TSD is 40% greater than τ corresponding to the considered n-fixed strategies [18], [19]. Compared to depthfirst approaches,τ TSD 2.5 τ [17] andτ TSD 7 τ [14] (notice that [14] is SISO capable, but τ [14] corresponds to f max achieved by SO-only system, provided by the authors). The considerable throughput increase achieved by TSD-ASIP can be further enhanced by a factor of q through SIMD vectorization (notice that except [16], considered realizations also implement some sort of parallelism and / or pipelining approaches). It should be also noticed that to achieve comparable BER performance, SNR required by TSD is much lower than SNR required by remaining approaches. IX. CONCLUSION In this work key strategies enabling efficient implementation of a complexity-reduced tuple search sphere detector have been presented. A novel flexible 8-bit fixed-point representation has been introduced, reducing the complexity compared to conventional 16-bit implementations. Proposed regularization and architectural concepts permit efficient application of pipelining and parallelization approaches. As a result, a high-throughput realization is enabled, achieving > 300Mbps at low SNR for a reference frequency of 400 MHz (considering 4 4 MIMO transmission with 16-QAM) 7. Under comparable conditions, proposed implementation achieves up to 7 times greater throughput than relevant state-of-the-art realizations, whereas greater improvement can be still achieved by exploiting SIMD vectorization and possibly using higher frequency. Low-complexity, flexibility, high throughput and high efficiency of the resulting detector make it a very favorable candidate for MIMO transmission in a wide range of applications. 7 HDL instantiation of a 4 4 system using a 64-QAM constellation and T = 8 reached f = 454 MHz on 65-nm technology, achieving a throughput of Mbps. Further results on hardware implementation analysis will be submitted to SiPS Work This work [16] [17] [14] [18] [19] Detector DF-SD (TS) DF-SD DF-SD (STS) DF-SD (STS) BF-SD (K-best) BF-FSD Input/Output SO HO SO SISO SO SO close to close to ML close to close to close to BER performance maxlogapp significant SNR loss maxlogapp maxlogapp maxlogapp suboptimal Flexibility with respect to SO 4,16,64-QAM ,64-QAM, QPSK ,4 4, QAM 16-QAM 2 2,4 4, QAM 16-QAM Technology µm 0.25 µm 90 nm 0.18 µm 0.13 µm f ref (MHz) τ f ref q 9.5dB 25dB 17.25dB 15dB τ 71 MHz q SISO - Soft Input, Soft Output HO - Hard Output DF - Depth-First tree traversal SO - Soft Output only BF - Breadth-First tree traversal TABLE V: Comparative between sphere search based MIMO detectors.

9 ACKNOWLEDGMENT The authors would like to thank the project partner Blue Wonder Communications (Dresden, Germany) for an excellent cooperation. REFERENCES [1] B. Mennenga, A. von Borany, and G. Fettweis, Complexity reduced Soft-In Soft-Out Sphere Detection based on Search Tuples, in Proceedings of the IEEE International Conference on Communications (ICC 09), Dresden, Germany, [2] E. P. Adeva and B. Mennenga, Survey on an Efficient, Low-complex Tuple Search Based Sphere Detector, Submitted to IEEE 34th Sarnoff Symposium, [3] C. Studer, A. Burg, and H. Bölcskei, Soft-output sphere decoding: Algorithms and VLSI implementation, IEEE Journal on Selected Areas in Communications, Feb [4] M. S. Yee, Max-log-MAP sphere decoder, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 05), vol. 3, Mar [5] B. Hochwald and S. ten Brink, Achieving near-capacity on a multipleantenna channel, IEEE Transactions on Communications, vol. 51, pp , Mar [6] E. Zimmermann and G. Fettweis, Unbiased MMSE Tree Search Detection for Multiple Antenna Systems, in Proceedings of the International Symposium on Wireless Personal Multimedia Communications (WPMC 06), San Diego, USA, Sep [7] B. Mennenga, Aufwandsgünstige Detektion in Mehrantennensystemen mittels komplexitätsreduzierter Baumsuchverfahren. Dissertation, Technische Universität Dresden, [8] B. Mennenga, E. Matus, and G. Fettweis, Vectorization of the Sphere Detection Algorithm, in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS 09), Taipei, Taiwan, [9] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, VLSI implementation of MIMO detection using the sphere decoding algorithm, IEEE Journal of Solid-State Circuits, vol. 40, pp , Jul [10] J. Jaldén and B. Ottersten, Parallel Implementation of a Soft Output Sphere Decoder, in Proceedings of Asilomar Conference on Signals, Systems, and Computers, [11] Mennenga and Fettweis, Search Sequence Determination for Tree Search based Detection Algorithms, in Proceedings of the IEEE Sarnoff Symposium 2009, Princeton, USA, [12] B. Mennenga and G. Fettweis, Simplified Search Sequence and Metric Determination for Tree Search based Detection Algorithms, Submitted to the IEEE Transaction on Wireless Communications, [13] M. Myllyla, J. R. Cavallaro, and M. Juntti, Architecture Design and Implementation of the Metric First List Sphere Detector Algorithm, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, [14] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, A Scalable VLSI Architecture for Soft-input Soft-output Depth-first Sphere Decoding, IEEE Transactions on Circuits and Systems II: Express Briefs, [15] G. Cichon, A Novel Compiler-Friendly Micro-Architecture for Rapid Development of High-Performance and Low-Power DSPs. Dissertation, Technische Universität Dresden, [16] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bleskei, VLSI Implementation of MIMO detection using the sphere decoding algorithm, IEEE Journal on Solid-State Circuits, [17] C. Studer, A. Burg, and H. Bolcskei, Soft-output Sphere Decoding: Algorithms and VLSI Implementation, IEEE Journal on Selected Areas in Communications, [18] Z. Guo and P. Nilsson, Algorithm and Implementation of the K-best Sphere Decoding for MIMO Detection, IEEE Journal on Selected Areas in Communications, [19] B. Wu and G. Masera, A Novel VLSI Architecture of Fixed-complexity Sphere Decoder, in 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools (DSD), 2010.

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm A New MIMO etector Architecture Based on A Forward-Backward Trellis Algorithm Yang Sun and Joseph R Cavallaro epartment of Electrical and Computer Engineering Rice University, Houston, TX 775 Email: {ysun,