High-Throughput Data Detection for Massive MU-MIMO-OFDM using Coordinate Descent. Michael Wu, Chris Dick, Joseph R. Cavallaro, and Christoph Studer

Size: px

Start display at page:

Download "High-Throughput Data Detection for Massive MU-MIMO-OFDM using Coordinate Descent. Michael Wu, Chris Dick, Joseph R. Cavallaro, and Christoph Studer"

Emil Logan
6 years ago
Views:

1 1 High-Throghpt Data Detection for Massive MU-MIMO-OFDM sing Coordinate Descent Michael W, Chris Dick, Joseph R. Cavallaro, and Christoph Stder Abstract Data detection in massive mlti-ser (MU) mltipleinpt mltiple-otpt (MIMO) wireless systems is among the most critical tasks de to the excessively high implementation complexity. In this paper, we propose a novel, eqalization-based soft-otpt data-detection algorithm and corresponding reference FPGA designs for wideband massive MU-MIMO systems that se orthogonal freqency-division mltiplexing (OFDM). Or data-detection algorithm performs approximate minimm meansqare error (MMSE) or box-constrained eqalization sing coordinate descent. We deploy a variety of algorithm-level optimizations that enable near-optimal error-rate performance at low implementation complexity, even for systems with hndreds of base-station (BS) antennas and thosands of sbcarriers. We design a parallel VLSI architectre that ses pipeline interleaving and can be parametrized at design time to spport varios antenna configrations. We develop reference FPGA designs for massive MU-MIMO-OFDM systems and provide an extensive comparison to existing designs in terms of implementation complexity, throghpt, and error-rate performance. For a 128 BS antenna, 8 ser massive MU-MIMO-OFDM system, or FPGA design otperforms the next-best implementation by more than 2.6 in terms of throghpt per FPGA look-p tables. Index Terms Coordinate descent, eqalization, FPGA design, massive mlti-ser (MU) MIMO, orthogonal freqency-division mltiplexing (OFDM), soft-otpt data detection. I. INTRODUCTION MASSIVE mlti-ser (MU) mltiple-inpt mltipleotpt (MIMO) technology promises significant improvements in terms of spectral efficiency, coverage, and range compared to traditional, small-scale MIMO [2] [5]. In fact, massive MU-MIMO is commonly believed to be one of the key technologies for ftre fifth-generation (5G) wireless systems [6]. The idea nderlying this technology is to eqip the base-station (BS) with hndreds of antenna elements while commnicating with tens of ser terminals concrrently and within the same time-freqency resorce. However, the large dimensionality of the data detection problem faced in the plink (where sers commnicate to the BS), reslts in excessively high implementation complexity at the BS (see, e.g., [7] and the references therein). Hence, to redce the implementation costs while enabling throghpts in the Gb/s regime for practical wideband massive MU-MIMO systems MW and JRC are with the Department of ECE, Rice University, Hoston, TX; {mbw2,cavallar}@rice.ed MW and CD are with Xilinx Inc., San Jose, CA; {miw, chris.dick}@xilinx.com CS is with the School of ECE, Cornell University, Ithaca, NY; stder@cornell.ed A short version of this paper for a single-carrier freqency-division mltiple access (SC-FDMA) massive MU-MIMO systems has been presented at the IEEE International Symposim on Circits and Systems (ISCAS) [1]. with hndreds of antenna elements and thosands of sbcarriers, novel algorithms and dedicated hardware implementations on field-programmable gate arrays (FPGAs) or application specific integrated circits (ASICs) are necessary. Dring recent years, varios data-detection algorithms [8], [9] and dedicated hardware implementations have been proposed for massive MU-MIMO systems [7], [10] [13]. All of the existing hardware implementations, however, are either nable to achieve the high throghpts offered by ftre wideband massive MU-MIMO systems [7], [12], [13], or exhibit excessive hardware complexity [11]. Frthermore, the hardware implementations in [7], [11] only spport singlecarrier freqency-division mltiple-access (SC-FDMA). As demonstrated in [14], however, orthogonal freqency-division mltiplexing (OFDM) enables (often significantly) less complex baseband processing 1, which may be a critical design factor for wideband massive MU-MIMO systems with hndreds of BS antennas and thosands of sbcarriers. A. Contribtions We propose a new, low-complexity soft-otpt data-detection algorithm and a corresponding high-throghpt FPGA design for massive MU-MIMO wireless systems that se OFDM. Or algorithm, referred to as optimized coordinate descent (OCD), performs approximate minimm mean-sqare error (MMSE) or box-constrained eqalization, which enables near maximmlikelihood (ML) soft-otpt data detection performance in massive MU-MIMO systems with a large BS-to-ser-antenna ratio. We develop a corresponding high-throghpt VLSI architectre with a deep and interleaved pipeline, which can be parametrized at design time to spport varios BS and ser antenna configrations. The algorithmic reglarity of OCD and the fact that preprocessing can be implemented at minimm hardware overhead enables high-throghpt VLSI designs that reqire lower complexity than state-of-the-art designs, even for systems with hndreds of BS antennas and thosands of sbcarriers. To demonstrate the advantages of OCD compared to existing massive MU-MIMO data-detector designs in terms 1 SC-FDMA typically generates baseband signals with a lower dynamic range, bt the receiver mst perform an additional freqency-to-time conversion (compared to OFDM). This additional conversion step reqires one to separate eqalization (that is sally carried ot in the freqency domain per sbcarrier) and data detection (that mst be carried ot in the time domain). This separation prevents the se of powerfl, non-linear eqalization methods [15], sch as the box-constrained detector proposed in this paper. OFDM, in contrast, cases a slightly higher dynamic range, bt reqires only one time-to-freqency conversion and enables non-linear data-detection methods that operate directly in the freqency domain on a per-sbcarrier basis [14].

2 2 of throghpt, hardware complexity, and error-rate performance, we provide implementation reslts on a Xilinx Virtex-7 FPGA. B. Notation Boldface lowercase and boldface ppercase letters stand for colmn vectors and matrices, respectively. For a matrix A, we denote its hermitian transpose by A H. We se a k,l for the entry in the kth row and lth colmn of the matrix A; the kth entry of a colmn vector a is denoted by a k = [a] k. The l 2 -norm of a vector a is defined as a 2 = k a k 2. The real part of a complex nmber a is R{a}. Sets are denoted by ppercase calligraphic letters; the cardinality of a set A is A. The expectation operator is designated by E[ ]. C. Paper Otline The rest of the paper is organized as follows. Section II introdces the massive MU-MIMO-OFDM system model and describes data detection sing MMSE and box-constrained eqalization. Section III details or OCD algorithm and shows error-rate simlation reslts. Section IV and Section V describe or VLSI architectre and shows FPGA implementation reslts, respectively. We conclde in Section VI. II. SYSTEM MODEL AND DATA DETECTION This section introdces the considered OFDM-based plink model and smmarizes efficient methods for linear MMSE and box-constrained soft-otpt data detection. A. OFDM-based Uplink System Model We consider a massive MU-MIMO-OFDM plink system, where U single-antenna ser terminals send data simltaneosly to a BS with B U antennas over W sbcarriers. Each ser i = 1,..., U encodes its own bit stream (sing a forward error-correction scheme) and maps the generated coded bits onto constellation points in a finite set O (e.g., 64-QAM sing a Gray mapping rle), with nit average transmit power, i.e., E [ s 2] = 1 with s O, and Q = log 2 O bits per constellation point. The reslting W freqency-domain symbols {s (i) 1,..., s(i) W } are then transformed into the time domain (TD) sing an inverse discrete Forier transform (DFT) [16]. After prepending the cyclic prefix, all sers transmit their TD signals over the freqency-selective wireless channel at the same time. After removing the cyclic prefixes, the TD signals received at each BS antenna are transformed back to the FD sing a DFT. For the sake of simplicity, we assme a sfficiently long cyclic prefix, perfect synchronization, and that perfect channel-state information (CSI) has been acqired via pilot-based training. 2 Under these assmptions, the FD inpt-otpt relation on the wth sbcarrier is commonly modeled as [17] y w = H w s w + n w, w = 1,..., W, (1) where y w C B is the associated received FD vector, H w C B U is the channel matrix, s w O U contains the symbols transmitted by all U sers, i.e., [s w ] i = s (i) w refers to the symbol transmitted by ser i over sbcarrier w, and n w C U models thermal noise as i.i.d. complex circlarly-symmetric Gassian vector with variance N 0 per complex entry. 2 These assmptions are common in the MIMO-OFDM literatre [16]. B. Eqalization-based Data Detection For the model in (1), optimal data detection in terms of minimizing the symbol error-rate is accomplished by solving the maximm-likelihood (ML) problem [18] s ML w = arg min y w H w z 2 2. (2) z O U Unfortnately, solving (2) exactly for massive MU-MIMO systems qickly reslts in prohibitive complexity, even with the best-known sphere-decoding algorithms [19]. Eqalizationbased data detection algorithms [18] enable one to find approximate soltions to the ML problem at low comptational complexity. Virtally all linear as well as non-linear eqalization methods relax the finite-alphabet constraint z O U in (2), which enables the efficient comptation of an estimate s that is (hopeflly) close to the ML soltion. The estimate s can then either be sliced element-wise onto the nearest constellation point in O as follows: ŝ i = arg min [ s] i z, i = 1,..., U, (3) z O which is known as hard-otpt data detection, or sed to compte reliability information for each transmitted bit in the form of log-likelihood ratio (LLR) vales (see Section II-E), which is known as soft-otpt data detection [20], [21]. C. Linear MMSE Eqalization The most common eqalization-based data detection algorithm is linear MMSE data detection [18], [20]. This method was shown to enable FPGA and ASIC designs that are able to achieve high throghpt in massive MU-MIMO systems [7]. Frthermore, for systems with large BS-to-ser antenna ratios δ = B/U (e.g., two or larger), linear detectors are able to achieve near-ml error-rate performance [3] [5]. The key idea of MMSE data detection is to relax the constraint z O U in the ML problem (2) to the U-dimensional complex space z C U, and to inclde a qadratic penalty fnction. In particlar, MMSE eqalization solves the following reglarized least-sqares problem [10], [22]: s MMSE w = arg min y w H w z N 0 z 2 2. (4) z C U Since the objective fnction in (4) is qadratic in z, the MMSE eqalization problem has a closed-form soltion. An explicit soltion to (4) can be compted as follows. First, compte the reglarized Gram matrix A w = G w + N 0 I U with G w = H H w H w and the matched filter vector s MF w = H H w y w. Then, the MMSE estimate in (4) is compted as s MMSE w = A 1 w smf w. (5) While this closed-form approach was shown to be efficient for traditional, small-scale MIMO systems (e.g., with for antennas at both ends of the wireless link) [21], compting the reglarized Gram matrix A w and its inverse A 1 w qickly reslts in prohibitive complexity in massive MU-MIMO systems with hndreds of BS antennas [11]. In Section III, we present a comptationally-efficient eqalization algorithm that directly solves (4) in a hardware efficient way, which avoids expensive calclations sch as the comptation of the reglarized Gram matrix A w and its inverse A 1 w.

3 3 D. Non-Linear Box-Constrained (BOX) Eqalization While linear eqalization methods are the most common approach in the MIMO literatre, a few non-linear eqalizers have recently emerged and shown to otperform linear methods in terms of error-rate performance [23]. A promising non-linear eqalization method, referred to as box-constrained eqalization (short BOX eqalization) [24] [26], relaxes the constraint z O U to the convex polytope C O arond the constellation set O, which is formally defined as follows: O O C O = α i s i (α i 0, α i R, i) α i = 1. (6) i=1 i=1 For example, the convex polytope C QPSK for QPSK with 3 O = {+1 + j, +1 j, 1 + j, 1 j} (7) is given by C QPSK = {x R + jx I : x R, x I [ 1, +1]} with j 2 = 1; this is simply a box with radis 1 arond the sqare constellation (ths the name BOX eqalization). For higher-order QAM alphabets, sch as 16-QAM or 64-QAM, we have C O = {x R + jx I : x R, x I [ α, +α]}, where α = max a O R{a} is the radis of the tightest box arond the sqare constellation. BOX eqalization solves the following relaxed version of the ML problem in (2): s BOX w = arg min y w H w z 2 2. (8) z CO U Since this eqalization problem (8) is convex, it can be solved exactly sing well-established nmerical methods from convex optimization [27]. Frthermore, as shown recently in [23], [26], the BOX eqalizer exhibits near-ml error-rate performance in the large-antenna limit, where we fix the BS-to-ser antenna ratio δ = B/U so that δ > 1/2 and by letting B. In addition, the BOX eqalizer does only need knowledge of the transmit constellation O bt not of the noise variance N 0. Unfortnately, solving (8) exactly with conventional interiorpoint methods reslts in prohibitive complexity and reqires high nmerical precision, which prevents efficient hardware designs that se finite precision (fixed-point) arithmetic. In order to solve (8) at low complexity and in a hardware efficient way, we propose a new algorithm in Section III. E. Soft-Otpt Data Detection From MMSE and BOX eqalization, hard-otpt estimates can easily be obtained by element-wise slicing of the entries of s MMSE w and s BOX w onto the nearest constellation point as in (3), respectively. In systems that se forward error-correction, however, one is generally interested in soft-otpt detection [28]. From MMSE eqalization where s w = s MMSE w, LLR vales are typically compted via the max-log approximation [21] ( 2 L w,i,b = ρ w,i min [ s w ] i a a Ob 0 µ w,i min [ s w ] i 2) a, (9) a Ob 1 µ w,i 3 We note that this constellation is not normalized to nit expected power. where the sets Ob 0 and O1 b contain the constellation symbols for which the bth bit is 0 and 1, respectively. For explicit MMSE detection, i.e., the approach discssed in Section II-C that comptes A 1 w, the post-eqalization signal-to-noise-andinterference-ratio (SINR) ρ w,i and the channel gain µ w,i can be calclated exactly and in the following efficient way [21]. The SINR is calclated as ρ w,i = µ w,i /(1 µ w,i ) and the channel gain is µ w,i = [A w ] H i [G w] i, where [A w ] i is the ith row of A 1 w and [G w ] i is the ith colmn of G w. However, for BOX eqalization in Section II-D, as well as for data detection algorithms that implicitly solve the MMSE detection problem (4), no efficient methods that exactly compte the SINR ρ w,i are known this prevents a straightforward comptation of the LLR vales in (9). In Section III-C, we propose an approximate way to generate ρ w,i and µ w,i, which enables s to compte approximate LLR vales for sch linear and non-linear eqalizers. III. FAST EQUALIZATION VIA COORDINATE DESCENT While the soltion to the implicit MMSE problem (4) can be compted (exactly or approximately) at moderate complexity sing iterative conjgate gradient (CG) or Gass- Seidel (GS) methods, see, e.g., [9], [13], [22], corresponding VLSI designs [10], [13] are nable to achieve high throghpt, mainly de to a fairly complex algorithm strctre, stringent data dependencies, or the need for high arithmetic precision. We next propose an alternative method to solve both the MMSE eqalization (4) and BOX eqalizaton (8) problems at low complexity and in a hardware friendly way. A. Coordinate Descent (CD) Coordinate descent (CD) [29] is a well-established iterative framework to exactly or approximately solve a large nmber of convex optimization problems sing a series of simple, coordinate-wise pdates. We first define the following fnction: f(z 1,..., z U ) = f(z) = y w H w z g(z), (10) where g(z) is a convex reglarizer. It is now important to realize that both eqalization problems (4) and (8) are special cases when minimizing (10). In fact, by setting g MMSE (z) = N 0 z 2 2, minimizing (10) is eqivalent to solving the MMSE eqalization problem (4). By setting g BOX (z) = χ(z C O ), where χ(z C O ) denotes the characteristic fnction that is zero if z C O and infinity otherwise, minimizing (10) is eqivalent to solving the BOX eqalization problem (8). CDbased eqalization simply minimizes the fnction f(z 1,..., z U ) in (10) seqentially for each variable (or coordinate) z, = 1,..., U, in a rond-robin fashion. 4 For more details on CD, see [29], [30] and the references therein. We next detail the CD algorithms for MMSE and BOX eqalization. 4 The performance of CD can often be improved by sing a careflly-selected variable-pdate order [29]; or own experiments have shown that for MMSE and BOX eqalization, a simple rond-robin pdate scheme performs well and is easier to implement.

4 4 1) CD-based MMSE Eqalization: Assme we want to find the th optimm vale z for the MMSE eqalization problem (4), i.e., we seek to compte the soltion to ẑ = arg min y w H w z N 0 z 2 2, (11) z C where we hold all other vales z j, j, fixed. Since this is a qadratic problem, we can solve it in closed form by setting the gradient of the fnction (10) with respect to the th component to zero: 0 = f(z) = h H (Hz y) + N 0 z. (12) By decomposing Hz = h z + j h jz j, we can solve (12) for z to obtain the following closed-form expression: 1 ẑ = h N h H y h j z j. (13) 0 j This expression is exactly the CD pdate rle for the th entry of z. For every iteration, we can compte (13) seqentially for each ser = 1,..., U, where we immediately re-se the new reslt ẑ for the th ser in sbseqent steps. We repeat this procedre for a total nmber of K iterations in order to obtain an estimate for s MMSE = z (K), where z (K) is the end reslt of the above-described iterative process. 2) CD-based BOX Eqalization: Analogosly to CD-based MMSE eqalization, we can derive the pdate rle for the BOX eqalization problem (8). Even thogh the characteristic fnction g BOX (z) = χ(z C O ) is not differentiable, a similar approach that ses sbgradients (instead of gradients) enables one to derive the following closed-form expression [30]: ẑ = proj CO 1 h 2 h H y h j z j. (14) 2 j Here, proj CO ( ) is the orthogonal projection onto the convex polytope C O and is given by { w if w CO proj CO (w) = (15) arg min q CO w q if w / C O. In words, if the argment w C is within the set C O, then the projection otpts w; if w is otside the set C O, the projection otpts the vale q that is closest to w within the set C O in terms of the Eclidean distance. We emphasize that for many practically-relevant constellation sets O, the projection (15) can be carried ot efficiently. For any QAM constellation, for example, we independently clip the real and imaginary part of w onto the interval [ α, +α], where α is the radis of the tightest box that covers the QAM constellation (see Section II-D for the details). For BPSK with O = { 1, +1}, we clip the real part of w onto the interval [ 1, +1] and set the imaginary part to zero. 5 5 Orthogonal projections for PSK constellations sets are also possible. The development of efficient algorithms for PSK systems is left for ftre work. Algorithm 1 Optimized Coordinate Descent (OCD) 1: inpts: y, H, and N 0 2: initialization: 3: r = y and z (0) = 0 U 1 4: MMSE mode: α = N 0 and C = C 5: BOX mode: α = 0 and C = C O 6: preprocessing: 7: d 1 = ( h α) 1, = 1,..., U 8: p = d 1 h 2 2, = 1,..., U 9: eqalization: 10: for k = 1,..., K do 11: for = 1,..., U( do ) 12: z (k) = proj C d 1 h H r + p z (k 1) 13: z (k) = z (k) z (k 1) 14: r r h z (k) 15: end for 16: end for 17: otpts: s = [z (K) 1,..., z (K) U ]T B. Optimized Coordinate Descent (OCD) Instead of blindly compting the pdates (13) and (14) for MMSE and BOX eqalization, respectively, we perform preprocessing and algorithm restrctring in order to minimize the amont of (recrrent) operations dring each of the k = 1,..., K iterations. These optimizations entail no performance loss, i.e., both methods, OCD and CD, deliver exactly the same reslts. We refer to the reslting method as the optimized CD algorithm (short OCD), which is smmarized in Algorithm 1. OCD spports both BOX and MMSE eqalization and the individal optimization steps are as follows. 6 1) Preprocessing: To redce the comptational complexity, OCD precomptes certain key qantities that can be re-sed dring each of the k = 1,..., K iterations. This preprocessing step not only reslts in significant complexity savings dring the iterative process (compared to CD), bt also simplifies or hardware implementation (see Section IV). In particlar, we precompte so-called (reglarized) inverse sqared colmn norms of H, i.e., d 1 = ( h α) 1 for = 1,..., U, with α 0, as well as reglarized gains p = d 1 h 2 2 for = 1,..., U. In MMSE mode, the reglarization parameter is given by α = N 0 ; in BOX mode, the reglarization parameter is given by α = 0, which yields p = 1, = 1,..., U. 2) Eqalization: In order to avoid recrrent operations dring the eqalization process, OCD performs incremental pdates and re-ses intermediate qantities dring each of the k = 1,..., K iterations. In essence, we perform seqential pdates on the so-called residal approximation vector, which is defined as r = y U j=1 h j z (k) j (16) 6 The OCD algorithm proposed in the conference version of this paper [1] differs from the one presented here. The operations in OCD as proposed here have been restrctred in order to (i) spport MMSE as well as BOX eqalization and (ii) redce the hardware complexity.

5 5 MMSE detector OCD, K = 4 OCD fp, K = 4 GS, K = 4 CG, K = MMSE detector OCD, K = 3 OCD fp, K = 3 GS, K = 3 CG, K = 3 OCD, K = 2 GS, K = 2 CG, K = 2 Nemann, K = MMSE detector OCD, K = 3 OCD fp, K = 3 GS, K = 3 CG, K = 3 OCD, K = 2 GS, K = 2 CG, K = 2 Nemann, K = 3 (a) 32 BS antennas and 8 sers. (b) 64 BS antennas and 8 sers. (c) 128 BS antennas and 8 sers. Fig. 1. for a massive MU-MIMO-OFDM system ( fp denotes fixed-point performance). Optimized coordinate descent (OCD) with box-constrained eqalization achieves close-to-mmse PER performance and otperforms the other three approximate eqalization methods [7], [10], [13]. Exact Inversion OCD BOX, K = 4 OCD MMSE, K = MMSE detector OCD BOX, K = 3 OCD MMSE, K = 3 OCD BOX, K = 2 OCD MMSE, K = MMSE detector OCD BOX, K = 3 OCD MMSE, K = 3 OCD BOX, K = 2 OCD MMSE, K = 2 (a) 32 BS antennas and 8 sers. (b) 64 BS antennas and 8 sers. (c) 128 BS antennas and 8 sers. Fig. 2. for a massive MU-MIMO-OFDM system. BOX eqalization otperforms MMSE eqalization, especially for systems with a smaller BS-to-ser antenna ratio. Frthermore, both approximate eqalization methods achieve near-exact MMSE performance for a small nmber of iterations. at every algorithm iteration k = 1,..., K and for each ser = 1,..., U. Note, however, that we do not recompte this residal approximate vector for every iteration and ser from scratch. In contrary, we pdate the residal approximation vector in every iteration and for each ser by first compting the symbol estimates z (k) on line 12 of Algorithm 1. We then compte a so-called delta vale z (k) on line 13, which enables s to pdate the residal r on line 14 withot calclating the residal (16) explicitly. As mentioned above, OCD delivers exactly the same reslts as CD, bt does so at significantly lower comptational complexity. In fact, the original CD algorithm in Section III-A reqires one complex-valed inner prodct and U 1 complex scalar-by-vector mltiplications per iteration k, whereas the proposed OCD algorithm reqires only one inner prodct and one complex scalar-by-vector mltiplication. More precisely, for MMSE eqalization, CD reqires 4BU 2 + 2U real-valed mltiplications 7 per iteration k, whereas OCD reqires only 8BU + 4U real-valed mltiplications. Hence, for a large 7 We cont 4 real-valed mltiplications per complex-valed mltiplication. nmber of BS antennas B, OCD reqires roghly U/2 times lower complexity than CD per iteration. C. LLR Approximation for OCD To compte the LLR vales (9) for MMSE and BOX eqalization sing OCD, we mst resort to an approximation as we never explicitly compte the inverse A 1 w. To this end, we se the approximation pt forward in [11], [22] for SC-FDMA-based systems. For OFDM, this approach simplifies significantly and corresponds to approximating the channel gains by µ w,i = d 1 w,i g w,i, where d 1 w,i is the ith reglarized inverse sqared colmn norm of H w and g i,w is the entry in the ith main diagonal of the Gram matrix G w at sbcarrier w. Frthermore, the approach from [11], [22] applied to OFDM systems reslts in the following SINR approximation: ρ w,i = µ w,i /(1 µ w,i ). We refer the interested reader to [22] for more details. As we will show next, this LLR approximation enables near-optimal performance in massive MU-MIMO systems with large BS-to-ser-antenna ratios.

6 6 D. Error-Rate Performance In order to assess the error-rate performance for the proposed OCD-BOX algorithm, we perform Monte-Carlo simlations in a coded 20 MHz MIMO-OFDM plink system with 2048 sbcarriers, where 1200 are sed for data transmission as in LTE Advanced (LTE-A) [31]. We se 64-QAM with Gray mapping and a rate-3/4 trbo code. To accont for spatial and freqency correlation, we generate channel matrices sing the WINNER-Phase-2 model [32] with 7.8 cm antenna spacing as in [11], [22]. For channel decoding, we se a log-map trbo decoder. We report the packet error-rate, which is obtained by coding over one OFDM symbol with 1200 data sbcarriers. The signal-to-noise-ratio (SNR) per bit in decibels, defined as ( ) ( [ ] ) Eb E s 2 10 log 10 = 10 log N 10 0 QE[ n 2. (17) ] Figres 1 and 2 compare the packet error rate (PER) for OCD-BOX with other exact and approximate data-detection methods for massive MU-MIMO systems with varios antenna configrations. In particlar, we show PER reslts for Nemannseries detection [7], CG-based detection [10], and Gass-Seidel (GS)-based detection [13]. We also inclde an exact linear MMSE eqalizer as a reference. For all considered antenna configrations, OCD-BOX otperforms Nemann, CG, and GS detection for the same iteration cont. We see that OCD with BOX eqalization (OCD-BOX for short) achieves near-exact MMSE performance for only three iterations (K = 3) for 64 and 128 BS antennas, whereas K = 4 is reqired for the not-so-large system with 32 BS antennas; lower vales of K reslt in a high error floor. These reslts confirm that for larger BS-to-ser-antenna ratios, approximate linear data detectors approach the performance of the MMSE detector. We note that for the considered antenna configrations, linear MMSE detection achieves near-ml performance [7]. Figres 2(a), 2(b), and 2(c) compare the PER for OCD-BOX against OCD with MMSE eqalization (short OCD-MMSE). The performance of OCD-BOX is sperior than that of OCD- MMSE, especially in the 32 BS antenna, 8 ser case. In general, the performance difference is more prononced for smaller BS-to-ser-antenna ratios. This observation is in accordance to recent theoretical reslts [23], and can be addressed to the fact that the box constraint arond the constellation is more accrate than the qadratic penalty g MMSE (z) = N 0 z 2 2 imposed by MMSE eqalization. We conclde by noting that for many modern wireless commnication standards (sch as LTE-A [31]) achieving a target PER of 10% is sfficient. The proposed OCD detector is able to meet this target performance at only a small SNR loss compared to the exact MMSE-based data detector. IV. VLSI ARCHITECTURE We now detail or VLSI architectre for OCD-based MMSE and BOX eqalization. The architectre was designed and optimized sing Xilinx Vivado HLS (version ), which allows s to conveniently simlate, parameterize, and generate different OCD designs that spport varios antenna configrations at design time. At rn-time, the proposed designs can be configred in terms of the nmbers of spported sers U and maximm nmber of iterations K. A. Architectre Overview Figre 3 shows two high-level block diagrams of the proposed OCD architectre. The inpts of or architectre are the channel matrix H w, the residal error vector r (which is initialized to the received vector y w ), and the reglarization parameter α, which we initialized to N 0 and 0 for MMSE and BOX eqalization, respectively. Or architectre spports two operation modes: (a) preprocessing (lines 6 8 of Algorithm 1) and (b) OCD-based qalization (lines 10 16). Preprocessing and eqalization are carried ot in a B-wide vector pipeline, i.e., we process B-dimensional vectors at a time. In the preprocessing mode, we compte the reglarized inverse sqared colmn norms d 1, = 1,..., U, as well as the reglarized gains p, = 1,..., U. In the eqalization mode, we perform the iterations on lines of Algorithm 1. In order to spport these two operation modes withot the need of redndant comptation nits, the processing pipeline shares the key bilding blocks sed in both modes. In particlar, both of the spported modes share the inner-prodct nit and the rightshift nit (highlighted in red in Figre 3). The inner prodct nit consists of B parallel complex-valed mltipliers followed by a balanced adder tree. We se mltiplexers at the inpt of the inner prodct nit, which enables s to switch between preprocessing and eqalization on a per-clock cycle basis. One of the main implementation challenges of the proposed OCD algorithm are data dependencies between sccessive iterations, which prevent traditional architectre pipelining. In particlar, as it can be seen on line 14 of Algorithm 1, each OCD iteration pdates the temporary vector r and the vector z (k+1) given the previos vectors r and z (k). Hence, in order to achieve high throghpt, we deploy pipeline interleaving [33], i.e., we simltaneosly process mltiple sbcarriers in a parallel and interleaved manner within the same architectre. For example, after performing an OCD iteration for the first sbcarrier, we start an OCD iteration for the second sbcarrier in the next clock cycle; we repeat this interleaving process ntil all pipeline stages are flly occpied. Or final architectre ses a total nmber of 24 pipeline stages, which enables or design to achieve p to 260 MHz in a Xilinx Virtex-7 FPGA (see Section V for more details). We note that it is possible to achieve even higher clock freqencies by increasing the nmber of pipeline stages (especially for smaller small B); this approach, however, reslts in a significant hardware overhead. B. Architectre and Fixed-point Optimization In order to optimize the hardware efficiency of or architectre, we se fixed-point arithmetic throghot or design. We achieved a negligible implementation loss with 16 bit precision with 11 fractional bit for most internal signals; see Figre 1 for the fixed-point (fp) performance. Or design has an implementation loss of less than 0.2 db SNR (measred at a target PER of 10%) compared to floating-point performance for the considered scenarios, which is a reslt of the following two optimizations.

7 7 16b 72b 72b + / (Bx32)b >> 16b 16b log2(b) x 16b (a) OCD preprocessing mode. (Bx32)b (Bx32)b b >> log2(b) (Bx32)b b (Bx32)b x x 16b + proj (Bx32)b pdate pdate (b) OCD iteration mode. Fig. 3. High-level block diagram of the proposed OCD-based preprocessing and eqalization pipeline. The pipeline is reconfigrable for varios BS-antenna configrations at design time, and is able to perform preprocessing as well as MMSE or BOX eqalization. The shared comptation nits between preprocessing and eqalization are highlighted in red. 1) Inner-prodct nit: This nit first comptes entry-wise prodcts of two B-dimensional vectors and then, generates the final sm of these prodcts. We se a balanced adder tree to compte the final sm and 36 bit adders to achieve sfficiently high arithmetic internal precision. Dring preprocessing, the inner-prodct nit comptes h 2 2 (line 7 of Algorithm 1); dring eqalization, the same nit comptes h H (r) (line 12). As both of these terms are close to B (for large vales of B), we shift these terms by b = log 2 (B) bits to the right in order to redce the dynamic range. Since we shift h 2 2 by b to the right, when we compte the reciprocal vale, = ( h α) 1, we effectively shift the reciprocal vale d 1 by b bits to the left. In the inner-prodct nit, we also shift the term h H r by b bits to the right. Conseqently, we do not need to ndo both of these shifts, as they cancel ot dring the mltiplication on line 12 of Algorithm 1. d 1 2) Reciprocal nit: This nit consists of two parts. The first part normalizes the inpt vale to the range [0.5, 1], which is accomplished sing a leading-zero detector and programmable shift to the left. The second part generates a reciprocal vale for the normalized inpt sing a look-p table (LUT). We se a FPGA BRAM18 to implement a 18 bit, 2048 entry LUT, where the leading 11 bits of the normalized inpt vale are TABLE I IMPLEMENTATION RESULTS ON A XILINX VIRTEX-7 XC7VX690T FPGA FOR DIFFERENT BS ANTENNA NUMBERS Array size B = 32 B = 64 B = 128 # of Slices # of LUTs # of FFs # of DSP48s # of BRAM18s Max. clock freqency 261 MHz 261 MHz 258 MHz sed to point to the entry in the LUT that stores the associated normalized reciprocal. Finally, we denormalize the normalized reciprocal vale by another left shift. V. IMPLEMENTATION RESULTS AND COMPARISON We now show FPGA implementation reslts and compare or design to the recently proposed data-detectors for massive MU-MIMO systems in [7], [10], [12], [13].

8 8 TABLE II AREA BREAKDOWN ON A XILINX VIRTEX-7 XC7VX690T FPGA FOR DIFFERENT BS ANTENNA NUMBERS B = 32 B = 64 B = 128 Main nits # of Slices # of LUTs # of FFs # of DSP48s # of BRAM18s r pdate nit 256 (8.91%) (16.9%) 0 (0%) 0 (0%) 0 (0%) Inner-prodct nit 811 (28.2%) (17.3%) (22.6%) 96 (48.5%) 0 (0%) h z scaling nit 265 (9.22%) 249 (4.11%) (10.6%) 96 (48.5%) 0 (0%) Miscellaneos (53.6%) (61.74%) (66.8%) 6 (3.0%) 2 (100%) Total (100%) (100%) (100%) 198 (100%) 2 (100%) r pdate nit 512 (7.87%) (16.3%) 0 (0%) 0 (0%) 0 (0%) Inner-prodct nit (25.0%) (15.9%) (23.3%) 192 (49.2%) 0 (0%) h z scaling nit 485 (7.45%) 505 (4.01%) (8.71%) 192 (49.2%) 0 (0%) Miscellaneos (59.7%) (63.8%) (68.0%) 6 (1.6%) 2 (100%) Total (100%) (100%) (100%) 390 (100%) 2 (100%) r pdate nit (9.23%) (17.1%) 0 (0%) 0 (0%) 0 (0%) Inner-prodct nit (31.1%) (17.2%) (27.0%) 384 (49.6%) 0 (0%) h z scaling nit (17.6%) (21.4%) (9.72%) 384 (49.6%) 0 (0%) Miscellaneos (42.1%) (44.3%) (63.3%) 6 (0.8%) 2 (100%) Total (100%) (100%) (100%) 774 (100%) 2 (100%) TABLE III THROUGHPUT AND LATENCY ON A XILINX VIRTEX-7 XC7VX690T FPGA FOR K ITERATIONS AND 64-QAM, AND 128 BS AND 8 USER ANTENNAS K = 1 K = 2 K = 3 K = 4 Max. throghpt [Mb/s] Latency [µs] A. FPGA Implementation Reslts We designed three different implementations for the following BS antenna configrations: B = 32, B = 64 and B = 128. For each configration, we provide post place-androte implementation reslts on a Xilinx Virtex-7 XC7VX690T FPGA. All or designs spport U 32 sers and K 256 OCD iterations; both of these parameters can be set at rn-time. The hardware complexity, resorce tilization, and maximm clock freqency reslts are smmarized in Table I. We note that there is no particlar critical path in all or designs as Vivado HLS evenly optimizes the delays among all pipeline stages. A detailed area breakdown of the main nits is shown in Table II. The r pdate nit corresponds to the otpt adder in Figre 3(b); the inner-prodct nit corresponds to the nit that comptes h H h and h H r in Figre 3(a) and Figre 3(b), respectively; the h z scaling nit corresponds to the scaling block in Figre 3(a); all remaining circitry has been flattened by Vivado HLS and is sbsmed in miscellaneos. Since the proposed architectre performs operations on B- dimensional vectors, the resorce tilization (exclding the BRAMs) scales linearly with B. Since the qantities H w and y w are assmed to be stored in external memories, or OCD architectre only ses two BRAM18s: one for the reciprocal LUT and one to store the reglarized channel gains p, = 1,..., U. The maximm achievable throghpt as well as the processing latency are shown in Table III. We see that the throghpt only depends on the maximm iteration nmber K and the clock freqency, bt does not depend on U. The reason is becase the nmber of bits per sbcarrier and the nmber of clock cycles reqired to process 24 sbcarriers grows linearly with respect to U. For example, dobling U dobles the nmber of bits per sbcarrier. However, since the nmber of OCD pdates is KU, the nmber of reqired clock cycles also dobles; this reslts in a constant throghpt. For K = 3 iterations, which was shown in Figre 1 to achieve near-optimal performance, or design achieves 376 Mb/s. Hence, the se of only three parallel instances (to process sbcarriers in parallel) wold easily exceed 1.1 Gb/s, while consming less than 65% of the FPGA s BRAM18s (cf. Table IV). The processing latency increases roghly linearly with respect to K and U. More specifically, the processing latency of this design is approximately 24(K + 1)U + O clock cycles, where O is the nmber of cycles reqired to flsh the pipeline. Typically, 26 cycles are reqired to flsh the pipeline, the exact vale of O depends on B. The (approximately) linear increase in K can be seen in Table III and for K = 3, or design reqires only 3.08 µs to prodce its first eqalized otpt. B. Comparison Table IV compares OCD to other, recently proposed largescale MIMO data detectors, namely the conjgate gradient (CG)-based detector [10], the Nemann-series detector [7], the Gass-Seidel (GS) detector [13], and trianglar approximate semidefinite relaxation (TASER) [12]. All of these detectors have been implemented on the same FPGA and for a 128 BS antenna, 8 ser system. We see that for the same system configration, OCD otperforms all other designs in terms of hardware efficiency, which we define as throghpt per FPGA LUTs. Frthermore, or OCD detector achieves sperior PER performance than the CG, Nemann, and GS detector (see Figs. 1(b) and 1(c)), which demonstrates the effectiveness of OCD. TASER, in contrast, achieves better error-rate performance for the considered antenna configration 8 bt only spports QPSK constellations. We note that the throghpt of (approximate) linear detectors, sch as the ones in [7], [10], [13] scales linearly in the nmber of bits Q per symbol; for TASER, however, the throghpt is limited by QPSK modlation, which 8 TASER achieves near-ml performance in not-so-massive MIMO systems, where the nmber of sers is comparable to the nmber of BS antennas.

9 9 TABLE IV COMPARISON OF DATA DETECTORS FOR MASSIVE MU-MIMO SYSTEM ON A XILINX VIRTEX-7 XC7VX690T FPGA Detector CG [10] Nemann [7] Gass-Seidel [13] TASER [12] OCD Performance near-mmse near-mmse near-mmse near-ml near-mmse Highest modlation 64-QAM 64-QAM 64-QAM QPSK 64-QAM Iteration cont K a 3 3 # of slices (1.0%) (45%) n.a (4.0%) (10%) # of LUTs (0.8%) (34%) (4.3%) (3.2%) (5.5%) # of FFs (0.4%) (19%) (1.8%) (0.8%) (4.96%) # of DSP48s 33 (0.9%) (28%) 232 (6.3%) 168 (5.7%) 774 (21.5%) # of BRAM18s Maximm clock freqency [MHz] Latency [clock cycles] n.a Maximm throghpt [Mb/s] Throghpt/LUTs a The method ses a special Nemann-series initializer followed by one GS iteration. prevents this detector to achieve comparable throghpts as the other approximate methods. In smmary, we see that OCD otperforms the next-best design (namely the CG-detector from [10]) by more than 2.6 in terms of hardware efficiency. The reasons for this advantage are de to the facts that (i) OCD can be implemented in a very reglar and parallel manner and (ii) preprocessing reqires significantly lower complexity compared to that of the other detectors that reqire the comptation of the reglarized Gram matrix A w, which can be a significant brden in massive MU-MIMO-OFDM systems. VI. CONCLUSIONS We have proposed a novel coordinate descent (CD)-based data detector, called optimized CD (OCD), for massive MU- MIMO systems that se orthogonal freqency division mltiplexing (OFDM). The proposed OCD detector enables highperformance linear MMSE and non-linear box-constrained data detection sing a simple, parallel VLSI architectre that reqires low hardware complexity. Or FPGA reference design achieves 376 Mb/s for a 128 BS antenna, 8 ser system, and sbstantially otperforms existing approximate linear datadetection methods in terms of hardware efficiency and/or errorrate performance. Or reslts show that OCD enables realistic OFDM-based massive MU-MIMO systems to spport tens of sers commnicating with hndreds of BS antennas, while achieving high throghpt at low implementation costs. There are many avenes for ftre work. OCD can also be sed for linear and non-linear precoding in the massive MU-MIMO downlink; a corresponding stdy is part of ongoing work. Compting exact soft-otpt vales for OCD-based detection (for MMSE and BOX eqalization) is an interesting open research problem. Finally, accelerated CD algorithms have been proposed recently [34]; sch methods may lead to even faster convergence and hence, cold enable higher throghpt at the same error-rate performance when implemented in VLSI. VII. ACKNOWLEDGMENTS C. Stder wold like to thank Tom Goldstein, Charles Jeon, Shahriar Shahabddin for insightfl discssions on the boxconstrained eqalization method. The work of M. W and J. R. Cavallaro was spported in part by Xilinx Inc., and by the US National Science Fondation (NSF) nder grants ECCS , CNS , and ECCS The work of C. Stder was spported in part by Xilinx Inc. and by the US NSF nder grants ECCS and CCF REFERENCES [1] M. W, C. Dick, J. Cavallaro, and C. Stder, FPGA design of a coordinate-descent detector for large-mimo, in Proc. IEEE Intl. Conf. on Circits and Systems (ISCAS), May [2] T. L. Marzetta, Noncooperative celllar wireless with nlimited nmbers of base station antennas, IEEE Trans. Wireless Commn., vol. 9, no. 11, pp , Nov [3] F. Rsek, D. Persson, B. K. La, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tfvesson, Scaling p MIMO: Opportnities and challenges with very large arrays, IEEE Signal Process. Mag., vol. 30, no. 1, pp , Jan [4] J. Hoydis, S. Ten Brink, and M. Debbah, Massive MIMO in the UL/DL of celllar networks: How many antennas do we need?, IEEE Jornal on Selected Areas in Commnications, vol. 31, no. 2, pp , Feb [5] E. Larsson, O. Edfors, F. Tfvesson, and T. Marzetta, Massive MIMO for next generation wireless systems, IEEE Commnications Magazine, vol. 52, no. 2, pp , Feb [6] J. G. Andrews, S. Bzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong, and J. C. Zhang, What will 5G be?, IEEE Jornal on Selected Areas in Commnications, vol. 32, no. 6, pp , Jne [7] M. W, B. Yin, G. Wang, C. Dick, J. R. Cavallaro, and C. Stder, Large-scale MIMO detection for 3GPP LTE: algorithms and FPGA implementations, IEEE J. Sel. Topics in Sig. Proc., vol. 8, no. 5, pp , Oct [8] H. Prabh, J. Rodriges, O. Edfors, and F. Rsek, Approximative matrix inverse comptations for very-large MIMO and applications to linear pre-coding systems, in Proc. IEEE WCNC, 2013, pp [9] Y. H, Z. Wang, X. Gaol, and J. Ning, Low-complexity signal detection sing CG method for plink large-scale MIMO systems, in Proc. IEEE ICCS, Nov 2014, pp [10] B. Yin, M. W, J. Cavallaro, and C. Stder, VLSI Design of Large- Scale Soft-Otpt MIMO Detection Using Conjgate Gradients, in Proc. IEEE ISCAS, May 2015, pp [11] B. Yin, M. W, G. Wang, C. Dick, J. R. Cavallaro, and C. Stder, A 3.8 Gb/s large-scale MIMO detector for 3GPP LTE-Advanced, in Proc. IEEE ICASSP, May 2014, pp [12] O. Castañeda, T. Goldstein, and C. Stder, FPGA design of approximate semidefinite relaxation for data detection in large MIMO wireless systems, in Proc. IEEE Intl. Conf. on Circits and Systems (ISCAS), May 2016.

10 10 [13] Z. W, C. Zhang, Y. Xe, S. X, and Z. Yo, Efficient architectre for soft-otpt massive MIMO detection with Gass-Seidel method, in Proc. IEEE Intl. Conf. on Circits and Systems (ISCAS), May [14] N. E. Tnali, M. W, C. Dick, and C. Stder, Linear large-scale mimo data detection for 5g mlti-carrier waveform candidates, in Proc. Asilomar Conference on Signals, Systems, and Compters, Nov [15] M. W, C. Dick, J. R. Cavallaro, and C. Stder, Iterative detection and decoding in 3GPP LTE-based massive MIMO systems, in 22nd Eropean Signal Processing Conference (EUSIPCO), Sept. 2014, pp [16] R. Prasad, OFDM for Wireless Commnications Systems, Artech Hose, Inc., Norwood, MA, USA, [17] D. Gesbert, M. Shafi, D. Shi, P. J. Smith, and A. Nagib, From theory to practice: an overview of MIMO space-time coded wireless systems, IEEE Jornal on Selected Areas in Commnications, vol. 21, no. 3, pp , [18] A. Palraj, R. Nabar, and D. Gore, Introdction to Space-Time Wireless Commnications, Cambridge University Press, New York, USA, [19] D. Seethaler, J. Jaldén, C. Stder, and H. Bölcskei, On the complexity distribtion of sphere decoding, IEEE Trans. Inf. Theory, vol. 57, no. 9, pp , Sept [20] D. Seethaler, G. Matz, and F. Hlawatsch, An efficient MMSE-based demodlator for MIMO bit-interleaved coded modlation, in Proc. Global Telecommnications Conference (GLOBECOM), Nov. 2004, vol. 4, pp [21] C. Stder, S. Fateh, and D. Seethaler, ASIC implementation of softinpt soft-otpt MIMO detection sing MMSE parallel interference cancellation, IEEE J. Solid-State Circits, vol. 46, no. 7, pp , Jl [22] B. Yin, M. W, J. R. Cavallaro, and C. Stder, Conjgate gradient-based soft-otpt detection and precoding in massive MIMO systems, in Proc. IEEE GLOBECOM, Dec 2014, pp [23] C. Jeon, A. Maleki, and C. Stder, On the performance of mismatched data detection in large MIMO systems, in Proc. IEEE Int. Symp. Inf. Theory (ISIT), 2016, pp [24] P. H. Tan, L. K. Rasmssen, and T. J. Lim, Constrained maximmlikelihood detection in CDMA, IEEE Trans. Commn., vol. 49, no. 1, pp , Jan [25] A. Yener, R. D. Yates, and S. Ulks, CDMA mltiser detection: A nonlinear programming approach, IEEE Trans. Commn., vol. 50, no. 6, pp , Jne [26] C. Thrampolidis, E. Abbasi, W. X, and B. Hassibi, BER analysis of the box relaxation for BPSK signal recovery, IEEE Int. Conf. Acost., Speech, Signal Process. (ICASSP), [27] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, New York, NY, USA, [28] G. Caire, G. Taricco, and E. Biglieri, Bit-interleaved coded modlation, IEEE Transactions on Information Theory, vol. 44, no. 3, pp , May [29] S. J. Wright, Coordinate descent algorithms, Mathematical Programming, vol. 151, no. 1, pp. 3 34, [30] G. Gordon and R. Tibshirani, Coordinate descent, Tech. Rep., Lectre Notes, Optimization , Carnegie Mellon University, [31] 3rd Generation Partnership Project; Technical Specification Grop Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Physical Layer Procedres (Release 10), 3GPP Organizational Partners TS version , Jl [32] L. Hentilä, P. Kyösti, M. Käske, M. Narandzic, and M. Alatossava, Matlab implementation of the WINNER phase II channel model ver 1.1, Dec [33] H. Kaeslin, Digital integrated circit design: from VLSI architectres to CMOS fabrication, Cambridge University Press, [34] Y. T. Lee and A. Sidford, Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems, in IEEE 54th Annal Symposim on Fondations of Compter Science (FOCS), Oct. 2013, pp

Conjugate Gradient-based Soft-Output Detection and Precoding in Massive MIMO Systems

Conjugate Gradient-based Soft-Output Detection and Precoding in Massive MIMO Systems Bei Yin, Michael Wu, Joseph R. Cavallaro, and Christoph Studer Department of ECE, Rice University, Houston, TX; e-mail: